Site Reliability Engineer

Göteborg

Details

Conduct code review for reported cases, fixes development, and delivery.

Infrastructure Automation and Configuration Management
Develop and maintain automation tools, scripts, and configuration management systems.
Implement Infrastructure as Code (IaC) practices using tools like Ansible, Terraform, or Kubernetes.
Collaborate with development and operations teams to automate build, test, and deployment processes.

Reliability Engineering and Resilience
Design and implement systems and processes to enhance infrastructure reliability and resilience.
Continuously improve system reliability by analyzing logs and trends, identifying areas for improvement, and implementing preventative measures.

System Monitoring and Incident Response
Develop and manage monitoring tools and systems to track software and infrastructure health, performance, security, and availability.
Set up alerts, dashboards, and metrics for proactive detection and response to incidents.
Investigate and diagnose root causes of incidents and work towards resolution in a timely manner.

Continuous Improvement and Collaboration
Drive a culture of continuous improvement by identifying areas for automation and efficiency.
Document procedures, incidents, and best practices for knowledge sharing and team efficiency.
Stay updated on industry trends and emerging technologies to propose innovative solutions.
Collaborate closely with cross-functional teams to ensure smooth operation of systems.

Required skills & experience
Bachelor's degree in computer science, Engineering, or a related field (or equivalent experience) with 5+ years of DevOps SRE work.
Proficient in scripting/programming languages such as Python, Bash.
Experience with cloud platforms (AWS preferred).
Experience in DevOps practice, CI/CD, and monitoring tools.
Experience with automation tools and configuration management frameworks such as Terraform, AWS CDK, Puppet, or Ansible.
Strong troubleshooting and problem-solving skills with a keen attention to detail.
Excellent communication and collaboration skills to work effectively in a cross-functional team environment.
Strong experience in system administration, infrastructure management, or site reliability engineering.

Additional information specifically for this job request
Additionally, you should have
A good general understanding of distributed systems and microservice architecture.
A solid technical background in IT system development/system administration.
Software engineering background and/or experience in tool development (e.g., Python, JavaScript, Java, or Kotlin).
Experience working with Application Performance Monitoring tools, Prometheus and Grafana).
Good knowledge of SLA, SLO, SLI and how to use metrics to measure service levels and objectives.
Experience working with centralized logging platforms (e.g., Elastic stack, Splunk, Datadog).
Experience working with container orchestration (e.g., Kubernetes).

Facts

City: Göteborg
Work Time: 100%
Start: May 27, 2024
End: April 24, 2026

Application deadline: April 22, 2024

Do you have questions about the assignment?

Contact person: Loui Nydelius
Email: loui.nydelius@asociety.se
Phone: 46733534482

Ref.no: 10623

Or, know someone who would be a perfect fit? Let them know!

About A Society

A Society is the consulting company of tomorrow that provide consultants, at the top of their game, access to the most attractive assignments. We maximise advantages in the flexible labor market, for both consultants and customers alike.

Site Reliability Engineer

Register today - Get access to all assignments!

Details

Facts

Do you have questions about the assignment?

About A Society