arcesium
Engineering Manager, Platform Reliability Engineering
Job Description
We are looking for an experienced Engineering Manager to lead our Site Reliability Engineering (SRE) team. The ideal candidate will have a strong background in SRE principles and practices, as well as experience managing and mentoring engineers. The SRE Manager will be responsible for the overall success of the SRE team, including ensuring that our systems are reliable, scalable, and secure. The team is responsible for monitoring the stability and availability of mission critical production systems, managing incidents for quicker resolution, and establishing BAU. Team is also building tools/infra which to be used by all development teams to assist in monitoring and troubleshooting. What you’ll do (Responsibilities): Lead and manage the SRE team in the design, implementation, and operation of our SRE practices and processes.Lead and manage a team of engineers, providing coaching, technical guidance, mentorship, goal (OKR) and performance management, and career management for their reports.Mentor and develop SRE engineers to ensure that they have the skills and knowledge necessary to be successful in their roles.Work with other engineering teams to ensure that our systems are designed and implemented in a way that is reliable, scalable, and secure.Represent the SRE team to other stakeholders within the company.Operations managementManage on-call rotations to provide 24 hours coverageDay to day support of dashboard, including responding to outages and triaging cases escalated by clients/internal teamsReview various processes from time to time and drive continual improvement.Should have a flair for automation and seek opportunities to automate manual processes and service catalog items.Own operational success by continuously monitoring the stability and tech KPIs of the team and remediating any issues.Own the Incident management processOwn end to end availability and performance of mission critical services and build automation to prevent problem recurrence What you’ll need (Qualifications): 9+ years of experience in SRE or a related fieldStrong understanding of SRE principles and practicesExperience with observability toolsExperience with incident response and managementReliability: An exposure to Chaos Engineering and various reliability practices including disaster recovery will be good to haveExperience with Cloud Computing like AWSExperience with KubernetesExperience in Agile practices (Scrum)Excellent analytical, problem-solving and troubleshooting skillsExcellent communication and presentation skillsExperience managing and mentoring engineersAbility to work independently and as part of a teamAbility to delegate, monitor and make progress The Company offers excellent benefits, an informal and collegial working environment, and an attractive compensation package. Members of the Arcesium Company Group do not discriminate in employment matters on the basis of sex, race, color, caste, creed, religion, pregnancy, national origin, age, military service eligibility, veteran status, sexual orientation, marital status, disability, or any other protected class.