We are looking for an experienced Site Reliability Engineer to work in a professional software development hub, responsible for delivering software solutions used by our client’s external private and business customers across the globe.
Our customer is a global retail company with over 16.000 stores in 26 countries, serving more than 6 million customers a day and having about 130.000 people working at their stores and support offices.
You will be focused on the Next Generation Retail Platform, aiming to improve its reliability, telemetry, observability, performance, maintainability, and quality.
Your role
- Understanding the business criticality of supported services
- Overseeing the production environment by monitoring availability, setting up alerting, and taking a holistic view of system health
- Defining, gathering, and analyzing metrics from all systems (operating systems, applications, PaaS like Azure, and AWS) to assist in fault finding and performance tuning
- Integrating monitoring, alerting, ticketing, and paging systems to provide instant information in case of incident, with appropriate incident thresholds defined
- Actively proposing and implementing improvements to processes, access management, change management, resource utilization, and lifecycle of service
- Operational support to production issues - engaging Support and Product Teams, triaging P1 production incidents
- Demand forecasting and capacity planning – ensuring proper ratio of capacity / cost and efficiency of running services, avoiding over- and under-provisioning
- Defining SLI, SLA, and SLO together with building dashboards to observe metrics
- Driving or participating in postmortems
- Leveraging automation to perform operations to scale with load and for menial tasks (toil)
- Building critical paths of products built on the platform in cooperation with Product Teams
- Mapping critical paths to logical and physical resources in cooperation with architects and DevOps
- Partnering with Product Teams in capacity planning
- Cooperating with architects and development teams in influencing services design and permanent resolution of defects in line with architectural principles and development practices
Your skills
Bachelor’s degree in computer science, IT, engineering, system analysis, or a related study (or equivalent or proven experience)7 years of experience in IT Industry Development3 years of experience in support or maintenance of microservices platformProven experience with setting up monitoring / alerting and reliability engineeringExcellent communication, analytical, planning, organizational and technical skillsMotivated and driven by achieving long-term business outcomesAbility to speak English at C1 levelExperience in AzureGood understanding of product management, agile principles, and development methodologiesA proactive approach to identifying problems, performance bottlenecks, and areas for improvementTechnical knowledge of areas like : microservice architecture, Java development, reliability patterns, SQL and NoSQL databases, CI / CD, Azure, TDD, BDD, etc.Job no.240614-PO7G3