Introduction
A career in IBM Software means you'll be part of a team that transforms our customer's challenges into solutions.
Seeking new possibilities and always staying curious, we are a team dedicated to creating the world's leading AI-powered, cloud-native software solutions for our customers. Our renowned legacy creates endless global opportunities for our IBMers, so the door is always open for those who want to grow their career.
IBM's product and technology landscape includes Research, Software, and Infrastructure. Entering this domain positions you at the heart of IBM, where growth and innovation thrive.
Your Role And Responsibilities
As a site reliability engineering manager (SRE) in the IBM Software organization, you will be responsible for managing and leading a team of SRE engineers. Responsibilities include ensuring the reliability, scalability, and operational efficiency of IBM Asset Lifecycle Management services. You will do the hiring, training, and mentoring team members, assigning tasks, setting goals, and conducting performance evaluations. You will work closely with development teams, SRE peers and engineering managers to automate infrastructure management, optimize system performance, and enhance monitoring capabilities.. Overall, an SRE Manager plays a crucial role in aligning engineering and operations to achieve reliable software systems. Combine technical expertise with leadership and management skills to drive continuous improvement and ensure high-quality service delivery.
Key Responsibilities
Leadership
- Provide strategic guidance to engineering teams on architectural decisions and directions.
- Empower teams to achieve technical excellence, with a focus on reliability, scalability, and simplicity.
- Foster collaboration across engineering, product, and other cross-functional teams to deliver optimal solutions.
Monitoring & Observability
Design and implement monitoring solutions to gain insights into system health, performance, and reliability.Build and maintain intuitive dashboards for real-time visibility into critical system metrics.Set up proactive alerting mechanisms to detect and resolve issues before they impact end users.Incident Management
Lead incident response, performing root cause analysis (RCA) and implementing long-term fixes to improve system resilience.Build observability solutions with monitoring, logging, and alerting using tools like Prometheus, Grafana, InstanaDefine and monitor Service Level Objectives (SLOs), and Service Level Agreements (SLAs) to ensure service reliability.Security & Compliance
Ensure compliance with security best practices and regulatory requirements across all infrastructure components.Implement secret management, encryption, and access control for sensitive systems and data.Participate in security audits, vulnerability assessments, and compliance automation efforts.Cross-Team Collaboration & DevOps Culture
Collaborate closely with development, operations, and security teams to design and implement resilient architectures.Promote SRE best practices, such as blameless postmortems, incident retrospectives, and operational readiness reviews.Mentor junior engineers and contribute to knowledge sharing across teams to build a strong SRE culture.Preferred Education
Bachelor's Degree
Required Technical And Professional Expertise
Bachelor's degree in computer science engineering / information technology5+ years' of experience working in global organizations with the ability to effectively communicate with executives, leaders, and individual contributors across the organization.5+ years of SRE experience working with telemetry, observation, self-healing solutions, and platform automation.Cloud & Infrastructure : Expertise in Kubernetes, OpenShift, Docker, IBM Cloud and other cloud platforms