Lead Site Reliability Engineer (SRE)

QAR 181-363/day

Indeed

Full-time

Onsite

No experience limit

No degree limit

CC2J+22 Umm Salal Muhammed, Qatar

Favourites

Description

Summary: Lead SRE function to ensure high availability, scalability, and performance of services, mentor SREs, and collaborate on reliable systems. Highlights: 1. Lead SRE function for high availability and performance. 2. Mentor SREs and collaborate on reliable system design. 3. Architect and maintain scalable, highly available infrastructure. **Lead Site Reliability Engineer — Job DescriptionPosition overview** Lead the SRE function to ensure high availability, scalability, and performance of services; mentor SREs and collaborate across engineering to design reliable systems and automate operations. **Key responsibilities** * Own reliability goals (SLOs/SLIs) and incident response processes; lead postmortems and corrective actions. * Design, implement, and operate monitoring, alerting, and observability platforms (metrics, logs, tracing). * Architect and maintain scalable, highly available infrastructure on cloud providers (AWS, GCP, Azure) and/or on\-prem. * Build and maintain CI/CD pipelines, infrastructure\-as\-code (Terraform, CloudFormation), and deployment automation. * Develop automation for operational tasks, capacity planning, and disaster recovery. * Lead incident management: triage, coordinate response, communicate status, and drive RCA. * Mentor and grow SRE/ops engineers; set best practices for reliability, on\-call, and runbooks. * Collaborate with product and platform teams to design resilient architectures and perform reliability reviews. * Implement cost optimization, security hardening, and compliance controls in operational workflows. * Conduct performance tuning, chaos engineering experiments, and load testing to validate system resilience. * Maintain and evolve service ownership, CI policies, and operational runbooks. **Required qualifications** * Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience. * 7\+ years in Site Reliability, DevOps, or systems engineering with at least 2 years in a senior/lead role. * Strong experience with cloud platforms (AWS/GCP/Azure) and core services (compute, storage, networking). * Expertise in infrastructure\-as\-code (Terraform, CloudFormation) and configuration management (Ansible, Puppet, Chef). * Proficiency with containerization and orchestration (Docker, Kubernetes) and service mesh technologies. * Deep knowledge of observability tooling (Prometheus, Grafana, ELK/EFK, Jaeger, Datadog, New Relic). * Solid scripting and programming skills (Python, Go, Bash) for automation. * Proven incident management and postmortem experience; familiarity with SRE practices (Error Budgets, SLOs). * Experience with CI/CD systems (Jenkins, GitLab CI, GitHub Actions) and Git workflows. * Strong leadership, communication, and cross\-team collaboration skills. **Preferred qualifications** * Experience leading distributed teams and mentoring engineers. * Background in security engineering, compliance (SOC2, ISO), and infrastructure cost management. * Familiarity with service meshes (Istio, Linkerd), observability at scale, and edge/ CDN architectures. * Experience with large\-scale databases, caching systems, and message queues (Postgres, Cassandra, Redis, Kafka). * Certifications: AWS Professional, GCP Professional, or Kubernetes (CKA/CKAD). Job Types: Full\-time, Permanent Pay: QAR181\.88 \- QAR363\.77 per hour Work Location: In person

Source: indeed View original post