




Summary: Lead SRE efforts to ensure system reliability, scalability, and performance through observability, automation, incident response, and team mentorship. Highlights: 1. Lead SRE strategy and critical service reliability roadmaps 2. Drive automation, performance tuning, and scalability improvements 3. Mentor SRE team members and lead professional development **Site Reliability Engineer (SRE) Lead — Job Description** Summary Lead SRE efforts to ensure reliability, scalability, and performance of production systems by building observability, automation, and incident\-response practices while mentoring SREs and partnering with engineering teams. **Key responsibilities** * Own reliability strategy: define SLOs/SLIs, error budgets, and reliability roadmaps for critical services. * Lead on\-call rotations, incident response, major\-incident coordination, post‑mortems, and remediation of systemic issues. * Design and implement monitoring, alerting, logging, tracing, and observability tooling (Prometheus, Grafana, ELK/Open search, Jaeger, Datadog). * Build automation for provisioning, deployments, recovery, and self‑healing (IaC, runbooks, playbooks, automation scripts). * Drive capacity planning, performance tuning, and scalability improvements across services and infrastructure. * Collaborate with engineering and product teams to improve service architecture, reduce toil, and embed SRE practices in development lifecycles. * Manage incident metrics, runbooks, and run regular game days/DR exercises to validate preparedness. * Lead hiring, onboarding, mentorship, and professional development for SRE team members; conduct performance reviews. * Implement change management, release gating, and safe rollout patterns (feature flags, canary, progressive delivery). * Oversee CI/CD reliability, deployment pipelines, and integration with platform tooling (Kubernetes, container runtimes, service mesh). * Enforce security, compliance, and operational standards in collaboration with security and platform teams. * Report reliability, uptime, and operational health metrics to engineering leadership and recommend investment priorities. * Evaluate and adopt tooling and processes that improve observability, alert fidelity, and incident resolution time. **Qualifications** * Bachelor’s degree in Computer Science, Engineering, or equivalent experience preferred. * 5\+ years in SRE/DevOps/Operations with 2\+ years in a technical lead or team‑lead role. * Strong experience with cloud platforms (AWS/Azure/GCP), container orchestration (Kubernetes), and IaC (Terraform, CloudFormation). * Deep familiarity with monitoring and observability stacks (Prometheus/Grafana, ELK/Open search, Jaeger/Open Telemetry, Datadog). * Proficiency in scripting and automation (Python, Go, Bash, or similar) and CI/CD tooling (Jenkins, GitHub Actions, GitLab CI). * Solid understanding of distributed systems, networking, storage, and security principles. * Experience defining SLOs/SLIs, managing error budgets, and running post‑mortems. * Excellent troubleshooting, communication, and stakeholder‑management skills; proven ability to lead during incidents. **Preferred skills** * Certifications: Kubernetes (CKA), cloud provider certs (AWS/GCP/Azure), or SRE/DevOps related certifications. * Experience with service mesh (Istio/Linked), chaos engineering (Chaos Monkey, Gremlin), and advanced release strategies. * Familiarity with FinOps and cloud cost optimization practices. * Background in scaling global systems and multi‑region architectures. Job Types: Full\-time, Permanent Pay: QAR152\.29 \- QAR929\.90 per hour Work Location: On the road


