




Summary: Lead Site Reliability Engineer responsible for defining SRE strategy, building platform capabilities, and driving culture to improve uptime and scalability of distributed systems. Highlights: 1. Lead design, implementation, and operation of highly available systems. 2. Define and own SLOs/SLIs, monitoring, and alerting strategies. 3. Mentor and grow SRE and platform engineers; lead hiring and development. **Lead Site Reliability Engineer (SRE) — Job Description Role summary** Lead reliability, scalability, and operability of distributed systems by defining SRE strategy, building platform capabilities, and driving culture and processes that reduce toil and improve uptime. **Key responsibilities** * Lead design, implementation, and operation of highly available, scalable production systems across cloud and on\-prem environments. * Define and own SLOs/SLIs, error budgets, monitoring, and alerting strategies; drive SLI/SLO adoption across teams. * Lead incident response, post\-incident reviews, root\-cause analysis, and remediation; implement preventative measures. * Build and maintain observability stacks (metrics, logs, tracing) and dashboards (Prometheus, Grafana, ELK/EFK, OpenTelemetry). * Architect and operate CI/CD and deployment platforms (ArgoCD, Spinnaker, GitHub Actions, GitLab CI) enabling safe, automated rollouts (canary, blue/green, feature flags). * Design, implement, and maintain self\-service platform tooling for developers (Kubernetes/EKS/GKE/AKS, service meshes, operators). * Drive Infrastructure as Code practices (Terraform, Pulumi, CloudFormation) and manage infrastructure lifecycle, drift detection, and compliance. * Automate operational runbooks, remediation, capacity planning, and routine maintenance to minimize manual toil. * Own reliability\-related security practices: secrets management, IAM, network policies, vulnerability scanning, and secure configurations. * Mentor and grow SRE and platform engineers; lead hiring, performance reviews, and career development. * Partner with engineering, product, and security teams to influence design decisions for fault tolerance and operability. * Manage on\-call rotations, escalation policies, and ensure adequate coverage; coordinate across teams during major incidents. * Drive cost optimization, observability of cloud spend, and capacity forecasting. **Required qualifications** * 7\+ years in site reliability, platform, or DevOps engineering roles with progressive leadership responsibility. * Proven experience operating production distributed systems at scale on at least one major cloud provider (AWS, GCP, or Azure). * Deep expertise with Kubernetes and container ecosystems; experience running large clusters and multi\-cluster environments. * Strong IaC experience (Terraform required; CloudFormation/Pulumi a plus). * Extensive experience with observability tooling (Prometheus, Grafana, ELK/EFK, Open Telemetry) and incident management platforms (PagerDuty, Ops genie). * Solid software engineering skills (Python, Go, or similar) for automation, tooling, and reliability engineering. * Demonstrated experience setting and enforcing SLOs/SLIs and reducing MTTR through engineering practices. * Experience with CI/CD systems and deployment strategies (Argo CD, Spinnaker, Flux, Git Ops). * Strong systems, networking, and security fundamentals. * Excellent leadership, communication, and stakeholder management skills; proven ability to influence across orgs. * Experience mentoring engineers and leading cross\-functional initiatives. Job Types: Full\-time, Permanent Pay: QAR23\.71 \- QAR86\.45 per hour Expected hours: 40 per week Work Location: In person


