




Summary: The Site Reliability Engineer ensures reliability, scalability, and performance of production systems through automation, incident response, and observability. Highlights: 1. Ensure reliability, scalability, and performance of production systems. 2. Design, build, and maintain production infrastructure and automation. 3. Lead incident response and drive corrective actions for system reliability. **Site Reliability Engineer (SRE) — Job Description** **Overview** * Ensure reliability, scalability, and performance of production systems by applying software engineering to operations, building automation, and improving incident response and observability. **Key Responsibilities** * Design, build, and maintain production infrastructure, automation, and platform tooling to reduce toil and improve reliability. * Define and track SLOs/SLIs, measure error budgets, and take remediation actions to meet availability targets. * Implement and maintain CI/CD pipelines, deployment automation, and release strategies (blue/green, canary). * Build monitoring, logging, tracing, and alerting systems; create dashboards and runbooks for on\-call teams. * Lead incident response, coordinate post\-incident reviews (RCA/blameless postmortems), and drive corrective actions. * Perform capacity planning, performance tuning, and resource optimization for services and infrastructure. * Manage and operate container orchestration platforms (Kubernetes/EKS/GKE/AKS) and supporting services. * Automate provisioning and configuration using IaC (Terraform, CloudFormation, Ansible) and manage secrets/configuration securely. * Implement fault\-tolerant architectures, disaster recovery, backup strategies, and multi\-region designs. * Collaborate with developers to improve observability, reliability, and operational readiness of services. * Harden systems for security and compliance; implement patching, vulnerability scanning, and access controls. * Mentor engineering teams on reliability best practices and contribute to SRE culture and tooling. **Required Skills \& Qualifications** * 3–6\+ years experience in SRE, DevOps, or production operations engineering (adjust per level). * Strong experience with cloud platforms (AWS, GCP, Azure) and managed services. * Proficiency with containerization and orchestration (Docker, Kubernetes) and related tooling (Helm, Istio/Linkerd optional). * Experience with infrastructure\-as\-code (Terraform, CloudFormation) and configuration management. * Strong scripting/programming skills (Python, Go, Bash) for automation and tooling. * Familiarity with observability stacks (Prometheus, Grafana, Datadog, ELK/Opensearch, Jaeger/Zipkin). * Deep understanding of networking, load balancing, storage, and OS internals (Linux). * Experience implementing CI/CD (GitHub Actions, Jenkins, GitLab CI) and release automation. * Proven incident management experience and ability to work under pressure. * Strong collaboration, communication, and documentation skills. **Preferred** * Experience defining SLO/SLA frameworks and driving organization\-wide adoption. * Background in distributed systems, large\-scale production services, or platform engineering. * Experience with chaos engineering, fault injection, or resilience testing. * Familiarity with policy\-as\-code (OPA, Sentinel), service meshes, and GitOps workflows. * Certifications (CKA, AWS/Azure/GCP certs) or contributions to open\-source SRE tooling. Pay: QAR15,321\.44 \- QAR22,214\.09 per month Work Location: In person


