Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

56 Site Reliability Engineer jobs in Egypt

Site Reliability Engineer

EGP120000 - EGP250000 Y The Coca-Cola Company

Posted today

Tap Again To Close

Job Description

Egypt
• Bulgaria
• Greece

Information Technology

Hybrid

Experienced Professionals

Department: Digital Factory, Digital & Technology Platform Services.

We are seeking a Site Reliability Engineer (SRE) to join our Integration Factory team. This role is pivotal in ensuring the reliability, scalability, and performance of our integration platforms and services. You will work at the intersection of software engineering and operations, focusing on performance & availability, automation, observability, and continuous improvement of our integration services (e.g. problem management, reduction of user created incidents, reduce MTTR to 48 hours or less).

YOUR KEY RESPONSIBILITIES:

Maintain and enhance the reliability and availability of integration platforms (e.g., API gateways, message brokers, ETL pipelines).
Design and implement monitoring, logging, alerting, and observability to ensure system health and performance.
Contribute to integration design by defining monitoring and end-to-end observability requirements.
Automate deployment, scaling, and recovery processes using Infrastructure as Code (IaC) and CI/CD pipelines.
Collaborate with API & Event consumers, integration product manager, integration development and integration architects to ensure best practices and continuous improvement in system design and deployment (e.g. feature prioritization).
Troubleshoot and resolve incidents in production environments, performing root cause analysis and postmortems.
Define and track Performance & Availability, Service Level & Operating Level Agreements (SLA, OLA), Mean-Time-To-Resolve (MTTR) and customer and peer satisfaction (NPS, P4G).
Continuously improve system resilience, fault tolerance, and recovery strategies.
Work closely with the integration support team to ensure accurate reporting and effective incident handling
Work along with the automated testing and observability teams to ensure and validate monitoring points effectively detect and report issues.
Responsible for determining the creation of dashboards in observability platform.

ARE THESE YOUR SECRET INGREDIENTS?

Required:

o A passion for creating robust, scalable platforms that accelerate innovation.

o Bachelor's degree in computer science, engineering, a related technical field, or equivalent practical experience.

o 5+ years experience in Site Reliability Engineering, DevOps and / or similar roles (e.g. Level 2 and 3 Operations Engineer, Integration Development).

o Strong understanding of core integration design principles and patterns (REST, GraphQL), authentication methods (OAuth, API Keys), and data formats (JSON, XML).

o Proficiency in scripting and automation (e.g., Python, Bash, Terraform, and Ansible).

o Experience with cloud platforms (e.g. Azure).

o Familiarity with monitoring and observability tools (e.g. Dynatrace).

o Solid understanding of CI/CD pipelines (e.g. Azure DevOps), containerization (Docker), and orchestration (Kubernetes).

o Exceptional communication and influencing skills, with a demonstrated ability to lead by consensus and drive standardization across multiple teams.

o Strong analytical and problem-solving skills, with a data-driven approach to decision-making.

o Good proficiency in English as a day-to-day business language is a must

Preferred:

o Previous experience as a software developer, solutions architect, or in a similar technical role.

o Hands-on experience with enterprise integration platforms and technologies.

o Specific experience with SAP integration tools such as SAP BTP Integration Suite, SAP BTP API-M, OData services, APIs, iDocs and RFCs.

o Familiarity with event-driven architecture and streaming platforms like Apache Kafka.

o Experience with cloud-based API management services, particularly Azure API Management.

o Experience with agile development methodologies (e.g., SAFe, Scrum, Kanban).

o Familiarity with DevOps practices and tools.

o Experience with Azure, D365, SAP S/4HANA and SAP MDG. SAP and MS Azure Certifications are a plus.

o Excellent leadership and communication skills.

o Strong problem-solving and decision-making abilities.

o Strong organizational and time-management skills.

o Ability to work in a fast-paced and dynamic environment.

o FMCG background or experience working with FMCG

ABOUT YOUR NEW TEAM:

We are Coca-Cola Hellenic, a growth-focused consumer goods business and strategic bottling partner of the Coca-Cola Company. We bottle, distribute and sell an unrivalled range of products in 29 markets in Europe, Africa and Eurasia. As we do, we create value for all stakeholders, support socio-economic growth and build a more positive environmental impact.

We bring together more than 30,000 people from over 70 nationalities, coming from five continents. The diversity of our markets, from mature to emerging economies, provides a wide range of attractive opportunities for growth.

We nurture our talents. We give opportunities to people across all functions and levels, as well as different geographies, backgrounds and education. We are willing to take a risk on the people we believe in, even if they don't have the perfect experience. We have faith in what every person can be.

And although we have so much to be proud of, we always stay humble. We believe the real magic happens – for us and for you – when we OPEN UP.

AT COCA-COLA HBC, DIVERSITY HELPS US THRIVE

At Coca-Cola HBC, we are an inclusive employer that thrives on diversity. This means our environment provides equal opportunities for all, regardless of race, color, religion, age, disability, sexual orientation, or gender identity. Join us in nurturing a culture where everyone belongs and contributes to our collective success.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

EGP120000 - EGP240000 Y Link Datacenter

Posted today

Tap Again To Close

Job Description

Role Overview

A
Senior DevOps / Site Reliability Engineer (SRE)
to design, build, and maintain scalable infrastructure systems. The ideal candidate has deep expertise in Linux administration, container orchestration, CI/CD, and modern DevOps practices, with the ability to mentor junior team members and drive automation across environments.

Key Responsibilities

Lead the design, deployment, and administration of Linux-based infrastructure.
Architect, maintain, and optimize CI/CD pipelines for development and production workloads.
Build and manage containerized workloads using Docker and Kubernetes (including HA setups, storage, and networking).
Troubleshoot complex system, networking, and DNS-related issues across distributed systems.
Implement monitoring, logging, and alerting solutions to ensure system reliability and performance.
Automate operational tasks using scripting and Infrastructure as Code (Terraform, Ansible, etc.).
Collaborate with development and security teams to ensure best practices in system reliability and compliance.
Mentor and guide junior engineers in Linux administration, DevOps, and automation best practices.

Required Skills & Knowledge

5+ years of experience in Linux systems engineering/DevOps.
Strong expertise in Linux administration (performance tuning, kernel-level debugging, storage systems).
Advanced understanding of networking, DNS, load balancing, and firewalls.
Proven experience managing Docker and Kubernetes clusters (including upgrades, scaling, and troubleshooting).
Hands-on experience with CI/CD tools (Jenkins, GitLab CI, ArgoCD, etc.).
Strong automation skills (Bash, Python, Ansible, Terraform).
Knowledge of security best practices in systems, containers, and networks.
Ability to design resilient, highly available infrastructure systems.
Deep expertise in administration, tuning, backup, recovery, clustering and replica sets for relational databases (MySQL, PostgreSQL and MongoDB)

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

EGP120000 - EGP240000 Y Ericsson

Posted today

Tap Again To Close

Job Description

Grow with us
As the Company at the forefront of the creation of the Mobile world, and with more than 60,000 patents to our name, we've made it our business to make a mark. Being part of Ericsson empowers you to learn, lead and perform at your best, shaping future technology. Ericsson is an inclusive employer where you are recognized for the skills, talent, and perspective you bring to the team.

Within the Solution Area Cognitive Networks Solutions Software R&D (SA CNS SW R&D), we offer the opportunity to collaborate with highly qualified Global teams and to enable success stories for our customers. You will be exposed to groundbreaking technology (5G, ML/AI, Automation and Cloud computing) and support the delivery of multiple Data-Intensive projects from our customer base and internal requirements. You must have the ability to perform hands-on Cloud and Data engineering tasks independently coupled with the appropriate testing. Are you ready to write the future with us?

Come, and be where it begins.

About this opportunity:
We are seeking a seasoned Senior/Tech Lead Site Reliability Engineer (SRE) to oversee the design, deployment, and maintenance of its cloud-native SaaS infrastructure on AWS working with Ericsson R&D Global Teams. This position demands in-depth expertise in AWS technologies—including Fargate and App Runner—Terraform, Python (AWS SDK), Helm, and GitOps. The individual will lead a dedicated SRE team to establish best practices, optimize service reliability, and continuously improve security and performance across a multi-tenant SaaS environment.

What you will do:

Cloud-Native SaaS Architecture:
Architect, deploy, and manage multi-tenant SaaS Cognitive Solutions on AWS using AWS Services (e.g., IAM, S3, EKS, ECS, Fargate, App Runner, RedShift, SNS, SQS, EventBridge, Athena, SageMaker, Aurora, DynamoDB, Cognito, API Gateway, etc.) to build Microservices, Data Flows, Data Warehouse, and AI/ML models, emphasizing scalability, reliability, and cost efficiency.
Champion microservices, container orchestration, and serverless paradigms to ensure high availability and optimal performance.
SaaS Control Plane API to design, develop, and maintain APIs that manage multi-tenancy in a cloud-based SaaS environment. experience in building scalable and secure APIs that enable efficient tenant management, access control, and resource provisioning.
Infrastructure as Code (IaC)
Develop and maintain infrastructure definitions using Terraform to enable reliable, automated, and repeatable deployments.
Collaborate with cross-functional teams to incorporate IaC principles into CI/CD pipelines, accelerating feature releases and minimizing downtime.
Site Reliability Engineering & Observability:
Define and track Service Level Indicators (SLIs) and Objectives (SLOs), establishing error budgets that align with organizational goals.
Implement robust observability solutions (e.g., AWS CloudWatch, CloudTrail, AWS Config, etc.) to proactively detect and resolve performance bottlenecks.
Containerization & Helm
Utilize Kubernetes (EKS) and Helm charts to package, configure, and deploy containerized applications efficiently.
Streamline container orchestration workflows, focusing on auto-scaling, upgrades, rollbacks, and enhanced service resiliency.
GitOps & Automation
Employ GitOps tools (Argo CD, Flux) to govern infrastructure and application deployments through declarative, version-controlled configurations.
Automate operational tasks using scripting languages (Python, Bash, PowerShell) and AWS SDK (boto3), improving developer productivity and reducing manual overhead.
DevSecOps & Compliance
Embed security best practices within the software development lifecycle, covering identity and access management (IAM), networking, VPC, encryption, and monitoring.
Ensure adherence to cloud compliance standards (SOC 2, HIPAA, GDPR, etc.), performing regular audits and vulnerability scans to maintain a robust security posture.
AI & Machine Learning Operations (MLOps)
Provide operational support for AI/ML models running on AWS, collaborating with data science teams to optimize performance and reliability.
Integrate MLOps methodologies into existing workflows, ensuring seamless model deployment, monitoring, and updates.
Performance & Cost Optimization
Conduct capacity planning, load testing, and performance tuning across AWS resources.
Leverage reserved instances, auto-scaling, and right-sizing strategies to balance reliability, performance, and cost effectiveness.
Incident Management & Continuous Improvement
Oversee on-call rotations and lead incident response, rapidly mitigating service disruptions and guiding root cause analysis.
Foster a culture of continuous improvement, refining operational processes and enhancing platform architecture to boost resilience.
Leadership & Mentorship
Manage and mentor a cross-functional SRE team, promoting a collaborative, results-driven environment and advancing professional growth.
Collaborate with product owners, development teams, and stakeholders to align SRE priorities with broader business objectives.

You Will Bring

Education Bachelor's degree in Computer Science, Computer Engineering, or a related field.
Experience
Overall Software Development: 6+ years of professional experience in software development.
Site Reliability Engineering: 3+ years of dedicated SRE experience with a primary focus on AWS cloud services and infrastructure.
Technical Expertise
Cloud Computing Concepts: Deep understanding of virtualization, networking, and storage in public cloud environments.
AWS Proficiency: Demonstrated ability to manage, operate, and secure AWS services (., IAM, S3, EKS, ECS, Fargate, App Runner, RedShift, SNS, SQS, EventBridge, Athena, SageMaker, Aurora, DynamoDB, Cognito, API Gateway, etc.).
AWS for AI/ML: Hands-on support of AI/ML model operations on AWS, collaborating with data science teams and optimizing ML workloads.
Kubernetes & Container Management: Proven experience with Kubernetes (preferably EKS) for container orchestration, including deploying and maintaining production workloads.
Helm Package Management: Skilled in creating and managing Helm charts for Kubernetes-based applications.
IaC Frameworks: Proficiency in Terraform and Burrito (if applicable), ensuring production-grade, scalable infrastructure definitions.
Scripting & Automation: Advanced skills in Python (including AWS SDK/boto3), Bash, and/or PowerShell for automating cloud operations.
DevSecOps & GitOps: Hands-on experience integrating security best practices into CI/CD pipelines, leveraging GitOps tools (Argo CD, Flux) for declarative deployments.
MLOps: Working knowledge of machine learning lifecycle management, ensuring robust and efficient AI/ML model deployments.
Linux Administration: Strong background in Linux system management, performance tuning, and troubleshooting.
Networking: Expertise in VPNs, firewalls, routing, switching, DNS, load balancers, and related security considerations.
Monitoring & Observability: Proficiency with one or more monitoring solutions (Datadog, Prometheus, Grafana, CloudWatch) to drive proactive incident response.
Security & Compliance: In-depth familiarity with SOC 2, HIPAA, GDPR, and best practices around IAM, encryption, and network segmentation.
Problem-Solving & Communication: Demonstrated strength in diagnosing complex technical issues and effectively communicating solutions to varied stakeholders.
Certifications
AWS Certifications: AWS Certified Solutions Architect (Associate/Professional), AWS Certified DevOps Engineer – Professional, or other relevant certifications.
Additional certifications in GCP, Azure, security (CISSP, CISM) are considered advantageous.
Additional Desirable Qualifications
Other Cloud Environments: Exposure to Azure or further GCP services beyond AI/ML is beneficial.
Advanced Programming/Scripting: Experience in Python, Go or other modern languages is a plus.
Team Leadership: Demonstrated success in building and leading cross-functional teams, including performance management and strategic planning.
Non-technical skills:
Be inspired by the needs of fast-changing environments.
Happy to work within distributed teams.
Coordinate with software, DevSecOps, and domain experts.
Proactive & team player.
Excellent oral and written communication skills.

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

EGP90000 - EGP120000 Y Sana Commerce

Posted today

Tap Again To Close

Job Description

Company Description
At Sana Commerce, we're committed to creating an inclusive environment because we know our diverse workforce is one of our greatest strengths.
What started in 2007 with a pizza and a plan has grown into a fast-moving SaaS company that helps manufacturers, distributors, and wholesalers thrive in B2B commerce complexity.

Our mission? To transform the way businesses buy and sell, so they can grow, build stronger relationships, and make the most of digital commerce. Join us and take ownership of your career in a dynamic, fast-moving environment.

At Sana Commerce, we're looking for a Senior Site Reliability Engineer to strengthen our reliability, observability, and automation capabilities across our Azure and Kubernetes-based platforms. This role blends hands-on operational excellence with engineering practices, ensuring uptime today while building the systems that make tomorrow more resilient.

This SRE position focuses on engineering reliability in everything we do: automating repetitive tasks, improving monitoring signals, running deep root cause analysis, and shaping systems for scalability. You'll be the engineer others look to during critical incidents, and the one raising the bar on how we prevent them in the first place.

What you'll get:

The opportunity to make an impact at a fast-growing SaaS scale-up;
A global and customized onboarding program (9,1/10 rated by previous hires);
A hybrid working model – 3 days from the office, 2 days from home.

Job Description
What you'll be doing

Lead incident response and root cause analysis by driving deep investigations, educating the team, and delivering actionable post-incident insights that prevent recurrence.
Manage Kubernetes and Azure environments by owning cluster configurations, platform usage, and ensuring availability, cost efficiency, and security best practices.
Develop observability and monitoring strategies with Dynatrace, Honeycomb, ElasticSearch, Kibana/Grafana, and Azure Monitor to measure performance, user impact, and continuously refine alerts and dashboards.
Implement and maintain edge and CDN integrations (Fastly WAF, bot management, CDN) to enhance performance, security, and reliability of customer-facing services.
Write and debug automation scripts in PowerShell, Bash, Python, or C#, ensuring logging, rollback, and versioning practices make the platform more resilient and self-healing.
Drive Infrastructure-as-Code adoption with Terraform, Bicep, and ARM to standardize environments, automate deployments, and reduce manual interventions.
Optimize system and application performance through deep monitoring, dump analysis, and right-sizing of resources to eliminate bottlenecks and maximize efficiency.
Collaborate across teams to break down complex problems, contribute to CI/CD and SDLC improvements, and embed reliability into development and release pipelines.
Participate in the on-call rotation by taking ownership of incidents, coordinating responses, and ensuring sustainable fixes rather than temporary workarounds.

Qualifications
What you bring

8+ years of experience in SRE, DevOps, or Cloud Infrastructure, with demonstrated ownership of large-scale systems.
Strong hands-on knowledge of Microsoft Azure services and practical experience operating Azure Kubernetes clusters in production.
Expertise in Dynatrace, Honeycomb, ElasticSearch, Kibana/Grafana, Azure Monitor (KQL). Able to design actionable monitoring that leads to prevention, not just detection.
Proficient in at least one programming/scripting language (PowerShell, Bash, Python, or C#). Strong debugging and logging practices.
Hands-on experience with Infrastructure-as-Code (Terraform, Bicep, or ARM) to automate and manage cloud infrastructure.
Solid understanding of TCP/IP protocols and troubleshooting network issues in distributed systems.
Ability to go beyond surface fixes, identify patterns, and engineer permanent improvements.
Strong communicator who can work with cross-functional teams and explain complex issues simply.
Microsoft Certified: Azure Administrator Associate
CKA: Certified Kubernetes Administrator

Who we are:
So, what does it mean to be a part of the Sana Commerce team?

At Sana Commerce, our values guide how we work, collaborate, and drive success.

Champions of Our League. "We deliver lasting success, balancing quick wins and long-term value." We take pride in our unique product and extensive B2B knowledge and continuously strive to improve. No matter our role, we bring value every day, helping our customers and partners succeed.
Supercharge Our Customers. "We're revolutionizing B2B commerce together, helping our customers to lead and succeed." Our customers are at the heart of everything we do. We go beyond solutions, providing the tools and support they need to grow.
Determined to Grow. "We embrace challenges, growing and raising the bar for ourselves and our industry." We take on challenges, seek feedback, and keep learning. Every setback is a chance to improve and move forward.
Bold Together. "We dare to be bold because we have each other's back." We collaborate across teams and time zones, challenge the status quo, and support each other to achieve the best outcomes.

Job descriptions can be tough to interpret. Even if you may not tick all the boxes,
please explain your motivation for the role of Data Engineer (AI/ML) in a cover letter
, we strongly encourage you to apply if you still feel like you are a great match for this role.
Apply now

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

EGP90000 - EGP120000 Y Smooth Professional

Posted today

Tap Again To Close

Job Description

We are seeking a seasoned
Senior Site Reliability Engineer
to enhance the resilience, performance, and scalability of our cloud-based platforms. The ideal candidate will combine strong operational skills with engineering discipline, automating tasks, improving monitoring, and owning incident response.

Key Responsibilities

Lead and manage incident response processes: root-cause analysis, post-incident reviews, and ensuring preventative actions
Operate and maintain Kubernetes clusters and Azure cloud environments, focusing on reliability, security, and cost‐efficiency
Design, implement, and evolve observability/monitoring systems and dashboards; refine alert thresholds and metrics for better insight into performance and failure modes
Develop automation scripts (e.g. in Bash, PowerShell, Python, or C#) for operational tasks, with logging, rollback, version control
Use Infrastructure-as-Code (IaC) tools (e.g. Terraform, ARM templates, Bicep) to standardize and automate infrastructure provisioning and changes
Optimize system and application performance: resource sizing, bottleneck identification, performance tuning
Integrate CDN, WAF or edge caching, bot protection to improve system security and performance
Collaborate across teams (development, product, operations) to embed reliability in architecture, CI/CD pipelines, and deployment processes
Participate in on-call rotation and ensure sustainable, long term fixes rather than temporary patches

Required Qualifications & Skills

8+ years of experience in infrastructure, site reliability engineering (SRE), DevOps or cloud operations roles with demonstrable ownership of large-scale systems
Strong hands-on experience with Microsoft Azure services and operating Kubernetes in production
Proficiency with observability tools (e.g. Dynatrace, Honeycomb, ElasticSearch/Kibana/Grafana, or similar); ability to design meaningful SLIs/SLOs/alerts
Programming/scripting proficiency in one or more languages (PowerShell, Bash, Python, C#, etc.) with good practices (logging, versioning, error handling)
Expertise in Infrastructure-as-Code to automate provisioning/deployment and reduce manual operations
Deep understanding of distributed systems, networking, performance optimization
Strong communication skills; ability to explain technical issues and collaborate with cross-functional teams

Nice to Have

Certifications relevant to Azure or Kubernetes (e.g. Azure Administrator, Certified Kubernetes Administrator)
Experience with CDN / WAF integrations and edge architectures
Experience in bot management, security hardening, or securing distributed systems
Experience working in SaaS environments or with high-availability SLAs

This advertiser has chosen not to accept applicants from your region.

Senior, Site Reliability Engineer

EGP120000 - EGP240000 Y Mrsool

Posted today

Tap Again To Close

Job Description

Who Are We

Welcome to the world of Mrsool Where on-demand delivery meets unparalleled user needs to deliver anything you desire. As one of the largest delivery platforms in the Middle East and North Africa (MENA) region, Mrsool has captivated users with its unique and seamless experience, earning it the highest ratings among all major delivery platforms on both Apple's App Store and Google's Play Store.

What sets Mrsool apart is its commitment to providing an unmatched "order anything from anywhere" experience. This extraordinary feat is made possible by our extensive fleet of dedicated on-demand couriers. With their unwavering dedication, they ensure that your desired items reach your doorstep, no matter where you are.

Whether it's a late-night craving, a forgotten item, or a special gift for a loved one, Mrsool is here to deliver, quite literally. We take pride in the convenience we offer, empowering you to get what you need when you need it, all at the tap of a button.

The Job in a Nutshell

We are looking for a highly skilled Senior Site Reliability Engineer to ensure the reliability, scalability, and performance of our systems. The ideal candidate brings deep expertise in AWS, Kubernetes, and modern cloud infrastructure, along with strong problem-solving skills and a proactive approach to improving system resilience and automation.

If you're eager to take on this rewarding opportunity, we'd love to hear from you. Apply today

What You Will Do

Develop and maintain monitoring and alerting systems to proactively identify and address issues.
Troubleshoot and escalate production incidents to minimize downtime and improve system reliability.
Continuously improve our infrastructure and processes to optimize scalability and efficiency.
Participate and take ownership for on-call rotations as needed to ensure 24/7 support for our application.
Perform routine maintenance and upgrades as needed to keep our systems up to date.
Contribute to ongoing efforts to improve our security posture and compliance with industry standards.
Communicate complex technical concepts clearly and concisely to both technical and non-technical stakeholders in order to make the right decision.
Mentor and coach junior engineers, fostering their professional growth and enabling them to deliver high-quality work.
Stay up-to-date with the latest advancements and trends in site reliability engineering and share knowledge and insights with the team.
Identify opportunities for organizational enhancements and propose alternatives to optimize team structures and execution.
Collaborate with development teams to design and implement automated deployment and testing pipelines.
Collaborate with development teams to design and implement scalable Infrastructure.

Requirements
What Are We Looking For

Bachelor's degree in Computer Engineering, Computer Science, or related field.
5+ years of experience in a similar role, preferably with experience in a high-traffic, high-availability environment.
Proficiency in at least one programming language (Python, Ruby, Java, Go, etc.).
Strong understanding of cloud infrastructure and related technologies (AWS, GCP, Azure, Kubernetes, Docker, etc.)
Excellent troubleshooting and problem-solving skills.
Experience with one or more automation and configuration management tools (Chef, Ansible, Puppet, Terraform, etc.).
Familiarity with monitoring and alerting tools (Prometheus, Grafana, Nagios, etc.)
Strong communication and interpersonal skills, enabling effective collaboration with cross-functional teams.
Ability to navigate ambiguity, set clear expectations, and thrive in a fast-paced, dynamic environment.
A strong grasp of computer science fundamentals when it comes to dealing with distributed systems and networks.

Benefits
What We Offer You

Inclusive and Diverse Environment: We foster an inclusive and diverse workplace that values innovation and provides flexibility.
Competitive Compensation: Our compensation packages are competitive and include potential share options. Additionally, you will benefit from a performance-based commission/ incentive structure, rewarding your achievements.
Personal Growth and Development: We are committed to your professional development, offering regular training and an annual learning stipend to help you advance your career in a fast-paced, dynamic environment.
Autonomy and Mentorship: You'll enjoy a degree of autonomy in your role, supported by mentorship and ambitious goals that drive both your personal success and the company's growth.

This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

EGP90000 - EGP120000 Y Trella

Posted today

Tap Again To Close

Job Description

About Us
Ready to change the world? We're reinventing freight and logistics at Trella. Backed by a number of leading VC companies (YC, Maersk Growth, Algebra Ventures and Raed Ventures) and we're looking for the best talent out there to help us build and scale our product offering. We aspire to create a step-change in the industry and we want you to be a part of the journey

We are innovative problem-solvers on this adventure together. Working at Trella means that you'll be surrounded by colleagues who are constantly pushing boundaries, thinking ahead, and meeting the high standards we set for ourselves. When we build, we do so in a product-led way: we value our customer experience and scalability, and we prioritize how we build our product accordingly.

Our Purpose
At Trella our Vision is to
Empower our Communities to move Economies Forward
, and we're doing this by building a digital experience that provides our Shippers, Carriers and Teams with the right technology and platform that reduces the costs of moving goods —> Simply, we're trying to
disrupt
and
reinvent
trucking, and
empower
our economies. We have launched from Egypt to Saudi Arabia, Pakistan and UAE, and are looking to build and expand our footprint across the MENA-P region.

What You'll Do:

Lead / Collaborate with Engineering teams to build SRE culture, maintain development, staging and production systems.
Managing infrastructure reliability, scalability and security using SRE principles including setting up SLOs, tracking error budgets, Production Readiness Review (PRR) etc.
Automation of Infrastructure provisioning using Terraform / CloudFormation.
Automation of IT Infrastructure related tasks using chef, puppet, ansible etc.
Proactively monitor and optimize Infrastructure costs.
Maintain and improve API-gateways, web servers, cache & database configurations, CICD pipelines etc.
Maintain and monitor deployment, orchestration, of the servers, docker containers, databases, and general backend infrastructure
Participate in on-call rotations and active involvement in resolving incidents, writing production incident reports.
Hire, mentor and coach junior team members.

What You'll Need:

BS/MS in Computer Science, IT or related technical field with 6+ years of relevant experience.
Expert programming and scripting skills preferable bash, shell scripting.
Expert in computer science in foundations such as Operating Systems, Computer Networks, OWASP principles.
Excellent system debugging skills, hands on experience with optimizing database configurations and queries for postgres.
Expert in at least one web server such as Haproxy, nginx, apache etc.
Experience working with established cloud platforms like AWS, Azure or Google Cloud.
Experience building and setup of reliable CI/CD pipelines, observability platform, monitoring and alerting tools.
Experience using linux systems and command line system administration.
Experience with automation tools like Chef, Puppet, Ansible etc.
Experience with Kubernetes platform such as EKS.

What We Offer

Hybrid work model with flexible working hours.
The experience of working in one of Forbes Middle East's top 50 most funded start-ups in MENA
Annual performance review
Flexible leave policy that supports your work-life balance and personal needs.
Development opportunities in a rapidly growing multinational company.
Early payday option, allowing you to access your earnings sooner helping you manage expenses and financial planning with greater ease.
Supporting our colleagues to build and grow themselves through Learning & Development initiatives.

This advertiser has chosen not to accept applicants from your region.

Be The First To Know

About the latest Site reliability engineer Jobs in Egypt !

Set Email Alert:

Enter your email

Job title

Location

Senior Site Reliability Engineer II

EGP90000 - EGP120000 Y Careem

Posted today

Tap Again To Close

Job Description

Careem is building the Everything App for the greater Middle East, making it easier than ever to move around, order food and groceries, manage payments, and more. Careem is led by a powerful purpose to simplify and improve the lives of people and build an awesome organisation that inspires. Since 2012, Careem has created earnings for over 2.5 million Captains, simplified the lives of over 70 million customers, and built a platform for the region's best talent to thrive and for entrepreneurs to scale their businesses. Careem operates in over 70 cities across 10 countries, from Morocco to Pakistan.

Why Join Us?

At Careem, you'll: - Work with one of the region's most advanced engineering platforms. - Solve real-world challenges at scale impacting millions of users. - Learn and grow with a high-performance team. Contribute to AI-integrated infrastructure that supports both traditional services and next-gen intelligent agents.

About the Role

As a SRE Engineer (L10) on the Storage & Infrastructure team, you'll focus on building, scaling, and automating our core data services. You'll work with a range of distributed systems including MySQL, Postgres, Kafka, Cassandra, Redis, and OpenSearch, while also supporting emerging workloads like vector databases, embedding stores, and LLM query caches. This role blends operational excellence with an opportunity to support AI-driven use cases like retrieval-augmented generation (RAG), agent memory systems, and AI observability tooling.

What You'll Do

Deploy, scale, and maintain cloud-native data systems on AWS.
Automate storage operations using IaC (Terraform, Pulumi, etc.).
Support AI-related infrastructure (e.g. Milvus, Weaviate, or Pinecone).
Collaborate with ML engineers and platform teams to support LLM-powered services.
Optimize and monitor performance across services using Prometheus, Grafana, OpenTelemetry, etc.
Participate in on-call rotations and contribute to post-incident reviews.
Help design secure, scalable environments that are AI-ready and cost-efficient.

You'll Thrive If You Have-

5–8 years of experience operating distributed systems at scale.
Proficiency with one or more languages (e.g. Go, Python, Bash). - Strong understanding of cloud infrastructure (preferably AWS).
Experience with IaC and CI/CD pipelines. - Familiarity with Kafka, Redis, Cassandra, or

similar systems.

Exposure to AI infrastructure (bonus): vector stores, model serving platforms (e.g. Ray, LangChain, LlamaIndex).
Curiosity to learn about integrating infrastructure with AI agents and LLM-based applications.

What we'll provide you

We offer colleagues the opportunity to drive impact in the region while they learn and grow. As a full time Careem colleague, you will be able to:

Work and learn from great minds by joining a community of inspiring colleagues.
A chance to help shape the future of AI-read
Put your passion to work in a purposeful organisation dedicated to creating impact in a region with a lot of untapped potential.
Explore new opportunities to learn and grow every day.
Work 4 days a week in office & 1 day from home, and remotely from any country in the world for 30 days a year with unlimited vacation days per year. (If you are in an individual contributor role in tech, you will have 2 office days a week and 3 to work from home.)
Access to healthcare benefits and fitness reimbursements for health activities including gym, health club, and training classes.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer/ Expert/ Specialist

EGP90000 - EGP120000 Y SITA

Posted today

Tap Again To Close

Job Description

Overview
WELCOME TO
SITA
We're the team that keeps airports moving, airlines flying smoothly, and borders open. Our tech and communication innovations are the secret behind the success of the world's air travel industry.

You'll find us at 95% of international hubs. We partner closely with over 2,500 transportation and government clients, each with their own unique needs and challenges. Our goal is to find fresh solutions and cutting-edge tech to make their operations run like clockwork. Want to be a part of something big?

Are you ready to love your job? The adventure begins right here, with you, at SITA.

About The Role & Team
The Site Reliability Engineer is responsible for the proactive support of products to ensure high product performance, with a continuous focus on improvement. The role involves identifying and resolving the root causes of operational incidents, implementing solutions to enhance stability, and preventing recurrence.

The Site Reliability Engineer manages the creation and maintenance of the event catalogue to trigger events and develops both manual remediation approaches and automated workflows to address alerts. Additionally, they oversee the deployment of IT services and solutions, ensuring seamless integration with minimal disruption.

What You'll Do

Design, build, and maintain support systems to ensure high availability, scalability, and performance of critical infrastructure.
Lead incident response and root cause analysis for system failures, including problem investigations and coordination with relevant teams.
Implement and manage automation for system provisioning, deployment, self-healing, and performance monitoring to increase operational efficiency.
Establish and monitor SLIs/SLOs, proactively identify performance issues, and drive continuous improvements in service reliability.
Collaborate with development and operations teams to embed reliability best practices and evolve toward zero-downtime architecture.
Manage and optimize an event catalog, including event definitions, thresholds, remediation actions, and relevance across products.
Develop event response protocols, provide training, and ensure efficient handling of incidents across teams.
Drive post-incident reviews and feedback loops to enhance event definitions and service reliability.
Oversee quality and readiness of deployments, ensuring clear processes, assigned responsibilities, and minimal disruption.
Maintain deployment schedules and conduct risk assessments to ensure operational stability and deployment readiness.
Coordinate and execute deployment plans, manage resources, and incorporate feedback for continuous process improvement.
Manage CI/CD pipelines and infrastructure as code, ensuring seamless integration between development and operations.
Support and evolve DevOps practices, automating operational tasks and maintaining tools to drive ongoing efficiency.

Qualifications
ABOUT YOUR SKILLS

Bachelor's degree in computer science, Information Technology, Engineering, or a related field.
Several years of experience in IT operations, service management, or infrastructure management, including roles such as Site Reliability Engineer, Problem Manager, or DevOps Manager.
Proven experience in managing high-availability systems and ensuring operational reliability.
Extensive experience in root cause analysis (RCA), incident management, and developing permanent solutions for recurring service disruptions.
Hands-on experience with CI/CD pipelines, automation, system performance monitoring, and the implementation of infrastructure as code.
Strong background in collaborating with cross-functional teams (development, operations, engineering, etc.) to improve operational processes and service delivery.
Experience in managing deployments, risk assessments, and optimizing event and problem management processes.
Familiarity with cloud technologies, containerization, and scalable architecture, including experience with zero-downtime deployment strategies.

NICE-TO-HAVE

Master's degree or professional certifications in Service Management, ITIL, or related fields.

What We Offer
We're all about diversity. We operate in 200 countries and speak 60 different languages and cultures. We're really proud of our inclusive environment. Our offices are comfortable and fun places to work, and we make sure you get to work from home too. Find out what it's like to join our team and take a step closer to your best life ever.

Flex Week:
Work from home up to 2 days/week (depending on your team's needs)

Flex Day:
Make your workday suit your life and plans.

Flex Location:
Take up to 30 days a year to work from any location in the world.

Employee Wellbeing:
We've got you covered with our Employee Assistance Program (EAP), for you and your dependents 24/7, 365 days/year. We also offer Champion Health - a personalized platform that supports a range of wellbeing needs.

Professional Development
: Level up your skills with our training platforms, including LinkedIn Learning

Competitive Benefits
: Competitive benefits that make sense with both your local market and employment status.

SITA is an Equal Opportunity Employer. We value a diverse workforce. In support of our Employment Equity Program, we encourage women, aboriginal people, members of visible minorities, and/or persons with disabilities to apply and self-identify in the application process.

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer – SAP Master Data Governance

EGP120000 - EGP240000 Y Coca-Cola HBC

Posted today

Tap Again To Close

Job Description

Department:
Data & Automation, Digital & Technology Platform Services.

Location:
Egypt, Cairo.

As a
Site Reliability Engineer - Master Data Governance
, you will play a critical role in ensuring the reliability, scalability, automations and performance of our Data Governance production systems. You will work closely with our development and operations teams to build and maintain products, improve system reliability, and develop automated solutions to address operational challenges. Your role will involve optimizing platform efficiency, driving faster troubleshooting, and minimizing downtime to deliver superior business outcomes and customer experiences.

YOUR KEY RESPONSIBILITIES:

•
Proactive Improvements:
Proactively analyze incident patterns to provide diagnostic insights, helping to identify problems in the area of Data Governance and driving their resolution in collaboration with product team.

•
Complex Problem Troubleshooting:
Lead troubleshooting efforts for complex issues in collaboration with application support partners, including proactive management of alerts and participation in crisis management teams as required, regardless of working hours.

•
Operational Excellence:
Influence and collaborate with Data Governance product teams to prioritize and swiftly implement user stories focused on operational improvements.

•
Automation:
Identify and implement automation opportunities to enhance operational efficiency across all tasks. Collaborate with Cyber Security team for improving security measures.

•
Observability Utilization:
Use observability toolbox such as Dynatrace, ServiceNow, Azure Monitor, AquaSec, or SAP MDG Inbound/Outbound monitoring tools to identify and implement product improvements and automations aimed at enhancing incident resolution, product performance, and availability.

•
Collaboration:
Collaborate with architects, DevOps, and Applications Support engineers to find solutions that improve product stability and reduce incidents with proactive, long-term approach. Participate in DevOps routines contributing to problem identification and implementation of improvements.

•
SRE Community Engagement:
Work with Central SRE, SIAM manager, and the SRE Community of Practice to leverage synergies and share best practices.

ARE THESE YOUR SECRET INGREDIENTS?

• University Bachelor or Master's degree in IT, Engineering, Computer Science.

• Proven experience in Site Reliability Engineering, DevOps, or similar role.

• 5+ years of professional experience in complex IT landscape and full stack technologies – mobile and web applications, Cloud (Azure, AWS, Google), OS, database, storage, virtualization, network.

• 2–5 years of experience in developing/maintaining Master Data objects within SAP applications (SAP MDG, S/4 HANA. SAP BPM, SAP MDM).

• Hands-on experience with BP (Customer, Supplier, Contact Person) Objects in either of the two SAP systems - SAP MDG or S/4 HANA.

• Proven experience in building end-to-end solutions, from requirement gathering to deployment.

• Experience integrating SAP MDG with enterprise systems like MS Dynamics365, S/4 HANA, SAP CRM.

• Strong proficiency in configuration, deployment, troubleshooting and supporting SAP MDG application with various customizations.

• Experience with SAP MDG and S/4 HANA background jobs monitoring and troubleshooting.

• Familiarity with reusable components, workflows, and libraries.

• Strong expertise in automation, monitoring, and performance tuning in SAP landscape.

• ITIL certification will be considered as an advantage.

• Strong collaboration and communication skills, with the ability to work effectively across teams.

• Excellent problem-solving skills and the ability to work under pressure during incident resolution.

• Fluent in written and verbal English.

ABOUT YOUR NEW TEAM:

And although we have so much to be proud of, we always stay humble. We believe the real magic happens – for us and for you – when we OPEN UP.

AT COCA-COLA HBC, DIVERSITY HELPS US THRIVE

This advertiser has chosen not to accept applicants from your region.

Industry

View All Site Reliability Engineer Jobs

Menu

Search Suggestions

Recent Searches

Popular Searches

Location Suggestions

Popular Locations

Nearby Locations

Other Jobs Near Me

Industry

56 Site Reliability Engineer jobs in Egypt

Site Reliability Engineer

Job Description

Site Reliability Engineer

Job Description

Senior Site Reliability Engineer

Job Description

Senior Site Reliability Engineer

Job Description

Senior Site Reliability Engineer

Job Description

Senior, Site Reliability Engineer

Job Description

Senior Site Reliability Engineer

Job Description

Be The First To Know

Senior Site Reliability Engineer II

Job Description

Site Reliability Engineer/ Expert/ Specialist

Job Description

Site Reliability Engineer – SAP Master Data Governance

Job Description

Nearby Locations

Other Jobs Near Me

Industry