Cloud Site Reliability Engineer

Damia Group PortugalTech recruitment experts on a mission to provide the best recruitment exper

Lisbon, PT

Permanent

DevOps / Sysadmin

Senior (5+ years)

Requires work permit

Languages: Required: English | Nice to have: Portuguese

Skills

Required:

Description

About the company: Damia Group is an international tech recruitment agency with 3 decades of experience. Our arrival in Portugal, 7 years later, was set on a mission to transform IT recruitment experiences and, through them, achieve better results. We believe in long-term relationships with a transparent and relaxed mindset. In a short period, we have reached the hearts of both scale-ups and larger organisations by delivering spot-on curated candidate shortlists, increased job offer acceptance rates and shorter time-to-fill.

Requirements

About The role: As a Senior Engineer on the Cloud Platform team, your impact will be measured by the continuous improvement of our platform’s reliability, scalability, and security posture.
SLO Ownership & Error Budget Management: Take direct ownership of the established Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for core platform services (e.g., latency, availability, error rate). You will manage and use the Error Budget as the primary driver to prioritize reliability work. Scale and Harden the Core Platform: Apply deep technical expertise in Kubernetes, AWS, traffic management, and Infrastructure-as-Code to scale and harden the foundational platform that powers the company’s product workloads. Drive Systemic Improvements: This role centers on hands-on engineering skill, technical leadership, and systemic reliability improvements within their complex, distributed multi-region platform.

What you'll do:

Kubernetes Platform Engineering
Use your Kubernetes and AWS expertise to evolve EKS lifecycle, multi-tenant isolation, and regional consistency, ensuring clusters remain secure, performant, and predictable as we scale
Traffic & Ingress Reliability
Apply advanced knowledge of cloud-native traffic management and API gateways
Infrastructure-as-Code at Scale
Demonstrate mastery in IaC to manage complex, multi-region architecture
Security & Access Control
Drive a zero-trust posture by establishing service guardrails and access controls across the platform
Reliability Engineering & Incident Leadership
Demonstrate strong diagnostic and incident-response leadership to rapidly isolate issues across clusters, networks, and workloads.
Collaboration, Influence & Mentorship
Guide and influence engineering teams across the organization through design reviews, operational best practices, and reliability-focused decision-making

Requirements:

Core Platform & Infrastructure Expertise
Demonstrate great skill in managing complex, distributed environments at scale, specifically focusing on:
- Cloud-Native Orchestration: Expertise in Kubernetes
- Infrastructure Automation: Master of Infrastructure-as-code (IaC), including Terraform
- Advanced Networking & Connectivity: understanding of core networking fundamentals, including routing, DNS, network segmentation (VPC.subnets), and connectivity services (e.g., transit gateways and network endpoints)
- Platform Systems: Deep competence in traffic/ingress systems and strong programming fundamentals in Go or Python
Security & Reliability Skills
- Fluency with IAM/IRSA, Vault, mTLS, and least-privilege design, combined with a proven ability to deliver measurable reliability improvements through automation, guardrails, and smart engineering
Leadership & Communication
- Demonstrate a strong operational mindset, excellent technical communication (both written and verbal), and the ability to influence designs, mentor others, and elevate platform engineering practices across teams
Experience and Proficiency
Demonstrate advanced proficiency and technical leadership in managing large-scale, resilient production systems. This experience is typically gained through roles such as:
-Site Reliability Engineer (SRE)
-Cloud Platform Engineer
-DevOps Engineer
-Other closely related infrastructure role