Cloud Site Reliability Engineer

Damia Group PortugalTech recruitment experts on a mission to provide the best recruitment exper
Permanent
Senior (5+ years)
Requires work permit
Languages: Required: English | Nice to have: Portuguese

Description

About the company: Damia Group is an international tech recruitment agency with 3 decades of experience. Our arrival in Portugal, 7 years later, was set on a mission to transform IT recruitment experiences and, through them, achieve better results. We believe in long-term relationships with a transparent and relaxed mindset. In a short period, we have reached the hearts of both scale-ups and larger organisations by delivering spot-on curated candidate shortlists, increased job offer acceptance rates and shorter time-to-fill.

Requirements

<!--block-->About The role: As a Senior Engineer on the Cloud Platform team, your impact will be measured by the continuous improvement of our platform’s reliability, scalability, and security posture.
SLO Ownership & Error Budget Management: Take direct ownership of the established Service Level Indicators (SLIs) and Service Level Objectives (SLOs) for core platform services (e.g., latency, availability, error rate). You will manage and use the Error Budget as the primary driver to prioritize reliability work. Scale and Harden the Core Platform: Apply deep technical expertise in Kubernetes, AWS, traffic management, and Infrastructure-as-Code to scale and harden the foundational platform that powers the company’s product workloads. Drive Systemic Improvements: This role centers on hands-on engineering skill, technical leadership, and systemic reliability improvements within their complex, distributed multi-region platform.

<!--block-->What you'll do:
  • <!--block-->Kubernetes Platform Engineering
    Use your Kubernetes and AWS expertise to evolve EKS lifecycle, multi-tenant isolation, and regional consistency, ensuring clusters remain secure, performant, and predictable as we scale
  • <!--block-->Traffic & Ingress Reliability
    Apply advanced knowledge of cloud-native traffic management and API gateways
  • <!--block-->Infrastructure-as-Code at Scale
    Demonstrate mastery in IaC to manage complex, multi-region architecture
  • <!--block-->Security & Access Control
    Drive a zero-trust posture by establishing service guardrails and access controls across the platform
  • <!--block-->Reliability Engineering & Incident Leadership
    Demonstrate strong diagnostic and incident-response leadership to rapidly isolate issues across clusters, networks, and workloads. 
  • <!--block-->Collaboration, Influence & Mentorship
    Guide and influence engineering teams across the organization through design reviews, operational best practices, and reliability-focused decision-making
<!--block-->
<!--block-->Requirements:
  • <!--block-->Core Platform & Infrastructure Expertise
    Demonstrate great skill in managing complex, distributed environments at scale, specifically focusing on: 
    - Cloud-Native Orchestration: Expertise in Kubernetes
    - Infrastructure Automation: Master of Infrastructure-as-code (IaC), including Terraform
    - Advanced Networking & Connectivity: understanding of core networking fundamentals, including routing, DNS, network segmentation (VPC.subnets), and connectivity services (e.g., transit gateways and network endpoints)
    - Platform Systems: Deep competence in traffic/ingress systems and strong programming fundamentals in Go or Python
  • <!--block-->Security & Reliability Skills
    - Fluency with IAM/IRSA, Vault, mTLS, and least-privilege design, combined with a proven ability to deliver measurable reliability improvements through automation, guardrails, and smart engineering
  • <!--block-->Leadership & Communication
    - Demonstrate a strong operational mindset, excellent technical communication (both written and verbal), and the ability to influence designs, mentor others, and elevate platform engineering practices across teams
  • <!--block-->Experience and Proficiency
    Demonstrate advanced proficiency and technical leadership in managing large-scale, resilient production systems. This experience is typically gained through roles such as:
    -Site Reliability Engineer (SRE)
    -Cloud Platform Engineer
    -DevOps Engineer
    -Other closely related infrastructure role