We’re looking for a Middle/High-Middle DevOps / SRE Engineer to help run and improve our production platform in GCP + GKE, fronted by Cloudflare, with observability in Datadog and CI/CD in GitHub Actions.
You’ll work closely with Senior/Principal engineers, implementing reliability improvements, expanding monitoring coverage, and reducing operational toil—especially important in a highload system with sudden traffic spikes.
Operate and support production systems on GCP, primarily GKE and managed services.
Execute platform improvements and operational tasks delegated by Senior/Principal owners.
IaC & Delivery Enablement
Implement infrastructure changes via Terraform (and Terragrunt where used).
Maintain and evolve Helm charts and Kubernetes manifests.
Improve reliability of GitHub Actions / CI/CD workflows and deployment automation.
Observability & Monitoring (Datadog)
Participate in incident response and operational support: triage, mitigation using runbooks, escalation, and follow-up fixes.
Contribute to postmortems with clear facts, timelines, and actionable remediation tasks.
Security Basics (DevSecOps)
Run/configure security tooling and monitoring, help triage findings, and implement fixes under guidance.
Support secure-by-default practices (secrets hygiene, access controls, baseline hardening).
Cost Awareness
Identify and implement cost optimizations (right-sizing, waste removal, efficiency improvements) without harming reliability.
Hands-on production experience with Kubernetes (ideally GKE) and basic cluster operations.
Working experience with Terraform and Helm in PR-based workflows.
Familiarity with GCP services used in SaaS operations (e.g., Cloud SQL, BigQuery, BigTable, Pub/Sub, Cloud Run, Memorystore).
Monitoring/alerting and troubleshooting skills (preferably Datadog).
Strong scripting/automation mindset to reduce manual work and prevent repetitive incidents.
Reliability awareness: understanding how changes affect availability/latency and how to operate under SLA constraints.
Cloudflare basics (WAF/DNS, edge concepts; Workers/CDN is a plus).
Experience writing/maintaining runbooks and participating in postmortems.
Exposure to SOC 2 / PCI-DSS requirements or willingness to learn.
Experience in high-load consumer products or game dev.
Improved monitoring coverage and healthier alerting (less noise, faster detection).
Faster, safer deployments with fewer manual steps and fewer production regressions.
Incidents are triaged effectively and resolved within expected timelines.
Platform reliability improves through steady delivery of operational fixes and automation.
Costs trend in the right direction thanks to recurring optimizations and guardrails.
Cloud-only, highload environment with real engineering challenges (not “just keep the lights on”).
Small team with ownership, autonomy, and quick iteration.
Strong opportunity to grow into broader platform ownership and SRE leadership paths.
Direct impact on reliability, scalability, and developer velocity.
Aghanim helps game developers achieve financial and creative independence by providing the solutions they need to launch, run, and grow their businesses.