Fox IT is a Portuguese IT consulting company, which has been ranked for the 5th consecutive year among the top 5% of SMEs in Portugal and, for the first time, in the national Top 10 of the sector. Our mission is to bring a fairer approach to the IT market, promoting a more balanced exchange between us, our consultants, and our clients. More recently, we received the “Happy at Work” certification, which reinforces our commitment to continuing to value our people.
We’re looking for a Senior Distributed Systems & SRE Lead, to join a high-impact engineering engagement, responsible for the health and performance of a data infrastructure comprising over 2,750 nodes across 11 distinct technologies (Oracle, PostgreSQL, MySQL, MariaDB, Cassandra, Solr, ElasticSearch, OpenSearch, Redis, Kafka, RabbitMQ, MapR). You’ll serve as an L3 team to an existing operational team for all technologies listed, and provide some L2 support for Oracle and PostgreSQL. The mission is to support and solve major incidents, while simultaneously building an automation library (Ansible/Terraform) to standardize operations across the global footprint.
Responsibilities:
· Ensure distributed systems (Kafka, Cassandra, Elasticsearch, Solr, OpenSearch,MapR/Hadoop) are running optimally and take first-line response for major incidents on the database side.
· Serve as a Subject Matter Expert for relational, non-relational, and messaging technologies.
· Collaborate with other engineering teams on complex lifecycle events, such as upgrades and migrations.
· Lead deep-dive forensic Root Cause Analysis (RCA) for recurring issues and production outages, permanently eliminating recurring problems.
· Manage the health and scaling of distributed clusters (Kafka, Cassandra, Elasticsearch, OpenSearch, Solr), including partition rebalancing and node decommissioning.
· Lead the configuration and tuning of Elasticsearch/OpenSearch and Solr indexing strategies for shard stability and search optimization.
· Develop Terraform modules and Ansible roles to standardize and automate environment management and deployments.
· Maintain the stability of large-scale big data solutions (MapR/Hadoop).
· Participate in a 24/7 on-call rotation and cover weekend interventions as required, with flexibility in business week scheduling.
Requirements:
- 7+ years of hands-on experience managing Apache Kafka (brokers/Zookeeper/Kraft), Apache Cassandra (ring management/repair), Elasticsearch/OpenSearch, and Solr.
- Solid experience with Redis, RabbitMQ, and MapR/Hadoop.
- Expert-level Linux/Unix administration and shell scripting (Bash).
- Proven track record with Ansible and Terraform for automation, deployment, and patch management.
- Schedule flexibility (working hours: 14:00-23:00 / 09:00-18:00; must accommodate team meeting schedules).
- Willingness and availability to take part in a 24/7 on-call rotation.
- Commitment to flexible shift patterns, including weekends, balanced with business week time-off to ensure a sustainable allocation.
Nice to Have:
- Willingness to cross-train and handle triage for any technologies in scope (including relational and NoSQL databases, and messaging systems).
- Experience with Oracle, PostgreSQL, MySQL, MariaDB, Cassandra, Solr, ElasticSearch, OpenSearch, Redis, Kafka, RabbitMQ, and MapR.
Regime: Hybrid (3 days onsite in Lisbon - Almada)
What we offer:
- Collaborative, innovative work environment.
- Exciting projects with cutting-edge technologies.
- Continuous learning and career growth opportunities
- 26 days of vacation
- health insurance
- 50€ of birthday bónus
- Training
Take the Leap, Join Us!
Localização do trabalho: Trabalho remoto híbrido em Almada