SRE Lead

Описание вакансии

Social Links is a global provider of OSINT and Open Data technologies. We are developing a modular Open Intelligence Platform that aggregates hundreds of data sources and delivers intelligence through AI agents, pipelines, and customizable workflows.

Our infrastructure spans both legacy on-premise deployments and a new AWS-based cloud-native platform. To support this transformation, we’re looking for a Lead Site Reliability Engineer who will own the full lifecycle of system reliability — from process design to hands-on implementation.

As Lead SRE Engineer, you will:

Take ownership of our current on-premises systems and stabilize them.
Build modern SRE practices from the ground up.
Drive our transition to the AWS cloud (architecture, tooling, observability).
Manage and mentor a team of DevOps and SysOps engineers.

This role is ideal for someone who wants to own reliability architecture and be a key strategic contributor to how a high-impact, AI-powered platform evolves.

Key Responsibilities:

Define and implement SRE practices: SLO/SLA management, incident response, postmortems, alerting policies.
Lead the team responsible for:
- On-prem infrastructure (Linux, VPNs, networking, firewalls, Zabbix).
- DevOps and CI/CD workflows.
- Platform observability (Prometheus, Grafana, Loki, Tempo).
Architect and scale cloud-native infrastructure using AWS services:
- EC2, VPC, EKS, S3, IAM, CloudWatch, Route53, etc.
Oversee migration of services and systems from on-prem to cloud.
Own logging, metrics, recovery processes, DRP, and secure runtime environments.
Implement infrastructure automation and self-healing mechanisms.
Build internal documentation, runbooks, and operational guidelines.
Act as mentor and leader for the reliability culture across engineering.

What We’re Looking For:

5+ years in infrastructure/SRE/DevOps roles, 2+ years in technical leadership.
Expert knowledge of Linux, Bash, system automation.
Deep understanding of core networking: VPN, TCP/IP, DNS, routing, NAT, firewalls.
Hands-on experience with on-prem operations and modernization.
Experience with monitoring: Zabbix, Prometheus, Grafana.
Proven experience with AWS (high priority): EC2, IAM, VPC, EKS, S3, CloudWatch.
Strong skills in CI/CD tooling: GitHub Actions, GitLab CI, ArgoCD, Helm, Kustomize.
Experience implementing SRE disciplines: SLOs, error budgets, incident management.
Proficiency in writing clear documentation and infrastructure standards.

Nice to Have:

Experience with OpenFaaS, Kubernetes, Terraform, Ansible.
Familiarity with SOC2, ISO 27001, GDPR compliance practices.
Python scripting for automation.
Experience with Vault, OPA, RBAC, and Zero Trust architectures.

Why Join Us:

A strategic role where you define infrastructure and reliability culture from the ground up.
Full ownership over reliability, observability, and platform resiliency.
A growing, global, product-driven company with engineering at the center.
Flexible remote environment with stock options and leadership visibility.
A foundational role with a clear growth path toward Head of Infrastructure/SRE.

If you turn chaos into structure and systems into strategy — this role is for you.