Armeta Inc. is developing advanced AI-driven systems that transform how large-scale engineering and construction projects are evaluated and approved. Our technology automates complex, compliance-heavy processes, ensuring accuracy and trustworthiness.
We are building a high-performance, on-premise computing platform to power our complex multi-agent, data, and backend systems, and we are looking for a DevOps engineer to build and manage this critical infrastructure.
Key Responsibilities
- Design, build, and maintain our high-availability on-premise infrastructure, built on Kubernetes and bare-metal (including supercomputers and NVIDIA DGX systems).
- Develop and manage robust CI/CD pipelines (e.g., GitLab CI, Jenkins) for automated building, testing, and deployment of all services.
- Manage the deployment, scaling, and operation of our core technology stack, including:
- Backend microservices (FastAPI);
- AI multi-agent systems and LLM-serving platforms;
- Distributed compute clusters (specifically Ray);
- Object storage systems (specifically Minio).
- Implement and manage comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, ELK/Loki) to ensure system health and performance.
- Manage NVIDIA DGX hardware, including GPU drivers, CUDA, and high-performance networking (e.g., Infiniband).
- Automate infrastructure provisioning and configuration management using IaC tools (e.g., Ansible, Terraform).
- Work closely with AI and Backend teams to ensure a smooth, reliable path from research and development to production.
- Implement and maintain on-premise security best practices, including network policies, access control, and vulnerability management.
Qualifications
- Expert-level knowledge of Kubernetes (K8s) and the container ecosystem (Docker).
- Proven experience managing on-premise, bare-metal server environments. Experience with public cloud (AWS, GCP) is a plus, but on-premise expertise is essential.
- Strong experience with CI/CD tools (e.g., GitLab CI, Jenkins, GitHub Actions).
- Strong experience with Infrastructure as Code (IaC) tools (especially Ansible, Terraform).
- 5+ years of hands-on experience in DevOps, SRE, or a similar role.
- Deep understanding of networking principles (TCP/IP, load balancing, firewalls, VPCs).
- Proficiency in scripting and automation (e.g., Python, Bash).
- Experience with monitoring and logging stacks (e.g., Prometheus, Grafana).
Preferred Qualifications (Bonus Points)
- Strong experience with MLOps tools and platforms (e.g., KubeFlow, MLflow, Seldon Core, KServe).
- Hands-on experience with NVIDIA GPU management, CUDA, and the NVIDIA GPU Operator for K8s.
- Direct experience deploying and managing Ray clusters.
- Direct experience deploying and managing Minio clusters.
- Experience with high-performance networking (e.g., Infiniband).
- Experience with distributed storage systems (e.g., Ceph).