LLM engineer

Описание вакансии

We are looking for LLM/ ML Infrastructure engineer experienced with Rust/C++ and CUDA for remote work.

Our client is building a decentralized AI infrastructure focused on running and serving ML models directly on user-owned hardware (on-prem / edge environments).

A core component of the product is a proprietary “capsule” runtime for deploying and running ML models. Currently, some components rely on popular open-source solutions (e.g., llama.cpp). Still, the strategic goal is to replace community-driven components with in-house ML infrastructure to gain complete control over performance, optimization, and long-term evolution.

In parallel, the company is developing:

its own network for generating high-quality, domain-specific datasets,
fine-tuned compact models for specialized use cases,
a research track focused on ranking, aggregation, accuracy improvements, and latency reduction.

The primary target audience is B2B IT companies.

The long-term product vision is to move beyond generic code generation and focus on high-performance, hardware-aware, and efficiency-optimized code generation.

ML Direction

1. Applied ML Track (Primary focus for this role)

Development of ML inference infrastructure
Building and evolving proprietary runtime capsules
Porting and implementing ML algorithms on a custom architecture
Low-level performance optimization across hardware platforms

2. Research Track

ML research with published papers
Improvements in answer quality and inference efficiency
Experiments with aggregation, ranking, and latency reduction

👉 This position is primarily focused on the applied ML / engineering track.

Role

This is a strongly engineering-oriented ML role focused on inference, performance, and systems-level implementation rather than model experimentation.

📌 Approximately 90% of the work is hands-on coding and optimization.

You will

Implement ML algorithms from research papers into production-ready code
Port existing ML inference algorithms to the company’s proprietary architecture
Develop and optimize inference
Optimize performance, memory usage, and latency
Integrate and adapt open-source ML solutions (LLaMA, VLMs, llama.cpp, etc.)
Contribute to the foundational architecture of the ML platform

Key Responsibilities

Inference Infrastructure Development:

○ Design and implementation of a cross-platform engine for ML model inference

○ Development of low-level components in Rust and C++ with focus on maximum performance

○ Creation and integration of APIs for interaction with the inference engine

Performance Optimization:

○ Implementation of modern optimization algorithms: Flash Attention, PagedAttention, continuous batching

○ Development and optimization of CUDA kernels for GPU-accelerated computations

○ Profiling and performance tuning across various GPU architectures

○ Optimization of memory usage and model throughput

Model Operations:

○ Implementation of efficient model quantization methods (GPTQ, AWQ, GGUF)

○ Development of memory management system for working with large language models

○ Integration of support for various model architectures (LLaMA, Mistral, Qwen, and others)

We are waiting from you

Strong proficiency in Rust or C++
Hands-on experience with GPU / hardware acceleration, including:
- CUDA, AMD or Metal (Apple Silicon)
Solid understanding of:
- LLM principles
- core ML algorithms
- modern ML approaches used in production systems
Ability to read ML research papers and implement them in code

Ability to write clean, efficient, highly optimized code
Interest in systems-level ML and low-level performance optimization
High level of autonomy:
- take existing algorithms from research or open-source,
- understand them deeply,
- adapt and integrate them into a new architecture
Fruent English