Carbon3 develops, owns and operates AI infrastructure across the UK, powered by renewable energy. Converting legacy industrial and energy sites into modern data-centre facilities, Carbon3 is combining brownfield regeneration opportunities with cleaner, efficient, scalable compute capacity for healthcare, research, finance, enterprise, and public-sector organisations
Role Summary:
We are seeking a DevOps Engineer to design, operate, and continuously improve our Kubernetes-based AI infrastructure. This role focuses on cloud-native platform engineering, GPU-accelerated workloads, reliability, automation, and customer enablement.
You will play a key role in delivering a production-grade AI platform that enables ML engineers, data scientists, and enterprise customers to build and run AI workloads at scale.
You will be responsible for the reliability, scalability, and performance of our Kubernetes-based GPU platforms. You will ensure our AI platform operates securely and efficiently while delivering an exceptional customer experience. This is a hands-on platform engineering position focused on systems reliability, automation, and continuous improvement.
Key Responsibilities:
Kubernetes Platform Operations:
- Operate and evolve a production Kubernetes environment supporting GPU-accelerated AI workloads.
- Manage cluster lifecycle (deployment, upgrades, scaling, resilience, multi-node operations).
- Implement high availability, failover, and maintenance strategies to minimize disruption.
- Enable aaS capabilities and segmentation for multi-tenant workloads.
- Infrastructure as code tooling and lifecycle.
- Network Overlays, Storage: Block, File and Object.
- Experience with Ansible, YAML, Terraform, Python, Jenkins and GitOps.
GPU & AI Infrastructure Engineering:
- Manage NVIDIA GPU infrastructure within Kubernetes (device plugins, drivers, CUDA compatibility).
- Implement GPU partitioning and workload isolation strategies (e.g., MIG, quotas, namespaces).
- Monitor and optimize GPU utilization, workload efficiency, and cluster capacity.
- Support AI/ML training and inference workloads with performance tuning and best practices.
Reliability, Monitoring & Automation:
- Design and maintain observability frameworks (metrics, logs, tracing).
- Implement proactive monitoring, alerting, and capacity planning.
- Lead incident response for platform-level events and drive root cause analysis.
- Automate operational workflows and infrastructure provisioning (IaC, configuration management).
- Contribute to platform reliability engineering practices (SLOs, SLAs, error budgets).
Security & Governance:
- Implement RBAC, network policies, and security hardening.
- Ensure secure multi-tenant workload isolation.
- Maintain compliance, data protection, and access governance standards.
Customer & Platform Enablement:
- Support customer lifecycle of onboarding, provisioning and operations.
- Provide guidance on workload configuration, scaling strategies, and best practices.
- Collaborate with engineering and vendor teams to resolve complex platform issues.
- Produce high-quality technical documentation and operational playbooks.
Required Experience & Skills:
- Strong hands-on experience operating production Kubernetes clusters.
- Experience with GPU-enabled Kubernetes environments.
- Solid Linux system administration, networking, storage and security skills.
- Experience with Infrastructure as Code and automation.
- Strong understanding of distributed systems, APIs, and cloud-native architectures.
- Experience implementing monitoring and observability solutions (e.g., Prometheus, Grafana.
- Proven incident management and root cause analysis experience.
- Strong communication skills and ability to work cross-functionally.
Desirable Experience:
- Experience operating AI/HPC infrastructure.
- Deep understanding of Kubernetes scheduling, networking, and storage.
- Experience with high-performance datacentre networking and tuning.
- Background in DevOps or Site Reliability Engineering (SRE).
Why Join Carbon3.ai:
You’ll be joining a mission-driven start-up building critical national infrastructure, where operational excellence directly enables growth. This role offers high visibility with leadership, real autonomy, and the chance to shape how a next-generation company operates at scale.
Diversity & Inclusion:
Carbon3.ai is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.



