Rivos Logo

Rivos

DL Communications Collectives SW Engineer

Posted 19 Days Ago
Be an Early Applicant
Remote
Hybrid
32 Locations
Entry level
Remote
Hybrid
32 Locations
Entry level
The DL Communications Collectives SW Engineer will design and implement optimized communication libraries for distributed systems used in deep learning. Responsibilities include working with GPUs, optimizing low-latency communication, collaborating with hardware teams, and validating library performance. The role demands strong problem-solving skills and a collaborative mindset in a fast-paced environment.
The summary above was generated by AI

We are working on software to improve the Deep Learning ecosystem and help hardware engineers build great Deep Learning parallel systems.

We are looking for a strong candidate with a background in writing systems software for networking devices (and optionally Linux kernel networking stack or network drivers). Someone who's implemented network protocols or has worked on OpenMPI.This role involves designing and implementing highly optimized communication collectives libraries similar to UCC (Unified Collective Communication) and NCCL (NVIDIA Collective Communications Library). The ideal candidate will work closely with hardware and software teams to ensure efficient data communication and synchronization across multiple AI accelerators in a distributed system, enabling scalable deep learning and high-performance computing applications.

You will be learning technical and organizational skills from industry veterans: how to write performant and readable code; how to structure and communicate projects, ideas, and progress; how to work effectively with the Open Source community.

We are big proponents of Open Source and Free software and contribute back our improvements to all the great projects we use.


We prefer candidates who work out of one of our offices, but will consider remote candidates as well.

Responsibilities

  • Build-up communication components of an AI Software Stack
  • Port AI Software to run on a new H/W platform
  • Profiling and tuning of communications within AI applications
  • Design, develop, and optimize communication collectives (e.g., AllReduce, AllGather, Broadcast, ReduceScatter) for large-scale distributed computing and machine learning frameworks.
  • Implement and optimize communication algorithms (ring, tree, butterfly, etc.) tailored for our architectures and multi-node clusters.
  • Ensure low-latency, high-bandwidth communication across multi-GPU setups, supporting interconnects such as PCIe and Infiniband.
  • Collaborate with hardware engineers and other software teams to optimize performance.
  • Implement fault tolerance and scalability mechanisms in distributed systems to handle large-scale workloads.
  • Write unit tests and benchmark tools to validate the performance and correctness of collective operations.
  • Stay current with advancements in hardware and networking technologies to continuously improve the library's performance.

Requirements

  • Strong understanding of GPU architectures (CUDA, AMD ROCm) and experience in GPU programming (CUDA, HIP, or similar).
  • Proficiency in designing and implementing parallel and distributed algorithms, particularly communication collectives.
  • Experience with network interconnects (NVLink, PCIe, Infiniband, RDMA) and understanding of their performance implications.
  • Hands-on experience with communication collectives libraries like UCC, NCCL, or MPI.
  • Strong knowledge of concurrency, synchronization, and memory consistency models in multi-threaded and distributed environments.
  • Experience with profiling and optimizing low-level performance (memory bandwidth, latency, throughput) on GPU architectures.
  • Familiarity with deep learning frameworks (TensorFlow, PyTorch, etc.) and their use of communication collectives.
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment.
  • Network driver experience recommended
  • Excellent skills in problem solving, written and verbal communication
  • Strong organization skills, and highly self-motivated.
  • Ability to work well in a team and be productive under aggressive schedules.

Optional Requirements

  • Experience with NumPy, PyTorch, TensorFlow or JAX
  • Experience with Rust
  • Experience with CUDA, OpenCL, OpenGL, or SYCL
  • Coursework or experience with Machine Learning algorithms

Education and Experience

  • Bachelor’s, Master’s, or PhD in Computer Engineering, Software Engineering or Computer Science

Top Skills

Cuda
Hip

Similar Jobs

2 Days Ago
27 Locations
Remote
1,200 Employees
Mid level
1,200 Employees
Mid level
Big Data • Cloud • Software • Database
As a Software Engineer II in the Observability team, you will build and maintain features for our data pipeline service, ensuring reliable operation while focusing on software architecture improvements. Your responsibilities include writing Java code, creating data connectors, enhancing product features, and participating in team duties.
Be an Early Applicant
2 Days Ago
27 Locations
Remote
1,200 Employees
Senior level
1,200 Employees
Senior level
Big Data • Cloud • Software • Database
As a Senior Software Engineer at Fivetran, you will build and enhance features of our data pipeline service, improve software architecture, and ensure reliable operation. Responsibilities include writing connectors and contributing to bug fixes while collaborating with a team to solve technical challenges.
Be an Early Applicant
2 Days Ago
27 Locations
Remote
1,200 Employees
Senior level
1,200 Employees
Senior level
Big Data • Cloud • Software • Database
The Senior Staff Site Reliability Engineer will ensure the reliability and performance of Fivetran’s production infrastructure. Responsibilities include monitoring availability, automating deployment, collaborating with engineering teams, resolving incidents, and enhancing infrastructure security and stability.

What you need to know about the Bristol Tech Scene

Along with Gloucester, Swindon and Bath, Bristol is part of the "Silicon Gorge" tech hub, a region in the U.K. renowned for its high-tech and research-driven industries, with a particular emphasis on sustainability and reducing environmental impact. As the European Green Capital, Bristol is home to 25,000 cleantech companies, including Baker Hughes and unicorn Ovo Energy. The city has committed to achieving net-zero emissions within the next decade.

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account