My group has an opening for a PhD Student:

The fully funded PhD position at Huawei in Paris, and in collaboration with EURECOM, offers an exciting opportunity to work on enhancing the efficiency of AI platforms through full observability. Generative Artificial Intelligence (GenAI) has emerged as a transformative technology, with tools like ChatGPT and DALL-E gaining widespread adoption and significantly impacting various industries. These technologies are built on foundation models driven by Transformer architecture and trained on vast datasets, presenting unique challenges in scalability and power requirements.

This PhD project seeks to address these challenges within Cloud Native environments, which offer the flexibility needed to efficiently utilize expensive dedicated hardware infrastructure. The research will focus on developing observability systems for distributed GenAI inference and training. The successful candidate will explore several critical areas, including network monitoring to address both end-host and in-network challenges in the context of distributed GenAI models; GenAI application monitoring (possibly transparent) of training/inference processes; and the integration of network and application layer monitoring to achieve a holistic system overview able to capture the complex interplay between GenAI applications and the underlying system infrastructure. Additionally, based on the enhanched observability offered by the proposed system, the research aims to develop methodologies to minimize the resource consumption of GenAI applications without compromising their performance.

Key questions that will form the basis of the research include identifying the challenges of network monitoring in the context of GenAI within Cloud Native environments, understanding the challenges of application monitoring in this context, exploring ways to integrate network and application layer monitoring for a comprehensive system view, and developing systems to reduce the resource consumption of GenAI applications.

Ideal candidates for this position will have a strong passion for complex and distributed systems, along with a Master’s degree in Computer Science, Networking, or a related field. Familiarity with Cloud Native platforms (e.g., Kubernetes, Docker, etc.), distributed LLM and AI models’ training/inference, system programming (e.g., C, Rust, C++, P4, etc), is a strong plus. If you are interested or want to know more, drop us an email.

Contacts: Roberto Morabito - Eurecom Gabriele Castellano - Huawei Massimo Gallo - Huawei

To apply send an email to us with CV, motivation letter, and references (if any)

Updated: