Enabling Seamless AI Workloads: Achieving Zero-Downtime Upgrades for FUSE in Kubernetes - Weiwei Zhu
Join us at the premier vendor-neutral open source conference, where developers and technologists come together to collaborate, share knowledge, and explore the latest innovations and advancements in open source technology. Learn more at https://events.linuxfoundation.org/
Enabling Seamless AI Workloads: Achieving Zero-Downtime Upgrades for FUSE in Kubernetes - Weiwei Zhu, juicedata.inc
In high-throughput AI workloads, such as autonomous driving and large-scale recommendation systems, the underlying file system must handle substantial volumes of continuous I/O to keep GPUs fully utilized. However, upgrading or restarting filesystem in userspace(FUSE) in Kubernetes often results in issues like file descriptor invalidation, cache loss, and write interruptions, leading to job failures and wasted resources.
In this session, we will introduce a practical solution for enabling seamless, zero-downtime upgrades of user-space file systems within Kubernetes. Drawing from a large-scale production deployment, we will demonstrate how we implemented self-healing mounts and rolling client upgrades for a FUSE-based distributed file system, deeply integrated with Kubernetes CSI and Operators.
We will highlight key failure cases encountered during early versions, explain why the default CSI lifecycle is inadequate for FUSE-based systems, and share how we redesigned the client upgrade process to maintain active I/O sessions without disruption.
The Linux Foundation
The Linux Foundation is a nonprofit consortium dedicated to fostering the growth of Linux and collaborative software development. Founded in 2000, the organization sponsors the work of Linux creator Linus Torvalds and promotes, protects and advances the L...