Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe

Zitieren

Zugehöriges Material

FOSDEM VZW

Rakonde, Sanju Kumar Karampuri, Pranith

Formale Metadaten

Titel

Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe

Serientitel

FOSDEM 2023

Anzahl der Teile

542

Autor

Rakonde, Sanju

Kumar Karampuri, Pranith

Lizenz

CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/61647 (DOI)

Herausgeber

FOSDEM VZW

Erscheinungsjahr

2023

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

We manage a 200TB glusterfs cluster in production. While we were managing this, we learnt some key points. In this session, we will share with you: - What are the minimal health checks that are needed for a glusterfs volume, to ensure high availability and consistency - What are the problems with the current cluster expansion steps(rebalance) in glusterfs we experienced? How did we manage to avoid the need for a rebalancing of data, for our use-case. Proof of concept for new rebalance algo for future. - How are we scheduling our maintenance activities such that we never have downtime even if the things go wrong. - How did we reduce the time to replace a node from weeks to a day. As the number of clients increased we had to scale the system to handle the increasing load, here are our learnings scaling glusterfs - How to profile glusterfs to find performance bottlenecks. - Why client-io-threads feature didn't work for us? How we improved applications to achieve 4x throughput by scaling mounts instead. - How to Improve the incremental heal speed and patches contributed to upstream Road map for glusterfs based on these findings