Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe

Cite

Related Material

FOSDEM VZW

Rakonde, Sanju Kumar Karampuri, Pranith

Formal Metadata

Title

Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe

Title of Series

FOSDEM 2023

Number of Parts

542

Author

Rakonde, Sanju

Kumar Karampuri, Pranith

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/61647 (DOI)

Publisher

FOSDEM VZW

Release Date

2023

Language

English

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

We manage a 200TB glusterfs cluster in production. While we were managing this, we learnt some key points. In this session, we will share with you: - What are the minimal health checks that are needed for a glusterfs volume, to ensure high availability and consistency - What are the problems with the current cluster expansion steps(rebalance) in glusterfs we experienced? How did we manage to avoid the need for a rebalancing of data, for our use-case. Proof of concept for new rebalance algo for future. - How are we scheduling our maintenance activities such that we never have downtime even if the things go wrong. - How did we reduce the time to replace a node from weeks to a day. As the number of clients increased we had to scale the system to handle the increasing load, here are our learnings scaling glusterfs - How to profile glusterfs to find performance bottlenecks. - Why client-io-threads feature didn't work for us? How we improved applications to achieve 4x throughput by scaling mounts instead. - How to Improve the incremental heal speed and patches contributed to upstream Road map for glusterfs based on these findings