We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe

Formal Metadata

Title
Lessons learnt managing and scaling 200TB glusterfs cluster @PhonePe
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We manage a 200TB glusterfs cluster in production. While we were managing this, we learnt some key points. In this session, we will share with you: - What are the minimal health checks that are needed for a glusterfs volume, to ensure high availability and consistency - What are the problems with the current cluster expansion steps(rebalance) in glusterfs we experienced? How did we manage to avoid the need for a rebalancing of data, for our use-case. Proof of concept for new rebalance algo for future. - How are we scheduling our maintenance activities such that we never have downtime even if the things go wrong. - How did we reduce the time to replace a node from weeks to a day. As the number of clients increased we had to scale the system to handle the increasing load, here are our learnings scaling glusterfs - How to profile glusterfs to find performance bottlenecks. - Why client-io-threads feature didn't work for us? How we improved applications to achieve 4x throughput by scaling mounts instead. - How to Improve the incremental heal speed and patches contributed to upstream Road map for glusterfs based on these findings