CFS: A Distributed File System for Large Scale Container Platforms
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 155 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/43071 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
SIGMOD 2019153 / 155
34
37
44
116
120
122
144
148
155
00:00
Computing platformScale (map)Magneto-optical driveData managementInstallable File SystemChi-squared distributionFile systemComputing platformScaling (geometry)Product (business)Lecture/Conference
00:21
Scale (map)Server (computing)VolumeScalabilityCluster samplingPoint cloudPhysical systemInstallable File SystemBlock (periodic table)MultiplicationDecision theoryConsistencyReplication (computing)MetadataClient (computing)Partition (number theory)Interface (computing)Cache (computing)Meta elementArchitectureNeuroinformatikNamespaceSemiconductor memoryPhysical systemArithmetic meanComputer architectureSocial classResultantFile systemData managementInsertion lossInterface (computing)MetadataNumberWritingUtility softwareCartesian coordinate systemComputing platformBlock (periodic table)AuthorizationUsabilitySequencePhysical lawRandomizationGraph (mathematics)Reading (process)Source codeProduct (business)DivisorPattern languageWaveComputer configurationCoordinate systemAbstractionBlogShared memoryService (economics)Uniform resource locatorView (database)Incidence algebraData structureForcing (mathematics)Disk read-and-write headVolume (thermodynamics)Scheduling (computing)Decision theoryMultitier architectureFlow separationReplication (computing)Cache (computing)Key (cryptography)Client (computing)Point cloud10 (number)Instance (computer science)WorkloadConsistencyStack (abstract data type)Greatest elementDifferent (Kate Ryan album)Connectivity (graph theory)Software maintenanceMachine codeMiniDiscServer (computing)Data centerInferenceNetwork topologySpacetimeResource allocationExclusive orComputer animation
04:13
Client (computing)MetadataMeta elementPartition (number theory)Interface (computing)Cache (computing)ArchitectureRange (statistics)Vertex (graph theory)ScalabilityReplication (computing)Overhead (computing)MultiplicationTerm (mathematics)Block (periodic table)Read-only memoryComputer fileObject (grammar)Physical systemContext awarenessSign (mathematics)BackupSequenceCommunications protocolWritingConsistencyCodeData recoveryMechanism designSocial classSpacetimeVolume (thermodynamics)Partition (number theory)Set (mathematics)Replication (computing)Vertex (graph theory)Table (information)Meta elementSingle-precision floating-point formatMultiplicationDirectory serviceField (computer science)MetadataServer (computing)Execution unitRange (statistics)File systemNumberWritingCommunications protocolNamespaceInferenceBackupOverhead (computing)Reading (process)Extension (kinesiology)Pattern languageClassical physicsMiniDiscSpeicherbereinigungSemiconductor memoryClass diagramComputer data loggingSystem callBlock (periodic table)Mathematical optimizationAnalytic continuationObject (grammar)In-Memory-DatenbankData managementDifferent (Kate Ryan album)CodeElectronic mailing listFreewareKey (cryptography)Water vaporKernel (computing)Network topologyPhysical systemDemosceneArithmetic meanInheritance (object-oriented programming)Right angleMetropolitan area networkInformation privacyCartesian coordinate systemParticle systemContent (media)BitView (database)Standard deviationWebsiteComputer animationEngineering drawingProgram flowchart
09:40
Client (computing)Cache (computing)InformationReplication (computing)WritingSequenceVolumeMetadataResource allocationAerodynamicsPartition (number theory)Channel capacityThermal expansionVirtual machineRouter (computing)Reading (process)NumberProcess (computing)Menu (computing)Random numberRead-only memoryComputer fileSoftware testingOperations researchScale (map)AverageVideoconferencingComputer data loggingPersonal digital assistantSubject indexingBackupElasticity (physics)Open sourceInterface (computing)DisintegrationParallel portMultiplicationComputer networkPoint cloudComputing platformDatabaseWritingDifferent (Kate Ryan album)Physical systemMultiplication signINTEGRALTask (computing)Interface (computing)Data managementSoftwareService (economics)DampingProjective planeChannel capacityFlow separationCartesian coordinate systemMetadataClient (computing)Staff (military)Open sourceCodeReplication (computing)ConsistencyData recoveryMultiplicationSoftware testingFile systemIn-Memory-DatenbankMechanism designComputer data loggingTelecommunicationMathematical optimizationInformationCache (computing)Goodness of fitComputing platformReading (process)ImplementationResultantProcess (computing)AverageVolume (thermodynamics)Vertex (graph theory)RandomizationLogarithmWave packetProduct (business)Menu (computing)Power (physics)Pole (complex analysis)BitNetwork topologyAuthorizationPattern languageRight angleInstance (computer science)Order (biology)Cellular automatonTheoryCASE <Informatik>Message passingSemiconductor memoryStandard deviationImpulse responseInsertion lossDataflowForestSocial classComputer animation
15:01
Computer animation
Transcript: English(auto-generated)
00:02
I'm Haifeng from JD. My talk is CFS, distributed file system for large-scale container platform. So while another new distributed file system, the background is that JD is running extremely large-scale Kubernetes cluster in production.
00:24
Roughly millions of Docker containers on top of tens of thousands bare metal servers. Here the graph shows a three-tier tag stack of all container platform. The up layer is means the containerized application. The middle tier for container
00:45
orchestration that is Kubernetes. The bottom layer is the container storage so is providing persistence volume to the up layer applications. Actually there are two options storage service as the infrastructure. One is file system
01:05
while the other one is block store. However, we argue that file system is a better abstraction in this scenario over block store. Firstly, a file system allows data sharing among different
01:21
containers while the block store is usually used exclusively. Secondly, distributed file system is more scalable than block store because a block device is fixed size after allocation. So we built CFS. The main motivation for us
01:44
is to decouple storage from compute for cloud native applications such that we can aggregate the whole storage resources in a data center to enable more flexible computer resource
02:00
scheduling. That's the motivation and background. During the journey of building CFS, we have made several key design decisions. Firstly, multi-tenancy is supporting loss of volume. Here volume means an individual file system namespace. That means CFS is not a single
02:22
distributed file system but it can run lots of distributed file system mounted by different applications containerized instances. The benefit of multi-tenancy is better resource utilization and secondly because the workloads on a container platform is highly diversified
02:47
so we have to optimize for both large and small files sequential and random read and access patterns. Third, due to the large number of files to store, the metadata store
03:03
is needed to be distributed and more separated from the class manager. Next, we provide a POSIX component API for ease of usage. Finally, we prefer a strongly
03:21
consistent replication for both metadata and data subsystem because via e-commerce business and strong consistent replication is also ease of maintenance. Now let's overview the high-level view system architecture of CFS. There are roughly four subsystems. The first one
03:47
and the second one actually have very similar structure while the metadata subsystem is totally in memory and the data subsystem is on the disk. The resource manager is responsible for
04:02
cluster coordination or seeing all the metadata and data nodes. The client provides the file system interface in the user space with its own cache of some cluster topology or anode infer
04:22
and the replica leader's infer and so on. Here please note the volume. Here volume means a logical concept. A volume is a single file system namespace mounted by one or multiple
04:42
containers. From the class manager's view, volume consists of a set of meta partition and data partition. A single meta or data partition is assigned to only one volume.
05:02
Now let's talk about the metadata subsystem. Actually the metadata subsystem is a distributed in-memory database for metadata and each volume actually has two larger tables. One for anodes. Anodes means in that node is a concept in the file system field and the other one is the
05:26
directory entry table. Well the anode table is sharded by the anode number range and the directory entries is called or located together with their parents' anodes. The metadata
05:44
subsystem is drawn by a set of meta nodes. The meta node is a server usually configured with a large memory because all the metadata is in memory. Each meta node hosts a set of
06:01
partition. The meta partition works as the unit of replication and each meta partition contains a range of anodes and their child entries as two in-memory betray.
06:21
Persistency is achieved by checkpointing as well as the raft log. It also plays a rather high log role. We know that the metadata subsystem is highly scalable because the class manager could dynamically create a new meta partition according to the usage of each volume.
06:50
While the metadata subsystem is totally in-memory, the data subsystem is a distributed on disk store for the file content. The data subsystem consists of a set of data nodes
07:05
and each data node runs a set of data partitions working as a replication unit while each data partition holds a set of extents. Here an extent means a range of continuous file blocks. Here they made some specialized optimization for large and small file.
07:31
Large and small file have very different layouts. For large file, usually a large file includes a list of extents because it says it's large. While we packed multiple small file objects
07:46
together into a single extent because each partition has a set of pre-created extent resolved to append small file objects. Interestingly for deletion, we don't use the
08:04
classic log structure modeling technologies. We leverage the punching hole features provided by the Linux OS kernel that is an allocated system call so that we can avoid the explicit
08:23
garbage collection to keep all code base much simpler. We propose scenario-aware replication for the data subsystem. That means each data node actually runs two replication protocols
08:44
in parallel for different write patterns. For sequential writes, we adopt the classic primary backup protocol. For sequential writes, each write request only has one disk IO.
09:05
And for the overwrite, the random updates, we adopt the raft replication protocol. Moreover, we leverage multi raft to further reduce the heartbeat overhead. So why two protocols? Because if we only use the primary backup
09:27
replication, that could cause too many fragments to hurt the read performance. If we only use the raft replication for each write request, we have to write it twice.
09:41
One for the raft log and the other one for the implicit update. By combining the two replications for different write patterns, we can achieve strong replication consistency with good performance so far. We also reuse the code in the metadata subsystem
10:03
for the multi raft implementation. From the other side, the drawback to this approach is that it brings some engineering challenges. For example, the failure recovery mechanism
10:22
becomes somehow complicated. Here is the workflow for the sequential write application. First, the client fetch leader info from its cache and sends the write request to the leader. The leader follows the request to other replicas, and the leader and the followers
10:45
commit their local writes in parallel, then reply to the clients. For the overwrite replication, the workflow is as below. First, the cache fetch leader info from the cache, then submit a raft command for this read request,
11:04
and then running a raft application. The result manager is the brain of the whole CFS cluster. It is responsible for both data node and node management.
11:22
It can expand the volumes according to this capacity and data usage. The result manager is also highly available to run as a replicated data machine.
11:44
Now let's talk about the performance tasks. First of all, Lifestyle tasks also have a file to set. In this task case, if Lifestyle is operating at a separate very large file, we can see that
12:02
Lifestyle has outperformed the staff in random read and write tasks. The reason is that, of course, all my data is from memory, so it is more of a new place for Lifestyle to avoid the usual read and write tasks.
12:32
Small L lifestyle test setup is based on the IMD task tool. The file is ready from roughly one kilobyte to 100 kilobytes.
12:45
We can see CFS is better in both read and write tasks. There are several reasons, and the most important one is that it helps by storage layout optimization of small files to minimize the communication
13:04
between the client with a resource manager. The metadata task also uses the IMD task tool. We have several clients, each client runs roughly 60 processes.
13:22
Please note that the y-axis uses logarithmic scale, and CFS is better in six out of seven tasks, gaining roughly three times the performance boost on the average.
13:42
This system is a production used in my company. For example, we use it to store lots of critical application, search and advertising indexes, AI training platform, and database data backup, and so on.
14:01
And more, we use it as underlying storage for our in-memory database services. And CFS is available as an open source project tool or license. It also has a container storage interface integration.
14:23
There's still a lot of work to do in the future. For example, we are implementing pyro-multi-rupt to further boost the reapplication performance, and adding the DPDK and RDMA networking. And moreover, we need to build more tools to make the ecosystem richer.
14:44
In conclusion, CFS is a distributed file system built from scratch for cloud-native application on very large container platforms with several features. So that's my talk.
15:00
Any questions? Thank you.