We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Cost-Efficient Virtual Petabytes Storage Pools

00:00

Formal Metadata

Title
Cost-Efficient Virtual Petabytes Storage Pools
Subtitle
using MARS
Title of Series
Number of Parts
95
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Background data migration via MARS on sharded local storage is the key for massive cost savings and even better total performance, compared to big cluster architectures using expensive dedicated storage networks.
Keywords
22
Thumbnail
54:22
27
29
36
Thumbnail
1:05:58
38
Thumbnail
1:00:58
65
Thumbnail
44:43
75
91
Thumbnail
1:21:58
94
Goodness of fitBitMultiplication signSlide ruleTheoryDigital video recorderXMLComputer animationUMLLecture/Conference
2 (number)SpacetimeInternetworkingComputer architectureLevel (video gaming)Data storage deviceHierarchyCentralizer and normalizerCategory of beingMoment (mathematics)Physical systemComputer animation
Design by contractNumberComputer architectureVirtual machineSource codeData storage deviceFormal grammarCache (computing)Set (mathematics)RepetitionEvent horizonCuboidFlow separationPartition (number theory)Point (geometry)CodeoutputAddress spaceMultiplication signCASE <Informatik>Square numberDirectory serviceVariable (mathematics)State observerService (economics)Error messageDebuggerComputer architectureDatei-ServerTheoryFront and back endsKernel (computing)Storage area networkCountingWeb 2.0MereologyComputer animation
Computer architectureSlide ruleSet (mathematics)Different (Kate Ryan album)Optimization problemData storage deviceStudent's t-testWeb 2.0Pulse (signal processing)CuboidSearch engine (computing)Search algorithmCASE <Informatik>Similarity (geometry)Front and back endsComputer animation
Data storage deviceServer (computing)Front and back endsDatei-ServerFlow separationSlide ruleComputer architectureExecution unitQuicksortSource codeComputer architectureStapeldateiComputer networkDenial-of-service attackPoint (geometry)Structural loadHuman migrationVolume (thermodynamics)Cycle (graph theory)Client (computing)LogicTrailVideo gameInternetworkingSimilarity (geometry)Storage area networkDimensional analysisOperator (mathematics)Open setSquare numberDifferent (Kate Ryan album)CuboidLastteilungComputer hardwareComputer animation
StatisticsKey (cryptography)Buffer solutionDataflowReplication (computing)ÜberlastkontrolleArithmetic meanInsertion lossIncidence algebraStorage area networkMoment (mathematics)Execution unitSlide ruleParticle systemGame controllerInternetworkingComputer networkData storage deviceTraffic shapingComputer animation
Database normalizationVirtual machineDistancePhysical systemGeometryHigh availabilityDifferent (Kate Ryan album)Service (economics)Pairwise comparison1 (number)Database transactionSimilarity (geometry)Data centerXML
Data centerMultiplication signCASE <Informatik>Service (economics)Normal operatorBackupTotal S.A.Server (computing)Complete metric spaceComputer animation
Computer architectureIncidence algebraVirtual machineData centerMiniDiscPhase transitionMultiplication signInheritance (object-oriented programming)Computer fileObject (grammar)Basis <Mathematik>Block (periodic table)CASE <Informatik>Asynchronous Transfer ModeXMLComputer animation
Data storage deviceIncidence algebraDistribution (mathematics)Computer architectureQuicksortCASE <Informatik>Functional (mathematics)Client (computing)Different (Kate Ryan album)Sound effectPropagatorPropagation of uncertaintySquare numberHash functionOrder (biology)Computer networkXML
MiniDiscArithmetic meanClient (computing)Endliche ModelltheorieData storage deviceComputer networkNumberCache (computing)Multiplication signComputer architectureOrder (biology)Constructor (object-oriented programming)Service (economics)RAIDBit rateGame controllerServer (computing)Information privacyCASE <Informatik>Computer hardwareSimilarity (geometry)Computer animation
Data storage deviceSpacetimeCASE <Informatik>Database normalizationService (economics)Execution unitSubstitute goodTheoremEndliche ModelltheoriePlastikkarteComputer architectureComputer hardwareComputer animation
Data centerOperator (mathematics)2 (number)Database normalizationFlow separationComputer architectureDistanceTheoryCASE <Informatik>Analytic continuationComputer hardwareRight angleFigurate numberComputer animation
DistanceDifferent (Kate Ryan album)Database normalizationSource codePairwise comparisonSlide rulePoint (geometry)Computer animation
Different (Kate Ryan album)Database normalizationFunctional (mathematics)Video gameStorage area networkCollisionClient (computing)Data storage deviceDistanceNumberReplication (computing)Slide rulePoint (geometry)Multiplication signCASE <Informatik>State of matterAdditionComputer networkDatei-ServerServer (computing)Physical systemService (economics)MiniDiscComputer hardwareInformation securitySimilarity (geometry)Computer animation
BefehlsprozessorCASE <Informatik>Data centerSource codeDatabase normalizationSymmetric matrixOrder (biology)Data storage deviceReduction of orderAsymmetryServer (computing)Computer animation
Multiplication signPredicate (grammar)WebsiteBefehlsprozessorData storage deviceDefault (computer science)Form (programming)Set (mathematics)Symmetric matrixArithmetic meanUniform resource locatorComputer animation
Data storage deviceNormal operatorData centerCASE <Informatik>BefehlsprozessorDivisorControl flowMereologyPhysical systemComputer animation
Endliche ModelltheorieData storage deviceBefehlsprozessorAssociative propertyCASE <Informatik>Moment (mathematics)Replication (computing)NumberCuboidData centerType theorySingle-precision floating-point formatComputer hardwareDefault (computer science)Product (business)Physical systemMeasurementException handlingField (computer science)Instance (computer science)HorizonLogicVolume (thermodynamics)Virtual machineMassRAIDRemote procedure calliSCSISlide ruleStorage area networkBitComputer networkLocal ringComputer animation
Virtual machineSoftware testingNumberMoment (mathematics)WritingSlide ruleFlow separationResultantMultiplication signComputer networkType theorySocial classIncidence algebraDifferent (Kate Ryan album)SoftwareMassLine (geometry)Computer animationLecture/Conference
Asynchronous Transfer ModeHuman migrationVirtual machineMoment (mathematics)Volume (thermodynamics)Data centerLogicPhysical systemDirected graphReplication (computing)Cartesian coordinate systemCategory of beingProcess (computing)SpacetimeData storage deviceiSCSIBlock (periodic table)SynchronizationComputer animation
Physical systemMetropolitan area networkTotal S.A.Similarity (geometry)Subject indexingMultiplication signChannel capacityPersonal digital assistantVirtual machineMoment (mathematics)Volume (thermodynamics)LogicHard disk driveRule of inferenceCASE <Informatik>Human migration1 (number)Product (business)Perspective (visual)Bit rateAdditionCuboidFlow separationSource codeMetadataiSCSISet (mathematics)Lattice (order)System administratorComputer animation
Group actionComputer networkLevel (video gaming)Right angleWeb pageVideo game2 (number)Reading (process)Computer animation
Set (mathematics)Revision controlProcess (computing)System administratorSynchronizationAsynchronous Transfer ModeOrbitDifferent (Kate Ryan album)Product (business)Stability theoryComputer animation
Process (computing)BitProjective planeOffice suiteNumberDatabasePhysical systemSpacetimeOcean currentServer (computing)Programmer (hardware)Java appletIncidence algebraTotal S.A.Branch (computer science)SoftwareMoment (mathematics)CASE <Informatik>Bit rateHigh availabilityHuman migrationLevel (video gaming)Product (business)Functional (mathematics)Revision controlComputer fileArithmetic meanAlpha (investment)Flow separationPopulation densityError messageChannel capacityWeb 2.0Computer animation
View (database)Web 2.0Symmetric matrixLine (geometry)WeightCartesian coordinate systemFeedbackClient (computing)Standard deviationFlow separationGreatest elementLastprofilData centerComputer networkComputer animation
VirtualizationLastteilungMoment (mathematics)Instance (computer science)Virtual machineLastprofilData managementDatabaseBit rateComputer programmingProcess (computing)Computer animation
Online helpFeedbackPlug-in (computing)Multiplication signOrbitRevision controlImplementationDivisorAreaProjective planeCartesian coordinate systemStructural loadPoint (geometry)LastteilungComputer animation
Different (Kate Ryan album)Physical systemOperator (mathematics)Complete metric spaceAreaLecture/Conference
Router (computing)System callProcess (computing)Streaming mediaAreaProduct (business)NumberSlide ruleSimilarity (geometry)Physical systemComputer architectureVirtual machineCuboidReal-time operating systemMetadataOrder of magnitudeHuman migrationHausdorff spaceDrop (liquid)Data storage deviceCASE <Informatik>InternetworkingOpen sourceVotingMultiplication signSquare numberProjective planeCartesian coordinate systemBit rateJava appletForcing (mathematics)ScalabilityPoint (geometry)Event horizonVariety (linguistics)Flow separationSource codeIntegrated development environmentLimit (category theory)Computer configurationReplication (computing)Scripting languageDistanceWordSound effectSystem administratorSocial classExterior algebraWave packetStandard deviationVolume (thermodynamics)Database normalizationMereologyNatural numberBitEndliche ModelltheorieInstance (computer science)LogicSurvival analysisCommunications protocolTotal S.A.Computer fileType theoryDatei-ServerServer (computing)Order (biology)Line (geometry)Computer networkCodecChromosomal crossoverClient (computing)MathematicsLastteilungState of matterBuffer solutionError messageKernel (computing)Computer animationLecture/ConferenceMeeting/Interview
Computer animation
Transcript: English(auto-generated)
Good evening, ladies and gentlemen, welcome. My audience is a little bit small due to the time, but hopefully I could please you with some interesting ideas.
If you have questions, you may ask in German. Also, I will stay in English because of the video recording and the English audience, so I will translate any questions when presented in German. So what is the subject of the talk? In theory it is about costs, but costs is of course not the only thing to obey here.
So this is my agenda, I have an architectural level first, that is the problem space. Of course the central subject of costs is not all to look at, so there are scaling
properties, which is very important for any big storage systems, and of course the second is the reliability. It is the most important, and of course it is sometimes only the third one in hierarchy. So the first three items are about problem space, and the next is solution space of course.
So let's start with one architecture which is proposed throughout the internet and hyped at the moment, which I call big cluster architecture. Let's start from the top here. Should I use this or the other one for pointing? I'll use this one.
Now look, one is a web host, and we have several millions of contracts and several millions of home directories. I think my last counting was about more than nine millions of home directories.
Ok, now please observe that each home directory is already a separate data space, separate from each other. So your input data, each web contract, is already isolated or should be isolated from each other. That means your data set is already partitioned. This is an important property, and now
if you want to have a cluster architecture, the storage here, you have three layers. One is the user layer, then you have some front-ends here, which are servicing your HTTP requests, and then you have a storage network and some storage server somewhere.
If you are using a big cluster architecture, then in theory almost any front-end could serve any request from any user. And the same is also true for the storage boxes. That means the data is striped in some way over all of these storage boxes.
That's the basic idea of these big cluster architectures. And if you look at it, first you have real-time access between your front-ends and your storages, because if something is not in the caches here, the local caches of your kernel on your front-end, it has to be fetched from the storage servers. And this is a real-time access. And the next observation is you have N machines. You can
imagine you have several petabytes of data if you have millions of customers and several thousands of machines. So this N is bigger than 1000 here. And what it means to have O of N square connections, at least the number of connections is going with the square.
Of course the traffic itself is scaling up only linearly, but the number of TCP connections which are open at the same time is O of N square. That's the important part to observe here. This is understandable, because potentially any of these
front-end machines can request data from any of the back-ends, N to N. So probably this is not the best architecture if you have a dataset which is already partitioned. There are some use cases, let me emphasize this, which are well tailored for this architecture.
For example, looking at a search engine, any user may enter any search keyword and then any of the front-ends might be addressed. So it makes a difference whether not only your dataset but also your access paths are already partitioned in some way.
So for example, if you use Ceph or similar architecture for your storage boxes here, there are use cases where Ceph is the optimal solution, but not for our web hosting here, just to make it clear. Okay, now let's look at, something has gone wrong here, there are two slides.
The next slide is the sharding architecture, probably known to some of you.
The idea is you have storage plus front-end in one single unit, which may be the same server but also may be a storage server, a small storage server and several clients, but in small islands. Even if you have some small storage network here between them, you have local switches for them.
So you have no big O of N square network in between. And now the problem is load balancing, here is viewed as the most important problem and the idea of what I am doing here is you have a small network which is only O of M and it is used for batch migration.
So your logical volumes are transferred during operation and during the data is being updated, that's an important point. The data is active and during that your data can be migrated to a different server for load balancing or for example for hardware life cycles, to get rid of your old servers and migrate to newer boxes and so on.
The idea is that this migration network is much smaller, it's only O of N and it's background traffic, so you don't need to dimension your network in such a way that you have plenty of resources for any load peaks. The main problem with other similar architectures is that typical loads from the internet, for example
DOS attacks and similar scenarios are overloading at the wrong moment, they are creating packet storms. And TCP is no real-time protocol, it has not been designed for storage networks, it has been designed for
nuclear wars, as you probably know from the history, this means your TCP send buffer is a queue by concept. So the only means of congestion control or flow control in your network is packet loss, that's the basic
principle of the internet and this means your storage network from the previous slide must be dimensioned in such a way that no incident happens and you don't get queued, this means it has to be over-dimensioned. And for this replication network, you simply can use background, some background channel, you can even use traffic shaping.
Okay, now let's look at a comparison of the ideas behind it. How to get reliability, high availability into the system,
the classical solution some of you probably already know is the RBD, who has already used the RBD or similar systems? Okay, three, four people here, okay, thank you. So you know what I'm talking about, if you want to know what Mars is, it's almost the same as the RBD but asynchronous.
The main difference is that Mars is using transaction logs, so all the updates are first recorded in a sequential log and then transferred in background to the secondary side. So you have pairs of machines typically, so we have a primary side where you normally are operating and the secondary side is a passive copy which is just mirroring the active one.
And of course only one of them is actually servicing the rest, the other one is the backup and if this distance, the physical distance between these two sides is more than let's say 30 or 50 km, then you are typically talking of geo-redundancy. And geo-redundancy means that a whole data centre may fail, for example if you have an earthquake or a similar terrorist attack
or whatever, then you can lose your whole data centre and you have a complete backup data centre where all your data is residing. So we can switch over in case of emergency. Now let's look at a certain other case where only single nodes are failing. This can occur during normal operations.
At any time, for example if you have 1,000 servers and a reliability of 99.99%, four nines in total, then it means you will have about 2,000 hours of downtime in total, in your total pool per year around this.
You can compute it. It's more than many people imagine. Now let's look at the case that two nodes are failing at the same time.
So if they are failing at the same time, here in this architecture it's no problem if they are on different pairs because you see this node, this red one is failing, you are just switching over, you have a failover to the other one and if this one fails you are failing over to the other pair in the other data centre.
So it's not a big incident of the whole data centre but only a small incident of your local machine and all your local disks which are connected to this machine are of course no longer accessible. If the same happens with the big cluster here, whatever it is, you can add other names here. These are just examples.
If you have two replicas here, what will be the problem here? Typically these replicas, in the case of Ceph there are objects corresponding to one file, to one block, whatever mode you are using here, when two are failing you have two replicas, you have a problem because not all of your data is present anymore.
And the main difference is here it matters which node is failing. So the probability for hitting exactly the same pair is much lower. But hitting any of them for producing incident here is much higher, by an order of N around it.
So this is the first problem of the big cluster architecture. And the second one, if you have a storage incident at the storage layer, you will likely have also an incident at your client servers, not of one of them but almost all of them,
or it depends on the data distribution and the hash function and some other effects, but I have written here O of N clients are typically affected or maybe affected at least in the worst case. So the spread of your incident is much more, you have a kind of error propagation here throughout this O of N square network.
So is it clear to you that the architectural impact here is two replicas is not enough here, in order to compete with this architecture you need at least three replicas if not four. Okay, agreement, disagreement somewhere.
I think it's clear, so you need not exact mathematical formulas, you can see it by just looking at the problem. Okay, now the question is what does that mean for your costs? Okay, if you have a big cluster here, then you need at least three replicas, this means 300% of disks times three.
And by the very construction you have storage nodes and you have client nodes and they have the same order O of N, similar numbers. So we are using three, about three times the number of servers you would need for this model. One simple sharding model is you have just local RAID 6, nothing else.
And the data protection is done by the RAID, preferably the hardware RAID controller with PBU cache, this is very important for performance. And you have hardware caching there and a battery backup or some small gold cap capacitor in case of power failure.
And the hardware does everything for you, it's a small card so you are not wasting any age unit here and power consumption is also very small by this model. Because you have no dedicated storage service at all, only this one card and this consumes not much power and not much space.
Okay, but of course you need a backup, but you need it in both cases because you have no further redundancy. If your whole cluster as such is failing you have also a problem with this architecture and the probability for failing the whole cluster is not as low as many people are thinking, that's the problem here. So backup is no substitute for backup, having a big cluster or having more replicas in any case.
Okay, is this clear? Now the next one is we are looking at geo-redundancy and many people are claiming that geo-redundancy is extremely costly. It produces much cost because everything is doubled. This is true in theory and it's especially true for the big cluster architecture.
Some people are believing that you just need to distribute your big cluster into two data centers and then you have geo-redundancy. This is not true because true geo-redundancy is if you have to survive really an earthquake. An earthquake means it takes several months to restore your data center into operations.
This is the scenario we are talking about, important for insurance for example. So if you have a bigger company and you want to survive this you have to be able to continue operations for several months in your second data center.
For this failure scenario you need at least six replicas here, you have to double all of it. Because otherwise during your three or four or five months or even one year or whatever it would take to restore you will have the same failure scenarios if you had one data center.
You are just losing half of your total hardware. And in this case of course you are also doubling, it's clear. But you can use Mars for long distance or DRBD for room redundancy but then we are no longer talking about big disasters then. It's clear that even there the cost difference is significant.
Now let's combine it, it's kind of unfair comparison here because at the one side we are returning back to the three replicas on one side.
No geo-redundancy and the other side we are losing the geo-redundancy from the last slide and comparing it this way this time. It's interesting even there it is cheaper, that's the key point of the slide here. So this means geo-redundancy you have one client, one storage server in best
case and you have to double it and you also need this replication network. But compare it to this side where you have O of N clients plus three times your storage service for three replicas. Assuming that one server has the same size and the same number of disks and so on.
And your storage network which is also replicated as a whole throughout the data centers. And in addition I have forgotten you need also replication network for the geo-redundancy, I forgot this. So cost of hardware should be clear to you is a vast difference here.
But there's a functional difference, here you have this geo-failure scenario and here you don't have it. So you get even better functionality here, better insurance against life insurance, you can sleep better. And here you have more cost and less secure system, safe system.
And it's possible, I think it's a new idea, I haven't seen it before but please correct me if you know a similar idea somewhere. This is an idea how to make it even cheaper, the geo-redundancy. What's the key idea here? It's asymmetric.
Full geo-redundancy means that one your data center is operating 100% CPU power, the other one almost 0%. Okay, so you are wasting CPU resources there, you are just deploying servers for the case of emergency.
So this means doubling the cost is something which could be tried to reduce and the idea is the storage cannot be reduced. The total storage needs times two, no chance around this. But for CPU power the storage can become or could become different if you have three data centers, three locations instead of two.
And A1 and A2 means this is the primary side here. The primary side is divided in two halves, the A1 is replicated here to A1 prime, green in secondary role by default. The A2 is replicated here and now the same thing symmetrically, so the Bs are replicated to here and to here
and the Cs from here, here's the primary side, are replicated, the secondary sides are these two data centers, symmetrically. Understandable hopefully. Now what about CPU consumption? Storage is clear, this is a factor of two but during normal operations each data center has 100% CPU for each part
and in case one of them breaks down, let's say data center one has an earthquake and the two other data centers are surviving, then you have to switch over the A1 to here and the A2 to here.
That means the CPU consumption of this data center increases only by 50%. It's not doubling and also it's the same with the other data center. So this means it's a challenge for a system like Mars how to tackle this problem here,
how to have a flexible assignment of CPU to storage. Our first model was having a fixed association between storage and CPU power, the first attempt and the second one is how to do it better. This is the next slide here, the basic idea of my talk.
Flexible Mars, sharding and cluster on demand, what does it mean? In this example here we have a hypervisor running on this single storage box, the RAID system is in hardware, we have logical volumes here and the VMs are either KVM or let's let it be LXC containers or whatever. Running on the same box, the number of course can be scaled up and down as you like
and in case the CPU power of this box here, this hypervisor instance in one of your data centers is not enough and you have another one which is for example in the secondary and the passive, you have more than enough CPU power, you are switching over, you are exporting this volume here,
this mass replicated volume to another one over the local replication network preferably, but it's an exceptional case, it's not the default. The default is to have the VM containers or the containers at the same hypervisor whenever possible.
So you have no storage network traffic at all by default and only in exceptional cases if you have for example a tag to a certain machine or whatever happens in this case of emergency breakdown or whatever, then you are exceptionally using iSCSI
or the new feature mouse remote device which should appear during this year's not yet production already. So it's just a replacement for iSCSI, a little bit better performance according to my measurements. I need one microphone, please repeat your question yourself.
Does it work? I think the microphone doesn't work, you have to switch it on. Here, try it. What triggers the failover? There are two possibilities, if you have really thousands of machines,
you wouldn't rely on something like heartbeat or similar, of course if you have a low number of classes, you can do it all automatically, but for one-on-one we are preferring the manual method
because automatic methods are tending to produce some unexpected results. You anyway have a network operating center which has to watch thousands of network connections or several dozens of lines and so on and so on.
And there are different types of incidents which can be automatically repaired or not and so on, so anyway we have 24x7 personnel in place. So at the moment we don't have an automatics there, but in my last slide I will come back to your question. So for example for community purposes, having this GPL software operating somewhere else,
it's an interesting question you are writing here. For one-on-one we internally do the manual way at the moment. We have our own cluster manager, I will mention it later. So hopefully there are no more questions for this slide, so it should be clear hopefully what happens.
And now what does cluster demand is clear, flexible mass charting is now all clear, and what's background migration. The idea can be seen here, you have this primary side logical volume 3 here. Of course you have another replica at the old data center which has failed at the moment for example,
and you are just creating a third replica on this machine on the same where you are already exporting via iSCSI where you have enough storage space. And this replication is doing a full sync, similar to DRBD, also does a full sync in background
while your data is being modified by your application, your block device, logical volume. So after some hours or if it's a very big, huge device then it may take one day, we have some up to 40 terabyte logical volumes in some cases at the moment,
and some very big boxes, 300 terabytes in total on one box, so 48 spindle or even some machines even have 60 spindles rate systems, several rate sets on one machine, high capacity ones and lower capacity for better performance and similar things.
So after a while you will have it up to date, similar same notion as with DRBD, you know this up to date from DRBD already, so the same as with Mars, and well the next step is not depicted here but you can imagine. What will I do then? Last step is switch the primary nodes to here and this iSCSI connection is not needed anymore.
And this is just the idea of the background migration, create a temporary additional replica, so you can even start with one replica and never create another replica at all, but if you have to migrate it then you have a temporary second replica
and afterwards of course you are destroying your old replica. There are some commands like Mars RDM, join resource and leave resource, and similar commands have already appeared in DRBD, so this is no accident because I am in contact with Philip Reisner from Lindbergh
and the look and feel from sysadmin perspective should be very similar of both products. So that's the basic idea, you are not using a big cluster, you are migrating the data explicitly upon your request as a sysadmin, you are telling the system where the data should reside,
you can migrate it at any time, to anywhere as you like, so you have a big cluster at the metadata level, that means all the resources in your big cluster are known, where they are residing but the actual data path, the actual IO is not a big cluster IO.
The actual data is where possible only local, with no network at all, and only in those places where needed you are using the network. That's the basic idea. Well, now my last slide, the second last one,
what's Mars, it's under GPL, well I have a 100 page manual on GitHub, you can download it, read it, I have also some sysadmin instructions, step-by-step instructions, how to set up your first Mars cluster and so on, very similar to DRBD, if you already know DRBD, you will become familiar with it,
there are some differences in detail because of the asynchronous modes of Mars, where you expect that in DRBD something is done synchronously, you have asynchronous versions in Mars also, and ok, it's stable because it's in production since more than two years, or three years now,
the one in one geo-redundancy feature is public, in our public advertising, probably you noticed it, we have geo-redundancy in our advertising, this is just the backbone of it, we have more than 2000 servers,
the databases are even more, so this number is too low, it's even more because the separate databases servers are also replicated via Mars, I think it should be more than 3500 in total, this is wrong, even too low, then the installed capacity is even higher,
more than 8 petabytes of data, interesting for you is the number of inodes here, web hosting means that you have extremely many, many, many, many very small files, ok, and the only file system which really has a low error rate here is XFS,
not ZFS, not BadrFS, X4 has some problems as we know, so we are using XFS, and Mars has now more than 30 millions of operating hours, on the big cluster and the big systems it's more for more than two years now, ok, what we are currently doing our internal project is increasing density and saving costs,
and this means that the migration is now the current work I'm working at the moment, there's one old branch, this is called stable branch,
which has been in production for, has collected this number of operating hours, and we are now for two weeks using the new, officially it's in alpha stage but already in production for two weeks now, and of course in a few months I will label it better, and hopefully by about end of the year I will also label it stable,
so it's collecting some experiences of course, maybe a little bit higher incident rate, at the first new versions, new functionality, but the experience is that the reliability of the software is higher than of the hardware,
and this has to be if you are creating an HA solution, the software has to be more reliable than the hardware, otherwise it's not HA, it's not high availability, so this is just a requirement for the software, well future plans, what I'm thinking about in there,
I would like to have some feedback from you, what we, I'm starting bottom up here, what's already done, we have been starting with the RBD some years ago, and migrated to Mars because of some network problems, so you can imagine that if you do data centers and you have several lines, 10 gig lines, several I think we have more than eight between them,
whatever it is, but there's a lot of other traffic, it has nothing to do with your web hosting, different applications with whatever, and this means that you are running sometimes, so backup is running and whatever is running, so we have some peak loads there, and also peak loads occurring at your DRBD replication,
then you have a problem, and in another talk I have even a slide about it, what happens if you are, if DRBD is overloaded, so if you replace it by Mars, it was the reason why I created Mars some years ago, started the Mars project, what we are currently doing is this one, so we are creating a virtual LVM like virtual pool,
where you can migrate, so load balancing is done by migrating, and virtual machine pools, at the moment we have our own cluster manager CM3, which is not ready for publication, I would like to have it published, it's created by a different team than me,
but I think it's not possible, because it relies on the internal infrastructure, on certain databases, instances even, and our internal infrastructure, and this is not for publication, it doesn't help anyone, that's the problem here, so what I'm thinking about is having a libworld plugin, I already have a better version of it,
but not yet fully tested, so if you have, there's already a libworld plugin for DRBD, and the Mars one is very similar, it's not totally different from it, but somebody has to test it, and my time is limited, so I would like to have some feedback from the community here if possible, and the interesting new point is,
trust your question before, what about automatic load balancing, not only failure scenarios, but also load balancing and the like, so now there's a question, this is to be done, separate implementation, hopefully under GPL, and published at GitHub, I would prefer it,
or going into existing projects like libworld or OpenStack or Kubernetes for example, why not, if it is easily possible, so I would like to collect some feedback from you, what would you prefer,
well not all of you probably have this application area of thousands of servers, but if you have a wish list, and could just make your check mark somewhere, what would you prefer,
no opinions, no preferences, okay, well you can also talk to me offline or send me an email, so this is just a question I'm asking, and probably I should gain more experience, I have a lot of experience in our own system, but not with new systems growing on the Green Meadow,
so because I'm working on a very old system which is in operation for decades now, and has some history of course, and whatever is happening there, so this means it's a complete difference whether I'm working in an area where a traditional system is very stable,
with processes, IT processes and so on, and rollout is no easy story there, to get something, rollout is really a mess there, but thanks to geo-redundancy no big problem, you can switch over, so you are starting at the passive side, you are putting the new kernel at the passive side
and then just hand over, but even if you have 2000 of machines and you have an error rate of whatever, it's less than one, let's say one per mil, not even one percent, then you have to look after it, because if you have several thousands of machines, a fully automatic load balancing has some risks,
that's the problem here, so your reliability has to be really extremely high there, in this area. So this is my last slide, any questions? Yes, tomorrow.
Yes, the question was, I said we switched from DRPD to MAS because of network problems. Yes. The whole question again. Do you think the change from, your question is,
most of these network problems for the geo-redundancy, on the one side or on the long side, in one data centre? No, it's really for the geo-redundancy, I can confirm that DRPD works very well with cross-over cables,
it's constructed for that case. DRPD is simply not constructed for long distance replication and we have some kind of long distance, we have high latency and we have no constant latencies here. In case of traffic jams and similar events there, it's simply not constructed for this case and that was the reason why there is a niche for MAS.
Stated simply. And this is also the reason why I'm trying to be, to some degree, compatible to DRPD as much as possible, because many sys admins are used, it has some reputation in the community, DRPD has some reputation.
Mars is a relatively new product, open source product, so it has to gain some reputation of course, so it's best for me, I think, to do something similar and I'm also in contact with Nimbit in this case. So I think this is the right way to go, I'm regarding DRPD as a system,
I'm even recommending DRPD in the cases where it's better seated. If you have cross over cables, don't use Mars, then use DRPD. But if you are planning long distances, if you have more than one router in between or more than one switch in between, and if your total throughput
of all of your switch ports together is more than, your intermediate connection is more than an order of magnitude lower than the total, then you should also consider Mars. Looking into it, having some proof of concept, looking into it, gaining some experience and then slowly migrating, as we have done.
If you have some operational problems, never touch the running system you know, it's also clear. But in case you have an alternative and our migration, I can tell some war stories, the first point was we have ITIL, that means we have to be able to roll back at any time,
and this is possible. Because it's a drop-in replacement, in some sense not completely, because there's a separate space for the slash Mars directory, but the DRPD metadata is on separate volumes, and if you have it on separate, the separate logical volumes for the metadata,
it's no problem to use Mars as a drop-in replacement for DRPD. And migrating even thousands of instances is of course no problem, I wrote some migration script, and it was two nights, one for the European cluster, a big cluster, and the other one for the US cluster, and we migrated more than 98% in one night,
and there were some remains where they had problems, which was then migrated later. So you have to automate it if you have really some thousands of machines, and I have some more war stories if you like, if you know more questions about big clusters.
This is interesting for you, a very old experience, probably you know one product of one called my website. It started several years ago, and it's partly written in Java and partly written in PHP, it's a very heterogeneous system,
and it started on the Green Meadow. This is a big cluster architecture, it's a low number of machines. So we had two storage servers and about I think six client servers in total. And they are starting with NFS. So it's a big cluster architecture by its nature,
but of course it's not really big, it's a small one. And after it grew to about, I don't remember exactly, about 20,000 customers and their behavior started to have incidents. And the problem is always the same, you look at, let's start, and you see there's a TCP send buffer somewhere at the server side,
and suddenly there's a peak, it's going up. And after a while it's going down, so it's some hanging there. And experiences, that means, as you probably are, you know the problem, also your experience, what's the problem is the TCP send buffer
and it's not the real time. TCP is no real time protocol. So what did they do? The sysadmins, at their own, started to replace NFS with OCFS2. Oh, no, no, don't laugh. OCFS2 is better than NFS, obviously, because it scaled, the first scalability limit with NFS
was about 2,000, 20,000 customers, and the OCFS2 was working until about 30 or 35,000 customers. Then the same problems, same picture as before. What did they do? They looked around.
They used GlusterFS with G. And as you probably know from talks about GlusterFS, it's better than NFS, and I can confirm, yes, this is true. And GlusterFS worked, what do you think? About 50, more than 50,000 customers on these boxes, and then the same problem as before. So it was a typical startup project,
starting, nowadays we have 10 times more in the meantime. So what did I do? I noticed, I got involved into the project, I talked with the application architect, not the system architect, but application architect. And I explained to him the very first slide about Big Gluster and this O of N square network
I was explaining to you. And he immediately was, oh, yes, this is O of N square. True, by concept. Now what do we need? We need a sharding architecture. We have converted the whole architecture to sharding, in this case, sharding of small clusters, because there's already some clustering there.
And what do you think it works until today? Even if you would need to scale up by another order of magnitude, 10 times more customer, we would be lucky if it would happen. But technically it's no problem to scale up, because sharding always scales by its definition. So if sharding does not scale,
then the internet does not scale. Sharding scales like the internet. When sharding does not help, then the internet does not work anymore. It's the same problem, in essence. So this is what I want to tell you. Big Glusters have their merits. They have even their application areas.
There are use cases for it, depending on the type of work. But if you don't need them, don't use them. That's the bottom line. Any more questions? Thank you for your attention.