We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Easy Geo-redundant Failover with MARS and systemd

00:00

Formal Metadata

Title
Easy Geo-redundant Failover with MARS and systemd
Subtitle
How to Survive Serious Disasters
Title of Series
Number of Parts
94
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The talk describes a simple setup of long-distance replication with minimum effort. The new systemd interface of MARS will drastically reduce your effort to make your existing complex solution geo-redundant. Geo-redundancy / mass data replication over long distances is now much easier to manage for sysadmins. Although systemd has some shortcomings and earns some criticism, it can ease your automation of handover / failover when combined with the new unit-file template generator from the long-distance data replication component MARS. It is very flexible, supporting arbitrary application stacks, virtual machiines, containers, and much more. MARS is used by 1&1 IONOS for geo-redundancy of thousands of LXC containers, and on several petabytes of data, with very low cost.
FreewareOpen sourceTamagotchiData centerInterface (computing)Physical systemBackupDifferent (Kate Ryan album)Data recoveryCASE <Informatik>DistanceStructural loadData managementGeneric programmingSoftware developerSimilarity (geometry)Replication (computing)XMLComputer animationLecture/Conference
Open sourceFreewareDatabase normalizationGeometryDistanceMultitier architectureWindowContinuous functionPoint (geometry)Physical systemFile systemAnwendungsschichtQueue (abstract data type)EmailVector potentialSpacetimeBlock (periodic table)Kernel (computing)Replication (computing)Level (video gaming)Read-only memoryComputer hardwareReduction of orderCache (computing)Web pageType theoryMilitary operationVerteiltes SystemRAIDOptical disc driveDirectory serviceBit rateCache (computing)Turtle graphicsCartesian coordinate systemReplication (computing)Goodness of fitMultiplication sign2 (number)DistanceWeb pageAuthorizationFile systemCASE <Informatik>InformationLevel (video gaming)Thread (computing)Block (periodic table)BackupTerm (mathematics)Server (computing)MetadataBefehlsprozessorLatent heatQueue (abstract data type)Slide ruleEmailNumberFiber bundleGame controllerExponential distributionVector potentialDatabase transactionService (economics)Analytic continuationPhysical systemData centerCodeFilm editingComputer configurationOpen sourcePoint (geometry)BitCross-correlationVirtual machineAreaSystem callInformation privacyClassical physicsComputer fileStack (abstract data type)Operating systemDistribution (mathematics)Direction (geometry)Perspective (visual)Connectivity (graph theory)Bit rateHuman migrationDomain nameAndroid (robot)Raw image formatReduction of orderPhysical lawMoment (mathematics)Session Initiation ProtocolDirect numerical simulationDigital photographyState observerDirectory serviceUniform resource locatorTotal S.A.Kernel (computing)InternetworkingOperator (mathematics)Computer animation
Directory serviceDatabase normalizationBit rateFreewareOpen sourceReplication (computing)DistanceSCSILatent heatDatabase transactionProxy serverComputer iconComputer networkMacro (computer science)CoprocessorData managementData modelMiniDiscSpherical capTheoremMaß <Mathematik>Buffer solutionPhysical systemMultiplication signMiniDiscData managementEndliche ModelltheorieSoftwareCASE <Informatik>2 (number)Cartesian coordinate systemOrder (biology)DemonFile systemReplication (computing)Dimensional analysisKorotationNumberLoginData centerDatabase transactionSlide rulePower (physics)Exponential distributionLevel (video gaming)ConsistencyPresentation of a groupAreaRevision controlMoment (mathematics)Connectivity (graph theory)Line (geometry)Proper mapParameter (computer programming)State of matterData storage deviceSpherical capWeb pageVolume (thermodynamics)Kernel (computing)BitTheoremDivisorStructural loadBackupInsertion lossPairwise comparisonSynchronizationGoodness of fitOpen sourcePhase transitionEvent horizonAxiom of choiceSoftware developerComputer configurationBlock (periodic table)Software testingAsynchronous Transfer ModeGeneric programmingDrop (liquid)Operator (mathematics)Interface (computing)Bit ratePoint (geometry)Computer fileStandard ModelIncidence algebraDatabaseDebuggeriSCSIRow (database)Diagram
CoprocessorMacro (computer science)Data managementData modelMiniDiscTheoremSpherical capMaß <Mathematik>Computer networkFreewareOpen sourceExecution unitInstallation artTemplate (C++)Asynchronous Transfer ModePhysical systemFeedbackSet (mathematics)Macro (computer science)MetadataClient (computing)Endliche ModelltheorieInstance (computer science)Template (C++)MiniDiscComplete metric spaceMoment (mathematics)Execution unitLogicCoprocessorException handlingSI-EinheitenVolume (thermodynamics)Data managementStaff (military)CASE <Informatik>Point (geometry)Group actionBlock (periodic table)Symbol tableSoftware repositorySubstitute goodPattern languageView (database)MicroprocessorMereologyMultiplication signInterface (computing)Computer fileSimilarity (geometry)Dynamical systemNeuroinformatikLine (geometry)Computer architectureOnline helpRow (database)SpacetimeVirtualizationDirection (geometry)Projective planeCommunications protocolSource codeSlide ruleInternet forumFile systemOperator (mathematics)Regulärer Ausdruck <Textverarbeitung>Term (mathematics)Inverse functionVector potentialDisk read-and-write headInstallation artSystem administratorLimit (category theory)Data conversioniSCSIFiber bundleElectronic visual displayLattice (order)XML
Open sourceFreewareExecution unitInstallation artTemplate (C++)Asynchronous Transfer ModeCoprocessorMacro (computer science)Point (geometry)Computer fileSymbol tableSI-EinheitenMultiplication signOperator (mathematics)Substitute goodComputer hardwareFile systemMass1 (number)Traffic shapingCASE <Informatik>Revision controlExistential quantificationData centerPlanningPhysical systemDeterminismReplication (computing)Presentation of a groupVirtual machineMechanism designSynchronizationSoftwareUniform resource locatorDifferent (Kate Ryan album)Pattern languageStructural loadLevel (video gaming)WebsiteCommunications protocolDisk read-and-write headSimilarity (geometry)Instance (computer science)MicroprocessorExecution unitCore dumpCase moddingImplementationInformationMacro (computer science)CoprocessorXMLComputer animation
Open sourceFreewareSource codeDatabase normalizationElectric currentWeb pageServer (computing)RAIDSanitary sewerTemplate (C++)Execution unitAsynchronous Transfer ModeCoprocessorMacro (computer science)Module (mathematics)Slide ruleWeb pageNumberSinc functionProduct (business)CASE <Informatik>Server (computing)MassTotal S.A.MultilaterationEnterprise architectureComputer animationXML
Generic programmingOperations researchElectric currentComputer hardwareChemical equationSynchronizationLocal ringOpen sourceFreewareState transition systemKernel (computing)DisintegrationCollaborationismShared memoryData compressionComputer networkScalabilityMetadataComputer hardwareDenial-of-service attackServer (computing)Information overloadLastteilungMultiplication signFile systemiSCSICommunications protocolCore dumpSpacetimeData storage deviceClient (computing)Flow separationDatabase transactionNumberVolume (thermodynamics)LogicMassComputer filePhysical systemHuman migrationCASE <Informatik>Operator (mathematics)Parameter (computer programming)SoftwareScheduling (computing)Cartesian coordinate systemProduct (business)Functional (mathematics)SCSISystem administratorError messageInsertion lossVirtual machineVirtualizationCache (computing)PrototypeStorage area networkProcess (computing)Datei-ServerInformation securityPatch (Unix)Level (video gaming)Structural loadLoginTotal S.A.Cone penetration testMoment (mathematics)Video gameSlide ruleOverhead (computing)Kernel (computing)CountingDatabase normalizationResource allocationLatent heatSynchronizationData centerService (economics)WebsiteComputer animation
Open sourceFreewareKernel (computing)DisintegrationCollaborationismState transition systemShared memoryData compressionComputer networkScalabilityMetadataExecution unitPhysical systemComputer hardwareMiniDiscExecution unitRAIDComputer configurationMultiplication signDistanceComputer fileCASE <Informatik>Classical physicsReal numberFlow separationLine (geometry)Configuration spaceMoment (mathematics)NeuroinformatikPoint (geometry)Software testingSuite (music)Connectivity (graph theory)Database transactionData compressionBlock (periodic table)Hash functionCyclic redundancy checkMetadataDefault (computer science)Power (physics)BefehlsprozessorSimilarity (geometry)Category of beingProjective planeOperator (mathematics)Slide ruleOrder (biology)Insertion lossNumberSoftwareScalabilityTerm (mathematics)Thread (computing)Game controllerSystem administratorData centerVirtual machineoutputReplication (computing)SpacetimeRevision controlSoftware bugBit rateKernel (computing)BenchmarkAlgorithmINTEGRALPerspective (visual)Level (video gaming)PrototypeBroadcasting (networking)Source code
Open sourceKernel (computing)DisintegrationCollaborationismState transition systemShared memoryData compressionComputer networkScalabilityMetadataMathematicsServer (computing)Direction (geometry)Buffer solutionReplication (computing)Flow separationSpacetimeChromosomal crossoverWeb 2.0CASE <Informatik>Computer hardwareException handlingGoodness of fitDatabase transactionComputer fileDistanceKernel (computing)Semiconductor memorySoftware testingArmCyclic redundancy checkSoftwareDirectory serviceDependent and independent variablesDesign by contractProxy serverMaxima and minimaFile systemCartesian coordinate systemSimilarity (geometry)Virtual machineMoment (mathematics)Product (business)Time zoneBackupInternet service providerConstructor (object-oriented programming)Slide ruleModule (mathematics)BitPurchasingFeedbackSoftware bugMeeting/Interview
FreewareOpen sourceTemplate (C++)Execution unitAsynchronous Transfer ModeMacro (computer science)CoprocessorDistanceReplication (computing)Phase transitionMetadataData recoveryDatabase transaction2 (number)Volume (thermodynamics)MathematicsSimilarity (geometry)DatabasePhysical systemBlock (periodic table)Power (physics)Flow separationPoint (geometry)Content (media)Multiplication signRevision controlCrash (computing)SequelOpen sourceException handlingMassoutputBitInformation securityStatisticsGroup actionSystem administratorComputer fileInstance (computer science)Lecture/ConferenceXML
Directory serviceReplication (computing)Volume (thermodynamics)Set (mathematics)Multiplication signDatabase transactionMassArithmetic meanAsynchronous Transfer Mode2 (number)SpacetimeDivisorAuditory maskingScripting languageBuffer overflowState of matterConnected spaceData loggerLecture/Conference
Open sourceFreewareTemplate (C++)Execution unitAsynchronous Transfer ModeCoprocessorMacro (computer science)Virtual machineCASE <Informatik>Total S.A.Data centerFunctional (mathematics)Arithmetic meanDirectory serviceSet (mathematics)BackupKernel (computing)Different (Kate Ryan album)ProgrammschleifeInformation overloadEqualiser (mathematics)Database transactionGame controllerComputer fileDirection (geometry)WebsiteCodeMultiplication signAsynchronous Transfer ModeForcing (mathematics)Modal logicSimilarity (geometry)2 (number)Revision controlMoment (mathematics)Human migrationReplication (computing)Spectrum (functional analysis)Scripting languageMeasurementTerm (mathematics)System administratorExpert systemDistanceModule (mathematics)SpacetimeCellular automatonPoint (geometry)Pairwise comparisonTable (information)Goodness of fitStructural loadXML
Asynchronous Transfer ModeExecution unitGame theorySystem administratorLevel (video gaming)Cartesian coordinate systemSocial classPhysical lawData managementCategory of beingVideoconferencingSoftwareWave packetHeegaard splittingForcing (mathematics)Communications protocolTheoremBit rateConnected spaceSpherical capGraphical user interface1 (number)Service (economics)Slide ruleRing (mathematics)Incidence algebraEvent horizonInformationAlgorithmMetadataMessage passingProxy serverConsistencyDistanceReplication (computing)Operator (mathematics)Multiplication signPower (physics)Different (Kate Ryan album)Personal identification numberStructural loadVirtual machineError messageCASE <Informatik>AdditionPosition operatorMeasurementSimilarity (geometry)Moment (mathematics)TimestampData centerBefehlsprozessorVariable (mathematics)GeometryGene clusterFeedbackPoint (geometry)Denial-of-service attackFundamental theorem of algebraHigh availabilityMeeting/Interview
FreewareOpen sourceTemplate (C++)Execution unitAsynchronous Transfer ModeCoprocessorMacro (computer science)XMLComputer animation
Transcript: English(auto-generated)
Hi, my name is Thomas and I want to tell you something about my work, my current work at One and One. And what I'm doing there for now almost 10 years is dealing with disasters and preventing disaster scenarios.
First I want to give some introduction about why this is needed. And why you need long distance, asynchronous replication, not asynchronous one for true disaster prevention, disaster management.
And what the difference is between long distances and short distances, there's a fundamental difference. And this difference also shows up in cluster management, how to manage those if you deal with particular loads and scenarios there.
And I've implemented a new solution in Mars which is now a new interface, a very easy and simple interface to systemd. And just using systemd as a cluster manager is something which is very easy and I hope that you find it useful if you are using Mars or similar.
And of course about newer developments, I'm in front of releasing a few new features in the next month. So probably there are some users who will hopefully... I want to get also a picture from you what's in your interest there. An example has went through the press in the last few months.
There was a smaller disaster with some satellites. And later you could read in the press that a data center failure or failure inside of a data center was the reason. And switch over to another data center did not work as expected.
So this is one of the cases where you need... Well the one thing is DR means disaster recovery but many people think that disaster recovery is only back up. But the generic term disaster recovery can not only restore from back up but it can also mean what I'm doing here.
A better term is continuous data protection which means you continuously have some back up, not daily or whatever. You have continuously the data or almost recent data. And you can switch over in almost let's say a few minutes or an hour or
whatever so you will survive a disaster in a much smaller time than restore from back up. Because if you have only back up, it may take only a few petabytes of data. Then of course it may take a few days or even a week or even much more time to restore your business.
And this can be really disastrous not only for your data center but for your whole business. And if you are in a stock company which is noted in stock, then it might be even a demand from it. And when talking about demands, you all know most of the audiences from Germany.
BSI means Bundesamtforsicherheide in the Informazionstechnikt. So it's a German authority. And around December they released a new paper. Here's the German title, the English title means that it's about locations of data centers.
For those cases where we are talking about critical infrastructures here. And you know that German, at least the German legislation has tightened the definitions what is a critical infrastructure. For example if you are operating a DNS, it's likely that you may fall and have a certain size of course.
There might be some chance that you are falling behind this. At the moment it's just, let's say, a recommendation. Recommended. And there are also some discussions for example with German Fowdeer, Verbandeutsche Elektrotechnikt and engineers and so on.
So they dislike this idea. And they have their own definitions which is only five kilometers distance. But you can think in an authority, a government authority is probably maybe that the legislation in future may be tightened in this direction.
So it may happen that in future you may have to deal with that. Even from a compliance perspective. Not only from you want to have a good service for your customer, you want to stay online, you want to be available.
But also even from a more serious, more than recommended somewhat in future. Okay, we want to turn the red thing in the green one and now how to do this. This is on the next slide. So I'm going a little bit back to basics.
What you see here is a classical operating system stack from Unix from the 1970s even. So you have certain layers. What you typically provide is a service to your customers or internally for your company, which is a critical one in this case or not, whatever it means.
And you have certain options for distributing this. A potential cut point to distribute it would be here at the application layer. And there are certain cases where this is even, I think, the best one. For example, mail queues. It can be replicated at mail level, more fine-grained and application specific. You can exactly read the metadata of your mails and where it's going to and for whom it's in precedence and so on.
So there are cases where I think that a MySQL replication is also a good example. You have more fine-grained control over transactions and so on. So there are certain cases where I would prefer the application layer. But there are other use cases where you need to be more generic, like replication of file systems or of whole virtual machines.
And here there's sometimes a question, should we do it at file system layer or at block layer? And some people think that file system layer is a good idea, but it's not a good idea for long distances.
The reason is on this slide, you have a caching layer in between those layers. In the kernel, it's the page cache and the de-entry cache. And one of these caches has been even introduced into the kernel by myself almost 20 years ago, around 20 years ago. So this is just my former working area in this.
And if you want to distribute a system here at this layer, you have to deal with the number of system calls, which are dealing with file systems, and the number of system calls per second. And the numbers here on this slide are way too low. And a modern server with, let's say, 32 CPU threads, or let's say 64, it's no problem to get such servers,
or probably you have some of them, you may have millions of syscalls per second, at least in some cases. But below that caching layer, it's not only 1 to 100 reduction, it may be 1 to 1000. If you have good tuned caches, it's no, in our service, I have measured 1 to 1000 around this.
So 99.9% hits rate in the cache, only the cache misses are appearing at the block layer. That means if you have a long distance replication to do, there's a clear answer where to do it. And please don't try at file system layer. Do it at block layer.
So this is, for long distance, there's no other chance in reality here. Okay, so clear any questions for this? Well, then let's look at an example in one-on-one IONUS. I'm working at shared hosting Linux. And there we have this single application where we have multiple applications,
but this single application is running over two data centers and two continents. And the distance between the data centers is around 50 kilometers for historic reasons. And the total number of customer home, the slides are on the internet, you can download it afterwards.
You don't need to take a photo, but you can also do. This is just the raw number of customer home directories, the number of domains is slightly less, I think only around 6 millions or whatever. And the number of inodes is also an interesting number. We have 10 billions and we have daily backup for them.
This is also a challenge, of course, because we have extremely many slow, small files. And the distribution is a very interesting distribution. If you look at it, it's SIP slows, an exponential distribution about file size. This is also a very interesting observation I have in my daily work. And the total space currently allocated, I think it's almost 5 in the meantime.
The slide is not yet. And we have a growth rate of around 20% per year. And this is certainly a challenge how to deal with that. And MARS is also a solution not only for replication, but also for migration of data in the background. Because you can migrate data or file systems even if they are being modified while being migrated.
And as far as I know, there are only open source components which can do that during operations while you have strict consistency in this system. I don't know whether it's also possible with SAF, but I think it's DRBD and MARS.
These two can do it, certainly. It's constructed for this case. OK. Well, synchronous doesn't work. And if you don't believe it, just try iSCSI over more than 50 km. Or try iSCSI through, let's say, a network bottleneck.
You can configure a small drop rate, let's say 1% or 2% packet loss just for testing. And try it. Then you will see. So that's very simple to test. OK, so that means you need an asynchronous replication.
And there you have certain options, of course. For example, the application specific one is clear because MySQL replication is constructed for this case. It's just done by the developers of that application, so you are fine with that. It's clear.
But for generic layers, you have the choice between commercial appliances. And, well, I've made some comparisons. It's around a factor of 10. Even if you get good rebates, it's around a factor of 10 in comparison to self-built storage. So what you typically want to have is open source because we are not only at an open source conference
because I think also it's clearly the best solution. You have two main components which could do it. It's the RBD, but this is not asynchronous. And some people are thinking that it also has an asynchronous mode, but it's not really true. I will explain it later if it's interesting for you.
It does some very small buffering, but it's only buffering in the TCP send buffer. And this is way too small. This is not working in practice. There's a reason why we migrated from the RBD to Mars in our data center with the numbers you saw before. In the previous slide.
So Mars has been constructed originally for this use case, persistent buffering in a file system. In a file with log rotation, transaction logging like a database. And each write request is one transaction. This means you would have a database where each write request is treated as if it is a transaction.
That means you have any time consistency. That means any write request. But the problem is not the block layer itself. If you have a file system like XFS, you are not calling sync or fsync all the time. So anyway, you are losing around, let's say, 10 seconds until the page flush daemon of the kernel will flush the dirty pages.
So for typical applications, you anyway are losing, let's say, 10 seconds or 20 seconds depending on your kernel parameters there. Typically, of course. At least for file system applications. Anyway, asynchronous replication isn't a real problem there.
It doesn't really add too much to this, let's say, data loss which occurs in failure scenarios. You have some data loss there. And if you want, I can explain why it's necessary. Because there's a cap theorem which explains this.
Here I have a small example. Let's assume that the application throughput is somewhat constant, which is never true in practice. It's also an exponential distribution, but the network is flaky. So sometimes you have packet loss. For example, if you are coupling to data centers, then you will have backup traffic at night or whatever.
And so you have load peaks where you have packet loss. And packet loss is typically something which will introduce some very flaky behavior into the network. So you have no guaranteed throughput. If you want to have it, you can spend a lot of money on having guaranteed separate lines.
You can do this, but it's a cost argument against this. So what mouse is doing if you have more application throughput and network throughput, like in this area, it just baffles in the transaction log. And if you later have a battle throughput, because a network bottleneck is gone, it will catch up.
So that's the simple idea here. And here, just for demonstration, this would be the TCP set buffer size, which is a few megabytes at most. But in the slash-mouse system where the transaction logs are residing, you have gigabytes or even a terabyte if you want to have. So my dimensioning recommendation is dimension the slash-mouse for, let's say,
a power blackout of, let's say, two days, one weekend to survive this in order to not need a full resync of all the data. So this is the best, theoretically best possible throughput behavior, and many people don't know that DRBD would become inconsistent during those phases.
Why? Because if the application throughput exceeds the network throughput, you will get either an incident or you have to disconnect then. And if you reconnect, you have to catch up. All these areas which have to catch up have to be done. And during this catch up, you are inconsistent because DRBD cannot remember the order of requests.
While the transaction log is a sequential one and records the original data as it is. So even if you have a disaster, even a rolling disaster where multiple events are occurring, then you will always have a consistent mirror, but it will reflect a state in the past at most.
A former state. That's all. But the point is, you get whatever is done is consistent, and it's inside of one logical volume you are replicating. It's strictly consistent. It's strict consistent, like any ordinary block device. But between the replicas, data center A and data center B,
you are eventually consistent. So we have two consistency models at two levels, which are independent from each other. So now, what's my talk about? About cluster management there. Of course you can have some proprietary cluster managers somewhere.
We have our own. So what I'm talking about in this presentation is not really in use at 1.1, but only in the lab at the moment. But it may happen that future versions of our internal CM3 cluster manager might be at the front end for the new systemd interface, because we are using systemd anyway for many purposes now.
But our system is much, much older than systemd. It's 20 years old now, around 20 years ago, and their systemd did not yet exist, of course. It's clear. Well, pacemaker doesn't really work. Why? There's a simple reason.
Pacemaker has a shared disk model, which is the standard model originally. And its predecessors had this model originally from the 1980s. And this model is very simple. You have one disk which cannot fail, for example from IBM or so, which cannot really fail in the 1980s.
This was the preferred model. And there you have a client you want, just you switch over your clients. This was the original idea, the original architecture. Of course you are right. If you shake your heads, you are right.
My experience is that even another group in our company tried to use pacemaker there. It did not work as expected, because the split-brain handling is not built in and it's rather clumsy to do with. So this is something which is a better cluster manager, a true cluster manager, if automatic failover is missing at the moment,
is lacking at the moment. And I'm using the term cluster manager in a way which is internal use at one and one. There cluster manager is something which is triggered manually and then does the rest. So it's not fully automatic at the moment. But of course, as an additional layer, it could be added to be also an automatic one, but you should implement a forum and similar things,
which would be possible with Mars metadata, protocols which are already implemented, but not yet. Well, it could be one of my next year or one of the next projects, for example. But not at the roadmap at the moment, at my roadmap at the moment. Now, what I'm trying to talk about is using systemd.
And all of you have probably, out of systemd, you are probably already using systemd, whether you would like it or not. And there have been some discussions in the community about systemd. I don't want to repeat them. It has some merits and it has also some shortcomings. But whatever is your opinion about systemd,
almost all distros, I think, have it now. And whatever you are using, you are already relying on it. So what's the only potential problem many people see with it, I think, is the monolithic architecture.
Well, I think it's not really monolithic from a user's viewpoint because you can install unit files as you like from your own. And this is just the interface I'm using for Mars. The idea is, Mars already has since from the beginning dynamic resource creation and deletion operation.
If you are using it, you know it. It's Mars RDM join resource leaf resource. You create another replica via join resource. Originally, you created the source via Mars RDM create resource. Similar commands are already also in DRBD. If you know DRBD, DRBD has taken over some parts even from Mars.
There, leaf resource is just giving up this replica and then you can recycle it for whatever reason, for whatever purpose. And I already have a macro processor internally in Mars for display of, let's say, whatever you want to program there.
The Mars RDM view command uses some macros internally. And this macro processor has some capabilities of doing whatever you want. And this is just the idea behind the systemd interface. Okay, let's start with an example. This is an example template which is already online at the Mars repo.
And this is a very trivial systemd unit template for mounting your DevMars log device somewhere. And if you know systemd already, then you will wonder what's these special symbols.
So the add symbol and the percent symbol there. These are simply the macro processor directives here. You can see that even the file name in the systemd subdirectory has some macro. This is a substitution of the resource name. So the template specifies it's written only once
and can be used for, let's say, 100 or 1000 resources in your whole system. You write it only once. And in this special example, in the first line, I'm computing another variable which is called mount here. This is the mount point. What's really done by the template is in these two lines.
What is mounted is DevMars resource name. And the mount point is somewhere the mount point and also the resource name. And I recommend you to consistently use the same name for both the block devices, for the mount point, for the file systems, for the iSCSI export and so on. Whatever you are doing with this, or for an NFS export of this file system,
use always the same name because otherwise you will get confused over time. An inflation of names doesn't help in any way. So this is a built-in convention here. And, well, what does it do? It just substitutes the dashes by slashes.
You know that Leonard Pettering has invented this dash substitution in his unit files. If you have a mount unit for systemd, the slashes are simply, this is a convention from systemd, are replaced with dashes. This is a systemd convention, not from Mars.
And this is just the inverse operation. Convert the systemd unit's name back into a mount point path with the slashes. So that's the idea here. So it calls a substitute operation, which is a regular expression substitute here by a macroprocessor. And then you compute the mount point. And here, well, I think if you know systemd,
is it understandable for you? Okay. I don't need to spend time on this anymore. Okay. Now, how to use it, this template? It's very simple. Once in lifetime of a resource, you have to create it. It's clear. So typically you have a volume group, a logical volume manager,
and you create a logical volume with a certain size. Our biggest are around 40 terabytes in certain exceptional cases. And you create a Mars resource, and after you have created it, it appears a staff Mars resource name at the primary side. And then you place some data in it.
For example, you make a new XFS file system. I should have better used XFS here, because you can also create a set pool here. And you can also trivially use set pool import and export operations via systemd templates. I have even implemented it, but I have not published it.
I should publish it on GitHub, and you can also see what it looks like. So you can hand over complete set pools and set FS instances, whatever you like. So there's no limit on what's inside of your block device here. Okay, so you do this only once, and then you have to tell Mars once what's the start unit name
and the stop unit name. And here you can invent new names which don't exist as a file in your file system. And then the macro processor of Mars will analyze it. And now I will go back one slide. Look here. This caret symbol here before unit means it matches
against the unit name you have provided there, and it turns it into a mount point name then. Okay, so you have a substitute operator. This is the add symbol. And the caret symbol is the opposite. It matches against the given name. So you invent a new template name out of the blue.
You invent it just by using it. And the macro processor does what I mean. Okay, so that's the basic idea behind it. So you don't need to write, if you have 1,000 and you have 3,800 resources, we don't need to write 3,000 versions of the same system.
The unit will write it only once and the rest is done by the macro processor, of course. So that's it. That's the basic idea behind my presentation. Well, an important point is Mars and also DRBD is in reality discriminating two scenarios,
the ones the plans hand over and the other one is the unplanned failover where something is breaking down. In the planned case, you want to ordinarily stop here at the old primary site. Then Mars, when its new mount is done,
it may take some time. For example, if you have XFS instances and you have some quota mounts and quota customers on it, it may take some time to sync all this quota information. I have corner cases where it takes a few minutes just to U-mount. It can happen in certain cases. If you have a few billions or millions of inodes and a few billions in total, it may happen.
It depends on the file system and what the file system implementers did there. And also a mount may take some time for some time for journal replay and similar things. It depends on the use case and on your load and on your data and the data pattern there. And then this head over protocol is ensuring
that all of the data is appearing at the new site. And then, of course, final stages just started. And via this method, you might even migrate the data to a completely other data center to a completely different continent, whatever it is,
because the sync, which is done originally for creating a new replica, is running in background. It's background priority, also on the network, lower network priority. You can even use traffic shaping. This is much different from the RBD. Try to traffic shape the RBD channels. Don't try this. This mouse, you can do it. So there you can see the difference.
What's the difference between synchronous and asynchronous replication? And this means it's also the same mechanism can also be used for migrating data to different locations for hardware lifecycle if you want to get rid of your old hardware. I have done this for a few thousand machines last year.
My last year presentation was about this. And, well, these are my final slides already. The mouse is GPL. It's a coddle module. It's on GitHub. The manual has much more than 100 pages. I think now almost 200. I think this number is wrong here.
It's productive since... Oh, I have to run. What's going on here? Ah, okay. Something was missing here. Ah, okay. It's productive. The first productive use is 2014,
and the first mass is 2015. One year later. So we have now mass production for four years now and enterprise-critical data where the company is getting much of its revenue from. So it's enterprise-critical usage. And, well, we have a big number of servers here.
The biggest server has around 300 terabytes on LVM level. And the biggest resources are 30 or 40 terabytes in some exceptional cases. In some cases it's even a regular one. Typical sizes are only between 1 and 3 terabytes per container,
per LXC container. Total number of inodes already mentioned. And, well, we are concentrating many LXC containers on one hypervisor. Typically 7 to 10, in some cases 12 or more. It depends on the size.
So we can dynamically, can even dynamically grow and in some cases even shrink the sizes. Which is on the next slide. Football is a sub-project of Mars created last year originally for hardware lifecycle, but it's also used for load balancing.
Because if you have an overloaded server with, let's say, 10 LXC containers and some customers are making trouble or you have some DDoS attack on one of them, you just migrate away one of them to another host. Well, it takes some time, it's not instant. But you can create the replica in advance
if you are really curious about this. Typical times for one terabyte is about two hours and for the very big containers it's more than one day, typical. It depends on the network, of course, and on some certain other. And on load, if you are permanently overloading the system, of course it takes longer.
Because the ordinary write-back by Mars takes precedence about the migration process. This is clear, you want... So the application does not feel anything about the performance, almost nothing about the performance of this migration in background. So I have my own scheduler implemented in CORDEL,
which is a two-leg scheduler for implementing this. It has nothing to do with the ordinary CORDEL schedulers for implementing this functionality here. It's in production and the main operations are migrated to another, from one cluster to another one. Well, the downtime is very low.
So for hardware lifecycle you don't have live migration anyway because the hardware is changing from a very old IBM Blade to a newer Dell hardware at the moment, or AMD or whatever. And this means you won't use any processor-specific functionality anyway.
Then you can shrink. This is done via a local rsync, so it creates another file system with better parameters. For example, the agcount parameter is originally 10 years ago, it was only four, which is a performance bottleneck. You increase it to a bigger number if you recreate a file system there. Expand is online without downtime,
it's just using xfs underscore growfs. So you can just dynamically increase your file system, it automatically does lvm, lvextend and so on. You're increasing the size of logical volumes everywhere in the whole cluster,
regardless of number of replicas. Then do a mass rdm resize command, which propagates it up the layer, and then you simply increase the xfs file system size, xfs growfs, and then they have more space there. And if the whole system is exploding because you have no space anymore,
you move away one of the containers. That's the idea, load balancing via football. So in reality you have a virtual lvm pool, which is spanning your whole data center. Or if somewhere some space is missing, you just migrate your container to some other machine, to some other cluster,
provided that networking infrastructures and the rest of your infrastructures of course must be constructed for that, but in our case it is. And so you don't need a storage network at all with this system. This is the big advantage. Migration traffic is only occurring if you really need to migrate the data,
otherwise we have local storage in most cases, and many people are believing that you always need a network, or you need a storage network, so distinction between storage servers and client servers, but we don't have it, and with petabytes of data it's possible, thanks to this online migration, which is possible with Mars.
Think about it, it's much cheaper, you are not only saving the network and its costs, the storage itself is cheaper and the performance is much better, because you don't have this network bottleneck in between. iSCSI is always a bottleneck, if you ever made a TCP dump of iSCSI traffic,
do it and you will get pale, what's being done at the protocol level, because iSCSI has to be backwards compatible to old SCSI protocols from the 1980s. This resource and request slot allocation, this is overhead, overhead, overhead, and in Mars I don't have all this overhead,
so the local device, which appears locally, is certainly much more performant there. So, I think now we have the last slide. Future plans, what I'm currently doing, I have not much time for Mars, because my main job is a downstream kernel, maintaining a downstream kernel,
which is then rolled out to more than 10,000 servers in total, and I have certain special patches I should not talk about, dealing with security and several other things, and dealing with those billions of inodes and so on. So this is my main job and Mars is just a side job or whatever you tell it,
but in the last few months I had some time for implementing a few new features, I'm not sure whether it's interesting for you. The original Mars is a prototype from 2010-2011, originally designed at the lab, and there I used MD5 checksums in the transaction log. And these MD5 checksums are very important I think,
because they have rescued the lives of machines and of customers many times. On very old flaky hardware, with this hardware, are BBU caches. Ten years old, the battery is going bad,
and then the RAM is fading away, and you suddenly have RAM corruption. And this means in the worst case, you are losing your whole data at the whole machine. It can happen. Therefore you have the geo-redundancy. And in these cases, typically Mars will detect this,
because the MD5 checksums are wrong, and it refuses to apply the defective log files at the secondary side. You also get an error message, and then the sysadmins come to me crying, complaining, Mars is defective. No, Mars is correct, your machine is defective. And you don't want to apply this data. Yes, there is data loss, there is data loss.
Yes, there is data loss. There is no chance to deal with it, because you don't want to apply this defective data, otherwise your secondary side will be trashed. Splish, primary minus minus force, switch over forcefully, and you don't want, you want to have this data loss. That's the point here.
So, and this MD5 is unfortunately the most, is a performance pick in some sense. It consumes a lot of CPU. If you do MD5 with the user space, tool MD5, you will check it. I've timed several CRC algorithms, and CRC32C is the fastest one,
according to my benchmarks, with four K blocks only, because each block is compressed individually, so of course it's not the best compression, the best possible compression, but I want to retain the property that each log entry is a transaction in its own. Therefore I don't have longer compression runs, of course,
so it's not a bug, it's a feature, to have the small compressed blocks there, and CRC32C is already used at the networking, TCP-ISP stack uses it in several places. So it's nothing new, you won't lose really things, and you don't need a cryptographically perfect hash there.
It's not needed, because it's against, not against attackers, it's against defective hardware and similar things. But of course in the future version, you will be able to select among them, if you have a lot of CPU power and will afford it, you can stick at MD5, but the new default will be CRC32,
and I noticed in my tests that my test suite is running much faster than before. I have not yet timed the real impact onto the IOPS, but I have certain cases where 40,000 IOPS are possible on a classical disk system without SSD, local, with RAID 5.
So if you want to dive in, I can even explain why this is the case, because it's even, in some cases, there are existing cases where it's faster than a raw device, and I can explain why this is the case here. So some other projects, log file compression, maybe not used at one and one, I think, because our data center lines are good enough,
but if you have a real long distance replication, let's say from East Coast Australia to whatever, to West Coast or to Asia or whatever, or to another continent, it may pay off, I'm not sure. So it's compression at two layers, the log file itself, before being transferred over the network,
can be compressed, and an independent option is to do it only at the network transport. But you have one primary and multiple secondaries, you will have to compress it several times there. So I think both options are needed. I will provide both of them. Then there are another thing,
scalability number of resources per host is not the best one at the moment, because it has emerged from prototype, this has to be improved, and I have certain ideas how to improve this. For example, having several controller threads per resource or similar things, my host per cluster is on my roadmap, on my medium term roadmap,
and if you could have a big cluster with thousands of machines at metadata exchange level, where each cluster knows what the status of each other cluster, at least potentially with some higher update times, with the Lampert clock algorithms I'm using here, then split cluster, joint cluster operations wouldn't be necessary anymore.
So you would have two big cluster operations at metadata level, but the data IO path is preferably local one. So it would be very similar to big cluster approach like Swift or Cephra, similar thing, from the sysadmin perspective, from the operational one.
Okay, then, well, I think this is something where I should get more time to do, because at the moment I have no time for this, and if it's not going upstream into the Linux kernel, I fear it will not survive in the long term,
so this is something which needs to be definitely done here, in the next years at least. Okay, then what's also lacking more tooling integration, at the moment it's more or less a component like the RBD, and support by other projects is not yet the best one, so this should be improved. So if other OpenStack and several other projects,
if it would be integrated there, the problem is I have no time for this, so community would be a good idea. If you want to do this bistography, I will support what I can, in order to integrate it in whatever tooling you already have or you would like to have.
So this is the end of my talk, then let's start with questions, and I can also add some, whatever you like in the appendix, I have some more slides. I think you get a mic.
Linz has a question here in the first. Yes. You mentioned that MD5 checksumming is very CPU intensive, does that already take into account that usually they have custom computation units on the die
for performing these operations? Yes. Or is that not used within the Linux kernel? Yes, there are several hardware acceleration units available, but I fear that we don't have them active at the moment or whatever it is. So they are either using the Intel SSE and similar, already using that,
so I'm just using the CODEL config as it currently is, and the current, we have CODEL 4.4, which is not the newest one at the moment, because, you know, in practice, you need some time for getting it stable, and so on, and my timing says this at the moment. But once I have published it,
you can download it and test it, and please give me a response. But I think there's a reason that MD5 is the slowest one. If you have not implemented it fully in hardware, it will be the slowest by construction, and there's a reason why the networking guys, not only from BSD but also from Linux kernel,
are using CRC32C. There's a reason for it, I think, a traditional one, a very old one, and even with better hardware support, either from AMD or Intel or both or from ARM or whatever manufacturer, from chipset manufacturers and from the server providers, I think it won't change very much.
Of course, it can help a lot, of course, but I have noticed that this is one of the bottlenecks of Mars. Okay. But this will be relinquished, of course. What would be the main advantages of using Mars over DRBD?
Well, it's a matter of application use case. In the Mars manual, there's one chapter about DRBD versus Mars, and I clearly recommend for a short distance replication where you don't have a switch in between, but a crossover cable, you should use DRBD because it consumes less resources.
DRBD is constructed for this case, and Mars is not, at least at the moment, it doesn't attempt to attack this in any way. So there's a clear separation. There's a gray zone, of course, in between of both. If you have long distance and asynchronous replication, don't use DRBD. You can buy this proxy product, of course,
but we have tried it, and it's RAM buffering. It's expensive, and RAM is the most expensive memory you can buy, and the recommendation would be to even have dedicated hosts only just for this buffering, and Mars does it all on the local host. So even on our biggest machines, our 300 terabyte machines,
the slash Mars Direct has one terabyte, which is less than 1% of total memory for the slash Mars transaction logs, and this works perfectly fine. So you typically are surviving an outage of one day without problems, typically, except if you are fully restore of your full backup,
of course, then, okay, there are several corner cases where it can be filled very quickly, but typical user scenarios like in web hosting have around the sizes, and this is not a real problem there, okay? Lance, another question?
So how to get started? Let's say I have an existing server that already has LVM file systems on top, and I would like to enable Mars. Do I need to copy data around first for this to happen, or can I just enable Mars? Good question. Yeah, it's described in the Mars manual. There's even a step-by-step instruction
for this case, or for similar case. You need some spare space on your LVM for the slash Mars, of course. Let's say you could start with 10 gigs, but I wouldn't recommend it. 20 is the minimum. 50 is better. Let's take 100 gig, okay? If you have a serious application,
then you should use, let's say, 200 gigs for whatever. If you want a test set up, 20 gigs are enough for the first start, of course. So you create the slash Mars file system. Your kernel should have a pre-patch. New kernels also work without a pre-patch, but performance is worse then, and you need to compile Mars as an external kernel module,
which is also possible by DKMS, possibly, but I wouldn't recommend it. In the step-by-step instruction, it's described, and I even have posted a module I'm not using, a DKMS file. You can try it, and if it has a bug, it may have, then please improve it
or send it back to me, give feedback. So I have not really tested it, but somebody else has created this, and it's in the contract directory somewhere. And then you say modprobe Mars, and then the first instruction you have to give is on the previous slide, already mentioned there.
It's, no, it's on here. You are creating, you already have, for example, if you have already data there, which happens if you are migrating from DRBD to Mars, you have done this. Regardless whether you have internal metadata
or external one, you have external one, where it's a little bit simpler, but you can just directly use it, because the DRBD metadata is at the end of the block device. So you will lose some, unnecessarily lose a few megabytes inside of your block device. You have to try to directly migrate it to Mars, and also back.
I'm insisting that it's possible to migrate back to DRBD again, because this is open source, and we have to work together. And you make the mouse add in, create yourself out of your volume group, out of your logical volume, then the deaf mouse appears on your primary here, a few seconds later. And it has exactly the same content
as your logical volume. It's one to one. There is no, absolutely no difference, except, and now there come the exceptions, if your system has a power outage, then it may be inconsistent, and then you have the recovery phase from the transaction log file. And this is very, very similar to MySQL.
If you have a MySQL instance, you know that it crashes, you need recovery phase at startup of MySQL. You know this as an experiences admin. Because this is the way how performance is optimized in databases. And if you know MySQL, and MySQL replication, MySQL transaction log replay,
and similar things, if you know this, then you also know like math is working. And if you know DRBD, then you have 70, 80% of the commands also. It's very similar. Details are of course somewhat different. But as an experiences admin, you will have no bigger problems, I think. Hopefully.
So several people in our company have mastered it, and it's in use at several teams. Some of them are using it for very different purpose, like in shared hosting, but, well. And I know that several people around the globe are also using it for long distance replication somewhere. But they don't have exact statistics on users.
And how would this look for ZFS? If you have ZFS, would you replicate devices and then put... Ah, ZFS replication. Okay. This is a good question that I have addressed in the newer versions of the mouse manual. There are some people who think that
making snapshots and then incrementally replicating the snapshots. Is this possible? Yes. But there are two drawbacks at least. The one is that you are losing some time. The snapshots are point in time snapshots. And the medium is a factor of 50% longer
because you are replicating an old state. During the replication of the new state, it's changing in the meantime. And if you do it by a script, an endless loop, which some people are doing, then it runs a few seconds. And if it's increasing, if you are writing more data
than can be replicated, the time is going up. And in some sense, it may happen that the ZFS is filling completely up and then you are in a real mess. Typically, from practice, from practical viewpoint, you should never exceed, exhaust the space of your set volume. And these snapshots can be a serious problem.
Mars also has the same problem that the slash mass directory in the transaction log5 may overflow. This is called emergency mode and there is a means for it, for dealing with this. So I have explicit means. This is the one thing. And the other one thing is Mars can individually switch over in short term,
handover, both handover and failover. Here we have the planned handover. The unplanned failover was very similar to DRBD. You said DRBD, ADM disconnect, and primary minus minus force. You just replace the ADM by Mars. ADM by Mars. ADM is exactly the same thing. So if you are using DRBD, you will know this.
And you also will typically have a split brain afterwards, also with DRBD. So there's not really a difference there. Okay, and this is per resource. And if you are doing a ZFS, you have no means. For example, you may have replication in the wrong direction. You won't notice it. Both DRBD and Mars are protecting you against this.
So you have some control, functionality in your kernel module, in the Mars ADM script and so on, which protects you from accidentally overwriting your good data by the old one and similar things. And there's a table in the Mars manual with a comparison, ZFS versus DRBD and Mars. So three columns.
And look into it, and if you think that something is wrong there, we can discuss and I will correct it, of course, because I'm not the big ZFS expert there. There are some cases where it could be beneficial if you are just creating backups, snapshot-based backups. It could help.
And ZFS could be the easier solution in this case. But if you want to have instant failover to the other side, we are using it not only for the emergency case, even for ordinary kernel updates. And you know, spectrum meltdown. So we typically have a kernel update each one or two months. It's clear in the last months.
Always all the time. How is it done? You are rebooting the current secondary site with a new kernel and you just switch over then. And then, of course, you reboot the old primary site, which is now secondary. And the downtime is a few seconds to a minute or around this, depending on the XFS and the transaction replay and similar things.
And so a kernel update can be even done during business hours, if necessary. It's no big problem. You have a small downtime because of the 50-kilometer distance. We have no live migration implemented at the moment. And you won't implement it between different kernel versions anyway, too much.
But for practice, it's good enough. And if you do it at midnight, then it's also no problem. So what's the big problem there? So we are using it regularly. And it's even another use case. So if you have ten resources in total on one side, we have dimensioned it for this case.
The machines are dimensioned for this. But sometimes there is an overload because customers are creating endless loops with backups. So copy minus A. You hold directory to a temporary space and then create a zip file. And if the zip file is created, it's copied once again. This is typically a script produced by customers.
You are laughing. You know it also. And similar things. We have a few millions of customers. You cannot control all of them. And they are just doing stupid things from a sysadmin viewpoint. And then, what do I do? Half of the resources, it's data center A.
The other half is running at B. Short-term measure. And it immediately relinquishes the load. The load is going down. If you have an incident because high load, because some whatever is popping up because DDoS attack also. Well, we have a DDoS proxy and some measures against it. But whatever could happen,
you have additional CPU power on the other data center and then you just butterfly operation. One of the rings is the one side. The other resources are running at the other side. That's the idea behind this. So, for sysadmin, it's even a comfortable thing
to have this handover feature. And in certain cases, you will start to like it. Next question. So, currently, the handover of Mars is currently 19% done manually,
if I'm understanding you right. I don't understand your question fully. We are using a lot of DRPD resources. And we have integrated those, or we have controlled those with Pacemaker and DRPD resource agents, where we can define on which side the VM is running or the service is running.
And the handover is then done by Pacemaker. And even we have also GEO clusters using DRPD and Pacemaker. Yes, it's sometimes critical, but right now, everything works fine. Okay. But not that high. The variability.
For high availability reasons. You are just telling the same story because another team in our company has tried the same with Mars. And this is a very similar experience. Since the problem we have, we have 99.98% reliability we want to achieve. And we are violating only rare cases.
And this means an error rate of 1% is way too much. Okay. So, whatever you are doing with Pacemaker, whatever high-level cluster management which tries to automate whatever things, you have to be very careful. This is the reason why we are currently doing it manually. So, we have a 24-7 network operation center
which watches everything in the whole company, including those machines. And these guys are responsible for pressing the button in the right moment if they really detect an outage or whatever is happening there.
Of course, it's losing some time if it's not automatic. But at the moment, we are operating this way and this is the proven way. And the experience is with the automatic failover with Pacemaker and similar ideas is if you want to fail back or whatever, it sometimes has false positives, alarms and so on.
And this could be too high. So, this is just an experience but I have not invested much time into it. So, what you are right is it should be improved. Probably I should have implemented another cluster manager for long distance where there is a different protocol in use like quorum consensus in a different way.
And Maus already, at least potentially, is supporting this by the Lambert clock algorithms it uses internally. Because it has the metadata, the status information is propagated with eventually consistent protocols. So, timestamps are compared. Lambert clock means if NTP isn't working correctly
and the clock on both sides are different and you send a message from A to B and B notices, oh, the message is created before it has been, no, it's arriving before it has been created. It cannot, it's wrong. So, their own Lambert clock is advanced by this difference, by this delta.
And this means the Lambert clock is a virtual clock which always is proceeding monotonically later in the time. So, it's never running back. And this protocol ensures that when you say on one side primary minus minus force
and the other side is disconnected as the network is down for some reason. It may operate in split-brain for a while but once the network is healthy, it automatically reconnects. A difference to DRBD. DRBD, you have to do a command. In our case, we have a three-node cluster basically with a two-way DRBD replication. And the third node is for Chrome only.
And if one side fails or the network disconnects and the one side with the better connection is still online and has the application running, and sometimes, yes, you have the possibility of a split-brain which you have to manually recover. Yeah. But that's the downside of the DRBD failover. Yeah, yeah, yeah.
This is one of the things which is, I have a slide for it. I can explain why it happens also with DRBD because it's the CAP theorem. The CAP theorem is explaining why this will happen because once you have a network, it can fail independently from your nodes. The network has its own failure.
This is the P property. You will have the problem. Whatever you are doing, this is a fundamental law like Einstein's law of speed of light. Just no chance against it, okay? So it's clear and I think the cluster managers are not ready for this, and I first have to think about it,
how to implement it correctly, at least as correct as possible, because full correctness isn't possible. It's just what the CAP theorem is telling. It's not possible to do it always in the way you want to have it, but to do it as best as possible, so the best-ever principle here. And I'm not sure how to do it at all.
the moments. I know that there are some shortcomings. If you want to try it, I would be very excited if you would try it and give me this feedback and what should be done better here. And if you want to have a good solution, we could start co-working. It's no problem. So I think this is one of the things that is just desperately needed by some admins
and by some use cases. I know, I have heard similar from many other teams who have already presented there and there's a need for it and currently not addressed and I don't have too much time at the moment for this. This is one of the weak points. But the problem is not mouse-specific, it is also with the RBD.
You are right. So further questions? I think we are at the end of time now or not. Okay. But thank you for this very vivid discussion. So we had a lot of questions. Thank you. So there's some interest.