Geo-redundant Failover with MARS, Now and in Future - TIB AV-Portal

Geo-redundant Failover with MARS, Now and in Future

00:00

0

Related Material

Chaos Computer Club e.V.

Free and Open Source Software Conference (FrOSCon) e.V.

Schöbel-Theuer, Thomas

Formal Metadata

Title

Geo-redundant Failover with MARS, Now and in Future

Subtitle

Howto Survive Serious Disasters

Title of Series

Number of Parts

62

Author

Schöbel-Theuer, Thomas

License

CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/59715 (DOI)

Publisher

Chaos Computer Club e.V.

Free and Open Source Software Conference (FrOSCon) e.V.

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

The Ahrtal flood disaster and a war in Europe will obviously increase requirements on geo-redundancy. OpenSource via MARS can help. Experiences from 1&1 IONOS datacenters running millions of customers and petabytes of data will inform you on current state and on future trends. This talk will discuss more possibilties for datacenters, how geo-redundancy can be built and operated, and in long term.

Keywords

System Administration

FrOSCon 202246 / 62

1

57:05

SageMath Examples from the CrypTool Book

2

40:32

Parallelisierung von Algorithmen für die Semantische Suche mit CUDA

3

44:29

Wie man 20 Frontends über eine API versorgt

4

1:00:43

Wer bin ich und wenn ja wie viele?

5

42:07

Gamification und Crowdsourcing

6

36:33

Zusammenarbeit mit Open Source Projekten

7

51:52

Is there hope for Linux on smartphones?

8

1:00:21

LOWA or how we managed to run LibreOffice in your browser (WASM edition)

9

57:49

10

1:02:11

Datenschutzgrundsätze bei der Softwarentwicklung und Verarbeitung von personenbezogenen Daten

11

56:42

The 'SUASecLab' Virtual Laboratory

12

1:00:32

Checkliste für Universaldilettanten

13

54:58

Fileserver online

14

1:00:08

Postgres für nicht-Datenbank AdministratorInnen

15

1:02:34

Upstream first or, how to avoid maintenance hell

16

55:27

Troubleshooting (Enterprise) Web Applikationen mit OpenSource Tools

17

45:58

Open Accessibility

18

31:08

Still not Superheroes

19

39:36

Testing software on multiple Linux distributions

20

51:10

Automatisierung im Cyberspace

21

40:54

An overview of Ignition - a one-time provisioning software

22

43:18

Deploy software with systemd-sysext

23

1:02:31

Zero-Touch Kubernetes: Vollautomatisierte Infrastruktur mit Flatcar Container Linux.

24

57:56

Monitoring wie es 2022 sein sollte

25

1:05:26

Introduction to modern fuzzing

26

48:12

Foreman SCC Manager: Rechenzentrumsautomatisierung von SLES

27

58:59

Webbasierte Dienste datenschutzkonform betreiben

28

54:47

State of the Edu

29

46:27

Wie man den Zustand „Gescheitert am ERP-System“ vermeidet

30

1:03:48

sectpmctl für LUKS Full Disk Encryption (FDE)

31

57:52

Multi Cloud mit Terraform – Eine Einführung

32

52:59

MySQL 8.0.30 (Community) - Status Quo

33

47:45

Fediverse und Selbstorganisation

34

57:58

MySQL in Kubernetes Umgebungen

35

59:05

State of the Union: Das Open-Source Jahr 2022

36

1:01:16

Der eigene digitale (offline) Sprachassistent - ein langer Weg....

37

54:24

Wie die Eisenbahn aus Fehlern lernt und dabei IT-Sicherheit erhält

38

43:38

REUSE: Indicating licence and copyright information has never been easier

39

1:06:38

Paketformate der Zukunft?

40

44:28

Observability Driven Automation

41

58:59

Patterns & Anti-Patterns bei der Automatisierung mit Ansible

42

1:03:51

Datenkreuzung Telegraf

43

59:28

web.py – Web-Anwendungen in Python

44

59:10

45

51:05

46

45:14

Geo-redundant Failover with MARS, Now and in Future

47

1:01:46

How to solve data minimization in your SQL database

48

45:29

49

50:12

Machine Learning + Graph Databases for Better Recommendations

50

1:02:36

Opencanary, eine Alarmanlage gegen Einbrecher im Netzwerk

51

50:30

Successful remote work while protect your privacy - Lessons learned

52

50:30

Vintage programming: an archaeological journey into the past

53

58:17

Adapting Java for the Serverless world

54

55:02

Mit Pull Requests arbeiten

55

25:50

Jazda: Rust on my bike

56

47:09

Verifying Application Startup on Linux

57

36:42

This is the way - Holistic (Network) Automation

58

34:49

The Small Device C Compiler

59

1:06:44

[Keynote] Digitale Souveränität – Verzweifeln oder Handeln?

60

1:00:03

61

32:34

MicroPython: BootOps mit mynit und ChatOps mit mytrix

62

45:06

Systemkonfiguration mit Puppet

Automatic playback

Speech

Text

Image

00:00

FreewareOpen sourceHypermediaGamma functionData storage deviceGeometryInterrupt <Informatik>Directory serviceBit rateComputer multitaskingDistanceMereologySCSILatent heatBuffer solutionCyclic redundancy checkConsistencyProxy serverComputer networkSoftwareDistanceBuffer solutionIncidence algebraBit rateInsertion lossHypermediaDenial-of-service attackMultiplication signDirectory service1 (number)Order (biology)Resource allocationTerm (mathematics)NumberSystem administratorKernel (computing)Operator (mathematics)Instance (computer science)Total S.A.SpacetimeHeat transferDifferent (Kate Ryan album)Database transactionComputer fileCartesian coordinate systemOpen sourceData centerComputing platformReplication (computing)Service (economics)Constructor (object-oriented programming)Structural loadProjective planePresentation of a groupSynchronizationDivisorUniform resource locatorLine (geometry)File systemSlide ruleMereologyScaling (geometry)AuthorizationFlow separationProcess (computing)Product (business)Phase transitionPower (physics)Partial derivativeLocal ringComputer animation

09:07

SCSIComputer networkLatent heatOpen sourceBuffer solutionCyclic redundancy checkConsistencyProxy serverFreewarePhysical systemFile systemCache (computing)Computer hardwareQueue (abstract data type)EmailVector potentialPoint (geometry)Block (periodic table)DistanceWeb pageRead-only memoryType theoryMilitary operationSource codeElectronic program guideArchitectureBenutzerhandbuchKernel (computing)Slide ruleDisintegrationOpen setData storage deviceClassical physicsSystem administratorOrder (biology)Block (periodic table)Module (mathematics)Goodness of fitMiniDiscCASE <Informatik>Insertion lossIndependence (probability theory)Slide ruleLocal ringBitFunctional (mathematics)iSCSISoftwareArithmetic meanComputer hardwareSimilarity (geometry)Volume (thermodynamics)LogicArmBefehlsprozessorCuboidVirtual machineRemote procedure callDifferent (Kate Ryan album)Computer architectureReal numberData storage deviceMultiplication signMereologyKernel (computing)Category of beingBranch (computer science)Electronic program guideOpen sourceArithmetic progressionData exchangeFile systemPlanningType theoryScaling (geometry)Replication (computing)State of matterOcean currentCartesian coordinate systemComputing platformOperator (mathematics)Data managementInternetworkingComputer animation

18:08

Computer networkSCSIData storage deviceClassical physicsOpen sourceFreewareParallel portInterrupt <Informatik>Hybrid computerMaxima and minimaSimilarity (geometry)BenutzerhandbuchMassCuboidCASE <Informatik>BitClient (computing)Crash (computing)Electronic program guideServer (computing)Computer architectureMultiplication signIncidence algebraConnected spaceData storage deviceSoftwareReplication (computing)Slide ruleCodePlanningOrder (biology)Process (computing)Open sourceMereologyView (database)Web pageType theoryMoment (mathematics)PhysicalismRevision controlBranch (computer science)VirtualizationUniform resource locatorStructural loadVirtual machineAdditionData managementCycle (graph theory)Parallel portVideo gamePresentation of a groupLevel (video gaming)Line (geometry)Kernel (computing)Parameter (computer programming)Denial-of-service attackContinuous trackOperator (mathematics)Point (geometry)DistanceComplex (psychology)Computer hardwareComputer animation

27:09

Open sourceFreewareGame theoryConsistencyDistanceFermat's Last TheoremComputer networkGamma functionEmulationNumberRevision controlConnected spaceDifferent (Kate Ryan album)Client (computing)Cartesian coordinate systemPresentation of a groupNatural numberOrder (biology)Phase transitionMathematicsReplication (computing)DistanceCategory of beingCASE <Informatik>Spherical capMacro (computer science)Control flowState of matterData centerSoftwareOperator (mathematics)Kernel (computing)Data storage deviceLevel (video gaming)Point (geometry)System administratorConsistencyParallel portMereologyTheoremSynchronizationGoodness of fitCommunications protocolLink (knot theory)Software testingIncidence algebraMultiplication signSinc functionComputer animationLecture/Conference

36:10

Computer networkDistanceTheoremConsistencyOrder (biology)Kernel (computing)Multiplication signLevel (video gaming)Replication (computing)Interface (computing)MetadataModule (mathematics)Electronic program guideDifferent (Kate Ryan album)Computer architectureMoment (mathematics)MereologySoftwareCuboidType theoryDefault (computer science)Data centerDistanceBenutzerhandbuchTraffic shapingClosed setOpen sourceSystem administratorComputer hardwareData loggerInformation securityHeat transferSoftware bugDecision theorySoftware developerCASE <Informatik>Software repositoryCodeSpacetimeImplementationTable (information)Social classCartesian coordinate systemRegular graphNumberEmailField (computer science)Standard deviationSynchronizationDatabase transactionMeeting/Interview

45:11

JSONXMLUML

Transcript: English(auto-generated)

00:08

So, thank you for being here. Microphone is not yet on, but there's a green LED.

00:33

Something is not working, yes you are right, but what is it?

01:01

Wrong channel. Ah, now you can hear me, I can hear myself, so you probably can understand me, I hope so. Okay, then I will start. You saw the logo from where I'm working, but this is a private explanation here. So, while I'm talking from my experience as an employee at 1&1 IONOS,

01:23

while I'm working you will hear something about geo-redundancy and what it means to actually do it in a certain scale. We'll talk about open source and specifically the Linux kernel and what can be done with it, and in order to solve hopefully some of your problems.

01:42

Probably most of you are sysadmins from this direction or interested in whatever. So I hope in order to help you some way. What I will talk about is here, so first is the motivation why geo-redundancy needed. As talked in the title, communicated in the title it's about disasters.

02:04

Like for example in the Airtel here in Germany or at several other places in Europe or wherever it may happen, how to deal with it. This means you need asynchronous replication, I'll explain why shortly, and the current status where I'm working and what we have already in scale in 1&1 IONOS.

02:25

But the main part of this presentation is about the future thingy, about what is the prosumer device. The most important keywords are already here on this intro slide. I have only around 12 or 13 slides, so hopefully we have time for discussion.

02:44

And you can ask questions at any time, very simply. So if something is not understandable, please raise your hands. And I will explain whatever you want to hear about what's local, what's remote, what to do, what's service, what's planned and what's unplanned and so on.

03:04

If necessary, we can dive into details because we have solutions for whatever. It's too much for our presentation of this type, all in total, it's clear. So I hope to meet your needs. Well, we start where I'm working.

03:21

The Mars project is in operation since 2014 in Mars. And some numbers are here. So in the meantime, we have six data centers at two continents and one island. This is the British island, but this is starting now. So the data centers are under construction in some way,

03:43

but already present some service of this platform here. Some numbers, the number of customers is in the millions range. Then the number of inodes, the number of files and

04:01

directories is in the billions range. And this makes a difference. Then the number of allocated space is not the same as the number of space in the file systems where the whole file systems need to be replicated. The one is seven petabytes in total. The other one is ten.

04:20

The number of XFS instances is also probably number you will be interested in as a sysadmin. And an important number is also the growth rate in terms of allocation. So just to get a little background what I'm talking about here and what you need in order to deal with that.

04:41

Okay, if you have a small data center, of course, there are different requirements than if you have many bigger ones. But if you have questions, please don't hesitate to ask. So any questions for this slide, then let's go to the next.

05:01

Short motivation, something has been in the media already, you know. And of course, the situation has not become better in the last time. For various reasons, we don't need to discuss here. What's a disaster? It may be an earthquake or a flood like in Aartal. It may be some attack of whatever kind, need not be a terrorist or whatever.

05:24

A full power outage. And important is to know it's not only full loss of a complete data center. There are more incidents like loss of, let's say, the network, the local network, or loss of power, or loss of partial power. Only one of the three phases or whatever.

05:43

All of this may induce such a disastrous incident, which means your company, where you are working, is in danger. This is the subject we are talking here about. Your company is in danger. Your products cannot be served anymore. You may lose even your job in worst case, hopefully not occurring.

06:05

This is the countermeasure here, or one of the potential countermeasures. And even the government has already some attention on this. Because if you look at the BSI, the German government authority for these cases,

06:21

it has papers since years. And some of them have German text or it can be translated. You can read them. And the bottom line is you should have at least 200 kilometers and some more criteria, like where it is located. For example, not all in the Rhine Valley and so on,

06:41

because there might be some incidents like earthquakes or whatever floods in the Rhine Valley may even go up, down to, let's say, Netherlands or wherever. So it's not just here in Kosovo where I'm living. It may happen everywhere. Or here is also the Rhine Valley.

07:02

So it's important to think about where to place your data centers. The distances and many other factors if you are responsible for such a thingy like here. Okay, questions for this then should be enough. Well, then one of the subject is not understandable.

07:24

No, for questions. You are resolving for questions. Synchronous versus asynchronous replication. Some people think it can be easily done, let's say, with the RBD or whatever. There are some limitations. There are loads which will work this way, yes, but not all of them.

07:41

And what I have shown you, I'm very sure that it does not work with the synchronous way. You need the asynchronous way there, because it depends on load patterns, on behavior of customers and so on in order to be able to replicate this. And for most of them, it's probably not you for you.

08:00

If you have MySQL replication somewhere, you probably know it already, because it's also working asynchronously, but on application layer. And if you have commercial solutions, then you pay for it, simply. So, but we are an open source conference. This means I explain you why it's cheaper if you have questions for this.

08:23

And, well, what's the features of Mars? Let's look at the last item here. It's truly asynchronous. It has persistent buffering. This means if you have a loss of a machine or of the whole data center or whatever, you will have a transaction log like in MySQL, and this transaction log is always usable.

08:45

This is the basic idea. You have, in addition, some CRC or MD5 checksums in order to be able to detect when your data is corrupt for whatever, so the transaction log itself is corrected.

09:01

And also, new Mars features are the same story at the network transfer layer also. And the property important is anytime consistency, which means if you have a delay from asynchronous replication, your mirror is always usable in the sense of you had a power loss

09:21

in a certain time of real time, but of course, it may be a little bit delayed for whatever reason, like network outages and so on. This is the basic idea. Here's a slide we can skip. I just can explain if you are interested why block layer is much better

09:41

than any file system layer. Questions for this? You want to please get the micro? Is the CRC and MD5 still sufficient with these data volumes? With the state avoidance, you mean, though it's not, it's independent.

10:01

So you can select for performance reasons. The CRC is much faster. Okay, but it's very limited. No, it's not very limited because the, oh, good question, very good question. The reason why I'm using whatever you can configure it is not against attacks.

10:22

It's against loss of data, which means CRC is enough because CRC is already used in the network layer, CRC32, in the TCP protocol, and it's already used at many other places in typical storage stacks, almost everywhere,

10:41

but you just don't see it in most cases. For example, in your disk, in the disk hardware, you have CRC-like technologies even in your platter and your firmware, but you just don't see it. And it's automatically correcting for whatever, but this would be a deep dive in the storage architecture and the hardware and so on.

11:02

But if necessary, you can read it. I give you some links. So the question is very, very good question here, but it's not the subject of this talk here. So more questions? We go to the next one. Okay. So what's the current state of Mars?

11:23

It's a kernel module for the Linux kernel. You need some pre-patch in order to apply it, but it has some documentation. The Mars user manual, it's around 140 pages, more than enough for sysadmins, but senior sysadmins should be able and will be able to deal with it.

11:41

The slides will be also on the internet afterwards, but you can make your own photos, of course. Well, and there's another document, the architecture guide, which is more pages, which is interesting for architects and probably some managers in order to deal with g-redundancy, which is not a simple task. It's more complex than assumed in many cases.

12:04

Here, the Mars kernel module and the Mars RDM tool is very similar to DRBD in concept and in operations. So it's hopefully coming close to what you probably expect here from such a thingy.

12:24

The most important story is a kind of backbone of one of the platforms, of one of the platforms having millions of customers inside of one-on-one IONAs. Okay. More questions? Ah, okay. Yes. Yes, in some sense, yes.

12:44

But we probably need the mic again, because the others didn't hear it. So IONAs is using Mars as their primary georeadundant data strategy?

13:04

In some sense, yes. In another sense, not fully. It depends on the application class, first of all. And second, on the type of application which is running, we have a LAMP stack, Linux, Apache, MySQL. But MySQL replication is done via MySQL replication.

13:22

And this was on the previous slide. There are reasons, but I can explain, we can discuss about it. I can explain the reasons, and they are also in the architecture guide. Just read it. It's used? It's used in the scale as, yes, it's used for file systems on XFS.

13:43

And on the next layer, we don't have VMs at the moment, which are not recognized as VMs, but LXC containers can be used like VMs. We have LXC containers. Okay, but similar properties, the main difference is the same kernel is responsible for everything, same kernel instance,

14:04

and this also saves some overhead. Yeah, it's clear for people who have really experience with VMs, then it's clear that for this type of application, it's better to have LXC, but not a kernel for everything. Also, there are some advantages also, for example, kernel updates and so on.

14:24

But from an open source perspective, what I want to tell you here is future plans. Currently, it's only supported their kernel 5.4, and the future story is I need some time because I'm probably the only real guy who's really working on it in a very big company.

14:49

But some branches are already on GitHub. They start with WIP, which is Work in Progress. And the prosumer device is also in a one-year-old branch now,

15:01

but I hope to be able to rebase it by end of this year. So what I'm talking about is more on dock level, but of course, if you talk with me, you can get much more. Okay, so this would be the main part. If you have no more problems, no more questions,

15:24

because I think most of you will be here to hear what's new, what's upcoming. So what's the basic idea? Well, typical hardware setups are that you have storages and hypervisors in different machines.

15:41

This is the classical in from this area, while at our we have both in one box. In reality, we also have a storage, but it's on a separate plug-in. It's hardware-rate controllers, which have their own CPU and essentially are like a PC, but with ARM architecture and a PBU, which means safeguard against power loss and so on.

16:06

So what's the idea that you can unnetter data exchange level? Only there, typically, because we will have only two or three replicas or four in some cases, but for cost reasons, you don't have too many replicas of logical volumes.

16:23

And therefore, the idea is that the Mars device, the deaf Mars is not a physical device, like your disk, but it's just a remote, it can be a remote device. This is the basic idea, but whether it's remote or a local virtual device,

16:45

does not matter anymore. This is the basic idea. What does not matter means you have to look at certain scenarios, which will be on the next slides. So from a sysadmin view, it's similar to an iSCSI-like network connection, similar to.

17:02

It has some features the iSCSI does not have, and otherwise also iSCSI does have features not present here. So it's not a one-to-one replacement, but it's a similar functionality. So from a Mars level, as seen from a Mars level, it should not matter in the future.

17:20

So this means this is one of the possible hardware setups you can have in the future, but of course the client, the hypervisor, needs also be under a Linux kernel, where you are able to install the Mars tooling, the low-level tooling, high-level tooling is a different story. And the idea is you can even migrate forth and backwards during runtime,

17:42

between the local prosumer state, where it's exactly on the same box, as traditional DRBD does also, or the remote prosumer. This means, for example, you have a mounted device, dev mouse somewhere, and now you are migrating away the data in background.

18:04

And afterwards, the customer has no experience, does not know where his data is. It's a little bit cloudish here. And this means, for example, you want to decommission some storage box, because it's defective. You migrate away, then you decommission it,

18:21

and fortunately the hypervisor was somewhere else. It means you can use this feature in this case. Okay, that's the basic idea. Questions for this? Is it clear? Very clear, most are, yeah? You also need a mic. In this case, you can imagine the storage boxes as either a physical machine,

18:56

or can you also imagine it, or can you also build it as like some kind of big cluster,

19:02

a self-class, and then running mass on top of that? On top of, it would be theoretically possible, but I would not recommend it, for reasons which are in the architecture guide. These arguments are on a different level than this presentation here, because it would be an academic discussion then. But I have published it.

19:21

If you read it, just select the architecture guide. There are some chapters about this. I have thought about it, but probably not now here. But if you want to talk with me afterwards, no problem. I will try to answer all your questions. Okay, more questions?

19:45

Okay, so this is clear now, hopefully. This is another variant, there are numerous variants. The ideas which is here already, you can see it, the hardware life cycle management means that you get new boxes over time.

20:02

And typical lifetime is around five years or whatever. So you have old hardware, you have new hardware, you have a mix-up between whatever. You can even change the architectural model in between, if you have these new features. This means you can have hybrid machines or you can have a mixture, that means in order for load balancing options,

20:23

you have most machines in a hybrid machine, as we actually have, but some additional machines in order to be able for load peaks, to migrate them or to just switch only the deaf mouse, but leave the storage where it is. This is the other options, because it's independent.

20:41

You can independently change the location of the deaf mouse of the virtual device via network connection from where is your data living, whether it's the same machine, the same physical box or not. This is the basic idea. Okay, so this is an example where you only have hybrids,

21:02

but of course imagine whatever you need for your use cases. This is the basic idea, scenarios are already here, the planned handover and the unplanned failover, not to be confused with each other. The planned handover is in reality the most important one, the most frequently used.

21:21

Importance is a different story, this is the unplanned failover, because there you can lose your job if it doesn't, if you are not able to recover in a certain amount of time, it can be very disastrous for your company and for your job. This is the point here. But in order to upgrade the kernel, to upgrade some software,

21:41

to do whatever you need to do, the planned handover is typically the most frequent operation. So I have to look at both at the same time and to support it as best as possible. And the idea is you can do planned handover independently, only the storage,

22:00

leave your mount, your deaf mouse where it is at the moment, whether it's local or not, does not matter. So it automatically demodes, so it upgrades to an iSCSI-like connection when necessary, or it demodes to local again, so no network traffic anymore,

22:20

already implemented and tested, but not in the lab level, not for usage at the moment, but this is some code already there on GitHub, and both in parallel. I will show you some examples at the next slide. Okay, so more questions, not. Okay, so here are some only two scenarios

22:42

out of a plethora of more scenarios, which are possible in practice. This is only the end planned case where many people are looking at. So here you have one center in the upper graphics, you have only one server, one replica, which is also possible.

23:02

You don't need the replication all the time, but you can do so for migration, and then you decommission the old one. Now let's assume that your client B was original, it has a crash for whatever reason, maybe there's just a software problem or whatever,

23:20

your kernel is crashing or whatever is crashing, then you just decommission this connection in some sense, and you mount it somewhere else. It's the same as a power loss or some other incident. Your storage has been alive all the time in this special scenario, but there are other scenarios not on this graphics where only the server crashes

23:41

and the client has to survive and so on. I have all of this, and you can read it in the guidelines. Okay, this is just explaining the complexity of what I am doing, and to get some glimpse about what this is about, what I'm talking about.

24:01

Another important scenario which often occurs is true geo-redundancy, you have a long distance replication there, and only the famous caterpillar is then destroying your network cabling, all of it in parallel, or the next flood is like an atoll, is cutting all the lines all in parallel.

24:22

Which would be a disaster, and it already happened in real life. So, well, you can try to go via satellite connection and so on, but if all of Germany wants to do that, you get a problem. So, it's just explaining what could be in scenarios.

24:43

In this case, we have a network outage here, and you get split-brained typically, in such a case if you switch over, if you have no access, you cannot shut down the other one. So, there are many scenarios for geo-redundancy you should deal with, or should be able to deal with.

25:01

And I try to do my best, essentially. And there are much more pictures in the guides I have already mentioned. So, if you are interested in this topic, just go on and ask me also. Just simply go on.

25:20

More questions? Okay. Yes, so this is the slide for you, if you are interested. So, for the Presumer device, you need to check out the Git branch WorkInProgress minus Presumer,

25:41

which I hopefully will update during the course of this year. And the documentation is the new version of the mouse user manual, Chapter 5. New chapter, which has already 20 pages at the moment. And there are many pictures there, many pictures, much more than in this presentation. So, there you can see

26:02

almost a type of compendium about scenarios. Not all of them, there are even more, but at least all the major scenarios, hopefully. And if you find a problem there, please talk with me. If you have no of a scenario I have overlooked, please, please talk with me.

26:24

So, and there's more documentation, more pages in the architecture guide, where I have a high-level view on many things for architects and companies and even for some managers. So, hopefully this could be also helpful and also helpful for the open source community.

26:41

So, this is what I have prepared for this presentation, more or less. More questions to this? Now, I hope we have enough time for the main part of this discussion, which is called discussion.

27:05

Ah, you need the mic for the discussion? Yeah, there are many questions, I have expected. So, we hopefully will have enough time. I didn't really understand. You have shown us two scenarios, possible scenarios,

27:21

and I don't really understood how your software solves these problems. I mean, with this break, where the connection between two replicas completely breaks down. What should we do then?

27:42

There's a simple answer to nothing. This would be, if you read a real, what you will do later, just keep the mic for your back answer to my answer for now, because I have prepared something for this, which could explain it, but I'm not sure.

28:01

The cap theorem. You know it? Okay. It already contains most of the answer to your question, I think, but you need to abstract down. So, this means you cannot have all three in parallel, and much more important is,

28:21

pick any two. Many people believe that you can get two of them. But in worst case scenario, if you have cascading events, it's even zero in worst case. So that means, typically you have only one, two properties left, and typically it starts, this is already violated

28:40

by having a long distance replication as such. This is not known to many people. This is the mathematical parts of this. And then you can now have, theoretically, you can have both at the same consistency, which is typically but only local consistency, not the global one. If you interpret it as the local consistency, then you have it already with Mars.

29:02

Both in parallel. But global consistency is also not possible. And this is the reason why certain concepts, like big cluster over long distance, does not work. But this is an academic discussion. And if you have cascading incidents,

29:21

for whatever reason, you have more incidents, for example, your outage is for one month or two months, or you have lost the data center forever. Forever. It has happened in the art hall. Then, oh, you need to recover. You have a recovery phase. You need to build up a new data center or whatever.

29:42

It can easily end up, you have nothing anymore after some time, at least in some places. So in order to protect against all of this, it's impossible. I would say. So there's always some risk, residual risk, we say in English. And in order to minimize this,

30:02

well, I do my best, but I cannot help. I cannot fight against nature. And you also cannot do it. But what I can do, and what I try to do, is do the best possible way. But of course, if you have some better experience or whatever, to talk with me, I will listen to you.

30:23

A geo-redundancy is not a simple thing. It's really tricky like hell. It's a hell by itself. But many of you want to do it, or have to do it, or if you are really thinking for the next decades, then you should, probably should do it.

30:42

And to gain some experience. And of course, yeah, you are, yeah, this is the point for this discussion here, for this presentation. I am not the full one who knows everything. I just try to do my best from the viewpoint what I have in this company. But of course,

31:01

there are more solutions than other things also. But I try to learn where I can. So answer to your question, okay. Hello, can you hear me?

31:22

No, I cannot hear you. Is it better now? Better, yes. Yes, okay. Ask the rest of the audience. Yeah, okay. Because I can't hear where the microphone is. I am close to you. But I don't know whether the other side will hear it.

31:41

Okay, anyway. So yeah, thank you for the talk, and thank you for the tool. I have two questions. One question is, you mentioned that it works asynchronously. So unlike DB, what's it called? DB- DRBD. DRBD, DRBD, which is synchronous, where it's easy to understand how it works,

32:01

and how applications can then access the storage. Yes. But if you have asynchronous operation, then if you have two clients using different storage nodes, then how would they know whether this async has actually already happened and become synchronous for both to access the same data?

32:25

That's the first question. Yeah, please let me stay at this and ask our next question afterwards. Please give me the chance to answer this, because the question is very good. I have a solution. This is called the replay link,

32:40

which says, in general, when the connection is healthy, there's only one primary node, and the rest is secondary, in secondary role. This is very simple, similar to DRBD, let's say version 9.0, which is not yet or not in the upstream kernel. This means you have one primary role

33:02

and whatever number of seconders, from zero to what you need, you can dynamically create them during runtime with mouse, already in operation since years now. So this means if you switch it, there's a handover protocol already implemented,

33:21

which means shut down the one side and when it's stable, then start up the other one. It can take some time in asynchronous because you need to catch up with the rest. At any time you can force it, like DRBD, DRBD ADM primary minus minus force,

33:43

just replace the DRBD with mouse. So it's very similar in concept. The difference is it may take longer because it may take some time to catch up if you are in a backlog. Only then. Okay, so this is the handover phase,

34:01

but the failover phase is even possible during whatever happens during that, because the incidence might cascade in a sense you cannot predict at all. But during transfer, whatever can happen, and I'm testing all of this, so you have even packets which are not arrived in a consistent way

34:22

in the TCP protocol. That means you have half packets and you cannot check whether they are really correct, but already some countermeasures in the mouse tooling in the internal layers of mouse. And this means I'm thinking about whatever I can imagine.

34:41

I cannot do more than that. So yes, the answer is you have, there are some global per resource global thingies, like the replay link, which says this should be the intended new primary. And there are other links which tell you this is the current one and so on.

35:00

And this is part of the internal handover protocol. But as a sysadmin, you typically don't need to deal with it. So hopefully this is good answer or not. I have a follow-up question to this. Yes, follow-up question please. Which is, is it possible for the sysadmin

35:20

to maybe using a tool to detect whether the synchronization has fully happened on different nodes? The tool is supporting this. You need to use the macros. ADM has internal macros in its own macro language and its documentation in as many pages,

35:41

macro names and so on, how to call it. You can do it. You can build your own tooling on top of it. This is the intention. As open source, everything which is important for you should be available to you as a sysadmin who is responsible for, let's say, monitoring or whatever.

36:01

It should do. And I have two levels in this. The one is the state as known on this host and the other one is for humans. I am addressing two types of use cases. The one is humans in order to attack your brain or to not attack your brain.

36:20

And the other one is for your tooling. And this is, there are different levels. And the documentation also tries to obey it. If you find a bug or whatever, please send me a fix or do whatever you like to do. Ask with me. Talk with me. It's about the actual network transfer.

36:41

Do you have any tooling there, for example, using TLS to make it secure or do you rely on some VPN or... This is in the documentation, I'm assuming. It's a difference to the RBD. I don't have my own type of security. I'm assuming that you have some secure network in some sense, whatever.

37:02

And I don't try to compete with... Ah, okay. It's a different story. Secure replication is typically on the hardware level or close to the hardware level or to the network at the network level. And of course, there are different departments in the big company who are responsible for this.

37:22

But there's a very simple solution in Mars. This is the Mars port numbers. Mars has three ports by default and the new tool has four ports, one more for the ISCASE-like traffic. And the default is 7777 for four times seven

37:40

and the surround port numbers. And it's easy for you with ordinary sysadmin tools you already have on your box in order to secure this. This is the simplest answer I can have at the moment. Why do I have different ports while DRBD has only one port per resource?

38:01

I can also explain, interested or not. Sure, okay. It's very simple. I'm distinguishing the types of traffic. The one port is for metadata only, the 777. The next one is for replication of the transaction log files and the third is for sync traffic.

38:20

And it's a very easy reason because you want to do traffic shaping. You also have over long distances and traffic shaping is also network department. In our company or in your company, it may be wherever, all in your hands or whatever, but you have the standard tools from the whole open source stack from whole Linux, wherever it is, you can use it.

38:43

Simply use it. You have different types and you can make your own. I even use some TOS fields in the TCP headers already implemented. So you just use the packets are already preclassified this way. Just read the docs, there it is.

39:02

And the new iSCSI-like has a new port number just due to this because it has another priority. This is customer traffic and customer traffic is typically much more important even than metadata exchange, maybe. So just adapt your network layer to your needs

39:20

as you like to do, but of course, you are responsible for it. Sure, okay. So yes, more questions. Mike? I haven't used the Mars yet, but I use DRBD.

39:42

Would you recommend or would you say that Mars is also usable for non-georeplication, so within the same data center or? Okay, this is also a good question because I had some discussions with Philip Reissner from Linbit a decade ago and so on.

40:02

And well, we talked about it and there are reasons for some use cases where DRBD is better and there are use cases where Mars is better. So in order to check this, I would recommend to read both manuals or the relevant parts of it and to think about it and so on.

40:20

But the simple story is, if you have regular application and typically, I have even a table in the documentation where you can see which use case class is better suited for what. Just read this. If you have time, I can open it here,

40:42

but it's not in my presentation. Just read the user guide and know it has been moved to the architecture guide and there is the table. What is better for which use case? So I'm also thinking about use cases where which one is better because both is open source,

41:00

the one is in the kernel, the other not yet and so on. So there's no simple answer to your question, but of course, I'm informed and so on. So it depends on your use case. Just gain some experience. Try it. I have a look at the documentation. Yeah, just have a look and try it. And if you have questions, if you get some problems,

41:21

please talk with me. I am keen to support your questions also as far as I can.

41:43

So lots of questions until now, probably not more. Well, we are good in time, very good in time, but I can explain much more. So you can use it. You can use my time.

42:08

Yeah, just get the mic, then just ask.

42:25

Um, I have a quick question regarding the upstream implementation on the kernel side. So it's not yet in the kernel? Yes. Only the module? No, neither. Well, you have to ask Linus in last instance,

42:41

but it's very simple. The user space tool will probably never go to the upstream code because it's user space, very simply. So I will need, I think this GitHub repo will stay, but for different purpose. My, um, I want to have it in the kernel if I had more time for it. I'm estimating that probably

43:01

for one or two years, I need 70% of my working hours for this in order to really happen because there will be discussions at LKML and so on and so on and so on. I already talked with some kernel developers and upstream developer one, so it's not impossible. But the last decision is at Linus as usual, of course.

43:21

And, um, well, it's an out of topic discussion for here, but yes, this is the short answer, but I can tell you more afterwards. Okay, thank you. And how many people are working on this project? One person, it's me. Okay, full time alone on it. Okay, to be honest,

43:41

20 to 30%, but it varies over time. In the last month, it has been increased for whatever reasons, maybe also political ones. There's more mass work to do now because of kernel part. If you want the details, there's a reason kernel 5.10

44:02

has a new interface to modules in certain places. But I'm very happy that this has happened because the new interface is much better in my opinion than the old one, which has now decayed. So I want to do it. I'm keen to do it, but it takes time. Very simply.

44:21

This means at the moment, I cannot support on your level the kernel 5.10 and later, but this is high priority also for the company. Very simply. Okay, and after this, then I hope to be able to go to the prosumer device again. And there's also a small problem for myself, my health.

44:41

I had a bicycle accident one year ago when I was in the hospital for two months. Okay, so this costs also some time and in order to recover at health state, it also costs some time, but I'm alive. I'm here. I want to do. That's the fraud story.

45:03

So yes, more more in discussions afterwards, if you like.