Easy Geo-redundant Failover with MARS and systemd

Video in TIB AV-Portal: Easy Geo-redundant Failover with MARS and systemd

Formal Metadata

Title
Easy Geo-redundant Failover with MARS and systemd
Subtitle
How to Survive Serious Disasters
Title of Series
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
2019
Language
English

Content Metadata

Subject Area
Abstract
The talk describes a simple setup of long-distance replication with minimum effort. The new systemd interface of MARS will drastically reduce your effort to make your existing complex solution geo-redundant. Geo-redundancy / mass data replication over long distances is now much easier to manage for sysadmins. Although systemd has some shortcomings and earns some criticism, it can ease your automation of handover / failover when combined with the new unit-file template generator from the long-distance data replication component MARS. It is very flexible, supporting arbitrary application stacks, virtual machiines, containers, and much more. MARS is used by 1&1 IONOS for geo-redundancy of thousands of LXC containers, and on several petabytes of data, with very low cost.
Loading...
Axiom of choice Demon Presentation of a group Group action System administrator Source code Disk read-and-write head Perspective (visual) Computer programming Software bug Mechanism design Different (Kate Ryan album) Computer configuration Core dump Videoconferencing Data compression Physical system Exception handling Rotation Personal identification number Potenz <Mathematik> Structural load Software developer Database transaction Staff (military) Maxima and minima Instance (computer science) Complete metric space Category of being Data management Digital photography Software repository Buffer solution Pattern language Web page Slide rule Computer file Connectivity (graph theory) Similarity (geometry) Online help Drop (liquid) Raw image format Event horizon Computer Latent heat Crash (computing) Term (mathematics) Computer hardware Authorization Pairwise comparison Default (computer science) Distribution (mathematics) Expert system Generic programming Counting Volume (thermodynamics) Directory service Line (geometry) Timestamp Symbol table Uniform resource locator Software Video game State of matter Multiplication sign Direction (geometry) View (database) Replication (computing) Mereology Mathematics File system Endliche Modelltheorie Scripting language Determinism Regulärer Ausdruck <Textverarbeitung> Flow separation Measurement Substitute good Connected space MiniDisc output Configuration space Asynchronous Transfer Mode Row (database) Server (computing) Inverse function Implementation Functional (mathematics) Overhead (computing) Divisor Cone penetration test Virtual machine Distance Coprocessor Wave packet Revision control Internetworking Operator (mathematics) Macro (computer science) Proxy server Time zone Addition Consistency Forcing (mathematics) Cellular automaton Debugger Planning Database Denial-of-service attack Incidence algebra SI-Einheiten Traffic shaping Cache (computing) Database normalization Kernel (computing) Logic Fiber bundle Android (robot) Suite (music) Dynamical system Existential quantification Scheduling (computing) Thread (computing) Equaliser (mathematics) Execution unit Client (computing) Turtle graphics Dimensional analysis Direct numerical simulation Information security Error message Social class Enterprise architecture Block (periodic table) Constructor (object-oriented programming) Bit Arithmetic mean Message passing Befehlsprozessor Process (computing) Hash function Ring (mathematics) Internet service provider Order (biology) Phase transition Moving average Spacetime Point (geometry) Open source Sequel Patch (Unix) Mass Code Scalability Number Template (C++) Product (business) Session Initiation Protocol Broadcasting (networking) Goodness of fit Internet forum Queue (abstract data type) Energy level Computer architecture Information Interface (computing) Physical law Content (media) Total S.A. Cartesian coordinate system Limit (category theory) Vector potential Graphical user interface Personal digital assistant Data center Tamagotchi Table (information) Satellite State observer Standard Model INTEGRAL Code Modal logic 1 (number) Set (mathematics) Insertion loss Parameter (computer programming) Stack (abstract data type) Web 2.0 Heegaard splitting Bit rate Spherical cap Synchronization Semiconductor memory Resource allocation Position operator Area Email Algorithm Moment (mathematics) Feedback Data storage device Virtualization Microprocessor Auditory masking Website Lastteilung Data logger Purchasing Backup Game controller Statistics Service (economics) Data recovery Login Metadata 2 (number) Power (physics) Prototype Cross-correlation Case modding Reduction of order Software testing Theorem Domain name Module (mathematics) Projective plane Human migration Film editing Communications protocol Buffer overflow Spectrum (functional analysis)
hi my name is tamas
and I once
work my current work at 1:1 and what I'm
doing there for now almost 10 years is
dealing with disasters and preventing
these disasters scenarios first I want
to give some introduction about why this
is needed and why you need long distance
a synchronous replication not a
synchronous one for true disaster
prevention disaster management and what
the difference is between long distances
and shorter since there's a fundamental
difference and this difference also
shows up in cluster management how to
manage those if you deal with particular
loads and scenarios there and have
implemented a new solution in Mars which
is now a new interface very easy and
simple interface to system T and just
using system T's the cluster manager is
something which is very easy and I hope
that you find it useful if you are using
Mars or similar okay and of course about
newer developments I'm in front of
releasing a few new features in the next
month so probably if there are some
users who will hopefully I want to get
also people from you what's in your
interest there okay an example has went
through the press in the last few months
there was smaller disaster with some
satellites and later you could read in
the press that a datacenter failure or
failure inside of a data center was the
reason and switch over to another data
center did not working as expected so
this is one of the cases where you need
well the one thing is TR means disaster
recovery but many people think that this
has to cover is only backup but the
generic term disaster recovery can it's
not only this a backup restore from
backup but it can also mean what I'm
doing here
a better term is continuous data
protection which means it continuously
have some backup not daily or whatever
you have continuously the data or almost
recent data and you can switch over in
almost let's say a few minutes or an
hour or whatever let so you will survive
a disaster in a much smaller time than
restore from backup because if you have
only back up it may take only a few
petabytes of data then of course it may
take a few days or even a week or even
much more time to restore your business
and this can be really disastrous not
only for your data center but for your
whole business and if you are in a stock
company which is noted in stock then it
might be even demand from it and when
talking about demands you all know you
most of the audience is from Germany psi
means goodness emphasis ahead in the
information technical it's a German
authority and around December they
released in new paper here's the German
try to the English turtle means that
it's about locations of data centers for
those cases where we are talking about
critical infrastructures here and you
know the German at least the German
legislation has in has tightened the
definitions what is critical
infrastructure for example if you are
operating a DNS it's likely that you may
fall I only have a certain size of
course it might there might be some some
chance that you are falling like this at
the moment it's just let's say a
recommendation recommended yeah and
there also some discussions for example
is German for the Arab Android electro
technique and engineers and so on so
that that is like this idea and they
have their own definitions which is only
five kilometers distance but you can
think in
authority government authorities
probably maybe that the legislation in
future may be tightened in this
direction so it can may happen that in
future you may have to deal with it even
from a compliance perspective not only
from you want to have a good service for
your customer you want to stay online
you have you want to pay to be available
but also from even from more serious
more than recommended someone in future
ok you want we want to turn the red
thing in a green one and now how to do
this this is on the next slide so I'm
going a little bit back to basics what
you see here is a classical operating
system stacks from UNIX from the 1970s
even we have certain layers what you
typically provide is a service to your
customers or internally for your company
or which is a critical one in this case
or not whatever it means and you have
certain options for distributing this a
potential cut points to to distribute it
it would be here at application layer
and there are certain cases there this
is even I think the best one for example
mail queues it can be replicated at mail
level more fine-grained and application
specific you can exactly read the
metadata of your mails and rest going to
and for whom is in residences and so
there are cases well I think that my
secure replication is also good example
you have more fine-grained control about
over cont transactions and so on so
there are certain cases what I would
prefer the application layer but there
are other use cases where you need to be
more generic like replication of file
systems off of whole virtual machines
and here there's sometimes the question
should we do it at file system layer or
at block layer and some people think
that filesystem layer is a good idea but
it's not a good idea for long distances
the reason is on this slide you have a
caching layer in between those layers
in the correlates the page cache and did
the entry cache and one of these caches
has been even introduced into the kernel
by myself almost 20 years ago around 20
years ago so this is trust my former
working area in Indies and if you want
to distribute a system here at this
layer you have to deal with the number
of system codes which are dealing with
our systems and the number of system
calls per second and the numbers here on
this slide are way too low and the
modern server with let's say 32 CPU
threads or let's say 64 it's no problem
to get such levels or probably you have
some of them you may have millions of
Cisco's per second at least in worst
case in some cases but on below that
caching layer it's not only 1 to 100
reduction it May 1 2010 oh and ourselves
we have I have measured 1 to 1000 around
this so 99.9 percent hits rate in the
cache only the cache misses misses are
appearing at the block layer that means
if you have a long-distance replication
to do there's a clear answer where to do
it please don't try it file system layer
2 it'll block layer so this is for long
distance there's no other chance in
reality here ok so clear any questions
for this well then let's look at an
example and one and one year I honest
I'm working at shorter hosting Linux and
there we have this single application
rather we have multiple applications but
this single application is running over
two data centers and two continents and
the distance between the data centers is
around 50 kilometers for historic
reasons and the total number of
customers the slides are on the internet
you can download it afterwards you don't
need to take a photo but you can also do
this is just the raw number of customer
home directories the number of of
domains is slightly less I think own is
around 6 millions or whatever and the
number of I notes is also an interesting
number we have 10 billions and rough
daily backup for them
is also challenged of course because you
have extremely many slow small files and
the distribution is a very interesting
distribution if you look at its sips
laws an exponential distribution about
file size this is also very interesting
observation I have in my daily work and
the total space currently allocated I
think it's almost 5:00 in the meantime
host light is not not yet and we have a
growth rate of around 20 percent per
year and this is certainly a challenge
how to deal with that and Marcie's also
a solution not only for replication but
also for migration of data in the
background because you can migrate data
or file systems even if they are being
modified while being migrated and as far
as I know they are all on your open
source component which can do that
during operations while you have strict
consistency in this system I don't know
whether it's also possible with itself
but I think it's it's the RBD and Mars
this is - can do it certainly it's
constructed for this case okay well what
synchronous doesn't work and if you
don't believe it just try I scuzzy over
more than 50 kilo miles or try I scuzzy
through let's say a network bottleneck
you can configure a small drop rate
let's say say 1% or 2% packet loss just
for testing I try it then you will see
so that that's that's very simple to
test ok so that means you need an
asynchronous replication and there you
have certain options of course for
example the application specific one is
clear because my secure replication is
constructed for this case it's just done
by the developers of that application so
we are friendly that it's clear but for
generic layers you have the choice
between commercial appliances and well
I've made some comparisons it's around a
factor of 10 even if you get good
rebates it's around or factor
10:4 in comparison South bit storage so
what you want typically want to have is
open source because we are not only at
an open source conference because I
think also it's clearly the best
solution you have two main components
which could do it it's the RVD but this
is not a synchronous and some people are
thinking that it also has an
asynchronous mode but it's not really
true I will explain it later if it's
interesting for you it does some very
small buffering but it's only buffering
in the TCP side buffer and this is way
too too small this is not working in
practice there's a reason why we
migrated from dr bt tomorrow's and our
data center with the numbers you saw
before first in the previous slide so
mars has been constructed for originally
for this use case persistent buffering
in a file system in a file with lock
rotation transaction logging like a
database and each write request is one
transaction this means you have yeah
yeah like you would have a database
where I eat for at request is treated as
if it isn't transaction that means you
have any time consistency it means any
write request but the problem is not the
block layer it says if you have a file
system like XFS you are not calling sync
or F sync all the time so anyway are
losing around let's say 10 seconds until
the page fluff demon of the kernel will
flush the dirty pages so for typical
applications you anyway are losing let's
say 10 seconds or 20 seconds or
depending on your kernel parameters
they're typically of course at least for
our system applications anyway so this
is not a synchronous replication isn't a
real problem there it doesn't really add
too much to this let's say data loss
which occurs in failure scenarios you
have some awesome data loss there and if
you want I can explain why's necessarily
you because there's a cap theorem which
explains this
here I have a small example let's assume
that the application throughput is
somewhat constant which is never true in
practice it's an a also an exponential
distribution but the network is flaky so
sometimes you have packet loss for
example if you are coupling two data
centers then you will have backup
traffic at night or whatever and so we
have load Peaks where you have packet
loss and packet loss is typically
something which will introduce some very
flaky behavior into the network so you
have no guaranteed throughput if you
want to have it you can spend a lot of
money on having guaranteed separate
lines you can do this but it's a cost
argument against this
so what Mouse is doing if you have more
application throughput network
throughput like in this area it just
baffles in the transaction lock and if
you later have a better throughput
because a network bottleneck is done it
will catch up so that's the simple idea
here and here just for demonstration
this would be the TCP set buffer size
which is a few megabytes at most but in
the slash Mouse file system your
deterrence actual logs are residing you
have a few you have gigabytes or even a
terabyte if you want to have so my
dimensioning recommendation is dimension
this left mouse for let's say a power
black order of let's say two days one
weekend to survive this in order to not
lose any not oil to need every full
resync of all the data so this is the
best theoretically best possible
throughput behavior that many people
don't know that the RBD would become
inconsistent you are in those phases why
am because if the application throughput
exceeds the network throughput you will
get either an incident or you have to
disconnect them and if you reconnect you
have to catch up all these areas which
have to catch up have to be done and
during the sketchup you are inconsistent
because the RBD cannot remember the
order of requests by the transaction log
is a sequential one and because the
original data as it is so even if you
have a disaster or even a rolling
disaster where multiple events are
occurring then you
always have a consistent mirror but you
it will be it will reflect a state in
the past at most a former state that's
all but the point is you get whatever is
done is consistent and it's inside of
one logical volume you are replicating
it strictly consistent strict
consistence like any ordinary block
device but between the replicas data
center a data center peer you have
eventually consistent okay and so we
have to consistency models at two levels
which are independent from each other
so now what's my talk about about
cluster management there if you try all
you of course you can have some proper
proprietary cluster manager somewhere we
have our own which is so what I'm
talking in this presentation is not
really in use at one one but only in the
lab at the moment but it may happen that
future versions of our internal cm3
cluster manager might use the new might
be the front end for the new system the
interface because we are using system D
anyway for for many purposes now but our
system is much much older than system D
it's 20 years old now around 20 years
constructed 20 years ago and their
system did not yet exist of course it's
clear so well pacemaker doesn't really
work why and there's a simple reason
pacemaker has a shared disk model which
is the standard model originally and his
predecessors had this model originally
from 1980s and this model is very simple
you have one disc which cannot fail for
example from IBM also which kind of
really failed model and there you have
client you want trust you will switch
over your clients so this was the
original idea of the origin architecture
of course you are right if you shake
your heads you are right that of course
but our my experience is in some even
another
group in our company tried to use
pacemakers very did not work as expected
because the split brain handling is not
built in and it's rather clumsy to do
with so this is something which is a
better cluster may a true customer of
automatic failover is missing at the
moment it's lacking at the moment and
I'm using the term cluster manager in a
way which is internal use at one-on-one
there are cluster managers something
which is triggered manually and then
does the rest so it's not fully
automatic at the moment but of course as
an additional layer it could be added to
be also an automatic one but you should
implement and forum and similar things
which would be possible with Mars
metadata protocols which are already
implemented but not yet well it could be
one of my next next year or one one of
the next projects for example but not at
the roadmap at the moment at my roadmap
at the moment now what I'm trying to
talk about is using system deem and all
of you have probably other systems II
you are probably already using system D
whether you like it or not and there
have been some discussions in the
community about system D I don't want to
repeat them it has some merits and it
has also some shortcomings but whatever
is you are appealing your opinion about
system D almost all just rows I think
have it now and whatever you are using
you are already relying on it so what's
the only potential problem many people
see this is I think f is the monolithic
architecture well I think it's not
really modern logic from a humanist view
point because you it can install unit
files as you like from your own and this
is just the interface I'm using from
Mars
the idea is Mouse already has since from
the beginning dynamic resource creation
and deletion operation if you are using
it you know it it's Mars REM Jordan
resource if resource you create another
replica virtual resource original you
created the source my mouse REM create
resource similar commands are already
also in the RB d if you know there are
be DS
the RPG has taken over some parts even
from ours their leaf fruit sauce is just
giving up this replica and then you can
recycle it for whatever reason for
whatever purpose and I already have a
macro processor internally in Mouse for
this play off let's say whatever you
want to program there it's the mouse
Adam view command uses some macros
internally and this macro processor has
some capabilities of doing whatever you
want and this is just the idea behind
the system the interface ok let's start
with an example this is an example
template which is already online at the
mouse repo and this is a very trivial
system the unit template for mounting
your deaf Mars block device somewhere
and if you know system D already then
you will wonder what's these special
symbols so the @ symbol and the percent
symbol there these are simply the macro
processor directives here you can see
that even the file name in the system
the subdirectory has some macro this is
a substitution of the resource name so
the template specifies it's written only
once and can be used for let's say 100
or 1000 resources in your whole system
you write it only once and in this
special example in the first line I'm
computing another variable which is
called mount here this is the mount
point what's really done by the template
is in these two lines what is mounted
it's def mouse resource name when the
main point is somewhere the mount hood
and the resort also the resource name
and I recommend you to consistently use
the same name for both the block devices
for the mount point for the file systems
for their eyes cast export and so on
whatever you are doing with this off of
an NFS export of this file system it was
always the same name because otherwise
you will get confused over time it's an
inflation of names doesn't help in any
way
so this is a built-in convention here
and well what does it do it's just
substitutes the dashes by slashes you
know that Leonard patterning has
invented this - a substitution in his
unit files if you have a mount unit for
system D the slashes are simply this
convention from system D are replaced
with this desert dashes this is a system
T conventional for Mars and this is just
the inverse operation convert the system
to units named back into a mount point
path missus that's the idea here so it
caused a substitute operation which is a
regular expression substitute here by a
microprocessor and then then you compute
the mount point and here well I think if
you know system D it's is it is it
understandable for you okay I don't need
to spend time on this anymore
okay now how to use it this template
it's very simple
once in lifetime of a resource you have
to create it it's clear so typically you
have a volume group a logical volume
manager and you create a logical volume
with a certain size or biggest are
around 40 terabytes in certain
exceptional cases and you create a
master also after you have created it it
repeals a staff Mouse resource name at
the primary side okay and then you play
some data in it if for example you make
a new XFS resistant I should have better
use telophase here because you can also
create asset pool here and you can also
trivially use set pool important export
operations in via system T templates
have a have even implemented that but I
have not published it I should publish
it github and you can also see what
what's looked like so you can hand over
complete set pools and set of s
instances whatever you do like so
there's no limit on what's inside of
your block device here okay so you do
this only once and then you have to tell
Mars once what's the start unit name and
the stop unit name and
here you can invent new names which
don't exist as a file in your file
system and then the microprocessor of
murrow's will analyze it and now I will
go back one slide look here this carrot
symbol here before unit means it matches
against the file name against the unit
name you have provided there and it
turns it into a my own point name then
okay so we have an substitute operator
this is the @ symbol and the current
symbol is the opposite it matches
against the given name so you invent a
new template name out of the blue you
invented just by using it and the marker
processes us what do what I mean okay so
that's the basic idea behind it so you
don't need to write if you have 1000 and
you have 3800 resources without needs to
write 3,000 versions of the same system
the unit will write it only once and the
rest is done by the microprocessor
of course so that's it that's the basic
idea behind my presentation well an
important point is Mars and also drbg
is in reality discriminating two
scenarios the ones the plan hand over
and I'll wanna spare unplanned failover
something's breaking down in the planned
case you want to ordinarily stop here at
the old primary side then Mars is in its
doom I understand may take some time for
example if you have excess instances and
you have some quota mounts and quota
customers on it it may take some time to
sync all this quota information F core
cases where it takes a few minutes just
to you mount it can happen in certain
case if you have a few billions a
millions of I nodes and a few billions
in total it may happen it depends on the
file system and what the file system
implement implementers did there and
also a mod may take some time for some
time for drawing replay and similar
things it depends on on the use case and
end on your load and enjoy
data and the data pattern there and then
this head over protocol is ensuring that
all of the data is appearing at the new
site and then of course final stages
that just started so and via this method
you might even migrate the data to
completely other data center to a
completely different continent whatever
it is because the sync which is done
originally for creating a new replica is
running in background this background
priority also on the network lower
network priority you can even use
traffic shaping which is much different
from the RBD try to traffic shape dr
batik channels don't try this it's mass
you can do it
so you there you can see the difference
what's the difference between
synchronous and asynchronous replication
and this means it's also the same
mechanism can also be used for migrating
data to different locations for hardware
lifecycle if you want to get rid of your
old hardware have done this for a few
thousands machines last year my last
year presentation was about this and
well these are my final slides already
Marcis GPL it's a chronal module it's a
github the manual has much more than 100
pages I think now almost 200 I think
this is where numbers wrong here it's
productive since no if the wrong what's
going on here okay
something was missing here okay it's
productive in the first productive use
is 2014 and the first mass is 2015 one
year later so we have now mass
production for four years now and now
Enterprise critical data where the
company is getting much of its revenue
firm so it's Enterprise critical usage
and well we have a big number of servers
here for this the biggest server has
brought in the terabyte and LVM
and the biggest resources 30 or 40
terabytes and some exceptional cases in
some cases it's a regular one typical
sizes are only between one and three
terabytes per container products each
container total number for inodes
already mentioned and well and we are
concentrating many Aleksey containers on
1 hypervisor typically seven to ten in
some cases twelve or more
depends on the size so we can
dynamically can even connect dynamically
grow and in some cases even Frink the
sizes which is on the next slide
football is a sub-project of mars
created last year for the originally for
hardware lifecycle but it's also used
for load balancing because if you have
an overloaded server with say 10
Aleksey containers and some customers
are making trouble or you have some DDoS
attack on one of them you just migrate
away one of them to another host well it
takes some time it's not instant but you
can create the replicas in advance if
you were really curious about this
typical times is for one terabyte it's
for about two hours and for the very big
containers it's more than one day
typical it depends on the network of
course and some certain other and unload
if you are permanently overloading the
systemic off of course it takes longer
because the ordinary writes back by mars
takes precedence about the migration
process this is clear you want so the
application does not feel anything about
the performance almost nothing about the
performance of this migration in
background so I have my own schedule
implemented in kernel which is a two-leg
scheduler for implementing this and has
nothing to do with their ordinary kernel
schedules for implementing this
functionality here it's in production
and we have the main operations are
migrated to another from one cluster to
another one
well the downtime is very low so for
hardware lifecycle you don't have my
life my creation anyway because the
hardware is changing from a very old IBM
blade to newer
Dell hardwa at the moment and this me or
AMD or whatever and this means you won't
use any process or specific
functionality anyway then you can shrink
this is done via a local our sync so
create another file system if better
parameters for example the AG count
parameter is originally 10 years ago it
was only 4 which is a performance
bottleneck you increase it to a bigger
number if you recreate a file system
there expand this online without
downtime it's just using XFS underscore
Croesus okay
so you can just dynamically increase
your file system you start automatically
does LVM the extend and so on you
increasing the size of the dint of
logical volumes everywhere in the whole
cluster regardless of number of replicas
ok then doing mass RDM resize command
which propagates it up the layer and
then you simply increase the XFS file
system size xf escrow FS and then have
more space there and if the whole system
is exploding because you have no space
anymore you move away one of the
containers that's the idea load
balancing via football so in reality you
have a virtual LVM pool which is
spending your whole data center if
somewhere some space is missing you just
make write you container to some other
machine to some other classroom provided
that networking infrastructures in the
rest of the instruction of core must be
constructed for that but in our case it
is and so you don't need a storage
network at all with this system this is
the big advantage migration traffic is
only occurring if you really need to
migrate the data and otherwise we have
local storage in most cases and many
people are believing that you always
need a network or you need a storage
network so distinction between storage
serve
and client service but we don't have it
and with petabytes of data it's possible
thanks to this online migration which is
possible with Mars think about it it's
much cheaper you are not only saving the
network and its costs the storage itself
is cheaper and the performance is much
better because you don't have this
network bottleneck in between as Cassie
is always a bottleneck if you ever made
a TCP dump of ice Cassie traffic do it
and you will get pale what's being done
at the protocol level because I Scotty
has to be backwards compatible to old
Scotty protocols from the 1980s this
resources request slot allocation this
is overhead overhead overhead and
Inmarsat don't and don't have all this
overhead so the local device which
appears locally is certainly much more
performant there so I think now we have
the last slide future plans what I'm
currently doing I have not much time
from us because my main job is
downstream Colonel maintain add ice
cream cone which is then rolled out to
more than 10 thousands of servers in
total and have certain special patches I
should not talk about it's dealing with
security and several other things and
dealing with those billions of inodes
and so on so this is my main job and
Mars is just a side job or whatever you
tell it but in the last few months I had
some time for implementing a few new
features and not true whether it's
interesting for you the original Mars is
a prototype from 2010-2011 originally
designed at the lab and there are used
md5 checksums in the transaction log and
this md5 checksums are very important I
think because we have they have rescued
the lives of machines and of customers
many times on very old flaky hardware
with this Hardware
BBU caches 10 years old the batteries
going bad and then the RAM is
fading away and you suddenly have RAM
corruption okay and this means in worst
case you are losing your whole data the
whole machine can happen therefore you
have the G redundancy and in this case
is typically Mouse will detect this
because the md5 checksums are wrong and
it refuses to apply the defective lock
files at the secondary side you also get
an error message and then the Swiss
admins come to me crying complaining
mouse is defective say no Mouse is
correct your machine is defective and
you don't want to apply this data the
other day they lost their status
yesterday Toulouse there's no chance to
deal with it because you don't want to
apply this defective data otherwise to
your secondary site will be trashed
please primary - - first switch over
forcefully and you're told once you want
to I have this data loss that's the
point here so and this md5 is
unfortunately the most is the
performance pick in some sense it
consumes a lot of CPU if you do md5 with
the use of space at all md5 you will
check it I've timed several CRC
algorithms and CRC's 32c is the fastest
one according to my benchmarks with for
cup K blocks only because each block is
compressed individualism of course it's
not the best compression the best
possible conferred person but I want to
retain the property that each log entry
is a transaction in its own therefore I
don't have longer compression runs of
course so it's not a bug it's a feature
to have the small compressed blocks
there and crc32 C is already used at the
networking tcp/ip stack uses it in
several places so it's nothing you you
won't lose lose really things and you
don't need a cryptographically perfect
hash there it's not needed because it's
against not against attackers it's
against defective hardware and similar
things but of course in the future
version you will be able to select among
them and if you have a lot of CPU power
and will
so that you can stick at md5 but the new
default will be crc32 - and I noticed in
my test that my test suite is running
much faster than before I have not yet
timed the real impact on unto the iOS
but I've certain cases were 40,000 I
hopes are possible on classical disc
system without SSD and local is rate 5
so if if you if you want to dive in
there can even explain why this is the
case because it's even in some cases
that are existing cases where it's
faster than the raw device and I can
explain why this is the case here so in
some other projects log file compression
may be not used at one and one I think
because our data center lines are good
enough but if you have a real wrong
distance replication let's say from East
Coast Australia - whatever - west coast
or to Asia or whatever or to another
continent it may pay off I'm not sure so
it's compression at two layers the log
file itself before being transferred
over the network can be compressed and
an independent option is to do it only
at the network transport but you have
one primary and multiple secondaries you
will have to compress it several times
there so I think both options are needed
III will provide both of them and then
there are other things
scalability number of resources per host
is not the best one at the moment
because it's has emerged from prototype
this has to be improved and have certain
ideas how to improve this for example
having several control of threads per
source or similar things
my host broadcaster will be is on my
road roadmap and my medium-term roadmap
and if he could have a big cluster with
thousands of machines at meta data
exchange level where each cluster knows
what the status of each other cluster is
potentially listen however updates times
if the damper clock algorithms I'm
losing here then split casteljau and
cluster operations wouldn't be necessary
anymore so you would have a true big
cluster operations at metadata level but
the data i/o path
preferably local one so it would be very
similar to be cluster approach like
swift or safe a similar thing from from
the sysadmin perspective from the
operational one okay then
well I think this is something where I
should get a more time to do because at
the moment I have not time for this and
if it's not going upstream until the
Linux kernel I feel it will not arrive
in the long term so this is something
which needs to be definitely done here
in the next years at least okay then
what's also lacking more tooling
integration it's at the moment it's more
or less a component like the RBD and
support by other project is not yet the
best one so this should be improved so
if other opens taken several other
projects if it would be integrated there
and the problem is I have no time for
this so community would be a good idea
if you want to this be stockman see I
will support what I can in order to
integrate it in whatever tooling you
already have or you will like to have so
this would is the end of my talk then
that start with questions and I can also
add some whatever you like in the
appendix have some more slides
yeah I think you get a mic lens has a
question here and the first test you
mentioned that md5 checksum is very CPU
intensive does that already take into
account that usually they have custom
computation units on the die for
performing these operations yes
used within the Linux kernel yes there
are several hardware acceleration units
available but I fear that we don't have
them active at the moment or whatever it
is so they are either using the internal
SSD and similar comma already using that
so I'm just using the Codel config as it
currently is and the current we have
photocurrent for that for which is not
the newest one at the moment because you
know in practice you need some time for
getting it stable and so on and my
timing says this at the moment but well
you know once I have published it you
can download it and test it and please
give me respond so but I think there's a
reason that md5 is the smallest slowest
one if you have not implemented it fully
in Hardware will be the slowest by a
construction and there's a reason why
the networking guys not only from BSD
but also from Linux currently are using
crc32 C and that's a reason for it I
think a traditional one a very old one
and even these pedals hardware support
over from AMD or Intel or both or from
our mo whatever manufacturer from
chipset manufacturers and from the
server providers I think it won't change
very much of course it can help a lot of
course but I have noticed that this is
one of the bottlenecks of Mars ok
but this will be relinquished of course
what would be the main advantages of
using marsh over dr.rudi
well it's a matter of application use
case in the mouse manual there's one
chapter about drbg versus Mars and I
clearly recommend for a short distance
replication where you don't have a
switch in between but across over cable
you should use the RVT because it
consumes less resources the RBD is
constructed for this case and Mars is
not at least at the moment it doesn't
attempt to attack this in any way okay
so here's a clear separation there's a
grey zone of course in between of both
if you have long distance and a
synchronous replication don't use the
RBD you can by this proxy product of
course but we have tried it and it's REM
buffering it's expensive and the most
expensive memory you can buy and the
recommendation would be to even have
dedicated host only just for this
buffering it must does it all on the
local host so even on our biggest
machines our 300 terabyte machines thus
left-mouse Direct has one terabyte which
is less than 1% of total memory for this
left-mouse transaction locks and this
works perfectly fine zone typically
surviving an outer of one day without
problems typically except if you are
restoring your fully restore of your
full backup of course then okay there
are several corner cases where it can be
filled very quickly but typical uses
scenarios like and web hosting have
around these sizes and this is not a
real problem there
okay let's another question so how to
get started let's say I have an existing
server that already has a VM file
systems on top and I would like to
enable Mars do I need to copy data
around first for this to happen or can I
just um good question enable Mars it is
described in the mouse manual there's
even a step by step instruction for this
case or for similar case you need some
spare space on
vm4 thus left mouse of course let's say
he cuts down with ten gigs but I
wouldn't recommend it 20s the minimum
that's 50 is better let's take 100 Kiki
okay to be - if he offers so serious
application then you could use let's say
200 kicks forward or if you want the
test setup 20 kicks are enough for the
first start of course yes
so you create the slash mouse filesystem
your caramel should have a pre purchase
your crawls also walk without a pre bit
but performance is worse then and you
need to compile Mouse as an extra column
a module which is also possible by DKMS
and possibly but I wouldn't recommend it
and then step by step instruction it's
described and I even have posted and
module I'm not using a DD kms file for
four you can try it and if it's has a
bug may have then please improve it or
send it back to me feedback so I have
not really tested it but somebody else
has created this enough it's in the
country directory somewhere and then you
say modprobe Mouse and then you the
first instruction you have to give it on
a previous slide already mentioned there
it's no it son here you are creating you
already have for example if you have
already data there which happens if you
are migrating from dr BD to mouse you
have done this regardless whether you
have internal metadata or external one
you had extraordinary as a little bits
in simpler but you can just directly use
it because the drbg metadata is at the
end of the block device okay
so you will lose some unnecessary lose a
few megabytes inside of your block
device yet you try to directly migrate
it to mars and why and also back it's I
am insisting that it's possible to
migrate back to the RPD again because
this is open source and we have to work
together and you make the mouse adding
create ourselves out of your volume
group is out of your logical volume then
the deaf mouse appears on your primary
here a few seconds later and
has exactly the same content as your
logical volume is one-to-one
there is no absolutely no difference
except and now the come the exceptions
if your system has a power outage then
it may be inconsistent and then you have
the recovery phase from the transaction
log file and this is very very similar
to my sequel if you have a my sequel
instance you know this crashes you need
to recovery phase it starts off with my
signal you know this is an experiences
admin because this is the way how
performance is optimized in databases
and if you know my sequel and my secure
replication my secure transaction lock
replay and similar things if you know
this then you also know unlike mass it's
working and if you know drbg then you
have 70 80 percent of the commands also
it's very similar details are of course
somewhat different but as an experience
this admin you will have no bigger
problems I think hopefully so several
people in our company have mastered it
and it's in use at several teams some of
them are using it for very different
purpose like in shared hosting but well
and I know that several people around
the globe are also using it for
long-distance replication somewhere but
I don't have exact statistics and users
numb mind how would it look forward set
of s if you have set of s would you
replicate devices and input I said if s
replication okay this is a good question
that have addressed in the mouse newer
versions of the mouse manual' and there
are some people who think that making
snapshots and then incrementally
replicating the snapshots it is possible
yes but there are two topics at least at
least the one is that you are losing
some time the snapshots are point in
time snapshots and the medium is a
factor of 50 percent longer because you
are replicating an old state during the
replication of the old stay of the new
state it's changing in the meantime and
if you do it by scripts
and endlessly oh but some people are
doing then it runs a few seconds and if
it's increasing if you are writing more
data than can be replicated the time is
going up and in some sense it may happen
that the set of s is feeling completely
up and then you are in a real mess
typically from practice from practical
viewpoint and you should never exceed
exhaust the space of your set volume and
these snapshots can be a serious problem
MRSA also has the same problem that the
surf mask director in the transaction
log files may overflow this is called
emergency mode and there as a means for
it for dealing with this so I have
explicit means this is the one thing and
the other one thing is Mouse can
individually switch over and short-term
hand over both hand over and fail over
here we have LaPlante hand over the
antenna failover is very similar to drbg
you said dr BD ATM disconnect and
primary - - force you just replace it dr
BD by mars at ATM by mouse other means
exactly the same thing
so if you are using dr BD this you will
know this and you also have will
typically have a split brain afterwards
also with diabetes or there's not really
a difference there okay and this is
power source and if you are doing this
set of s you have no means for example
you may have replication in the wrong
direction you won't notice it dr both
the our body and Mars are protecting you
against this so you have some control
functionality in your kernel module and
the mouse ATM script and so on which
protects you from accidentally
overriding you good data by the old one
and similar things and there's a table
in the mouse manual with the comparison
set of s versus trad at mars
so three columns and look into it and if
you have think that something is wrong
there we can discuss and I will correct
it of course because I'm not the big
cell FS expert there there are some
cases where it could be beneficial if
you are just creating backups snapshot
based backups it could help and self s
could be the easier solution in this
case but if you want have instant
failover to the others
we are using it not only for an
emergency case even for ordinary Colonel
update you know spectrum meltdown so we
typically have a kernel update each one
or two months clear at last month's
always all the time how is it done
you are reporting the second time the
currently secondary suggests a new code
and you trust switch over then and then
of course you reboot the old primary
site which is now secondary and the
downtime is a few seconds to a minute or
around this depending on the ex FAS and
transaction replay and similar things
and so cuddle up date can be even done
during business hours if necessary it's
no big problem you have a small downtime
because of the 50 kilometer distance we
have known live migration implemented at
the moment and you won't implemented
between different kernel versions and
you way too much but for practice it's
good enough and well and if you do it at
midnight and it's also no problem so
what's what's the big problem there so
we are using it for ocular and it isn't
even another use case so if you have 10
resources in total on one side we have
to mentioned it from this case the
machines are dimensioned for this but
sometimes there is an overload because
customers are creating endless loops of
backups so copy - a your whole directory
to a temporary space and then create a
zip file and if the zip file is created
it's copied once again this is typical
script produced by customers you are
laughing you know it also and and
similar things we have a few millions of
customers you can you can control all of
them and they are just doing stupid
things from a sysadmin few point and
then what you are doing half of the
resources it's data center a the other
half is running at beam short term
measure and it immediately linked with
the load the load is going down if you
have an incident because high load
because some whatever is popping up
because Peters attack awesome where we
have DeRose proxy in some some measures
against it but whatever could
you have additional cpu power on the
other data center and then you just
butterfly operation one of the Rings is
the ones all right the other resources
are running at the other side that's the
idea behind this so for sysadmin it's
even comfortable thing to have this hand
over fitra and in certain cases you will
I will start to like it okay next
question
so currently the failover off or the
handle off of Mars is currently 99% done
manually if I'm understanding it right I
don't understand your question for you
we are using a lot of DBT resources and
we have integrated those or we control
those with pacemaker and DD resource
agents yeah where we can define on which
side the VM is running or the service is
running and the handover is and done by
a pacemaker yeah and even we have also
Keo classes using the apt and pacemaker
yes it's sometimes critical but right
now everything works fine okay but you
are just telling the same story because
another team in our company has tried
the same with Mars and there's a very
similar experience the problem we have
we have 99.9 eight percent reliability
we want to achieve and we are violating
only in very rare cases okay and this
means an error rate of 1% is way too
much okay so you whatever you are doing
is pacemaker whatever high level cluster
management which tries to automate
whatever of things you have to be very
careful and this is the reason why we
are currently doing it manually
so we have a 24/7 network operation
center which survives
what is everything in the whole company
including those machines and these guys
are responsible for pressing the button
for pressing the button the right moment
if they're really detect an outage or
whatever is happening there
of course it's losing some time if it's
not automatic but at the moment we are
operating this way and this has the
proven way and the experience is with
the automatic failover with pacemaker
and similar ideas is if you want to fail
back or whatever it sometimes has false
positives alarms and so on this could be
too high so this is just an experience
but they have not invested much time
into it so what you are right is it
should be improved probably I should
implanted another cluster mentor for
long distance well there's a distant
different protocol in use like quorum
consensus in a different way and MOUs
already at least potentially supporting
this by the damper clock algorithms
issues internally because it has the
metadata the status information is
propagated with event eventually
consistent protocols or time stamps are
compared laptop clock means if the NTP
isn't working correcting correctly and
you're the on both sides are
different and you send a message from A
to B and P notice so the message is
created before it has pin know it's
arriving before it has been created that
cannot wrong so it's the only lamp of
clock is advanced by this by this
difference by this Delta and this means
the lamp o clock is a virtual clock
which always is proceeding monotonically
in later in the time so it's never
running back and this protocol ensures
that when you say and one side primary -
- force and the other side is
disconnected as the network is stalled
for some reason it may operate in split
train for a while but once the network
is healthy it automatically reconnect
indifference to the RBD RBD you have to
do it come on now okay we have three
node cluster basically with the two-way
the APD replication the third node is a
for chrome only
and if one side fails or network
disconnects and the one side with better
connection is still online and has the
application running and sometimes yes
you have the possibilities of split
train which you have to be manually
recover yeah that's that's the downside
off of the crpd failover yeah yeah this
is one of the things which is ever slide
for it I can explain why it happens auto
HDR video because it's the cap theorem
the cap theorem is explaining why this
will happen because once you have a
network it can fail independently from
your notes network has its own failures
the P property you will have the problem
whatever you are doing this is a
fundamental or like Einstein's law of
speed of light you just just no chance
against it okay so it's clear and I
think the casterman interests are not
ready for this and I first have to think
about it how to implement it correctly
at least as correct as possible because
a full correctness isn't possible this
just was the cap theorem is telling it's
not possible to do it always in the way
you want to have it but to as best as
possible so the best-ever principle here
and I'm not sure how to do it at the
moment I know that there are some
shortcomings if you want to try it I
would be very excited if you would try
it and give me his feedback and what
should be done better here and if you
want to have a good solution we could
start co-working it's no problem now so
I think this is one of the things is
just desperated neatly by some admins
and by some use cases I know I've heard
similar from from many other teams where
I've already presented there and it
doesn't need for it and currently not
addressed and I don't have too much time
at the moment for this this is one of
the weak points but the problem is not
space Mouse specifically this also with
the our body you're right so for other
questions I think we are at the end of
time now or not okay
but thank you for this very vivid
discussion so we had a lot of questions
thank you though there's some interest
[Applause]
you
Loading...
Feedback

Timings

  703 ms - page object

Version

AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)
hidden