We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Self-hosted server backups for the paranoid

00:00

Formal Metadata

Title
Self-hosted server backups for the paranoid
Subtitle
Using Borg, SSH, Python and FreeNAS to securely backup Linux servers
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Quarkslab is a French company specializing in information security R&D, consulting and software development. Due to strong data security constraints imposing self-hosted solutions coupled with limited resources in a fast-growth environment, data safety has been a pain point in our infrastructure. After our backup server failed, we decided to recreate a new backup system from scratch, adapted to our needs and using technologies we were familiar with, to backup 30+ Linux servers. In this talk, we will present how our old backup system failed, the key requirements we learned from this failure, and how we designed and implemented a new backup system based on Borg Backup, borgmatic, SSH, Python and FreeNAS to solve those requirements. We will conclude by listing the shortcomings and improvement points of our approach, as well as comparing our solution to seven important properties every backup system should have.
Server (computing)BackupInformation securityServer (computing)Type theoryPhysical systemDifferent (Kate Ryan album)Link (knot theory)MereologyLecture/Conference
DialectData storage deviceOpen sourceRAIDProjective planeServer (computing)VirtualizationInformationData storage deviceBackupWindowError messageMereologyCASE <Informatik>Physical systemWeb 2.0Virtual machineSoftwareComputer hardwareGame controllerLimit (category theory)NumberDatabaseType theoryLecture/Conference
Open sourceRAIDEnterprise architectureData storage deviceSource codeOpen setComputer hardwareGroup actionMiniDiscEnterprise architectureServer (computing)Distribution (mathematics)Hard disk driveInterface (computing)Process (computing)BackupMereologyComplete metric spaceRemote procedure callEmailPhysical systemEndliche ModelltheorieInformationComputer fileFile systemSocial classBit rateFunctional (mathematics)Point (geometry)Database normalizationData storage deviceFlow separationGroup actionComputer hardwareContent (media)WritingData flow diagramMultiplication signType theoryComputer animationLecture/Conference
Open sourceBinary fileData compressionEncryptionData integrityBackupServer (computing)Standard deviationOpen setData storage deviceComputer fileData compressionBlock (periodic table)EncryptionVirtual machineVirtualizationMedical imagingPower (physics)MiniDiscFile systemLevel (video gaming)JSONXMLLecture/Conference
NumberComputer filePrice indexBackupUniqueness quantificationTotal S.A.BackupTotal S.A.LaptopComputer fileVirtual machineRemote procedure callXML
BackupCompilation albumPairwise comparisonComputer fileNumberDifferent (Kate Ryan album)Total S.A.MiniDiscUniqueness quantificationBackupMultiplication signData storage deviceServer (computing)Virtual machineLecture/Conference
BackupServer (computing)Connected spaceInformation securityFirewall (computing)Internet service providerDirection (geometry)Computer animationLecture/Conference
BackupServer (computing)Connected spaceComputer animation
Server (computing)BackupConnected spaceLocal ringOverhead (computing)Lecture/Conference
EncryptionCASE <Informatik>BackupSoftwareServer (computing)InformationEncryptionKey (cryptography)XMLProgram flowchart
EncryptionBackupKey (cryptography)Vulnerability (computing)Server (computing)BitCASE <Informatik>LaptopMoment (mathematics)Lecture/Conference
EncryptionBackupKey (cryptography)InformationServer (computing)BitComputer fileCASE <Informatik>Process (computing)Physical systemEncryptionComputer animationProgram flowchartLecture/Conference
Scripting languageError messageEmailData storage deviceBlogBackupNewton's law of universal gravitationRegular graphServer (computing)Process (computing)Scripting languageLoginData storage deviceNumberEmailBackupPresentation of a groupError messageDirection (geometry)CodeMultiplication signTotal S.A.Unit testingXMLComputer animationLecture/Conference
FreewareBackupServer (computing)CodeRule of inferenceJSONXML
FreewareBackupProcess (computing)Scripting languageRule of inferenceFrequencyProcess (computing)Server (computing)BackupLecture/Conference
Process (computing)Electronic mailing listComputer fileConfiguration spaceScripting languageError messageInformationBackupXMLLecture/Conference
Process (computing)Arithmetic progressionReverse engineeringConnected spaceXML
Process (computing)Connected spaceRootLecture/ConferenceXML
Process (computing)BackupRootEncryptionVariable (mathematics)Integrated development environmentServer (computing)Configuration spaceSystem callLecture/ConferenceComputer animation
BackupProcess (computing)TupleServer (computing)Configuration spaceKey (cryptography)Directory serviceSet (mathematics)Data storage deviceBackupComputer fileLecture/Conference
BackupRootEncryptionProcess (computing)BackupKey (cryptography)Block (periodic table)Server (computing)Connected spaceMechanism designComputer animationLecture/Conference
BackupLocal ringRepository (publishing)Process (computing)Asynchronous Transfer ModeBackupConnected spaceServer (computing)Single-precision floating-point formatBlock (periodic table)State of matterXMLComputer animationLecture/Conference
BackupLocal ringRepository (publishing)Process (computing)StrutBackupRepository (publishing)Connected spaceKey (cryptography)Server (computing)XMLComputer animationLecture/Conference
BackupLocal ringRepository (publishing)Process (computing)InformationRun time (program lifecycle phase)BackupServer (computing)Key (cryptography)EmailXMLComputer animation
Process (computing)BackupFreewareServer (computing)Scripting languageProcess (computing)Scripting languageRule of inferenceBackupLecture/ConferenceJSONXML
Scripting languageBackupFreewareServer (computing)Repository (publishing)Block (periodic table)Normal (geometry)Scripting languageOpen setBitSurjective functionServer (computing)BackupComputer fileAsynchronous Transfer ModeMultiplication signRead-only memoryData storage devicePhysical systemJSONXMLLecture/Conference
BackupScripting languageServer (computing)FreewareMultiplication signComputer fileProcess (computing)BackupData compressionServer (computing)Operator (mathematics)1 (number)SequenceConfiguration spaceCASE <Informatik>BefehlsprozessorScripting languageRight angleMereologyEmailFile systemCase moddingLecture/Conference
FacebookPoint cloudOpen source
Transcript: English(auto-generated)
Hello, everyone. So my name is Axel. I work at Quarkslab, which is an information security company. I'm part of the infrastructure team. And so one of our roles is to manage all the services
and service of Quarkslab. And one of your duties, of course, is to handle backups of all the services. So after we had an issue with a backup server, we decided to redo all the backup systems at Quarkslab. And so in this talk, I will present the solution we designed and we implemented. It can also be applied if you have
a personal infrastructure. You can also back it up using the solution. Everything is on GitHub. So you have links at the end of the presentation. So basically, at Quarkslab, our infrastructure is composed of different types of service. So you have virtual machines, of course. You have also bare metal services that are on-premise.
Some project servers, we call them, who can contain some sensitive information that we only want a limited number of people at Quarkslab to have access to. And sometimes that doesn't include us. But we have to manage them, to backup them as well,
find a solution to do it. And also some servers in the cloud. It's mostly bare metal servers that are on hosting providers, that are encrypted as well, that we need to handle. So basically, what we want to do is to be able to backup all those types of servers. So we looked at existing solutions to manage backups.
And we found that servers are quite complex. So often, they try to handle all the cases to be very generic, to have the ability to process the Windows servers to all kinds of machines.
And also, they often need an agent on the host that needs to run continuously. They need a server, maybe with a database. They provide the access control release, the web interfaces to manage the backups, stuff like this. And we found it quite hard to actually understand
every part of each system. And what we believe in is that we need to be able to understand how everything functions, to be able to debug any issues that work first. So we decided to go with tools we knew. And we decided to package them together
to be able to answer our needs. So basically, backups rely on two things. To have effective backups, you need to have effective storage. So basically, if you write data to disk, backup data to disk,
you want to be able to get it back later. So it's very important. Even if your backups, the system is really, if the software is really advanced, it can't actually backup. It can't actually survive a hardware error. So to handle the hardware side of things and to handle the storage side of things, we decided to go with two tools, which
are FreeNAS and OpenCFS. So FreeNAS is a Linux distribution that you can install on the server. The server can have any disks that you want. It's not forced to be like SAS drives or enterprise drives. It can be out of the shelves, normal hard drives
that you can buy consumer hard drives. So basically, it transforms the server into a very powerful solution to be able to effectively backup, to effectively store files in files.
So it offers some interfaces like NFS, those kind of interfaces, so you can access and store your data on the server. It also does a really great job of handling emails to automatize stuff like, for example, disk scrubbing.
So you want to verify that all your disks that on the disk are intact. So you can use it very effectively to do it. And this server relies on the file system, which is ZFS. So ZFS is more than just a file system. It also handles the red side of things.
So you provide it with mutable disks, and you want redundancy across the data you write to those disks. So ZFS can do this. It also offers functionality similar to LVM.
So for example, if you need to be able to create separate partitions, let's say, you can do it quite easily, read dimensions. And what it brings with those kind of write partitions, which are called data stores, is the ability to do snapshots of them.
So for example, in point in time, you can just freeze the content of the data store where you store your data. And then you can mount it read-only later if you want. So you can access those files as they were at the time you did the snapshot.
And another ability is that you can actually send snapshots to other hosts that run ZFS. And the FreeNAS does a great job of integrating it. So this allows you, for example, to do off-site backups of your data. So you just take a snapshot of your partition, of your data store, and you can just send it
to a remote ZFS server. And so you have the ability to basically self-host a complete backup system. So you don't have to rely on existing backup systems or to rely on servers hosted by other people. And at WhatsApp, we are very keen on self-hosting everything we can because we deal sometimes with sensitive information
and we don't trust external entities to handle them for us. So the issue you also have to take care, of course, in hardware is actually the hard drives themselves. So, for example, here, this is a Seagate model. At the 32% failure rate after a few years of work,
which is quite unusual because Seagate was trying a new technology at the time, and they claimed that it was, of course, very safe, but it wasn't really, so they faced a class action lawsuit. So if your server happens to have those disks in size,
then you have a very high probability of failure and even ZFS won't be able to do anything regarding it if you lose the majority of your drives. So it's very important just to have separate types of disks into your server.
So effective storage is one part of the things. The other part is having effective backups. So here, we decided to use Borg because we use Borg ourselves. For example, I use Borg to backup my personal laptop, and I decided that maybe it was also a good idea to use this tool to also back up our service.
Borg provides a number of advantages. So it's open stores and standard binaries, so you can just drop it on the server and it works. You don't have to install a ton of dependencies to make it work, so it makes it very easy to install on a heterogeneous infrastructure. It handles compression, deduplication, and encryption. So basically, it works at the block level of the files.
So instead of treating files individually as a file, it treats data as blocks. So for example, in the file, you can have multiple blocks and then it does the deduplication across those blocks, not across the files themselves. And it's really powerful because, for example,
if you have virtual machines, disk image on your system, you can have multiple virtual machines that have the same base image, but small modifications inside them, of course. Borg recognizes the data that is the same in every file, and it allows you to deduplicate it very effectively.
Only the data that is unique into all the files in the file system will be back up. So the deduplication is very powerful, and of course, it has compression as well above it. So this means that if your data is highly compressible,
like logs, for example, they will be quite effectively compressed. It also, it's quite fast, so I have an example. And you can do remote backups over SSH. So it's quite fast. Here, it's my last backup of my personal machine of this laptop. So you can see that it did a backup of 1.4 million files
in four minutes. And in four minutes, so there were 350 gigabytes of total data on the drive, total used data. It compressed it to 228 gigabytes just by compressing the files
that were highly compressible. And it further deduplicated the data. So this is my last backup. Of course, I had previous backups. So in this last backup of 350 gigabytes, only 247 megabytes were changed
between the last backup and this one. So it only stores 247 megabytes of new data on the disk. And so you can see that is quite effective because the total number of data that were back up ever on my machine were 5.5 terabytes of combined data, so of unique data.
Sorry, not unique data, like unique files. In those five terabytes of data, so it went to 3.47 terabytes of compressed data. And then the duplication brought those 300,
so this 3.47 terabytes down to 266 gigabytes. So effectively on my backup hard drive, there's only 266 gigabytes used. But if I were to expand all the backups in time
to recover all the files, it would have a total of 5.7 terabytes of combined data across everything. So it's really, really effective deduplication. And it's very fast. I mean, 1.4 million files can in four minutes to determine what were the differences between the previous backup and this one.
So we decided to use Borg onto the different service. Like I said, Borg, you can do backups over SSH. So you just connect to the remote server and you ask, sorry, the server connects to the backup server and basically does its backups
across the SSH connection. So the issue we have is that our backup server, which is this server here, is self-hosted, is in our internal, like, on-premise infrastructure because we mostly have self-hosted servers.
The issue is that we have some hosts that are on internet, like I said, like bare metal servers and hosting providers that we need to backup as well. So we can connect from the backup server to the external server without any issue. But the external server can't connect back
to the backup server. It can't initiate a connection. So the issue is that our firewall only allows basically outgoing connections and not incoming connections. And we don't want to allow incoming connections because for security reasons, basically. We don't want to expose any internal service to internet directly.
So the issue is that Borgs allows servers that need to be backup to connect to the server that is storing the backups. But here, the connection will be blocked by the firewall. So to solve this issue, we initiate the connection from the backup server itself.
And we do reverse port forwarding using SSH. So this means that the backup server will initiate the connection to the remote server. And then it will, across this initial SSH connection,
it will open a listening port here on the remote server. And every data that is incoming into this port will be redirected to the local backup server on any port we want, so either as a SSH port. So basically, this allows us to expose the SSH server of the backup server to any external host
without the host having to connect to the backup server. Because we initiate the connection first, we establish a SSH channel between the two servers so the data can flow back across this channel, which is inside the SSH connection that is already established. This also, of course, works for internal hosts.
So we have some overhead because we have two SSH connections where we didn't need two because the remote server here can connect directly to the backup server because they're in the same network. But it works for both cases, so it's the most generic case so we are going to implement it.
The second issue we have to deal with is some servers so they handle sensitive information. We want all our backups to be encrypted. So this means that the key for the encryption must be stored on the server itself so it knows which key it has to use.
The server will do the backup across the SSH channel like I just explained. So the data will flow encrypted across the channel and will be stored encrypted onto the backup server. So the issue is what happens if your server dies, basically, like a hard rock dies because the encryption key was stored on it.
So basically you have no way to recover your backups. So that's a bit of an issue. So what we do, we don't have a very good workflow to handle this at the moment. What we do is basically we store a copy of the backup encryption key in the InfraTeam guys' laptops basically
so and those are, we have a few backups of this key. So in case of the server catching fire we have a backup of the encryption key that we can directly use to recover the data. But the encryption key is not present on the backup server itself
because we don't want someone finding like a vulnerability on the backup server gaining access to it. We don't want the person to be able to read the backups because it's a content sensitive information. So we prefer to have the keys stored separately from the data itself, encrypted data.
And this is also very convenient because some of the servers we want to backup using this infrastructure, but we are not allowed to access the data inside. So it's a bit of an issue because if we add a copy of the encryption key
we will be able to just use it to decrypt the data. So what we can do instead is tell the person managing the server how to set up the system on his server. And then this person can store a copy of the key themselves securely. So this means that we, these admins, infrastructure guys, never have access to the decryption key.
But in case of the server catching fire we can provide the person with the encrypted files and they can decrypt the files themselves. So to handle executing the backup process
on all the various servers, we created a small Python script, Python 3 scripts, which basically is in charge of, at regular intervals, of connecting to the remote servers. So triggers SSH general creation process
and triggers a backup process. So this script is running on the backup server itself. And it collects all the logs from all the various backups and it sends us emails. So it sends us, of course, emails on error. So if the backup has an issue we are informed directly. But it also sends email on success,
which is very important because if you never receive emails on success of your backups, you are never sure that the backup process actually executed, you know. So basically you can have a, so backup server can be down, for example, for some whatever reason. And then you will not receive any error message
informing you that the backup can't be done because the server is down. So basically it's important for us to receive emails when everything is going well because then we know that everything actually went well. So that's a subject of the emails that are sent to us.
So informing us of the total number of server backups, the number of servers that adds issues during the backup, and the total time that was spent back-upping the servers, which is a useful metric if you compare the times between different backups. You can know instantly if, for example,
some server is acting weird because it takes a really longer than usual to backup. So it handles storage of all those backup logs so we can access them whenever. And it's really simple. We have a full code coverage on it with unit tests. And it's accessible on GitHub.
So you can use this script if you want to replicate the same concept as us, basically, to have this concept of back-upping using Borg on remote servers. So now I'll present... No, sorry. Yeah, so this script is running on a FreeBSD jail,
which is running on the FreeNAS server itself. So the recommended way to run code on the FreeNAS server is actually to create a BSD jail and run the code inside it. So that's what we do. And we automate a few things. So we automate the creation and the provisioning of the jail and the provisioning of the servers that need to be backuped
using Ansible. And you can find different Ansible rules and an example playbook that implements those rules on GitHub as well. So if you want to just recreate the solution as is, you can very easily do so.
Now I'll present the complete backup process. So basically, you have the backup server here. This is a cron that is running periodically on the backup server. So for instance, it's once a day, but you can have the frequency you want.
So this cron is in charge of executing the backup scripts that I just talked about. So the script will read its configuration file. So configuration files can send, basically, the list of hosts that need to be backuped, how to connect to those hosts, which SSH ports, which should we use,
some various information. For example, the topic of the success and error emails, what they should be. So it creates this backup configuration. Then it runs SSH, SSH command. So basically, it creates a SSH connection to the host we need to backup.
For each host, it does this. Of course, secondarily, we can also parallelize it, but it's a work in progress. It's not done yet. But secondarily, it will find which host needs to be backuped. It will create a connection to it, and it will establish the reverse
port forwarding using SSH, so to have the connection back, to be able to connect back for hosts that are on internet, for example. So the host will receive the SSH connection. The host that needs to be backuped will receive the SSH connection. The tunnel will be established.
So in the SSH configuration, of the host that needs to be backuped, so there are two important things. So basically, we permit root login, but we only allow commands that are predetermined in advance. So we don't actually expose root access to the natural attackers.
Well, we expose it, but they can only run one command that we defined. And we permit user environments, so we can send environment variables to SSH. So you can see that we don't actually execute any commands on the server. Why? Because it's restricted to executing
only one specified command in advance, and this specified command is a call to BorgMatic, which is a tool that handles Borg configuration. So basically, we tell BorgMatic, hey, you can execute. You can read this configuration file to know what settings you need to provide to Borg,
so which directories you need to backup, which server you need to backup to. If the repository doesn't exist on the backup host already, you can create it using an encryption key that we store on the backup server itself. Oh, sorry, that we don't store on the backup server itself,
that we store on the host that need to be backup. Then you can create a new backup, check all existing backups for integrity, so we are sure that the backups don't become corrupted over time, and that we can actually reassemble all the various block from all the various backups we use, we stored. And then you can delete all the backups
that you don't need anymore, because they are expired, for example. And it uses an SSH key to connect to the backup servers that we provisioned in advance using Ansible. But you can also provision it by hand, of course. So basically, the host will establish
a SSH connection to the backup server using the mechanism integrated into Borg. The backup server will receive the incoming SSH connection established by Borg. It will also restrict it to executing one single command only, which is borg serve.
So basically, Borg will act as a remote server, and will handle very efficiently all the incoming data, because it can understand what the Borg on the servers that need to be backup is sending. So we work in app-end-only mode. So that means that remote server can only store new blocks inside the backup server.
It can't actually delete any blocks. So this is very important, because if someone were to compromise our remote host, then if they want to cause us harm, they can just instruct the host to delete all of its backups, and then we don't have actual backups.
So by using the app-end-only mode, the server can't delete any information, so we are always sure to have actual backups of the server inside our backup servers, even if the remote server is compromised. So we know we can roll back to a state where it wasn't compromised, basically, if we need.
And we restrict it to a specific folder, so it can only store backups in this folder. So it's the same issue. If someone were to compromise a remote host, they could basically instruct Borg to store the backups, to erase, to overwrite backups of other servers,
so you don't want this. You want it to be restricted to only one repository. So that's what we do here. And this SSH key is actually stored on the, it's the SSH key of this host, which we have into our backup server.
So then the connection is established. Borg server is running here. Borg is running here. They can exchange data, which is encrypted, of course, with the encryption key that is present on the host that needs to be backup. And then the script, once the backup is finished, it sends us an email. So, for example, we started backup
completely successfully. We don't have any errors. It spends this time back-up-ing, and so we have only success. We don't have failure or skipped host. So thank you, everything I talked about, everything we created, so basically the NCBIR rules
to set up the backup system. The script that is handling or press-writing all the backups, and the backup process itself, are all documented and present on GitHub. So it's on the QuarksLab, sorry, QuarksLab organization, and this is the name of the various repositories we have.
So feel free to ask me any questions. I will also be present outside the room if you have any questions. For me, that's a bit too long to talk about here. And don't, yeah, feel free to take a look at the scripts if the system seems interesting for you and you want to implement it yourself.
Yes? Have you experienced problems with bulk backup that it needs to have big bite access and that you cannot keep a bulk backup open and at the same time bite it, or if you take a bulk back, or a bulk backup, then you... Okay, can you repeat the question, sorry?
Bulk doesn't allow to its own bulk backup with two simultaneous systems, or you can also not do a bulk backup, not a bulk backup, and access that bulk backup without using very special tricks and open AFS and all kind of other things. It doesn't have read-only access.
How did you, how did you guys solve that? Bulk doesn't have read-only access, too? I'm sorry, I don't understand your question. Bulk backup seems to only need to have read-write access and doesn't allow read-only access. Oh, that's the app-only mode that we use, basically.
So basically, the bulk server allows new blocks to be stored inside it, but it doesn't allow any blocks to be erased. So the bulk server has read-write access to the files themselves, but the bulk server itself restricts blocks to only be written if there are new blocks.
Say it doesn't allow any old present blocks to be erased. That's not a feature. Yeah. But if you are restoring a backup. Oh, if you restore a backup. Yeah, you lock up everything. You cannot have, if you have issues with that,
it seems not in a way, so it seems not the case, so. No, I don't, we never experience any issues with it, so I don't know, sorry.
Isn't it what you talk about, mounting read-only snapshots? Yeah, so that's handled. ZFS. So it's handled by ZFS, yeah. So basically, yeah, I skipped a bit because I didn't have time to present everything, but basically, the data is stored by the bulk server onto ZFS, onto a data store, a ZFS data store.
Then we can create snapshots of this data store, which are read-only, but the data store itself, in normal use, remains read-write, and it's only those snapshots that we can mount at a later date to be able to access the backup files as they were at this date.
This prevents us, for example, if there is an issue in our script that is running on the jail, on the backup server, if the script decides to erase all the backup files, we can actually restore a ZFS snapshot of the data in the previous,
like of the previous day, so we can basically recover the data. I don't know if it answers your question. Well, we can talk about it a bit later, if you want. Any other question?
Yeah, so restore is also quite fast. The only thing that happens is that it uses quite a lot of CPU usage to do the backups and to do the restore because it has to basically compute ashes of all the parts of each file
to be able to know which ones are different for which one was there, and during the restore, it has to, and also it does compression over those parts, stuff like this. So during the restore, it has the same operations. So it's quite CPU intensive during the backup itself, but you can mount the backup,
any backup you want using Borg that is exposed as a file system, basically, like a FUSE file system. So I didn't notice any slow access to this, as long as you have the correct CPU to handle it, basically.
Time's up. Is your backup server sequentially trick the server to be better, or are they being better in parallel? It's sequential. Okay, sorry, I didn't repeat the question. So is the backup process sequential or parallelized?
So basically, it's sequential right now, but nothing prevents it from being parallelized. So we provide one configuration file to the script, and the script executes the configuration file in order, but we also have a mod we added quite recently where we can just instruct the script
to only backup one host in this configuration file. So you could imagine a case where you have n cron entries, one per server that need to be backup, so it can execute at the same time, and that all use the same configuration file,
but all restrict the backup to one specific host. You will just receive a lot of emails of success or error, but that's all. But it can be done here. So time's up. I'll be outside if you have any more questions. Thank you for listening. Thank you.