We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Coinboot - Cost effective, diskless GPU clusters for blockchain hashing and beyond

00:00

Formal Metadata

Title
Coinboot - Cost effective, diskless GPU clusters for blockchain hashing and beyond
Title of Series
Number of Parts
44
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Producer

Content Metadata

Subject Area
Genre
Abstract
How to run clusters for GPU computing based blockchain hashing diskless on cost effective commodity hardware. Running the nodes of a cluster diskless is quite common in HPC environments. The challenges to run diskless in the context of blockchain hashing for cryptocurrencies are different. There are constraints like to run sufficiently on hundreds of machines with commodity 1 Gbit/s network hardware or the modest RAM size of 4 Gigabyte. This talk will provide insights in the technical approaches that made it possible to run GPU-clusters for blockchain hashing diskless and provide an outlook to other potential GPU-based use cases beyond blockchain hashing. I will discuss like how some early userspace trickery and state of the art RAM compression is used. How to handle the modest given RAM size and how a neat toolset based on container-runtimes helps to easily build boot images and plug-in packages. And how to use plug-in packages as an elegant way for adding further software like proprietary GPU drivers to the computing nodes of the clusters.
Graphics processing unitSystem programmingSystems engineeringBoom (sailing)Computer clusterData miningMultiplication signComputerSound effectReal numberBlock (periodic table)Computer animationLecture/ConferenceMeeting/Interview
Block (periodic table)Error messageTrailData miningTask (computing)Scale (map)Vertex (graph theory)Graphics processing unitCubePerturbation theoryMaxima and minimaAxiom of choiceDigital photographyElectric generatorBlogComputer hardwareMereologyBlock (periodic table)Projective planeGraphics processing unitSoftwareData miningHeegaard splittingTotal S.A.Computer animation
Graphics processing unitIntelComputer hardwareBefehlsprozessorData storage deviceArchitectureProduct (business)Computer hardwareVideoconferencingComputer animation
WebsiteComputer hardwareGraphics processing unitVideoconferencingProduct (business)Food energyTouchscreen12 (number)Case modding
Maxima and minimaComputer hardwareBefehlsprozessorGraphics processing unitIntelData storage deviceArchitectureComputer-generated imageryMotherboardStructural loadComputer networkBootingSystem programmingRobotComputer hardwareSoftwareProduct (business)Medical imagingFlash memoryData storage deviceCondition numberVertex (graph theory)CASE <Informatik>LogicBootingMotherboardGame controllerMiniDiscFirmwareProper mapExecution unitGame theoryConfiguration spacePhase transitionLocal area networkSound effectStatement (computer science)Social classVideo gameStack (abstract data type)Row (database)Boss CorporationSpacetimeComputer animation
SoftwareBootingComputer networkGraph (mathematics)Device driverGraphics processing unitOperations researchBootingOperator (mathematics)Data miningMiniDiscMedical imagingProduct (business)Semiconductor memoryRevision controlPhysical systemNoise (electronics)SoftwareDevice driverProcedural programmingFlash memoryMathematical optimizationStreamlines, streaklines, and pathlinesNP-hardConstraint (mathematics)Goodness of fitVideoconferencingEmailCryptographyDrop (liquid)Electronic visual displayComputer animation
Computer-generated imageryComputer networkBootingComputer networkSoftwareYouTubeMedical imagingHeat transferElectronic visual displayData storage deviceEqualiser (mathematics)Volume (thermodynamics)Computer animation
SupercomputerDistribution (mathematics)Core dumpThresholding (image processing)Graphics processing unitComputer wormDevice driverWindowDistribution (mathematics)Thresholding (image processing)CASE <Informatik>Device driverMatching (graph theory)Front and back endsSupercomputerImage resolutionBootingSoftwareTablet computerComputational complexity theoryProjective planePublic domainAngleLevel (video gaming)SpacetimeBlogComputer animation
Open sourceOpen sourceOperating systemProcess (computing)Category of beingOffice suiteProjective planeProof theoryProduct (business)Maxima and minimaEstimatorData miningEmailDevice driverComputer fileMassMultiplication sign
Kernel (computing)Integrated development environmentSystem programmingServer (computing)BootingComputer networkSoftware testingComputer-generated imageryBuildingFreewareBootingServer (computing)Medical imagingInstance (computer science)Process (computing)SoftwareConfiguration spaceSoftware testingReal numberBootingKernel (computing)Remote procedure callSoftware developerImage resolutionFamilyGodService (economics)ComputerPhysical systemGroup actionPlug-in (computing)SimulationComputer animation
Hand fanFile systemComputer-generated imageryBootingKernel (computing)RootPermianPhysical systemRootComputer fileFile systemMatching (graph theory)BootingMedical imagingBootingRankingYouTubeModal logicElectronic visual displayRandomizationRobotLine (geometry)Computer animation
User profileElectric currentBootingComputer-generated imageryPhysical systemPatch (Unix)SpacetimeProcess (computing)MiniDiscPlug-in (computing)SCSIType theoryVertex (graph theory)RootData storage deviceLocal ringComputer networkRootData storage deviceFunctional (mathematics)SpacetimeType theoryBootingBuildingInstance (computer science)Profil (magazine)Centralizer and normalizerPhysical systemiSCSIFile systemPatch (Unix)SoftwareMedical imagingBootingProcess (computing)MiniDiscProjective planeWrapper (data mining)Bootstrap aggregatingScripting languagePlug-in (computing)Extension (kinesiology)MereologyComputer fileStreaming mediaMaizeWeb browserEndliche ModelltheorieWhiteboardComputer animation
SpacetimeLevel (video gaming)Integrated development environmentData compressionRootInfinityPivot elementPhysical systemPlug-in (computing)Computer-generated imageryPhysical systemState of matterSoftwareBoss CorporationLevel (video gaming)Interface (computing)PlastikkarteMoment (mathematics)Tangible user interfaceSpacetimePlanningCompass (drafting)Projective planeData storage deviceLine (geometry)Web pageMaizeRootFile systemScripting languageBootingComputer fileMaxima and minimaIntegrated development environmentMiniDiscBootingData compressionFile archiverClassical physicsPlug-in (computing)Instance (computer science)Medical imagingComputer animation
Metric systemData compressionRootSystem programmingPhysical systemGraph (mathematics)RootComputer fileFacebookMiniDiscKernel (computing)Data compressionProjective planePairwise comparisonSheaf (mathematics)File systemShared memoryBootingCompilation albumMusical ensembleLogic gateSemiconductor memoryEvent horizonPhase transitionComplete metric spaceTemplate (C++)Data conversionComputational complexity theoryMotherboardPressureComputer
BootingDefault (computer science)Computer configurationClient (computing)FingerprintDynamic Host Configuration ProtocolMotherboardRange (statistics)Default (computer science)Computer configurationBootingInstance (computer science)Grand Unified TheorySheaf (mathematics)Touch typingFirmwareDirect numerical simulationBootingAuditory maskingDifferent (Kate Ryan album)Product (business)Configuration spaceServer (computing)Data managementOpen setTrailComputer animation
Server (computing)Integrated development environmentConfiguration spaceIntegrated development environmentCore dumpServer (computing)Configuration spaceComputer fileVariable (mathematics)Vertex (graph theory)Centralizer and normalizerBootingMetropolitan area networkSound effectMultiplication signWebsitePlug-in (computing)
Plug-in (computing)Vertex (graph theory)Extension (kinesiology)Physical systemBootingDevice driverGraphics processing unitPlug-in (computing)RootState of matterServer (computing)Vertex (graph theory)BootingFile archiverComputer fileFile systemSet (mathematics)MathematicsFunctional (mathematics)DatabaseComputerExpert systemCASE <Informatik>Computer animation
Integrated development environmentVertex (graph theory)Computer fileFile systemRootMathematicsBuildingPlug-in (computing)Server (computing)Overlay-NetzCASE <Informatik>BootingMereologyFile archiverPhysical systemHill differential equationConnectivity (graph theory)TrailComputer animation
Server (computing)Vertex (graph theory)Service (economics)Software repositoryDynamic Host Configuration ProtocolRange (statistics)Integrated development environmentVariable (mathematics)Kernel (computing)Electronic data interchangeIP addressKernel (computing)Electronic program guideIntegrated development environmentServer (computing)Uniform resource locatorRange (statistics)Multiplication signConnectivity (graph theory)Dynamic Host Configuration ProtocolService (economics)Software repositoryRevision controlPseudodifferentialoperatorData managementComputer clusterComputer animation
Graphics processing unitData compressionVertex (graph theory)Read-only memoryUsabilityComputer clusterBeta functionRootComputer hardwareData miningVideoconferencingMachine learningFile systemComputer hardwareRootPoint (geometry)Data miningMultiplication signVideoconferencingStreaming mediaGraphics processing unitPlanningLocal ringMereologyComputing platformMiniDiscNumberPoint cloudWorkloadUsabilityDigital photographySemiconductor memoryGroup actionPairwise comparisonElectronic mailing listComputer clusterTowerSpacetimeProjective planeVideo gameKolmogorov complexityConnectivity (graph theory)BootingComputer animation
System programmingWebsiteLattice (order)Ext functorLecture/ConferenceMeeting/InterviewComputer animation
Transcript: English(auto-generated)
Welcome everybody. I'm Grünter Miedel, I have a background in system engineering and I'm taking care about IT infrastructure since 10 years. And today I want to talk about CoinBoot, cost-effective diskless GPU clusters for blockchain hashing and beyond.
Yeah, let's start. Beginning 2017 there was this emerging cryptocurrency boom. So the company I was working for at this time got a customer who ordered 20 overseas containers packed with computers for cryptocurrency mining.
Mining is cryptocurrency lingo for taking part in the generation of new blocks to a blockchain for a reward. The most popular cryptocurrency, Bitcoin, is mined with special hardware as you may know.
And other cryptocurrencies like, for instance, the ether of the Ethereum project is generated mostly with GPUs. The customer wanted 20 overseas containers for GPU mining. So we crammed nearly 5,000 nodes and 30,000 GPUs in these containers, split into 240 nodes and 1440 GPUs in each containers.
And the emphasis was on minimal total cost of ownership to maximize return of investment.
So commodity hardware was the first choice. The customer ordered not only the hardware but also the software stack to run and operate the hardware. Though I came up with a solution. But at first let's take a look at the hardware. There I have a video for you showing you the hardware and the production facility.
This is the production facility. You see the hardware there.
The hardware is beside the six AMD GPUs on the low end. Solar GPUs, 4GB of RAM, no BMC, no APMI, 1GB Ethernet. The containers you see here are air-cooled.
And they have an electrical power consumption of 250 kilowatts each. So a lot of electrical energy is going there. And we also have produced water-cooled containers. Okay, the video is done. Let's see if I come out of full screen without any harm.
Okay, not working. Ignore the bot bar on the top. Yeah, so this is the hardware we got. So now is the initial approach. The initial approach for the deployment of the software stack was the following.
Create the golden image of the OS plus the configuration and additional software. And deploy it during the production on the cheap USB flash drive. After the first container was completed, we switched it on and we recognized that around 10% of all nodes did not come up properly.
Though, as we found out, there was a race condition between the initialization of the controller of the USB flash drive and the storage initialization of the mainboard firmware. Though sometimes it happened that the USB flash drive was not fast enough
to be recognized by the mainboard as disk. So then a working one was put in place and it was the following bootloader or network. In this case GRUB. GRUB had some logic to determine if an USB flash drive was initialized successfully. And if this was the case, booting would proceed from this disk drive
and else shut down the node properly. And we later switched on the node again while back online and hoped that booting next time succeeds. So the workaround was working.
We were busy with the production and shipping of all the containers. We had a working setup, but I got an idea for further optimization. Can we drop the unreliable USB flash drive at all? So the USB flash drives we use are just costing 5€ for 32GB
and they have, as you may recognize, a lot of issues. The idea of getting rid of the disk comes with some pros and cons. One pro is cutting costs by having no USB flash drive,
no workaround for boot failures anymore, and we could also streamline our production and operations procedures because the slow USB drives really make trouble in production and operations. And the cons are mostly constrained by hard-on software.
We have only 1GB of network. The golden image we have created has a size of 4GB. The RAM is also 4GB. And the proprietary GPU driver we need for the cryptocurrency mining is only available for Ubuntu, Red Hat and Zulu Linux. The preferred OS by the customer and by my team was Ubuntu.
And the main conflict is between the size of the golden image and the size of the RAM. So if you put this golden image on the RAM, you don't have any memory left for running the system. OK, so the main topic to go diskless was to get a noise image that is less bulky.
Less bulky because it has to fit in the RAM and also to reduce the network load. So I set myself a goal. Less or equal 200MB of volume to be transferred over the network for booting a node.
Besides this, the image should be bootable via network of course and able to run diskless without any further storage devices. Because I'm quite lazy, as most of you probably. I don't want to reinvent the wheel. Though such a tiny distribution supporting network booting and running diskless should already exist.
I did know that in the HPC domain, diskless was done since ages. So I searched for distributions in this HPC domain, a special tailored HPC distribution. And I found, for instance, Rocks, OpenHPC and XCAD.
So none of them seem to be a good match. Rocks, yes, is center-spaced. OpenHPC is all center-spaced. And XCAD, TLTR, a very complex project, mostly written in Perl. I don't want to spend my time on this. It seems to work quite well, but I adopted that I could integrate, for instance, the graphics driver there.
And then there are the new kids on the block, like CoreOS and its front-end fork, flat-core Linux. Both are known as lightweight diskless and able to run diskless. So I looked at both. But both were 100% about the threshold of 200MB.
So they need 400MB of image size for booting a node. This is why I would call them lightweight, but yet too bulky for my use case. Sorry to the flat-core Linux and CoreOS people, you do a good job, but you cannot help me in my use case.
So I had to slowly convince myself to create something on my own. I proposed my idea to the project leader of the mining container project. He liked the idea, asked me for an estimation about the effort for proof-of-concept to report to a supervisor.
I said, give me four weeks. To make a long story short, the company had no further interest to look deeper into this idea. But I was stubborn, curious and eager and did it as a side project in my free time.
So while working on this as a side project, the proof-of-concept took me four months. Most time was spent on getting rid of the unnecessary libraries, modules, documentation, header files and other stuff that you don't need to run an operating system and getting the job done. Taking apart the massive proprietary GPU driver was also one endeavor of this whole project.
And I slipped it down to the bare minimum. The initial commit I did on October 2017, proof-of-concept was finished on March 2018. Then I got stuck in a corporate roundabout for a while.
And what should I say, at the end I was able to publish all as open source at August 2018. What did I get? I got light-white peaks e-booting with an image size of 145 MB for kernel 4.15 and an image size of 155 MB for kernel 5.0.
I got diskless worker nodes, configured real environments, a plug-in system, a coin boot server docker container. I'm using IPXE with remote logging, which is very handy if you want to debug the booting process.
On all these nodes I got support for legacy peaks-e and UFI network boot as well. And mostly important to move fast in the development process is I got testing with Travis CI. On Travis CI I test, of course, the coin boot docker server container.
And I'm spawning multiple QEMUMs in the Travis CI instance to see if network booting is working at all. And, of course, daily builds and releases of the images. As new updates and upgrades come to the distribution, which is based on Ubuntu, I'm rebuilding the image every day.
Let's have a look at the toolset to build the light-white OS images. A bootloader needs two files to boot a Linux system.
At first a compressed Linux executable called forum-d-nodes. Why it's called forum-d-nodes? It's a long history. And an initial root file system called inner-dramfs. So I was looking for a tool to build that and I found the BIRF.
Which is also tailored towards running diskless, so obvious a good match. Debf is an abbreviation for Debian on an initial RAM file system. Debf is part of the current Debian and Ubuntu releases. You can easily install it, but sadly the last activities upstream on this project were 10 years ago.
And even if it's in the official releases of Ubuntu, it's broken. The images you can create with it do not boot properly. There were some system de-related patches required, which I have done.
Debf is at the end just a nice wrapper for the bootstrap with the possibility to run scripts to adapt your build root file system. Debf is calling this a profile. So I created a profile for Debf. For that I customized early user space process.
To use a compressed RAM disk for the root file system and having the capability of loading plugins to extend worker nodes at boot time with functionality. So now let's take a look at the customized early user space process I created.
But at first we have to look at the two types of running diskless nodes. Type A is without centralized storage where the root FS is on the local RAM of the working node only. And type B with root FS on centralized storage and accessed over NFS or iSCSI for instance.
And of course we go for A because B is not playing well with our commodity one gigabit network. So the early user space has to support root FS on the local RAM.
So I came up with a two-stage early user space. The first stage is running slash init. It's an initial boot script and it's based on a minimal busybox environment. It creates a ZRAM disk with SCD compression for the root file system.
Then it extracts the final root FS archive on this compressed RAM disk. After that it pivots root to this file system and handovers to init2 of the final stage. Slash init is what you found in all classical learning systems for booting.
But it extended it and added a second stage which is called init2. init2 launches and triggers system D-UDFD to get all devices up and running, for instance the network interface card. Then it's downloaded and extracting the plugins to the root FS on the compressed RAM drive.
Then it's downloading and adding this environment file to etc environment. And after that it handovers to system B-D with calling aspen init to finalize the boot process. In the future there is a plan to directly mount a squash FS image at the first stage.
Now some fancy graphs. Let's talk about RAM compression with SCD for the root file system. The RAM drive compression of coin boot is using SCD compression. The SCD compression algorithm.
Thanks to Jan Kole from Facebook and other contributors for this excellent project. As you may remember yesterday there was a lightning talk about SCD given by a colleague from Ivan. He had shown the use of SCD.
So it's a really sophisticated compression algorithm. SCD is in the kernel since 4.14 I guess, so late November 2017. It's the fastest high compression algorithm you can currently get.
It can be also used for a compressed RAM drive with ZRAM. So with the root file system on ZRAM with SCD compression you get much less shared memory usage and overall more memory available for the system in comparison to the root FS on temp FS which is the classical approach for diskless nodes.
You may recognize that this blue bar which is completely missing or very tiny on the ZRAM section there. This is basically the shared memory which is allocated in temp FS completely for the root file system.
And if you look at the memory available on the green bar you have quite more memory available if you use ZRAM and SCD RAM disk compression.
The next topic is booting is tricky these days. Mainboards come with a wide range of default boot options. For instance legacy, BIOS PXE, UFI PXE, UFI HTTP boot and so on.
Touching the firmware configuration of thousands of mainboards in production was absolutely not feasible. To cope with that I came up with the following solution. I am using DNS mask as DHCP server and I configured DNS mask like that.
It is setting a tag based on the data provided by the DHCP client, DHCP request. And based on this tag different boot loaders got delivered with the DHCP acknowledgement. The next topic is as I already mentioned configuration via environment variables.
I think this is well known from working with containers already. You have one central environment file at the core boot server where you can tweak the environment variables for your whole cluster.
And these environment variables are then available at each worker node. Plugins. Get all you need. Coinboot plugins extend nodes with functionalities. Coinboot plugins are just a set of file system changes packed in a compressed archive.
At boot time they are downloaded from the worker node, from the Coinboot server by the worker node. And the worker node is downloaded, archived and extracted on his root file system. Plugins are created with CoinbootMaker.
And I also created an experimental way to keep the Debian packet manager database in a wallet state when using Coinboot plugins. Ok, let's talk about CoinbootMaker. CoinbootMaker is used to build Coinboot plugins. CoinbootMaker for this case takes just an init ROM affairs, Coinboot init ROM affairs.
And then it runs runc to create a container with this init ROM SS baseline. This container has overlay of this backing file system. And then with this file system changes are tracked in the backing file system.
So you just install whatever you want on your working node there. And when done with the changes, the files to be part of the plugin get collected and packed as an archive. This archive is then placed on the Coinboot server. And during boot it will be downloaded and extracted by the worker node to be placed on the root file system.
Yeah. Coinboot server. Yeah, let's give some short hints how to use Coinboot server. So it's a central component for booting your cluster.
So there is a quickset guide available under this URL. There is a docker container which brings all services required. So basically dnsmas for TFTP and DHCP and nginx for HTTP. It's all pre-configured.
Just use it. So clone the repo for Coinboot and then configure the DHCP range. So that it reflects the range of IP addresses you want to hand out to your cluster node. Then that's a mandatory environment where we have the Coinboot server IP, which should basically be the IP address.
The docker host you spawn the Coinboot server is available. The kernel and initial environment of Coinboot are downloaded automatically when you spawn up the container.
If you want a different version of the kernel and the initial ROMFS you can specify this in the environment file. So let's go. Just make a docker compose up. Wait a short amount of time until all services are up and running of the Coinboot server.
And then switch on your worker nodes. And then the magic happens. Okay, so I'm already at the summary of my talk. Coinboot can run GPU-based blockchain hashing on GPU clusters with a minimal TCO.
By using RAM drive comparison with SDG disk list worker nodes have more usable memory available. And Coinboot is easily extensible and can run various other number crunching workloads. Of course. So what's next? There's an ongoing transition to a monorepo because Coinboot consists of a lot of sub-projects and it's a complete mess to work with them.
So I'm working on moving to a monorepo and of course going out of the beta status as well. There's TensorFlow for AMG GPUs coming to Coinboot and support for NVIDIA GPUs.
There's a plan to use P2P plugin loading with the local BitTorrent stream using SquashFS and OverlayFS for the root file system. And there's also a plan to giving mining hardware a second life because at some point in time mining is no longer profitable. And Coinboot will probably be part of a platform that makes GPUs accessible for the machine learning and data science community.
At last, some thanks. I want to thank Julie and Elmo. And Steve Schmeller who convinced me to apply here.
Esther and Barbazi for the photos. Jaime, Camilla, Komesh, and Lucas Reif of Cloud & Heat for their support. Cloud & Heat for the video material. Subrahmanaiah, Umushanka, Yoshi for being the best unicorn whisperer ever. And now you can ask me questions or you can ask them later. Thank you.