Netflix and FreeBSD
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Alternativer Titel |
| |
Serientitel | ||
Anzahl der Teile | 34 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/45165 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
1
2
3
4
7
8
10
14
15
17
18
22
23
26
28
31
32
34
00:00
Offene MengeNetzbetriebssystemSchreib-Lese-KopfSoftwareentwicklerCachingBasis <Mathematik>ClientVideokonferenzLuenberger-BeobachterDatenverwaltungBitOffene MengeEinfach zusammenhängender RaumComputeranimationUMLJSONXML
00:51
CDMAOffene MengeDistributionenraumInhalt <Mathematik>Zentrische StreckungDatentransferBitrateSchlüsselverwaltungGruppenoperationVideokonferenzCDMASoftwareRechter WinkelProzess <Informatik>ZweiQuaderRuhmasseInternetworkingCDN-NetzwerkFehlermeldungInhalt <Mathematik>Computeranimation
02:47
Offene MengeInformationsspeicherungRechnernetzTeilbarkeitSoftwareQuellcodeSoftwareVideokonferenzBildschirmmaskeMereologieZweiRegelkreisZahlenbereichBitrateZusammenhängender GraphCachingAggregatzustandEntscheidungstheorieEinfach zusammenhängender RaumDifferenteRechnernetzOffene MengeOpen SourceMultiplikationsoperatorFormfaktorElektronische PublikationSystemplattformHardwareMomentenproblemCDN-NetzwerkInformationsspeicherungQuaderBandmatrixFunktion <Mathematik>Minkowski-MetrikGeradeEnergiedichteBitGamecontrollerCDMAHypermediaWhiteboardMeterVierzigFestplatteComputeranimation
05:41
VideokonferenzOffene MengeChiffrierungBeanspruchungMini-DiscDigital Rights ManagementHalbleiterspeicherMinkowski-MetrikBandmatrixClientPunktVideokonferenzElektronische PublikationQuaderNebenbedingungStreaming <Kommunikationstechnik>BeanspruchungChiffrierungAuswahlaxiomServerTLSSoftwareBenutzerbeteiligungVersionsverwaltungRechnernetzInhalt <Mathematik>CDN-NetzwerkProgrammierumgebungKernel <Informatik>ZahlenbereichKartesische KoordinatenCASE <Informatik>Metrisches SystemProgrammfehlerPlastikkarteEinfach zusammenhängender RaumMultiplikationsoperatorEntscheidungstheorieSocketZweiMereologieGamecontrollerKanalkapazitätProzessfähigkeit <Qualitätsmanagement>Prozess <Informatik>Rechter WinkelVerschlingungSoftwareentwicklerSchlüsselverwaltungSpieltheorieBefehlsprozessorSynchronisierungTV-KarteReelle ZahlSoftwaretestDreiecksfreier GraphProgrammverifikationMini-DiscComputeranimationTechnische Zeichnung
12:36
TLSBefehlsprozessorSpeicherabzugMereologieEinfach zusammenhängender RaumClientSoftwaretestMereologieReelle ZahlComputerspielCASE <Informatik>RechnernetzBefehlsprozessorTLSQuaderVideokonferenzDateiformatZweiBandmatrixGraphComputeranimation
13:41
Persönliche IdentifikationsnummerBefehlsprozessorPhysikalisches SystemHalbleiterspeicherSocketDifferenteSocket-SchnittstelleCASE <Informatik>BandmatrixNetzbetriebssystemPunktMomentenproblemGraphMultiplikationRechter WinkelCodeProjektive EbeneEinfache GenauigkeitFormation <Mathematik>Potenz <Mathematik>Arithmetisches MittelSoftwaretestZweiVererbungshierarchieLinearer CodeSystemprogrammHilfesystemMinimumGüte der AnpassungComputeranimation
18:17
Operations ResearchPhysikalisches SystemDreiecksfreier GraphWeg <Topologie>NetzbetriebssystemVersionsverwaltungSchreib-Lese-KopfPhysikalisches SystemUmsetzung <Informatik>ResultanteProgrammierumgebungEntscheidungstheorieBitCodeZweiVerzweigendes ProgrammHardwareSoftwareentwicklerGüte der AnpassungMinimumTermMathematikBasis <Mathematik>Total <Mathematik>CDN-NetzwerkMinkowski-MetrikAusnahmebehandlungDokumentenserverProdukt <Mathematik>PunktCachingNichtlinearer OperatorNotebook-ComputerRuhmasseFlächeninhaltKompakter RaumMereologieKernel <Informatik>GruppenoperationWeg <Topologie>Rechter WinkelAuswahlaxiomTabelleWechselsprungBridge <Kommunikationstechnik>Zusammenhängender GraphCASE <Informatik>Spannweite <Stochastik>MultiplikationsoperatorFreewareOrtsoperatorSynchronisierungOnlinecommunityArithmetisches MittelDreiecksfreier GraphXMLComputeranimation
25:05
Weg <Topologie>FiletransferprotokollSoftwaretestDreiecksfreier GraphPhasenumwandlungBrowserMultiplikationsoperatorBandmatrixEinsModemSchreib-Lese-KopfHilfesystemNichtlinearer OperatorRechter WinkelGraphische BenutzeroberflächeCodeLipschitz-StetigkeitSoftwareentwicklerQuick-SortQuaderSoftwaretestProgrammfehlerNotebook-ComputerLineare RegressionFreewareFormation <Mathematik>Dreiecksfreier GraphVariableRechenschieberChiffrierungWeb SiteRahmenproblemGruppenoperationMathematikVerzweigendes ProgrammReelle ZahlPhysikalische TheorieFrequenzProzess <Informatik>PhasenumwandlungFamilie <Mathematik>ClientComputerspielPunktMinimumGüte der AnpassungProgrammbibliothekStatistikComputeranimation
31:46
PhasenumwandlungSoftwaretestDreiecksfreier GraphAlgorithmusKernel <Informatik>TLSBetriebsmittelverwaltungSchedulingObjektverfolgungQuellcodeBetriebsmittelverwaltungDichte <Physik>StatistikSoftwareAlgorithmusPuffer <Netzplantechnik>Physikalisches SystemMultiplikationsoperatorProzess <Informatik>Web-SeiteDualitätstheorieElektronische PublikationSummengleichungKernel <Informatik>Dreiecksfreier GraphMini-DiscSpeicherverwaltungDifferenteStichprobenumfangSchlüsselverwaltungHalbleiterspeicherNummernsystemTermFlächeninhaltProjektive EbeneOpen SourceAnalysisGeradeInhalt <Mathematik>QuaderBandmatrixSchedulingCASE <Informatik>SocketKonfigurationsraumBitZeiger <Informatik>GrenzschichtablösungMehrkernprozessorBildschirmmaskeQuick-SortWort <Informatik>Schreib-Lese-KopfSoftwareentwicklerCodeStreaming <Kommunikationstechnik>Neumann-ProblemFunktionalE-MailMapping <Computergraphik>TLSMinimumGlättungAggregatzustandNetzbetriebssystemLoginSoftwarewartungBildschirmfensterMailing-ListeComputeranimation
38:26
GeschwindigkeitDreiecksfreier GraphSoftwareentwicklerCodeProgrammfehlerVerzweigendes ProgrammMultiplikationsoperatorSynchronisierungProjektive EbeneSchreib-Lese-KopfAusnahmebehandlungComputersicherheitWeg <Topologie>Dreiecksfreier GraphVersionsverwaltungLineare RegressionPatch <Software>Reelle ZahlGerichteter GraphFrequenzSchreiben <Datenverarbeitung>AuswahlaxiomWechselsprungComputeranimation
41:37
InformationSoftwaretestDean-ZahlCodeComputersicherheitComputersicherheitKontextbezogenes SystemVerzweigendes ProgrammProgrammfehlerMathematikCodeBinärcodeStabilitätstheorie <Logik>Konfiguration <Informatik>Hinterlegungsverfahren <Kryptologie>CASE <Informatik>Schreib-Lese-KopfSystemaufrufFunktionalSynchronisierungObjekt <Kategorie>EinsGeradeSoftwareentwicklerStatistikGemeinsamer SpeicherStellenringInformationZahlenbereichInverser LimesDifferenz <Mathematik>Kernel <Informatik>Luenberger-BeobachterFigurierte ZahlTranslation <Mathematik>Elektronische PublikationSoftwaretestGrenzschichtablösungMereologieWeg <Topologie>Patch <Software>DifferenteGefrierenBitMini-DiscCoprozessorTopologieSpieltheorieGerichteter GraphSpannweite <Stochastik>MultiplikationsoperatorMAPComputerspielHalbleiterspeicherProjektive EbeneWort <Informatik>EinfügungsdämpfungSchlüsselverwaltungReelle ZahlEinfach zusammenhängender RaumComputeranimation
51:15
GeschwindigkeitDean-ZahlPunktTLSSoftwareentwicklerMathematikTwitter <Softwareplattform>Notebook-ComputerKeller <Informatik>TeilmengeComputerarchitekturSoftwaretestMereologieCASE <Informatik>Kernel <Informatik>ServerBitLineare RegressionSuite <Programmpaket>RechnernetzGeschwindigkeitKonfiguration <Informatik>Hinterlegungsverfahren <Kryptologie>EinsBildgebendes VerfahrenSchreib-Lese-KopfFramework <Informatik>Chi-Quadrat-VerteilungTravelling-salesman-ProblemComputeranimation
56:22
GeschwindigkeitDean-ZahlFlächeninhaltSpieltheorieSchreib-Lese-KopfE-MailQuick-SortCodeProjektive EbeneTermMultiplikationsoperatorComputerarchitekturSoftwareentwicklerHinterlegungsverfahren <Kryptologie>CASE <Informatik>ProgrammfehlerBefehlsprozessorTLSHardwareServerHilfesystemRechter WinkelTabelleComputeranimation
01:01:29
GeschwindigkeitGraphische BenutzeroberflächeBrowserVererbungshierarchieMultiplikationsoperatorBefehlsprozessorNotebook-ComputerWeb-ApplikationClientComputeranimation
01:02:37
GeschwindigkeitComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:12
Okay, my name is Jonathan Levy, I'm a software development manager at NetWix. My team develops and maintains the operating system that runs on the content caches that we use
00:23
to deliver streaming video to NetWix clients around the world. And I'm here to talk to you a little bit about what we do and then also give you some observations that I had with us using a FreeBSD head for the past roughly three years as the basis
00:42
for that operating system that runs on the caches. So let me start off by giving you a little bit of background on OpenConnect. OpenConnect is NetWix's CDN. It's mobile, so we have boxes spread throughout the world. It is purpose-built for what we're doing, and it's meant to be efficient.
01:04
So this is meant to be a purpose-built, global, efficient CDN to distribute NetWix's content around the world to our streaming video customers. And the OpenConnect CDN at peak delivers more than 100 terabits per second.
01:26
Now that's a very large scale obviously, and that makes it really interesting to work in my group. It also makes it very humbling to work in my group. It makes it interesting because when you're dealing with that much data, that much traffic,
01:41
there are problems of scale that you get to deal with that you might not in other situations. When you're dealing with massive amounts of data, little 1% problems suddenly become very big. Because 1% of a lot is still a lot. But it's also humbling to work with this scale because what we do makes a big impact.
02:06
I was thinking a few months ago when I was thinking about the scale of this, I was realizing that if we made an error which increased TCP retransmissions by 1% of our total traffic, so if the retransmission rate went from 1% to 2%, let's say,
02:21
that's an extra terabit per second of traffic at peak that we're pumping out into the internet. And that's not an insignificant amount of traffic. And so we would potentially have very big, what we do potentially has very big impacts if we don't get it right. And so it's very important that we take our job seriously and try to get the software right on the OCAs.
02:50
The workhorse of the CDN is what we call the Open Connect Appliance. These are the actual content caches that distribute the streaming video files.
03:03
These run almost exclusively open source software. We try to, as much as possible, use commodity parts. For a long time, the only custom component we had, I think, was the sheet metal enclosure for our boxes, that custom red sheet metal that you see there.
03:25
We since have a few more customized things that were sticking in there, but it's not custom silicon. It's more along the lines of custom parts to cram more into the same space kind of thing. For the most part, it's commodity hardware.
03:42
And our goal is to produce a very cost-efficient platform. And so we spend a lot of time thinking about just how much we need in these boxes, and how we can manage to produce a cheaper box that will do even more for us.
04:00
You see here one of our appliances. We have a number of different appliances which have different mixes of hardware, different form factors, different network attachments. So we have 10 gigabit, 40 gigabit, and 100 gigabit attached storage. We also have boxes that have hard drives, some that are exclusively solid state media.
04:21
Some are 1RUs, some are 2RUs. So we have different mixes of form factors and storage capabilities and so forth. This one here is a 40 gigabit per second storage appliance with 248 terabytes of storage and a 2RU form factor. When we say 40 gigabits per second, our goal is usually to have it able to do a steady-state peak
04:48
of 90% of the attached bandwidth. So 40 gigabit per second box, we would want it to be able to do 36 gigabits per second. Line rate, 100% TLS encrypted sessions.
05:00
And the reason that we target 90% of line rate is because of some of the control loops that are involved in managing our CDN. When you get above about 90%, it becomes hard to make all the decisions work right to keep it right around your target. The higher you try to go, the harder that is to make it work.
05:23
We are experimenting with making that number a little bit higher, but at the moment we are still targeting 90% of line rate. So for a 40 gigabit per second, 40 gigabit attached box, we would target 36 gigabits per second as our steady-state peak output.
05:43
This is the typical, our CDN does a number of things, but the biggest thing we do is distribute streaming video. And so this is the biggest chunk of our work is distributing streaming video. And you'll see here our typical workload is to serve a whole bunch of Netflix applications streaming video.
06:01
And so we have Netflix applications around the world that contact the CDN to get data. We have the benefit of having Netflix-controlled client applications hitting our CDN. And so we can do a couple of things that are very useful there. One is that we can tell the clients, we can make the clients smart about how to interact with CDNs.
06:21
We can say, here's three or four choices where you can get this video, and we can make the client applications be smart about figuring out where they should get the video. So if we have a bug on the OCA, for example, and the OCA stops serving content,
06:41
the client can seamlessly decide that they're going to fail over to one that will give them the content they want. Or if there's just a blip in network connectivity between them and the server that they're contacting, again, they can make the decision to switch over to something better for them to get the data. A lot of times they can do this without the user ever really seeing an impact.
07:04
So they're downloading data enough seconds ahead of when the user's going to watch it that all this can happen in the background without the user even noticing that there's been a problem. The other thing we can do, because we control the applications that are contacting us,
07:21
we can have them give us feedback on how the CDN is performing. And so they can say, hey, I downloaded two minutes of this video from this particular OCA, and there was really high latency from that one OCA. Or the latency was fine, bandwidth was fine, but it took you 100 milliseconds longer than normal
07:44
to start sending me data. And we can track all those metrics. We have very detailed, rich metrics from our clients about how the OCAs are performing. And we use this in our software development because at some point in our software development cycle, we do tests with real clients.
08:02
And we will let real clients, one of those several choices they can stream from will be a box running our candidate release for the next software revision. And we'll let them tell us how their experience was. And they can say, hey, it was a lot better, or it was worse, or some things were better, some things were worse.
08:20
And we can track that and decide how we've changed the experience for our clients. And we can use that to either find bugs or verify that we've actually made things better the way we thought we would make them better. And it's very helpful for us because we can use these very detailed metrics
08:41
to understand to great detail how we've impacted clients. This is the typical workflow that we're dealing with inside of our clients. In some ways, it's a very simple workflow. We get data off the disk, we bring it into memory, we encrypt it for TLS purposes,
09:03
and then we send it back out on the wire to the customer involved. The plain text data box, in case you're concerned, the plain text data box itself is usually DRM encrypted videos. And so everything is usually encrypted even in our memory because we deal with DRM encrypted files.
09:22
But we do further encrypt those in a TLS session and we send them to many of our customers. It depends on what other clients' software can support that. And so we do that encryption on the OCA.
09:40
In some ways, this is not complicated. This is not a novel workflow. This is what a lot of web servers do. And so it's not hard to do this. It's also not hard to do this at high bandwidths if you throw enough money at the problem. The trick is to make this work efficiently at as low a cost as possible.
10:05
And one of the big ways that we've been able to do that is by avoiding a lot of kernel to user space copies. And so data stays in the kernel as much as possible in our environment. We have something called async send file, which actually is available to anyone using FreeBSD.
10:22
The web server tells the kernel, hey, send this chunk of this file out this socket. And the kernel will return to user space so that the web server can go on to other things while data in the background sends the file out to the users.
10:41
The other thing that we have internally, which we're in the process of upstreaming, is kernel TLS support. And in the kernel TLS, the user space negotiates the keys with the remote side, then gives the session key to the kernel, and asks the kernel to do the encryption on the data for it so that all that can happen in the kernel without the data having to go back to user space.
11:05
And the benefit of that is when you combine kernel TLS with async send file, you can do TLS encrypted bulk sends of data in the kernel without the data ever having to be copied to user space.
11:23
And so you avoid having to copy things back and forth in memory. And so this lets us be very efficient on our memory bandwidth usage. Our key constraints, as you can imagine from looking at this diagram, our key constraints are PCI capacity and memory bandwidth.
11:43
And so we try to get as much as we can out of the memory controllers that we have, the memory bandwidths that we have in each controller, and the PCI links that we have. And the trick is to buy the right amount of those things on your CPU,
12:02
as well as buying the right amount of processing power so that you're getting just the right size of CPU and memory bandwidth and PCI links that you need. And we keep trying to drive efficiencies into the system so that we can make even more efficient use of memory bandwidth
12:21
and the other internal resources we have, because that directly saves us money by letting us buy cheaper parts or less robust parts, or doing more with the parts that we already have. Using FreeBSD and commodity parts, we achieved 90 gigabytes per second,
12:44
serving 100% TLS encrypted connections with about 55% CPU use on a 16-core 2.6 gigahertz CPU. That was measured with actual clients, so this is a real-life test. We stuck this in our network, and real clients were downloading real videos from it.
13:04
So there's nothing contrived about it. And there's nothing contrived about it except that we made sure that all the sessions were TLS sessions. So we didn't let anyone, any clients that couldn't do TLS, we didn't let them come to this box, because we wanted to see in the worst-case scenario what we'd get.
13:20
So that's the one contrived piece of this. Otherwise, it's a realistic scenario. And in this case, it's an Intel 6122 CPU, 1RU format box, 100 gigabits per second attachment, 96 gigabytes of RAM, and 16 terabytes of NVMe.
13:41
And you can see here a graph of bandwidth and CPU. You'll see the bandwidth is hovering right at 90. That's on the top, and the CPU is right around 55, and that's on the bottom of the graph. We're also testing, at the moment, an AMD chip. I think its code name is Naples. And in a single-socket, I should also mention
14:01
that was a single-socket system that did all that. With Naples, we're testing another single-socket CPU, and I think we're able to get about double the performance out of the single-socket CPU. It's an interesting challenge, because if you know their design,
14:21
the Naples chip is basically NUMA-1A chip. So it's a single-socket design, but there's different memory domains, and so all of a sudden, you have to deal with all the NUMA problems that we've avoided up until now by doing single-socket. And so it's been an interesting project to make that work,
14:43
but I think we've got to about double the performance on the single-socket chip. Sorry, what's that? 175. 175, so not quite done, almost done. Okay, so we are also testing other solutions
15:03
that are multi-socket that would get us there as well. Does it suck that the NUMA issues are gone away on AC after all the work you've put in? I mean, that's a good thing overall, but there's public announcements about the triplets. I think that, in general, we view investment in NUMA
15:24
as a good thing, because we learn interesting things. So in the course of doing the NUMA work, we realized things about the way the operating system works and the way we were using memory resources that we hadn't realized before, and so it lets us make things better. So even if we never end up shipping a system
15:41
that requires NUMA, it still is worth doing the work. Oh, we appreciate it a lot. It still is worth doing the work to get to learn some interesting things that make it, made other things better. And so it's nice to do work like that, where there's a potential long-term benefit,
16:03
but you also get short-term gains out of it. Speaking of NUMA, before the NUMA work, the AMD system was getting 85. So we took the AMD system from 85 to 175. So I think that's where you're remembering the doublet. Okay. Yeah, so Drew just said that the,
16:20
those that didn't hear, Drew said that with the AMD Naples chip, with none of the NUMA enhancements, we were at 85 gigabits per second, and with some work that Drew did internally, we got up to 175 gigabits per second.
16:42
How much more expensive would the AMD system operate during the launch? I don't know, Bryson. I'm not the right person to ask, so I'm sorry. I can't help with that. One more question, and then I'm going to go on. I'm just curious. So if your goal was about 90 for the throughput here, did you have a CPU utilization goal?
17:02
I mean, it seems like 60. Why not try to push 80 and go with your CPU? So there's a couple of reasons. One is that you want, the question was why would we not try to pare down the CPU and get more fully utilized the CPU with the same bandwidth?
17:23
And there's a couple of reasons, I think, at least two of them. First of all, we want some headroom in case the unexpected happens. So there are cases when you can have a sudden inrush of clients, and it's helpful to have extra CPU headroom, where you can have something else that unexpected happens,
17:41
and it's helpful to have some CPU headroom. The other reason is that CPU growth is not necessarily linear, and so you reach a point where the CPU growth may become exponential instead of linear, and so therefore 55 doesn't mean that you can,
18:00
well, 50% would not mean you could double your bandwidth and get away with it, because you reach points where the CPU growth can become exponential instead of linear. So let's move on to talk about the operating system for a minute.
18:21
The operating system is a lightly customized version of FreeBSD Head, and so one of the questions that I get asked regularly is why do we use FreeBSD? And my answer is that we came for the license, but we stayed for the efficiency. So I wasn't here when we chose this operating system,
18:43
but I've been told by the people who were that we originally chose it because of the license, the permissive BSD license, and so that's the reason that we started off using FreeBSD, but we stayed because of the efficiency. And I originally had performance here, but I decided that didn't really capture the totality of things, because it's not just about performance.
19:02
It is about performance. We're happy with the hardware efficiency that we get, in terms of bits per dollar spent, basically. We're happy with that. But it's about more than just that kind of efficiency. There's also efficiency of development as well,
19:20
and the FreeBSD community has been a great community to work with. It's a very collaborative community. We have lots of good relationships with other developers in the community. We can share code back and forth. We found it to be a very collaborative, welcoming community.
19:41
It's also helpful that it's a complete OS, which makes it somewhat easier to work with, because when you're making changes that require both user space and kernel components, you make them in one repository. You can deal with it in one repository instead of having to combine a bunch of bits together to make an operating system.
20:01
It also has a reasonably compact code base. I know that it's a massive amount of code, but it also is not as bad as I think I've seen elsewhere. And the port system works well for us for what we do,
20:23
but I will say that I think that it probably does not work quite as well for the general user community, and I think that that's an area where we probably need to, at least if you're talking about desktops and laptops and things like that, it's probably an area we can improve on some,
20:40
but I think we know that, and that's why we have groups that are working on improving the packaging system for FreeBSD. But on the whole, it's a reasonably efficient environment within which we can work and do our work. And so it's both hardware efficiency, or efficiency of performance,
21:01
but also efficiency of development. For those of you that aren't FreeBSD developers, I'll briefly describe the FreeBSD release cycle. Give me 30 seconds. So the FreeBSD has a head branch, which is along the bottom. This is the development branch. All code goes into there first.
21:22
Then there's these release branches that come off of that, and so every roughly two years now, the FreeBSD community branches off a stable branch, which is going to form the basis for a major release. And so the stable 11 was the one about three years ago or so that branched off.
21:41
And then off of that one, they also have particular point releases, 11.1, 11.2, and so forth. About nine months ago, they branched off stable 12, and off of that one, we have the 12.0 release, 12.1, and so forth. All code gets committed, almost all code.
22:01
There's a few exceptions. Almost all code gets committed to the head branch and then gets backboarded to one or more of the stable branches. And that's how you get code into those stable branches that may eventually form part of one of the minor releases. When I first, well, before I started actually,
22:23
so it was probably three years ago, maybe a little more than three years ago, we were running stable 10, and we were tracking the stable branch. And what happens when you're up in one of these branches is that you get some of the code going into head but not all of it. And so each month when we would do one of our merges
22:42
from the stable branch, we would get some code changes, but we wouldn't get everything going into head. And so after two years on the stable branch, you find that there's an awful lot of changes that you now have to incorporate when you want to jump to the next branch. So just about two years ago, when they branched off stable 11, we had a choice. Which way were we going to go?
23:01
So we wanted to update ourselves from stable 10 to about right here. And at that point, we could choose to go stay on stable 11, or we could choose to stay on head. It was about as much work to get there because we were trying to get ourselves to about this point and once we're there, we could choose which way we wanted to go. And we made the decision to try tracking head.
23:23
And I think there was some decent bit of fear and trepidation about what we were going to find, but overall, the results have been good. And in fact, when we had to make the decision what to do with stable 12,
23:41
there was a fairly short conversation and we decided to keep tracking head. And so there's a lot of benefits that we get out of doing that, and I'll talk about some of them in a few minutes, but we think that there are benefits to us tracking head. And so we've been doing that now for about three years.
24:04
The way we get our code from FreeBSD head is that about every five weeks, we pull down the latest code from FreeBSD head into our master branch, and then we will pull off one of our internal releases from our master branch. So about every five weeks,
24:21
we're syncing code from FreeBSD head into our master branch and then it's getting deployed to our caches. What that means is that a commit to the upstream FreeBSD development branch is usually fully deployed across our CDN sometime between five and 15 weeks, let's say,
24:42
from when it's committed to FreeBSD head. So what that means is that all of you that are FreeBSD committers, your code is being run in production around the world serving 100 terabytes per second of traffic at peak sometime between five and 15 weeks after you commit it.
25:03
And the way we're able to do that is that the code people commit to head is really, really, really good. And so thank you to all of you that are FreeBSD developers and help keep FreeBSD head healthy and high quality.
25:21
And, you know, unless you think your code is not going anywhere until two years later when the release comes out, it is actually going places and people are consuming it and grateful for what you're doing. So thank you. How often do you hit a major bump in the road?
25:43
Like, I would assume I know 64 where you have to say, whoa, whoa, we can't do it in five weeks, you know, let's settle off and stuff. I guess the question was how often do we hit something which makes us miss our five-week headings. I'm going to guess the answer is about twice a year.
26:01
And sometimes it's internal stuff. Sometimes it is things that are disrupted upstream. I know 64 actually wasn't that bad for us for various reasons. I think Werner spent a very short period of time fixing that one. There's been other ones that have been a little more disruptive.
26:21
But the most destructive is the move to the upper Mississippi that won. And that's sort of because of our own internal technical debt because I need the TLS hooks into the Mississippi library. And I have a lot of it in there locally. And that for one, that seemed quite a lot.
26:42
Yeah, there was also one, the epic change was disrupted for us. But that one was half epic and half us because it hit in the middle of the summer. And so we didn't have as many people around to troubleshoot that one. And we had a feeling that that was
27:00
going to take additional testing and so forth. So we purposely punctured on that one until we have more people around. So that was another time when we missed our five-week cycle. But I'm going to say it's probably about twice a year that we missed that. Alistair, were you going to add something? I was just going to say it's a drizzle of other vehicles at that time.
27:21
OK. One more question. Clarifying on this slide itself, it seems like it's intentional that you bring in the merge from head down to the master after you've made a release branch. Yes. I'll get to that on the next slide, I think. The perfect transition. So this is what our typical release cycle looks like.
27:42
It's five weeks of development, five weeks of testing. Or five weeks of development, five weeks of deployment. Basically. And of course, real life happens. And so sometimes these aren't exactly this way. But this is about what we shoot for, about what the typical process looks like.
28:00
Early in that cycle, we try to do the merge from head. That way, we have as much time as possible to test it and find any sorts of bugs that have crept in. So we try to do that pretty early in the cycle. If it gets too late in the cycle and for whatever reason the merge hasn't hit, we may just punt it to the next cycle
28:23
because we don't want to take... Because we're taking relatively fresh code, we do want to test it. And if we don't think we have enough time to test it thoroughly, we will delay until we can do that.
28:40
We typically can get a merge in every five weeks pretty early in the release cycle. Throughout that time, we're adding features. And we are integrating the code from the various features with each other as well as many upstream changes that have occurred. And we're also doing testing throughout that time. We have some regression tests we run.
29:02
And we have some performance tests that we run. We run them during that time frame to try to catch problems as quickly as we can after they hit. And then after the five weeks, our release engineer, very much on time, branches off the next release branch.
29:23
And we start over with developing the next release. But the one we just finished goes into testing and deployment. So these five weeks are overlapping. We usually have two releases in flight, one that's in development and one that's in testing and deployment. The first week or so of the testing and deployment phase
29:43
is reserved for our development team to do testing. It's this point where we will send, I think it's something like 1% of all Netflix clients to a box that's running the development code, so the release candidate code.
30:01
And we'll let them give us those statistics about how things are going. And we'll, of course, also monitor the boxes to see if there's anything unusual that we see on them. After that, we turn it over to our operations team, and they deploy it on even more boxes.
30:21
And they will test it again to make sure that they don't see anything abnormal. And then we do a phased rollout, which takes roughly two weeks or so. There's all sorts of variables that get thrown in there, like holidays, for example. We have a holiday coming up, which is following right in the middle of one of our deployment cycles, and so it may stretch it out more than two weeks.
30:45
But typically, it takes about two weeks to do that. Yes? What time of day do you send a rollout, so does it matter? Well, we would try to do it during the down cycles. So we try to do it in the off-peak times,
31:02
when the boxes are not utilized. But that, because we're all over the world, that hits all different times at the end of the day. So it's all throughout a 24-hour period we have boxes that are downloading the new cones, installing them, and we're good. I mean, your off-peaks are very different
31:20
than many people, right? Sure. I mean, our... So years ago, there was someone in... I live in a place where it snows a lot, and if you're familiar with my cable bottoms work, there's shared bandwidth that people are sharing. And, you know, several years ago, when the bandwidths were not what they are now,
31:43
there was someone that pointed out to me that on the snow days when people were stuck at home, you would always get worse performance from your cable bottoms because everyone was at home. And I don't think we've done any analysis like that with Netflix per se, but obviously our peaks are when people are at home, right?
32:02
And it's typically when they're at home in the area where they are. And so the peak for the U.S. East Coast is going to be when people in the U.S. East Coast are at home, for example. We used to have to do those analyses because I worked at an IFPA at 600 that, you know, you really have to have
32:22
some of our maintenance windows. Sorry. One more question. Yeah, sorry. Just back to the hardware, you mentioned the memory and the NVMe side. Were they? I didn't catch you. For that one sample 100 gig box I was giving you, I think that it was 16 terabytes of NVMe and 96 gigabytes of RAM, I think.
32:45
But we have all sorts of different configurations to handle different kinds of content, different kinds of traffic. So I mentioned that we do feature development. You might wonder what kinds of features. We're developing because we're getting
33:01
a full functioning operating system from the upstream from the state community. So what do we add to it? So these are some of the examples of the kinds of things that we've added to the operating system. We did some new enhancements, as Drew pointed out. Drew did a lot of preliminary work
33:20
on some of that to figure out where the bottlenecks were and come up with some solutions that worked, but had some rough edges, let's say. And then we've worked with Jeff Urberson to smooth off some of those rough edges, let's say,
33:40
and expand that work as well. And I think Mark Johnson has helped with that as well. And that's, as we mentioned before, it's been very, it's borne a lot of fruit for true NUMA devices. We've tried it out both with the name, the AMD NUMA on chip, as well as tried it out with Intel SMP dual socket systems.
34:06
And it's borne a lot of fruit in both cases in terms of increasing the performance of the system. Async to send files is a big one. I described that earlier. Kernel TLS is another big one, which I described earlier. Async to send files upstream.
34:21
Kernel TLS is not yet, but it's in the process of being upstreamed, and we expect that it will be upstreamed in some form sometime in the next two months or so, give or take. But you'll keep watching the mailing list
34:40
to see when that actually materializes. There's all sorts of things that can go wrong between now and then in terms of, again, smoothing off some of the rough edges to make that work. We upstreamed some p-buff allocation enhancements. These are physical buffers for storing data
35:02
that you're getting off of the disk. And they're used heavily by send file, and we basically modernized the allocation scheme so that it would not have as much performance contention on multi-core systems. Unmapped m-buffs.
35:21
This is something that, again, Drew from my team came up with. It's a way of having m-buffs store pointers to physical pages rather than going through the process of mapping those pages. And on top of that, because of the way they're being stored, instead of having 2K per m-buff,
35:41
you end up with the ability to have, I think it's a 32K per m-buff. So 132K per m-buff instead of 2K. So you get a much higher density,
36:01
so you avoid a lot of long m-buff locks that we used to see before. And each m-buff lock takes cache-line hits and requires pulling some memory off of, out of your RAM, and that adds up over time. And that means that you want out of memory bandwidth
36:22
and you either have to buy more memory bandwidth in some form, or you sacrifice performance. And so being able to shave off that little bit of memory bandwidth usage, the little bit times, many, many, many times, makes the big difference. We've contributed some IO scheduling improvements.
36:43
We've done a lot of work in TCP algorithm research. There is several folks in Netflix that work very hard on TCP research, and they've done a great job at enhancing TCP performance just by tweaking some of the algorithms that we use.
37:01
And we also added TCP login infrastructure. And the login in this case helps to debug problems that occur. So it's been very helpful for us in developing our software to be able to get this stream of data out of the kernel about the way that it's processing packets.
37:23
It's been very helpful for us to see how it's processing packets and be able to turn that into a better understanding of what our code is actually doing and not what we think it's doing. So that's very helpful. You'll also see soon, I think in the next couple of months, we should also be upstreaming some TCP statistics improvements
37:44
so that you can get very fine-grained statistics about the TCP sessions themselves. And we're hopeful that will be helpful to the community in the way it's been helpful to us. The reason that we track that
38:01
is because we think it lets us stay forward-looking and focused on innovation. I think that downstream users of open-source projects can get stuck in these vicious, what I call the vicious or the virtuous cycle. And in some ways, this is nothing new or novel, but it's an important concept
38:21
that I think underlies why we, and explains why we track head. So the vicious cycle starts with infrequent merges. So you do infrequent merges from the development branch. Now, you can either mean that you're tracking head, but you're only taking merges once every six months
38:42
or a year or something. It can also mean that you're tracking a stable branch, which is not getting frequent merges from head by definition, and that once every two years, you're trying to emerge from head when you jump from one stable branch to the next. Those infrequent merges mean that when you do actually do one,
39:02
when you do make the big jump, you end up with a lot of conflicts and a lot of regressions. It can take weeks just to make the code compile. If you've ever done one of these massive moves, it can take weeks just to make the code compile. And then when it compiles, it doesn't work, and there's so many bugs that are overlapping each other
39:21
that it can be really hard to just figure out why things are broken and even just find all of the bugs. And what happens is you end up with more and more people trying to fix that and make that work, and so then you slow down feature development because people can't develop features while they're also helping to stabilize the branch
39:41
that's getting the sync from upstream. So what this means is you have a choice. You can either spend more and more time managing these merges and do less and less feature development, or you can do less and less merges so that you spend less time on them, you hope, but each one then becomes more painful.
40:02
And that eventually degrades to we're just not taking any more syncs for upstream except for security features because it's too painful, and you end up with projects that get stuck, many branches behind because it's so painful to keep up.
40:20
So then you have what I call the virtuous cycle, which starts with frequent merges. So frequent merges means that every time you're doing that, you get less complex, less regressions, which means it moves much faster, and you can keep staying forward-looking. So instead of fixing the code that you've already written,
40:40
you can look forward and keep writing new code. Because you're running something close to head, you can easily collaborate with other people in the community that are running head. When it's time to upstream your code, it's easy to upstream it because you're already running head,
41:00
so it's easy to take your patch and upstream it. And that makes it easier to keep doing more frequent merges. And so you end up with this cycle where it's real easy to do your work because you're doing frequent merges,
41:22
and there's an incentive to keep doing the frequent merges to keep making your work easy and keep staying close to head. So all of this makes it easier for us to perform frequent merges and iterate quickly on our own development. Now there are a couple of reasons why we do keep local diffs.
41:42
So one of them is that we have information covered under NDA. I think there's like, I want to say like, probably 20 lines of code in the kernel that are covered by an NDA. So some vendor said, here's some information that you can use to get the statistics you want
42:01
or to flip this feature that you want, but you can't share it with anyone else. And so we can't upstream that because we're not allowed to. But I really think it's something like 20 or 50 lines of code. It's an extremely minimal amount.
42:21
We also have features which are still in development and testing. We try very hard to only upstream things that work really well. And so we have features that are in development or in testing, and until we have smoothed off those rough edges to verify that they work correctly,
42:40
we'll hold back on upstreaming them. We frequently are willing to share them with other people that want them, but we usually try not to upstream them so we're sure that they're going to work well. How do you keep track of, I know you said it's limited, but how do you keep track of the information on the NDA, just local comments? Yes, we have local comments in that,
43:02
and I think in the one case that I'm aware of, we actually ended up separating it off in a separate file, which we include, and the separate file is well marked what it is and why it's separate and so forth. There's features that need to be generalized. So sometimes we develop a feature.
43:21
It works really well for us, and it works really well in our workload, on our hardware, but we're not sure how it's going to work for other people's workloads. It doesn't handle anything except for x86-64 and so forth, and we want to generalize that before we upstream it, so that's another reason why we'll keep local diffs.
43:42
But is our intention to upstream any code as we can? In fact, some of our development just happened upstream already. My developers will just commit the code upstream and pull it back as part of the next merge because that's the easiest way for them to do that. So some observations from running head.
44:01
So first of all, it makes it really easy for us to collaborate with others. One of the problems you have when you're running something other than head is if you want to collaborate with someone else and send them a diff, you send them a diff from your local tree, and they say, sorry, it doesn't apply to head. Help me not here. And so then you have to take your diff, and you have to recreate it on head,
44:21
send it to them, then they send you back their diff based on your head, and you've got to figure out how to get that somehow into your tree, which you're actually working on, and you end up with a lot of translations between head, from the tree you're working on through head to get to the other person. And that's just not efficient.
44:42
So it's easier to collaborate with others when we're running head. We also get faster bug fixes and features, so there's features that someone else will do, and if we do what we want, they commit them to head, we can easily get them because we're running head. If we were running stable 10 or stable 11, we would have to either wait for them to MFC the code,
45:01
we would have to MFC the code, or if it's not something that can be MFC, we'd have to figure out how to get it ourselves. And so it just makes it harder to benefit from the rest of the community when you're not running code. It's easier for us to upstream code. And not only that, it's better,
45:20
because what we upstream is the same code we're running internally. If we were running in a stable branch to upstream it, we would have to port the code to head, and then we would upstream it. But that would not be the same code we're actually using internally. By running head internally, we can upstream the exact same code that we're using
45:43
and let the rest of the community benefit from that without us having to go through the hoops of then making the code compiled and rolling and tested on head. Another observation is that when tracking head, upstream code freezes are more disruptive than helpful for us.
46:05
I know that there's people that consume releases that probably think it's more helpful to have the code freeze and to stabilize the code. For us, it's more disruptive, because what happens is you can't upstream your changes, and so you end up accumulating local patches,
46:22
a backlog of local patches that have to be committed, and you can end up with just an accumulating pile of differences that you don't want to have, that you can't reconcile until the code freeze is over. KPI changes are easy for us to handle.
46:41
We recompile everything for all of our releases. Because we recompile everything, we don't really care about API, KPI stability. And when the API or KPI changes, because we take monthly syncs, it's really easy to fix the 10 lines of code that call that function that need to be changed.
47:02
So it's very easy for us, because we're taking incremental syncs every month rather than taking a huge sync when we go between stable branches. And API, KPI changes are mostly an odd issue. Again, we recompile our code with every release. There's a handful of binaries that some vendor produced
47:21
where we don't recompile them every release, and so we do kind of care about stability for that. But a lot of times, we get that through the compat options anyway. And head quality is so high that bug flaw is manageable. We do get bugs from upstream,
47:42
but because the quality is so high and we're taking such frequent syncs, the number of bugs we get with each sync is low enough that it's relatively easy to spot them. It's even possible to bisect them, because there's a limited number of changes, so all else fails. We can actually even bisect the code.
48:03
And there is usually few enough of them that it's relatively easy to manage. Benefits to the FreeBSD project, they get wide deployment of head code, albeit in a narrow use case. I know that we have a relatively narrow use case,
48:22
getting bits from a disk to memory, encrypting it, getting out the wire. And on x86, and up until now, Intel x86 64-bit processors. So it's a very constrained use case.
48:41
But that particular use case gets tested very thoroughly and gets deployed very widely, and I think there's a benefit to the project from that. There's also an incentive for us to upstream our code for trying to minimize our differences, and so there's a benefit. There's an incentive for us to upstream our code.
49:01
There's some objections people have to running the development code. So one is that it isn't stable. That's not been our experience. Our experience has been that it is very stable, as I said, head quality is very high. Why should you pay to find the bugs, others will find while testing head? Well, if everyone has that mindset,
49:20
no one's actually going to find bugs. And on top of that, we don't find tons of very highly disruptive bugs. Aren't there more security bugs? The other nice thing is you find the bugs right away. So the person who committed the bug,
49:40
I still remember the context of what you did. If you did a big pile of changes, it'll be two years after this coverage made, and you're ready to lock it on knowing how it works. Okay, so that's Drew. Drew's reminding me of something I meant to say. The other thing about finding the bugs with the way we do our things is we find them very quickly. Everyone still has context on the bug that was committed,
50:04
why they committed that code, and on top of that, there's not two years of dependent changes that are built up that rely on the buggy behavior and have to be unwound. So it's much easier to fix bugs closer to when they're committed. For security bugs,
50:21
I think many or most of the security bugs are also in the stable branches. And so there's been very few that we've found that are specific to the head branch. And so I don't think that's really a very large concern of ours. Another objection is no one runs development branches.
50:41
That's not true. We're not the first company to do this, but it also is not necessarily a bad thing to do, given the fact of how stable it is and the other benefits that we get out of it. We pay a monthly cost to do merges. That's true, but we would pay the same cost,
51:01
or maybe more, even if we did it all at once every two years. Instead, we get to amortize it over time. You get new bugs each month. That's true. We also get new features. And we get a lot more features than we do bugs, I think. So in summary, running FreeBSD Head
51:21
lets us deliver large amounts of data to our users very efficiently while maintaining a high velocity of feature development. We found that FreeBSD Head is stable. It's stable enough that we deploy commits from FreeBSD Head into our network probably between five and 15 weeks after they're committed. It lets us officially deliver a bunch of traffic
51:40
to our users throughout the world, and it lets us maintain our development agenda. So that's the end of my talk. I think that I can keep answering questions until the next speaker comes and kicks me out, probably. But, yeah, go ahead.
52:08
Somebody made a commit. Maybe it's a little controversial. Did you say, let's just do a minus R? Yeah. So the question was, when we pull from Head, do we ever pull from something that's a couple days or maybe weeks old just so that we avoid something
52:21
that we think might be disruptive? And the answer is yes. In general, we pull the latest release point that someone has run on their laptop or, in some other way, verified passes the skip test for being buildable and usable.
52:42
But there are some cases where we'll try to avoid something that looks like it's not quite stabilized yet by shifting just a little bit to the one side. But it's not that common that we'll do that, but we have those. How much CI do you run on the FreeBSD side?
53:02
Do you run any of the FreeBSD-based stuff on your build just to make sure everything's still hunky-dory? Yes, we run the full FreeBSD regression suite internally. Minus the tests that don't make sense because of our kernel options and things like that. We also run some of our own tests to make sure our own stuff hasn't broken.
53:21
So kernel TLS, for example, has its own test because it's not upstream yet, and we want to make sure we haven't broken that. Is that integrated or in your own framework? No, I believe it's... It should be integrated. I'm not 100% sure that's done yet, but the goal is to have it integrated so that it will go upstream with the changes that the kernel TLS changes that are going in.
53:42
Dan. Question from Twitter about the TCP stacks. Is it implemented by the FTR? No, it is not implemented by a sifter. Probably the closest thing to sifter is the TCP logging that we do.
54:01
And the TCP logging doesn't use sifter. I don't think it's exactly a superset, but you probably think of it conceptually as almost a superset of sifter stuff. Yes. Two questions. One, are you looking at the non-AMD64 architectures
54:22
for servers, and two, brain parts? I'll answer the first part while you're thinking. We think about non-AMD64 architectures,
54:41
so we've thought about other ones, and we've done some basic investigation, but we've never decided that any of them looked like they were going to give us a better dollar per megabit value, basically,
55:01
than what we could get through AMD64. And so we continue to, I think, periodically think about those things, and we'll have some discussions internally, but we've never made the business case to go much beyond that for the other architectures.
55:22
And I'm not saying that that's nothing bad about those architectures. It's just simply, for our use case, looking at the entire price for the server we'd end up with and how much we could pump out of that. It just hasn't quite made sense for us to make that switch up. I remember the second one. How do you do deployments?
55:43
Do you use package-based? Do you use FreeBSD update? We use NanoBSD, so we build an entire image, which includes the base OS, the packages we want, some internal packages that we ship out, and we package them all up using NanoBSD
56:03
and deploy them that way. Yes, in the back. So the question is about smart NICs. We're looking at using smart NICs.
56:21
Yes, we are looking at using smart NICs. We have, I think currently we have NICs deployed that can do packet pacing, and we've looked at leveraging those features. We also have, we're beginning to evaluate whether or not hardware TLS would make sense,
56:42
and so we've looked at some of that. I don't think we have anything we're ready to explicitly talk about in that area, but we are looking at that to see if the total package is there in terms of the cost per dollar,
57:00
or cost per megabit served. Yes. How does those tables suggest that as a project, FreeBSD is doing something really right? And I know you don't, but do you have a sense for what that is? Why do you think they're doing that? Yeah, so I actually have a FreeBSD committer,
57:21
so I suppose I do have a, I do have some skin in the game in a sense. I think that there's an interesting, one of the interesting facets of these projects is they build up cultures, and the culture is something that's hard to change
57:41
once it's defined, and it's something that's, you know, really ends up infusing everything the project does, and even if it's not something that they're actively trying to, and it's also cultures have a tendency to change relatively slowly,
58:02
even if people are trying to change them. And the culture in FreeBSD has been one of, I think, radical commitment to quality, and so you'll have people who will commit things to head, I've done it, I commit things to head, and I would say it's probably half the time
58:21
I get emails back within 24 hours saying you could have made this more efficient if you'd done this, or did you think about this obscure quarter case that you're only gonna hit on these three architectures, or, you know, other sorts, and sometimes there are bigger things, like hey, there's an atrociously bad bug
58:42
that you didn't realize you put in here, but it's very common that FreeBSD developers scrutinize everything going into head, and will point out even the smallest deficiencies in an effort to make sure that what's being committed is the highest quality code possible,
59:01
and that's just the culture that the project has developed, and it's something that has remained for a while, and I think that's how we end up with a head is as stable as it is. Yes? That's a good question.
59:21
It depends how you count. I think I have about eight people working for me, and I think that, all told, there was probably another six or so, six to eight people who are working in this area, and so yeah,
59:41
between 12 and 20, let's say.
01:00:00
Yes, that's true. However, in some cases, it takes a team of lawyers to decide whether or not you are just working your own infrastructure and to decide what you need to do. And so we managed to sidestep a lot of that by using this.
01:00:24
I think that was also around the time that GPL 3 was coming out. And there were some concerns about what that was going to mean for GPL code. So I think that also played into it a little bit.
01:00:41
That's it's really nice that, for the most part, we do use a few GPL packages. But it's really nice that, for the most part, we do not have to worry about the licensing. And it may be that the licensing would be a non-issue. But we just don't have to think about it, which is really actually kind of helpful. Any other questions?
01:01:03
All right. Two more. Colin, first. You mean one two-part question, right? Right. How much of that is TLS?
01:01:20
100%. It was 100% TLS with two opponents. So Colin's question is, how much of that CPU usage is the actual TLS encryption? The answer was about 50%. I thought half of the CPU was encrypted. I watch Netflix on my FreeBSD laptop.
01:01:48
Question is, when can I watch Netflix on my FreeBSD laptop?
01:02:02
And yeah, I think once you get the Chrome browser working there, in theory, it should work. But if it doesn't work, let me know. And I will pass it on to the browser team that makes sure that our browser-based client works.
01:02:22
If it doesn't work in the Chrome browser, let me know. Any other questions? That's why it was one more, right? Super. Thank you for your time.