Fighting I/O: a story of Firefox startup speed improvements
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 64 | |
Author | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/45923 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 201126 / 64
3
7
10
11
17
19
21
28
33
34
35
37
40
44
48
49
52
55
57
59
62
63
64
00:00
MultiplicationSoftware developerFocus (optics)outputCodeComputer animationLecture/Conference
00:44
Channel capacitySpeicherkapazitätNP-hardHard disk driveMiniDisc2 (number)Multiplication signVertex (graph theory)Cartesian coordinate system10 (number)Polar coordinate systemCoprocessorOrder (biology)Semiconductor memoryBand matrixChannel capacityPhysical systemMemory managementSoftwarePerturbation theoryPerfect groupExecution unitComputer animationLecture/Conference
04:08
Multiplication signMiniDisc2 (number)Point (geometry)BefehlsprozessorPhysical systemDifferent (Kate Ryan album)CodeCore dumpDuality (mathematics)BootingBeat (acoustics)DialectElectronic mailing listOrder (biology)Diagram
06:28
Color managementKernel (computing)Block (periodic table)Matching (graph theory)File formatFunction (mathematics)BlogBefehlsprozessorRow (database)CodeGraph (mathematics)Computer fileProcess (computing)MiniDiscoutputVirtual machineFlagMultiplication signKernel (computing)Functional (mathematics)Point (geometry)Event horizonBinary codeDirectory serviceMoment (mathematics)Physical systemScripting languageReading (process)Uniform resource locatorState of matterSubsetSource codeSystem callBitFluid staticsTrailRead-only memoryWritingBoundary value problemSheaf (mathematics)Block (periodic table)Complete metric spaceTouch typingGrass (card game)Factory (trading post)Pole (complex analysis)Link (knot theory)MereologyGame theoryOpen setRight angleHypermediaMagnetic stripe cardSource code
15:58
Fluid staticsFunctional (mathematics)Address spaceSpacetimeLibrary (computing)Reading (process)Similarity (geometry)Computer fileLink (knot theory)RandomizationResultantRevision controlCorrespondence (mathematics)Object (grammar)Plug-in (computing)CompilerOrder (biology)Slide ruleMultiplication signBitConstructor (object-oriented programming)Process (computing)Software developerSheaf (mathematics)Electronic mailing listRange (statistics)Linker (computing)CodeFunction (mathematics)Boundary value problemMiniDiscComputer animation
21:09
Scaling (geometry)BefehlsprozessorDifferent (Kate Ryan album)ChainComputer fileMultiplication signObject (grammar)CodeGraphical user interfaceSheaf (mathematics)Online helpReading (process)CompilerNumberKernel (computing)BefehlsprozessorSoftware testingArithmetic progressionLinker (computing)Revision controlFunctional (mathematics)Order (biology)Graph (mathematics)Extension (kinesiology)Real numberPersonal identification numberDynamical systemFile systemWeightLibrary (computing)TrailLevel (video gaming)Disk read-and-write headServer (computing)Module (mathematics)1 (number)Group actionPoint (geometry)Volume (thermodynamics)Web 2.0Software bugComputer animation
26:15
Revision controlProfil (magazine)Order (biology)Computer fileBlogMultiplication signExtension (kinesiology)Binary codeGraph (mathematics)ResultantMathematicsPointer (computer programming)KettenbedingungUniform resource locatorObservational studyZoom lensCondition numberDiagram
29:37
Functional (mathematics)Multiplication signKernel (computing)Address spaceCodeProjective planeSpacetimeSelf-organization1 (number)MathematicsLine (geometry)Library (computing)Computer fileGroup actionOpen sourceArithmetic meanCache (computing)Arithmetic progressionFluid staticsPhysical systemMoment (mathematics)Computer-assisted translationFeedbackDirectory serviceScripting languagePoisson-KlammerEndliche ModelltheorieSet (mathematics)Computer animation
Transcript: English(auto-generated)
00:00
Satisfying luncheon, you're ready for another fine afternoon. If I could introduce my colleague, I'm going to talk about fighting iOS, the code of the cold startup movement. So this is a big focus, as you know, Firefox 4 performance and startup code after this. Take it away. 34.
00:21
Can you hear me? Yes. Yep. Before I start, I would like to know how many people in the audience are not Mozilla developers. Awesome. How many of you are developers? Awesome.
00:41
So I'll be talking to you about IO, which is an unexpected problem on most systems nowadays. So I'll be introducing you why we need to address IO somehow
01:05
and how it has an impact on software. Then I will introduce you how to actually see what's happening, hopefully, and what can be done against it.
01:25
20 years ago, I had my first real PC, one with memory management units, and it was fast at the time. But, well, PCs nowadays are really, really, really faster.
01:42
Processor speed by then was like tens of millions of instructions per second. Now you can count in tens of thousands of instructions per second. Memory capacity has more than,
02:01
yeah, 20 years ago you could count in megabytes. Now you count in gigabytes. Memory bandwidth was maybe one gigabit per second. Now it's more like 200 or 300 or 500 gigabit per second. Hard drive capacity have exploded.
02:22
My hard drive by then was 200 megabytes. It was quite big. Now you see terabytes. And throughput is good as well, because by then you had one megabyte per second. Now you have hundreds.
02:41
Hard drive access time also increased. 20 years ago you had a drive that had an access time of 20 milliseconds. Now with SSDs we have 0.1 milliseconds. But that's SSDs.
03:01
In practice, this is not true. Most people don't have an SSD. So we're still stuck with very, very slow access times in the order of 5 to 10 milliseconds. So what's the problem? That's the problem.
03:20
So on the horizontal axis you have time. And this is Firefox startup on Linux. Vertically you have the disk. It's the offset on the disk. And what you can see is that you read stuff and you go elsewhere and go back. You're going back and forth on the disk.
03:44
And with really slow access times it means that every time you have a vertical bar it's really slow. And even if you zoom on the other axes it's not really good either.
04:00
And even these or these are hurting a lot. I did a little experiment with the data I gathered just before. Instead of taking the IO as it was
04:24
well I did take both the IO as it was experienced in reality and I also reordered to see the difference. And on the slowest disk which is only 30 megabytes per second
04:40
throughput the normal IO takes 2.7 seconds. With the older IO it takes half the time. On a faster disk around 85 megabytes per second
05:01
the older IO is 3 times as fast. So it's really critical to have to avoid however possible any any seek on the disk. And another point
05:22
is that we don't really have a problem with warm startup. Anything that is CPU bound is not a problem. Why? Because you see, that's the Firefox startup. On a code 2 dual system
05:40
takes 4 seconds to start almost and on warm startup it's much faster well under 1 second. And on an i7 system the core startup is not really pretty much different but
06:00
the warm startup is twice as fast but it's not really a big difference because it's under a second. It's CPU time used. It's wall clock. It's wall clock. So that's the time it takes to start
06:21
on on code 2 dual with slow disk. So you have to know what are the problems what are the problems with IO and the problem is that at the moment we
06:40
have a lack of tools but there are ways to track some kind of IO but it's really hard to have an actual grasp on what's happening actually. Linux has some tools that allow to have some idea
07:01
about what's happening but it's really cumbersome and I will show you some of the tools and you don't you can have widespread knowledge of what's happening and getting relevant startup times is
07:22
hard. The system I used was a virtual machine which I rebooted 50 times to get average times within some kind of boundaries. This is not something you want to do every day.
07:43
I wrote some automation tools to do that but it's really cumbersome. Tracking IO is also not really as simple as tracking read and write. You will find a lot of scripts on the net actually doing that and it's wrong, very wrong.
08:02
It's really simple because for example, if you open a file read from it close the file open the file again, read again and close it again. What do you think will happen? You have one access not two, only one
08:23
because the system is quite intelligent it's caching you hope it does. Another interesting point I discovered is that
08:40
CPUs failing actually is influenced by IO. Nowadays the CPUs are not running at their full speed every time and what's happening is that if you go back
09:02
whenever that kind of stuff happens the CPU is waiting for this which means the CPU is sleeping it's at its slowest slowest speed when you need to go back to full CPU speed
09:20
there is a latency that happens because the CPU can't really switch from slower speed to faster speed in an instant. So what happens is that if you somehow find a way to have
09:41
your CPU maxed out during the startup of Firefox it's faster by 10-20% which is quite impressive and unexpected.
10:00
So together the data I showed you before the graphs on the disk I use ftrace which is a kernel tracing facility in Linux which has the advantage of not needing anything else than the kernel compiled with
10:21
the right flags but usually these rows come with all you need for that all you need is to mount the debugFS which might or might not be already mounted depending on the distro and do some fiddling with
10:41
so here we just enable the trace on the disk where everything will be traced we say that we want block tracing here we say that we want the block IO complete
11:02
events we enable and then here we get the trace the output is not really not really readable you have a lot of output which you don't really know what it means exactly because there's not much
11:21
documentation about the block block IO tracing facility so you have to guess I used why I could I just took what looked like what's happening there are
11:40
some files in the events directories you have much more events than that you have a format file in each of these events directory which is supposed to contain the format used by the output and it doesn't really match
12:02
another tool that I used and I will show you graphs just after that, is SystemTap SystemTap is a kernel tracing swiss army knife you can do anything with it you can insert code in the kernel during its job when it does you can do anything with it
12:22
almost you can trash the system with it if you want I wouldn't advise it the big downside of that is that you really need to know the kernel internals to actually do something so the graphs
12:40
I did after that required a lot of coding in the kernel code and well obviously it's hard to get the right data out of it because it also depends on how the kernel is optimized because the kernel source is not exactly
13:03
designed for that there are many calls from static functions to other static functions and sometimes they are inline sometimes not so you can put procs on some and you can put procs on some others it's really a thing
13:22
the URL I wrote there is a blog post I posted a few months ago maybe a month ago about the SystemTap setup I used and the script also to get this
13:42
this doesn't there's more than the SystemTap output but here is a summary of all the IOs happening on the libzoo file which is the main file containing most of the code in Firefox during the startup
14:01
and the recent strikes correspond to some sections some big sections in the file the pinkish one is relocations I'll explain it to you later that's code
14:20
that's read-only data that's read-only data that's read-only data read-only data and this is data
14:42
here it's something you have to endure on 64 bits 64 bits and it's EH3 which is used to to unroll acceptance which is actually not used
15:02
in Firefox you have to have that it's in the LBA so what's happening? the process starts then you have some reads here at the beginning and at the end why? well, that's kind of an unfortunate
15:23
state of the binaries is that to know what to read in the binary you have to read at the beginning of it and at the end of it which is pretty weird but it's the way it is it could be changed
15:41
by the way of the linker but at the moment you cannot do anything actually most of what is there you cannot do much about it because maybe it runs around here sorry
16:06
so after initializing libzoo here it does nothing in the file that's because it's doing things on other libraries similar things and then here you see readings here
16:21
and here and what's happening here is that it's doing relocations and relocations is something that is necessary when you have random address spaces because the library is not necessarily loaded at the same
16:40
address in the other space so this means that you have to change a bunch of offsets to to make it work with the address at which the code is loaded fortunately
17:00
you don't have to do that in the code because the compiler and the linker does a great job at it but you read a lot of data and you also update a lot of data and you do it a bit at a time so you go back and forth
17:22
here it's forward that's another thing that's static initializer so I think it's next slide yeah these are static initializers
17:42
these are only examples and for each of those the compiler will actually create a function this function will be in the corresponding object file and the result is that well you know when you link
18:00
a lot of your object files you have a lot of object files and each of these functions are in each of these little object files so each static initializer is called for each object file
18:20
and it's done backwards because GCSE developers decided it was going to be backwards the reason is that there are constructors and there are destructors so you have static initializers and
18:41
you have the other hand and to be safe with object files they have to be run in reverse order from each other and it was decided unfortunately that those going backwards are the static initializers
19:06
so the main problem with that is that it's really easy to get static initializers without knowing because who would know from that for example you could guess here
19:20
maybe but really it's something stupid from the compiler because this is a constant and doing this it will actually create a function that just sets this value not the function column, anything just this value
19:41
to this slot just the function for that here it actually calls something so you have to know that it will create a static initializer icegrind is another
20:02
tool it's actually two tools one was developed by Taras Glek and one was developed by myself I took Taras's one
20:20
and I changed it to do what I wanted it to do so they are both vamp-wide plugins my version tracks all the bytes the single bytes that are accessed during the execution of a process
20:44
only once so at the end of the run it will tell you what byte ranges have been touched the one from Taras you give it a list of sections whatever you want
21:01
what we use it for is taking for example the output of LD which will give you a map of all your object files and functions and we list all the functions and we can know that way which functions
21:21
are called when, in what order but only once so what can be seen with icegrind, with my version the one taking bytes by bytes
21:41
is that while the kernel actually reads a lot of data for example text, the text section is the code section so the red bar is the size of the section in the file the green one
22:01
is the read ahead what the kernel actually reads which is a lot most sections are read almost entirely and the blue bars are what's actually needed and you see the code nothing is needed, almost but you still read all that which is a waste of time
22:27
advanced starter is something new that's coming in Firefox 4 so the blood cost actually has the extension it's a small extension
22:41
a quite stupid one actually only displaying the three the three values so it's tracking when main is called it's not exactly main it's the main function in libzool when the section is restored which is when all the tabs
23:01
have been initialized but not necessarily loaded from the net and when the first thing occurs whatever it means we also gather data, actual data from users through addons dot mojeta dot org pins
23:21
those that you send when you want new addons or when it wants to know if you are up to date and a real estate extension with graphs and stuff like that is a work in progress so we have a lot of unexpected enemies
23:44
the file systems for example during the course of all this testing I copied Firefox a lot and it turned out that files were mixed for example the libzool file
24:01
which is 20 something megabytes you had a bunch of it a bunch of another file another bunch of libzool another bunch of another file and so on and so forth the toolchain doesn't really help as we saw the compiler doesn't help here static and dynamic linker doesn't help
24:22
and CPU scaling doesn't help either so what can we do about it? so we can for example do something about that we can try to do something about that we can try also
24:41
to do something about that that's something the linker should do and we should definitely try to do something about all these which are basically most of the time due to code to system libraries
25:03
so what we have to do is well, avoid fragmentation we had something for Firefox 4 the SQLite files for places for example were very very fragmented
25:20
and we improved that by allocating by bigger chunks reduced the number of files this was done in Firefox 4 before we had a lot of different files in continents and chrome they are all grouped in one file now omnijar
25:40
improved the binary layout so we tried some things about that reordering reordering the object files for example which is the easiest way to do that without needing a new two chain reduce the size
26:02
anything you can do to reduce the size will obviously save some IO and avoid going back and forth between files because, well, it's killing so for example so it's actually sad this graph is kind of sad
26:25
this is the 3.6 start time and it's actually faster to start than 4 this is 4 beta 8
26:42
I did my data gathering a lot of time ago so without omnijar we see that omnijar actually gives a good improvement here but we are still slower than 3.6 that's important but
27:02
but we also have extensions packing now so it's instead of unpacking all the extensions when we install it we keep them packed when we can so these are profiles
27:20
I used with the six biggest big users extensions only those that work on 4.0 as well because not all of them do so actually
27:40
version 4 is actually faster with extensions packed and something stupid I did is trying to reverse the static initializers those that go backwards I just hacked the file so that the pointers go forward
28:02
and the result is actually surprising I did not expect that much I did expect some improvement but not that much in the order of 10-20% just by going forward instead of backwards
28:22
these are the various changes I tried unfortunately they won't make it except 1 to 4.0 so here we have normal target here we reduced
28:44
the static initializers some of the static initializers but not all of them and here we reordered the binaries and packed relocations and reduced the static initializers
29:00
so these are file sizes so it's libzoom size so it's quite good use and the result in start time is actually deceiving for less static initializers but actually good
29:23
for when you put all of those together so you have two blog posts with more data about that so the reason why the static initializers are actually slower is that
29:40
when you stop reading here some of the stuff here you should stop reading there well, sometime here you will probably have to read actually because here it's cached from the first start and here you won't see anything for these offsets
30:01
because obviously it's already in the cache but if you read less here you have to read them later which kills sometimes I skip this one so what's next?
30:20
we'll also try to avoid fsync which is also killing because basically it's telling the system to crush anything in fsync cache that is not written we'll also try to separate hot and cold functions at the target what will be tried is to actually separate
30:41
in two libraries one for the hot functions those that are actually used at startup and the ones that are not removing dead code because we have some dead code and it's taking space and it might actually be read by the kernel
31:01
by mistake and preloading this is a small experiment I did I just preloaded all the library files from the Firefox directory I just did a cat
31:23
on the whole file and it's actually faster to start and the faster times include the amount of time it took to captain the dead node? yes, it does so this is a three line change to our startup script exactly and the improvement is what we sent?
31:43
ok, I can sense a lot of questions out there but we're only going to take two because we're having time basically
32:03
is the Firefox or Mozilla organization supporting you in this and how can other open source projects benefit from it? well, anything that can help really help I'm actually can you repeat the question someone asked in the back? he was asking what
32:21
can people do with anything about it to help so yes he asked, are you getting paid to do this? yes how can other projects use it? ah, ok yes, I'm being paid for this I'm actually contracted by Mozet
32:42
and how do other people well, you can start to use the code we wrote, like icegrind or something like that you can contact me there's an address I can probably give a hand any feedback will be helpful
33:03
from your experience because you probably have problems as well if you have strips better strips for system tab or x-rays or whatever that'd be helpful for the moment it's
33:22
it's work in progress so if you want to give a hand you can one more question over there, yeah for example, the static initializer problem
33:47
is kind of solved in GCC 4.6 it's not really solved in the mean that it still generates functions
34:00
for stupid things but it groups them which takes the data there are other things happening within GCC and actually Mozila is trying to get people to do some things on GCC's side
34:23
thank you that's it thanks very much