We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Fighting I/O: a story of Firefox startup speed improvements

00:00

Formal Metadata

Title
Fighting I/O: a story of Firefox startup speed improvements
Title of Series
Number of Parts
64
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Cold startup is the first experience a user has from an application, so you'd rather make it fast. Unfortunately, a lot of different things get in the way, from filesystems to toolchains, even including binary formats. This talk will explore various sides of the problem, and introduce some of the techniques implemented in Firefox 4.0 and in the works for subsequent major releases. This talk will cover some of the tools developed or used by Mozilla to improve Firefox cold startup, and some upcoming toolchain improvements.
MultiplicationSoftware developerFocus (optics)outputCodeComputer animationLecture/Conference
Channel capacitySpeicherkapazitätNP-hardHard disk driveMiniDisc2 (number)Multiplication signVertex (graph theory)Cartesian coordinate system10 (number)Polar coordinate systemCoprocessorOrder (biology)Semiconductor memoryBand matrixChannel capacityPhysical systemMemory managementSoftwarePerturbation theoryPerfect groupExecution unitComputer animationLecture/Conference
Multiplication signMiniDisc2 (number)Point (geometry)BefehlsprozessorPhysical systemDifferent (Kate Ryan album)CodeCore dumpDuality (mathematics)BootingBeat (acoustics)DialectElectronic mailing listOrder (biology)Diagram
Color managementKernel (computing)Block (periodic table)Matching (graph theory)File formatFunction (mathematics)BlogBefehlsprozessorRow (database)CodeGraph (mathematics)Computer fileProcess (computing)MiniDiscoutputVirtual machineFlagMultiplication signKernel (computing)Functional (mathematics)Point (geometry)Event horizonBinary codeDirectory serviceMoment (mathematics)Physical systemScripting languageReading (process)Uniform resource locatorState of matterSubsetSource codeSystem callBitFluid staticsTrailRead-only memoryWritingBoundary value problemSheaf (mathematics)Block (periodic table)Complete metric spaceTouch typingGrass (card game)Factory (trading post)Pole (complex analysis)Link (knot theory)MereologyGame theoryOpen setRight angleHypermediaMagnetic stripe cardSource code
Fluid staticsFunctional (mathematics)Address spaceSpacetimeLibrary (computing)Reading (process)Similarity (geometry)Computer fileLink (knot theory)RandomizationResultantRevision controlCorrespondence (mathematics)Object (grammar)Plug-in (computing)CompilerOrder (biology)Slide ruleMultiplication signBitConstructor (object-oriented programming)Process (computing)Software developerSheaf (mathematics)Electronic mailing listRange (statistics)Linker (computing)CodeFunction (mathematics)Boundary value problemMiniDiscComputer animation
Scaling (geometry)BefehlsprozessorDifferent (Kate Ryan album)ChainComputer fileMultiplication signObject (grammar)CodeGraphical user interfaceSheaf (mathematics)Online helpReading (process)CompilerNumberKernel (computing)BefehlsprozessorSoftware testingArithmetic progressionLinker (computing)Revision controlFunctional (mathematics)Order (biology)Graph (mathematics)Extension (kinesiology)Real numberPersonal identification numberDynamical systemFile systemWeightLibrary (computing)TrailLevel (video gaming)Disk read-and-write headServer (computing)Module (mathematics)1 (number)Group actionPoint (geometry)Volume (thermodynamics)Web 2.0Software bugComputer animation
Revision controlProfil (magazine)Order (biology)Computer fileBlogMultiplication signExtension (kinesiology)Binary codeGraph (mathematics)ResultantMathematicsPointer (computer programming)KettenbedingungUniform resource locatorObservational studyZoom lensCondition numberDiagram
Functional (mathematics)Multiplication signKernel (computing)Address spaceCodeProjective planeSpacetimeSelf-organization1 (number)MathematicsLine (geometry)Library (computing)Computer fileGroup actionOpen sourceArithmetic meanCache (computing)Arithmetic progressionFluid staticsPhysical systemMoment (mathematics)Computer-assisted translationFeedbackDirectory serviceScripting languagePoisson-KlammerEndliche ModelltheorieSet (mathematics)Computer animation
Transcript: English(auto-generated)
Satisfying luncheon, you're ready for another fine afternoon. If I could introduce my colleague, I'm going to talk about fighting iOS, the code of the cold startup movement. So this is a big focus, as you know, Firefox 4 performance and startup code after this. Take it away. 34.
Can you hear me? Yes. Yep. Before I start, I would like to know how many people in the audience are not Mozilla developers. Awesome. How many of you are developers? Awesome.
So I'll be talking to you about IO, which is an unexpected problem on most systems nowadays. So I'll be introducing you why we need to address IO somehow
and how it has an impact on software. Then I will introduce you how to actually see what's happening, hopefully, and what can be done against it.
20 years ago, I had my first real PC, one with memory management units, and it was fast at the time. But, well, PCs nowadays are really, really, really faster.
Processor speed by then was like tens of millions of instructions per second. Now you can count in tens of thousands of instructions per second. Memory capacity has more than,
yeah, 20 years ago you could count in megabytes. Now you count in gigabytes. Memory bandwidth was maybe one gigabit per second. Now it's more like 200 or 300 or 500 gigabit per second. Hard drive capacity have exploded.
My hard drive by then was 200 megabytes. It was quite big. Now you see terabytes. And throughput is good as well, because by then you had one megabyte per second. Now you have hundreds.
Hard drive access time also increased. 20 years ago you had a drive that had an access time of 20 milliseconds. Now with SSDs we have 0.1 milliseconds. But that's SSDs.
In practice, this is not true. Most people don't have an SSD. So we're still stuck with very, very slow access times in the order of 5 to 10 milliseconds. So what's the problem? That's the problem.
So on the horizontal axis you have time. And this is Firefox startup on Linux. Vertically you have the disk. It's the offset on the disk. And what you can see is that you read stuff and you go elsewhere and go back. You're going back and forth on the disk.
And with really slow access times it means that every time you have a vertical bar it's really slow. And even if you zoom on the other axes it's not really good either.
And even these or these are hurting a lot. I did a little experiment with the data I gathered just before. Instead of taking the IO as it was
well I did take both the IO as it was experienced in reality and I also reordered to see the difference. And on the slowest disk which is only 30 megabytes per second
throughput the normal IO takes 2.7 seconds. With the older IO it takes half the time. On a faster disk around 85 megabytes per second
the older IO is 3 times as fast. So it's really critical to have to avoid however possible any any seek on the disk. And another point
is that we don't really have a problem with warm startup. Anything that is CPU bound is not a problem. Why? Because you see, that's the Firefox startup. On a code 2 dual system
takes 4 seconds to start almost and on warm startup it's much faster well under 1 second. And on an i7 system the core startup is not really pretty much different but
the warm startup is twice as fast but it's not really a big difference because it's under a second. It's CPU time used. It's wall clock. It's wall clock. So that's the time it takes to start
on on code 2 dual with slow disk. So you have to know what are the problems what are the problems with IO and the problem is that at the moment we
have a lack of tools but there are ways to track some kind of IO but it's really hard to have an actual grasp on what's happening actually. Linux has some tools that allow to have some idea
about what's happening but it's really cumbersome and I will show you some of the tools and you don't you can have widespread knowledge of what's happening and getting relevant startup times is
hard. The system I used was a virtual machine which I rebooted 50 times to get average times within some kind of boundaries. This is not something you want to do every day.
I wrote some automation tools to do that but it's really cumbersome. Tracking IO is also not really as simple as tracking read and write. You will find a lot of scripts on the net actually doing that and it's wrong, very wrong.
It's really simple because for example, if you open a file read from it close the file open the file again, read again and close it again. What do you think will happen? You have one access not two, only one
because the system is quite intelligent it's caching you hope it does. Another interesting point I discovered is that
CPUs failing actually is influenced by IO. Nowadays the CPUs are not running at their full speed every time and what's happening is that if you go back
whenever that kind of stuff happens the CPU is waiting for this which means the CPU is sleeping it's at its slowest slowest speed when you need to go back to full CPU speed
there is a latency that happens because the CPU can't really switch from slower speed to faster speed in an instant. So what happens is that if you somehow find a way to have
your CPU maxed out during the startup of Firefox it's faster by 10-20% which is quite impressive and unexpected.
So together the data I showed you before the graphs on the disk I use ftrace which is a kernel tracing facility in Linux which has the advantage of not needing anything else than the kernel compiled with
the right flags but usually these rows come with all you need for that all you need is to mount the debugFS which might or might not be already mounted depending on the distro and do some fiddling with
so here we just enable the trace on the disk where everything will be traced we say that we want block tracing here we say that we want the block IO complete
events we enable and then here we get the trace the output is not really not really readable you have a lot of output which you don't really know what it means exactly because there's not much
documentation about the block block IO tracing facility so you have to guess I used why I could I just took what looked like what's happening there are
some files in the events directories you have much more events than that you have a format file in each of these events directory which is supposed to contain the format used by the output and it doesn't really match
another tool that I used and I will show you graphs just after that, is SystemTap SystemTap is a kernel tracing swiss army knife you can do anything with it you can insert code in the kernel during its job when it does you can do anything with it
almost you can trash the system with it if you want I wouldn't advise it the big downside of that is that you really need to know the kernel internals to actually do something so the graphs
I did after that required a lot of coding in the kernel code and well obviously it's hard to get the right data out of it because it also depends on how the kernel is optimized because the kernel source is not exactly
designed for that there are many calls from static functions to other static functions and sometimes they are inline sometimes not so you can put procs on some and you can put procs on some others it's really a thing
the URL I wrote there is a blog post I posted a few months ago maybe a month ago about the SystemTap setup I used and the script also to get this
this doesn't there's more than the SystemTap output but here is a summary of all the IOs happening on the libzoo file which is the main file containing most of the code in Firefox during the startup
and the recent strikes correspond to some sections some big sections in the file the pinkish one is relocations I'll explain it to you later that's code
that's read-only data that's read-only data that's read-only data read-only data and this is data
here it's something you have to endure on 64 bits 64 bits and it's EH3 which is used to to unroll acceptance which is actually not used
in Firefox you have to have that it's in the LBA so what's happening? the process starts then you have some reads here at the beginning and at the end why? well, that's kind of an unfortunate
state of the binaries is that to know what to read in the binary you have to read at the beginning of it and at the end of it which is pretty weird but it's the way it is it could be changed
by the way of the linker but at the moment you cannot do anything actually most of what is there you cannot do much about it because maybe it runs around here sorry
so after initializing libzoo here it does nothing in the file that's because it's doing things on other libraries similar things and then here you see readings here
and here and what's happening here is that it's doing relocations and relocations is something that is necessary when you have random address spaces because the library is not necessarily loaded at the same
address in the other space so this means that you have to change a bunch of offsets to to make it work with the address at which the code is loaded fortunately
you don't have to do that in the code because the compiler and the linker does a great job at it but you read a lot of data and you also update a lot of data and you do it a bit at a time so you go back and forth
here it's forward that's another thing that's static initializer so I think it's next slide yeah these are static initializers
these are only examples and for each of those the compiler will actually create a function this function will be in the corresponding object file and the result is that well you know when you link
a lot of your object files you have a lot of object files and each of these functions are in each of these little object files so each static initializer is called for each object file
and it's done backwards because GCSE developers decided it was going to be backwards the reason is that there are constructors and there are destructors so you have static initializers and
you have the other hand and to be safe with object files they have to be run in reverse order from each other and it was decided unfortunately that those going backwards are the static initializers
so the main problem with that is that it's really easy to get static initializers without knowing because who would know from that for example you could guess here
maybe but really it's something stupid from the compiler because this is a constant and doing this it will actually create a function that just sets this value not the function column, anything just this value
to this slot just the function for that here it actually calls something so you have to know that it will create a static initializer icegrind is another
tool it's actually two tools one was developed by Taras Glek and one was developed by myself I took Taras's one
and I changed it to do what I wanted it to do so they are both vamp-wide plugins my version tracks all the bytes the single bytes that are accessed during the execution of a process
only once so at the end of the run it will tell you what byte ranges have been touched the one from Taras you give it a list of sections whatever you want
what we use it for is taking for example the output of LD which will give you a map of all your object files and functions and we list all the functions and we can know that way which functions
are called when, in what order but only once so what can be seen with icegrind, with my version the one taking bytes by bytes
is that while the kernel actually reads a lot of data for example text, the text section is the code section so the red bar is the size of the section in the file the green one
is the read ahead what the kernel actually reads which is a lot most sections are read almost entirely and the blue bars are what's actually needed and you see the code nothing is needed, almost but you still read all that which is a waste of time
advanced starter is something new that's coming in Firefox 4 so the blood cost actually has the extension it's a small extension
a quite stupid one actually only displaying the three the three values so it's tracking when main is called it's not exactly main it's the main function in libzool when the section is restored which is when all the tabs
have been initialized but not necessarily loaded from the net and when the first thing occurs whatever it means we also gather data, actual data from users through addons dot mojeta dot org pins
those that you send when you want new addons or when it wants to know if you are up to date and a real estate extension with graphs and stuff like that is a work in progress so we have a lot of unexpected enemies
the file systems for example during the course of all this testing I copied Firefox a lot and it turned out that files were mixed for example the libzool file
which is 20 something megabytes you had a bunch of it a bunch of another file another bunch of libzool another bunch of another file and so on and so forth the toolchain doesn't really help as we saw the compiler doesn't help here static and dynamic linker doesn't help
and CPU scaling doesn't help either so what can we do about it? so we can for example do something about that we can try to do something about that we can try also
to do something about that that's something the linker should do and we should definitely try to do something about all these which are basically most of the time due to code to system libraries
so what we have to do is well, avoid fragmentation we had something for Firefox 4 the SQLite files for places for example were very very fragmented
and we improved that by allocating by bigger chunks reduced the number of files this was done in Firefox 4 before we had a lot of different files in continents and chrome they are all grouped in one file now omnijar
improved the binary layout so we tried some things about that reordering reordering the object files for example which is the easiest way to do that without needing a new two chain reduce the size
anything you can do to reduce the size will obviously save some IO and avoid going back and forth between files because, well, it's killing so for example so it's actually sad this graph is kind of sad
this is the 3.6 start time and it's actually faster to start than 4 this is 4 beta 8
I did my data gathering a lot of time ago so without omnijar we see that omnijar actually gives a good improvement here but we are still slower than 3.6 that's important but
but we also have extensions packing now so it's instead of unpacking all the extensions when we install it we keep them packed when we can so these are profiles
I used with the six biggest big users extensions only those that work on 4.0 as well because not all of them do so actually
version 4 is actually faster with extensions packed and something stupid I did is trying to reverse the static initializers those that go backwards I just hacked the file so that the pointers go forward
and the result is actually surprising I did not expect that much I did expect some improvement but not that much in the order of 10-20% just by going forward instead of backwards
these are the various changes I tried unfortunately they won't make it except 1 to 4.0 so here we have normal target here we reduced
the static initializers some of the static initializers but not all of them and here we reordered the binaries and packed relocations and reduced the static initializers
so these are file sizes so it's libzoom size so it's quite good use and the result in start time is actually deceiving for less static initializers but actually good
for when you put all of those together so you have two blog posts with more data about that so the reason why the static initializers are actually slower is that
when you stop reading here some of the stuff here you should stop reading there well, sometime here you will probably have to read actually because here it's cached from the first start and here you won't see anything for these offsets
because obviously it's already in the cache but if you read less here you have to read them later which kills sometimes I skip this one so what's next?
we'll also try to avoid fsync which is also killing because basically it's telling the system to crush anything in fsync cache that is not written we'll also try to separate hot and cold functions at the target what will be tried is to actually separate
in two libraries one for the hot functions those that are actually used at startup and the ones that are not removing dead code because we have some dead code and it's taking space and it might actually be read by the kernel
by mistake and preloading this is a small experiment I did I just preloaded all the library files from the Firefox directory I just did a cat
on the whole file and it's actually faster to start and the faster times include the amount of time it took to captain the dead node? yes, it does so this is a three line change to our startup script exactly and the improvement is what we sent?
ok, I can sense a lot of questions out there but we're only going to take two because we're having time basically
is the Firefox or Mozilla organization supporting you in this and how can other open source projects benefit from it? well, anything that can help really help I'm actually can you repeat the question someone asked in the back? he was asking what
can people do with anything about it to help so yes he asked, are you getting paid to do this? yes how can other projects use it? ah, ok yes, I'm being paid for this I'm actually contracted by Mozet
and how do other people well, you can start to use the code we wrote, like icegrind or something like that you can contact me there's an address I can probably give a hand any feedback will be helpful
from your experience because you probably have problems as well if you have strips better strips for system tab or x-rays or whatever that'd be helpful for the moment it's
it's work in progress so if you want to give a hand you can one more question over there, yeah for example, the static initializer problem
is kind of solved in GCC 4.6 it's not really solved in the mean that it still generates functions
for stupid things but it groups them which takes the data there are other things happening within GCC and actually Mozila is trying to get people to do some things on GCC's side
thank you that's it thanks very much