Fighting I/O: a story of Firefox startup speed improvements - TIB AV-Portal

Fighting I/O: a story of Firefox startup speed improvements

00:00

1

Formal Metadata

Title

Fighting I/O: a story of Firefox startup speed improvements

Title of Series

Number of Parts

64

Author

License

CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.

Identifiers

10.5446/45923 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

Cold startup is the first experience a user has from an application, so you'd rather make it fast. Unfortunately, a lot of different things get in the way, from filesystems to toolchains, even including binary formats. This talk will explore various sides of the problem, and introduce some of the techniques implemented in Firefox 4.0 and in the works for subsequent major releases. This talk will cover some of the tools developed or used by Mozilla to improve Firefox cold startup, and some upcoming toolchain improvements.

FOSDEM 201126 / 64

1

15:14

Automated cross-browser webtesting with Sahi

2

13:05

Rails Admin: The right way of doing data administration with Rails 3

3

38:33

qt webkit goes Mobile

4

32:31

PL/Parrot: Cutting Edge Free Software

5

13:47

PageKite: Making The Web 1000x Bigger

6

49:40

OpenStack: Building a free, massively scalable cloud computing platform

7

15:43

Opening talk FOSDEM 2011

8

44:56

Objective-C: C in Smalltalk Objects

9

14:59

Neo4j: Graph DB and Neo4j introduction

10

26:21

11

14:23

Mongrel2 experiences adding IPv6

12

14:56

MediaMosa: Open source video backend

13

50:51

Linux Disaster Recovery as a Service (with rear)

14

15:32

Magellan project: How to deploy 550.000 Linux notebooks in classrooms

15

15:47

Libre Graphics Magazine: Bringing F/LOSS Designers Together, One Dead Tree at a Time

16

53:41

LibreOffice: The Document Foundation

17

12:18

KDevelop: Rapid C++ Programming

18

13:29

Project iRail: providing public transport information

19

46:59

I'm going M.A.D

20

48:13

Kernel development: How things go wrong

21

59:19

golang: Practical Go Programming

22

15:47

FreeMedForms: Managing drug-drug interactions. An open source model.

23

15:36

Forban: a simple link-local opportunistic p2p free software

24

15:12

Flashrom: The open source flash programmer

25

58:49

firefox4, new features for userd and developers

26

34:28

Fighting I/O: a story of Firefox startup speed improvements

27

40:33

Django's architecture: The good, the bad, and the ugly

28

50:42

DevOps: More than Marketing

29

44:24

Data-as-a-service with Infinispan

30

15:06

CyaSSL: Embedded Security for Devices

31

15:21

Coreboot: x86 system boot and initialization

32

10:30

Cloud9 IDE: Kick ass code editing and end to end JavaScript debugging

33

16:01

chicken: Cheney-on-the-MTA

34

47:22

Calligra, me and you

35

43:54

Introducing Apache Wicket

36

16:15

Apache Etch: Efficient and feature-rich network services

37

15:40

Android Video Streaming

38

15:35

Agora Voting System for a Liquid Democracy

39

14:39

Aalto-1: A nanosatellite using Open Source

40

14:16

0MQ: Multithreading magic

41

16:13

XWiki: Annotating Documents, the eXtensible Wiki Way

42

55:04

Why Political Liberty Depends on Software Freedom More Than Ever

43

45:50

WebODF, an office suite built on browser technology

44

14:51

VillageTelco Project

45

09:56

UC Engine: a real time collaboration application framework

46

16:08

Timebank: The Timebank free software project

47

47:32

The Storage Technologies Behind Facebook Messages

48

52:05

Firefox Feature Innovation

49

53:03

Beyond Init: systemd

50

08:57

Sirius: Is the UK Government Backing Free Software?

51

14:57

Seeks Project: Let's take back Websearch

52

23:55

53

28:48

dynalogin: Open source two-factor authentication

54

28:36

BOFH meets SystemTap: rootkits made trivial

55

30:52

Trust assertions

56

22:26

Unifying access to PKCS#11 tokens

57

16:12

EJBCA and OpenSC

58

17:32

Fribid and browser security software

59

15:34

CyaSSL: Embedded SSL Library

60

26:05

libcurl: Supporting seven SSL libraries and one SSH library

61

29:44

SSH libraries: SSH vs TLS; libssh

62

22:43

Smartcard Jungle

63

08:55

64

14:44

scala: Scala expressiveness

Automatic playback

Speech

Text

Image

00:00

MultiplicationSoftware developerFocus (optics)outputCodeComputer animationLecture/Conference

00:44

Channel capacitySpeicherkapazitätNP-hardHard disk driveMiniDisc2 (number)Multiplication signVertex (graph theory)Cartesian coordinate system10 (number)Polar coordinate systemCoprocessorOrder (biology)Semiconductor memoryBand matrixChannel capacityPhysical systemMemory managementSoftwarePerturbation theoryPerfect groupExecution unitComputer animationLecture/Conference

04:08

Multiplication signMiniDisc2 (number)Point (geometry)BefehlsprozessorPhysical systemDifferent (Kate Ryan album)CodeCore dumpDuality (mathematics)BootingBeat (acoustics)DialectElectronic mailing listOrder (biology)Diagram

06:28

Color managementKernel (computing)Block (periodic table)Matching (graph theory)File formatFunction (mathematics)BlogBefehlsprozessorRow (database)CodeGraph (mathematics)Computer fileProcess (computing)MiniDiscoutputVirtual machineFlagMultiplication signKernel (computing)Functional (mathematics)Point (geometry)Event horizonBinary codeDirectory serviceMoment (mathematics)Physical systemScripting languageReading (process)Uniform resource locatorState of matterSubsetSource codeSystem callBitFluid staticsTrailRead-only memoryWritingBoundary value problemSheaf (mathematics)Block (periodic table)Complete metric spaceTouch typingGrass (card game)Factory (trading post)Pole (complex analysis)Link (knot theory)MereologyGame theoryOpen setRight angleHypermediaMagnetic stripe cardSource code

15:58

Fluid staticsFunctional (mathematics)Address spaceSpacetimeLibrary (computing)Reading (process)Similarity (geometry)Computer fileLink (knot theory)RandomizationResultantRevision controlCorrespondence (mathematics)Object (grammar)Plug-in (computing)CompilerOrder (biology)Slide ruleMultiplication signBitConstructor (object-oriented programming)Process (computing)Software developerSheaf (mathematics)Electronic mailing listRange (statistics)Linker (computing)CodeFunction (mathematics)Boundary value problemMiniDiscComputer animation

21:09

Scaling (geometry)BefehlsprozessorDifferent (Kate Ryan album)ChainComputer fileMultiplication signObject (grammar)CodeGraphical user interfaceSheaf (mathematics)Online helpReading (process)CompilerNumberKernel (computing)BefehlsprozessorSoftware testingArithmetic progressionLinker (computing)Revision controlFunctional (mathematics)Order (biology)Graph (mathematics)Extension (kinesiology)Real numberPersonal identification numberDynamical systemFile systemWeightLibrary (computing)TrailLevel (video gaming)Disk read-and-write headServer (computing)Module (mathematics)1 (number)Group actionPoint (geometry)Volume (thermodynamics)Web 2.0Software bugComputer animation

26:15

Revision controlProfil (magazine)Order (biology)Computer fileBlogMultiplication signExtension (kinesiology)Binary codeGraph (mathematics)ResultantMathematicsPointer (computer programming)KettenbedingungUniform resource locatorObservational studyZoom lensCondition numberDiagram

29:37

Functional (mathematics)Multiplication signKernel (computing)Address spaceCodeProjective planeSpacetimeSelf-organization1 (number)MathematicsLine (geometry)Library (computing)Computer fileGroup actionOpen sourceArithmetic meanCache (computing)Arithmetic progressionFluid staticsPhysical systemMoment (mathematics)Computer-assisted translationFeedbackDirectory serviceScripting languagePoisson-KlammerEndliche ModelltheorieSet (mathematics)Computer animation

Transcript: English(auto-generated)

00:00

Satisfying luncheon, you're ready for another fine afternoon. If I could introduce my colleague, I'm going to talk about fighting iOS, the code of the cold startup movement. So this is a big focus, as you know, Firefox 4 performance and startup code after this. Take it away. 34.

00:21

Can you hear me? Yes. Yep. Before I start, I would like to know how many people in the audience are not Mozilla developers. Awesome. How many of you are developers? Awesome.

00:41

So I'll be talking to you about IO, which is an unexpected problem on most systems nowadays. So I'll be introducing you why we need to address IO somehow

01:05

and how it has an impact on software. Then I will introduce you how to actually see what's happening, hopefully, and what can be done against it.

01:25

20 years ago, I had my first real PC, one with memory management units, and it was fast at the time. But, well, PCs nowadays are really, really, really faster.

01:42

Processor speed by then was like tens of millions of instructions per second. Now you can count in tens of thousands of instructions per second. Memory capacity has more than,

02:01

yeah, 20 years ago you could count in megabytes. Now you count in gigabytes. Memory bandwidth was maybe one gigabit per second. Now it's more like 200 or 300 or 500 gigabit per second. Hard drive capacity have exploded.

02:22

My hard drive by then was 200 megabytes. It was quite big. Now you see terabytes. And throughput is good as well, because by then you had one megabyte per second. Now you have hundreds.

02:41

Hard drive access time also increased. 20 years ago you had a drive that had an access time of 20 milliseconds. Now with SSDs we have 0.1 milliseconds. But that's SSDs.

03:01

In practice, this is not true. Most people don't have an SSD. So we're still stuck with very, very slow access times in the order of 5 to 10 milliseconds. So what's the problem? That's the problem.

03:20

So on the horizontal axis you have time. And this is Firefox startup on Linux. Vertically you have the disk. It's the offset on the disk. And what you can see is that you read stuff and you go elsewhere and go back. You're going back and forth on the disk.

03:44

And with really slow access times it means that every time you have a vertical bar it's really slow. And even if you zoom on the other axes it's not really good either.

04:00

And even these or these are hurting a lot. I did a little experiment with the data I gathered just before. Instead of taking the IO as it was

04:24

well I did take both the IO as it was experienced in reality and I also reordered to see the difference. And on the slowest disk which is only 30 megabytes per second

04:40

throughput the normal IO takes 2.7 seconds. With the older IO it takes half the time. On a faster disk around 85 megabytes per second

05:01

the older IO is 3 times as fast. So it's really critical to have to avoid however possible any any seek on the disk. And another point

05:22

is that we don't really have a problem with warm startup. Anything that is CPU bound is not a problem. Why? Because you see, that's the Firefox startup. On a code 2 dual system

05:40

takes 4 seconds to start almost and on warm startup it's much faster well under 1 second. And on an i7 system the core startup is not really pretty much different but

06:00

the warm startup is twice as fast but it's not really a big difference because it's under a second. It's CPU time used. It's wall clock. It's wall clock. So that's the time it takes to start

06:21

on on code 2 dual with slow disk. So you have to know what are the problems what are the problems with IO and the problem is that at the moment we

06:40

have a lack of tools but there are ways to track some kind of IO but it's really hard to have an actual grasp on what's happening actually. Linux has some tools that allow to have some idea

07:01

about what's happening but it's really cumbersome and I will show you some of the tools and you don't you can have widespread knowledge of what's happening and getting relevant startup times is

07:22

hard. The system I used was a virtual machine which I rebooted 50 times to get average times within some kind of boundaries. This is not something you want to do every day.

07:43

I wrote some automation tools to do that but it's really cumbersome. Tracking IO is also not really as simple as tracking read and write. You will find a lot of scripts on the net actually doing that and it's wrong, very wrong.

08:02

It's really simple because for example, if you open a file read from it close the file open the file again, read again and close it again. What do you think will happen? You have one access not two, only one

08:23

because the system is quite intelligent it's caching you hope it does. Another interesting point I discovered is that

08:40

CPUs failing actually is influenced by IO. Nowadays the CPUs are not running at their full speed every time and what's happening is that if you go back

09:02

whenever that kind of stuff happens the CPU is waiting for this which means the CPU is sleeping it's at its slowest slowest speed when you need to go back to full CPU speed

09:20

there is a latency that happens because the CPU can't really switch from slower speed to faster speed in an instant. So what happens is that if you somehow find a way to have

09:41

your CPU maxed out during the startup of Firefox it's faster by 10-20% which is quite impressive and unexpected.

10:00

So together the data I showed you before the graphs on the disk I use ftrace which is a kernel tracing facility in Linux which has the advantage of not needing anything else than the kernel compiled with

10:21

the right flags but usually these rows come with all you need for that all you need is to mount the debugFS which might or might not be already mounted depending on the distro and do some fiddling with

10:41

so here we just enable the trace on the disk where everything will be traced we say that we want block tracing here we say that we want the block IO complete

11:02

events we enable and then here we get the trace the output is not really not really readable you have a lot of output which you don't really know what it means exactly because there's not much

11:21

documentation about the block block IO tracing facility so you have to guess I used why I could I just took what looked like what's happening there are

11:40

some files in the events directories you have much more events than that you have a format file in each of these events directory which is supposed to contain the format used by the output and it doesn't really match

12:02

another tool that I used and I will show you graphs just after that, is SystemTap SystemTap is a kernel tracing swiss army knife you can do anything with it you can insert code in the kernel during its job when it does you can do anything with it

12:22

almost you can trash the system with it if you want I wouldn't advise it the big downside of that is that you really need to know the kernel internals to actually do something so the graphs

12:40

I did after that required a lot of coding in the kernel code and well obviously it's hard to get the right data out of it because it also depends on how the kernel is optimized because the kernel source is not exactly

13:03

designed for that there are many calls from static functions to other static functions and sometimes they are inline sometimes not so you can put procs on some and you can put procs on some others it's really a thing

13:22

the URL I wrote there is a blog post I posted a few months ago maybe a month ago about the SystemTap setup I used and the script also to get this

13:42

this doesn't there's more than the SystemTap output but here is a summary of all the IOs happening on the libzoo file which is the main file containing most of the code in Firefox during the startup

14:01

and the recent strikes correspond to some sections some big sections in the file the pinkish one is relocations I'll explain it to you later that's code

14:20

that's read-only data that's read-only data that's read-only data read-only data and this is data

14:42

here it's something you have to endure on 64 bits 64 bits and it's EH3 which is used to to unroll acceptance which is actually not used

15:02

in Firefox you have to have that it's in the LBA so what's happening? the process starts then you have some reads here at the beginning and at the end why? well, that's kind of an unfortunate

15:23

state of the binaries is that to know what to read in the binary you have to read at the beginning of it and at the end of it which is pretty weird but it's the way it is it could be changed

15:41

by the way of the linker but at the moment you cannot do anything actually most of what is there you cannot do much about it because maybe it runs around here sorry

16:06

so after initializing libzoo here it does nothing in the file that's because it's doing things on other libraries similar things and then here you see readings here

16:21

and here and what's happening here is that it's doing relocations and relocations is something that is necessary when you have random address spaces because the library is not necessarily loaded at the same

16:40

address in the other space so this means that you have to change a bunch of offsets to to make it work with the address at which the code is loaded fortunately

17:00

you don't have to do that in the code because the compiler and the linker does a great job at it but you read a lot of data and you also update a lot of data and you do it a bit at a time so you go back and forth

17:22

here it's forward that's another thing that's static initializer so I think it's next slide yeah these are static initializers

17:42

these are only examples and for each of those the compiler will actually create a function this function will be in the corresponding object file and the result is that well you know when you link

18:00

a lot of your object files you have a lot of object files and each of these functions are in each of these little object files so each static initializer is called for each object file

18:20

and it's done backwards because GCSE developers decided it was going to be backwards the reason is that there are constructors and there are destructors so you have static initializers and

18:41

you have the other hand and to be safe with object files they have to be run in reverse order from each other and it was decided unfortunately that those going backwards are the static initializers

19:06

so the main problem with that is that it's really easy to get static initializers without knowing because who would know from that for example you could guess here

19:20

maybe but really it's something stupid from the compiler because this is a constant and doing this it will actually create a function that just sets this value not the function column, anything just this value

19:41

to this slot just the function for that here it actually calls something so you have to know that it will create a static initializer icegrind is another

20:02

tool it's actually two tools one was developed by Taras Glek and one was developed by myself I took Taras's one

20:20

and I changed it to do what I wanted it to do so they are both vamp-wide plugins my version tracks all the bytes the single bytes that are accessed during the execution of a process

20:44

only once so at the end of the run it will tell you what byte ranges have been touched the one from Taras you give it a list of sections whatever you want

21:01

what we use it for is taking for example the output of LD which will give you a map of all your object files and functions and we list all the functions and we can know that way which functions

21:21

are called when, in what order but only once so what can be seen with icegrind, with my version the one taking bytes by bytes

21:41

is that while the kernel actually reads a lot of data for example text, the text section is the code section so the red bar is the size of the section in the file the green one

22:01

is the read ahead what the kernel actually reads which is a lot most sections are read almost entirely and the blue bars are what's actually needed and you see the code nothing is needed, almost but you still read all that which is a waste of time

22:27

advanced starter is something new that's coming in Firefox 4 so the blood cost actually has the extension it's a small extension

22:41

a quite stupid one actually only displaying the three the three values so it's tracking when main is called it's not exactly main it's the main function in libzool when the section is restored which is when all the tabs

23:01

have been initialized but not necessarily loaded from the net and when the first thing occurs whatever it means we also gather data, actual data from users through addons dot mojeta dot org pins

23:21

those that you send when you want new addons or when it wants to know if you are up to date and a real estate extension with graphs and stuff like that is a work in progress so we have a lot of unexpected enemies

23:44

the file systems for example during the course of all this testing I copied Firefox a lot and it turned out that files were mixed for example the libzool file

24:01

which is 20 something megabytes you had a bunch of it a bunch of another file another bunch of libzool another bunch of another file and so on and so forth the toolchain doesn't really help as we saw the compiler doesn't help here static and dynamic linker doesn't help

24:22

and CPU scaling doesn't help either so what can we do about it? so we can for example do something about that we can try to do something about that we can try also

24:41

to do something about that that's something the linker should do and we should definitely try to do something about all these which are basically most of the time due to code to system libraries

25:03

so what we have to do is well, avoid fragmentation we had something for Firefox 4 the SQLite files for places for example were very very fragmented

25:20

and we improved that by allocating by bigger chunks reduced the number of files this was done in Firefox 4 before we had a lot of different files in continents and chrome they are all grouped in one file now omnijar

25:40

improved the binary layout so we tried some things about that reordering reordering the object files for example which is the easiest way to do that without needing a new two chain reduce the size

26:02

anything you can do to reduce the size will obviously save some IO and avoid going back and forth between files because, well, it's killing so for example so it's actually sad this graph is kind of sad

26:25

this is the 3.6 start time and it's actually faster to start than 4 this is 4 beta 8

26:42

I did my data gathering a lot of time ago so without omnijar we see that omnijar actually gives a good improvement here but we are still slower than 3.6 that's important but

27:02

but we also have extensions packing now so it's instead of unpacking all the extensions when we install it we keep them packed when we can so these are profiles

27:20

I used with the six biggest big users extensions only those that work on 4.0 as well because not all of them do so actually

27:40

version 4 is actually faster with extensions packed and something stupid I did is trying to reverse the static initializers those that go backwards I just hacked the file so that the pointers go forward

28:02

and the result is actually surprising I did not expect that much I did expect some improvement but not that much in the order of 10-20% just by going forward instead of backwards

28:22

these are the various changes I tried unfortunately they won't make it except 1 to 4.0 so here we have normal target here we reduced

28:44

the static initializers some of the static initializers but not all of them and here we reordered the binaries and packed relocations and reduced the static initializers

29:00

so these are file sizes so it's libzoom size so it's quite good use and the result in start time is actually deceiving for less static initializers but actually good

29:23

for when you put all of those together so you have two blog posts with more data about that so the reason why the static initializers are actually slower is that

29:40

when you stop reading here some of the stuff here you should stop reading there well, sometime here you will probably have to read actually because here it's cached from the first start and here you won't see anything for these offsets

30:01

because obviously it's already in the cache but if you read less here you have to read them later which kills sometimes I skip this one so what's next?

30:20

we'll also try to avoid fsync which is also killing because basically it's telling the system to crush anything in fsync cache that is not written we'll also try to separate hot and cold functions at the target what will be tried is to actually separate

30:41

in two libraries one for the hot functions those that are actually used at startup and the ones that are not removing dead code because we have some dead code and it's taking space and it might actually be read by the kernel

31:01

by mistake and preloading this is a small experiment I did I just preloaded all the library files from the Firefox directory I just did a cat

31:23

on the whole file and it's actually faster to start and the faster times include the amount of time it took to captain the dead node? yes, it does so this is a three line change to our startup script exactly and the improvement is what we sent?

31:43

ok, I can sense a lot of questions out there but we're only going to take two because we're having time basically

32:03

is the Firefox or Mozilla organization supporting you in this and how can other open source projects benefit from it? well, anything that can help really help I'm actually can you repeat the question someone asked in the back? he was asking what

32:21

can people do with anything about it to help so yes he asked, are you getting paid to do this? yes how can other projects use it? ah, ok yes, I'm being paid for this I'm actually contracted by Mozet

32:42

and how do other people well, you can start to use the code we wrote, like icegrind or something like that you can contact me there's an address I can probably give a hand any feedback will be helpful

33:03

from your experience because you probably have problems as well if you have strips better strips for system tab or x-rays or whatever that'd be helpful for the moment it's

33:22

it's work in progress so if you want to give a hand you can one more question over there, yeah for example, the static initializer problem

33:47

is kind of solved in GCC 4.6 it's not really solved in the mean that it still generates functions

34:00

for stupid things but it groups them which takes the data there are other things happening within GCC and actually Mozila is trying to get people to do some things on GCC's side

34:23

thank you that's it thanks very much

Recommendations