Advances in computing and communications mean that we can cost-effectively store every book, sound recording, movie, software package, and public web page ever created, and provide access to these collections via the Internet to students and adults all over the world. By mostly using existing institutions and funding sources, we can build this as well as compensate authors within what is current worldwide library budget. 20 petabytes and growing, the Internet Archive is the 250th most popular website. This talk will discuss the technology and politics of building a large library.
we've got Brusa K right here he did a very good talk last night last afternoon about
distributing the web and we had a good turnout that it was really awesome thank you right now is going to be talking about the Internet Archive which you probably know is the 1 that found that it's so he knows a lot about it as it is What's New in the president the guy that's leaving
snoring very loudly of right next to my tense I know that not everybody is is a way get this and completely are all completely great so I need to talk about basic you can think of the Internet Archive is hacking the copyright system or trying to get institutions to do things that they're not used to doing or the way I like to look at it is let's go back to the Library of Alexandria and do it again and that's going to the Library of Alexander that's available to everybody can we make
all the books music video web pages software ever created by humans available to anybody that wanted to have access to it can we do this that turns out technologically actually you can that between the storage of what we have on on computers now and the Internet in
terms of getting it to people you can do it so you say well why hasn't it happened and there's a lot of institutional issues of trying to get this all happened that has taken a lot longer than I thought but we're getting there so what I'd like to suggest is a universal access to all knowledge is within our grasp and we're getting there but we need a lot more
help at all to be able to get there so you are going on the Internet Archive is a non-profit library on on is this showing yeah the nonprofit libraries San Francisco please visit
us on n by We've been around for almost 20 years and the idea is to try to do the pieces of the Internet that haven't gotten there yet so as people are going and making things available on the net they're mostly forgetting about the old things and the light we see ourselves in the tradition
of libraries I like looking what people carved in stone with they carved in stone above the library in Boston I was free to all and this was put there by the
robber barons a capitalist there were not nice men right these guys were all about property in mine mine mine yet they carved free to all above the library that was their legacy and why because information serves a different purpose than just selling stuff all back
and forth so this is the tradition that were in the I'm an engineer so I go out of any problem from an engineer's perspective you don't say OK if we want all books music video web pages up there have to say well how big is it how harder problem is it where do we get it from how we make it all
work so that's the structure of the stock I still star with books we say I we want you all books the biggest book library by 4 is the Library of Congress and they say they've got 28 million books so 28 million books is by far the largest library ever made in the world in a book is about a megabyte if you have it in microsoft word so 28 million megabytes mega terror 28 terabytes a 28
terrabytes that's for hard drives that you can buy at a local store she have all of the words in the Library of Congress in a shopping cart for less than you pay in a month's rent something has changed some things happen we can
actually think about having all of this history on easily accessible then the question is you know
would would you wanted to do it and the answer is yes actually we're getting were pretty used to having books on screens even scanned books scanned books answers not on
candles but the images of pages the screens are good enough you get these beautiful books but you can also
take it another step in some places they say well we don't have screens were not all online can you print it back out again so we made a
bookmobile it's a print on demand bookmobile so we put a satellite dish a printer cutter binder and kids make their own books because about the euro a book to go and download print and
bind a book change so it's actually cheaper to do that and a lend it from a library a study at Harvard said it cost 3 dollars just administratively to lend the book so for small books or you can actually do you can make things available as long
as people don't yell at you on in
India we went made up a couple of them
this is the 1st day at the library of Alexandria in Egypt engineer working with the
kids all happy kid with his own
book and we didn't even in Uganda this is the
1st book this girl has ever owned so so we could take not only art books and music and video and make it available to us but we can make it available even another step out there which is pretty
cool there is room Goldberg machines these oddball things that go and do
it on demand and they can make books
and they come out issues but I think the real way things are going as we all know is more on
the area of strains and the
stranger are getting so good on that we can actually do beautiful books that a pleasure to read and go and take our
books and make them available lots of different formats my favorite on here is in the bottom right it is a little talking machine for the blind and dyslexic it talks a little bit there's they now have access to millions of books that they never had before
OK so now you're convinced maybe that is a good thing to have it up there we can go and have a storage to
be able I have it up there so how do you get it done well we've been doing these different things like but putting scanning centers up this is
at the Library of Alexander the guy is a look too happy Wireless this scan about 170 thousand Arabic books there anything continuing along then we
designed and built their own scanner all call described we made the scanning
centers this is the 1 in San Francisco where they're doing microfilm down the center and got it so that's fairly efficient to basically turn the page you they will should use robots and we've tried the robots and they don't work very well the terror books inexpensive they don't work very well on I think they could work well but not the investment hasn't been made either by assembling anybody else but so we've been doing it basically by
hand and getting beautiful books done this is are rare books of
from Korea this is of biology books out of what China by working with the Chinese Academy of Sciences and we've now set up 33 scanning centers in 8 countries where libraries are doing this is well Google is already done actually a lot more than
we have about 10 times more than we have but they have a lot more money than we do and they locked it up and so they basically took even the public domain and made it property again and this is wrong if there is a sin in our world as locking up the public domain the public domain is small enough as it is all we should be arguing up maybe about what's in copyright on so if they're the Microsoft were the limits on were digitizing pretty fast I've been going around and asking different places can we get everything ever written
in a particular language so I got to meet with folks in in Greece but they're kind of busy imploding on there's of Iceland and we got yes out of parliamentarians we got yes is out of the header libraries and there was 1 per at a 300 thousand people in Iceland there was somebody that decided they were in charge of the no department so they said No no ground to a halt what quality said yes so all we basically started
working with boloney's to going digitized everything ever written in boloney's who just wanna do it also lets go and see if we get all languages it turns out the way the boloney's right is not on paper but on the whole leaves this scratch in the policy was
completely cool and so these are these priests that work with us to go and digitize these things by photographing them they're just completely beautiful and so
now we've gone and digitized and photographed everything written in in boloney's also a
little but you won't lack of it when we asked him how do you read your beer pong leaves this well most people don't read
it the the priest or their the school performances that far but I'll a little but is hope that
in the course of 1 and here we are in the early termination of go online so I'd I'd like to do is
give a round of applause for the boloney's be the
1st question you know so if should do this with Turkish or or Dutch or Danish materials we can basically go and do this in such a way that their businesses still work on and still come up right so scanning centers were doing about a thousand books every day in the
scanning centers all over the world with got
about 3 million free e-books at a public domain and we have modern books that are available for the blind and dyslexic modern meaning it probably in copyright on people were also doing a lending system OK here's hack number 1 on on
how to go and get things available to people even though they're in copyright so we've been we try to buy books from publishers so that we can lend them 1 person at a time the publishers in
general of said no so far so we been digitizing of books and we still lend them 1 person at a time so where Google got into trouble our by getting into lawsuits in the like we've gotten in loss and so the way that this works is you can
go to open library dot org click on a book say this survey
HTML 5 for beginners A which is actually a book that we bought B you see that it's checked out by somebody so then you have to put on your wait but if you go over a less
popular books like this history of Mayflower Descendants from from the Boston Public Library surprise nobody's checked it out OK so you can go and say OK I want to borrow of this book you have a
choice of formats and then you
borrow this book and another thing that's cool about this is it's borrowing it from the Boston Public Library so these real libraries are digitizing books that at in copyright non rights cleared books digitizing them and lending them just like we are library and this
has been going on for 4 years and it's been just fine so it's mechanism of trying to be respectful of those they're trying to make money off of this stuff but still having access and having it happen and we've been trying out this whole approach of how far can you go in working with publishers but not working for them and basically building a library system and making it work so well we've been able to get books on by the hundreds of thousands there might of our current books and
make them part of available OK books let's go on to another
media type music so what we want this is an area that has more lawyers on than
business people it seems I just that there's an area that is people like to sue each other in the whole music area so it had to be a little more careful of about how we've got about this and we 1st dotted with rock and roll bands that wanted to be distributed so it turns out the
Grateful Dead started a tradition of allowing people to record their concerts and then trade them on cassettes all with other people as long as no 1
made any money that's been a key thing that I found in all of this is no 1 made any money on and so we as is moved online the band were up for being distributed online so we have some level of permissions and usually the fans going thing is it OK to put your concerts on the archive and somebody has to say yes maybe it's the drummer that you know or somebody on in that community says yes this is a lot less than what lawyers or would like with deals signatures and all that stuff that net is OK here and and if it ever becomes not OK we take it back down again but you know it's only happened once out of them we now have 6 thousand bands up there and 130
thousand concerts and everything the Grateful Dead's ever done so the idea of getting music up there own out there is is starting to work as well
word there it is called collections that were on all websites before M P 3 dot com was a format that was standardized you is the battle though you
days of AI f f anybody remember AIFF yeah yeah bad news anyway on but there there were these sites that were trying to go in do these and distribute music 1 of them was the inter internet underground music archive and so they died long time ago on and so where were now up up up we have them up we've got lots of net labels that are using us for free hosting
they were really working along with other lots of different record producers now to start to go and
do the digitization and there's a possum some engineers are actually here in Amsterdam that have been doing RCD digitization software to try to help get all of the stuff to work well were starting their
donations of LP's 78 rpm records and the
light and starting to do mass digitization of these different
formats so why do this well we we haven't to sort of put everything up on the net would like to maybe do 30 seconds and then point to Amazon dot com or something like that but so far we're we've done is made available to researchers are and and the like and listening
rooms so that these on campus you can have
full access to it so were starting to get better at music and getting better at the at the whole areas of
music so we've now added to our collections of other audio recordings that a freely hosted on the Internet so the idea of having infinite storage InfiniBand with for ever for free for some communities is a very compelling offer so basically make things of available with the so even audio is doable moving images
most people think of movies of hollywood
films and I were not that good at collecting the stuff yet so we'll see we've been doing old
films that haven't been particularly distributed like through Hollywood at those old films you saw in high school when they had a substitute teacher the we'll in this projector and they'd show you why to be a typesetter or you know these old are you ready for marriage anyway these so we digitize these and made them available and
people love I don't quite sure why aren't but they're there and I have we been making these things available and and people than uploading things long before you to you to really ran away with the whole area of
video-hosting they own it but there's still people about a thousand a day people putting things up in the Internet Archive because they want the maybe more permanent or or some other of our reason so been doing
digitization even VHS tapes
which are almost all have rights problems we find if we do once the part on DVD nobody gets mad at us
so we're getting better at it is that digitizing
even television is doable we started out archiving 20 channels of television in the year 2000 Russian Chinese Japanese rocky Al-Jazeera BBC CNN in 24 hours a day DVD-quality the idea is that at least hold onto it and so we've
got 9 11 collection but were now starting to all land television
so if you go to Archive that or you can
basically go in search of what people said if it's in U.S. television news because it's the only thing we've got the
closed captions this the transcripts of what people said and we you can type in and go and say I wanna see things about
Edwards Snowden's and find all of the clips that have Edward of Snowden's in them and you can then take those clips and put them in party rear blog or put them all or request a DVD of the whole program if you wanna make a documentary and this is working on even though it means enormous amount of materials that the publishers that the networks are happy about this saw actually able to make steps of making things widely available we want everyone to be a John Stewart Research Department like it the the Comedy Central where they go and say here's what a politician said before now here's what they said now in this and it you know doesn't match that type of thing to have people
think critically about television
so even moving images on is doable so if you do this kind of
thing that we're doing going offer free hosting sometimes you attract attention from people you don't
like on so the FBI gave us 1 of these nasty letters called a National Security Letter a national security letters when they demand information
about patrons of the Internet Archive our users of the Internet Archive and we can't even say to anybody that we've ever gotten this request so we got 1 of these things and we got lawyers the Electronic Frontier Foundation her for the electron this the what can you do
what you do about this they said will you have to comply will can we talk can I talk to my board about it you no I talk to anybody about it No can I ever talk to anybody about it no what happens if I don't do it jail that is anything we can do no really play can sue the United States Government so to the United States and more so the the fallacy of
the flat hundreds of thousands of these letters sent out given only 3 organizations that have publicly gone and pushed back on the government and they've been all libraries what's great about being
a library is you're allowed to go and say now late there's a long history of people being rounded up for what it is they red and bad things happening to them and people remember this and so we're Google doesn't have that the at least publicly hasn't said no on libraries are sort of role makes it so that it's not an embarrassing thing to do so we find that being a library is a good thing software so there's a
lot of software out there I'm were getting better at going in reproducing this
software by running emulators in the browser this is a real mind of taking the seat emulators of old Apple or Commodore Atari
of software and cross our compiling it wouldn't scripted into JavaScript and it runs in your browser's you click and it actually boots on all IBM PC in your browser and your run neural it turns out this is very very popular as I guess a lot of people spend a lot of of the early days playing games out but anyway there now back so Oregon trail and all these other games are very popular or probably best known for is crawling the World
Wide Web so the how many people have used the Wayback Machine yeah they so the way back machine is a way you can see the Web as it was that so many people are pouring their lives in the web but lot web pages only on average last 100 days so we go through and we tried
archive them and we started with he was pretty small it is known in incurred frigging big we archive about a billion pages every week to be able to create this Wayback machine on this is what you look like in 1996 of
pets stock comical sock the guy do work he old all web
design and I looked up with the I thought 1 leg onlooker what Chaos Communications
camp look like so this is the chaos of a website from nineteen 90 seven on but it actually looks a whole
whole lot like the current 1 so
I have so I there there's a little retro thing going on so it's not quite as dramatic out there's another thing that this
has been used for a user came back and said they were Izturis there's been a change in that year web collection as the only place they can show it this is a press release by the United States Government of the President being on an aircraft carrier saying mission accomplished about a war on rock and it says that the president announces combat operations in a rock have ended then a couple days later they changed it and they put in major combat operations have changed they'd make any notice that they changed at a press release its Georgia
well right to be able to go and read you press releases from the past it's living in the day so now this is this is an example of why we want something like a Wayback Machine Another is we haven't web
archiving tool that the subscription-based service that a lot of companies the libraries and museums pay us to do which is helpful to keep the lights on on and we now have 17 hundred curated collections like the
Japanese disaster where people come together to go and say you should archived this
because soon it's offline or archive these things soon the offline
to it's been a community project to be able
to go and work together to build up
What are the key things about an event to make sure that we really haven't done well so even when it is doable on so the
World Wide Web collection of of ours is by about 10 petabytes of data it's growing a few petabytes every year we have about 4 100 and 50 billion pages on we get about 600 thousand
people a day using it it's a database of 450 billion pages that gets queried about 2 thousand times a 2nd and so that's sort of what the Wayback Machine is and it's been much more popular than we thought it would be which is just but even rare books and letters
you can go and do these things by photographing in and breaking them available next up for
us is personal digital archives how are we going to do this stuff that splinter on all sorts of places so people don't even have them it's not boxes in people's basements anymore it's not even hard drives that you have it's these flicker sites in these other places that you've gone and put your memories and guaranteed these guys are going down I we even the the rich companies like Google Google Video ever Google Video all used to exist there 6 million videos on it but they took it away I Yahoo videos now gone Geocities is famously gone
Apple Computer the most valuable company in the whole world couldn't figure out how to run 200 terabytes of mobile me on a continuous basis so we archive that and making it available so don't count on these places they they they they don't have your best interest in mind they'll turn it off whenever they want to so how do you go preserve stuff for the physical Physical in the digital if we wanna build a
library of Alexander version 2 well what's the lesson of the Library of Alexander version 1 with the best known for burning right is best known for not being here anymore so you how we go and
make it so that we don't do that again well let's have multiple copies if we put multiple copies in multiple places that will have different fault modes then I think we have a better shot at it so we gave a copy back in 2002 to the new Library of Alexander and this is actually read did their 1st floor get Alexander go it's completely great on our city and a
library and so there's this is what it looked like in 2002 on access for all
give us free space may have for the last 10 years Thank you access for all the of the in in
Amsterdam on this is what it looked like in 2008 on this is what the Wayback
Machine we came up with this idea of doing our data centers in shipping containers son ran with that they give us 1 of these so all the way back machine was this so get ask how how big is when you have that somehow because the Web is
8 feet by 8 feet by 20 feet for several years when you use the Wayback Machine you're actually using this shipping container that sat
outside on Sun's campus which is pretty great we've no made these prettier machines because we bought the school church
and so there's these blinking lights inside that is how how we've been able to scale up we've also started the scale of
physical collections because law libraries and throwing things away so we basically
take our our ideas of how to do really compact we use shipping and Tina's inside warehouses so we have no books or music or video their inboxes protected by a box protected by shipping and tainted protected by warehouse protected by nonprofits so the idea is to try to have
layers of protection against certain types of attack FIL enough area has OK we've got
a copy in an earthquake zone of the Middle East and a flood zone of what can go wrong so I I think we need some other copies in other places and more participation toward keeping our our whole society and cultural heritage a couple other things that were worried about the at the Internet Archive is some of the ways of trying to be sustainable in the area in the era of corporations corporations are basically becoming real strangleholds on certain types of things like end user access so we've been doing free public Wi-Fi hot and the like on so every time I get a building we may
put free Wi-Fi with no passwords and stuff up and that's that causes people to get the free Wi-Fi in some cases the mad in other cases both of which are perfectly fine with us on we've
even gone and tried to apply open-source ideas to housing housing has become a real problem because the debt burden source starting to try to transition housing just be able support
nonprofit workers that a debt-free hearings goodness ideas were down were really trying to figure out some of of these things if you're going to go on building new housing system for people that are working in the open-source world then you're
going to need an organization like that of financial institutions we started the credit union on and then try to work with some Bitcoin companies Oded that make regulators mad anyway so that those the idea of to take not just
try to make the data sustainable in copying of forward but how do you go and get the communities around these materials to be able to have lifetimes that work were way and at point people been donating Bitcoin if you donated the going to
thank you very much we pay our employees partially and that point at all and that's all all around working project that needs
help there's a lot everything needs help but here's a few of my top were redesigning the Wayback Machine and could really use some help on on trying to make this a new and different neat how do you do search at a level that we can actually do is searching for 150 billion pages is too much for us but may be safe a search and the like all we're trying to get elastic search to be able to make on all of our stuff
more accessible the full text much easier to use on were we were trying to build software that can be distributed that people can archive the CDs LP's and books in a distributed way to go and participate in bringing these things online as a couple programmers in Amsterdam looking for more programmers to help in the distributed Web is sort of a
new idea that if there is some mechanism the bill we've together to go make a next-generation Web that we don't have archive by taking snapshots we can actually archived working websites so the live on for
tens or hundreds of years after when the the original administrators are gone so in
conclusion universal access to all knowledge it's possible to do I think it could be 1 of the greatest things humans have
ever done I think he could be that 1 of the things that are generation gets to go and offer the world kind of like the man on the moon or the Library of Alexander and Asians days there we can pull this together we have the technology to be able political would get the political will to live in an open
environment as long as we don't lose that and we have to act pretty fast I'd say because most kids have turned to screens instead of books or old materials be able find out information they're learning from whatever it is they can get a hold of and the best that we have to offer is not on the net yet so we need to be bolder than we've been to go and make universal access to all knowledge happen thank you very much
