Software as a first-class citizen in web archives

Video in TIB AV-Portal: Software as a first-class citizen in web archives

Formal Metadata

Software as a first-class citizen in web archives
Title of Series
Part Number
Number of Parts
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
The Web contains all kinds of information today. Web archives preserve this data and make it long-term available. However, access is usually only provided by a URL and a timestamp. Hence, there is no deeper meaning attached to archived resources, although collectively they can represent entities, such as software. Moreover, documentation and source code that is available at different points in time, can even represent different versions of a software. Treating them as first-class citizens in web archives enables reliable and permanent references to software, which is normally hard to manage.

Related Material

The following resource is accompanying material for the video
Mathematics World Wide Web Consortium Computer animation Web service Forschungszentrum Rossendorf Software Projective plane File archiver Set (mathematics) Information
World Wide Web Consortium Projective plane Bit Mathematics Type theory Crash (computing) Computer animation Software Web service Forschungszentrum Rossendorf Software File archiver Information Information security
Web page Scripting language World Wide Web Consortium Standard deviation Scripting language Computer file Computer file Computer-generated imagery File format Medical imaging Internetworking Internetworking Web service Software Universal product code File archiver Self-organization Information Videoconferencing Information security World Wide Web Consortium
Revision control Computer animation Internetworking Personal digital assistant Multiplication sign View (database) File archiver 1 (number) Website
Home page Chemical equation Temporal logic Multiplication sign Web page Virtual machine Timestamp Uniform resource locator Computer animation Logic File archiver Hausdorff dimension Resultant
Home page Temporal logic Multiplication sign View (database) Web page Limit (category theory) Timestamp Virtual machine Timestamp Uniform resource locator Mathematics Arithmetic mean Computer animation Personal digital assistant Web service Software Logic File archiver Hausdorff dimension Website Information Object (grammar)
Timestamp Food energy Timestamp Uniform resource locator Event horizon Computer animation Software Personal digital assistant Object (grammar) Web service Software Revision control Website Information
Ocean current Point (geometry) Web page Serial port Link (knot theory) Multiplication sign 10 (number) 19 (number) Latent heat Term (mathematics) Software Integer Information Address space Position operator World Wide Web Consortium Home page Mathematics Googol Computer animation Search engine (computing) Web service IRIS-T File archiver Website Ranking
Web page Mathematics Slide rule Uniform resource locator Pointer (computer programming) Computer animation Link (knot theory) Forcing (mathematics) Multiplication sign File archiver Maxima and minima Resultant
Algorithm Source code Computer Web service Software Information World Wide Web Consortium Software developer Computer program Parameter (computer programming) Term (mathematics) Cartesian coordinate system Softwarewissenschaft Mathematics Component-based software engineering Computer animation Software Web service Computer hardware Revision control Computer science File archiver Object (grammar) Freeware Resultant
Computer program Computer program Parameter (computer programming) Perturbation theory Term (mathematics) Computer Computer Component-based software engineering Spreadsheet Component-based software engineering Computer animation Software Atomic number Term (mathematics) Personal digital assistant Computer hardware Telecommunication Software Core dump Descriptive statistics
Link (knot theory) Algorithm Web page Computer program Parameter (computer programming) Mathematical analysis Term (mathematics) Hypercube Computer Component-based software engineering Mathematics Computer animation Software Web service Computer hardware Software Information Codierung <Programmierung> Reading (process) Resultant
Source code Computer program Link (knot theory) Link (knot theory) Open source Web page Source code Mathematical analysis Sound effect Mathematical analysis Surgery Cartesian coordinate system Hypercube Number Mathematics Mathematics Computer animation Software Web service Software Website Information Quicksort
Point (geometry) Home page World Wide Web Consortium Link (knot theory) Multiplication sign MIDI Library catalog Timestamp Revision control Mathematics Computer animation Software Software Revision control Repository (publishing) File archiver Website Convex set Process (computing) Office suite Marginal distribution Resultant
World Wide Web Consortium Group action Link (knot theory) Library catalog Library catalog Mathematics Computer animation Software Personal digital assistant Web service Software Revision control Repository (publishing) File archiver Process (computing) Information
Pattern recognition Mapping Mathematical analysis Electronic mailing list Food energy Virtual machine Mathematics Uniform resource locator Mathematics Computer animation Software Web service Software Website Right angle Information Descriptive statistics Row (database)
Link (knot theory) State of matter Multiplication sign Mathematical singularity Virtual machine Bit Computer icon Virtual machine Sign (mathematics) Mathematics Uniform resource locator Machine vision Uniform resource locator Software Revision control Website
World Wide Web Consortium Random number Link (knot theory) Web page Sound effect Augmented reality Computer Focus (optics) Virtual machine Revision control Mathematics Latent heat Process (computing) Software Web service Software Authorization Website Video game Information
Slide rule Random number Web crawler State of matter Multiplication sign Mathematical singularity Virtual machine Sheaf (mathematics) Augmented reality Mathematical analysis Focus (optics) Total S.A. Regular graph Different (Kate Ryan album) Software Endliche Modelltheorie Website Metropolitan area network World Wide Web Consortium Link (knot theory) Web page Motion capture Virtual machine Mathematics Latent heat Uniform resource locator Arithmetic mean Internetworking Computer animation Software Personal digital assistant System on a chip File archiver Mathematical singularity Website Species Row (database)
Web page World Wide Web Consortium Intel Multiplication sign MIDI Sheaf (mathematics) Mathematical analysis Cartesian coordinate system Virtual machine Number Internetworking Computer animation Software Web service Software File archiver Website Cuboid Information Website
Mathematical analysis Total S.A. Timestamp Wiki Mathematics Internet forum Software Core dump Data mining Authorization Diagram Process (computing) Information Endliche Modelltheorie Website Source code Home page State of matter Motion capture Digital object identifier Virtual machine Internetworking Computer animation Software Repository (publishing) Web service Network topology Revision control File archiver Website Right angle
Point (geometry) Trail Web crawler Link (knot theory) Set (mathematics) Metadata Timestamp Revision control Analogy Software Data mining Authorization Software cracking Process (computing) Information Website Source code Information State of matter Electronic mailing list Database Instance (computer science) Landing page Word Software Repository (publishing) Web service Revision control File archiver Website Free variables and bound variables
World Wide Web Consortium Computer program Addition State of matter Temporal logic Multiplication sign MIDI 1 (number) Theory Computer animation Software Software File archiver Website Text editor Website Descriptive statistics
World Wide Web Consortium Mapping Link (knot theory) Temporal logic MIDI Mathematical analysis Theory Library catalog Mathematics Search engine (computing) Web service Software Videoconferencing Information Whiteboard Website Table (information)
so and the last become 4 days so I would 1st really appreciated so many of you are still
here thank you the yeah as a starter set 1 influence among them from the obvious and the I many working project on that's EU project and but it's mainly dealing with web archives but what I am
presenting today secure joint work with the DAB and that's in the project at finding that so here's the 1 and 2 and I would like to talk about software 1st citizen in Web archives and we already have a verdict quite a bit about How to
archive software how to make software more sustainable and so there was the the crash in the keynote this morning which is basically an archive for our software and then we heard about good practices for at creating sustainable software to make it easier to archive that's of where right and however there are lots of other software so the types of software so and not alter suffers an hour not everything is following like not everyone's following the these good practices so and I would like to propose this as kind of a universal solution and get so where pockets of only the and I'd like to start to talk about what achieved so what 6 you
about that if I and I hope many of you are already familiar with partners for those would not know about his like she maintained by a couple of I'm organizations yeah fewer than they actually much more the and 1st of all this the Internet Archive which is basically the organization that house the because some web archive today and what these organizations do is busy they crawl the weapon they store and archive everything they they encounter so that's of course web pages but also the images on web page this that are linked from their files scripts but busy anything and that's stored in a big data files called walk
files and that security standardized and that is what a web archivists so quite simple just archives of the well known the you know you have a standard that In the end just big files now you need to get access to those files and to get access to them as if the this really great to it's
called the Wayback Machine and that's developed by the Internet Archive and this is basically the main tool to get access to 12 access and the way this works is there's you and tell you out there
in this case there's a NY times of the year time per site the the and then you get this calendar view of all the archived versions of this website and then you can visit just click on 1 of these ones here and what you then
get is the archive wouldn't that website so basically way that replace this archived working so this is from 2012 Europe times homepage and that shows no this is the result of the election in 2012 when Obama was elected president we elect early yeah so that that's a machine and as I just said you
you enter URL there and then choose the
timestamp so basically deidentify of such a resource use visited this so this is really the the well I often resource in the weather machine and and this is how that can be identified there's this is the prefix in this case the into archive and then you have the timestamps lest you held anyone who however there are quite a few challenges to the for this element of and so 1 of the big challenges this view well changes and you can find that resource again like if you include in your times homepage and you out title that would change which is probably quite unlikely you you need to know you knew you out to find out whether so they're kind of no
logical object as a container 1 have genotypes homepage you need to enter the URL also the search abilities are quite limited to there there is no site search on the way the machine but that the as the on quite a few limitations and also the time stamps here the rent they actually only represent the times when the website was crawled that's not like any particular meaning so
something that would be more these i guess the something like this where instead of the Alon's timestamp good the object on like somewhere 90 and energy invented the 1 about but like this would be an example you wanna see the website of Obama at the election 2012 or in case of software you may wanna see but this supplement Medicare Edward and 5 . 2 so these are just examples course that doesn't really exist yet but that would be problem
and what we want the yeah in 1 way to go about this and that's to be what you do on
the underlying weapon you surf the web and you don't know the the current US and website search engine like google or being an integer turn that interested in and then you get you and the this is an example of such a search engine and for web archive this is developed by us at the address of quot tempest and what it can do you know if you're interested in mathematical begin and other term you select the time spent at interest then what year from 2009 to 2013 and you get all the wealth and and this is ranked
by how many links point to that well in a specific year and what's interesting here is body on the on the 2nd end on the 4 position you see at least you Mathematica homepage on from the and different but and if you look at the at a time if then you see that the 1st 1 is from 2010 was the last 1 4 2 and 13 but all of them are linked in In all years so that you can see below like blow all the hits there there are years when they are in links to that page now 3rd we have it is necessary but not the also the I should point to the serial but can so this is about a new way
height non archived results if you if you click on that they you see that this 1st visit to the magnetic a page was only archive into 1 9 2010 while this link down here is only archive from 2010 on and the reason for that is that the L changed so if you are interested in the mathematical URL before without any you can go to the current in the URL that you need that 1 the and as you can see in the URL of this search result here that R&D much closer to to what I showed slide
force is a much closer to what's desired we want the mathematical there and then at time 2 inches that pointers to the resource so so why is this is actually interested in all around and there are there mainly 2 reasons why
this is interesting but the first one is that archiving software especially scientific suffers really crucial because that scientific software Software's used everywhere that we saw that as a new earlier talks already common in certain disciplines like now for computer science so we might really be the object both research visited that 1 the paper talks about the in in other disciplines like humanity is and it's at least of that's used and that might be crucial to understand the results as well and another reason is why we should use about software web archives is that archiving these soft applications and so is not entirely possible so as that of the before then the developers actually followed these good practices then you can of course do that because source code is available at everything's free and well but there's commercial software proprietary software as well and and their web services they can never archival and also there is the issues then yeah it is the she allowed to archive software without having the license so it doesn't mean an he'd by all the available software that's used in articles and so what can we do with the archive software which allow to provide them publicly from England the and then to see even more why this this is
useful so I take and at the
usual diffusion of of software and this the diffusion seems to be widely used by atoms from 1987 already and this definition as as software as a comprehensive term used to identify all of the known hot core components of a computer or communication systems and software includes computer programs data that is used by these programs in any paper computer-based documentation that describes computer systems and how to use them and so determines what the computer does and how it doesn't so basically the 2 components of software are not just by the program itself it's also the date of
the communication and it's this is the purpose of the software so what this is often do and how does it would and now you know what I was asking yourself what 1 of these things the way she need and of course if you want to executed the software and we run and then you need to program that in in many other cases that's that's not necessary so if you read a paper and there's a software mentioned mentioned that often enough that you can just understand what the software is so if I would tell you what Microsoft Excel spreadsheet that most of you would already understand what it is even if they then I don't know excellent the and the so is often a short descriptions are you have to understand the purpose of the software and the yeah may be how if achieves this only if to understand the features of the software of the delimitation is enough so if you wanna look
up at the software has a certain feature you can just read indentation don't need the extra program and if you can you read results in a paper or anyone like this but how do you know results but yeah were computed then decode Audi items that are using software of the new the
so we're actually interested in 1 of
these things are actually available on the web like on the website after all on websites that talk about the self and yet we did this in little analysis in which he found that around 60 per cent of all the software outside that we analyze the surgery on the mathematics of the websites I come to that the so is there around 60 per cent of the actually link to some sort of fragmentation so even without having access to the extra program you can read the documentation of 60 per cent of the web sites and and on the 30 per cent of the websites there's even source code available although we didn't only analyzed
open-source software website we analyzed also quite of such but still that interests and provide some sort of source code and even on effects of light it's about effects we consider anything that can be downloaded here so that may be the real application you forget himself maybe and they'd all datasets anything like that and it was quite interesting actually is that they don't don't you X axis to see how often that the stock was mentioned in articles and the number of the website to provide artifacts is much higher for the high you reference websites then the way around means probably if you provide some sort of artifacts that can be downloaded your and suffers more likely to be mentioned been newspapers papers the yeah but that this analysis was basically done on the on the
convex of we just looked at the use of home pages but what we actually wanted euros to Alan to understand previous results are reported in in scientific paper so we need to get back in time and the but up and coming back to the point that I showed earlier would be
great if if we can somehow tell the Web archive I would like to have to so the did the website of the office software at a certain margin or that's echinoderms of to this because it's not so easy to connect the version of the software to a to a time stamp or date can we look up is the west side of the software as it was used a certain publication so our goal was was kind of you to and leading software and
publications with Web archives and to do that we if she started with the software catalog there was SW mastered yeah Sofitel for mathematics of group and because we don't know what were some of the softwares used publication of best guess is just a publication date so basically the year of publication which is the of course in most cases probably not correct because the experiments were done before the paper was published that it's articles and and it's best guess that we can make and
which the work on proving so again this is S W map the support of those started with and it's but quite a because they have more than 1 thousand records and for each of these suffer
constant this all publications where the software is you described in 1 mentioned but when there are more than 110 thousand articles this right now and they are actually already following a publication based approach so they start with publications and then in which software is mentioned in these publications and that's exactly what we had before so whenever software is mentioned in a publication that makes the scientific suffer and then they actually create a record for the software and manually at items
like DUL descriptions things that the so we start energy from here we scanned all these websites and we don't notifications or at least a list of publications and and we actually had some analysis and later on in this little will will be then connected the S W map with the recognition and the way we did this that SW math
now integrated and you link so there's the the URL of the website that corresponds to the software and below that you well that added a new they added a new link which is this wasn't in 4 year and when you click on that link pictures that's a little icon behind each of the publications I'm and this kind of shows you whether there so I have available and that you know the publication or it chosen the gray I can like this 1 here will have policy but there's a bit of I can read out and that shows there's no locker room of that so when that you and if you click on that FIL like on this is achieved but you go
to and that's and the website of the software but in the way that machine that framed on in this thing that we call a time portal where you don't actually see this as the software then the lead opened in SLU met before and that's the state of the software in this publication at least in that publication year
and yeah what what we find is that you can actually get a lot from this website so you immediately see it if no we don't know what singular as you needed to see this as a computer Aguirre processes and you see the current version here and probably this is not the burden that the author use that it's already close so if this is were 1 . 8 you know the authors didn't use 10 and use that so at this already helps submitted and but we also do we add and this this here that's an automatically for each software website and that automatically detects links on of outside that ever specific likely instrumentations link start effects and the user can directly go to these things and and all features you can also switch to the life website and compared to then engine what's maybe
more interesting is if you look at the web
site uh the URL of this time model then you can see that this is much closer to what I showed before so basically this desired state on we don't have a URL anymore In this URL we just now the software the that's 84 mystery man and we a publication idea that identifies this record in the way that machine and web archive the onset of showing you here we show that this is a self singular in this publication so it's much more software centric so speak and you as I just said we we had this is bar where we point it can find the use of research so for specific features but and we also so certain meanings to the slides it's not just random times anymore which of these is just times that of the crawler happened to catch this website but now it's it's actually the yeah it gets a meaning is so in this case section publication year and we try to keep the state in the middle of that year but that needs to be improved so so now another
question that we else's if you what has been aka so far so it's species that we have to use but it's only helpful if they are in the archives about different sites on differences I when the new in whether machine so we actually the I started from the top publication that those of that mentions the
software based on a number of citations and that that was here on the x axis and we looking at here how many of these uh software is actually archive and the gas actually around 50 % archived at this Red are appeared widely used to so many were
cross but then there's some pages is allowed to be archived can dislodged Justin Roberts 60 but but then there's still around 40 % really archived like really available and where she which are in this arena not too bad so and and about half of that this section also available in the year of this top publications still around for this and are available in general half of that is really available in the year off the top publications and that's something that needs to be improved but another thing that's quite interesting is how many of the web sites if you change from the time when they were mentioned in this talk of publication and as can see you all I get see dark blue box well there and almost all of these websites and she changed so that that shows the need of creating archives of these websites because and as the subway walls these websites will evolve as well so the documentations updated features updated and things like that so we really need to create those archives to handle reproduced the software at a sometimes in in the past that yeah
this diagram on right hand side just rose and if the website was not archived in that particularly yield to stop publication when
model was an archived instead and was if she's pretty good is that it is always but have very close to that date so when he after 1 year before what most 2
years after an critical step on the only issue that we can we can do better with this end yeah some ideas that we would like to implemented in the near future and 1 of them is would be at the core of these use the tree to create so-called micro archives that comprise all the web sites for particular software like not just your homepage but also maybe discussion boards that talk about a certain software and maybe repositories maybe get talk pages all that and provide archiving 1 features so that an author that that's using some software can click on maybe this is w math while repositories to on and on about and say 1 archive to soften now and this might was automatically created then for this particular software this particular date when the all the users and ideally that also would then be provided with some handle maybe a you I said this or that can put into this paper and that reference
might and point to that archive so actually and those my crackers could be used as landing pages for software and when adjusted about this and software journal where am busy these really short paper as a use as placeholders for the edges of where I was really wondering maybe could actually use like micro archives that have all the websites of belong to a software the In a as landing page that can be referenced to a set of social papers because those websites already have most of the information's typically that you that you are interested in they have links to mutation they have links of the words on what these things that we have an analogy is if you that once we have those
micro because we can derive and automatic metadata from them so we can never find all of the words which should be quite easy because we have this like a pretty unique formant but typically and once we find that out from the website we can then assigns a certain version of the software to a certain crawl date and and also we can then yeah label snapshots and assign needed in a to it for instance and especially in open source of with the list of authors or contributors of quite long so instead of like having someone adding that to a database manually we could derive that from a from a certain from from the archived websites and also keep track of how many of us were added how many were removed and that even if the so it was not available in in repository like it top which supports that anyways and of course we could think about generalizing this approach of the entities on software because this is as shown in beginning not
only applicable to software to young persons or companies as well and and so some conclusions and that the web actually
provides access to lot of software quite comprehensively anatomy into the software as the educational program itself but to a lot of additional stuff around that like documentation descriptions metadata all the that time and already 50 % of the works of sites are archive ready not all of them at the time of the publications where they mentioned but at least they are archived and archives are growing so that will be the future much more frequently and hopefully we have a state for each software but as as said we're working on these on-demand solutions where authors or editors or maybe the publishers India and can you click this button automatically triggered the the archiving ones so as mentioned in the but even for more details
there should be 2 related tables 1 is specifically on the analysis showed theater was published at a video last year and then another 1 this found temples that's this search engine that I showed earlier and that will be published next month the Web Science Conference so thank you very much if you wanna try it yourself you can just go to S W map board that's this catalog and there are these links that I just showed and then connection try thank you but if it