Merken

Software as a first-class citizen in web archives

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
Chill
so and the last become 4 days so I would 1st really appreciated so many of you are still
here thank you the yeah as a starter set 1 influence among them from the obvious and the I many working project on that's EU project and but it's mainly dealing with web archives but what I am
presenting today secure joint work with the DAB and that's in the project at finding that so here's the 1 and 2 and I would like to talk about software 1st citizen in Web archives and we already have a verdict quite a bit about How to
archive software how to make software more sustainable and so there was the the crash in the keynote this morning which is basically an archive for our software and then we heard about good practices for at creating sustainable software to make it easier to archive that's of where right and however there are lots of other software so the types of software so and not alter suffers an hour not everything is following like not everyone's following the these good practices so and I would like to propose this as kind of a universal solution and get so where pockets of only the and I'd like to start to talk about what achieved so what 6 you
about that if I and I hope many of you are already familiar with partners for those would not know about his like she maintained by a couple of I'm organizations yeah fewer than they actually much more the and 1st of all this the Internet Archive which is basically the organization that house the because some web archive today and what these organizations do is busy they crawl the weapon they store and archive everything they they encounter so that's of course web pages but also the images on web page this that are linked from their files scripts but busy anything and that's stored in a big data files called walk
files and that security standardized and that is what a web archivists so quite simple just archives of the well known the you know you have a standard that In the end just big files now you need to get access to those files and to get access to them as if the this really great to it's
called the Wayback Machine and that's developed by the Internet Archive and this is basically the main tool to get access to 12 access and the way this works is there's you and tell you out there
in this case there's a NY times of the year time per site the the and then you get this calendar view of all the archived versions of this website and then you can visit just click on 1 of these ones here and what you then
get is the archive wouldn't that website so basically way that replace this archived working so this is from 2012 Europe times homepage and that shows no this is the result of the election in 2012 when Obama was elected president we elect early yeah so that that's a machine and as I just said you
you enter URL there and then choose the
timestamp so basically deidentify of such a resource use visited this so this is really the the well I often resource in the weather machine and and this is how that can be identified there's this is the prefix in this case the into archive and then you have the timestamps lest you held anyone who however there are quite a few challenges to the for this element of and so 1 of the big challenges this view well changes and you can find that resource again like if you include in your times homepage and you out title that would change which is probably quite unlikely you you need to know you knew you out to find out whether so they're kind of no
logical object as a container 1 have genotypes homepage you need to enter the URL also the search abilities are quite limited to there there is no site search on the way the machine but that the as the on quite a few limitations and also the time stamps here the rent they actually only represent the times when the website was crawled that's not like any particular meaning so
something that would be more these i guess the something like this where instead of the Alon's timestamp good the object on like somewhere 90 and energy invented the 1 about but like this would be an example you wanna see the website of Obama at the election 2012 or in case of software you may wanna see but this supplement Medicare Edward and 5 . 2 so these are just examples course that doesn't really exist yet but that would be problem
and what we want the yeah in 1 way to go about this and that's to be what you do on
the underlying weapon you surf the web and you don't know the the current US and website search engine like google or being an integer turn that interested in and then you get you and the this is an example of such a search engine and for web archive this is developed by us at the address of quot tempest and what it can do you know if you're interested in mathematical begin and other term you select the time spent at interest then what year from 2009 to 2013 and you get all the wealth and and this is ranked
by how many links point to that well in a specific year and what's interesting here is body on the on the 2nd end on the 4 position you see at least you Mathematica homepage on from the and different but and if you look at the at a time if then you see that the 1st 1 is from 2010 was the last 1 4 2 and 13 but all of them are linked in In all years so that you can see below like blow all the hits there there are years when they are in links to that page now 3rd we have it is necessary but not the also the I should point to the serial but can so this is about a new way
height non archived results if you if you click on that they you see that this 1st visit to the magnetic a page was only archive into 1 9 2010 while this link down here is only archive from 2010 on and the reason for that is that the L changed so if you are interested in the mathematical URL before without any you can go to the current in the URL that you need that 1 the and as you can see in the URL of this search result here that R&D much closer to to what I showed slide
force is a much closer to what's desired we want the mathematical there and then at time 2 inches that pointers to the resource so so why is this is actually interested in all around and there are there mainly 2 reasons why
this is interesting but the first one is that archiving software especially scientific suffers really crucial because that scientific software Software's used everywhere that we saw that as a new earlier talks already common in certain disciplines like now for computer science so we might really be the object both research visited that 1 the paper talks about the in in other disciplines like humanity is and it's at least of that's used and that might be crucial to understand the results as well and another reason is why we should use about software web archives is that archiving these soft applications and so is not entirely possible so as that of the before then the developers actually followed these good practices then you can of course do that because source code is available at everything's free and well but there's commercial software proprietary software as well and and their web services they can never archival and also there is the issues then yeah it is the she allowed to archive software without having the license so it doesn't mean an he'd by all the available software that's used in articles and so what can we do with the archive software which allow to provide them publicly from England the and then to see even more why this this is
useful so I take and at the
usual diffusion of of software and this the diffusion seems to be widely used by atoms from 1987 already and this definition as as software as a comprehensive term used to identify all of the known hot core components of a computer or communication systems and software includes computer programs data that is used by these programs in any paper computer-based documentation that describes computer systems and how to use them and so determines what the computer does and how it doesn't so basically the 2 components of software are not just by the program itself it's also the date of
the communication and it's this is the purpose of the software so what this is often do and how does it would and now you know what I was asking yourself what 1 of these things the way she need and of course if you want to executed the software and we run and then you need to program that in in many other cases that's that's not necessary so if you read a paper and there's a software mentioned mentioned that often enough that you can just understand what the software is so if I would tell you what Microsoft Excel spreadsheet that most of you would already understand what it is even if they then I don't know excellent the and the so is often a short descriptions are you have to understand the purpose of the software and the yeah may be how if achieves this only if to understand the features of the software of the delimitation is enough so if you wanna look
up at the software has a certain feature you can just read indentation don't need the extra program and if you can you read results in a paper or anyone like this but how do you know results but yeah were computed then decode Audi items that are using software of the new the
so we're actually interested in 1 of
these things are actually available on the web like on the website after all on websites that talk about the self and yet we did this in little analysis in which he found that around 60 per cent of all the software outside that we analyze the surgery on the mathematics of the websites I come to that the so is there around 60 per cent of the actually link to some sort of fragmentation so even without having access to the extra program you can read the documentation of 60 per cent of the web sites and and on the 30 per cent of the websites there's even source code available although we didn't only analyzed
open-source software website we analyzed also quite of such but still that interests and provide some sort of source code and even on effects of light it's about effects we consider anything that can be downloaded here so that may be the real application you forget himself maybe and they'd all datasets anything like that and it was quite interesting actually is that they don't don't you X axis to see how often that the stock was mentioned in articles and the number of the website to provide artifacts is much higher for the high you reference websites then the way around means probably if you provide some sort of artifacts that can be downloaded your and suffers more likely to be mentioned been newspapers papers the yeah but that this analysis was basically done on the on the
convex of we just looked at the use of home pages but what we actually wanted euros to Alan to understand previous results are reported in in scientific paper so we need to get back in time and the but up and coming back to the point that I showed earlier would be
great if if we can somehow tell the Web archive I would like to have to so the did the website of the office software at a certain margin or that's echinoderms of to this because it's not so easy to connect the version of the software to a to a time stamp or date can we look up is the west side of the software as it was used a certain publication so our goal was was kind of you to and leading software and
publications with Web archives and to do that we if she started with the software catalog there was SW mastered yeah Sofitel for mathematics of group and because we don't know what were some of the softwares used publication of best guess is just a publication date so basically the year of publication which is the of course in most cases probably not correct because the experiments were done before the paper was published that it's articles and and it's best guess that we can make and
which the work on proving so again this is S W map the support of those started with and it's but quite a because they have more than 1 thousand records and for each of these suffer
constant this all publications where the software is you described in 1 mentioned but when there are more than 110 thousand articles this right now and they are actually already following a publication based approach so they start with publications and then in which software is mentioned in these publications and that's exactly what we had before so whenever software is mentioned in a publication that makes the scientific suffer and then they actually create a record for the software and manually at items
like DUL descriptions things that the so we start energy from here we scanned all these websites and we don't notifications or at least a list of publications and and we actually had some analysis and later on in this little will will be then connected the S W map with the recognition and the way we did this that SW math
now integrated and you link so there's the the URL of the website that corresponds to the software and below that you well that added a new they added a new link which is this wasn't in 4 year and when you click on that link pictures that's a little icon behind each of the publications I'm and this kind of shows you whether there so I have available and that you know the publication or it chosen the gray I can like this 1 here will have policy but there's a bit of I can read out and that shows there's no locker room of that so when that you and if you click on that FIL like on this is achieved but you go
to and that's and the website of the software but in the way that machine that framed on in this thing that we call a time portal where you don't actually see this as the software then the lead opened in SLU met before and that's the state of the software in this publication at least in that publication year
and yeah what what we find is that you can actually get a lot from this website so you immediately see it if no we don't know what singular as you needed to see this as a computer Aguirre processes and you see the current version here and probably this is not the burden that the author use that it's already close so if this is were 1 . 8 you know the authors didn't use 10 and use that so at this already helps submitted and but we also do we add and this this here that's an automatically for each software website and that automatically detects links on of outside that ever specific likely instrumentations link start effects and the user can directly go to these things and and all features you can also switch to the life website and compared to then engine what's maybe
more interesting is if you look at the web
site uh the URL of this time model then you can see that this is much closer to what I showed before so basically this desired state on we don't have a URL anymore In this URL we just now the software the that's 84 mystery man and we a publication idea that identifies this record in the way that machine and web archive the onset of showing you here we show that this is a self singular in this publication so it's much more software centric so speak and you as I just said we we had this is bar where we point it can find the use of research so for specific features but and we also so certain meanings to the slides it's not just random times anymore which of these is just times that of the crawler happened to catch this website but now it's it's actually the yeah it gets a meaning is so in this case section publication year and we try to keep the state in the middle of that year but that needs to be improved so so now another
question that we else's if you what has been aka so far so it's species that we have to use but it's only helpful if they are in the archives about different sites on differences I when the new in whether machine so we actually the I started from the top publication that those of that mentions the
software based on a number of citations and that that was here on the x axis and we looking at here how many of these uh software is actually archive and the gas actually around 50 % archived at this Red are appeared widely used to so many were
cross but then there's some pages is allowed to be archived can dislodged Justin Roberts 60 but but then there's still around 40 % really archived like really available and where she which are in this arena not too bad so and and about half of that this section also available in the year of this top publications still around for this and are available in general half of that is really available in the year off the top publications and that's something that needs to be improved but another thing that's quite interesting is how many of the web sites if you change from the time when they were mentioned in this talk of publication and as can see you all I get see dark blue box well there and almost all of these websites and she changed so that that shows the need of creating archives of these websites because and as the subway walls these websites will evolve as well so the documentations updated features updated and things like that so we really need to create those archives to handle reproduced the software at a sometimes in in the past that yeah
this diagram on right hand side just rose and if the website was not archived in that particularly yield to stop publication when
model was an archived instead and was if she's pretty good is that it is always but have very close to that date so when he after 1 year before what most 2
years after an critical step on the only issue that we can we can do better with this end yeah some ideas that we would like to implemented in the near future and 1 of them is would be at the core of these use the tree to create so-called micro archives that comprise all the web sites for particular software like not just your homepage but also maybe discussion boards that talk about a certain software and maybe repositories maybe get talk pages all that and provide archiving 1 features so that an author that that's using some software can click on maybe this is w math while repositories to on and on about and say 1 archive to soften now and this might was automatically created then for this particular software this particular date when the all the users and ideally that also would then be provided with some handle maybe a you I said this or that can put into this paper and that reference
might and point to that archive so actually and those my crackers could be used as landing pages for software and when adjusted about this and software journal where am busy these really short paper as a use as placeholders for the edges of where I was really wondering maybe could actually use like micro archives that have all the websites of belong to a software the In a as landing page that can be referenced to a set of social papers because those websites already have most of the information's typically that you that you are interested in they have links to mutation they have links of the words on what these things that we have an analogy is if you that once we have those
micro because we can derive and automatic metadata from them so we can never find all of the words which should be quite easy because we have this like a pretty unique formant but typically and once we find that out from the website we can then assigns a certain version of the software to a certain crawl date and and also we can then yeah label snapshots and assign needed in a to it for instance and especially in open source of with the list of authors or contributors of quite long so instead of like having someone adding that to a database manually we could derive that from a from a certain from from the archived websites and also keep track of how many of us were added how many were removed and that even if the so it was not available in in repository like it top which supports that anyways and of course we could think about generalizing this approach of the entities on software because this is as shown in beginning not
only applicable to software to young persons or companies as well and and so some conclusions and that the web actually
provides access to lot of software quite comprehensively anatomy into the software as the educational program itself but to a lot of additional stuff around that like documentation descriptions metadata all the that time and already 50 % of the works of sites are archive ready not all of them at the time of the publications where they mentioned but at least they are archived and archives are growing so that will be the future much more frequently and hopefully we have a state for each software but as as said we're working on these on-demand solutions where authors or editors or maybe the publishers India and can you click this button automatically triggered the the archiving ones so as mentioned in the but even for more details
there should be 2 related tables 1 is specifically on the analysis showed theater was published at a video last year and then another 1 this found temples that's this search engine that I showed earlier and that will be published next month the Web Science Conference so thank you very much if you wanna try it yourself you can just go to S W map board that's this catalog and there are these links that I just showed and then connection try thank you but if it
Web Services
W3C-Standard
Software
Menge
Forschungszentrum Rossendorf
Mathematisierung
Projektive Ebene
Information
Packprogramm
Computeranimation
Web Services
W3C-Standard
Software
Bit
Software
Computersicherheit
Datentyp
Forschungszentrum Rossendorf
Mathematisierung
Systemzusammenbruch
Projektive Ebene
Information
Packprogramm
Computeranimation
Web Services
Internetworking
Elektronische Publikation
Selbst organisierendes System
Computersicherheit
Web-Seite
Elektronische Publikation
Information
Dateiformat
Packprogramm
Computeranimation
Internetworking
Videokonferenz
W3C-Standard
Spezialrechner
W3C-Standard
Software
Universal product code
Skript <Programm>
Skript <Programm>
Bildgebendes Verfahren
Standardabweichung
Web Site
Sichtenkonzept
Versionsverwaltung
Packprogramm
Computeranimation
Eins
Internetworking
Resultante
Mathematische Logik
Hausdorff-Dimension
Temporale Logik
Virtuelle Maschine
Summengleichung
Packprogramm
Computeranimation
Zeitstempel
Homepage
URL
Homepage
Web Services
Web Site
Mathematische Logik
Sichtenkonzept
Mathematisierung
Virtuelle Maschine
Information
Packprogramm
Computeranimation
Zeitstempel
Homepage
Homepage
Arithmetisches Mittel
Objekt <Kategorie>
Software
Hausdorff-Dimension
Temporale Logik
Inverser Limes
Zeitstempel
URL
Objekt <Kategorie>
Web Services
Web Site
Information
Computeranimation
Zeitstempel
Energiedichte
Software
Software
Zeitstempel
Versionsverwaltung
Ereignishorizont
URL
Web Services
Umwandlungsenthalpie
Web Site
Punkt
Ortsoperator
Adressraum
Mathematisierung
IRIS-T
Strömungsrichtung
Information
Binder <Informatik>
Term
Packprogramm
Computeranimation
Homepage
W3C-Standard
Software
Suchmaschine
Ganze Zahl
Serielle Schnittstelle
Neunzehn
Zehn
Rechenschieber
Resultante
Forcing
Mathematisierung
URL
Extrempunkt
Zeiger <Informatik>
Binder <Informatik>
Packprogramm
Computeranimation
Homepage
Resultante
Freeware
Kartesische Koordinaten
Information
Computeranimation
Komponente <Software>
Web Services
Software
Datenverarbeitungssystem
Softwareentwickler
Informatik
Hardware
Web Services
Programm
Algorithmus
Mathematisierung
Quellcode
Packprogramm
Softwarewissenschaft
Objekt <Kategorie>
W3C-Standard
Software
Parametersystem
Versionsverwaltung
Term
Programm
Telekommunikation
Kommunikationssystem
Programm
Ordinalzahl
Störungstheorie
Term
Computeranimation
Komponente <Software>
Deskriptive Statistik
Software
Tabellenkalkulation
Datenverarbeitungssystem
Datenverarbeitungssystem
Software
Parametersystem
Zusammenhängender Graph
Speicherabzug
Term
Hardware
Resultante
Programm
Web Services
Algorithmus
Mathematisierung
Information
Analysis
Computeranimation
Homepage
Komponente <Software>
Software
Verschlingung
Software
Datenverarbeitungssystem
Parametersystem
Hypercube
Decodierung
Term
Lesen <Datenverarbeitung>
Hardware
Web Services
Soundverarbeitung
Web Site
Mathematik
Open Source
Programm
Mathematisierung
Zahlenbereich
Kartesische Koordinaten
Quellcode
Information
Binder <Informatik>
Analysis
Quick-Sort
Computeranimation
Homepage
Quellcode
Software
Chirurgie <Mathematik>
Verschlingung
Software
Hypercube
Analysis
Resultante
Randverteilung
Web Site
Punkt
Prozess <Informatik>
Dokumentenserver
Konvexer Körper
Versionsverwaltung
Mathematisierung
Packprogramm
Computeranimation
Homepage
Office-Paket
W3C-Standard
Software
Online-Katalog
Software
Binder <Informatik>
MIDI <Musikelektronik>
Zeitstempel
Versionsverwaltung
Web Services
W3C-Standard
Software
Online-Katalog
Prozess <Informatik>
Mathematik
Software
Dokumentenserver
Binder <Informatik>
Gruppenkeim
Online-Katalog
Information
Versionsverwaltung
Packprogramm
Computeranimation
Web Services
Web Site
Mathematisierung
Virtuelle Maschine
Mathematisierung
Mailing-Liste
Information
Mustererkennung
Computeranimation
Mapping <Computergraphik>
Deskriptive Statistik
Energiedichte
Datensatz
Software
Rechter Winkel
Software
URL
Analysis
Web Site
Bit
Virtuelle Maschine
Mathematisierung
Vorzeichen <Mathematik>
Bildschirmsymbol
Binder <Informatik>
Computeranimation
Virtuelle Maschine
Software
Maschinelles Sehen
URL
Versionsverwaltung
URL
Aggregatzustand
Soundverarbeitung
Autorisierung
Web Services
Web Site
Prozess <Physik>
Versionsverwaltung
Virtuelle Maschine
Mathematisierung
Web-Seite
Information
W3C-Standard
Software
Umwandlungsenthalpie
Zufallszahlen
Computerspiel
Verschlingung
Software
Datenverarbeitungssystem
Fokalpunkt
Erweiterte Realität <Informatik>
Web Site
Subtraktion
Virtuelle Maschine
Regulärer Graph
Analysis
Computeranimation
Virtuelle Maschine
Datensatz
Informationsmodellierung
Zufallszahlen
Umwandlungsenthalpie
Software
Total <Mathematik>
Fokalpunkt
System-on-Chip
Erweiterte Realität <Informatik>
ART-Netz
Metropolitan area network
Internetworking
Motion Capturing
Spider <Programm>
Singularität <Mathematik>
Mathematisierung
Web Site
Web-Seite
Packprogramm
Rechenschieber
Arithmetisches Mittel
W3C-Standard
Singularität <Mathematik>
Software
Verschlingung
Garbentheorie
URL
Aggregatzustand
Web Services
Internetworking
Web Site
Quader
Virtuelle Maschine
Zahlenbereich
Kartesische Koordinaten
Web Site
Information
Packprogramm
Analysis
Computeranimation
Homepage
Intel
Software
Software
Garbentheorie
MIDI <Musikelektronik>
Web Site
Mathematisierung
Virtuelle Maschine
Aggregatzustand
Information
Analysis
Computeranimation
Homepage
Zeitstempel
Netzwerktopologie
Quellcode
Informationsmodellierung
Digital Object Identifier
Webforum
Software
Total <Mathematik>
Web Services
Autorisierung
Internetworking
Motion Capturing
Dokumentenserver
Prozess <Informatik>
Web Site
Wiki
Packprogramm
Diagramm
Software
Rechter Winkel
Speicherabzug
Versionsverwaltung
Data Mining
Web Site
Punkt
Versionsverwaltung
Aggregatzustand
Information
Computeranimation
Zeitstempel
Quellcode
Metadaten
Weg <Topologie>
Software
Analogieschluss
Web Services
Autorisierung
Prozess <Informatik>
Cracker <Computerkriminalität>
Freier Parameter
Dokumentenserver
Spider <Programm>
Datenhaltung
Web Site
Mailing-Liste
Binder <Informatik>
Packprogramm
Software
Landing Page
Menge
Wort <Informatik>
Information
Versionsverwaltung
Data Mining
Instantiierung
Autorisierung
Addition
Web Site
Programm
Web Site
Packprogramm
Computeranimation
Eins
W3C-Standard
Texteditor
Deskriptive Statistik
Software
Software
Physikalische Theorie
Temporale Logik
MIDI <Musikelektronik>
Aggregatzustand
Web Services
Mathematisierung
Online-Katalog
Web Site
Information
Binder <Informatik>
Whiteboard
Videokonferenz
Mapping <Computergraphik>
W3C-Standard
Software
Suchmaschine
Physikalische Theorie
Temporale Logik
MIDI <Musikelektronik>
Tabelle <Informatik>
Analysis

Metadaten

Formale Metadaten

Titel Software as a first-class citizen in web archives
Serientitel 2nd Conference on Non-Textual Information: Software and Services for Science (S3), May 10-11, 2017 in Hannover
Teil 7
Anzahl der Teile 13
Autor Holzmann, Helge
Lizenz CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/31033
Herausgeber Technische Informationsbibliothek
Erscheinungsjahr 2017
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract The Web contains all kinds of information today. Web archives preserve this data and make it long-term available. However, access is usually only provided by a URL and a timestamp. Hence, there is no deeper meaning attached to archived resources, although collectively they can represent entities, such as software. Moreover, documentation and source code that is available at different points in time, can even represent different versions of a software. Treating them as first-class citizens in web archives enables reliable and permanent references to software, which is normally hard to manage.

Zugehöriges Material

Folgende Ressource ist Begleitmaterial zum Video

Ähnliche Filme

Loading...