Merken

Inconsistent XML as a barrier to reuse of Open Access Content

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
for the useful introduction I actually like the concept of having questions already doing introduction also throughout the talk and also once I find my the presentation of URL come to those questions any time there was Firefox button somewhere among tested this but I don't find it right now can someone please point me to wear that browser is because the I just need a fire some browser actually to his critics gave and then I have power point which I don't use and here's an example brother that want to find my talk I just type in just in some search engine and you can you can do the same thing if you scroll down thing with which is the thing you want and then we have the same the same their 2013 2014 mixture that was just mentioned so I just click on 2014 issue here there you can do the same thing but if you don't if you don't like Google for some reason you just go to English Wikipedia directly and type in that string now we're going to make the our view little larger so that you can actually read otherwise you all have your laptops there you can follow the presentation there and so yeah we're talking about inconsistent XML as barriers to reuse of open access content and we have a few nights page on
which you're welcome to post it's very easy don't need to reduce it's like any other Wikipedia page and whenever you have questions please go there in order to warm you up a little more after the coffee break but have few questions and so the first one by estimated the number of people in the room repairs roughly being 153 don't agree with play Wikipedia you can just editing and so a few more questions how many of you have ever had trouble 388 figuring out the best way to mark up content the honest altogether lattice that looks like 80 % uh all type you the figures in right here so that later on we have a record of then submitted a comment on the nicest slides on just stand yeah let's say 5 % really roughly if you know that you can just go in and change things how many of you have produced Jack continent is consumed by more than 1 set of users the OK maybe 13 . 5 per cent ends and how many of you have tried to get content out of debt space repositories OK you all end where you have tried that With repositories that contain content from more than 1 publisher will provide OK that's these are the brave hands 3 pointers a day and I'm doing this demonstration because I think in principle science could be proceeding in a similar fashion whenever I dates are being recorded for myself in this case I was recording because I'm interested that and then they should be public of course they should normally be acquired with some more precision than what I have done but the good thing is whenever someone knows better you can go and fixed and keep this as a mind-set now tell you where I come from and the 1st thing is of course obviously from Wikipedia by there is a moral victory has a number of sister projects 1 of which is Wikimedia Commons which is the media repository instead here I'm actually trying to get content into that across the 3 out of the repository that you all know government central so we have about here that uploads video files to see some of them the has uploaded over the last few days so for instance this 1 and so it has a
fine and it has a video obviously has some descriptions of enforcement of data and licensing information some provenance information some files of all time stamped and so on and the bottom in some categories all this created automatically by what on the basis of XML available from and running this spot is actually what I wanted to do and in the course of doing that is written many many problems due to inconsistent market and that's what I'm going to talk about my background is in biophysics and experimental biophysicist so before it's the started working on this but I had no idea what XML was about a handful of jets and I'll tell you how to improve your system OK so and the kinds
of problems that we found are roughly categorized by these slides so to speak so they come in flavors that's their potential licensing to the media at times and keywords and something but we haven't looked at this references but that is something that people mention to us when we were just mentioned on the we're giving this talk here and ask so licensing information but what what it does is it spiders from central for openly licensed articles and would open new doesn't mean that is different from many people there is an open definition actually which is quite clear and the short version if you cannot use on Wikipedia max and so we want to use those and ikipedia so your interest and the real open-access stuff and which most of the open access subset of PubMed Central and she doesn't fall into that most of the open access subset is actually not consistent with according to the criterion and instead yeah in order to be define whether a certain particle is access or not what do do we look at the tags that there and licensing and so ideally they would have some lessons your eyes included in the licensing type and name miss makes up the tag in absolute terms because I'm not an expert in that but but it's all written down in in rather correct fashion it and then you can take it from there rather than from my world and so that's how it's supposed to look like the problem as the others all sorts of variations a very common 1 friends that is that there is no you're right but there is some human readable text that may or may not make sense to a human reader and even irrespective of whether they actually know something about the licensing and but there is also more interesting cases like contradictions between the human and machine-readable licenses so what we have here in an example where the sum the final actually through creative commons attribution license and it says here it does not permit commercial exploitation but think appropriate comes attribution license does that and as so the human readable text and the machine readable text contradict each other and so how are you going to handle that on on automated from fashion and but to give you an an idea of the scale of the bot has importance so 15 thousand your media files from from central from the supplementary of course we don't normally neglect and it's at that scale things really become a problem then on the human and the machine readable versions agree parts they contradict the journals licensed state that is something that is really hard to pick up more by about that and he looks at the XML the article itself I don't go through all of these examples there's links to most of the things I mentioned we can dig deeper at any time you really welcome to pose questions and then we have license information that is outside the lessons and Prinsloo right and and so on we have contradictory your and we have cases where the loss of information is missing altogether and and all this is sometimes even make Stopman you have typos on top of that which was actually my title slide if you look at the contents the and there's a number of some consequences from that so 1st PMC have recently about a year ago offered search by licensed features but it makes use of basically the same information and the information was reliable than the set features more reliable and yet can continue that chain of thought and it makes re-use cases like my what's difficult and that's why there's a fully remainder of the talk I want you to kind of be in the mind set of always had that little box in your mind as crawling its central that crawling maybe another dance repository to which you're contributing either by providing XML by working on the documentation or by even discussing standards if you make the life that what difficult yeah it's that will be a sad book and the other license related issues I have not been discussed in detail that much so for instance there there's some publishers if 1 thousand research as an example and no central who use different licenses for different parts of the article the especially the text of the article on in the main body of the text and supplementary data they have to be predefined licenses had to do that and so as long as we keep the signaling the lessons that in her article level we we run into problems of what about incremental licenses that change over time it's also not fair to obvious how to handle them and that if there is a good solution it's not well documented and so on so what immediate types that this currently just there's 3 attributes are being used to signal the media types it's called what is called Montagnard one's mind subtype uh the problem is that in the example that I'm giving in in Jackson's just segmentation their combined as 1 would like applications last PDF so there's no separation between 2 but in the QCD tagging guidelines they require both elements of those contradictions in in in the end that is not entirely clear way and the media type should go should go to the supplementary material should go on on the child properties like media graphene and that his soul if you if you want to run at the box that kind of searches for those materials it's it's really hard OK some data here is a graphic and that's is the kind of person commentary but while file formats that have been submitted to not indexed by PubMed Central in 1 month that is made 2013 and we see that the gray areas are not interesting for us so we have 3 media types the interesting for us that degree has video lose all your friends pink well that's that's purple is make images we don't import images right now what we're working on a system that will and that's why we already keeping track of images and you see that something yeah there is a difference of file formats and so on and so on and so that's the kind of of find balls topic that would be consonant with which we work but there is a complication in that Wikimedia commons only accepts stuff in open file formats which is based in all of those and so on we have to convert all the videos to all the of them and we will soon start them to buy them which are both patent unencumbered formats but all the is really form of and you see here they have some problems with this which are not accepted by the Wikimedia community another some file that may be interesting to publishers in the room we have here for that
same month of the May 2013 a list of basically publishers by how well they signal that the media type of the of the stuff in their XML so red and green have the same the I mean that you know from traffic lights and so on the great ones basically means Microsoft Office data between which we did not differentiate and you see there's quite a quite a few publishers and have quite a large red of on the and that means that for some reason the followed ties have not been indicated problems yeah and that's just 1 month in the sample size 23 thousand articles yeah and we could invent will go through all
of the sudden of its central but this is just illustrated if anyone would be interested to statistics large-scale we would be interested leverage but yeah there's mismatches between the declared type in the real time as indicated here there's also matches mismatches between the founding extension of the media types that we have various videos that end at Docstoc friends or TEXT or something like that which is not what you expect from the norm so while working with about the have signal some of those problems to them and they were kind of responsive so they have improved the handling of the mind types there now and checking the mind tribes themselves and they also support had requests and so there is quite some progress but they said they do or don't consider any time soon as retroactively fixed the XML so now all the XML maybe checked for more things of of the kind that I mentioned but whatever problem there is invaded by already it will stay there at least well from new future and this may be a problem then the standardization of efforts as Bruce already alluded to the supplementary journal article materials Working Group has produced some recommendations on which are worth I'm considering in this regard next kind of problem keywords that these are actually used to put the files in categories on to Wikimedia Commons so that people are are writing articles about it on Wikipedia about a certain topic if she had a chance to find those rich and so the quantum essential component of the you know about as well and here Dad has 2 main options you can use either the keyword groups I a 1st for you can go through the article categories subject group and subject and there's no real prescription as to what you should do and so everybody doesn't basically how they like it and uh there's even inconsistencies and between different jails by the same publisher or sometimes between different articles in the same general and that yes the the problem is actually that there there is no no real standards it's really hard to find a way to extract subjective information from those are xml text comes while talking about these licensing and meters high security issues and someone as the us on Twitter and said OK a few years ago they were going through PubMed Central trying to haul this reference information that they gave us a few of those examples of just mention them here's a tweet
citations 50 and 60 in that's documents they I have chosen your should
then there is even 1 for yeah they have 3
papers that share the same GUI rejected belongs to a different that makes it really hard to reuse that the and yet so there's a few more of
along those lines and basically the quality control has failed at some point and then it would be nice if there were more of that in congestion to repositories like from central but also the room on on that somewhere by other means or if there is a problem that then it would be nice to actually be able to fix it so we'll come back to the other issues that they noticed is that the market of citations that somewhere in the range of references as cited like reference 1 2 3 or 15 to 21 ourselves sometime this is there's actually multiple ways to do this 2 of which are of given here kind of you can put all those 3 ideas into 1 are detector we can separate it out and then there's other ways to do that and again there is no or consistency across publishes the PMC tagging islands actually do you say you should do this 1 but nobody cares yeah and then there's other problems like there's an example here know where the year common year of publication is actually
well was 21 year of publication was actually a 7 thousand 942 that's quite advanced into the future right that's how the head of the future looks like and that's basically due to some issue of extracting numbers in the title and yeah of course that you have all those problems with misspelling that hopefully to address at least to some extent and this transcription of citation titles especially they contain something more of complicated than during which were and so on yeah so these are the kinds of problems that other people have discovered when looking into references so we have had discovered problems while looking into licensing Media types was references and we can probably continue that list but we just have look at OK so I the remaining part of my talk is that actually focus on some recommendations and this is the talk on where and when it's late in the afternoon and it's quite technical but nonetheless there I think it's important and so we have kind of a that we have a table of contents for this slide and we actually intended this slide to be quite succinct but it wasn't possible because there's so many things that make reuse difficult so just walk you through the table of contents we can dig deeper into any of those but yeah just to give you an overview so these license tagging as recommended in the PMC tagging guidelines should be amended we think such that the URI is always in the same place at the moment it is kind of analysis that you're right can be in different places and this introduces additional variety and potential sources of errors and we think it would be better to have it always ends in 1 place and you specify how that should look what then once this is implemented on ingest we think well why not do something similar for the Jester Center and the year myself open access media indicators initiative has been mentioned by Bruce as well so it would be nice to actually our 1st the CD recommendations of the working group and then see how they work in practice how work of jasper they work with kinds of years and then public domain works like the C C 0 waiver any kind of public domain work text technically that's not a license and so when we're discussing licensing information we should be aware that there's quite a lot of open content that doesn't really technically legally fall under that heading so we should say that for the purpose of Jats public domain counts as a license we should prescribe somehow what to do this kernel from functions then the PMC style checker which is kind of the bodyguard at the you that ingestion of it should be a bit more strict in at least for people like us who actually want to reuse the content and then there is some loose guidelines on how to handle multiple licenses which should also be more standardized we say if there is a license statement somewhere should always be in a license tribute or less table was little devil mix those up in there but the important point is that the you signaling others a loss of information should be consistent and then another interesting thing From the mandates so far mainly been about making stuff for free to read there is now some discussion about legal aspects and licensing but there is basically no discussion about technically but if something is free to read and openly licensed but available in the technical form and that actually is a barrier to reuse and it is not actually available and so on I would really encourage from there is in the room or those of you who talk to from this to make that point to that and if that were actually implemented the fund has led to have an easier time to actually try to put their funding time so that's actually in their own interest then the license metadata already contained in midst and what we think that it would be best if that could be fixed even retroactively that the license data could be exposed much more prominently than it is right now for the media types by there is quite some variation media types and it's not easy to come up with a unique ways or 1 fits all solutions but at least there should be something like a best practice guidelines and of course once we discover that a certain Docstoc follows actually video to be nice that you want to adapt the file extension to actually reflects the the as for the keywords it would be nice to have fewer tags available at some on the level of the leaf was something like the figure the need for a supplementary video maybe even a paragraph whatever subcortical level at the moment that's not what they actually foreseen then in order to kind of fix all those things would be neat if we would have some public version history so the version of record that the that the file the video file that I was talking about actually originally had the Docstoc extension we can keep a record of that but we can fix it and then it has to be adults ADI or whatever extension in the next or some other words and what we can go back into the original version if that is necessary for whatever reason and so but the really nice to have a public public version history for PMC and the best thing for science in general just imagine the public knows in history of the other notebooks
of categorized or something and then you really need so yeah and then it would be nice if that public vision is you we're not just maintained ought to be maintained by PMC but that there would be some mechanism by which someone could submit patches like pull requests on different OK so the and then we have some more philosophical explanations about why we think that there should always be 1 way to type things we have some of the recommendations like and PMC should open source the code the tagging guidelines and that the style checker for instance bone recommendations for nicer I work on a better platform to actually keep track of the comments that have been submitted on a certain initiatives and so on and the final but really important and recommendations keep reuse in mind think about the hot crawling around from central searching for audio and video content if you want to have further detail you can 1st go through that page and 2nd we have a whole paper now the re-use aspect of one's more in in some detail so the the idea here is that these databases are not dead ends as so we really give our best products constant out there to aggregated in databases like PubMed Central but the perspective of those who come there especially with machine of automated tools has not really been on the mind of database designers that much but I think it should be this is actually a case of gas and there there's other examples of for instance the year protein data bank and in the room so it had similar problems initially the emphasis was on getting continent in and false problems reuse later and so they went into those us actually clean it up which took quite some effort and want to this project at least I think we should probably reconsider subjects related procedures and once you're done with desolately already stirring thinking about it there's probably some other databases suffering similar problems and so yeah I just want to say the discussion around and yet 1 we have we use cases in mind we certainly cannot cover all used that's clear but there should be some use cases that are kind of intrinsically associated with the of tool that we're trying to develop and you just give a few examples here 1 1 would of course be the little bottle quantum roundabout from wonderful in the other examples would be species disambiguation of salinity paper that proposes an an open source standalone software system capable of recognizing and normalizing species names mentions with speed and accuracy that's the kind of thing that you can do once the marker correct citation was the set citing sentences for a given article can be used as a surrogate for the actual article in a variety of scenarios contains information was the use of appears to be important and that would be another 1 of these has already can often it's clear is an automated fashion so we propose improving the formulation of full text queries by using the open access literature as a proxy for literature as a whole and once we have kind of art in the minds of caring about use could as well start signaling reusability that's what we're working on Wikipedia and so on but Wikipedia they often have those small numbers the numbers somewhere in the text that indicated in mind citation if you click on them it brings to the reference section where have all the usual metadata and we're working on a system that will look somewhat like this it will indeed if it contains some an indication of the licensing like how open is that reference actually what can I do with it can read it can I reuse the images of files in there and then some Wikimedia version of where that information can be found in this case the full text is available from which to source which is a sister project of Wikipedia all the images are on Wikimedia Commons another sister project all the metadata is on with the data analysis support so it all fits into the media universe so we're we're not the only ones to thinking about the system so colossal working on implementing a similar system closely associated with the myself guidelines on how to do that and we also tried to be compatible with the national guidelines but as long as they're not out of what we can do that and we're kind of waiting for them to come out because they're ready to deploy and once it's on Wikipedia it's kind of there are likely to influence the standard and and then the question would of course be of course of this Wikipedia does it and maybe plus another publishes data 1 didn't change your elsewhere and that discussion OK finally I'm here and this has a volunteer must attend from all this week if you will do as a want to invest the time but do you actually work sometimes and this is what I'm working on that use cases and best practices for markup of biodiversity literature and sort of looking at the natural history museum in berlin and we're interested in getting all the articles about ordered of species descriptions and other taxonomic literature and marked up in a way that allows for semantic integration and the writing some reports the European Commission on best practices and specific use cases of marker of the biodiversity literature and if you have suggestions that would be welcome only have to submit by the end of the month not all check whether any of you have actually submitted any questions to the question and answer page know that there is some videos and there's some
some images of chaos so some interactivity not looking for answers to your questions and comments thank you nice paper I have a small correction I I don't know if it's in Jack's 1 . 0 or not but by Jackson 1 . 1 we absolutely allowed keywords and other such metadata on anything that's got a title essentially figures appendices sections box text you can put all kinds of keywords and figures can do it's there OK then I encourage you to basically fix my presentation was already fixed so he didn't they already fix it was giving the whole cycle but thanks yeah and that's the kind of thing you can do I haven't touched my presentation but it's it's already had that I would like this to be a possible for things like PMC as well the some oversight of course in terms of quality so because he has made a mistake somewhere you should be able to to fix it with this way and my other question if I may ask the 2nd are you using the taxonomy stuff that Terry can on those people presented a task on a couple of years ago the taxonomy extension to chance the outcome fully aware of that but I'm using that before my work model must therefore come with its 1st and he's sitting there in the brackets any other questions comments that plays bridges and 1 from an arab comment on your recommendations also by saying I enjoyed your paper the 2nd time around equally as much as I enjoyed the 1st time although I don't remember all the details that you see the recommendation is the 1st time around on your recommendation number 6 1 way things on the part of white yes so successful is because it's the leaves latitudes people attacked things in more than 1 way depending on what their local profile on all in favor of providing best practices and that a policy recommendations to but that's 1 day on as a raise my eyebrows and saying and maybe this isn't going to work for everyone was going on on the upper of and I also had similar discussion because as we notice from the previous talks is not a discussion and conference so that is 1 at a time and and we want you to think about that and may be that the exact phrasing that we have here right now is not the best but having best practices or guidelines would certainly helpful also there's many publishers who do not deliberately introduces errors they are just looking for those some guidelines on how to actually do it and if there's too much variation they just say OK well I I don't know there's seems to be 5 different ways just pick 1 and I have no real reason why they picked this 1 over the other ones and if there were some guidelines they could make the whole corpus more coherent than it would also be more easy to use automated tools to kind of get some content of George is quick question to the audience in terms of best practices and we use the PMC style checker the new jets XML and you you have to see more hands and that go go upwards and find work on how many of you use somebody else's style checker not enough and not only I would certainly wanted beta-tests of style checkers yeah what did he look near Mulberry technologies again I Best practices is a wonderful idea but when I set out to write best practices guidelines for the tag libraries I soon discovered that that was actually what suggest users wanted because that would tell them how to do it and they didn't want that what they wanted was common practices so they can go through and go 0 good where on the list were not doing anything to weird fine so the tag libraries for Jackson and then have a number of common practice the other way we try to steer people with examples of course my point there is if you nothing the examples we have our steering people in the right direction send new examples but we you have a whole lot of time on this presentation yeah they may actually make sense to kind of digest this with different perspectives jets PubMedCentral maybe some individual journals and between ourselves and in the recommendations certainly makes sense that the PMC level maybe even at the only a subset level because some of those are really specific to the kind that are only possible open-access stuff instead they may not make sense with 4 jets as a whole but some of them may make sense and I leave that to you I'm not an expert on that just discovered problems and we use while I was using you as an excuse to make another plea for samples because every chemical practical and they make microphone in front of me I try OK well in the in the paper we have virtually all almost all over column sentences have an example at the end and it's a long paper so you have loads of examples and we have a few new ones in the presentation actually that's nice that that's not what I knew what knew are examples nice tag cats examples of how people might do it and think it ought to be done on how people are doing it right most of our examples of bad yes most you examples of bad examples and that's not a good idea of the demise of supplement 1 vs set which is about 1 meter and things and stress that's when we detect certain things you think it is something that would benefit by more restrictions for example the license with a license your eyes go there's no reason but I can think of to allow a lot of well and I would add that I mean I think this exchange very interesting because I think it illustrates the entire problem which is that the benefits are not necessarily 1 by people who put in the work and so you have a disparity across the community between the you know where the allocation of responsibilities and you know who's going to reach rewards and that's a that's a tough problem and I think I know what Christians suggests actually really important because you know I'm a big believer in carrots and sticks and what what I think we need to do is illustrated the benefits you get from correctly getting mind types from correctly and you specifically illustrating how licensing works and then the you know people who don't do a writer going to not benefit from and show the ones that are emerging that they're going so a big fan of the carrot-and-stick style system especially the car gets bigger you can use it as the and a lot Randolph from BMC since the i n just a winemaker quick about personal going works from style checkers there's a whole group of us who work on that anyway so that to your paper I think it illustrates a really and great example of what Tommy talked about this morning of 1 size does not fit all so what you need for your project is very clear what PMC needs for what we're doing is also very clear that they don't coincide all the time so some of the things that you were recommendations are suggesting won't work for us and some of the things that we want to do won't work for you and that can be extended into other publishers obviously beyond just beyond PNC beyond you know archives or anybody else so I think it's important that we focus on the reusability and the inconsistency issue and not so much on what PMC needs you to make you about what happened because I'm all for a happy about but then you know he is still needs to work and we still need to do what we need to do so I think but so I think it's a really good illustration of 1 size doesn't fit all and we do need to keep that in mind as a community yet on the other hand it also illustrates that there's certain things where we can actually do things to make the plots and PMC at at the same time I think examples like the final ending of of all the files and it's a review of the video file and Z as doc uh just signals around felt that this kind of things that are of benefit to everyone within degrees and so this is the basis for discussion some suggestions we fully aware that you will not change the world just to make that 1 but been maybe a few hundred books but not this 1 and it's it's a basis for discussion and I'm open to whatever road to discussion takes and you
Punkt
Sichtenkonzept
Browser
Familie <Mathematik>
Kombinatorische Gruppentheorie
Homepage
Zusammengesetzte Verteilung
Offene Menge
Suchmaschine
Notebook-Computer
Datentyp
Inhalt <Mathematik>
Widerspruchsfreiheit
Feuchteleitung
Leistung <Physik>
Zeichenkette
Zentralisator
Zahlenbereich
Raum-Zeit
Computeranimation
Videokonferenz
Homepage
Deskriptive Statistik
Datensatz
Minimum
Datentyp
Kontrollstruktur
Inhalt <Mathematik>
Zeiger <Informatik>
Figurierte Zahl
Dokumentenserver
Kategorie <Mathematik>
Physikalisches System
Elektronische Publikation
Rechenschieber
Verbandstheorie
Menge
Rechter Winkel
Basisvektor
Hypermedia
Projektive Ebene
Hill-Differentialgleichung
Information
Ordnung <Mathematik>
Instantiierung
Zentralisator
TVD-Verfahren
Einfügungsdämpfung
Gewichtete Summe
Extrempunkt
Versionsverwaltung
Kartesische Koordinaten
Element <Mathematik>
Computeranimation
Übergang
Videokonferenz
Chatbot
Zentrische Streckung
Dokumentenserver
Kategorie <Mathematik>
Dichte <Stochastik>
Arithmetisches Mittel
Rechenschieber
Teilmenge
Divergente Reihe
Verkettung <Informatik>
Menge
Betrag <Mathematik>
Rechter Winkel
ATM
Dateiformat
Information
Ordnung <Mathematik>
Standardabweichung
Instantiierung
Aggregatzustand
Subtraktion
Quader
Zahlenbereich
Term
Virtuelle Maschine
Weg <Topologie>
Datentyp
Inhalt <Mathematik>
Bildgebendes Verfahren
Attributierte Grammatik
Trennungsaxiom
Videospiel
Expertensystem
Materialisation <Physik>
Physikalisches System
Binder <Informatik>
Elektronische Publikation
Quick-Sort
Minimalgrad
Flächeninhalt
Offene Menge
Mereologie
Hypermedia
Partikelsystem
Zentralisator
Subtraktion
Gruppenkeim
Computeranimation
Eins
Videokonferenz
Arithmetische Folge
Endogene Variable
Stichprobenumfang
Datentyp
Meter
Quantisierung <Physik>
Zusammenhängender Graph
Maßerweiterung
Widerspruchsfreiheit
Gammafunktion
Statistik
Materialisation <Physik>
Kategorie <Mathematik>
Computersicherheit
Mailing-Liste
Elektronische Publikation
Konfiguration <Informatik>
Echtzeitsystem
Twitter <Softwareplattform>
Hypermedia
Information
Normalvektor
Standardabweichung
Zentralisator
Metropolitan area network
Spannweite <Stochastik>
Punkt
Gamecontroller
Hill-Differentialgleichung
Widerspruchsfreiheit
Gerade
Computeranimation
Überlastkontrolle
Zentralisator
TVD-Verfahren
Vektorpotenzial
Einfügungsdämpfung
Bit
Punkt
Euler-Winkel
Momentenproblem
Datenanalyse
Beschreibungssprache
Natürliche Zahl
Gruppenkeim
Versionsverwaltung
Oval
Extrempunkt
Computeranimation
Eins
Homepage
Kernel <Informatik>
Übergang
Videokonferenz
Deskriptive Statistik
Metadaten
Mixed Reality
Maschinelles Sehen
Public-domain-Software
Figurierte Zahl
Feuchteleitung
Inklusion <Mathematik>
Umwandlungsenthalpie
Kraftfahrzeugmechatroniker
Lineares Funktional
Addition
Befehl <Informatik>
Datenhaltung
Abfrage
Ähnlichkeitsgeometrie
Quellcode
Algorithmische Programmiersprache
Konstante
Rechenschieber
Rechter Winkel
Garbentheorie
Projektive Ebene
Information
Ordnung <Mathematik>
Semantic Web
Varietät <Mathematik>
Tabelle <Informatik>
Fehlermeldung
Instantiierung
Proxy Server
Subtraktion
Mathematisierung
Zahlenbereich
Systemplattform
Code
Virtuelle Maschine
Weg <Topologie>
Datensatz
Software
Perspektive
Notebook-Computer
Datentyp
Quantisierung <Physik>
Indexberechnung
Inhalt <Mathematik>
Maßerweiterung
Grundraum
Bildgebendes Verfahren
ART-Netz
Schreib-Lese-Kopf
Analysis
Transinformation
Open Source
Anwendungsspezifischer Prozessor
Mailing-Liste
Physikalisches System
Elektronische Publikation
Fokalpunkt
Quick-Sort
Patch <Software>
Offene Menge
Hypermedia
Mereologie
Wort <Informatik>
Hill-Differentialgleichung
Verkehrsinformation
Betriebsmittelverwaltung
Offene Menge
TVD-Verfahren
Punkt
Gruppenkeim
Bridge <Kommunikationstechnik>
Computeranimation
Übergang
Eins
Videokonferenz
Richtung
Metadaten
Poisson-Klammer
Unordnung
Meter
Figurierte Zahl
Softwaretest
Jackson-Methode
Profil <Aerodynamik>
Plot <Graphische Darstellung>
Teilmenge
Rechter Winkel
Projektive Ebene
Garbentheorie
Computerunterstützte Übersetzung
Normalspannung
Fitnessfunktion
Fehlermeldung
Subtraktion
Quader
Interaktives Fernsehen
Zahlenbereich
Kombinatorische Gruppentheorie
Term
Task
Informationsmodellierung
Perspektive
Fächer <Mathematik>
Endogene Variable
Stichprobenumfang
Datentyp
Inhalt <Mathematik>
Maßerweiterung
Bildgebendes Verfahren
Widerspruchsfreiheit
Expertensystem
Mailing-Liste
Physikalisches System
Elektronische Publikation
Packprogramm
Minimalgrad
Last
Dreiecksfreier Graph
Basisvektor

Metadaten

Formale Metadaten

Titel Inconsistent XML as a barrier to reuse of Open Access Content
Serientitel JATS-Con 2013
Teil 09
Anzahl der Teile 16
Autor Mietchen, Daniel
Maloney, Chris
Dagsson Moskopp, Nils
Lizenz CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/21791
Herausgeber River Valley TV
Erscheinungsjahr 2016
Sprache Englisch
Produktionsort Washington, D.C.

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract In this paper, we will describe the current state of some of the tagging of articles within the PMC Open Access subset. As a case study, we will use our experiences developing the Open Access Media Importer, a tool to harvest content from the OA subset and automatically upload it to Wikimedia Commons. Tagging inconsistencies stretch across several aspects of the articles, ranging from licensing to keywords to the MIME types of supplementary materials. While all of these complicate large-scale reuse, the unclear licensing statements required us to implement text mining-like algorithms in order to accurately determine whether or not specific content was compatible with reuse on Wikimedia Commons. Besides presenting examples of incorrectly tagged XML from a range of publishers, we will also explore past and current efforts towards standardization of license tagging, and we will describe a set of recommendations for generators of content on how best to tag certain data so that it is both compatible with existing standards, and consistent and machine-readable.

Zugehöriges Material

Ähnliche Filme

Loading...