We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Free your papers, researchers!

00:00

Formal Metadata

Title
Free your papers, researchers!
Title of Series
Part Number
156
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Ryan Lahfa - Free your papers, researchers! Research is financed from public money and researchers publish papers. But, papers are often unavailable to everyone except if you pay money for it, which seems wrong! What can we do as developers? Well, we can help researchers to open their papers! ----- And of course, how do we do it? Python, of course! Dissemin is a website using the Django framework which aims to promote a global Open Access policy, it offers to researchers a way to deposit legally their papers inside of a repository (Zenodo for example). We will see how the researcher world works quickly, and what are the challenges of assisting researcher to make papers available to everyone!
11
52
79
Addressing modeArc (geometry)Game theoryRaw image formatMathematicsSpecial unitary groupGrand Unified TheoryArmCurve fittingCAN busPunched cardVarianceUltraviolet photoelectron spectroscopyComputer-assisted translationHand fanExecution unitRing (mathematics)Mass flow ratePauli exclusion principleLie group12 (number)RippingRAIDQuantum stateEuler anglesMetropolitan area networkPosition operatorStudent's t-testProjective planeGroup actionFunctional (mathematics)CASE <Informatik>File archiverAuthorizationRepository (publishing)Hash functionFaculty (division)BackupMetadataSource codeBounded variationTask (computing)Library (computing)Data storage deviceRevision controlQuicksortProduct (business)Interface (computing)EmailAlgorithmSoftware repositoryVirtual machineSimilarity (geometry)FingerprintUniverse (mathematics)Field (computer science)Proxy serverDigitizingSelf-organizationRobotPower (physics)Web pageFeedbackSoftware developerHoaxRow (database)Communications protocol2 (number)BitWordQuery languageEndliche ModelltheorieObject (grammar)Web crawlerPolygon meshPhysical systemContent (media)Multiplication signRight angleMobile appMarginal distributionState of matterNatural numberPay televisionGame controllerDifferent (Kate Ryan album)Process (computing)GravitationMathematicsInformationSoftware frameworkCondition numberOpen setObservational studyLimit (category theory)Materialization (paranormal)Game theoryFreewareTerm (mathematics)Order (biology)Closed setDatabaseMatching (graph theory)AreaDisk read-and-write headIdentity managementShooting methodOnline helpService (economics)DiagramSpacetime1 (number)Point (geometry)Gravitational waveReal-time operating systemReverse engineeringSocial classImplementationSequenceDataflowLetterpress printingPopulation densityMereologyPredictabilityArithmetic meanScripting languageEstimatorSequelHypermediaVaporReal numberInstance (computer science)OpticsPrincipal idealMonster groupSubject indexingSpontaneous symmetry breakingForcing (mathematics)ResultantControl flowArithmetic progressionWebsiteTheoryWell-formed formulaBit rateRule of inferencePressurePersonal digital assistantHierarchyDecision theoryComputer animationLecture/Conference
Transcript: English(auto-generated)
Can you start? One, two, one, two, okay. So, hi everyone, I'm going to introduce you to a tool for a researcher to free your papers. But first of all, what is a researcher exactly? The systematic investigation into a study of materials
and sources to establish facts and reach new conclusion. This is kinda confusing. So, what is exactly a paper? So, who was there at the keynote on the gravitational waves? Okay, so we could say this is really a breakthrough, what was discovered on this day.
And naturally, there was paper published for it. So, we can all take a look to the full text here, which is filled out scientific details and mathematics formula. But what do we do with papers? Well, researcher read them to inform themself what's going on in their field.
Or people will read their thesis with them. We could also, as developers, use them to read software, for instance, machine learning systems or databases. But there is a catch. Research is financed from public money and published through companies like Elsevier or organization like ACM or IE.
And this publisher decide to keep the papers behind the paywall, so you have to pay for it. So, this is kind of closed source. But it creates problem. Well, student can access those papers because their school pays for a subscription.
But we would like to have open access. Open access is really important because over people who are not student cannot access to the paper and have to pay ridiculous amount of money for only 10 pages, which is not really fair.
So, it's time for the guest game. And who, people who are students in this room cannot play, unfortunately. So, publisher edit journals and those journals contains papers. We have to access, in order to access their content, we have to pay subscription for it. And the most famous one is Nature.
And in your opinion, how much do we pay for subscription per year for only one journal? Does anyone have an opinion? 50,000? No, too much. So, sorry? Too low. So, we pay over 10,000 per year.
And you can also have journals poking at over 25,000 of dollars per year. This is insane. But what is really interesting is how much they make a profit. So, according to right research, Elsevier has around 31.7% of profit margin,
which is what? Which is what? So, let's compare it to a big company. What was Google Apps to make profit margin in 2008? Anyone idea? Well, 30.6%, which is less than the publisher.
Okay. So, let's summarize a lot this. So, researchers keep up with the state of their art finding in their field. They have new insight and ideas by reading this paper and understand them. So, they read down these ideas into a new paper
and then they submit it to a publisher. There is a system which called peer reviewing, which tries to filter out which content could be considered as scientifically correct and as a breakthrough. Finally, if the paper is accepted, our sense has progressed.
But in fact, there is a third party which we often forget, which is the publisher. And publisher gets money every time at each process. But there is a problem. Why this is important that publisher don't gain too money from that?
Because we have subscription which are becoming more and more expensive, like you said, like you've seen. The price is really high. And the fact that we are giving away our research for free for the reason of peer reviewing it makes them more profitable. So, why shouldn't also students from developing countries
have access to these crucial papers which are behind the pay wall? Because they don't have the money to pay for it? And why cannot people who are not student but maybe developers or any kind of field cannot access these papers? Because they are not students, they don't have universities to pay for this subscription?
Well, this could be you if you are a developer and you cannot access papers. You could run into a situation where you need them, but they are not available. So what can we do to improve open access? Well, we created Thisimine, which is a tool to give control back to researcher.
And we would like to promote a global open access policy with it. So how do we do that? We fetch your papers from different sources. Jubilee Core, Baze, all kind of repositories, we fetch them. And we check the policies on these papers. Can we open them, can't we open them?
Can we open only the preprint version? And then we tell you what you can deposit legally. So what does it look like? Well, this. This is a page where you can deposit a paper. So you can see what you can deposit and what you can't deposit.
And this is pretty simple for a researcher to actually open their papers. When it's done, your paper is free and accessible by everyone. So this is great, but to give you more insight, who is behind Thisimine? Thisimine was an initiative from a group of students of the economic superior in France.
We are a non-profit organization participating in many open access related projects. We worked with Wikipedia for an open access bot. We will be at the OpenCon. So maybe you were telling yourself, but this is a Python talk. Where is Python in this story? Well, Thisime is of course written in Python
using the Django framework. So we're using PostgreSQL to start papers and their metadata data. So we're encountering some challenges I wanted to share with you. First of all, the papers. We have more than 15 millions of metadata data papers, and we are still getting more and more metadata
for new sources, but we have a problem. How do you fit this amount of data in your store? Well, we kept PostgreSQL, and we used its powerful JSON field. So PostgreSQL and JSON field.
Well, implementing JSON field in a model is just a matter of importing a model and use it in your model, which is really awesome. But what's this answer is how PostgreSQL handles them. Well, you have indexing for free, which works on soft fields. This is super efficient, and can be your NoSQL word for a while
if you don't want to buffer. You could avoid very complex join, and you can access soft fields in queries directly without having to fetch the whole record. So JSON field in PostgreSQL is really a good solution when you don't want to implement a NoSQL store.
The second challenge is we need to have search, and it has to be fast, so that our researcher can get more feedback really easily. So we tried to keep PostgreSQL for this kind of task. We looked in this search on Gynails, but it was not sufficient for our amount of data we had.
So we entered Haystack. Haystack is a Python library for Django to provide awesome search tools. First of all, multi-backends. So we can have Elasticsearch, Solr, Woosh, Zapyan. Even we can use them, we can configure a master Elasticsearch with a backup Solr, which is really cool.
We have face setting, we have real-time indexing, which is important when we're getting new papers. We want them to be indexed right away. And we're still working to make this faster, because it's really hard how to maintain all this metadata coming from many sources.
And the third challenge is the most hard, in my opinion, how to prevent duplicates papers, because we have so many sources which provide slight variation in the metadata, so that we need to have a solution for that. So we tried the fingerprinting technique.
We have a function which takes a paper and reduce it to its minimal form, remove the diacritics, lower case everything, we sort the authors list, we simplify the title, and finally, we compute a hash on it and store it. Then, if we have a similar fingerprinting database, we can just merge the paper
and get more, more, and more metadata on the paper. So this technique is working more or less, but we always have some cases where we don't have the title, because some sources don't require you to enter a title for a paper, which is absurd. Anyway.
To close on the challenges, we have many more challenges around machine learning to disambiguate authors' names, perform title climbing from LaTeX Marker, infrastructure script, we have already a background for development, we would like to get Ansible or anything to push in production in a more efficient way.
We would like more deposit interfaces and sources to support more universities and more use cases, and how our data repository is filled with interesting issues, and we need your help. So, to close also on open access, we have many projects around this domain, like a proxy for digital object identifier,
an open access bot for Wikipedia, a crawler for repositories, an OAET mesh protocol implementation, which is a protocol to fetch papers in an efficient way, and a bit of inspiration from another light talk,
which has been done by Lassie from the crawler team. I want you to do something at the end of this talk, which is, if you're a developer interested in open source, clone, disseminate, run it using background or anything, try it out and deposit fake papers for fun, take an issue and submit a pull request if you can,
and if anything goes wrong, blame us. And if you're a researcher interested in open access, you could talk about this, I mean, to every one of your peers, you could parachute them to open their papers, because this is really important. And most of all, you should open your own papers if you have them.
And if anything goes wrong, complain to us also. So thank you for this talk, and thank you, Python, it was really a great conference. If you have anything, you can contact us. Thank you very much.
Do you have any questions? Hi, so I'm interested in how you are funding yourself, because I would imagine that going against companies like Savior, Springer, and so on isn't a trivial task,
especially for a couple of students. So like we said, we are a non-profit organization. We receive some donations, but we don't have so many costs, so we don't have the need for a lot of funding. We're going to get some funding from repositories
in France, which is, but to be honest, we don't really need, we don't really need funding, so we don't have problem going against Elsevier or Springer. Does that answer your question?
Did you think of storing the paper on already open storage for them, like Archivics or All in France? So why did you choose to store that in your database?
So as far as I know, Archivics is only a store for pre-prints, right? Not for, go ahead. Yes, so we offer the possibility to store pre-print, post-print, and published version.
The thing is also, I don't know if Archivics and those kind of repository supports the way we are fetching the policies so we can tell you what you can deposit or not. Right, so we think we are trying to get also papers
from IXEV and other repositories, so we are trying to unify all repositories. We are not storing anything. We just use Zenodo, which is another repository, financed by the CERN. So we think we are different.
Hi, me again. I would be interested in your faculty position on that because it seems, you know, in the research business there are multiple problems where scientists are complaining. That's one of them. So I'm more interested in are you iterating
with your faculty or do you set your goals and priorities by yourself inside the student group? So your question is about how do we prioritize what we do? So if you are working with your faculty, with professors, assistants, and so on, or if this is just your student project.
So I'm not really a student, but half of the contributors in the discipline group are students. We have, I think, a professor now. The prioritization is based on what are the use case of universities and what do they need to make their repository better
and also how can we promote in a more efficient way to open access. So you could say this is my student project in the sense that I work on it because I find it really interesting, but for other people it's really important and crucial
and I understand that. So we are trying to prioritize some issue in our issue tracker, but if you have a really huge use case and you really need just send us an email and we will talk with you to see how we can make this happen.
Hi, great talk. I would like to know more maybe about how do you handle the duplications? Do you use only the title for the fingerprint? No, no.
For the duplication problem, we are using a lot more data. I don't have the algorithm in my head, but I'm pretty sure we're using the title, the authors list, we're trying to sort it, so that this is a deterministic sorting. We are trying to simplify all data
which could be removed by one repository but kept on another repository. To be honest, as this is open source, I can suggest you only to take a look to it and I can give you the file after.
Why did you decide to build something new rather than use an existing open access tool such as dSpace, ePREDENCE or Fedora Commons? For using what? Sorry. For using existing tool like dSpace which has been around for about a decade. So I'm not really aware of what dSpace does exactly,
but someone already asked himself the question. So I think we found a lot of flaws in dSpace which we didn't really want to keep. So it's the same reason for
whenever you create a new tool, it's because the older one is not sufficient at some point, we think. Any other questions? Let's say I'm a researcher and I put the final version of one of my paper
but I don't have the right to do it. Are you taking your legal risk because you are hosting it or not? Well, we have our terms and conditions specifies that you must have the right
to deposit the paper on the website. Yeah, but let's say I don't care about the limitation and I just do it because I think it's the right thing to do. So I don't know if this really happened before, but if it would happen,
we won't be able to detect it automatically or find it ourselves. So until we get an email from someone saying to us, oh, you're hosting your paper which should not be open and we would have maybe assessed and diseased later on anything like that and we would remove the paper, unfortunately.
Anybody else?