Jumping the Paywall

Video in TIB AV-Portal: Jumping the Paywall

Formal Metadata

Jumping the Paywall
How to freely share research without being arrested
Title of Series
Number of Parts
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
This presentation will identify and address two safety-critical problems permeating research today: lack of total free access to scholarship, and the looming threat of apprehension for trying to facilitate said free access. In other words, this presentation will explore how to procure access to all-too-often restricted content sequestered behind extortionate academic paywalls, and how to then safely freely disseminate said content without being apprehended.
Web 2.0 NP-hard Latent heat Information Telecommunication Internetworking Chaos (cosmogony) Theory
Point (geometry) Metropolitan area network Slide rule Group action Mobile app Freeware Content (media) Group action Content (media) Open set Surface of revolution Inclusion map Word Hypermedia IRIS-T Revision control Authorization Software framework Information Figurate number
Proper map Direction (geometry) Physical law Open set Term (mathematics) Subset Category of being Latent heat Strategy game Profil (magazine) Core dump Limit of a function Process (computing) Information Asynchronous Transfer Mode
Category of being Length of stay Radio-frequency identification Term (mathematics) Statement (computer science) Right angle Open set Attribute grammar
Source code Trail Distribution (mathematics) Multiplication sign State of matter Set (mathematics) Login Database Content (media) Rule of inference Oval Library (computing) Identity management Computer forensics Library (computing)
Source code Suite (music) Observational study Source code State of matter Content (media) Login Content (media) Word Oval Different (Kate Ryan album) Personal digital assistant System identification Website Right angle Library (computing) Identity management Row (database) Library (computing) Identity management
Beat (acoustics) Dot product Observational study Identifiability Observational study Digitizing Digital object identifier Content (media) Digital object identifier Sequence Uniform resource locator Mathematics Uniform resource locator Sign (mathematics) Vector space Repository (publishing) Information Object (grammar) Proxy server Library (computing) Physics Library (computing)
Source code Pay television Sheaf (mathematics) Physicalism Digital object identifier Digital object identifier Twitter Twitter Uniform resource locator Proxy server Task (computing) Library (computing) Probability density function
Web page Computer icon Existential quantification Link (knot theory) Touchscreen Key (cryptography) Login Open set Library catalog Image registration Login Web browser Computer Revision control Radical (chemistry) Computer configuration Googol Authorization Library (computing) Library (computing)
Server (computing) Personal digital assistant Server (computing) Right angle Database Information security Information security
Source code Server (computing) Randomization Existence Tape drive Authentication Tape drive Plastikkarte Instance (computer science) Regular graph Rule of inference Magnetic stripe card Social engineering (security) Magnetic stripe card Word Entropie <Informationstheorie> Right angle Information security Information security Perimeter Row (database) Spacetime
Type theory Content (media) Content (media) Digital watermarking Disassembler
Probability density function Oval Data recovery Password Data recovery Content (media) Password Drop (liquid) Content (media) Arm Probability density function
Point (geometry) Probability density function Server (computing) Functional (mathematics) Multiplication sign Electronic program guide Electronic program guide Content (media) Cloud computing Water vapor Open set Content (media) Public key certificate Computer programming Digital watermarking Data management Word Oval Order (biology) Right angle Marginal distribution Form (programming)
output Marginal distribution Digital watermarking Probability density function
Probability density function Web page Content (media) Parameter (computer programming) Marginal distribution Digital watermarking Probability density function
Web page Trail Functional (mathematics) Connectivity (graph theory) Multiplication sign Source code Time zone Mathematical analysis Digital watermarking Field (computer science) Number Revision control Different (Kate Ryan album) Green's function Authorization Pairwise comparison Probability density function Algorithm Information Web page Mathematical analysis Line (geometry) Instance (computer science) Digital watermarking Word Green's function Natural number Natural language Iteration Spacetime
Touchscreen Hexagon Information File format Electronic mailing list Metadata Text editor Lie group Parameter (computer programming) Metadata Probability density function
Computer virus Point (geometry) Random number Implementation Random number generation Computer file Cellular automaton Parameter (computer programming) Theory Metadata Data management Data model Latent heat Uniformer Raum Authorization Information Namespace Address space Scripting language Execution unit Adobe Acrobat Uniqueness quantification Electronic program guide Core dump Mereology Instance (computer science) Digital object identifier Vector potential Word Hexagon Hypermedia Sample (statistics) System programming File viewer Random number generation Probability density function
Point (geometry) Probability density function Time zone Distribution (mathematics) Multiplication sign Source code Instance (computer science) Content (media) Open set Metadata Word Hexagon output Office suite Text editor Right angle
Computer file File format Multiplication sign Content (media) Basis <Mathematik> Line (geometry) Instance (computer science) Content (media) Information privacy Timestamp Type theory Word Cross-correlation Software Personal digital assistant Universe (mathematics) File archiver Right angle Thermal conductivity Design of experiments Position operator Library (computing) Row (database)
Point (geometry) Group action Building State of matter Multiplication sign Mass Information privacy Metadata Usability 2 (number) Medical imaging Different (Kate Ryan album) Imperative programming Belegleser Stapeldatei Scaling (geometry) Moment (mathematics) Physical law Instance (computer science) Incidence algebra Digital photography Word Order (biology) Right angle Freeware Library (computing) Row (database)
Point (geometry) Group action Presentation of a group Open source Multiplication sign Physical law Bit Theory Goodness of fit Strategy game Term (mathematics) Authorization Website Right angle Extension (kinesiology) Computing platform Physical system
Point (geometry) Word Internetworking Multiplication sign Universe (mathematics) Authorization Summierbarkeit Right angle Open set Limit (category theory) Logic gate Asynchronous Transfer Mode
our next talk is about the deconstruction of academic pay walls
made sure everybody here knows and has seen those specific kinds of paywalls and as you also might have noticed those academic paywalls differentiate themselves unfortunately from other papers on the web not by making themselves transparent for the disadvantage as one might expect you to the importance of proliferation of knowledge but rather by even more exploitative pricing which is why I'm very excited for storm harding and his talk jumping the paywall thank you so much so welcome please thank you guys thank you it's so thank you for coming so hi everyone I'm storm I'm from the internet today I'm here to talk to you guys about paywalls so paint walls are very basically you're probably familiar with the concept that's when we have some piece of knowledge some piece of information and we cannot get to it without paying today we'll be talking about how do we jump through those paywalls we're going to break it down into kind of the theoretical approaches that academics have taken to this problem then we're going to actually talk about practical solutions to how do we extract contact from pay castles and then how do we remove any potential traitor tracing or watermarking of that
content so before we start with course have this disclaimer where any particular tenses I may use any particular tenses you may hear are not indicative of any kind of injunction to action for always operating on a purely imaginative framework so with that in
mind let's look at starting the theoretical overtures that how academics have grappled with the problem of paywalls so this guy Gary Hall in 2009 proposed what we'll term a legalist approach to dealing with pay walls and here you could see some of the approaches that he proposed such as asking for permission to share an article after it's been published or adopting I don't ask don't tell policy where if you publish an article in app a journal you could then come on the slide put it online yourself so this last point is particularly interesting to me and what we're going to be focusing on throughout the talk is if we can put articles online for free why should that permissibility be restricted to the figure of the author or in other words why do we need to repeat the so-called dry chance of the legalist in other words why do we need to engage in legal discourse when we talk about unethical act and so on words later on strippers
and McCleary developed this policy that any strategy as you see up here for contesting the law should proceed through more than just legal channels and so that's what we'll be exploring
today is extra legal or non legal asst modes of intervention in the copy fight and then finally I'm assuming most of
you guys are familiar with Aaron Swartz and his guerilla open access manifesto which will be more or less adhering to today and finally a final note that this is not a talk in defense of copyleft as these talks also often are instead today when we jump to pay while we view copyleft as in fact a much more malignant enemy than traditional copyright the reason for this is because copyleft presents a sense of acceptability it makes copyright palatable this seems contradictory the problem is that copyleft is not question intellectual property itself it merely changes the directive from thou shalt not the traditional copyright injunction to thou shalt the profile permissibility of copyleft well it doesn't that question is why should anyone dictate that thou in the first place right so what we're going to be doing today's challenging notion of intellectual property at its core coffee left does not do that in fact cough left entrenches it all further and finally
things like open access which are again fundamentally reformist nonsensical in junctures to again propagate intellectual property because here we see some of you may be familiar with plls the Public Library of Science and on the one hand they claim that they stand for unrestricted access and then restricted reuse but in the next paragraph in their mission statement
that may say that they apply the Creative Commons Attribution license licensing of course is inherently a restriction the particular terms of the license do not matter what matters is that someone feels they have the right to set a license in the first place and that is what we're fighting against today and thus we reject copyleft to reject open
access and we embrace the copy fight so
that was kind of the theoretical standpoint that I'll be coming from now let's talk about how do we actually liberate content where's a set of rules to follow the first is always the
pirating always steal books from the library never check those books out the question is why right we've all been brought up to be good citizens right who rent books from the library return them on time don't paint our fines what this does is creates a convenient tracking database which can then be of course be correlated to your online distribution activities so let's say you're fond of a particular forensics drill that a local
community college library has and you always borrow and return it on time but while you're doing that you also scan a copy and post it somewhere else let's say elsevier one of the owners of the journals then decides to start checking library records and oh who checked out this particular journal and these particular issues which then went online okay now of course you may be thinking but other sites may be keeping records to write the difference is that your library record is usually unless he took the precaution city use false identification to register is linked to your real identity and can lead to source neutralization that's one of the
main problems that will be talking about today's particular case studies is source neutralization which is when the adversary neutralizes the source of content right in other words by shutting them down we're sending a lossy way arrest sometimes through grimmer circumstance such as suit suicide so
like you said don't use the library unless you have to and if you do have to always steal from the library going further than into eltern ative digital vectors there's of course library genesis the current mirror is that IO though if you may not know the actual URL bounces around in the round-robin sequence so it may change T dot phys.org and so on so exactly how big is library genesis well most recent studies show that it contains 38 million academic articles totaling 28 terabytes which constitutes thirty-six percent of all indexed academic articles that have a DOI or a document object identifier which is kind of like an ISBN for journal articles so
that's one of our main resources a related resource is Sai hub sign hub is a round-robin sequence of dot edu and IA see that UK proxies would she feed an article GOI or other yaar i like the URL and then it could go and get the article
so basically what's I hub does is what we've been doing throughout the 90s where you go and you find public edu proxies that have access to particular journal subscriptions and then you write a basic scraper to collect all the content and distribute it otherwise CyHi automates this task by automatically mirroring any particular article that you access on the lib jen archives okay
so these are the two main resources that we should use in lieu of dangerous physical needs faced libraries there's also a growth lately in crowd-sourced resources rather the subreddit scholar has a request section in a fulfillment section where you could post a deal I that you want and someone can find it on Twitter there's also the recent hashtag i can has PDF where you make a tweet with again the DOI or the URL of the article you want and anyone who has access we'll send you the link in
there's a couple more more or less obvious resources that we nonetheless should not overlook such as Google Scholar which oftentimes leads to open versions of articles are otherwise paywalled on other mirrors and then you should also always check the personal pages of any particular author because somethings they put articles online there and again going back to the
dreaded library if everything else fails and you have access to a university go and try to procure the article from there but be sure to use open login terminals okay some of these may be non-obvious for instance if you're faced with just a basic catalog screen something's tapping something like the windows key and then right clicking going to desktop and you can basically escalate privileges to obtain access we don't require registration to view articles in an educational setting and so now what
we should talk about is this last resort right let's say that we can't find what we're looking for online we actually have to trek out to a local at least Wi-Fi hotspot that has edu access we need to practice good operational security when were actually on adversarial territory hey so you may be
familiar with the case of Aaron Swartz if not so Aaron was essentially arrested for downloading a few articles from JSTOR which is a particular popular academic database from a server rack at MIT what Aaron did was he went to a particular server closet over and over again plugged in his own car drives and then liberated a few million articles
what led to Aaron's arrest was that he went back to the same closet great so in other words that admin's notice regular activity coming out of this random server rack and then they set up CCTV surveillance in that space hey so the first rule as always as to never return to the same feeding hole right to always pick a different source if you're practicing actual OPSEC in the vicinity another particular item to keep in mind is do not create any record of your
existence if the particular facility that you're accessing requires swipe through or smart card access you can show you social engineer access into the facility by for instance taping black electrical tape over the magstripe and then complaining to a security guard that your car just doesn't seem to work and they will more often than not just swipe you in hey so by now at that's
what we should have procured some articles right before we actually start sharing them we now need to engage in content defame or removing all of the actual poison that the venomous publishers inject into these articles before we can share them to again prevent um target neutralization so
there's three basic types of so-called bad things that publishers can put in articles content protection watermarks and made a data so let's go through how can we potentially deal with each one so
let's start with content protection so content protection is very basically stuff that prevents you from doing stuff to the article sometimes things like copying it like printing it or reading it right and again there's very very
many easy-to-use tools that we could do deploy to the feed content protection one such tools the advanced PDF password recovery pro which can also brute-force passwords to PDFs if they're not just content protector also password protected and again this would work for
very basic protection for more advanced protection such as Adobe's more recent lifecycle program which requires connecting to a server in order to get a temporary certificate to view the article what we can in fact do is spoof the server to localhost and not going to go through that because they're existent very detailed guides the point to take away set this is very easily done if you just look up how to do it he so that was
the content protection right that's usually the overt form of content feigning right or in other words a protecting content is it's very obvious when you cannot copy a particular article a much more in the various latent form is usually watermarks watermarks function by once again the content protection being embedded into the actual article and these can be things like marginal notes like you not recall would say that this was downloaded from wherever at whatever time from whatever IP hey so this will be the first kind of water mark that we're looking at and again this is relatively straightforward right this is things that you could see in the marginalia of an article so let's get
rid of it a basic tool to use would be on first glance bris bris is a cropping tool where you can input a PDF select the margins and crop out the potential watermark so here we have censored
article where on the Left we have before Bruce the sensible watermark marginalia on the left-hand side and then after seemingly without it the problem though
is that brisk performs what is in fact known as a non-destructive crop means all it does is adjusts the actual margins it does not delete the content so he may e we download risks and crop out the margins forensics examiner's will still be able to retrieve the Mar watermark that will be outside the printable margins but will still be embedded in the PDF instead what is
necessary to thieve then is entirely reprint the article not simply recap it and select the margins within the printer parameters once we do that we find that actually printing it gets rid of the marginal where I worked more than vistas hey so that was the very basic
kind of where I'm working that we can't encounter there are other much more sneaky watermark step publishers may potentially put in the first is known as natural language watermarking or nlw the way that nlw functions is instead of adding extraneous information into the article it modifies the actual syntax of the article itself so a very basic example you see up here would be one iteration of an article would say I ate a green cupcake yesterday and other one would say yesterday ate a cupcake that was green or yesterday agreeing cupcake and so on and so forth once any given number of sentences are modified the particular tracking algorithm can then deduce which version of the article was watermark earth which source it came from and of course the good the flip side of this is that this is very trivial to defeat right Reaper for performing a simple difference analysis between two copies of the article okay there's then a potential third kind of our marking that we should be conscious of as well which is spatial watermarking the way that this functions is modifying the actual spacing between sentences between words within the sentence between lines between pages being page numbers and so on and again the good thing is that once we get rid of the content protection this is again very true role to remove by dumping most of the article into plain text which will get rid of particular spacing manisha preserved in PDF files finally the third
rubber kind of component that publishers often used to track you is made of data so may today is again basically data about data in our instances it's things like who the articles author is the time that it was generated the particular times it was generated in the mysterious uuid field which we'll talk about in a second
so if you're using something like Adobe Acrobat they again have ostensibly an ad data scrubbing tool built in and here on the screen shot you can see they claim this will discard document information and metadata this is what is known as a lie if we open a meta data the metadata
of a PDF that has been scrubbed with acrobats own scrubber we find that it still has uuid parameters which we can view if we open it in a basic hex editor these are again a formative list of bad things and remember our goal is to get
rid of bad things to share the good thing which has the knowledge so what is this particular unique user ID adobe XMP specifications which lay out the metadata the Acrobat uses don't actually conveniently tell us what it is they say that that's up to the printer right the PDF printer can set its own uuid parameters best practices in the RFC specs that are there dictate that they should be at least partially a random number generator but earlier versions of uuid used the mac address in fact this hope this is how for instance the author of the melissa virus eventually got caught was that the uuid used to spread the melissa virus on some word documents match some other random files that someone had uploaded online at one point which turned out to be the friend of the guy who wrote the virus in other words the uuid is dangerous Adobe specifications do not dictate that you needed to use the latest eid implementation which is a random number generator so in theory any potential PDF printer that you use could be using you you IDs that will again allow traitor tracing so in other words they need to be taken care of if you're editing your document in something like Adobe Acrobat they will not be taken care of even if you select the script tool which means that to go back you need to go into the
document in a hex viewer and actually remove or modify the parameters there
and of course this was all talking about if we want to modify potential made it a right we would open it in a hex editor change the time zone for instance if we simply wanted to wipe the data in other words we didn't want to insert spoof data but we wanted to simply erase all we could use very easy tool known as the meta data anonymous ation toolkit where you feed an input and it produces it cleaned out put the problem is simple wiping of course is I then your adversary will know that the data was erased in other words they will know that you are privy to the modifications wait so if you have the actual time to go in and start modifying values instead of just erasing them that will potentially lead to adversarial down a goose chase rabbit hole so at this point
what we've done so far is we found sources where we can procure articles we've discussed how we could remove protection from the articles now how can we finally share them the first day of
the very fundamental principle would be not to use your own IP not to use the IP of any university you may be affiliated with and to use tor but of course not to use tor from your University Network because then it would of course be obvious if you're the only person using tor at that given time and there's a tour upload that matches that time stamps in other words not to use your network entirely hey the second thing to do is to wait so let's see you purchase a book from amazon at five o'clock on friday and then you put it on lib gent at 501 on friday and let's say you do this over and over again amazon may very easily then conduct time correlation attacks because libyan of course preserves the file upload date and time so the second thing to do other than not using your own network is to wait before you upload stuff to be able to spoof file correlation attacks and you may for further be able to spoof these by again modifying the data within your document right so if you download with something on wednesday the fifth you could change the made a day to say download on tuesday the fourth or even earlier or potentially even later and then finally you could use various file hosting solutions which are more or less friendly to the type of content that we want to share some of these are the following okay and that's pretty much it
now what we do is open it up to questions but the last thing i want to say is remember that this is serious business right this is why we started off with a formal disclaimer is because people are getting arrested for effectively simply sharing in formation okay sooner is be safe and be careful when you guys do this and remember we are at war at the time that I'm speaking right now elsevier one of the biggest publishers has filed a john doe las eight in New York against sahab Libyan is also under attack in that for instance the High Court has recently blocked it in the United Kingdom right so these are very serious issues I may have addressed some of them glibly as a way of getting them across but keep in mind this is very serious business hey thank you guys any questions okay thank you very much we do have time for questions I would ask everybody to please line up at the mics we do have a question already please go ahead ah I do have a question your injunctions to steal books from the library is very strange in particular it violates most ethical positions in the golden rule and it ignores the fact that librarians are very protective of patron privacy both on a historical basis in individual cases if you heard brusters talk well days ago he talked about the National Security Letter which entered an archive got and fought rob many libraries have a long tradition of resisting law enforcement demands for patron records so I think you're deeply misguided in
suggesting that people steal books from libraries I'm sorry I'm deeply white misguided misguided okay so that wasn't
really a question I suppose but I will respond to it in kind anyway so to address your first point about the fact
that many librarians are protective of P user privacy librarians can be served with orders where they're not allowed to state that they have received orders to turn over loner records that's four that's a fundamental factor of at least US law however even if that were not a fact putting trust in another entity increases the entropy right in that if you're trusting the librarian not to hand over the records that bridge does not need to be there if you simply take the book in the first place in other words you are putting yourself needlessly at risk incident note that you can also read the book in the library but let borrow it and create no record of the book or you can photocopy it with your phone without any record of your being there up at the library and not deprive other people of this library resource a resource ok ok let's back up and take your quite point 1 by points so yes you can read a book in the library assuming that you have physical access to the library what we are fighting for is making knowledge globally accessible to people who do not have the privilege to be in a particular building ok second of all to address your second point about taking a photo or photocopy or taking a photo with your phone yes you can't do that if you want very crampy low-resolution images if you didn't want to take the book from the library you could for instance a more prudent solution would be to use their fancy scanners but to go even further than that you seem to be assuming that in the action of taking a book from the library that book would have otherwise not been taken out but what of traditional patrons they took books out from the library the difference is that when we do it we put the books online for anyone to then see and then we can bring them back to the library as opposed to a general patron who takes the book out reads them for themselves a fundamentally selfish act and then brings them back so I don't particularly see the problem unless you're assuming that we won't put the books back when we take them or that we won't put the books online which were the two fundamental imperatives of this mission yes in order to steal a book you need physical access to the library instead of stealing the book you could read the book you could photocopy the book the quality of the book the quality of the photo copy that you make is unlikely to be noticeably different for usability then oh of course you could steal the book and deny I don't know you know how to go in yeah I saw I think we do have some more questions thank you anyway please go ahead over here you have a massive quantity of PDFs you want to change the metadata on do you have any recommendations for tools to batch process a so get the question is kind of if whether your intention is to actually modify the metadata or simply to wipe it if you simply want to wipe it that's a lot easier to do you where you could simply batch process using the metadata anonymization toolkit which can batch process and wipe out the data if you actually want to go through and spoof the data in every single one at this moment there's not yet unfortunately any tools available to automate that they're being worked on vaguely in people's free time but unfortunately don't know of any to do mass scale spoofing at this point thanks
ok thank you very much we have another question on the other side of the room you mentioned two potential problems to solve the the thing with the green cupcake and the thing with the spacing have you seen either of those problems in the wild or know that you heard of it yeah that's actually a good very good question so I've looked at I think something by now something like 20 major publishers of articles none of them use these systems presently but these systems do exists of my such as it's only a matter of time before they're widely adopted but that's a very good question in the wild likes it by now I know on the side that says seven but now I've done more like 20 I have not actually found that in the wild anywhere yet so these are at this point only theoretical attacks and counter-attacks but I think it's good that we start thinking about them earlier as long as they also don't scare us into inaction right wait practice to your action to removing them okay thank you very much thank you are there any more questions in the Czar I please get up to the
microphone if you can right here in yes okay is go ahead no it's a bit off the original purpose of the presentation but in terms of complementing this data liberation strategy with the strategy that also embraces the authors of the publications that are to be liberated themselves I think a lot has been done historically for quite a while with the social convention of a pre-publication or post pre-formatted copy it is a law in the UK that requires researchers do that and I believe to some extent you can put it on your University website and academia that edge you and research eight so I guess two platforms that are helping to get around it I'm also just a bit concerned that maybe the guerilla tactics on the one side are not possibly going to win the favor of authors it's different when you come to journals I mean I don't think many people get as pissed off when journals are pirate I'm sorry that we are very we are very short in time do you think you can answer that so just very briefly to address your
question where there's the are things like academia de edu and research gate which are again legalist modes a praxis where authors can voluntarily put content online and we should absolutely use those but my point two days that we shouldn't limit ourselves to simply leave those kinds of legalist modes of attack in other words we should certainly use the author's if they're willing to join us but we shall restrain our sums to their consent okay great so I think we have time for one very quick question please I simply wanted to add to the previous speaker that in it I read a lot of academic literature literature and US universities normally put things online nowadays on the open Internet whereas in Germany the problem
is very strong and and publishers don't allow you to put anything on the internet if you want to publish it in a magazine so I think this problem I think should be addressed of the future conference thank you thank you thank you that was more more like an annotation than a question so I think we're good right so I think we good yeah we're good then thank you very much stone rallies for catalytic great talk


  566 ms - page object


AV-Portal 3.21.3 (19e43a18c8aa08bcbdf3e35b975c18acb737c630)