We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Journalists are researchers like any others

Formal Metadata

Title
Journalists are researchers like any others
Subtitle
How we have built datashare?
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
We are not journalists. But we are developers working for journalists. When we receive leaks, we are flooded by the huge amount of documents and the huge amount of questions that journalists have, trying to dig into this leak. Among others : * Where to begin ? * How many documents mention "tax avoidance" ? * How many languages are in this leaks ? * How many documents are in CSV ? Journalists have more or less the same questions as researchers ! So to help them answer all these questions, we developed Datashare. In a nutshell, Datashare is a tool to answer all your questions about a corpus of documents : just like Google but without Google and without sending information to Google. That means that it extracts content and metadata from all types of documents and index it. Then, it detects any people, locations, organizations and email addresses. The web interface expose all of that to let you have a complete overview of your corpus and search through it. Plus Datashare lets you star and tag documents. We didn't want to reinvent the wheel, and use assets that has been proved to work well. How did we end up with Datashare from an heterogeneous environment ? Initially we had : - a command line tool to extract text from huge document corpus - a proof of concept of NLP pipelines in java - a shared index based on blacklight / RoR and SolR - opensource tools and frameworks Issues we had to fix : - UX - scalability of SolR with millions of documents - integration of all the tools in one - maintainability and robustness while increasing code base