We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Journalists are researchers like any others

Formale Metadaten

Titel
Journalists are researchers like any others
Untertitel
How we have built datashare?
Serientitel
Anzahl der Teile
490
Autor
Lizenz
CC-Namensnennung 2.0 Belgien:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
We are not journalists. But we are developers working for journalists. When we receive leaks, we are flooded by the huge amount of documents and the huge amount of questions that journalists have, trying to dig into this leak. Among others : * Where to begin ? * How many documents mention "tax avoidance" ? * How many languages are in this leaks ? * How many documents are in CSV ? Journalists have more or less the same questions as researchers ! So to help them answer all these questions, we developed Datashare. In a nutshell, Datashare is a tool to answer all your questions about a corpus of documents : just like Google but without Google and without sending information to Google. That means that it extracts content and metadata from all types of documents and index it. Then, it detects any people, locations, organizations and email addresses. The web interface expose all of that to let you have a complete overview of your corpus and search through it. Plus Datashare lets you star and tag documents. We didn't want to reinvent the wheel, and use assets that has been proved to work well. How did we end up with Datashare from an heterogeneous environment ? Initially we had : - a command line tool to extract text from huge document corpus - a proof of concept of NLP pipelines in java - a shared index based on blacklight / RoR and SolR - opensource tools and frameworks Issues we had to fix : - UX - scalability of SolR with millions of documents - integration of all the tools in one - maintainability and robustness while increasing code base