Large-scale data extraction, structuring and matching using Python and Spark

Zitieren

EuroPython

Kayal, Deep

Formale Metadaten

Titel

Large-scale data extraction, structuring and matching using Python and Spark

Serientitel

EuroPython 2017

Anzahl der Teile

160

Autor

Kayal, Deep

Lizenz

CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben

Identifikatoren

10.5446/33699 (DOI)

Herausgeber

EuroPython

Erscheinungsjahr

2017

Sprache

Englisch

Inhaltliche Metadaten

Fachgebiet

Informatik

Genre

Konferenz/Talk

Abstract

Large-scale data extraction, structuring and matching using Python and Spark [EuroPython 2017 - Talk - 2017-07-14 - Anfiteatro 1] [Rimini, Italy] Motivation - Matching data collections with the aim to augment and integrate the information for any available data point that lies in two or more of these collections, is a problem that nowadays arises often. Notable examples of such data points are scientific publications for which metadata and data are kept in various repositories, and users’ profiles, whose metadata and data exist in several social networks or platforms. In our case, collections were as follows: (1) A large dump of compressed data files on s3 containing archives in the form of zips, tars, bzips and gzips, which were expected to contain published papers in the form of xmls and pdfs, amongst other files, and (2) A large store of xmls in the form of xmls, some of which are to be matched to Collection 1. Problem Statement - The problems, then, are: (1) How to best unzip the compressed archives and extract the relevant files? (2) How to extract meta-information from the xml or pdf files? (3) How to match the meta-information from the two different collections? And all of these must be done in a big-data environment. Presentation – https://drive.google.com/open?id=1hA9J80446Qh7nd8PMYZibtIR1WjMkdLXfDgwUlts7JM The presentation will describe the solution process and the use of python and Spark in the large-scale unzipping and extraction of files from archives, and how metadata was then extracted from the files to perform the matches on