Bestand wählen
Merken

Large-scale data extraction, structuring and matching using Python and Spark

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
let's on audible enough and
understandable because the question I ask the
indicated so I I hope I have better luck with this and so but I would like to start with
the vote of thanks to the organizers actually because this has been a very enriching experience uh the conference has been great the talks would create and the energies so invigorating uh and also the Python community I think it's great to have people developing an already was some language so yeah time everyone um my talk today is about the large-scale extraction structuring and matching of data even I don't remember that without looking at the heading 3 honest bits amount of but what I'm really trying to say is this that how did we manage to make sense of more than 100 million things uh the work in the workplace they're working my name is a deep Kyle and let's do a quick introduction of who I am and I walk as a machine learning engineer in in a company called as here uh in Amsterdam for all those who you do not know all those of you who do not know as severe it so it can have more than a hundred years ago as a publishing company but right now it's boring also into information analytics to derive useful insights of scientific articles to help people in for example health care and education uh dying other node giving them as between diseases and drugs which have occurred simultaneously in scientific literature to the doctors now if they're treating you for an in a so what are the possible drugs that that they can give you a for that you can catch a pile of data to derive those insights out of a man as any information analytics company would have at uh everything that down relies on data sometimes when you work with did not have that uh where you have a ton of on preprocessed unstructured data and you're trying that essentially make sense of it uh sometimes you have this problem that is commonly known in literature as record linkage them where you have a 1 in uh 1 now serve production-quality database uh your good data and 1 data dump there is like a pile of data someone's hand delivered to you which you have no control over and we choose kind of 1 to make sense of uh we had the same problem and we keep having the same problem that observe repeatedly because we want to enrich our production data as much as possible uh and as data quality improves over time such enrichments will continue to happen and I'm sure for all of those out here work in the data industry is the same for you so so that the problem was this week we specifically had uh many Hive tables where we had an extracted relevant stuff from scientific papers mostly bibliographic information like title and abstract like a publication year a deal I derive you don't know is a digital object identifier which is like a primary key for scientific publications is say it's mostly one-to-one not always most but mostly so this is our sort of good data we we know exactly what looks like but the quality of course is still improving over time and then we also had some under a data dump and so as to say which on over the place to be honest because we had no control over it to read it uh it was delivered to us and is is still being delivered to us in pieces by third-party vendors them and they do their own magic but magic uh uh and they do all sorts of stuff like in cost SIP dark eyes pdfs instead of more structured content and so on so a read and even know what we're looking at the I was a bigger problem than this because we had over 100 million lies the list so I was really hard to sort of uh have a summarizing inference from the from just taking a look at it manually so when we um God who only got uh this problem to be solved it was pretty daunting to start with 3 honest because they can many things going on at the same time and and we didn't know where to begin but like all problems you mustn't be hacked through it all the hacker is a fancy term these days but you should be with stuff of smart scientifically you you try to break down the problem into atomic tasks uh and you saw them as best as you can for each task and then you combine them at the end must like you would going divide and conquer algorithm in in our case the relevant questions over a whole do we untangle this mess they hardly realize what's in those archives once we do realize what is in the archives how do we make sense of it and extract any useful information out of it like if we know that the archive has Adaptec's war-divided then what can we do about extracting the title from the dock expired due to have matchable and meaningful information then using that information that we just extracted how can we best match to our production database and finally what is occurring questions through through all of these atomic subtasks is how do we do it at scale because we're talking about tens of millions to hundreds of millions of bias the the text tag of course was Python uh which is the main source of programming language and spark for processing for the wind uh pies and integrates surprisingly well it's pocket and so if you haven't tried it all ready for your own problems these would not do it it's it's really great this is glance the 1st question of how to sort of make sense of what is in the archives you you get again have to be scientific because there's no other way to generalize it the automate such a process that every a compressed archive is inherently different you can go past but if you can't you pass on a you could pass on the the the water flow to other people but you can't really pass on the chord se because it will not work for everything but really looking at the taking slices of it looking at manually or also talking to people who made the archive if you have access to them you can make some sort of well uh assumption of the archive and then you can it up and then generalize it that's how you probably code anyway so our data Domen ridiculous took a look at it seemed to be a nested request a compressed archives it was Ipsos they observe this of G zips and Das and so on but what was in those it's was mostly um PDFs or exam as as you would expect in any of the most scientific papers to be um so let's start with a very small example of what I'm really saying uh 1st to read the to see how we distribute the data on spark and if if I don't go
into the optimization leaders right now uh it's because it's only 30 minutes but I'd be happy to talk outside this so let's just assume this command uh if you do a binary files with a spot context SC aspire context by the way is how you talk to a spa cluster and initializes bar context it means that you have a good the resource manager that it should await commands uh so you pass on your sort of uh magnificent 1000 advised to the spark cluster fire this binary files command spot disorder ready for processing these vise the the next thing that you need to do is tell the spark what to do for that you need buys on the right this blue helper functions the functions of extracts information from and you see that it takes a a variable x not when you do when you do that the previous thing uh spot creates what we know as RD which is essentially a key value pairs so the key is the name of this if for example 1 . zip and the value is the well binary con content of 5 so here when you see it you see X is being split into X 0 and X 1 where x 0 is the name of the violence and X 1 is actually the was binary code that the bite the content of my uh then you put it through the urine zip and tar and cheese of libraries that to do actually get the contents of the file you you read and everything that is XML which is the specific case but you can choose to do so with other types of content the and then you make a dictionary out of it which is essentially again a key and a value the key is that parts to that particular file you within any zip file anywhere and the values the whole content of the file dumped in the dictionary that so now if you run only the extract on spot you would have a list of such dictionaries but you don't want that because that's not a very parallelizable you what you want is actually a list of tuples like a list of key-value pairs so you have to flatten the dictionary and that's what the flattened Latin function does the uh you when you have once you have these functions and if tested them you sort of push them through the map function of spark which efficiently um distributes walk uh it if you use a malfunction put distributed to pour point but the other functions which distribute more efficiently woodless going to that right now uh you color slapped happened to flatten the contents and then finally you save the up everything that you did into what is known as a sequence file uh no sequence file if you do not know what it is is uh how groups will efficiently storing data so instead of storing 1 find uh . over 1 unit of information profile what it does it still is a massive I would say of 1000 bias and another massive iron with 1 solid phase and this way it can actively distribute all your files into the different nodes and executives that it has for so what you essentially did at the end is you produce this sequence of Ireland which is a key which is uh and the name of the file and the value which is actually got the whole content I hope you can see it we um so you you achieve your 1st task of making sense of it all you use it from from our man good massive data you you managed to iron out the contents of the file and actually track it back to its source and then the next problem is how to extract meaningful information out of it out and uh we here is where the approach is the same you you have to sort of begin to the data and and do some sampling and make some assumptions which are generally enough for your use case for our use case we were trying to understand what kind of to be a graphic information is in the air meaning titles and abstracts and do your eyes and so on and we're mostly looking at uh things which represented um and scientific articles so XML files and PDFs xm as a relatively simple because the structured content anyway you can just use Python's XML library in in your own and function but with PD assume we have a harder time uh because PDFs member aren't really structured the it's an ongoing area of research actually uh it hasn't been solved it there are many such tools which claimed to structure previous the but none of them walk around with the with the goal set accuracy what we use was this to recorded serum mind uh which is I think by uh University in Poland uh which uses machine learning to effectively check will vary a you have things in the media to make sense of it by you know comparing formed so header is a bigger font and a section is a smaller font and so on and with with boats textual information and these lexical information such a kind of makes sense of the pdf and splits it into a more structured format so let's this I'm not willing to PDS for out because it's a bit more complicated but see what we can do is an XML uh here you have an example exam as it what you need to sort of considers that you perhaps want it to extract the title from and you see that it has the title within a tag which communities as stated but that's mostly the case with these uh uh well informed uh tags because the title of a scientific article can only be quoted so few things or an abstract of a scientific article can only cost so few things you can quite easily have a few rules which help you extract a very general things like the title and abstract Purdue your it's of course much harder if you want to extract something like a section or or figures and so on but we don't want to do that what we wanted to extract warm water over all metadata information and that was pretty simple to do so then you write your own small process for the title you use of dx mn library you push your XML through that uh as a string and you find everything that that is called a title or a citation title that is what we sort of a Laurent and we looked at the data and most of our cases were covered with that you just redundant to another function later on so with this uh XML it when you pass it through a general pass so it is very easy to write you get you actually get the title correctly extracted you write such other passes them for for example abstract or a DOI or did Journal volume the issue the the page numbers and so on uh and you put together in a meta function which takes in your and sequence by savior remember the sequence right was a key and a value the key was the name the value was the content
and that is exactly what we're doing the F name is a violin which is the key which is x is 0 and the content which is actually the whole file itself is X 1 we want to process the content and that's what we do in all these passes once we have all the processed content back to us that we have to make SPARC understand what it's getting back for that a spark understands what are always a rule in data frame or a table essentially the around this of for us might be a collection of these values that you see a violin DOI volume and so on so you dead spot that hey I am expecting filename at the lowest the at the 1st and value the value of the row and that you I had the 2nd value of volume of the 3rd and so on so you have this sort of create a structure in in our case it's called 1 rule so when you get all your past information back you push it through all 1 row to make up for that spot understands with all your information finally uh you want to do this not just for 1 file but your entire repository of everything uh represented by the sequence by so you can't use the sequence by command which makes an RTD and then you brush your posit parts of function through it which gives you back rows so once you mapped uh you push up functions the mapper and get it back as a data frame you have something like that which is essentially a table of what you wanted you you have from very unstructured data managed to make something realizable the the and that's the the quick recap here uh we started off at a data dump of everything all over the place and now we kind of know what we are dealing with to be we extract some meaningful information out of it that the final task that we had to do it and the most important I guess with without which those exercise falls apart as is a matching told how the mat match 2 things to them there are many ways to match to the approximate matches Exact matches uh joins are a good way to match it's a very standard thing to do in sequence if you have keys which you can trust on of course uh an approximate matching is something which people do more and more Derek techniques Korenberg like locality sensitive hashing or LSH which is a very popular technique to approximate matching but again let's not go into that because I'm just trying to give you a hint of what you can do with Python and Spock so let's see let's see what we can do it only exact matches the exact matches as i said rely a lot on how you preprocess data so if you have a diet in which is uh missing a doctor at the end of process which has a dot if you do it fine you want to see it so what we have to do is preprocessing in a little efficiently to override such a man uh negative false negatives the story so the 1st of the 4 that is to normalize the content um so here we have a quick normalizer which essentially gets the most zips through the the the title uh check severity stop words now in natural language processing there does this notion of a stop word which is which are frequently occurring words like if for dollar when so we must get rid of those and then we convert uh we get rid of was a non alphanumeric characters like dashes or percentages or a question marks and then we Lottie's everything so that becomes are matched key so for me it with that function that we have a different kind uh we lord of the table uh we pop should through spot again telling spot that please take my title and applied the normalized functioning on it it this is very similar to band as I think if you if you used and as you know what this what this does so it would take the title and make another column called norm title which normalizes your title in the way that you've described uh slave if you see here the management of acute kidney problems if you quickly browser non-title you see that the author has disappeared because we removed the stop or not this is kind of a better matching key than your or text the next thing that you do word to complete your match is joined uh on different teams have followed by union in everything so that you have won a match 5 so you've joined by things like do you of course you have to check before that that the DOI or any such a key on which joining is non empty because if you join 2 things which are and you get tons of false matters so you join things on for example do you are you joined to thing on the normalization of the title that you did you can similarly normalize the abstract and join on that all these lines of course are in our because you want a 1 of 3 one-to-one match you finally a do a big union and then you take a distinct because you don't want to do we so finally what you have is this nifty little finally where you have a 0 on 1 side of PU I really you don't it's just a primary key that we use so all of the primary key of 1 of your databases and the other side you have for example the filename to where uh this this huge mess of things up is located so the primary key again of your messy data dump so essentially what you have just managed to do is match everything against everything else now it's ready for enriching so in summary what we managed to do uh here was we had some production-quality good data we had a pile of data lying around reading the we use an enrichment which we had very little clue about and we managed to in the end due to the match everything together what was possible to be matched uh and make the mass pairs before of before products to make use of uh we did that to play effectively breaking down a huge problem into smaller more Tacloban sub-problems and solving each of the problem by itself with the best of our abilities in approaches and did everything using Python and spark so that's to to conclude the whole talk about what you can do with Python spot or so relatively short of fraction of time uh just to give you a hint of everything and and feel free to reach out to me now or outside over you man uh on the uh the aforementioned address and we're always looking for people to work for us and on problems like these and more machine learning oriented in an NLP-oriented problems as severe so yeah thank thanks a lot I have
thank you don't have any questions and break up on my loaded 1st of yeah here and the explosion of spot all the using uh this 2 . 1 so uh 2 . 1 even has LSH they mentioned for example so it and it allows you to do even approximate matching flexibly so we have 2 . 1 if I love understood your presentation that you have chosen bodies or will be different and no 0 you have to choose duties as a 1st version because data frames assume structured data right so we don't really have a C is these orders stings like that we we had to start off at uh a fighter off which we had no clue about when we start off so it was a file within embedded within an an archive repositories was like a within as it within a ZIP so what you it would use the catalyst optimize as leisurely it use the catalyst optimize elite elites directly Wilson in the no no no we have to do our own basically we had the partition again so that's what I avoided here we had to displays of by by ourselves and make the memory management ourselves if the good question actually I question and which library to she uh which library did you use to you are to use SPARC was the pies the despise pockets last year I don't remember the version maybe maybe I
I have had remember the village but by spark I think that's it's pretty standard thank you no homing notes source later you use in your company nodes again uh I think it was a R A X large Amazon cluster at 8 things OK out for beginners wholehearted syncytial efforts on um it's maybe a day's worth of work not very hard very honest to optimize it is a bit harder but if you want something which you can optimize the like the memory management to it you can optimize very quickly and I would suggest starting with uh starting to read think that the order of a about if think that's on the side project which has a better memory optimizer and just a minor comment when he generated a product you like them when you when you generate a product you I if we go back to the page you can't and you you you you remove the stop words in you whitespace you end up with a very long run on like mentioned take acute kidney syndrome up yeah and you know why why why would you choose to not to blast the white space into an underscore and then you could you can recover yeah so that's a good No point I mean we we actually much more than this so we this is just a quick thing to show you guys we actually ended up using an anti-gay to remove even non a noun noun noun phrases because you know if you have a title the assumption is that you you you summerize introducing some novel concept annotation which is probably not of our of our next adjective uh it's it's more probably like a noun phrase so you might have a title like um cure for um and a malaria or something uh and in those cases them it's the broader reusable to remove everything what a noun phrase and a red that is about example actually here any longer titles it's the use for group we what we found was it was used for the not just remove stopwords were also remove everything which is which wasn't noun phrases to just keep the nouns so there were many preprocessing and things that we did and we sort of ended up with the the best 1 that worked best for us to this was just an example but you're Europe completely right that would probably have worked as well sure yeah I mean it's also possible to use words like her something yet yeah another good question thanks are there any other questions I just wanted to mention to the audience here they're interesting this that by by sparkle trees on items spake h-index a few days or stuff go so you can now in beeping studies and it's procurers and used for how thanks it thank you and also to that could need help me what have reliable reliability you can rate at the head of making all the time and there are other good question actually i'd avoided numbers in this presentation because you can you give this people like an accuracy a lot of that depends on the configuration record how you did stuff and what kind of considered valid edges of so of I can I you mention it here and the throughput that we got was I think we we were crunching from around 70 million fires in 5 hours on a relatively small so we're nothing fancy um and uh and to be you we got around 50 per cent match trade with just the digital object identifier which I mentioned was like a primary key for files uh we can get more than that because a lot of the your missing in the data dump that we got uh when we added a died as an abstract it was around 76 or 80 % that was and so we could matter on 80 per cent of the dump the court to our own data to enrich it so here it is that kind of what you're looking for the Americans no I thought I help you handle multiple languages having this position is an English language but as you get absolutely so from the short answer is that uh in XML as isn't specific visit tag it as an attribute of a tag which score which is a language for example in most scientific content that comes to us so you can actually uh Fisher tag out and also see what language it is so know we didn't just extract English we extracted all the possible languages of things that the estimate had basically um before we matched it so we we it was not just mission was also and it's it's not not that complicated actually it was relatively simple for a given that the data already has uh the the attribute in the text that Lang equids yes or at the end or something like that we have time for 2 more questions give me a break yeah so any advantage of using by supporting the joints here compared to using a relational database and so i'd drought greeted the
user relational database because uh we we did all of the dumping in spot in natively mSPoC anyway but we did compare what it would take for a multichannel job a program to do this um and was trooper was something like 20 thousand files in our uh is significantly less so I mentioned like member 70 million pleasant 5 hours as opposed to 20 thousand fighters in and and 1 um but yet so it's a good thing to try to benchmark it against a database but as I said we we had a pile of data and the 1st thing was to in fact extract information meaningfully out of it into a database and structured we to even have a database to start it we had 1 database would begin tree have another 1 so thank you last question who the word is the accommodation for doing this me of link for sure uh I think I just mention this to someone else as well because bar as spark the user streaming of there is a spark is well I guess you know it already but it streams in
mini-batches and made the memory optimizer for spot is good but it still in I'm and I think it's not the best so that I think is an up-and-coming project of abortion as well which is actually the other way around when you compare this spark spark streams in small batches and think is actually streaming which batches in small bits of the stream and has a better streaming and process and it has better memory management so if you want to begin a project and uh the proper way which is a streaming project try using budget think as well I mean dry you're reading about and benchmarking of I think so 1 of the and sorry why not star why not a strong not strong yeah sure I haven't used myself I'm sorry so I I don't really have any of the experience of strong but it I have worked hard coatings of ordered so as I said think because they're kind of and think mom might be similar and view regards like the irony can do machine learning on what to do and stuff like that and then I haven't used Strømsø so uh I don't have a benchmark to compare against sorry thank you it does the whole time that questions so I think you're going well thank you thank you
Bit
Abstimmung <Frequenz>
Prozess <Physik>
Inferenz <Künstliche Intelligenz>
Program Slicing
Formale Sprache
NP-hartes Problem
Computeranimation
Intel
Algorithmus
Maßstab
Speicherabzug
Hacker
Metropolitan area network
Informationsqualität
Präprozessor
Chord <Kommunikationsprotokoll>
Abstraktionsebene
Datenhaltung
Güte der Anpassung
Prozessautomation
Quellcode
Matching
Biprodukt
SISP
Teilbarkeit
Software
Wechselsprung
Grundsätze ordnungsmäßiger Datenverarbeitung
Information
Schlüsselverwaltung
Lesen <Datenverarbeitung>
Tabelle <Informatik>
Selbst organisierendes System
Wasserdampftafel
Ordinalzahl
Derivation <Algebra>
Analytische Menge
Term
Task
Virtuelle Maschine
Datensatz
Pi <Zahl>
Inhalt <Mathematik>
Datenstruktur
Hilfesystem
Programmiersprache
Zehn
Matching <Graphentheorie>
Mailing-Liste
Datenfluss
Quick-Sort
Packprogramm
Energiedichte
Digital Object Identifier
Gamecontroller
Speicherabzug
Bit
Punkt
Prozess <Physik>
Minimierung
Gruppenkeim
Laurent-Reihe
Oval
Abstraktionsebene
Information
Binärcode
Computeranimation
Homepage
Freeware
Negative Zahl
Datenmanagement
Einheit <Mathematik>
Maßstab
Font
Gleitendes Mittel
E-Mail
Figurierte Zahl
Gerade
Metropolitan area network
Schreiben <Datenverarbeitung>
Bruchrechnung
Lineares Funktional
Approximation
Dokumentenserver
Datenhaltung
Abstraktionsebene
Machsches Prinzip
Güte der Anpassung
Stellenring
Ruhmasse
Profil <Aerodynamik>
Dichte <Stochastik>
Quellcode
Biprodukt
Matching
Kontextbezogenes System
Natürliche Sprache
Meta-Tag
Arithmetisches Mittel
Funktion <Mathematik>
Rechter Winkel
Disk-Array
Dateiformat
Garbentheorie
Information
Schlüsselverwaltung
Message-Passing
Tabelle <Informatik>
Zeichenkette
Algebraisches Modell
Subtraktion
Folge <Mathematik>
Ortsoperator
Rahmenproblem
Wasserdampftafel
n-Tupel
Zahlenbereich
Task
Virtuelle Maschine
Datensatz
Knotenmenge
Binärdaten
Datentyp
Programmbibliothek
Inhalt <Mathematik>
Spezifisches Volumen
Ganze Funktion
Autorisierung
Elektronische Publikation
Matching <Graphentheorie>
Schlussregel
Mailing-Liste
Elektronische Publikation
Quick-Sort
Data Dictionary
Matching
Mapping <Computergraphik>
Körper <Physik>
Flächeninhalt
Digital Object Identifier
Mereologie
Hypermedia
Wort <Informatik>
Speicherabzug
Entropie
Lateinisches Quadrat
Normalvektor
Bit
Punkt
Minimierung
Formale Sprache
Gruppenkeim
Versionsverwaltung
Raum-Zeit
Computeranimation
Homepage
Netzwerktopologie
Freeware
Umwandlungsenthalpie
Präprozessor
Dokumentenserver
Güte der Anpassung
Quellcode
Biprodukt
Festspeicher
Projektive Ebene
Ordnung <Mathematik>
Schlüsselverwaltung
Rahmenproblem
Ortsoperator
Zahlenbereich
SPARC
Kombinatorische Gruppentheorie
Datensatz
Pi <Zahl>
Programmbibliothek
Inhalt <Mathematik>
Konfigurationsraum
Schreib-Lese-Kopf
Attributierte Grammatik
Beobachtungsstudie
Schätzwert
Tabelle <Informatik>
Relationale Datenbank
Matching <Graphentheorie>
Elektronische Publikation
Partitionsfunktion
Packprogramm
Quick-Sort
Matching
Digital Object Identifier
Ruhmasse
Wort <Informatik>
Wiederherstellung <Informatik>
Speicherabzug
Speicherverwaltung
Relationale Datenbank
Bit
Prozess <Physik>
Sichtenkonzept
Minimierung
Datenhaltung
Elektronische Publikation
Computeranimation
Matching
Netzwerktopologie
Virtuelle Maschine
Streaming <Kommunikationstechnik>
Prozess <Informatik>
Festspeicher
Wort <Informatik>
Projektive Ebene
Information
Speicherverwaltung
Stapelverarbeitung
Optimierung
Benchmark

Metadaten

Formale Metadaten

Titel Large-scale data extraction, structuring and matching using Python and Spark
Serientitel EuroPython 2017
Autor Kayal, Deep
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/33699
Herausgeber EuroPython
Erscheinungsjahr 2017
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Large-scale data extraction, structuring and matching using Python and Spark [EuroPython 2017 - Talk - 2017-07-14 - Anfiteatro 1] [Rimini, Italy] Motivation - Matching data collections with the aim to augment and integrate the information for any available data point that lies in two or more of these collections, is a problem that nowadays arises often. Notable examples of such data points are scientific publications for which metadata and data are kept in various repositories, and users’ profiles, whose metadata and data exist in several social networks or platforms. In our case, collections were as follows: (1) A large dump of compressed data files on s3 containing archives in the form of zips, tars, bzips and gzips, which were expected to contain published papers in the form of xmls and pdfs, amongst other files, and (2) A large store of xmls in the form of xmls, some of which are to be matched to Collection 1. Problem Statement - The problems, then, are: (1) How to best unzip the compressed archives and extract the relevant files? (2) How to extract meta-information from the xml or pdf files? (3) How to match the meta-information from the two different collections? And all of these must be done in a big-data environment. Presentation – https://drive.google.com/open?id=1hA9J80446Qh7nd8PMYZibtIR1WjMkdLXfDgwUlts7JM The presentation will describe the solution process and the use of python and Spark in the large-scale unzipping and extraction of files from archives, and how metadata was then extracted from the files to perform the matches on

Ähnliche Filme

Loading...
Feedback