NLPeasy - a Workflow to Analyse, Enrich, and Explore Textual Data
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/49928 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202075 / 130
2
4
7
8
13
16
21
23
25
26
27
30
33
36
39
46
50
53
54
56
60
61
62
65
68
73
82
85
86
95
100
101
102
106
108
109
113
118
119
120
125
00:00
Computer animationMeeting/Interview
00:26
Formal languageComputer animation
00:50
Data managementInformation technology consultingExecution unitVirtual machineService (economics)MathematicsProjective planeBitFeedbackInformation technology consultingPresentation of a groupComputing platformVisualization (computer graphics)Mathematical analysisFreewarePower (physics)Open sourceCognitionProduct (business)MereologyClient (computing)Probability theoryPlotterLevel (video gaming)Universe (mathematics)Data managementEnterprise architectureMultiplication sign
03:50
Data modelNatural numberProcess (computing)Formal languagePauli exclusion principleLemma (mathematics)RAIDMathematicsDemo (music)Dean numberMusical ensembleMaxima and minimaMessage passingVirtual machineOnline helpLaptopDemo (music)HistogramPoint cloudMachine visionData modelRegulärer Ausdruck <Textverarbeitung>SpacetimeInterface (computing)InformationCustomer relationship managementMathematicsBitResultantServer (computing)Endliche ModelltheorieNumeral (linguistics)Standard deviationDemonVisualization (computer graphics)WordLine (geometry)Electronic mailing listDatabaseExpressionExpert systemMereologyGraph (mathematics)Frame problemFunction (mathematics)Arithmetic progressionParsingPhysical systemFactory (trading post)CASE <Informatik>Exploratory data analysisSoftware developerRow (database)Goodness of fitGame theoryInstance (computer science)Process (computing)Category of beingAbstractionPreprocessorPower (physics)PlotterGoogolMathematical analysisMachine learningRepresentation (politics)Speech synthesisMetadataPattern recognitionComputer animation
13:27
Demo (music)Electronic mailing listSource codeMathematical analysisDemo (music)GodService (economics)Electronic mailing listFlow separationAuthorizationType theoryGoodness of fitWeb pageVirtual machineDomain nameInstallation artStandard deviationAbstractionProjective planePhase transitionMereologySet (mathematics)Elasticity (physics)VotingSource codeServer (computing)InformationTwin primeInstance (computer science)Uniform resource locatorLevel (video gaming)Computer animation
16:53
Mathematical analysisVisualization (computer graphics)Venn diagramInterior (topology)ForestGodSpacetimeInstance (computer science)1 (number)Mathematical analysisBitComputer fileType theoryHecke operatorInformationWordData modelAbstractionAuthorizationInterface (computing)Link (knot theory)TensorResultantVector spaceRegulärer Ausdruck <Textverarbeitung>Multiplication signGene clusterPoint cloudGroup actionVariable (mathematics)Demo (music)MereologyTable (information)Visualization (computer graphics)Stack (abstract data type)Bit error rate
23:34
HierarchyAverageInstallation artRandom numberSoftware developerStability theory1 (number)Similarity (geometry)View (database)Multiplication signPlug-in (computing)Server (computing)Address spaceDemo (music)GodGeometryEmailLink (knot theory)Branch (computer science)Level (video gaming)Open sourceGraph drawingUniform resource locatorProjective planePattern recognitionINTEGRALComputing platformElasticity (physics)Overlay-NetzInstance (computer science)Real numberMoment (mathematics)AreaSelf-organizationComputer animation
27:16
Meeting/Interview
Transcript: English(auto-generated)
00:06
that this talk is going to be about NLP, and it's called NLP Easy, and we can see this already. You're going to present to us an easy workflow. I'm really curious to see what that is. Yeah, sure.
00:20
So everything is ready for this, so please start your session. Cool. Yeah, so thanks for having me here at the EuroPython 2020. As you already said, I will talk about the workflow to analyze and enrich and explore textual data.
00:41
We call it, so you said NLP Easy, I like to call it like NLP Easy. So easy, NLP Easy, you know, language squeezy. Maybe first about me quickly, my background is in mathematics. I did a PhD in probability theory, then I went a couple of years for a postdoc in machine learning, University of Stuttgart,
01:03
and then I came back to Zurich, where I'm now managing consultant with U1 Solutions. Here, my projects are mostly data science, machine learning, AI, some infrastructure, visualizations, coaching of data science teams, stuff like that. And a little bit in my free time,
01:22
I'm doing a couple of open source projects. The last one, PlotVR here, I presented actually last year at EuroPython in Basel, and today I'm proud to present NLP Easy. A little bit about D1, so maybe a minute,
01:43
we are consultancy with over 50 data professionals based in Zurich. Most of our clients are in Switzerland, couple of them are also abroad. We are covering many parts of the data pipeline or the data journey of a company,
02:03
so it can be business consulting here in the top level, then also data architecture, so how should data be set up in a company. We're doing lots with data experience, so Power BI Tableau dashboards we can help you with. We also have some award-winning visualizations
02:22
of things. We help with data management, so the pipeline of how data goes through a company. And yeah, machine learning AI, that's what I'm here today for. We are doing smaller or bigger projects with that.
02:40
And the NLP projects that we have done, just to give you a little bit of idea of what is my, how NLP came about. So what went, what were the precursors of that? So one project that we are quite proud of is the product solution advisor for BOSOT,
03:01
where we, that's a company that sells screws and nuts and bolts, where we combined Elasticsearch and Neo4j. Then for health insurance, we actually abused Verge2Vec for non-textual data on the claims. We also have POC where we do,
03:22
where we ingest documents into Azure cognitive services and set up a platform for that called Hawkeye. We have done some customer feedback analysis with spaCy syntax, dependency parsing, and other things. And finally, this NLP now,
03:42
and yeah, we are proud that we actually can, I could show a couple of those projects in at least national conferences and also some international. So what is the background of NLP? In my experience, NLP obviously is a big thing.
04:01
So it might be the next big thing. There has been big progress in the last years with respect to methods. So it started say 10 years ago with Verge2Vec, which was really a game changer for NLP. And now in the last couple of years, more deep models that are out there,
04:25
these are really important. And one extremely nice thing about the developments in last years is that there are many pre-trained models. So you don't need to spend what they say a transatlantic flight to train your big bird model,
04:45
and something like that. You can just download it and then start using that. That's one hand, the methods. The other hand, there is abundant data. So there is lots of textual data for sure in corporations. So you might have customer relationship management entries,
05:02
mails, documents, maybe customer reviews. There's text everywhere. And until now for the standard data scientists, they kind of weren't that accessible. But everybody knows they are really important because there are many use cases.
05:21
This classification, sentiment analysis, named entity recognition, and so on and so on. But why aren't the data scientists using them as a standard tool in their toolbox? So I think there are a couple of things behind that. One thing is NLP obviously is harder than say standard machine learning.
05:43
It's much more higher than dimensional that what your usual machine learning methods as a data scientists are capable of doing. You also need some specialized pre-processing, how to convert those words into something that you can do machine learning on.
06:02
And one thing also is a little bit that NLP experts usually assume that the text is the only thing that you're looking at. You want to, they want to extract everything from the text. And I think in most corporate situations
06:21
and exploratory situations, that's not the case. The texts are just one part of the data and you might have other things that NLP experts then call metadata. So you might have a longer list of columns and one or two of them are texts.
06:41
That's one thing why these NLP methods or packages usually they don't behave that nicely with the standard data scientists workflow. Others are the methods and models have a reputation of being really hard to use. That might be true or not. And also some other standard tools
07:01
are cumbersome for textual data. So if you want to plot something, okay, you would go for a ggplot or seaborn, but how do you use text with it? So there's this little bit, yeah, it's difficult to grasp results with texts there. Power BI Tableau, they have some interfaces to text, but they don't show it too nicely.
07:22
And maybe your SQL servers, obviously they can handle texts, but are they really equipped to search in those and stuff like that? So NLPZ is something like a vision and it's actually a package that you can download that tries to help you out with those things if you are not that big into NLP yourself.
07:47
So what is NLPZ? Let's see. So NLPZ basically in the end is a package. You have your data in some kind of data frames
08:02
where each record corresponds to the document, maybe with other information, and then you funnel it through NLPZ. And NLPZ can help you with the regexes, with spaCy, with waiter, stuff like that. That's one part. So it will enrich your documents by using other really cool giants
08:24
on whose shoulders we stand on. So for spaCy, maybe you have listened yesterday to the 15 things about spaCy. I love spaCy as well, so this is really cool thing to go there. And if you're interested, please look at the talk.
08:45
That's one thing, enrich your documents with NLP methods. But another thing is now how do you get access to the results? And there we found out that usually how do you work with textual data in your everyday work?
09:02
You go to Google, you search for things. So our idea was it needs to be something like Google, something that can search. And that's where we ended up with Elasticsearch. So one possibility with NLPZ is then to ingest everything into an Elasticsearch database.
09:23
That might sound a little bit too big for you if you're not accustomed to Elasticsearch, but actually we help you a lot with that because we can give you, we can start it for you on a Docker daemon that you have running.
09:41
So we will see that. Yeah, it's Apache license 2.0. You can install it and you can add pull requests. There's a demo Python notebook and so on. So how does it work? Let's go a little bit. I'll go through a demonstration just in a bit, but first to show you the basic digest one.
10:04
So basically if you want to, you connect to an Elasticsearch server using just one line and it might start it on your Docker daemon. If you don't have running something already, then you need to get your data.
10:20
NLPZ cannot help you with that. And obviously all the tools that you use for pre-processing your data, please use them on this data as well. So for instance, here, this is the neuro information processing systems conference where I scraped abstracts. Yeah, and then you start with NLPZ, you set up a pipeline.
10:42
So first you start with, you say a couple of things about the columns you have there already. So the message and the title, maybe you have a date column in the year, and then you add some enrichment steps. For instance, regExist. This guy here parses latex math expressions
11:01
out of the message column and puts it into a math column. Vader sentiment calculates a sentiment on the message and space enrichment does a lot of things. So where we use a spacey model to extract entities, to extract part of speech, you can also go into dependency parsing
11:22
and stuff like that. And then you just ingest it and it writes it to elastic. So that's one thing that's really nice. And you have it in a database that's really good equipped for textual data. But for exploratory analysis, that's not the best thing. It would be much better
11:40
if you could just then look at the data. And that's what we do actually as well for you. So you can just with one command create in Kibana, which is the graphing interface to Elasticsearch. It will set up lots of visualizations for you and put them into a final dashboard.
12:01
So usually that takes lots of clicks in Kibana to have that. And here it goes automatically for you. So maybe one more thing about how these different visualizations come about. So basically in the beginning, you say I have a couple of text columns
12:21
and one date column. So after that the pipeline knows message and title are text and year is date. Then you add this regex extraction where you take it onto the message column and output a math column. So the math column now for each record is a list of extracts.
12:42
So a list of tags. That's not the same thing as text. It's more like a factory and its category, something like that. Then if you add the wader sentiment, it knows now, oh, there's a numeric column sentiment. And if you do space enrichment, here we add lots of columns.
13:00
Couple of those are numeric and couple of those are tags. We will see that. And now if you generate the dashboard, the text columns go on one hand side in such an overview of that and on the other side in such a nice word cloud. Now your numeric columns, they get into histograms
13:21
and your tag columns get into bar charts. Good. So let's see whether the demo gods are willing today. I set up a small data set here. So basically I scraped the list of sessions
13:44
or at Europython 2020. So yeah, that's standard beautiful soup thing that you can do. I don't want to get too much into that. I'm just here now having all of the talks with their title, the URL and the list of authors
14:02
and the author's profile. So which are the, so usually now in pandas, if you had something like that, you would search for that using such a cumbersome thing. Okay, then you see, I searched for NLP, but actually where it might not be so easy
14:21
to find where it is. So that's one thing. Another source that I took was that in, during the voting phase of Europython, I also scraped all of the proposed talks there. So half of them actually did win.
14:42
This is much easier to get all of, there were all of the abstracts on that page. So here again, some beautiful soup scraping and you see, you have now interesting data. So you have a title, a subtitle, an author, a list of keywords, the type of the talk, the Python level, domain level, the abstract and so on.
15:05
For instance, here you see the proposals type, there are different things. Actually, there are a couple of duplicates or two twins, so we dropped them. And now we can actually can just,
15:22
so we see here that all of the proposals titles are in the talks title. So actually they didn't change the titles. So we can just do a join here. Now we have the information,
15:40
whether a title did win or not. Good. So that's the preparation part. Now, what happens with NLPy? So I talked to you, we import NLPy as an E. Now we start a new elastic search. So actually you see there was no elastic search found here
16:01
and it tried to connect to something and it didn't, there was no container running on that, on my machine with that prefix. So it started an elastic search and a Kibana here. So I can actually click on this thing and go to it.
16:21
And then you see, okay, it's here. Good. So not to, you need to have Docker installed. So chances are that you have elastic installed or Docker installed, they are better than that you have elastic installed. So I think that's quite nice. And also it helps you can have separate elastic search service for separate projects.
16:43
So now let's look again at the, whoops, at the columns here. That's not bad.
17:02
So we have title, subtitle and so on. So actually title, subtitle, abstract, these are all texts. We have some tag columns like author and the keywords and the type and so on and whether it did win. And we also pass the link to our elastic stack right there.
17:23
And now we do something like we add a regex, for instance, here to, that it should find out all of the HTTP links in the abstracts. Actually that doesn't work yet really nicely, but what the heck. We also add a space enrichment that takes a little bit time to load this model
17:42
because it's something like, I don't know, 500 megabytes big. But the pipeline is not run yet, just that you know. One important thing here is that we also want to extract all the vectors, the spacey vectors. So these are maybe not that good, like a fast text vector or maybe BERT tensors or something like that,
18:02
but they're good enough right now. That's why we also need to go for the, at least for the mid-sized English model here. And then we also add a wader sentiment here, okay? So let's hit it and you see, okay, it takes a little bit to process all of these files,
18:24
but they are also now already ingested into Elasticsearch. So that's nice. So let's also create the dashboard. And here you see there are a couple of things inside Kibana that if you start using Elasticsearch
18:43
or do something with it, you need to understand. But in the end, it's okay because you can, here we just set it up for you. So you don't need to really understand it. But that's one part. You see, it takes a little bit of time,
19:01
but then it's there, okay? And we here for the analysis later, we take only the one proposals. So now let's go to the Elasticsearch stack here, to the Kibana interface. And you see here, we have a dashboard and yeah, we have now a dashboard here. And you see, okay, let's dismiss this guy.
19:23
You have here a nice interface where you have the results here, 152. You have all of the authors, all of the keywords, the type. You have, for instance, the word cloud for the titles. You can see what are the entities in the abstracts,
19:44
the named entities that Spacey extracts. For instance, Python is by far the biggest entity that it finds, API, Django, and so on. You have the sentiment, okay, that's cool. Most of our abstracts are really nicely written. Okay, so let's see if we now want to search for NLP.
20:05
NLP here, ah, we see there are four results. And what is really nice in this interface, it already highlights to you what these things are about. So, and it now gives you also the informations here.
20:22
And you can then maybe say, I only want to see the ones that did win. So now there are only three. Yeah, so it's really fun to go into it. You might also check here. So this is my abstract into the table that's now ingested into Elasticsearch
20:40
with all the additional variables that are not even visualized here in the Kibana dashboard. Okay, so you see in a couple of minutes, you have set it up. But then you can also do more things. So one thing is, for instance, if we now look at these results, there are lots of things here that you can go for.
21:03
Let's do a hierarchical clustering on the spacey word vectors for all of these documents. So actually here we put out a variable. You see here, for all of the 150 documents,
21:23
you have here the word vectors. There are 300 guys long. So we use NumPy to stack them. Sorry, there are only 77 here, because I'm only looking at the ones that did win. And we do a clustering and ta-ta.
21:43
We can quickly show and see a first grouping of all of the talks based on the vocabulary that they are using in their abstracts.
22:01
So let's see. For instance, here, apparently making pandas fly is somewhat next to my talk here. And there's also a Pythonic full text search. That's nice, no. Okay, and then I also tried to look at,
22:23
so now again, at all of the proposals, not only the ones that did win. You can do something like t-SNE on that and visualize which of the talks did win, so the green ones, and which of those didn't win, the red ones here.
22:41
And you see they're kind of intermixed in this t-SNE visualization. So I also tried to train first just a random forest on it. It didn't work nicely. Then I tried to use BERT even. It didn't go very nicely. So probably it's not so easy to predict
23:04
which abstract will win and which won't win. So probably it's also a little bit, you see that they're often pairs next to each other and probably they are not independent. So if one gets chosen, maybe the other one will fail.
23:23
So yeah, this was chosen over the other one, something like that. Okay, so that's the end of my demonstration. Let's see. Demo gods were very helpful. So you saw already the similarity that we had.
23:43
Here we did it on customer reviews for restaurants in Zurich, in TripAdvisor. If you are from the Zurich area, you probably know most of them. So you will see that Hiltl and Tibbits are really similar in their way.
24:02
And so they're here nicely in a cluster. You have here more the beer things, you have here the very expensive ones and so on. So this is really nicely that you can do stuff like that with NLP. And you can also use the sentiment score on those restaurant views in Kibana.
24:22
You can really just by clicking, so if you have the geo coordinates also in the elastic documents, you can set up such a geo view and overlay the sentiment here just directly.
24:40
So this is really nice to work with those things. You can also do network visualizations here. For instance, this is a kind of a insider or whistleblower platform in Switzerland regarding financial news. And we used just the entity recognition
25:03
to link people and organizations, and you really see how this unfolds. One more thing. So you can actually go to MyBinder and just start it because we set it up that you will have, so MyBinder actually just starts up Docker containers
25:23
for you with two gigs of RAM. And we start up a Kibana and an Elasticsearch server and forward the ports over the URL. So this goes really nicely. Here, please wait maybe a couple of hours. I didn't have it yet in the master branch really.
25:43
But yeah, it will be there in just a moment. And there are also the other things that you, so this is what I showed you where I had Jupyter running locally and I opened two Docker containers. But you obviously can also use Kibana and Elastic just running on itself. So yeah, thanks.
26:01
My time is basically up. NOPC is open source. So please go ahead and look it up, pip install it. If you have PRs, yeah, they're welcome. The package is still under development. I don't have, I do it on my own time mostly. So I don't have time to invest too much into it.
26:23
But there are some more upcoming features. So like adding more stage plugins for Bert or for cleaning. Also have a better support for incremental working when you ingest it, you do your pipeline for first time,
26:40
ingest it into Elasticsearch and maybe add some more things and so on. So the usual data science workflow that would be really important to put there. Have more stable APIs and documentation. There is some documentation already, but it could be better. Support for integration of the pipeline into a real ETL thing.
27:02
So that would be something cool. Yeah, so if you're interested in NOP or other projects or yeah, we are hiring, please contact me. Here's the mail address. And I'll be available in the talk. And the PC, Discord channel now. So thanks a lot. Yeah, thanks a lot for showing all of this to us.
27:21
Thank you.