From an old-school data managing company to data analytics with Python - TIB AV-Portal

From an old-school data managing company to data analytics with Python

00:00

10

Hain, H. Gramlich, S.

Formal Metadata

Title

From an old-school data managing company to data analytics with Python

Title of Series

EuroPython 2017

Number of Parts

160

Author

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/33718 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

From an old-school data managing company to data analytics with Python [EuroPython 2017 - Talk - 2017-07-14 - Anfiteatro 2] [Rimini, Italy] Our mission is to manage a huge amount of communication and document data in large scale industry projects by providing web based project management systems. The increasing amount of communication creates the desire for a GPS helping us and our customers to navigate through the communication stream. Our R&D projects are focusing on topics like clustering, event detection, and network analysis (Who knows who, domain experts). Traveling the wild side of NLP, Data Science, and Analytics, we stumbled across amazing Python tools supporting us in our goal to navigate the project communication and therefor supporting our clients in Project & Risk Management avoiding wrong turns. We would like to share some of our approaches to answer our research topics and challenges: One of the challenges, amongst others, is to utilize and adapt up to date clustering algorithms for social stream data and to expose them as reentrant services. Another one is to tailor them for the current application domain, improving clustering precision by parametrization and other means. Furthermore the integration of a Python based analytics system into an existing JAVA based application environment and eco system is required. In addition, we would also like to share some of our ""traffic jams"" experienced during our travel starting as traditional Java/SQL focusing company that integrated Python into its development portfolio

Speech

Text

Image

00:00

Coma BerenicesSoftwareIntelService (economics)Disk read-and-write headProduct (business)State of matterPerspective (visual)Computer networkMachine learningCognitionEvent horizonSoftware developerStudent's t-testSystem programmingMessage sequence chartJava appletMachine visionTelecommunicationHacker (term)Thermodynamisches SystemCollaborationismInformation and communications technologyData managementCorrespondence (mathematics)SummierbarkeitComputer wormLotus NotesInformationWordMachine visionWeb 2.0TelecommunicationCore dumpCollaborationismVirtual machineImplementationSlide ruleProduct (business)Mobile WebBitLotus NotesData streamEvent horizonCognitionMultiplication signFile formatData managementCorrespondence (mathematics)Data analysisDecision theoryDomain nameThermodynamisches SystemMoment (mathematics)Cartesian coordinate systemComputer animation

03:05

Product (business)Java appletMultiplication signTelecommunicationMachine visionData managementBitSoftware developerInformationProduct (business)Web 2.0Lotus NotesComplex (psychology)Cartesian coordinate systemProcess (computing)Decision theoryLecture/Conference

04:34

Information and communications technologySoftwareAliasingDifferent (Kate Ryan album)Information technology consultingData managementTelecommunicationData storage deviceMessage passingThermodynamisches SystemInformationComputer animation

05:09

Information and communications technologyChemical equationFood energyMultiplication signTelecommunicationInformationCorrespondence (mathematics)Data managementLecture/ConferenceComputer animation

05:49

Information and communications technologySoftwareE-textDigital filterCorrespondence (mathematics)Control flowBookmark (World Wide Web)Content (media)InformationTwin primeMachine visionCore dumpImplementationPersonal digital assistantSystem identificationTime domainEvent horizonIntrusion detection systemTask (computing)AverageException handlingMeta elementStreaming mediaGraph (mathematics)Data modelUltraviolet photoelectron spectroscopySimilarity (geometry)Database normalizationBookmark (World Wide Web)Traffic reportingFilter <Stochastik>Order (biology)Directed graphDifferent (Kate Ryan album)Multiplication signTwitterSocial classTelecommunicationOutlierQuicksortEmailInformation and communications technologyRight angleFrequencyStreaming mediaInformationCartesian coordinate systemGraph (mathematics)Domain nameCorrespondence (mathematics)Point (geometry)Perturbation theoryEvent horizonMachine visionData managementNormal (geometry)Content (media)Gene clusterMessage passingJava appletExtreme programmingState of matterLine (geometry)Similarity (geometry)SpacetimeNoise (electronics)Service (economics)IdentifiabilityMachine learningGroup actionProfil (magazine)ImplementationSoftwareWordGame controllerException handlingEndliche ModelltheorieSet (mathematics)Thermodynamisches SystemObject (grammar)Process (computing)Field (computer science)Task (computing)Electronic mailing listComputer programmingFacebookFunctional (mathematics)Expert systemLatent heatBuildingExistenceInsertion lossMetadataArtificial neural networkPerceptronLecture/ConferenceComputer animation

14:13

Similarity (geometry)Trigonometric functionsVector graphicsLinear mapSimulationComputer networkMathematicsPartial derivativeJava appletProcess (computing)Data managementState of matterMachine learningService (economics)InfinityControl flowIterationInformationForm (programming)Service (economics)Correspondence (mathematics)Execution unitMeasurementPhysical systemAlgorithmData managementSimilarity (geometry)EmailContent (media)Student's t-testStandard deviationJava appletInformation securityDomain nameRule of inferenceBasis <Mathematik>AuthenticationProcess (computing)Polygon meshAreaMereologyDatabaseCombinational logicAuthorizationBitSet (mathematics)Gene clusterTrigonometric functionsPhysical lawConnected spaceWechselseitige InformationImplementationInformation and communications technologyFunctional (mathematics)PerceptronSpacetimeVirtual machineStreaming mediaSoftware testingCASE <Informatik>IterationLibrary (computing)FrequencyMultiplication signRootkitWordInverse elementContrast (vision)Information privacyAnalytic setNatural languageObject-relational mappingClient (computing)Representation (politics)Cycle (graph theory)Game controllerRight angleTwitterData structureSystem callNormal (geometry)Thermodynamisches SystemResultantState of matterSupport vector machineCartesian coordinate systemBuildingCodeWebsiteBoilerplate (text)Term (mathematics)MetadataLaptopVector spaceNoise (electronics)Hydraulic jumpIdentifiabilityCausalityExistenceInterface (computing)Peer-to-peerLambda calculusLine (geometry)Alpha (investment)Program flowchart

22:23

Multiplication signControl flowProcess (computing)Decision theoryStandard deviationMereologyLibrary (computing)Group actionPoint (geometry)Java appletKey (cryptography)Lecture/Conference

23:32

MereologyProcess (computing)Library (computing)BitNatural languageMoment (mathematics)SpacetimeConstraint (mathematics)Formal languagePoint (geometry)Speech synthesisPower (physics)Lecture/Conference

24:41

CodeOpen sourceLatent heatMeta elementShared memorySimilarity (geometry)Open setLecture/Conference

25:22

Contrast (vision)Process (computing)BitLecture/Conference

25:52

SpacetimeFormal languageBitThermodynamisches SystemRight angleJava appletWater vaporMobile WebProcess (computing)Form (programming)Lecture/Conference

26:38

Product (business)Game theoryJava appletLecture/Conference

27:07

Sinc functionMessage passingForcing (mathematics)Latent heatScaling (geometry)Multiplication signData managementGroup actionData conversionOnline chatSpacetimeDecision theoryPhysical systemType theoryAdditionCausalitySlide ruleFeedbackEmailThermodynamisches SystemInformation and communications technologyInformationInformation privacyAverageData structureProcess (computing)Graph (mathematics)Different (Kate Ryan album)State of matterCASE <Informatik>FacebookLecture/Conference

Transcript: English(auto-generated)

00:04

Thank you very much. So, I'm Stefan. This is Hendrik. Welcome to our talk. Yeah, it's our first time at EuroPython, and we really, really like the conference, and we are also very new to Python, and we would like to share some of our experiences or projects,

00:22

so we will talk about an implementation project, and how we will achieve our visions. One thing that's very important, we're talking here about old-school data management, and so we don't think old-school is bad, because if we remember the party yesterday,

00:40

and if you think about when did the party started, it started when the DJ started to play old-school hip-hop songs, so we really like old-school, but we also would like to share some of our new ideas with you. Some words about us. So, we both wanted to join the community, and that's why we are here both, and I'm more a little bit

01:02

responsible for products. Hendrik is our Python guy, our data analyst, and, yeah, he is doing some research in Karlsruhe at the KIT, and he's doing his master there, and he has a lot of experience in machine learning, neutral nets, cognitive systems, and he's doing a lot of research on event detection in big data streams.

01:27

We are working for SOBIS, just some small words about our company, so we are a small company from Germany. Our core business are, at the moment, web applications, collaboration systems, and mobile solutions, and, yeah, within the next 20

01:42

minutes or 30 minutes, we would like to share our journey, how we discovered Python, why we are using Python, and I will start a little bit with our vision, because we will talk about one of our products at the beginning that exists for 20 years, and with Python, we are able to implement these visions, and

02:01

then Hendrik will show you how we do this, and how we use Python to cluster communication and detect events, and another reason why we are here is we are the new guys, so everything is new, and we like it. If you have any feedback for us, it would be a pleasure if you talk to us later, because we can

02:21

learn a lot. I would like to start, and this is the only, let's say, sales slide to tell you a little bit about our project, our solution, so we are taking care about collaboration and communication in large projects, and the idea is that in this solution, the industry is taking care of

02:42

correspondence, documents, everything you need in these big projects, and they are sharing this information. This project exists for 20 years now, or this product exists for 20 years now, and it all started with Lotus Notes Dominant. Does anyone know this? Yeah, okay. So, unfortunately, today, not a lot of

03:02

companies are using Lotus Notes anymore, but 20 years ago, we started with this product using Lotus Notes. Let's say eight years ago, we switched to Java web applications, and I've learned a lot this week, so we all love Python, but Java is a little bit strange. So, but in 2009, we decided, okay, we have to go to

03:23

a new technology, because all of our customers left Lotus Notes, so we had to make this decision, and this was also the time when we decided, okay, we have to implement some agile development methodologies, etc. And two years ago,

03:40

we finally discovered Python, because Python helped us really to implement our visions in our products. And before we start into this vision, what we had at that time, maybe a little bit more about the challenges that we have. In the industry, for example, our customers are building these kind of big plants,

04:03

and in these projects, they're taking two years, five years, ten years, fifteen years. We have a lot of communication and a lot of complexity. That means we have a lot of information that we have to take care of, and we have a lot of data that we have to take care of. And this is getting even worse when we

04:23

have a look at how many people are working in these big projects. So, we have different disciplines, we have external persons who are working in this project, engineers, commercial guys, sales guys, project management teams from different suppliers, consultants, etc. So, a lot of people are working together.

04:42

And when we have a look at communication in this project, it's really, it's a mess. We have thousands of mails, we have today, we have messengers, we have a lot of different systems where this information are stored. And it's really, really a big challenge for the industry to take care of this

05:01

information. And this is even getting worse when we're talking about portfolios. So, the big players in the industry, they're taking care about thousands of projects at the same time. And we have to find a way to manage this kind of communication and information. And there are a lot of

05:20

solutions out there who are trying to solve this problem. And what are we doing? Not just us, but everybody in the industry, we are trying to manage our project communication. So, I mean, in the IT departments, we are using Slack or Jira today. But in this kind of projects, people are still used to manage communication and data in folder. We have a

05:42

folder structure, we will organize this information. We have tools like correspondence metadata, we have controlling possibilities with all this kind of manual work. So, you can define favorites, tags, reports, filters and full-text search, etc. I think you know that from all of your solutions or from different kind of solution. The challenge

06:03

that we have here, it's everything is manually. So, you have to classify everything manually, you have to organize your data manually in these kind of systems. And this is something what we call, this is our search content. So, if users who are using this kind of systems want to

06:22

search for specific content or information, it's always like, yeah, I know the topic, I know that there is something in there, and I'm going into these kind of systems, and I want to find that. So, we are doing that the same way in our solutions that we provide for the industry. So, everything is this kind of search content where you

06:41

have to look for this information. But we had the vision. So, I mean, we all know Facebook and these cool technologies. So, is this kind of manual classification of information still state of the art today? So, do we manually have to classify information? Do we manually have to classify correspondence? This was one question that we

07:04

asked ourselves. And the other question was, we can manage this data, but we always manage this data by, let's say, text, who sends it, who created a document. But we never use the core, the information that is in the document or correspondence itself.

07:22

And our vision was always for the last years, can we change the way how these people work to give them a support to provide some content and information for our users? So, our question was, how can we implement the possibility that our application provides

07:40

content? Can't we present this kind of concepts to this specific audience? And we always asked us, but we never found an answer to these questions, and we never had the technologies to do these kind of things. So, what we did, we talked to our customers, like all the other companies too. And when we did that, we got a lot of

08:04

information back, and we summarized this kind of information. So, we developed our vision. The challenge was, our vision, we could not implement that with our existing tools like Java and all these things. And just to summarize our vision is, we wanted to

08:21

implement these cool features like recommendation engines here. Can't we drill down the information to the information that a user needs at a specific time? Can't we use this kind of project correspondence and communication data to identify domain experts in a project? If you're working in a big company, and we all have

08:43

all this information available, can't we profile users? Can't we tell our community in a company, okay, we have this kind of experts? Can't we automatically detect this? Can't we identify trends and risks in projects? So, if a project manager is opening the program

09:03

in the morning and the application tells him, yeah, welcome back to your project, something important happened. And please have a look at that. And we also said, okay, can't we implement things like clustering and event detection? So, when we have a lot of correspondence, a lot of information, can't we implement an

09:22

automated process how we bring these information together? And that's what we are going to show you now, because now we will show you how we implemented what we called machine learning as a service that allows us to identify topics and clusters in

09:40

correspondence. And of course, we did that with Python, and that's why we're here. And Henrik will show you now how we did that in detail. Thank you. A warm welcome from me, too. I will show you how we solved some of the problems Stefan identified just now. And I will talk about the task of identifying topics, hot topics,

10:05

and events in social stream data. As you can see, communication within projects, emails, and correspondence is just a social stream. So, what's topics after all? Topics are basically labeled clusters. And clusters are points in space which belong

10:23

together due to their similarity. So, on the picture on the right side, you can identify three distinct clusters. The red one, the blue one, the green one, and the blue one. And some outliers outside of the cluster. You can depict them as communications or emails or tweets, anything which is a

10:44

communication, basically. And if you manage to put a label on it, you have basically, yeah, you have identified the topics, basically. So, maybe the green cluster is concerned about order 66. The red one is concerned about project

11:01

management. And the blue one is concerned about invoices. So, what's hot topics after all? Hot topics are basically communications that belong together which grow exceptional in a distinct period of time. It's similar to a

11:20

trend. The trend evolves over time. You could see the EuroPython as a trend if you monitor the Twitter stream. It will hold slightly until after the EuroPython as the message with the hashtag EuroPython will be more in this time

11:41

period. In contrary, we identify events in streaming data also as exceptional cluster growth but in a shorter time period or the building of a new cluster. As there is a communication which is not similar to any other

12:00

communication which has to be put into another cluster. I won't speak about noise because this is another hard topic and I could fill more time with it. So, what's the information after all which we have available to identify these? Basically, we know our participants, we know our content and we have the metadata which is manually put

12:27

into our communication and ordered by our customers. So, we can build a communication model. This is a social stream graph. People talk to other people, send messages to each other and those are maybe tagged with the aforementioned

12:44

metadata. So, we compare messages and text to each other but we also identify groups of people belonging together as they are highly communicative within this group and outliers of course who talk only less with other

13:04

people and so on. So, what's this graph built or each communication built of basically? This is an atomic model. We call it social stream object. Each communication is based on a sender, a content depicted by

13:22

the edges here in the graph and a set of one or more receivers. Basically, it's a hyper graph as T1 depicts only one message which is sent to three people. If you compare it with big streams like Twitter, you would have

13:40

a big audience, every Twitter user who is able to see your message of course. So, what do we do? Basically, the hardest topic within it all is cleaning and normalization of the data. Of course, you have much noise in any communication data. This is a machine learning problem and

14:01

for example, to remove footer or reply lines from emails or other communication and stop works, we utilize neural networks explicitly a multi-layer perceptron to remove those lines from automatically from email which

14:23

is trained on our company data. Then we compare the textual similarity. We compare the structural similarity who sends correspondence to whom and the other similarities are the text. We have the metadata

14:41

we have within our communication. The similarities or the tie are basically relatively simple. They are the term frequency inverse document frequency based cosine similarity between the correspondences or the clusters and we have also bit vectors which depicts the sender

15:01

receiver sets within a cluster and the correspondence. And we normalize tech mutualities between the different correspondences. So, the most algorithms suited for streaming data and email and company correspondence could be seen as very slow stream data. Expect one

15:24

value. So, we built a linear combination from the different similarities and generalized them on many similarity measures we can gather from the other things. Lambda here is harder to infer from our system domain

15:46

cause it seems that the structure who sends whom and email is much more beneficial for example for clustering than the actual content for clustering the information. So, how do we do this? As we evolved from a Java

16:04

company, Java is not so good for data science. You need just too much time to build boilerplate code. You can't experiment fast on some new data or algorithm. So, yeah,

16:21

I would call it resting in this case. But, yeah, Python delivers awesome libraries which we utilize for just this course. So, we have Jupiter and Pandas for quick experiments on data or trying to implement some

16:40

machine learning algorithms. We have Spacey, an awesome fast natural language processing library which we utilize to do lemmatization. Does everybody know what lemmatization is? Okay. Basically, we try to get the

17:02

word stamps, the base words from each word to get a normalized representation of the word. Which is really fast. We use Flask to expose our services to our other solutions and we use scikit-learn to implement, for

17:21

example, multi-layer perceptrons and support vector machines to identify noise in correspondences. And the results of our research is stored in MongoDB which is a very good and very fast NoSQL database. But I

17:41

guess everybody knows this. So, what's our workflow? It's slightly different. We came from a normal iterative incremental work and now we have to do research as some of the solutions still or just don't exist. You have to experiment to get the solutions right. So, basically, we

18:03

begin with a Jupyter notebook, do some work, try something out and from there we go to a design, an implementation test, of course, and a partial deploy of a business functionality, you could say. But due to our inexperience with Python between Jupyter and design

18:24

and between design and implementation, there are often some hiccups. Of course, you have to adapt to the Python rules and contrast build or it would be really ugly to build Java solutions in Python. So, the cycle is a

18:41

little bit shorter and you would jump over, yeah, between implementation and Jupyter notebook more often than wished for. So, how do we interface with our existing solutions? We build Java applications with their own

19:03

quietly sophisticated security measures, authorization features, authentication and their own object relational mapper like Hibernate, for example, a quite modified Hibernate which makes things harder and a multitude of databases which our clients want to be supported.

19:22

So, we have to expose our solution in other ways. And this would be basically, Piazza stands for our peer system and alpha analytics, would be to expose and control API, to revert control back to our other

19:43

solutions, a resource API which stands for the findings our algorithms do and of course, the processing and the state management. These are basically the simplest parts of the application and we call it simple MLAS,

20:02

Machine Learning as a Service. So, what are the challenges we had? The challenges are basically security concerns. We handle highly confidential material in case of plant building or chemical sites and yeah, how do we

20:25

implement the security or how do we guarantee that the security between our former systems and the new machine learning systems are held? Then the services therefore must be designed specifically and we have to

20:44

adhere to some security standards within the industry. Interfacing was a problem because if we want to analyze data within a former database then we have to access

21:02

the database directly which would break the security rules. So, we decided to do a loosely coupled system for example on just read, don't write of course. Then we are relatively inexperienced to the scientific cycle and so process iterations within iterations may be

21:26

not the best thing we can do and we have to gather more experience. And the last concern is an ethical concern, privacy.

21:40

This information we build can of course be misused to spy on workers, co-workers, how do they work and so on. So, it's a consumer problem. What do we want to expose really from the things we find in the customer's data.

22:03

So, this both processes are not finished and yeah, advice would be welcome. In this case I want to thank you for listening and if you have any questions we would be happy to answer them.

22:26

Okay, we have almost eight minutes for questions. And then we have a break for the coffee break. So, great on time, excellent. So, any question?

22:41

Yes. Can you specifically, thank you very much for a talk and for this tremendous transition from Java to Python and specifically I'm interested in what exact tool set

23:01

from the Python standard library or maybe something else you don't hear me, right? Could you repeat the last part? I've understood Python standard library but. Basically, what was the key things in your decision

23:24

to switch to Python from Java? The key thing why we didn't do Java or the main point was search for natural language processing libraries and we compared them,

23:42

we have speed constraints and the fastest natural language processing library out there at the moment would be spacey and every millisecond added here or added here would greatly slow down our solution space of course

24:02

and after researching a little bit of space in experimentation, the choosing wasn't hard because of course we need a multifunctional language on the other part. So, data science tools like R, even if it sounds like a pirate language, which is cool, wouldn't come to mind

24:24

and Python has great capabilities for interfacing with other technologies of course. So, those were the basic reasons why we choose Python. Any other question?

24:42

Yes. Do you share some of your code open source in some way or on GitHub or whatever?

25:03

Basically, the sharing of the code, the similarity measures the tools but not exactly the streaming code, unfortunately. Because there are, the reason behind this is not, we won't share, we don't want to share

25:21

but there are customer specific meta tags included which would give hints on processes within customers by customer projects and we can't just do this. Any other question?

25:42

Yes. There were some hiccups of course and a little bit of bumpy road. The thing is, the first things were not so hard.

26:01

The experimentation with Jupyter and Python is really a language which you can learn quite easily but master quite hard. So, yeah, we try to get more experiences on this right now my solution space seems a little Javaic, I guess

26:24

but I have to forget some of the Java things and throw them overboard in my mind to get better systems.

26:42

Yeah, if I can, I won't go back actually to Java. Okay, but I have to admit we are still using Java for the product so we are not only Python now so we're using both worlds. Any other question?

27:01

We have three more minutes for some questions. If not, I can ask a question myself. Very fascinating project and a great success story since for Python, obviously. Yay, Python. Can you give us an idea on the type of scale

27:22

that you're working at? Like, I don't know, I imagine you have tons of email messages or chat messages coming in. If you can give a rough idea. You want to know how many or? Yeah, how many messages you go through your system.

27:43

It's project specific but we have, you can, our bigger projects have 200,000 to 500,000 communications but this is not only the case, there are documents involved also which are quite large and we search also the document space

28:02

which isn't mentioned here cause for the clustering process, you first don't need to inspect the document and this, above many projects but about the amount of projects, Stefan may be better able to answer. I would say an average project, you will have about six to 10,000 emails a month

28:23

so it's not really big data but you have a lot of additional information that we evaluate like actions or other messages and an average customer of ours has about, let's say, 400 to 500 active projects so that's the average amount of data.

28:43

Any other question? We have time for one question. Well, I can also ask the last question myself. Well, so you mentioned privacy issues

29:01

and I could imagine some people, especially workers in those companies could be a little bit nervous. Do you get, I don't know, any feedback or any resistance from anybody or people are pretty happy, they see the advantages? Yeah, at the end of the decision of a customer or company

29:23

if they're introducing this kind of systems but we got some feedback, if you remember the slide when I said we talked to our customers and during these kind of workshops we had a lot of controversial conversations so you found every opinion. People who like Facebook, they also like this kind of systems, for example

29:42

but you also found engineers or project managers who say, nope, I don't wanna have such a system because I still can't think but this kind of tool, it's just a tool that should support your daily work so it doesn't force you to do anything different than before so you find every kind of opinion regarding these tools.

30:03

Excellent, well thank you very much. Let's thank the speaker again.