Merken

From an old-school data managing company to data analytics with Python

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
thank you very much so I'm 7 the now we come to our talk and and the actual
1st time at your Python and be really really like the conference and we are also very new to Python and we would like to share some of our experiences or projects so we will talk about an implementation projects and how we will achieve all of which I'm 1 thing is very important and we talking about old-school data management so that we don't think old-school school is bad because if you remember the party yesterday problem and if you think about when that the party started it started when a DJ started to play old-school hip-hop songs so we really like old school but we also would like to share some of our new ideas with you um some words about
us so we also want to join the community and that's why we here balls and I'm I'm morally with responsible for products and Hendrick is more Python going on data analyst and and yeah he's doing some research in counselor at the county things master there and he has a lot of experience in machine learning what formats cooperative systems and he's doing a lot of research out on the event detection in data streams from we working for so this just some small words about our company so we are small company from Germany and our core business are at the moment that applications Our collaboration systems and more less solutions and yet within the next 20 minutes of sediments we
would like to share our journey how we discovered Python widely using Python and I will start a little bit with elevation the because we will talk about 1 of our products at the beginning there exists for 20 years and was priced and we are able to implement decisions and them and it will show you how we do this and how we use Python to cluster communication and detect events and another reason why we are here so we have a new guys so everything is new and we like it I'm not if you have any feedback of for us it would be a pleasure if you talk to us later occurs and we can learn a lot I am I would like to start and this is the only with the same flight and to tell you a little bit about or project some or all solution so we are taking care about collaboration and communication large projects and the idea is that in this solution the industry is taking care of correspondence documents everything you need in this project and they're sharing information this project exists for 20 years now what this products is for 20 years now and it all started with
Lotus Notes domain anyone knows this yeah OK so um unfortunately today and a a lot of companies are using Lotus Notes animal over 20 years ago we started with this
product by using and Lotus Notes have let's 8 years ago we switch to the job of the applications and have learned a lot this week so we all of Python but Java isn't strange so but in 2009 we decided OK we have to go to new technology Becker's all of our customers left Lotus Notes so we have to make decision and this was also the a time and we decided OK we have to implement some that again major development abilities such as and 2 years ago we finally discovered 5 because Python helplessly to implement alterations in all products and before we start into this vision what we had at that time I may be a little bit more about the challenges that we
have and in the industry and for example or customers of building these kind of the plants and in this project they're taking 2 years 5 years 10 years 15 years we have a lot of communication and a lot of complexity that means we have a lot of information that we have to take care of and we have a lot of data that we have to take care of and this is getting even versus when we have a look
at how many people are working on the project so we have different disciplines we have access to all persons who are working in this project engineers commercial guys sales guys up project
management teams from different suppliers consultants etc. so a lot of people are working together when we have a look at the communication in this project
it's really it's a mess we have thousands of males we have today we have messages we have a lot of different systems where these information are store and it's
really really a big challenge for the industry to take care of this information and this is even getting worse when we're talking about portfolios so the big players in the industry they're taking
care about balance of project at the same time and um we have to find a way to manage this kind of communication information and there are a lot of solutions out there were trying to solve this problem and what are we doing but just ask that everybody ministry we are trying to manage all project communication so I mean energy
department we using slack or due today but in this kind of projects people are still used to manage communication and data info that we have for destructive organize this information we have to I like but correspondence
method that we have controlling possibilities with all this kind of manual work so you can define favorites tax reports so there's a
politic search etc. sink in order from 1 of the solutions of different kind of solution the
challenge that we have here it's everything is manually so you have to classify assessing manually you have to organize the data manually in this kind of systems and this is something what we call this our search function so it uses the using this kind of systems want to search for specific content information is was like yeah I know the topic I know that there's something in there and I'm going into this kind of systems and I want to find that so we doing that the same way In our solutions that they provide for the industry so everything is this kind of search content value have to look for this information but we had the vision and so I mean we all know Facebook and is cool technologies so it is this kind of manual classification of information still state of the art of the severely mentally have to classify information to be manually have to classify correspondence and this is 1 question that we ask yourself and the other question was how we can manage the data but you always manage this data say tax who sends who created a document but we never used the for the information that is in the document of correspondence itself and all vision was always for the last years and how can we change the way how these people were to give them a support to provide some content information following answers so our question was how can we implemented possibility that our application provides content can be presented kind of differences with a specific audience and we're was asked us that we never found I found an answer to discussions and we know that the technologies to do this kind of things so what we did
that we talk to our customers like all the other companies to and and when we did that we get a lot of information that can be summarized as kind of information so we develop our vision them the challenge was originally could not implement that was so existing tools like Java and all these things and I'm just to summarize
all vision is and we want to implement the school features like recommendation engines you can't be do chilled only information to the information that the user needs of the specific time can be used these kind of project correspondence in communication data to identify domain experts in approach again the induced because if you're working in a big company can be 0 at all this information available on the profile users can be tell our over in their community in a company a we this kind of experts can be automatically detect this how can we identify trends and risks in projects so if the project manager is opening our the program in the morning and and the application tells and at what back a project something important happened and please have a look at that and we also said that cannot be implement things like clustering and event detection so when we have a lot of correspondence a lot of information cannot really implement an automated process how we bring these information together and um that's a
given to show you now a curious and now will show you how we implemented and what we called machine learning of service and that allows us to identify topics in clusters and correspondence and of course we did that with Python and that's why we here and and the crucial e-mail within that indeed thank you 1 will come from into but I will show you how we can solve some of the problems that gentrified from just now and I will talk about
the task of identifying topics hot topics in the Wednesday of social stream that as you can see communication within projects e-mails correspondence just the social stream so what's topics all topics of basically labeled clusters and clusters are points in space which belong together due to their similarity so in the picture on the right side you can identify 3 distinct clusters the red on the blue 1 or the green 1 and the blue 1 and some outliers all sort of the class so you can depict them as communications or e-mails for tweets and which is a communication basically and if you managed to put a label on it you have basically on yeah you have identifiedto topics basic so maybe the cluster is so concerned about order 66 the the right 1 is concerned about project management and the blue 1 of is concerned about invoices so what's hot topics after all hot topics of basically communications of that belong together would grow exceptional in adjusting period of time but it's similar to a trend trend evolves over time you could see the Europe I from as a friend if you ammonia toward the Twitter stream it begins slightly before the Ural Python and will hold slightly until after the Europe Python as the message of with the hashtag Europe Python so will be more interest time period in contrary re-identify in a streaming data but also as exceptional class control but in a shorter time period or the building of a new class of as there is a communication which is so not similar to any other communication which has to be put into another class of I don't speak about loss of course this is another hot topic I could feel more time before so what's the the information of all which we have a when the bill to prevent it's basically we know all participants we know or content and we have the the tomato . of which is manually elementary of put into or communication ordered by our customers so we can build a
communication model this is a social stream graph people talk to other people send messages to each other and those are may be tagged with the of the aforementioned middle about so we compare messages and the text to each other but we also identify groups of people belonging together as they are highly communicative but within this group and outliers of course who talk on the list with other people and so on so what's this graph built for each communication field of basically this is the atomic model we call it social stream object each communication is based on the sender or a content depicted by of the edges here in the graph and a set of 1 or more receivables basically the hypertrophic so you want depicts of only 1 message which the central 3 people if you compare with the extremes like Twitter and you would have a big audience from every Twitter user was able to see your message course so what we doing basically the the hottest topic
within the tallest cleaning and normalization of the that of course you have much more noise in any communication that this is a mess and learning problem and for example to on you to remove whatever or reply lines from e-mails or other communication and stop words we utilize neural networks of explicitly on the multilayer perceptron tool and remove those lines of from automatically from you know which is trained on a lot of companies that are the then we compare the textual similarity we compared the structural similarity whose and of correspondence to whom and other similarities are the text we have the middle daughter we have within all communications the similarities of the time are basically
relatively simple they are time frequency inverse document frequency based cosine similarity between the correspondences all the clusters and we have also been quick to us which depicts the of the sender receiver sets within a cluster and the correspondence and renormalized tech mutualities between on the different correspondences so the most algorithms of student for streaming data and e-mail and company crossed comments could be seen as a very small unit of stream data expect 1 where you so we built a linear
combinations from the different similarities and generalize them on all all different on many similarity measures we can gather from the other the things land area is so harder to to infer from our system domain call so it seems that the structure of the sense rooms and e-mail is much more beneficial for example for clustering than the actual content for clustering information so how do we do this from as we evolved from its other company yeah Java is not
so good for that of science but you need just too much time to the boilerplate code you can't experiment fast only on some new . or algorithm so yeah I would call it resting in this case but so yeah Python
deliver us also from Austin libraries which we utilize for just so we have a super giant panda us for quick experiments and not or trying to to implement some machine learning algorithms we have space you of fast natural language processing library which we utilized to do lemmatization does everybody know what lemmatization this all that but basically view we are on so we tried to get a word stems the the basic words from each work to get a normalized representation of the words um which is really fast and we used last to expose all services to all other solutions and we use the psychic learning tool and implemented for example multi-layer perceptrons and a support vector machines tool but identifying noise in of the correspondences and also from the results of our research this is stored in the morning with which is very good and very fast almost equal . operates but I guess everybody knows this so what's all workflow on a slightly
different we came from a normal normal iterative and incremental work and now we have to do research as some of the solutions still uh just Thorndike's this you have to want to experiment with to get 2 solutions right so basically we begin with Jupiter multiple do some work tries some something out and from there we go to and design and implementation test of course and possibly deploy all of business functionality you could say but due to that of our inexperience with Python between Jupiter and design and between design and implementation there are often some pick out cause of and you have to adapt to the to the Python roots in contrast to build what it would be really happening to the job solutions and pipe so the to use a little bit shorter and you would some old law yeah between implementation issues the download book but more often than for so how do we get and how do we interface will follow existing solutions we build job at the
connection with the all quite sophisticated security measures authorization features authentification and their own object-relational mapper like hibernate for example a quite modified hibernated to make things harder and unwanted you've got a basis on which all clients want to be supported so we have a to to expose our solution in other ways and this would be basically the accuser stands for our kids system and offer analytics would be to exposing control API true remote-controlled back to all other solutions the Resource API which stands for the findings of our algorithms to and of course of processing and state management use of basic lead to this in the other parts of the application requirements and blue and lost mesh learning as a service so which there what was so what are the the challenges we had the
trends are basically security concerns of how we handle highly confidential material in case of plant building or chemical sites and yeah how do we implemented the security of and all how do we guarantee that the security between a of formal systems and the you mentioned learning systems are held then on to services therefore must be designed to specifically and we have to adhere to some security standards within the industry so interfacing was a problem also if we want to know analyze with being of form about database then we have tool yeah excessive thought of based directly which would break the security rules so we decided to do a loosely coupled system for export for example on just 3 don't right of course then we are relatively inexperienced of Trudeau scientific cycle and and soul process iterations iterations maybe not the best thing we can do and we have to work together more experience and the last concern is the ethical concerns privacy the this so this information with those kinds of course be misused to was spying on on on workers co-workers how can they were and so on so it's a consumer problem what do we want to expose really from the things that we we find in our and I ended the customers data so this whole processes on not finished and yeah advice would be what comes In this case I want to thank you for listening and if
you have any questions we would be happy to answer them a few
OK we have almost 8 minutes for questions and that will break for coffee break so great and time excellent so
any any questions yes
it so can use specifically thanks you think you not for adopting for this tremendous transition from job what a quiet and no specifically in interested in what's this the what the exact to set from the Python the standard library or the something else but you know the new year and so they could you repeat the last part of Iife understood Python standard library about and basically what was the key things you know your decision to to switch to buy from job and the key thing why we didn't do Java or do the main point
was the search for natural language processing libraries and so we compare them we have speech constraints and the the fastest that from language processing library at out there at the moment would be space and every millisecond edits here or edit you would so greatly slowed down or all solution space of and so my after researching a little bit of spacing experimentation the of the of the choosing wasn't hard because of course we need a moody multifunctional language on the other part so dada starts tools like power even if it sounds like a pirate language which who also wouldn't come to mind and Python has great capabilities for interfacing with other technologies of course so does where the basic reasons because of why we choose 5 any other questions yes
it is the share of uh some of yours gold open solution in some way or babble or whatever so basically our the sharing of the quote of dissimilarity measures the tools but not exactly the the streaming called unfortunately because they are and the reason behind this is more what we we want to share you don't want to say but there are
customer-specific automated text included which would give you hints on on processes in within customers by customer project and do so you copy contrast with this yeah any other questions
it would yes In the past yeah but there with some some hiccups of course and I was a little bit
about the role of the thing is 1 of the 1st things where we are not so hot that experimentation with Jupiter and Python is really about is really a language which you can learn quite easily but master quite hot so on yeah I tried to we tried to to get more experiences on this form right now my solution space seems to the village of ionic I guess but I have to forget some of the job of things and of fraud in mobile water in my mind it to get better but dozens the so
the yeah I know if I can I won't go back actually to absorb the game but I have to admit is still using for the products and the the Python so we use
both yeah any other questions there are 3 more minutes for some questions I know that can ask a person
myself and a very fascinating project and the great success story things for Python of 3 and Python that can you give
us an idea on the type of scale that you're working on like and then I made you have tonsil he met metamessages so over you know chat messages coming in and if you can even graph and you want to know how many or the and many messages you go through are your system it's project-specific of about 2 we have of course you can become a lot bigger project of 2 100 thousand to 500 thousand of communications but this is not the case of not only because their documents and will also which are quite large and research also the document space which isn't mentioned you also for the clustering process you 1st of all need to want to inspect the document and this of many projects thought of him about the amount of projects Fund may be better able to answer I would say an average project you will have about 6 to 10 thousand e-mails amongst so it's a really big data but you have a lot of additional information that you evaluate like actions on other messages and leverage customer of ours has evolved the 400 to 500 active approach that's the the average amount of data any other questions we have time for 1 question when I can also ask the question we itself a well at so you mention privacy issues and that I could imagine that some people especially cursing those companies would be at the there was the you get and and online feedback over any resistance and from anybody in Europe people are pretty happy they see the advantages the it it at the end of this is decision of customer company the into using this kind of systems but we get some feedback if you remember the flight when said we talk to our customers and in doing this kind of structure we had a lot of the controversy conversations to you from every opinion people who like Facebook they like this kind of systems for example but you also from engineers of project managers say no that 1 have such a system whose state can think about this kind of 2 it's just just a tool that should support daily work so it doesn't force you to do anything different in people so you find every kind of opinion regarding this but I think you're much let's thank the speaker again a few
Intel
Datenmanagement
Affine Varietät
COM
Implementierung
Facebook
Computeranimation
Telekommunikation
Bit
Momentenproblem
Datenanalyse
Kartesische Koordinaten
Aggregatzustand
Informationsmanagement
Analysis
Computeranimation
Virtuelle Maschine
Systemprogrammierung
Geschlossenes System
Datenstrom
Datennetz
t-Test
Biprodukt
Ereignishorizont
Web Services
Geschlossenes System
Message sequence chart
Softwareentwickler
Multifunktion
Affine Varietät
Mathematisierung
Applet
Telekommunikation
Biprodukt
Ereignishorizont
Entscheidungstheorie
Datenstrom
Software
Kollaboration <Informatik>
Rückkopplung
Dateiformat
Speicherabzug
Wort <Informatik>
Maschinelles Sehen
Information
Bit
Applet
Kartesische Koordinaten
Biprodukt
Lotus Notes
Computeranimation
Entscheidungstheorie
Domain-Name
Prozess <Informatik>
Lotus Notes
Biprodukt
Softwareentwickler
Maschinelles Sehen
Telekommunikation
Affine Varietät
Information
Komplex <Algebra>
Telekommunikation
Subtraktion
Datenmanagement
Geschlossenes System
Affine Varietät
Besprechung/Interview
EDV-Beratung
Telekommunikation
Information
Speicher <Informatik>
Message-Passing
Computeranimation
Summengleichung
Telekommunikation
Energiedichte
Multifunktion
Datenmanagement
Affine Varietät
Telekommunikation
Information
Computeranimation
Subtraktion
Facebook
Content <Internet>
Kartesische Koordinaten
E-Mail
Information
Gerichteter Graph
Computeranimation
Service provider
Lesezeichen <Internet>
Geschlossenes System
Speicherabzug
Inhalt <Mathematik>
Maschinelles Sehen
Implementierung
Umwandlungsenthalpie
Lineares Funktional
Multifunktion
Volltext
Telekommunikation
Digitalfilter
Software
Information
Maschinelles Sehen
Ordnung <Mathematik>
Verkehrsinformation
Aggregatzustand
Telekommunikation
Expertensystem
Domain <Netzwerk>
Multifunktion
Prozess <Physik>
Affine Varietät
Applet
Implementierung
Profil <Aerodynamik>
Kartesische Koordinaten
Ereignishorizont
Computeranimation
Domain-Name
Datenmanagement
Twitter <Softwareplattform>
Existenzsatz
Identifizierbarkeit
Maschinelles Sehen
Information
Optimierung
Ereignishorizont
Maschinelles Sehen
Mittelwert
Telekommunikation
Einfügungsdämpfung
Punkt
Klasse <Mathematik>
Information
Raum-Zeit
Computeranimation
Task
Streaming <Kommunikationstechnik>
Task
Datenmanagement
Algorithmische Lerntheorie
Cluster <Rechnernetz>
Ereignishorizont
E-Mail
Meta-Tag
Multifunktion
Affine Varietät
Kontrolltheorie
Gebäude <Mathematik>
Ausnahmebehandlung
Ähnlichkeitsgeometrie
Störungstheorie
Frequenz
Quick-Sort
Dienst <Informatik>
Ausreißer <Statistik>
Twitter <Softwareplattform>
Rechter Winkel
Identifizierbarkeit
Information
Ordnung <Mathematik>
Telekommunikation
Gruppenkeim
Geräusch
Ähnlichkeitsgeometrie
Computeranimation
Graph
Streaming <Kommunikationstechnik>
Modelltheorie
E-Mail
Gerade
Multifunktion
Graph
Telekommunikation
Mailing-Liste
Ähnlichkeitsgeometrie
Objekt <Kategorie>
Ausreißer <Statistik>
Datenfeld
Menge
Twitter <Softwareplattform>
Wort <Informatik>
Perzeptron
Extreme programming
Normalvektor
Modelltheorie
Streaming <Kommunikationstechnik>
Message-Passing
Neuronales Netz
Lineare Abbildung
Multifunktion
Applet
Schaltnetz
Inverse
t-Test
Systemaufruf
Ähnlichkeitsgeometrie
Physikalisches System
Transinformation
Frequenz
Computeranimation
Streaming <Kommunikationstechnik>
Domain-Name
Einheit <Mathematik>
Algorithmus
Flächeninhalt
Menge
Information
Inhalt <Mathematik>
Datenstruktur
Cluster <Rechnernetz>
Trigonometrische Funktion
Simulation
E-Mail
Einflussgröße
Resultante
Multifunktion
Prozess <Physik>
Selbstrepräsentation
Support-Vektor-Maschine
Natürliche Sprache
Raum-Zeit
Code
Computeranimation
Virtuelle Maschine
Dienst <Informatik>
Algorithmus
Datennetz
Programmbibliothek
Wort <Informatik>
Perzeptron
Textbaustein
Bit
Prozess <Physik>
Implementierung
Iteration
Kartesische Koordinaten
Analytische Menge
Gesetz <Physik>
Computeranimation
Client
Algorithmus
Datenmanagement
Prozess <Informatik>
Objektrelationale Abbildung
Kontrast <Statistik>
Wurzel <Mathematik>
Kontrolltheorie
Einflussgröße
Web Services
Softwaretest
Einfach zusammenhängender Raum
Autorisierung
Lineares Funktional
Prozess <Informatik>
Kontrolltheorie
Computersicherheit
Applet
Physikalisches System
Dienst <Informatik>
Basisvektor
Mereologie
Authentifikation
Polygonnetz
Normalvektor
Aggregatzustand
Web Site
Datenmissbrauch
Prozess <Physik>
Computersicherheit
Datenhaltung
Gebäude <Mathematik>
Iteration
Schlussregel
Lineares Gleichungssystem
Physikalisches System
Computeranimation
Bildschirmmaske
Dienst <Informatik>
Twitter <Softwareplattform>
Geschlossenes System
Rechter Winkel
Dreiecksfreier Graph
Information
Standardabweichung
Punkt
Prozess <Informatik>
Mereologie
Gruppenoperation
Besprechung/Interview
Programmbibliothek
Kontrollstruktur
Standardabweichung
Entscheidungstheorie
Nebenbedingung
Bit
Prozess <Physik>
Momentenproblem
Gemeinsamer Speicher
Formale Sprache
Sprachsynthese
Ähnlichkeitsgeometrie
Natürliche Sprache
Raum-Zeit
Offene Menge
Mereologie
Programmbibliothek
Leistung <Physik>
Bit
Prozess <Physik>
Affine Varietät
Kontrast <Statistik>
Bildschirmmaske
Spieltheorie
Prozess <Informatik>
Wasserdampftafel
Mobiles Internet
Formale Sprache
Besprechung/Interview
Biprodukt
Raum-Zeit
Rückkopplung
Telekommunikation
Subtraktion
Facebook
Umsetzung <Informatik>
Prozess <Physik>
Gruppenoperation
Raum-Zeit
Datenmanagement
Geschlossenes System
Mittelwert
Datentyp
Datenstruktur
E-Mail
Zentrische Streckung
Addition
Datenmissbrauch
Graph
Affine Varietät
Physikalisches System
Entscheidungstheorie
Forcing
Information
Message-Passing
Aggregatzustand

Metadaten

Formale Metadaten

Titel From an old-school data managing company to data analytics with Python
Serientitel EuroPython 2017
Autor Hain, H.
Gramlich, S.
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/33718
Herausgeber EuroPython
Erscheinungsjahr 2017
Sprache Englisch

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract From an old-school data managing company to data analytics with Python [EuroPython 2017 - Talk - 2017-07-14 - Anfiteatro 2] [Rimini, Italy] Our mission is to manage a huge amount of communication and document data in large scale industry projects by providing web based project management systems. The increasing amount of communication creates the desire for a GPS helping us and our customers to navigate through the communication stream. Our R&D projects are focusing on topics like clustering, event detection, and network analysis (Who knows who, domain experts). Traveling the wild side of NLP, Data Science, and Analytics, we stumbled across amazing Python tools supporting us in our goal to navigate the project communication and therefor supporting our clients in Project & Risk Management avoiding wrong turns. We would like to share some of our approaches to answer our research topics and challenges: One of the challenges, amongst others, is to utilize and adapt up to date clustering algorithms for social stream data and to expose them as reentrant services. Another one is to tailor them for the current application domain, improving clustering precision by parametrization and other means. Furthermore the integration of a Python based analytics system into an existing JAVA based application environment and eco system is required. In addition, we would also like to share some of our ""traffic jams"" experienced during our travel starting as traditional Java/SQL focusing company that integrated Python into its development portfolio

Ähnliche Filme

Loading...