Add to Watchlist

The Shogun Machine Learning Toolbox


Citation of segment
Embed Code
Purchasing a DVD Cite video

Formal Metadata

Title The Shogun Machine Learning Toolbox
Title of Series EuroPython 2014
Part Number 40
Number of Parts 120
Author Strathmann, Heiko
License CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
DOI 10.5446/19980
Publisher EuroPython
Release Date 2014
Language English
Production Place Berlin

Content Metadata

Subject Area Computer Science
Abstract Heiko - The Shogun Machine Learning Toolbox We present the Shogun Machine Learning Toolbox, a framework for Machine Learning, which is the art of finding structure in data, with applications in object recognition, brain-computer interfaces, robotics, stock-prices prediction, etc. We give a gentle introduction to ML and Shogun's Python interface, focussing on intuition and visualisation. ----- We present the Shogun Machine Learning Toolbox, a unified framework for Machine Learning algorithms. Machine Learning (ML) is the art of finding structure in data in an automated way and has given rise to a wide range of applications such as recommendation systems, object recognition, brain-computer interfaces, robotics, predicting stock prices, etc. Our toolbox offers extensive bindings with other software and computing languages, Python being the major target. The library was initiated in 1999 and remained under heavy development henceforth. In addition to its mature core-framework, Shogun offers state-of-the-art techniques based on latest ML research. This is partly made possible by the 21 Google Summer of Code projects (5+8+8 since 2011) that our students successfully completed. Shogun's codebase has >20k commits made by >100 contributors representing >500k lines of code. While its core is written in C++, a unique of technique for generating interfaces allows usage from a wide range of target languages -- under the same syntax. This includes in particular Python, but also Matlab/Octave, Java, C#, R, ruby, and more. We believe that users should be able to choose their favourite language rather than us dictating this choice. The same applies for supported OS (Linux, Mac, Win). Shogun is part of Debian Linux. Features of Shogun include most classical ML methods such as classification, regression, dimensionality reduction, clustering, etc, most of them in different flavours. All implemented algorithms in Shogun work on a modular data representation, which allows to easily switch between different sorts of objects as for example strings or matrices. Common ML-tasks and data IO can be carried under a unified interface. This is also true for the various external open-source libraries that are embedded within Shogun. Code examples are provided for all implemented algorithms. The main and most complete set of examples is in the Python language. In addition, in order to push usage of Shogun in education at universities, we recently started adding more illustrative IPython notebooks. A growing list of statically rendered versions are readily available from our [website] and implement a cross-over of tutorial-style explanations, code, and visualization examples. We even took this up a notch and started building our own IPython-notebook server with Shogun installed in the cloud at (try cloud button in notebook view) . This allows users to try Shogun without installation via the IPython notebook web interface. All example notebooks can be loaded, interactively modified, and executed. In addition, using the Python Django framework, we built a collection of interactive web-demos where users can play around with basic ML algorithms, [demos] In the proposed talk, we will give a gentle and general introduction to ML and the core functionality of Shogun, with a focus on its Python interface. This includes solving basic ML tasks such as classification and regression and some of the more recent features, such as last year's GSoC projects and their IPython notebook writeups. ML material will be presented with a focus on intuition and visualisation and no previous familiarity with ML methods is required. ## Key points in the talk * What are the goals in ML? * Example problems in ML (classification, regression, clustering) * Some basic algorithm ideas * Focus on Visualisation, not Maths ## Intended Audience * All people dealing with data (data scientists, big-data hackers) who are looking for tools to deal with it * People with a general interest but no education in Machine Learning * People interested in the technology behind Shogun (swig, cloud notebook server, web-demos) * People from the ML community (scipy-stack) * ML scientists/Statisticians
Keywords EuroPython Conference
EP 2014
EuroPython 2014
hello and good morning to everyone on the 1st speaker of the day will be a gold star out man and he will be speaking about the shogun machine learning toolbox they don't few behind the wall is quite surprised that full here given the all the time of the day of the so yes a set of Michael ocean and this is what I be talking about so since there um a sequential talk so I
can't go really into details of this a high-level overview of this
project and 1st tell you a bit about you know what it's about and the size and what expect and then I'm going to tell you a bit about what machine learning is about and then I'm going to talk about the you know the machine learning features that we haven't should go in and out on the talk a bit about some you know what you can do and since this is a quite a few of the you know the place I can talk a bit about technical details of you know things that are going on under the hood and I close with some remarks on the nice community that evolves from a project recently so 1st come as an open source project kind of all the everything that you know people talk about these things is quite a local perspective so here's some background about me so 1st got In a lot the 1st started doing these kind of things are going be distracted and pretended to be a musician and then finally started studying computer science and machine learning in London and clearly easier to no sense she learning and for the people you know know these words these research topics now and in particular interest in open source inside a joint should to bring together my open source in the machine learning interest in 2010 and I'm kind of guiding the project along the so what is she learning about and how many of you are kind of familiar with with this term or what it means OK so
quite a few so this is a very very high level uh examples of
applications what you can do and the things that have come across so far so machine learning is the science and it's really size of patterns and information on mn well what does that mean it's it's kind of abstract thing and it involves a lot of mathematics but what can you do with it it's quite useful for automating things for example for recognizing things and so 1 project I worked on last year was was about detecting frozen airplane wings and there was a company with that it's they injected ultrasonic sound waves into the airplane wing and then the sound waves travel through the material together reflected a come back to recall them and then if there's like a little crack or something in the when you can see this in the reflection and is you want to do this in an automated way and this kind of half humans do there actually so we developed
some some tools to do this automatically and another 1 is so like going to the sun and then go to the skin doctor and then tell me for getting sunburned and there was this thing there was taken photographs of my of my skin and then there's this thing that scans them food and what it actually does it looks for characteristic patterns in these photographs for process that might indicate that you know there's something dangerous going on and another nice example is that 2 years ago was in the years ago was in India and was using my my credit card and then you got the guts blocks immediately all the bags like so you need money and the like are we we have some we use thought this was fraud so their computer system told them all he is likely to be some and that more examples for example so you you might want to predict things such as recognizer detect so another project I worked on was actually predicting how old age of a bunch of patients reacts to certain treatment whether it's resistant not so what we did is we took the DNA of the of the HIV virus is of the individual patients and we put it into our pattern recognition and machine learning algorithms and then it'll also don't give this patient this particular treatment and does all learned from from the from data so that more things like repeating science we look a lot you know brain scans and things like this but there's also commercial interests thinking of these companies like Google Amazon Netflix so when recommend you things that you do might like um yeah and sometimes so some people sometimes confused machine losses very related to Computational Statistics and um so there's lots of exchange and for me is really the same thing could maybe the machine wants to automate things with Statistics Norway about understanding of certain process and in all these passwords that you are on the Big Data deep learning is a nice 1 currently you can and that kind of related and obviously can use all the stuff to build robots OK so he is a bit about should
and this is all latest version with an open source project uh remit public since 2004 so this means that public for 10 years also because the 8 core developers that can spend your time every day developing and we get about 20 regular contributors to be quite a bit project actually given that it is signal coming from the community the original background from academia of people at university has been developed this and I work at universities so as academia but we're getting more and more into more applied to the regions and I would really boost the project is for years ago started during the Google Summer of Code and so far we've been doing 29 projects that's 29 times 3 month full-time work so and we get quite some impact with and then so let's say this now and a couple more times so we'll do a workshop actually this weekend sunday Monday which is free so feel free to drop in the sense that it is some so he's a bit more about the size of the project a room was familiar with although the
website alone 1 of the key lesson get this organ university nobody knows what so that it's kind of a big they crawl get up to to get statistics we get you know quite a few comments to get quite a few contributors I really like B's comments here that they didn't so we have a very low number of source code comments mostly because of the of the well-established nature properties of whatever this means and 163 to use of reference so it's quite and yet we get quite a few comments so here's the here's the name of coastal leave you know I could I could talk about exponential growth the and stuff but I won't be about a the million lines of code which class is the number of
comments per month so you know you see the some kind of most things
and but even in winter we still have been on average about 2 frequent today so it's quite active projects just to look like set you
up what we're talking about OK machine learning and so or you can really see this well OK so the
most classic textbook machine learning is you know can be categorized into you know supervised learning and unsupervised learning and some other categories sold all these textbook algorithms we have been implemented so this 1 is supervised learning and 2 in the case circuit hurry so this is learning from data that somebody labels for somebody gives you some information that he knows about the data and then you're trying to come up with this characterization of the data for some data that you haven't seen yet so you take you you know you take your you all scans of your D N a that you have so far and that you know whether the treatment was
effective interview treatment and then you try to predict this patients so if you open textbook all these methods you you pop across the so in order to support vector machines that was a possible a couple of years ago about some processes and this aggression is something that you be companies can be a very interested in because you can paralyze it all these things are implemented with insurance and then here I in the this so the other the other class of algorithms that we have quite a bit is unsupervised learning and then is
the different and you just getting a bunch of data with no information to it and you you trying to come up with the characterisation of the process that generated the data which we use but like this and again if you're text about kinds of algorithms like the clustering algorithm for k means we have a bunch of points and you will find say assume 3 clusters what are the classes while the how can I now I can characterize them can use them for labeling things we get quite a few labeled models if you know what that means some which is basically trying to find a low-dimensional representation of you your information and in order to describe it in a more efficient way for communication on to
understand that the for and instance and the generation so these old textbook methods but we also that you know quite a few researches implementing the work for books so that's for example what I do so this stuff is actually not available so I have arms all these you know all these things were kind of hot topics
in in machine learning recently or still I can this guy and that they're all in their to get a feeling for what's in there have a look at our collection of IPython notebooks on a website that quite nice they're kind of tutorials about the methods and what can do that ask if this 1 here when you do machine learning in practice you of all sorts of problems like the 1 imported
data in a preprocessing and all these things you can do this with the 2 works so you can you can have different types of different representations of data like no a dense matrices sparse data strings collections of documents data streams this quite nice and this kind of a unique feature of a toolbox that we can just handle these things on a unified framework that different
data types that preprocessing tools that methods to evaluate your algorithms to tune the parameters all this so that kind of coal comes all included to make your life easier ok that's already what I'm going to have a machine learning and so here are some technical features that might be interesting for you guys so we've written in C + + estimate of you enterprising Python conference will be provided some but provide automated interfaces told a bunch of languages and talk about
this in a minute and sold but the reason why we read plus this is because we can actually expose a framework to a lot of languages and since we are in not doing quite low level things it's still we can do we can do official code we can handle the memory manually and these kind of things we use quite a few you know cutting edge things for linearly by numerical competition like eigen and recently started using the N CL these kind of things to do computation GPU but if you if you wanna get enough get a grasp for for the interface have a look at our class list Doxygen generative classless OK and now he is the 1 of the nice things so we don't believe that it's good to tell users what programming language to use but obviously we wanna since a of Python use it a lot like use a lot from research would have an interface to this so we use does anyone even not know what's we can OK that's that's the magic to it and what we do is we write all C + + classes we find a bunch of packed maps that converts C times to say Python types and then we have a list of classes that we want to expose the repressor buttons and at compile time this week thing generates interfaces to all these languages so like whenever implemented new algorithm and then a press a button then I can use it from Python and I'm gonna show you an example in a minute this is quite is quite meet because we have interfaces to you know python octave model of jobs are really was shop and it's all the same interface for the with certain syntactic changes so in the this looks like this so you know if you know C code you know you have point inuktitut
templated you find a instance of class and some of the bunch of new instances and then you call
methods on these classes 5 minutes think so now we got Python and this I do the same thing here but rather than in in a 2 D matrix pointed to major applied in an but it's really the same interface defined bunch of instances and a bunch of classical methods on these classes and then divide the world prediction on China's support vector machine i can get these here in Python you know the 1st indexes 1 but if I go to octave and the 1st so 0 but if I go talk to know things that really changed in the 1st sentence is 1 but it's the same cold that's running under the hood just quite need I think that's a good job of hours at the moment the OK then finally another thing that we lost Python force have not books so I said we use this
mainly we use this thing for
documentation quite a bit and what another thing we set up this this set of a web service we can try to without installing its we're running in IPython notebook the the clout you can connect it to the viewers get up account and they can you can run I example notebooks after this can be broken we broke this week that the effects of and investigate a bunch of interactive Web demos like OCR recognition between jungle of planets and this link here OK I'm not going talk about this uh we do testing is billboard so we have quite a few bills that's that's also quite like
this that Fedora freebies the uh Windows Mac the offline and the skin should actually do work and last 2 minutes and minute
talk a bit about our community so this is really the nice thing about the project for me at least meeting of these people to get quite a few credit
card active mailing list and human active ISI channels turns out here this guidance instead of the of the course the began introduced and then he or he actually got back to us in the Syrian region so that's that we get in all sorts of people like this guy from Spain here he's quite active on there so few people you know sitting in the know when Russia just writing also encode and and that emit fewer than last year it's kind of hard to talk to them because they don't speak English properly it is this guy he alumni wise living in Mumbai and the works 20 6 hours a day and then went to the good stuff so it's nice it's nice talking to other people have to have a look and I get a page on the contact page and then some of code is as I mentioned before this really boosted I assume everyone here knows about this so we got to get current you get a
project running um a mentoring 3 so I don't really see these days uh and that's quite close so if you're interested in machine learning either of mentoring a project or John is a student get back to us there are so few future changes we we just founded non-profit association to take we we want to be able to take the nation's many open source projects
to these these days we can be transferred and license from GPL is the good knowledge and we kind of aiming for using children in educational purpose but also in industry and we also organize workshops so here's like we have a U 2 footage of our last year's workshop uh the next 1 is on Sunday and on Monday we get hands-on session on Sunday we can learn how to use should come from a practical perspective and like a bunch of talks but most science stuff from the sea bass and the hands-on session this and ResearchGate channel website if you're interested it's free and you can just come along and grab a beer or coffee with
us M. so last slide um effect is is quite intense stuff so we always appreciate any kind of help so we can use use
the toward the finches emission and give us feedback can fix box we get hundreds of clerks and get up here if you are like a super good C + + software engineer can help us with design problems that we have within the framework we can write Python examples in notebooks them this this is actually quite cool the Superfund right the samples you can you know have upheld to the documentation we need with a website in jungle uh I don't know jungles and how to use it so we need people to help us if you have like this affine acceleration machinery given common implemented as step back to us and then yelled succumbed to thank you if you you feel it yes please she was the all
yeah so the question is what's the difference between children and other tokens like psychic we call orange and I guess so it's like taking 2nd which the most similar 1 section were quite similar projects the thing is some so if you want to use you when you are not bound to Python that's kind of a big big difference to the project analysis is written in C + + we can do things with the memory that the Python people have more trouble doing so you can have noble like huge data structures of memory and a and b you treat them really efficiently so we can have some really large scale examples in with millions of examples that run on a single machine but otherwise it's it's you know there's also quite a bit of overlap so we take a lot of inspiration from the 2nd 1 website for example which I think is brilliant and the whole kind of way that document things and stuff and so I know a few of the guys and I quite like the project so and I think it's good to have like you know but of diversity and what questions this year and how machine learning and so the question is whether we have used machine learning to improve shouldn't have that's a good 1 and In so machine learning enforcing becomes right but you know memory bug-free free code for us but we can do is that we you know these these oral the quite sometimes I do a bit of you know uh data mining on on on on on you know the number of classes and how they evolve in these kind of things but it's really more for visualization and for and for marketing yes what on and yes to the question is what I mean with large-scale um so think got if you open to this example here so these are ones these ones
here quite need so this was done on a laptop so there's an example
from bioinformatics it's about splice site recognition so splice site is something in your DNA where you you know where the genus you you deionized trans prior to and and it's cut into pieces before it's translated pertaining to a kind of predict where this happens and there's a data set here which we talk about 15 million examples and the the dimension of the feature space representations to 100 million so that's quite a big and this runs on a laptop and a
couple of hours so mn and this works by the magic of defining data streams and constraining files from the network and then putting them in the algorithms but actually meant to run on a single computer so it's not a distributed to but this is OK thanks guys here
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation
Computer animation


  406 ms - page object


AV-Portal 3.8.0 (dec2fe8b0ce2e718d55d6f23ab68f0b2424a1f3f)