Merken

PySpark and Warcraft Data

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
hello everyone thanks for having me this conference is an amazing and today I will talk to you about how much data I got from the wall of what that eye and how I tackled the surprising amount of data and my
name is Vincent moduli guided companies also data-driven we you would be the 2 clusters and some of them are very added by the user of our models were really telescopes and I often give open training sessions and absent case you're interested I've started using scholar of doing it because for data but more over time hobbies game and huge fanboy boys should be set for this font those I'm in no way affiliated with lizards everything in the years because I like gaming and I liked analyzing data again I'm in no way affiliated with was what so ever
tonight and I'll give you a brief description of the task in there I will also say given explanation why this task is a big technical challenge then I will go on explain why support and excellent solution to this challenge and also you have code works also to get set up to make your own spot cluster and after that I'll show you some surprising conclusions that from the data and then if there are no more questions I will also should have enough time times also give you a brief demo of how you can do a spot formalized cluster running right now
and if you weren't here just sit there and you wanna to move on to different topics including fine just remember and think this is the talk will be exported
very worthwhile and I so if you know Python verbal weighted Big Data cloud performs it scales and plays well the current Python datasets that note that the API currently is a little bit more limited but the project is gained in abstractions so can expect way more in the future so for those
of you who haven't heard about it yet as
is also being it's called world of warcraft it's amazing millions of people playing and there are always done that here but you know what you're and there's all these ports humans and
fighting basically the game always look something like this and what is happening however the must have part of his game in business daily fighting itself of there much of in the respect people playing the game is you be that loops so some of human will probably be sort of oldest or component is more so that he has a very shiny sort and that's the whole reason why your claim
at the beginning or at the very simple you keep getting stronger which means that you can fight stronger monsters which means that you can get stronger equipment that you can use quite stronger monster is an agency stronger when she gets from side of search and becomes the recursion and all of which is
fine but it's the general about the I and the items I 1
mean parts of the game and as the main role also in which you have you lost and what you can also do do which is an interesting part of the game and you could electromotility also that and the and then you can make so you collect with the flowers in motion or you collect a bunch of irony make good source can use it or you can choose to so it is the will of Warcraft game has a huge auction has that you can use it to collect virtual goods trade for virtual gold bi-directional slag you can use this little staggered better faster stronger which used to collect data from its and this in a so this auction house was 1 of the things that was opened up uh if you go to the limited API which nowadays works but different due to the expansion of world trade or but about a year ago when I was looking into it and I can see every single option that was opened at that moment so today that I have is not the actual option being sold but is a snapshot of all the prices at the moment so what did I do that then you know from things like other
items and this has where well the this dataset is extremely cold it's a check last week and we still have about 10 million people playing this game there's about hundred plus identical instances of this game to every server that you play this game on in Europe with about theory and it's an exact copy of another world people behave slightly differently maybe only differ from economic background this kind of interesting because all the economic loss that we have In our normal life should also worker and it's very hard to get a perfect measurement of the economy and such that region and world war there are different prices for packet milk and it's very hard for me the fact that in real life I have a perfectly maybe idea so that I have perfect measurements this experiment is actually very interesting and this slices will become online so you also affects the description but basically for every option that I have I have no idea of a product which is just typed into google and you get an actual picture of the products have a current bid price and the buyer price is an auction house just like eBay is very possible that the current price from the beginning it is not the same as the price we just by the way when the amount of the product so it's possible that you don't sell just 1 leaves of plants but in the end also the owner of the products and I also know the server products on so a lot of questions you can think when you have data like this you can start thinking of nice hypotheses you could test physical called that's so do basic economic laws extends that something you can test the data that is there such a thing as an equilibrium price there's all these different servers and technical you've lost economic makes sense right we will argue that if the price for service the servers 10 bold it should also be the same on this service was playing the same game essentially uh and necessarily have between how many pieces of a certain made the price very hard to do any of the research in real life considered the problem of collecting the data but the data is already there in World War is a very nice experiment there comes a downside and validate TI-digits snapshots every 2 hours and over for ride intended use the it I think it only can and should be able to get enough of however to gigabytes of data loss which is pretty if I were to analyze 1 this would probably fit in memory on 1 machine so it could be something like and but if I wanna do weeks worth of of data and then we start hitting benchmarks so what to do it's not trivial we can produce an excel and as well ossible approach that if you're thinking of a scene ensues like that of file formats and easy 1 the drop Jason trend the animal save you bunch of megabytes began tried suffocates T. 5 and was better technical work but they won't scale it works for today than in a week from someone more and more of this approach in a skill or is this approach scales vertically he invited the server every time that someone's going to expand so we this have problems since that very little of
this people all of big data problems and the best option Africa
for Big Data problem came from a guy working on error you're dealing with the big data problem whenever you're data is simply too big to to fit on a single this seems like a so the
personality of how will I decided to tackle this problem is to think about what do I do from book building and I would usually but
want however from a global bigger building and the non-scaling version will be bigger or more expensive an alternative would be it's just use many small ones the idea was big data incident serious Ontario and it has 1 big circle and full data that Cervantes splits the data and this small bits and this without was among the cluster of peers and analyze all this data in so that's going to take many small bombs approach what that looks like there's this thing called group which sure you've heard of before and the idea is that you have distributed this how it works in layman's terms a mechanic go into detail here but in layman's terms on a computer cluster you have 1 the known were master node basically the track of all the files are made for every file attempts to make the chance of those in each chunk is replicated across the entire cluster so this should be singled out you will always have some data left is still the analysis with and only job it is master node has assistive track where all the files are so you can connect as many of these slave nodes on the date and the situation and and these can work on in parallel the only thing that the maximum of the sky given might where it is and this is from the old school way and the skills so every time I have more data I just add more slave nodes of the basic scaling just prior and also the idea is we're going to probably wanna write MapReduce code so we can bring the code like the analysis to the data so that bring all the data to the analysis but also conundrum of thought so
what part of that you can write MapReduce jobs and this is new technology park and watch the expenditure on this is this like like that you and like all the MapReduce queries we ask the main idea is to try to always computation and memory if you
go to the website you'll have this very nice performance benchmarks that this was I think linear regression benchmark adjusted of box that at times can be 100 times faster than the MapReduce if the data fits in memory of the clusters and about 10 times faster on and other forms optimization for
you you can try this out very easily if you download spark and install it locally and you run some spot jobs on because in
in written in in scholar is built upon a JVM and if you run a query against you so we notice that there is a Java commands as taking up all the you have so even if you just transport locally machine 18 and they will try to much paralyzing for you as possible this sort of comes box if you follow the API gives you don't think of heroes and many more are probably just for you and which gives me a very nice especially because I don't really have time to to figure out how spot exactly works but if they give it's kind of patterns that you're aware that again and I can do lovely lovely analysis of a lot of data the term and the API
as it turns out not too
difficult um you can declare a variable called text file you then tells sparked that there is a text file located somewhere it can be on HTS is that it is also a standard can also be on the local disk but also be honest 3 all of these approaches work there were 4 and from then on and all the commands and you do are basically done in a functional style so suppose this were a text file and political work count this the code would look like in the text file I would make a flat map most everyone I have explained that means that I have to that I will turn that into a tuple which so the value and then I reduced by the basically coming everything up In part as tons and tons of examples to the work out and I think that moment in record for fastest out knowledge may be due to with 2 years ago so it's not like minded functional so that not like I always right but I think that we will so it's not the nice spot
features like fully and that's what you can do here but and because it's super fast because users distributed memory only when it needs to reduce this and it doesn't on the entire data so if you have a huge quantities so we have distributed memory it's skills literally just like you do has very good light bindings and as of recently it also has support for for sequel statement and thing called the frame um what they've done is they have been able to take a distributed data frames so basically like on but the cluster on cluster basically the place of the technology so you can run it on a so so you can run on top of that you have connection with 3 if you're trying to do this and you're like huge web story can use the sound to you also have your model state in for the user it's even got machine learning libraries that work in parallel so even if you had have machine learning algorithm like linear regression that usually only works on 1 machine because you're doing the X primed X inverted calculation and sparkled if you add a great descent methods that will just work parallel and he has my preventing socialist treating each in it will just work and she learning
libraries was just and it works of reducing your companies already invested in and do it very easy to get started 1 of the coolest things this sparkle to the it's really easy so you have all the time and you're doing too you're not mutable data and it will then do the 1st thing that is to say it works through poles and I want to before runs this operation was created that so the directed case the basic graph all mentioned into the data and then tries to internally optimise query which you have to do lot the the there's also a multi-language support so there's even bindings for or nowadays I had been able to play with our studio in the web browser connected to spot uh if you like scholar supported and scholars there's lots of stuff there and of course I users can enjoy produce among the
young so how do we get an the cool thing about spot
is that comical database which is heavily invested in building most tools but if you go to the other just go to used to form the root directory but there is a demand and that's where you can distribute the said that you have Amazon and so here for example I have permission filed against and I'm saying here my credentials island in this region whenever spot clusters I won't have any machines and I want them to be of this type of goal and launched its command with all other parts the start of the whole cluster for you to take attendance would think a line and you can then turn out to be a solution for the cost that can be done as easily and to remember there's also the least any data that you can say factors 3 and if you log into machine described above that for you what so in terms of getting started as way easier to anything you want with you usually seriously like possible scripts and other ways of optimizing the spark does this for you and the box and then if you want solid notebook and connecting spots also varies just use this information you basically points to the price for library which comes on the master node is provision for you and then you say well my sport master here and this report we can connect spark during the course of chemicals are context and this will make sure that all the apartments were normally and then you can say from the use of simple commands all this also again will be on my son going through a very quickly but starting spot clusters is something basically nincompoop do it's just 1 minus 32 the most recent history if you WS keys in your secret available in the same way the file of the huge file that I have so for example 40 gigabyte lawful options data that is to say what to export context either follow here make 3 partitions of that's taking the entire file coming up in 30 % during the course of I didn't say and again because the 1st part is least evaluated what will happen is the 1st time when I say they about counts it with all of will then only take all this data from S 3 and bring it locally which is why if I run it the 2nd time it's actually quite so what sparkle do if you tell it to is will keep data in memory that you will use later on this also will the you huge performance boost over something like Hadoop which really does require you to enter into this 1st before you can be looked again also note that if you just run this it doesn't really do anything at this just creates the operation graphs about around to the server the instructor words will be called actions and this would actually be an action only when you run this does the memory actually follow it the and the next year again and this is like the old-school spots of weakening their friends and they're just to be more aggressive so you can take any text file you want you can describe the uh the land to passes for example I you Jason by then I can declare that there is the role structure in which the and then I have supported 1st came out of this the moment I do this as distributed data for training and distributed across the gargantuan cluster which means that many operations that all you will all be suitable for big datasets so I'm just gonna go through a couple of very simple queries this to get you guys of data values I might be infuriates and as it should be ill the so just to see if the basis of these factors on about running were run on an 8 so that if node machine and its slave nodes on a view as but he's which is about 7 . 5 gram editable text file that all the analyzing these queries is about 20 Gigabytes so I'm just going give
you guys moments is trying figure out what this was but if you again use of this should be ill very familiar this is distributed frame and as a column in that affect and the 1 group vise then comes with a couple of functions the debates ones like some anything and what you didn't specify dictionaries and say for this group I wanna get the buyout value some of that program the 8 I can also then tell you much to the collected data but also convert that to upon the frame so if you're doing a huge computational and result will be small enough to get into 1 machine this actually a fairly simple way to big data on the cost of and doing small again which which a very common use also have this nice because it allows us to also something sparked moment doesn't necessarily do a box for you but it's easy ways this is a little bit more of a complex query but it continues used to and as such should feel rather familiar suppose I have an item and this is the item idea on 1 hand the silence at the state of the filter so that I can say well by around and that I can say well I have to you call that I real I can take all the rows and count them and I can maybe calculate the mean for by them so this will be that I'm counting on the only this item across all rounds and and trying to see how many of these items on the server and was the effort and show tablet should 1st results and there is even support for doing that the more complicated things so you can take all the functions that spark already has take them out and then you can put those into the aggregate function this way you will also be able to give you these columns your own custom name which is useful if you want and then after June's aggregation do more pipelining this is what that looks like so take this query notice grouping by aggregating here them filtering and at the very end of saying take only the 5 1st step this is what the actual that looks like a 20 for this tag so these are all operations that were doing go we can see from here but it says Mark not petitions aggregated exchange filter limits maps all these operators will have to do will figure out the best way to this little computations necessary in memory offer song and this state is also given to you through the Sparky why this automatically come if you install through online search for this even some support for user-defined functions at the moment because Python is not necessarily a statically typed languages scholars and accepted you know the trick right now would be needed to find a user-defined functions through entering the land and then telling what I should come out of this function can then be used to create a new column and this is for now the Python waiting sports you use user-defined functions and the policies and it feels a bit verbose but if you wanna get a performance boost out of it you do kind of types so it makes sense that can force you to do that so they all the school of clusters cost more money is the common part of my sentence here from points that were not just buying 1 computer doing is rectifying you given impression by this is not necessarily the case big data of expensive but not really that if you take topology and if you can transfer data from S 3 to a computer within the same region is free you don't see any transaction costs that uh having that for years so gigabytes monastery uh um times this amount per month usually without euro or so that they will not just the 40 gigabytes of data on this is nothing other than the fact that the in the actual CPU cost and the actual institutions that vise so I'm registered renting just a couple of machines as part of the house now remotely so I'll be OK if the hours that use with the little bit comes with a statement now 1st and and I was only going to be using this cluster for 6 hours in a work day let's say I have like these machines and the machines I've been using 7 it's across this month per hour and then after 1 day of doing at whole article data on it'll cost me a total of about 15 years dollars max enabled to throw away machines don't need an understanding them back online again makes all the sense and the start time is about 15 minutes so if you have a lot of data which gigabytes of and you only need to analyze it say once a week for the recommendation batch script or something like that there's no need to have a 2 system anymore it's all technology is way better off going down so this is you will have to be willing to put all the data on the street with no was sort of like likely solution for the bank but for a lot you have have just wanna get the datasets analyzed the very were all that I mean I'm willing to spend 15 euros on my hobbies today is not so so uh let's talk about a few results but I think that's the reason why most of you were here anyway and all these theories have been done with some form part and I should apologize a little bit this recently oxygen for sparking out and I have to sort of contribute and check at that whole so the chance that you'll see I'm able to what these are the most popular items that about a year ago so this was for the words that are I think it was the expansion and then for was for an hour something the directory thing or permissive area and this sort of together stuff like never we lost which supposedly it has about 10 public 100 men like a million I was being sold in the entire world work here uh this item called Golden Lotus Anderson because spirit so these are all items that you can collect um so please do collapsed I being herbalists the identical and by the end this kind of thinking so if you are able to work with what professions if you think this as a couple of questions that you can use where you can collect items in the main use case for collecting these items and instead of explain items on the action and get money back doing this is actually bit tricky because in the beginning levels there are different items and when you are at the lower levels so each 1 of them is completely up to to analyze every single slot of experience however I just look at the number of level 10 to 20 items little on the collective the mean the a goal that you can get for doing something with screening is 2 . 6 million formalism is 2 . 3 and 4 mining is only 1 . 5 now again these are like the early life and so the material that you to consider this region you take uh student but these are some very quick things you can just do spark another something that turns out you can also look at the buyout value for all products and then you really get to know what you can do is you can say it is the owner of products on the analysis at as about 8 thousand people that good all the users and the begins with the object of what the what % of Warcraft and so as to turns out that 1 % of Warcraft about 25 % of the auction houses which an interesting thing of this is what queries going to consider the the change in front of you per user number of part that actually the amount of gold they have other sets of data you arrange those and then basically lists and In a way this is basically the lines of code and spark even though you are handling gigabytes of data another thing
I would that seemed interesting if you will for the players who have status of so you have to be 1 item or firefighters were terrorists or 20 items can have the same box on the object so I was wondering how well let's take spirit as an example is look at the stack size which sex opposite a more often so it can be a bag could be anything from 1 to 20 in terms of the twenties kind there of was kind of popular and 5 antenna and popular anything else doesn't really happen as much what might be watering no it doesn't matter as a consumer is a person selling wouldn't matter if I am selling these were selling the the average price per item that we would have for that matter on the garbage priced perfectly this this is actually exceeded this is set 1 picture just so if I am looking at the size of the 20 news about that how the price distribution and also doesn't matter on the airline's auction house or on the of course structure terms of doesn't matter too much so these are part lot the average for the item just shifts the little from islands island but if I just look at the meeting they also the circle around a little less sometimes but nothing to significant just looking at it is really matter part of the Alliance for part of the so it is not something that shocked to appear hear economists but if you're a world of warcraft is so useful knowledge doesn't really matter too much himself like 20 years 1 again also think mind I didn't check these things actually got soul cannot only thing I can see is that they listed at this price for this In another interesting thing and that was my main interests I was looking at this data for every server side to calculate the mean price for the buyer and I can calculate the number of items of around logically if there's a lot of pixie dust and you would expect prices this video just became rare whereas if I had thousands of thousands of lines of videos that becomes more overpriced turns out it's very hard to find the right and that actually does this so if you look at this graph every red dot denotes word and March house and everyone got notable lines auction house and the server we have 1 of these so this is the market economy and this is the mean by the fact that it seems like there are a couple of situations where there is no law in the market but there are very high prices however these are only 3 points to these might as well just be outliers so OK now we're going to start getting a the complicated state you will know those economics hold in all of graph sense the and aim to figure out might be useful ways you can calculate the 1 regression coefficients so if you have linear regression let's say that my wife variable thing I wanna predict is the price and all this kind of market is that the actions are not really exist in the constant calamitous in this direction of the slot itself is positive then it means that there are market is bigger price because of its negative that means if the market goes up the price of the so if I could calculate for every product for every server for every 3 actions evolve toward alliance this number then I should be able to filter out there 1 values that are negative because of negative energy means that we have a downward slope and if the downward slope that means that the quality of service has influence on price and the end not 1 single item that have this characteristic and I might have made a mistake because all possible but also start to wonder like what does that actually mean because this small part of the talk about this is kind of play so yeah there's still lots of work I should be doing this but all in all the enabled this amount data is something already is so quick quick quick conclusions as part of a to its africanus if you know of that has already feels very familiar it's easy to it started and as we more things I haven't talked about there are some real-time tools that is not available in Python yet but it might be the least in the graph of the school so you can have used go after algorithms working for you it's like the Open Data Institute across many years this machine learning algorithms will just war even on bigger and bigger datasets so there we lessening the sample data before it's taken in machine learning algorithm and some final against reconstruct with this don't forget to turn the machines off the whole economical benefit does assume that you emissions off where don't don't be like me and the commissions on over a week because you've gone on holiday in Turkey my boss and not and this is also not really meant for multiuser uh and if you have to cluster that many analysts on it will spark will do is will go to your animals say give me all these resources and those are mine and so I said and done uh so if you have a notebook opening scattered spots finding resources if you leave it open for the weekend you're going to come back on Monday don't have resources use so this is something you wanna keep in mind that a certain spot for single-user all this um and and then the main thing only all this if you dataset simply to big the tools we have right now we and that's where you start here is the part that goes away for more preferable for that part of the moment and flexibility and spark getting there was still only members of the not so please see yourself the effort and only you spot it you have data set like and no for the device the bigger and you have a very small benchmark is a very good and then we can use 2 things that I can go and ask them to like you guys ask questions or give you a quick demo support life in real time demo diluted them again so so
yeah so so what I have here and I have this distributed data frame has already preloaded and this is the world of work of datasets that I do have right now and what I also have a is a
URI just a blog so this is my
spot you right now this was a job I just ran this that visualize and with new
years go counts as thinking should
frame and I should be able to see that some of these detectors are working to invest memory being
used and this is this divides data being held in real time the government of that and if I want something that's a little bit more interesting I think I have an example right here so are so moving around and summing over the bias values so there's obviously takes a little bit longer but again it is device this this fight is over time and again you can cast this result and there's all plotting things as well and also I'm doing this through rebel you can also set that no book and I am probably going to be trying to see if I can get some promising to get some as well
as and think that it is mainly in the museum of images in my presentation that come from The Noun
Project give credit credit is due
and this is just that have few few few
so any questions you can ask me and then the now this is sort of all over media will be be heavily and questions and once we have a few minutes for questions relating to with you it wanted to in the I would like to this is the example of and all but I'd like to know reason for provides that's the small datasets according to that is that is I would love to use my clients data so you when you would want to assess how our and kind of passing up through the like example silicon so you don't actually advocating mean the fully abides with you but if you looked at was the was the result of that and to that the again sounds fine but the point is in the example I
guess things so again only body dataset is definitely definitely you know you don't have to operate here in like the 1 to that stuff all the time and this is the easiest way to do it on and example of direct did you try to combine it with all kinds of spam and on in fact
this down and I recently committed to Sparc products so you can get parts of your body points you present the downside of hot part of the moment you only have a data frame so who knows support for machine learning libraries as OK and so on and what is that is there like already all support for us flux training on on not yet I not sure if this thing working on but I think it's more something you can expect from 1 5 I want you to be double-checked jurists all I need to say that I don't think right now but my guess that quantifies the amount of time to regular people on OK thank you if we of look just 1 question 1 since was as clear like syntax we also support joined 0 yeah definitely this just like in Python and I think it's called join or merge those needs of on Titan but you can also do joined them with the and this is the of the thing you want to call it so if you're doing very nasty outer-left joint and that will cause the this that some point the memories and all following set start doing things that so it scales in the sense that if you're going to get bigger and more machines and that's the linear scale and it is doing all sorts of things uh to make sure that so the problem gets no bigger than this but you can think of problems will just simply always be a ball you don't have genes on against pockets of brute-force approach and it seems that you need to and then it'll probably still work and then and the only thing and contrast-induced find very need more expensive it's more of so the degrees while the companies high stocking using Morrison with the economics of it and then technology for said it is more expensive to buy a 600 divided on this in the lab and survive and
maybe 60 thanks is that means and
then I'll be right there and answer questions locally
Informationsmodellierung
Wellenpaket
Offene Menge
Spieltheorie
Julia-Menge
Datenerfassung
Computeranimation
Portscanner
Task
Offene Menge
Deskriptive Statistik
Demo <Programm>
Task
Bit
Knoten <Statik>
Code
Computeranimation
Keller <Informatik>
Expertensystem
Zentrische Streckung
Offene Menge
Bit
Gewicht <Mathematik>
Abstraktionsebene
Knoten <Statik>
Computeranimation
Keller <Informatik>
Portscanner
Task
Bit
Chatten <Kommunikation>
Vorlesung/Konferenz
Projektive Ebene
Streuungsdiagramm
Expertensystem
Monster-Gruppe
Loop
Spieltheorie
Mereologie
Vorlesung/Konferenz
Zusammenhängender Graph
Rekursive Funktion
Quick-Sort
Computeranimation
Portscanner
Momentenproblem
Virtualisierung
Spieltheorie
Wort <Informatik>
Mereologie
Güte der Anpassung
Quellcode
Wärmeausdehnung
Computeranimation
Gruppenoperation
Konfiguration <Informatik>
Server
Einfügungsdämpfung
Subtraktion
Program Slicing
Ikosaeder
NP-hartes Problem
Gesetz <Physik>
Physikalische Theorie
Statistische Hypothese
Computeranimation
Demoszene <Programmierung>
Deskriptive Statistik
Virtuelle Maschine
Reelle Zahl
Spieltheorie
Vorlesung/Konferenz
Biprodukt
Tropfen
Stochastische Abhängigkeit
Einflussgröße
Benchmark
Videospiel
Zentrische Streckung
Machsches Prinzip
Validität
Strömungsrichtung
Thermodynamisches Gleichgewicht
Biprodukt
Gruppenoperation
Konfiguration <Informatik>
Portscanner
Dienst <Informatik>
Twitter <Softwareplattform>
Festspeicher
Server
Dateiformat
Reelle Zahl
Instantiierung
Bit
Extrempunkt
Gruppenkeim
Versionsverwaltung
Schreiben <Datenverarbeitung>
Computer
Term
Inzidenzalgebra
Code
Computeranimation
Eins
Metropolitan area network
Weg <Topologie>
Knotenmenge
Mehrrechnersystem
Prozess <Informatik>
Äußere Algebra eines Moduls
Parallele Schnittstelle
Parallele Schnittstelle
Analysis
Kraftfahrzeugmechatroniker
Zentrische Streckung
Kreisfläche
Peer-to-Peer-Netz
Elektronische Publikation
Mini-Disc
Fehlermeldung
Quader
Minimierung
Schreiben <Datenverarbeitung>
Ikosaeder
Computerunterstütztes Verfahren
ROM <Informatik>
Computeranimation
Bildschirmmaske
Zustandsdichte
Prozess <Informatik>
Festspeicher
Lineare Regression
Mini-Disc
Benchmark
Quader
Landau-Theorie
Abfrage
Term
Quick-Sort
Computeranimation
Portscanner
Virtuelle Maschine
Metropolitan area network
Prozess <Informatik>
Mustersprache
Vorlesung/Konferenz
Parallele Schnittstelle
Gammafunktion
Analysis
Momentenproblem
Wort <Informatik>
n-Tupel
Elektronische Publikation
Zählen
Gerade
Code
Computeranimation
Mapping <Computergraphik>
Datensatz
Mini-Disc
Mereologie
Lambda-Kalkül
Standardabweichung
Euler-Winkel
Rahmenproblem
Browser
Fortsetzung <Mathematik>
ROM <Informatik>
Computeranimation
Virtuelle Maschine
Informationsmodellierung
Benutzerbeteiligung
Algorithmus
Lineare Regression
Gradientenverfahren
Programmbibliothek
Parallele Schnittstelle
Einfach zusammenhängender Raum
Schnelltaste
Nichtlinearer Operator
Befehl <Informatik>
Graph
Vererbungshierarchie
Güte der Anpassung
Abfrage
Rechnen
Portscanner
Polstelle
Festspeicher
Inverter <Schaltung>
Aggregatzustand
Wellenpaket
Quader
Momentenproblem
Gruppenoperation
Regulärer Ausdruck
Oval
Ungerichteter Graph
Zählen
Gesetz <Physik>
Term
Computeranimation
Virtuelle Maschine
Knotenmenge
Code
Notebook-Computer
Total <Mathematik>
Datentyp
Programmbibliothek
Skript <Programm>
Booten
Wurzel <Mathematik>
Datenstruktur
Gerade
Caching
Nichtlinearer Operator
Sichtenkonzept
Datenhaltung
Abfrage
Kontextbezogenes System
Elektronische Publikation
Partitionsfunktion
Menge
Teilbarkeit
Konfiguration <Informatik>
Zustandsdichte
Login
Festspeicher
Basisvektor
Mereologie
Server
Notebook-Computer
Wort <Informatik>
Reelle Zahl
Information
Schlüsselverwaltung
Verzeichnisdienst
Verkehrsinformation
Message-Passing
Distributionstheorie
Demo <Programm>
Momentenproblem
Extrempunkt
Formale Grammatik
t-Test
Wärmeübergang
Gesetz <Physik>
Computeranimation
Richtung
OISC
Freeware
Algorithmus
Lineare Regression
Skript <Programm>
Benutzerdefinierte Funktion
Dienstgüte
Gerade
Verschiebungsoperator
Benchmark
Befehl <Informatik>
Datentyp
Biprodukt
Ausreißer <Statistik>
Menge
Verbandstheorie
Benutzerschnittstellenverwaltungssystem
Festspeicher
Server
Soundverarbeitung
Ikosaeder
Virtuelle Maschine
Bildschirmmaske
Variable
Datentyp
Datenstruktur
Ganze Funktion
Videospiel
Portscanner
Echtzeitsystem
Offene Menge
Wort <Informatik>
Wärmeausdehnung
Data Mining
Resultante
Bit
Punkt
Formale Sprache
Gruppenkeim
Computer
Computerunterstütztes Verfahren
Analysis
Eins
Übergang
Videokonferenz
Metropolitan area network
Prozessfähigkeit <Qualitätsmanagement>
Wärmeübergang
Figurierte Zahl
Nichtlinearer Operator
Lineares Funktional
Teilnehmerrechensystem
Abfrage
Übergang
Aliasing
Arithmetisches Mittel
Transaktionsverwaltung
Reelle Zahl
Verzeichnisdienst
Aggregatzustand
Server
Total <Mathematik>
Quader
Rahmenproblem
Gruppenoperation
Zahlenbereich
Zentraleinheit
Term
Code
Physikalische Theorie
Data Mining
Graph
Mittelwert
Notebook-Computer
Stichprobenumfang
Inverser Limes
Biprodukt
Optimierung
Lineare Regression
Kreisfläche
Mathematik
Graph
Einfache Genauigkeit
Physikalisches System
Quick-Sort
Mapping <Computergraphik>
Objekt <Kategorie>
Modallogik
Energiedichte
Flächeninhalt
Tablet PC
Mereologie
Stapelverarbeitung
Metropolitan area network
Web log
Rahmenproblem
Prozess <Informatik>
Zoom
Ruhmasse
Vorlesung/Konferenz
Gravitationsgesetz
Extrempunkt
Zählen
Computeranimation
Gammafunktion
Inklusion <Mathematik>
Resultante
Metropolitan area network
Bit
Echtzeitsystem
Rahmenproblem
Rechter Winkel
Festspeicher
Cloud Computing
Computeranimation
Resultante
Punkt
Spieltheorie
Kombinatorische Gruppentheorie
Analysis
Quick-Sort
Computeranimation
Videokonferenz
Graph
Spezialrechner
Client
Gamecontroller
Hypermedia
Projektive Ebene
Reelle Zahl
Bildgebendes Verfahren
Zentrische Streckung
Multiplikation
Wellenpaket
Punkt
Momentenproblem
Rahmenproblem
SPARC
Fluss <Mathematik>
Störungstheorie
Biprodukt
Quick-Sort
Computeranimation
Virtuelle Maschine
Minimalgrad
Menge
Festspeicher
Mereologie
Computeranimation

Metadaten

Formale Metadaten

Titel PySpark and Warcraft Data
Serientitel EuroPython 2015
Teil 123
Anzahl der Teile 173
Autor Warmerdam, Vincent
Lizenz CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
DOI 10.5446/20222
Herausgeber EuroPython
Erscheinungsjahr 2015
Sprache Englisch
Produktionsort Bilbao, Euskadi, Spain

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract Vincent Warmerdam - PySpark and Warcraft Data In this talk I will describe how to use Apache Spark (PySpark) with some data from the World of Warcraft API from an iPython notebook. Spark is interesting because it speeds up iterative processes on your hadoop cluster as well as your local machine. I will give basic benchmarks (comparing it to numpy/pandas/scikit), explain the architecture/performance behind the technology and will give a live demo on how I used Spark to analyse an interesting dataset. I'll explain why you might want to use Spark and I'll also go in and explain when you don't want to use it. The dataset I will be using is a 22Gb json blob containing auction house data from all world of warcraft servers over a period of time. The goal of the analysis will be to determine when and if basic economics still applies in a massively online game. I will assume that the everyone knows what the ipython notebook is and I will assume a basic knowledge of numpy/pandas but nothing fancy. The dataset has been chosen such that people who are less interested in Spark can still enjoy the analysis part of the talk. If you know very little about data science but if you love video games then you should like this talk.
Schlagwörter EuroPython Conference
EP 2015
EuroPython 2015

Ähnliche Filme

Loading...