Harnessing other languages to make Ruby better
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 65 | |
Author | ||
License | CC Attribution - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/37607 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Producer |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Ruby Conference 201444 / 65
2
3
4
6
9
11
14
17
18
19
20
25
27
29
30
32
34
35
39
40
46
47
50
53
55
56
58
61
63
00:00
Programming language
00:31
TwitterProgrammer (hardware)SoftwareMessage passingSoftwareMedical imagingComputer animation
01:03
Level (video gaming)FamilyComputer animation
01:56
Interface (computing)Convex hullStandard deviationPerturbation theoryStrategy gameOptical disc drivePoint (geometry)WebsiteStructural loadQuicksortReal-time operating systemFile formatCubeTwitterMobile appBootstrap aggregatingOperator (mathematics)NumberBinary fileMereologyDifferent (Kate Ryan album)CASE <Informatik>BitArithmetic meanJava appletPhysical systemExistenceComputer animation
04:46
View (database)Multiplication signAnalytic setComputer animation
05:25
Operations researchArea1 (number)Dependent and independent variablesDifferent (Kate Ryan album)Wave packetProduct (business)Operator (mathematics)Real-time operating systemQuicksortDecimalStatement (computer science)BitTheoryArray data structureAnalytic setVector spaceComputer animation
07:28
Reduction of orderDynamical systemBitType theoryBefehlsprozessorProgramming languageObject-oriented programmingPhase transitionMultiplication signNetwork topologyLattice (order)AreaRight angleLecture/Conference
08:59
Operations researchVector spaceMatrix (mathematics)FLOPSApproximationAbstractionPower (physics)Kolmogorov complexityStatisticsElement (mathematics)Focus (optics)LogarithmDatabaseJava appletPower (physics)Library (computing)WritingType theoryOrder of magnitudeAlgebraic closureNumberCodeCASE <Informatik>Integrated development environmentApproximationDirected graphProduct (business)Scaling (geometry)Bookmark (World Wide Web)Arithmetic meanDecimalStandard deviationService (economics)MereologyAbstractionSet (mathematics)Programmer (hardware)StatisticsCore dumpMathematicsPhysical lawSingle-precision floating-point formatOperator (mathematics)Line (geometry)Computer configurationChainRule of inferencePoint (geometry)SequenceProjective planeData managementComplex (psychology)Multiplication signMixed realityMathematical analysisCalculationQuicksortOpen sourcePosition operatorCloningTerm (mathematics)Optical disc driveCoefficient of determinationMetropolitan area networkMoving averageMoment (mathematics)Process (computing)Data miningPlastikkarteNumbering schemePrisoner's dilemmaFamilyAutomatic differentiationTwitterData recoveryGoodness of fitGame theoryWater vaporComputer animationDiagram
18:18
CompilerFormal languageOnline helpProgramming languageMeeting/InterviewComputer animation
19:16
Programming languageNatural numberMachine learningGoodness of fitComputer architectureData managementIterationFamilyLibrary (computing)Similarity (geometry)Archaeological field surveyWebsiteMachine learningTwitterOnline helpSound effectSeries (mathematics)Different (Kate Ryan album)Bookmark (World Wide Web)Natural languagePhysical lawLevel (video gaming)SoftwareData analysisFrame problemMathematicsFacebookMultiplication signRewritingVector spaceStack (abstract data type)CodeProblemorientierte ProgrammiersprachePoint (geometry)AbstractionComputer animation
23:41
Closed setNumberCombinational logicToken ringOpen setLibrary (computing)CountingComputer animation
24:41
Subject indexingSeries (mathematics)Physical systemForm (programming)Auditory maskingFigurate numberTime seriesOrder of magnitudeQuicksortParsingFile formatComputer animation
25:54
Scale (map)Singuläres IntegralImplementationRow (database)Boss CorporationProgrammer (hardware)Endliche ModelltheorieSpacetimeBitCartesian coordinate systemCodeFormal languageInstance (computer science)Different (Kate Ryan album)Scaling (geometry)Disk read-and-write headSequelNumberMultiplication signQuantum stateGoodness of fitPiComputer animation
28:53
Control flowHill differential equationProduct (business)BuildingLevel (video gaming)ExpressionElectronic mailing listRational numberFilm editingRow (database)2 (number)ChainCodeKey (cryptography)Inheritance (object-oriented programming)
30:07
10 (number)Functional programming2 (number)PRINCE2Computer configurationSpecial functionsInheritance (object-oriented programming)ExpressionElectronic mailing listParameter (computer programming)BitCodeComputer animation
30:58
Local GroupFunctional programmingExpressionPositional notationControl flowRow (database)Reverse engineeringRight angleObject-oriented programmingDisk read-and-write headAlgorithmBitSystem call2 (number)Multiplication signInverter (logic gate)Operator overloadingParameter (computer programming)Network topologyElectronic mailing listOperator (mathematics)Group actionComputer animation
32:42
Standard deviationTable (information)MeasurementLimit (category theory)Level (video gaming)Block (periodic table)ResultantBitArithmetic meanInstance (computer science)Standard deviationMeasurementKey (cryptography)Table (information)Intelligent NetworkSocial classMultiplication signNumberDivisorGame theoryLine (geometry)MultiplicationHash functionDivisor (algebraic geometry)Computer animation
34:45
CompilerNetwork topologyQuery languageExpressionNetwork topologyAbstractionSource code1 (number)Group actionCompilation albumMathematical optimizationMultiplicationFront and back endsAbstract syntaxClient (computing)TheoryComputer animationLecture/Conference
35:40
Subject indexingRegulärer Ausdruck <Textverarbeitung>Demo (music)Video gameShape (magazine)MeasurementCopenhagen interpretationError messageLandau theoryCountingLocal GroupState of matterBitIntegrated development environmentControl flowKernel (computing)QuicksortError messageGroup actionMixed realitySubject indexingException handlingExpressionArithmetic meanSimilarity (geometry)Dimensional analysisHierarchyCodeComputer programmingState of matterInformationSlide ruleType theoryData structureMultiplication sign2 (number)Point (geometry)Perspective (visual)MeasurementFilm editingComputer animation
39:46
Formal languageNetwork topologyQuery languageCASE <Informatik>Open sourceExtreme programmingLaptopMultiplication signWordClient (computing)Projective planeSource codeComputer animation
40:56
SoftwareEvent horizonVideoconferencing
Transcript: English(auto-generated)
00:19
harnessing other languages in Ruby, ostensibly to make them better, but just because it
00:25
seemed like a crazy idea. There's a method to tonight. I work for a company called Intellectual Software. I've been programming for a really long time, but that's just because I'm old. And I've been doing Ruby since pretty much 2010, and it was
00:47
mostly Java before that. Our company is basically split between these two places. I'm in a nicer place, and the only reason I mention that is because I wanted to brag
01:03
about how far I've come. And this is a map centered on San Diego, and the other red dot is Cape Town. And basically, the closer you get to the tip of the leaves, the closer you are to the other side of the world. So, I'm pretty much close to winning,
01:24
I think. It's not actually the other side of the world, because I didn't actually check, and there was an island in Antarctica, called Kurgan Island, also known as Desolation Island. So, if you want to go there just for bucket list, and the cool thing is, you
01:47
could say Bnet, and I also have a theory. Anyway, the background to this talk is really the reason why I would even think of it in another language. It's not because
02:05
I'm part of the theme talk about Ruby is not dead and playing emos and I think it's dead. I remember Ruby came around people saying Java is dead, and it's still very much alive and hadooping. So, it all started in a spot us, like a lot of things
02:25
do with the crowd here. It was a challenge, and it was accepted. I probably sometimes think I would like to just work for a bank and not have had that huge challenge, but anyway. The challenge, or the value proposition, in our case it was to bring meaningful
02:45
insights to an intelligence market research. The companies that we were dealing with are very enterprisey, and they have systems that date back to before me. Literally,
03:02
one of the file formats that we have to deal with is Baldwin and myself. It's one of those binary formats where you can check one bit and let you know the next bit. It's basically a segfault format. The only problem with this challenge was, like a
03:25
typical bootstrap society, you've got to deliver something really, really quickly. It doesn't have to be the full product, but you've got to continuously justify your existence through repeated demonstration of planning value. Particular problems with
03:40
this startup were, one of the typical stuff is the data is highly dimensional, meaning that just to kind of say, SQL is not the obvious friend here, if you imagine any number of columns being a requirement, and there's lots of caveats to that,
04:02
like cubes and a load app and stuff like that, and I'm going to skirt over that. Data is very easily structured, comes from all over the place, all over the world, all sorts of different formats. There are no standards, it's kind of like a best effort thing. And finally, there needs to be real-time interrogation of the data,
04:23
and I say that because high dimensionality and real-time interrogation means caching strategies, operating off disk, lots of standard approaches kind of fall away. But probably like many people in this conference, our initial solution, given the
04:45
perspective, was Rails. Twitter was on Rails. This was 2010, and to be honest, the analytics stuff was not like a huge chunk of what we needed to do. We had other stuff we needed to do to be in an enterprise. We needed authentication,
05:05
we needed attachments, so we had things like Paperclip, Devise, CanCan, all that stuff. I mean, that just saved us a huge amount of time. And I'm not going to bash Rails because I don't know how we would deliver on time without it.
05:23
And besides, I can always pull this quite loud. Performance is a nice problem to have, it means you're growing, it means you've proved your product. And that was basically my attitude as well. Of course, around about 2010 is kind of
05:42
incredible as well, because anyone remember that? Those guys said the same thing. It does eventually become a problem you really have to deal with. So three years later, I was sort of looking quite desperately around the game, in a
06:12
2013, we did a lot of incremental improvements, and we were doing most of our analytics in Ruby. We weren't doing that much, we were doing a little bit
06:21
more to actually doing quite a considerable amount, probably an embarrassing amount, of real-time analytics in Ruby. We started off just playing Ruby, renewables, arrays. We used big decimals to kind of speed things up. And that sounds weird, but we need to do basically two types of
06:44
operations. We need to do a lot of set theory stuff like intersections and unions. So we're using millions of little layer statements and then aggregations inside those. And doing that stuff with a nice thing like
07:01
big decimals, it was a hack, but in one week we had something that was quite fast. And that approach was basically called vectorization. I'm not too confused with the other terms, well, other things that vectorization can be used to describe. In the data sciences world, it
07:22
means a very specific thing. And who of you were at Chris Seaton's deoptimizing Ruby talk? Basically, he gave a very good overview of why Ruby can be quite slow for certain types of things
07:44
like the thing on the left. There's a lot of things Ruby needs to do on the L time, on the left times right, when it needs to potentially write. The type, if it's an integer, it needs to go to fixed numbers, it needs to go to fixed numbers, it needs to check for
08:07
monkey patching, it needs to do all these things. And basically it's computing against one CPU upcode in C, which is null. So I'm null, F null. So even having one extra upcode there
08:21
means you're half the speed. And the traditional approach in dynamic languages is to vectorize this, which is to say, I'm going to create an object that represents a lot of things, like an array, or a matrix, or something like that. And when I multiply by another one of those, I can go straight down to C and do that quickly.
08:45
And that also is quite cool because it looks better as well. It looks a bit more symbolic, it's how you would think, it's how you would write the stuff down on paper. So, kind of a win-win. Phase two was basically, we needed to get rid of our
09:01
horrible big decimal hack because amazingly, when you add two big decimals together, that scale of linearity, the boring, was following our laws. I have no idea why. Makes no sense. We before pretending we contributed and re-faulted a joke from Tyler McCarver and Fitzset, which is,
09:23
there's really damage to how these things work, like big fields. And that got us our set operations going really quickly. I don't really think that they could be done that much more quickly. I don't know about things like pipelining and stuff like that, so maybe there's one tiny order of magnitude left.
09:46
And this basically left us in the position where aggregation was kind of dominating on the calculation side. And this didn't take us long to do and I kind of preferred having to really look at the performance problem
10:01
and allow us to concentrate on that, I think. Until that aggregation did become a problem, we were all this time always looking for the silver bullet and really it allows us to do everything in a clean and performant way. GSL was not totally the answer, but it did allow us to do vectorized aggregation and statistics pretty well.
10:26
It's built on a C library. It's the new scientific library. And that's built on top of other stuff, like BLAS and LAPAC, which in turn gave back to 1979 and then 1992 and that doesn't mean that practically it just means that no one wants to touch them because they're perfect.
10:45
Doing just those tiny, small things. So it's fairly solid, but again, it allows us to optimize a particular pathway and buy us some more time.
11:01
And then a certain US supermarket chain that won't be named, we pitched for some work and it was the biggest day we'd ever seen. It basically included every single receipt or shopping tray that had ever gone up in the last four years.
11:23
All schools in the United States. And so you can imagine it's a huge amount of data. And then they want to ask questions like, you know, if you buy Oreos, what type of toothpaste do you like? And then they need to talk about real time stuff.
11:40
So you can't even do the stuff in the background. So it was kind of horrible because it was like two or three orders of magnitude more than anything else. We decided that we needed to paste it anyway, but we're going to hack it for now. NRA is a gem written by Masahiro Tanaka.
12:02
And it's kind of like a Python modified clone and he's the parallel gem as well. And this is going to make people squirm, but basically we were just trying to see if it was possible. It was possible. We did manage to do some types of analyses that they wanted to do.
12:21
But around about 2012, 2013, this chart is a logarithmic scale of... I'm approximate and I've kind of worked this out a bit with a lot of assumptions. But it's approximately a linear logarithmic scale, so it's pretty useful.
12:43
But that's kind of how you want your performance line to go when you're trying to improve it. And that looks kind of good, but there's a hidden sacrifice happening between 2012 and 2013. And it's basically why we kept performance linear, our abstraction power started to flatline initially.
13:05
We weren't making our programmers any more productive, which is counter to the Ruby way of doing things. And the worst part was, as we were optimizing all these little code parts, we were also increasing the complexity to the point that very few people,
13:24
because we started to have people that weren't allowed to cross the roads, you know, without help, just in case we hit by a bus, because they were the only people that knew us. So we knew we had to do something. Again, there is an obvious kind of question there.
13:43
Isn't there a database service or some kind of thing? Can't you do something? It may be possible that you can. We had a lot of people look at this. And I think Apache Spark now is potentially something that could deal with this.
14:00
Certainly at the time, we didn't think so. So the answer is no, in this case. So we asked ourselves, you know, we tried Ruby to get the product out the door as quickly as possible. What would happen if we didn't use Ruby or we were to start again in another language?
14:26
And this is a thought experiment. And particularly looking now at data is the most important thing that our company does. And I picked two of the options.
14:41
R. R is like the grand idea of open source statistics. It's got a lineage of data back pretty much to 1976 through S. But some of the core concepts came in about 1997.
15:06
And that's a really long time ago. And the stuff that's been in the ecosystem there is really big. And one thing R is really really good at is managing data. I don't have to explain the term managing. I'm not sure where it came from. I should try to find out last night.
15:22
But it's definitely a thing. I don't know where it's like appeared. But basically, when you're dealing with... It's generally people like R or something like that, where you're all sort of level-based environment. Generally working at financial institutions.
15:41
You're dealing with data that comes in any format. With any number of considerations or standards. And you often need to pull it apart, reshape it from long to wide. Change the dimensionality, re-invert it. Get it to merge with something else that's only loosely the same.
16:00
That's mentioned. It's basically just having a whole bunch of tools to pull your data apart and reassemble it. To answer questions. And R is fantastic at that. You also have to look at the JVM. Because you haven't done your due diligence unless you have.
16:20
Scala... I couldn't go back to Java. I came from Java and I tried to write one class. The other day I did a loop. And I thought, never again. So it would have to be a recent Scala. Scala tends to focus more on how much data you can get through it.
16:45
Rather than what you can do with the data. Apache Spark is built on Scala. And other people tend to prefer Scala. But generally the problem is the size. Not the complexity of the analysis.
17:02
And on the plus side there's many shared ideas of 3D. You can certainly write 3D-esque Scala. Mixing with Scala is kind of analogous to traits. The sequence is not too far away from any rule. So you can write really in Scala. You can also write code in Scala.
17:21
You can write Java in Scala. I'm pretty sure you can write anything in Scala. One of my favorite things about Scala is it comes with a free kitchen sink. And of course Closure. I know a lot of you have probably experienced the same thing as me. Just people won't stop going on about it.
17:40
But then when I was doing Java people couldn't stop going on about Ruby. So I thought, well maybe that's a good thing. It does have some interesting libraries for data managing. One of them is in Canton. It's based loosely on R but it's one company's attempt to make something useful for them. So it's certainly not a huge project.
18:03
And the rest of it is dominated by Datomic and a roll-your-own approach which does seem quite pervasive in Closure. It comes with a free hammock and a poster for a checky. And then of course someone said I should look at Haskell because it comes with a free beard.
18:23
I kind of wanted to because I don't think I could do that by myself. I need language help. But you only get the beard if it compiles. So the answer kind of surprised me. And this picture is relevant.
18:43
Because I didn't expect Python. I would expect quite a few Ruby people to think like I did. Which is the languages are similar enough to mean that if you know Ruby why on earth would you look at Python? And it's more of a cultural thing than anything else.
19:03
Ruby has a slight tendency to do things one way and Python idiomatic is slightly more explicit. So why Python? I mean the first point is mainly the main one.
19:21
The size of the scientific community. That was kind of a surprise. It's big. And kind of like Facebook and Twitter there's a network effect there. Once the libraries start to accrete and libraries get built on those libraries.
19:42
And those libraries in turn enable new things to be done. Suddenly there's luck in. And there's value in the ecosystem not just the library by itself. And there's incredible depth there now. It's basically been going on for about ten years and where it is now is very impressive.
20:03
And then of course if there is this difference then the similarity is a good thing. It's not that different. Bundler is better than pip install but it's kind of the same thing. There's a lot of stuff like we were looking for vectorization.
20:23
We weren't going to move everyone over to Python but the people that would need to do any Python. It wouldn't be that difficult. So I'm going to take a quick look at what Python has to offer.
20:48
NumPy is kind of the bedrock of the scientific Python stack. It's an array computing library that is pretty much all about vectorization.
21:00
It goes back to 1995. It's on its third rewrite. So the first two iterations were architectural considerations. We need to do this again. You can be sure the architecture is pretty solid. They are talking about a fourth rewrite. But it is a solid library.
21:26
And on top of that, there's one of my favorite libraries. Pandas, which used to stand for panel data analysis or something like that to do with analyzing survey data.
21:41
It's now just got pandas everywhere, like actual with bamboo and stuff. So kind of just pandas now. It's built on NumPy. It completely relies on it. But it ports a lot of what's good in R into Python. And what's good in R is the data frame and the series, which if any of you ever looked at, you'll see an example of that later.
22:06
It is very, very, very fast. It kind of takes vectorization further. It gives you higher level abstraction tools. And wherever NumPy doesn't help out, the stuff is Cythonized, which is like a Python DSL for generating C code.
22:26
Or it's actually written in C where even the Cython is not fast enough. So it's been around for long enough that a lot of stuff has been optimized. And it is the Munger extraordinaire of data. It is a hugely cynical data analysis library that can pretty much look at anything.
22:44
And you get a couple of bonus extras, which we are not immediately interested in, but it's nice to know they're there. SciPy is linear algebra, fast Fourier transforms, clustering analysis. There's IPython notebooks, which I'm going to skip over because it's relevant to the end of the talk.
23:03
SimPy, which is if you want to actually solve algebraic problems. If you want to cheat during your high school math, this is a good library to know. Natural language toolkits, like analyzing Twitter feeds for whether people hate you or like you.
23:20
And machine learning and scikit-learn. So I just wanted to reinforce the strength of the community thing. Because it's particularly the scientific aspect more than anything else that I'm talking about here. Not Ruby versus Python. I've got some git commit charts here.
23:44
This isn't like super fair and I'm hugely grateful to NRA and GSL. They kind of, our business relied on them. But it's important to bear this in mind. That's total commits. What's even more astonishing is the contributor count for pandas.
24:04
It's just 310 people through a combination. Well, mostly pull requests. And that's a huge, huge number. If you look at issues open and closed, you can see pandas again. 4,681 closed issues.
24:22
I know it's a lot of outstanding issues. I would be quite scared of a thousand open issues. But that's just because there is a huge number of people using it. And that is quite close to Rails. And that's just for one library. And it's worth diving into those issues.
24:43
Just break them down by their GitHub tags. There's 450 closed issues for time series stuff. Now, if you think dealing with time series is easy, just, I would just pause. And if someone says, why can't we just do this, then I just show them this chart.
25:02
Because this looks like pain that has been fixed by someone else. The 90, the figure on the right, that's just dealing with CSVs. CSVs in the data world are sort of the simplest, most ubiquitous form of exchanging data.
25:22
90 closed issues for CSV. Pandas is just about the fastest CSV parser I know of. It's certainly orders of magnitude faster than Ruby. And it's incredibly cynical. It will handle stuff written in a Mac on system 7 in some terrible format.
25:42
You know, it just does everything. Quoting, currency, date formats, missing values, all that stuff. It does it. And it does it quickly. So, that's awesome. But, obviously, I'm not going to rewrite, well, we weren't going to rewrite the application in Python.
26:07
A, because we have a lot of Ruby programmers. And B, because we like Ruby. We just want some of this goodness. So, the problem then became, can we get the flexibility of something like Pandas with the speed of something like NumPy?
26:27
With a Ruby API that feels local and natural. And as a bonus, scales horizontally. Because that's increasingly becoming something we definitely need to do. Is be able to farm that out to cheap Amazon instances when we need to.
26:45
And for inspiration, we didn't have to look that far. The most obvious thing which I think everyone would have had exposure to is active record scopes. You're effectively writing SQL in Ruby. It's deferred, it's composable.
27:02
But it ends up running SQL. And it's a fairly simple model to get your head around. Of course, this is a little bit different. We're trying to talk about a general purpose language going off to another general purpose language. There be dragons, potentially.
27:22
Getting Ruby to run Python is kind of the same thing as getting Python to run Ruby. It just depends on who's the boss of the API. If that makes sense. And think about active record. Active record speaks SQL.
27:41
SQL is the boss of the API. We decided that Ruby should be the boss of the API. And that the Python side should understand Ruby. So, that means transforming Ruby code into data. Sending data on the wire.
28:00
And transforming that data into Python. So, it's a simple pipeline. And I've got Python or other because the beauty of sending Ruby over the wire is that we don't need to lock into Python. This isn't a Python love fest going on.
28:21
There's a practical use that we have for it. And it's completely feasible for us to utilize any language. Including Haskell. So, get my beard. So, I don't know if you've heard this term. There's usually someone who knows.
28:41
Generally, everyone knows someone who's a bit of a lisp fanatic. And they generally talk about this with a space dark religious seal on their face. There's an XKCD about that. There's also an XKCD about lessons from lisp. And as usual, Randall Munroe nails it.
29:01
Lisp just keeps coming back to haunt us. So, code as data is related to I mean, it's very much kind of a founding principle of lisp. And I hope to kind of demonstrate that.
29:24
There are two key lessons from lisp that we used for building this product. The one is the use of S expressions, which I'll go into next. And the second thing is immutability.
29:41
Immutability is absolutely key. I'm not gonna dwell on that that much, but you know ActiveRecord is doing it. So, it is important. You can't go and as you're chaining on stuff, you can't actually affect any of the previous scopes.
30:04
So, S expressions. They're super simple. They're basically, they're parentheses. Lisp is all about parentheses. S expressions start and end with a parentheses. Parenthesis, parentheses. The first argument is a function and the other arguments are optional.
30:25
And they're data. And there's one special function. There are a couple of special functions in lisp. There's a bit of a debate whether you need three magic functions or seven or eleven.
30:40
But you don't need that many. One of them is quote and unquote. And effectively, if I were to quote that function, I could say that it's data. And unquote turns it back into code. But I'm not gonna dwell on that. It's not that important. I think an example is probably best.
31:00
I think a lot of people would have seen reverse polish notation. It's quite a similar concept. Function defined first, argument second. I've basically got the S expression on the left and the Ruby on the right.
31:23
And the other thing about S expressions. This is basically the most important thing about S expressions. And the second most important thing is that they can be nested. And it takes a little bit of time to read if you're not familiar with that. You have to read from inner to outer.
31:42
Not from left to right. But the bonus is you don't have to worry about operator overloading. It's incredibly simple to create. It's incredibly simple to consume. And it's just data. So, it's a tree. So, we can use very, very simple algorithms to transform this stuff.
32:02
And just the final example is if there was such a thing as active lisp, that's what your active record scopes would kind of break down to. And you can see it kind of inverts the call tree on its head. You end up with the thing you do last first. So, you have to read from inside out.
32:23
And so, this is basically what we send over the wire. We will ‑‑ we have Ruby objects that represent ‑‑ well, that work pretty much like active record scopes. And we end up sending the thing on the right over the wire.
32:43
The limitations ‑‑ there are limitations, obviously. We don't run into that many of them, luckily. But an example of the limitation is this is kind of an example from our API.
33:05
I'm basically here trying to calculate Z‑scores from some kind of column in a table. And a Z‑score is basically a measure of number of standard deviations from mean.
33:20
And you can see the second‑last line. It gives you ‑‑ it's quite readable. You basically take what you're looking at, you subtract the mean, and you divide by the standard deviation. And that PMAP is parallel map. Does exactly what you think. It runs on multiple Python back ends.
33:43
And we can send huge chunks of data at that and they get farmed out to Amazon instances. And the results get returned. If I were to change this to have a little bit of Ruby stuff in that block, I've got a divisor lookup.
34:00
Instead of dividing by the standard deviation, I'm looking up something in a hash. This isn't going to work because that map block is actually not being executed in the Ruby side multiple times. It's just being executed once to be turned into data to go to the Python side.
34:21
So there's a bit of a cognitive problem that you have to be aware of. And it just means you have to have a powerful enough API that you don't really need to do this that often. And for us, we've been using this in production for about six, seven months. It isn't a problem.
34:40
But we do get some benefits as well. Compilers do optimizations on abstract syntax trees, which are pretty much S expressions. And so can we. And we can do things like we can automatically shard stuff that looks like that we can tell can be parallelized.
35:05
Even though the Ruby client is not aware of that. We can also do things like you may ask to load a big CSV or some other data source. And you only end up using a few columns. We can backpopulate that few columns all the way back to source so that we only look at that data.
35:26
And that's quite cool. And the other thing I mentioned before is we can target multiple backends. So that's the kind of theory. And I should have to show you this now because it's all been abstract until now.
35:41
Okay. So what you're looking at here is how many people are familiar with IPython? Okay. Not that many. It is fantastic. It blew my mind. I might do a lightning talk on it because first of all, it's not limited to Python. It's basically like Donald was being going on about literate programming.
36:06
And it's a real attempt at it. This is a mix of code, ripple, and markdown in one document that can be exported and run elsewhere. And if you think about from an academic perspective, shipping a paper off with your data and your code and the actual paper itself in one thing is mind blowing.
36:25
Anyway. This is we're using iRuby kernel which has been written by Daniel Mendler and we've also contributed to it. I'm just going to fire this up. This is us just creating our basic data structure.
36:44
And this breaks down into an expression. And this is effectively what we send over the wire. And you can toggle that and you can see that's what gets rewritten as it gets turned into Python.
37:02
Do something a bit more advanced. This is us doing a little bit more dimensional stuff. I was going to regionalize this for the U.S., but I ran out of time. So, this is kind of two dimensions. Country and city. It's a hierarchical data structure with two measures. Rainfall and rainy days.
37:21
And we need to do things like extract a dimension out of that. Kind of olap-y and potentially get the mean of these things. And this is all happening over the wire in the Python side. And you can see this breaks down into Ruby expression and this is where we send over the wire.
37:45
And this then gets rewritten and turned into this Python. And that works very well.
38:01
And the advantage, another advantage we have is with all of these things, we can see, we actually get runtime information. So, we know the mean took 6.4 milliseconds, the group by 5.5. These are things that are not in the code above set index. This is what happens when we rewrite stuff.
38:21
We express stuff. We keep the Ruby API actually cleaner than the Python stuff we're using because we can. And then finally, error handling. This stuff could be a nightmare, but it's actually really, really nice to deal with.
38:42
This is kind of a similar thing, except I've tried to group by state instead of country. So, that's wrong. Obviously, the Ruby side, the Ruby expression will remain the same. But if I want to see what happened on the other side, you can see I had a fail on set index.
39:08
And for some reason, I don't have tool tips. But there is a stack trace that you can't see here, an invisible stack trace. And that allows us to, we can see both on the Ruby side and on the Python side where stuff went wrong.
39:26
And it's, yeah, this is basically our sort of IDE for working with the data, but we work with this type of code just in Ruby and actually still hosted in Rails.
39:41
Yeah, and it's worked pretty well. And that's it. And I didn't have to use my one slide. This was just in case. I don't know how we're doing for time. I don't know. I suspect not that well.
40:01
But are there any questions? I actually don't. The project is not open source yet. We're still working on it and we're kind of pushing it through into our clients
40:21
and when we're comfortable with that, we may look at that. But the other thing that you should look at is the iRuby notebook because that is cool. I really think it's a hidden gem. And don't let the IPython thing, they're actually renaming their project to Project Jupiter so that the Python gets removed from the word so that other people don't shiver.
40:45
Anyone else?