Analysing Big Time-series Data in the Cloud
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 96 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/51696 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
NDC Oslo 201657 / 96
2
7
8
9
12
14
19
20
26
28
31
33
38
40
43
45
48
50
51
61
63
65
76
79
80
83
87
88
90
93
94
96
00:00
Demo (music)Process (computing)BitQuicksortMathematical analysisDifferent (Kate Ryan album)Similarity (geometry)PlotterProjective planeLogicData analysisVirtualizationSource codeCloud computingCASE <Informatik>Library (computing)Connectivity (graph theory)Exploratory data analysisRight angleLink (knot theory)Multiplication signFrame problemAreaCalculationArithmetic meanFunctional (mathematics)Connected spaceObject (grammar)MereologyVirtual machineNeuroinformatikLine (geometry)Point (geometry)Interactive televisionAverageDampingWritingInternet service providerResultantFunction (mathematics)SubsetDressing (medical)BuildingStrategy game40 (number)Cluster analysis2 (number)Endliche ModelltheorieVotingSet (mathematics)Entire functionType theoryTransformation (genetics)Exploit (computer security)CodePattern languageData storage deviceStructural loadInternetworkingVisualization (computer graphics)Sampling (statistics)Address spaceData structureKey (cryptography)Operator (mathematics)Numbering schemeINTEGRALSeries (mathematics)Interface (computing)Software developerFormal languageSingle-precision floating-point formatLatent heatSystem callAuthorizationTable (information)Computer fileText editorService (economics)SoftwareTouchscreenOnline helpWave packetInformation technology consultingFrequencyFile formatTime seriesWordMathematicsOcean currentWebsiteObservational studyComputer programmingTwitterSign (mathematics)Installation artOpen sourceSpacetimeNumberView (database)Integrated development environmentDirectory serviceNoise (electronics)Physical systemDataflowCountingLaptopProgramming languageSemiconductor memoryDirected graphMechanism designPresentation of a groupDigital photographyMeasurement.NET FrameworkBridging (networking)Port scannerRow (database)Confidence intervalTraffic reportingState of matterCoefficient of determinationRepetitionArmBit rateElectric generatorSoftware testingSocial classSubject indexingProcess (computing)PlanningBlock (periodic table)Materialization (paranormal)Data managementRange (statistics)MultilaterationOpen setShared memoryAnalytic setSequencePlug-in (computing)ImplementationElectronic mailing listProduct (business)Exception handlingLetterpress printingData centerCartesian coordinate systemGroup actionLocal ringExpressionNumerical analysisMathematical optimizationProgramming paradigmProtein foldingStreaming mediaPoisson-KlammerRoundness (object)ProgrammierstilInformationArray data structureNormal (geometry)Web 2.0Complete metric spaceDemo (music)BefehlsprozessorF sharpExtension (kinesiology)Slide ruleVector spaceNumeral (linguistics)Domain nameWeb-DesignerStatisticsUser interfaceScripting languageConstructor (object-oriented programming)Computer clusterFront and back endsPower (physics)Dot productScalabilityUniform resource locator1 (number)Computer animation
Transcript: English(auto-generated)
00:05
All right. Looks like it's nine, so it's time to start. Thanks, everyone, for coming. I'm impressed. This is a bit too early for me. So if I say something that doesn't make sense, then it's probably because of the early hour.
00:23
So I'm going to be talking about, I actually put up a very boring title when I looked at it yesterday. So the topic will actually be much more fun than the title may suggest. I'm going to be talking about UK house prices, which
00:44
is a fun topic. And we are going to do some time series and data analysis. And it's going to be reasonably big. I have some sort of larger data sets that I'll show you as well. And the other bit that you'll see
01:02
is various very interesting F Sharp community projects that happened in the area of data analytics and working with big data. My name is Tomasz Peciciek. I work with F Sharp Works, which is sort of an F Sharp consultancy.
01:21
And we do trainings and help people with F Sharp. First of all, I have to say this reminder, F Sharp is a general purpose language. And I think there's this sort of misconception that F Sharp is only good for data analytics
01:44
and some sort of science-y stuff. That's not really the case. And I think the other talks here at NDC did really nice work at demonstrating this, because there was talk about doing user interfaces with F Sharp.
02:00
There's lots of interesting libraries for that. There was talk about doing web development with F Sharp. So really, the language itself isn't restricted to some particular domain. And it works great in many areas. And I'd say you can use it really nicely in lots of areas
02:22
where there's some libraries for it. So there's great F Sharp web libraries. There's lots of other interesting things. But I will be talking about the data analytical part, which is partly because I've been actually working on this. I spent some time in Blue Mountain Capital, which
02:41
is a hedge fund in New York. And we did some of the work when I was there. So that's why I'll be talking more about the data analytical part. But that doesn't mean that F Sharp isn't useful anywhere else. It just means I actually spent some time working on this.
03:00
And so it's an area where I contributed, and I know some things about it. So why would you use F Sharp for this kind of work when you are working with data, you're doing some calculations? There's completely other answers for why would you use F Sharp
03:20
or completely. Lots of them will be similar. But you might give different answer for why would you use F Sharp for web. For analytical components, the main thing is that F Sharp is actually a nice programming language, which is efficient. And it leads you to more correct code.
03:45
And it lets you do things faster. And I think you'll see some of the reasons why that's the case in the talk. And if you're doing some sort of logic, some analysis,
04:01
then you might use something like R or Python. And those languages have very rich libraries. We'll see that you can actually use R from F Sharp. But they're not compiled. They don't have static typing, which are really useful things when you're working with lots of data and you want to quickly explore it.
04:25
I think the stuff I'll be showing in the talk is actually a nice example of what you can get when lots of people in the community collaborate. Great. And we started this F Sharp Software Foundation, which is now a US nonprofit.
04:43
So you can join and help us make F Sharp great. Not great again. It's been great always. One nice thing about the F Sharp Foundation is that it brings together all the people who
05:00
are involved in it. So that involves the language design, which is mostly done in Microsoft Research. It involves the various editor and tooling authors, including Visual Studio, including Atom and VS Code, including Emacs.
05:20
It also includes the various people who are using F Sharp for commercial projects. And they sometimes open source some of the libraries, which is the case of Blue Mountain Capital, and it also brings together the broader open source community. So you'll actually see all of these involved in the samples.
05:41
And what I want to show you here, I'll start with just local small data sample. The UK actually has an open government data website where you can download UK house prices. And they have one month download, which is like 20 megabytes.
06:01
But you can also download prices from 1995 until today. And it includes every single house sale in the UK. So we'll start by looking at the local data set. And I'll switch to Atom. So I'm using Atom and a project called
06:22
Ionite, which is the F Sharp integration. And right here, I'll just start by loading some of the F Sharp data analytics libraries. So FSLab is the library I'll be using for data analytics, which brings together
06:41
lots of other components. And this is my directory with a file. So what I'm going to do here is just load the CSV file. So that's the data, no SQL format of the future CSV file.
07:03
And this file actually contains all the sale information for houses in the UK in April 2016. And we are going to look at some of the interesting things
07:20
there. There's lots of different columns in the file. Half of them, I have no idea what they mean. So what I'm going to do first is actually just get the subset. And this data structure that I loaded, data frame is inspired by data frames in R.
07:43
And it's really just a CSV file, two-dimensionally table-like structure. And I can do lots of operations on it. One of them is I can actually select just the columns I care about. So I care about things like price, postcode, and town.
08:05
And one nice thing you can see here is that what we've added to Ionite is the ability to display anything you load as HTML. So when I load the frame, I actually get it here as a table.
08:20
And I can even scroll through the table and find all the different house prices. So let's say I'm interested in, I live in Cambridge, which is a nice place except for the house prices. So maybe what I want to do is to look at Cambridge and find where could I possibly afford something.
08:45
And we are going to be using the normal F sharp pipe. So the way you work with these frames is pretty much the same you work with arrays or sequences. And there's lots of operations in the frame module,
09:00
like filter rows, where I can say for every go over all the rows. It gives me the key and the data in the row. And then I can say, let's have a look at cases where town is Cambridge.
09:23
And this filters out some of the, I did something wrong. I didn't run this line. And now we filtered it out to only Cambridge. There is actually some noise. So the other thing I need to do here is to only look at cases where the date is 2016.
09:45
So if the year is 2016, that's what I want to see. And then the next operation I have on the frame, you can do lots of different things. And I just want to sort it by the price.
10:01
So if we take this and say sort by price, now with like four lines of code, I actually can iterate over this. And there are some weird things at the beginning. So the house for 5,000 pounds, that's probably not a real house. That's more like a dog shed or something.
10:23
But if you scroll down to the end, you can see for, what is this, 2 million pounds, you can buy a beautiful house in Hills Road. So this is a sort of nice way of interactively exploring
10:41
the data. Because I can write my code, run it immediately, see the results. And thanks to some of the nice new tooling here, you can even sort of explore the results we get. One more thing I want to do here is I actually want to do some aggregation.
11:01
So I'm going to take this, and I'll just filter out some of the prices that aren't really correct. So I'll take all the houses and town we leave. But I only want the type and duration are telling me something about the kind of sale it is.
11:23
So duration f means that the UK system, I don't quite get it. But f is the normal thing. So that means you actually own the house. And I think this is all right.
11:41
So we only get some of the houses from the data set and assign it to a new frame. You can see here that I'm using the usual functional style where I just clone or apply some transformation, get a new frame. But when I do this, it isn't actually
12:01
copying all the data always. Because it's all immutable, so it can share a lot of the data in between the individual values. So now I've cleaned it. And I'll show you one more interesting operation, which is, how can I aggregate this?
12:24
And there's a useful function for doing that where I can say, aggregate all the data by town and apply some sort of folding operation or aggregation on one of the columns. So what I'm doing here is I'm just saying,
12:41
aggregate it by town and give me the average price per town. And if I run this, and you can see the result, I get average prices by town. And what I want to do now is to take the average prices by town, count how many sales are there in every single town,
13:04
and plot the 20 most expensive places. But I only want to do it for places where there's more than 10 or something sales, so that one fancy house somewhere in the countryside doesn't skew the data for the area. So I'll need to count how many sales are there
13:23
for each town and put these two data sets together. And to do that, I can use indexing, which is a built-in feature here, where if I say index the data by town,
13:41
then you can see that it transforms the structure so that I have the town on the left in bold, which is like the key. So I'll have two frames which share the same key, and then I can nicely put them together. So this is average prices.
14:02
We've got average prices, and I'm going to add counts where I'll just replace mean with count. And now I've got counts. And what I can do next is to say, take average prices and add a new column named count,
14:25
which we get by taking the column in count. So this is the merged frame where if I run it, if I run this as well, oh, I have to run all of it.
14:43
All right. Now I have a frame where I have key, which is the name of the town, and then I have the price and the number of sales in the place. So I'm just going to wrap this up by using a cheat that I did before.
15:05
First of all, we take all the data and filter out only towns where there's more than 100 house sales, sort it by price, and take the last 20. And you can see the results here.
15:22
And then the next thing I can use here is that I'm going to use xplot, which is a charting library, and this actually has a nice ionite integration as well. So it takes the chart and shows it right in my F-sharp interactive.
15:41
And if you look at this, you can see that the most expensive houses are obviously around London. And I think it's sort of amazing how clustered it is around London these days. And if you look at the average price in London, it's something like one million pounds.
16:01
So that's roughly 10 multiplied by 10 to get the corona value. And like everything else around is significantly cheaper. So moving to London is probably a tricky problem. All right, so that was one example of what you can do.
16:23
And I did show you a couple of new things here. The main is the F-sharp interactive on the right, which is a new feature in ionite. And ionite is an F-sharp plugin for Atom and Visual Studio Code.
16:42
I was using Atom, but what I was showing is hopefully soon coming to Visual Studio Code as well. Atom has sort of more flexible extensibility model, so you can hack it in lots of interesting ways, like you can actually insert HTML output
17:00
in your Atom F-sharp interactive. The other interesting thing is that it's very sort of open and ionite actually supports a lot of the great F-sharp community tooling. So you have support for the package manager,
17:21
which I'm not going to say anything after. It's a dangerous topic. I don't want to get shut later on. But it also supports things like fake, which is a really nice F-sharp based build system. So you get an integration with lots of other great tools.
17:42
And the new fun thing we've been adding to ionite is this way of adding HTML formatters. And the sort of plan here is that when you use this, when you define this formatter, you'll be able to use it in ionite, where that's what I've been doing.
18:01
There's also a project called FSLab Journal, which lets you generate nice reports, which are like HTML reports with text and code and outputs, so it will work there. And I'm also talking with the people working on F-sharp integration for Jupyter notebooks, and we want to make the same model work there as well.
18:24
And it's really simple. When you have some interesting object structure, you can format as HTML. All you have to do is to say FSI at HTML printer for your table object here, and then you say, here are some styles and scripts
18:43
I want to include, and here's my HTML. And it gets formatted. So far, I was looking at just one month of the data, which is alright, but it's not really big data,
19:00
it's 20 megabytes. So what I want to do next is to show you another interesting project. And that's something that we jokingly call BigDeedle. Deedle is the data frame library that I was using here. So all the frame dot something operations,
19:25
they're coming from this Deedle library, and the library gives you all sorts of tools for doing data transformations, data explorations, like the kind of things I was showing in the last demo. And in the previous demo, everything I was doing
19:44
was actually done over data that was loaded in memory, but we have another sort of backend for the library that lets you do all the, or some of the operations
20:00
over a virtualized data source where you don't actually load the data into memory. So what I'm doing here is that I just referenced this library, or this DLL, which contains a provider for Deedle that loads the house prices on demand.
20:21
And over here, I can actually disable and enable logging so that you'll see what data sets is it loading. And I'll just need this helper to create date time because creating date time takes like,
20:40
it's very complicated, and then this is another helper that I'll use later on. And what I'm doing here is that I just call the provider to give me the data frame that represents all the house prices in the UK for the last, what is it,
21:02
20 years, roughly 20 years. So all the house sales over 20 years. And this is exactly the same sort of frame that I was using before. So I can do the same things I was doing before, like pick only the columns that I actually understand.
21:22
So I'm going to do that. And when I select this and do alt enter, it actually downloads the data here. So if I, let's see, what I can actually do here is that I can even scroll through this table.
21:44
So if I actually succeed at clicking on this little thing here, I can scroll down and you can see this is the logging coming as I scroll through it, I just picked some place in the middle. It needed to download the data for second and third May
22:02
because that's where the scroll bar sent me an event, but then it just loaded the data for the 2003, which is where I'm actually right now. So what I'm doing here is that I'm just scrolling through roughly four gigabytes of CSV file
22:23
with UK house prices. And I can filter it and nothing, it just does it without downloading all the data because I'm just removing some columns.
22:41
I make it a bit smaller. So what else can I do with it? I can, for example, look at some column. So if I look at just street, I can access the column and it just picks that one column and gives me a single data series with the names.
23:05
And we'll see, you can do a lot more things with it later on. So now I've been working with the frame, with the data set, and you can nicely explore it here in the IDE,
23:20
but it's never actually trying to load the entire data set. So in this case, it's roughly four gigabytes, but it doesn't really matter. If it was four terabytes, whatever you can put in your storage, it would still work the same. So let me disable the logging.
23:42
And typically what you can do if you have data in the structure is that you sort of, the first thing I want to do is that I just wanted to explore it, to see some interesting parts of it. And now I'm going to just select some small range
24:01
and do a local calculation over it. So the typical pattern is I just want to understand the data and before I even try to run something over the entire data set, I want to test it on some small subset locally. So I'm going to take the houses and from the rows, I'm going to do some filtering
24:22
and just take one month of data from in April, 2010. So the idea is I want to compare the data I had for 2016
24:40
with similar month in 2010. And let's just save this. And one of the functions, the helper that I defined earlier, materialize, is just the helper that actually forces the download of the data.
25:00
So now you can see this will be running for a bit because it needs to issue a couple of requests to the data source to actually download all the prices for the one month period. And now we have it. I'll say a bit more about how this actually works later on.
25:21
So now we've actually downloaded one month of data and we can play with it locally as before. And I'm just going to copy the same code I wrote before where we take the data and aggregate it. So this is exactly the same thing I was writing before.
25:43
And now if we do, if we take the last or the 20 most expensive places and plot it, then you can see the difference between the chart I was showing earlier somewhere.
26:02
If I can scroll through this and find it, where was my chart? There it is. So this was how 20 most expensive places in the UK looked last month with big red bulb in London.
26:22
And this is how it looked 10 years ago. And you can sort of see, if I zoom in, there's a bit more diversity. Like there's actually orange, it's not just green and red. And the average house price in London was half of the one this year.
26:41
So over five years it basically doubled. All right. So here I think what was really interesting here is that instead of using the local data set which you can load from a CSV file,
27:00
I was using this virtual big data set which was loaded on demand and the tooling and the libraries all are actually able to cope with it because they only access the bits that are needed. I couldn't even show it in the, I could even show it here in the output in the grid
27:22
because the grid also only asks the data frame to give me the data that you can see on the screen. So as I scroll through it, it asks the frame, give me this range and that works because I can always load
27:41
one part of it, one range. So this is the name of the library and I'll give you all the links later on. So DDL is basically an exploratory data frame and time series library which means that it lets you really easily drill into the data
28:01
and see what it looks like and with tools like Ionite, that's actually so much nicer. It has this in-memory data source which is very well tested. It's what Blue Mountain is using and the virtualized data source is something that's sort of more new. We've been working on it
28:22
and it lets you do very similar operations but over data that's not actually in memory. Now this is something, if you wanted to use it, it's not tied to one specific data source. So it's not like you have to have Azure or you have to have Cassandra or you have to have whatever.
28:42
You just have to implement two interfaces. One is basically defining the addressing scheme. So how do I map the keys to some actual offset in the frame and what's the next address and so on. There's samples on the internet
29:00
that do this for sort of partitioned, when you store your data partitioned by month or partitioned by day, there's a nice example showing that. This is what I'm doing here and then you have to define this virtual vector source which is the thing that actually loads the data. So it takes a little bit of work to actually do this
29:21
but it doesn't matter what data source you're using. Here I was using one example with the houses, that's just pulling the data from a rest-based API. So I have a rest-based API on Azure written using SUAF that lets me say,
29:43
give me data for this day. And that's all it exposes. And that's all I needed to actually be able to show it in the frame. So the one thing that I was doing here was that I always worked with it locally.
30:02
So even though I have this big data set, I only downloaded like one month of the prices to my machine and then did some calculation there.
30:20
And this is very nice because you can actually play with it and write your computation and see if it does what you want it to do. In the sort of true F-sharp interactive spirit, you always spend a lot of time writing your code in F-sharp interactive, running it, testing if it works
30:42
and only when you actually wrote it and it does what it's supposed to do, you'll run it on some larger data set or you'll turn it into an application that you can put in production. So here, we've done the interactive bit, but now we want to do some calculation
31:00
over larger chunk of the data set. And for this, I'll actually need to reset my REPL.
31:20
All right. The reason why I need to reset my REPL is I'll be doing some cloud computations and I'll be able to send some code to the cloud, but I don't want to send it all the data that I already loaded in my REPL
31:41
because it would be in scope. So what I'm using here is another really cool project called Embrace and don't worry about the names, I'll have a summary slide and I'll give you enough time to take a picture later on. So Embrace is a library that lets you do
32:02
the same sort of interactive programming style that you're used to from F Sharp, but run it in the cloud rather than running it locally. And I actually started my cluster in the morning, so assuming Azure hasn't consumed all my money,
32:21
this will work. If you want to test it, you can do it on a local cluster, but then you're sort of, it misses the point, it's not funny. It's no fun when you have clusters spinning on your machine and it's like 10 different machines, but it's all on my little laptop.
32:41
So, this object here, cluster, is the connection to the cluster and I have my houses here as well, the rest is pretty much the same. And what I'm going to do is I'll start,
33:01
this is a function, so I wrote this function before, which just loads the houses, gets a range over one, like between two dates, gets only the sales in a specific town,
33:22
which are freehold, which means you actually own the house, and then it just calculates the average price. And rather than running it locally, I want to run it in the cloud. So what you do first is you say cloud. And this is using the F sharp computation expressions
33:44
to basically wrap the function and change how it's executed so that rather than running locally, we can ship it off to the cloud. And now I can define my function
34:00
and we have a cloud function here. So what I'm going to say is, so what do I want to do? I just want to do it for, let's just look at April. So if I say DT 2010 4.1 to DT 2010 5.1 for London.
34:33
Now this expression represents a cloud computation and what you can do with it is that you can say cluster create process
34:45
and this will create a process. It will basically take your code that you have in your F sharp interactive, connect to the Azure cluster in this case and send the code there and start it there.
35:02
So it prints some logging and it created a work item for the cluster. Here I'm using Azure but they also have AWS backend for Embrace which is a fairly new thing and I can then check status and it's done already.
35:23
So this was fast and I can access the result. So this is the average house price for London in 2010. Now this wouldn't be all that interesting. We just processed one month but what we can do with the cluster
35:40
is that I can say let's go over years from 1995 which is the first year to today and here I'm building a list of cloud computations so we're going to say calculate the average price
36:05
for London in this case and I need to change my years. So I'm just looking at April because I don't want to burn my Azure Compute Center and then we are going to return year together with the average price
36:25
and so this is average London and now before doing create process I can say cloud.parallel if I know how to type L. Cloud.parallel and what this does
36:43
is that it takes the list of cloud computations which is like 20 processes and it distributes them over the entire cluster and if I look at my computation status it is running
37:02
and I can also look at the cluster here and I've actually written a little format for the cluster so here are all my eight machines in Azure. There's the CPU usage, network because they're actually fetching the data and active work items
37:20
so it evenly distributed the work across my entire cluster. I'm going to do the same thing for Cambridge so I'll just do some copy paste here. This is the nice thing about doing things in interactive way because you don't have to be clean
37:40
you can just copy and paste things you can clean it up later when you know it works and it does the right thing but for now we'll just start the same sort of computation for Cambridge and when it starts we can check London is still running
38:02
Cambridge computations are running too and here's my cluster. You don't actually need that strong internet connection to ship the work to the cluster I was even able to run it in my hotel where even checking Twitter is a bit of a problem
38:25
so when this eventually finishes we'll get all the data and what I want to do then is just to compare the prices the price changes in London and in Cambridge so I'll write the code for that while it's running
38:45
we'll take the Cambridge prices and the London prices. This is result. Draw a line chart and then add labels
39:03
saying this is Cambridge and the second bit is London. Alright so how's my cluster doing? Zero work items so it looks like everything's completed yes it is and we can get the data and draw a chart here
39:23
so this is the interesting growth of average prices in London although you can see that it actually went down for the first time in a long time in the last month in Cambridge it's still going up.
39:44
You could fit some model for it and it will tell you that in next year you won't be able to buy anything anymore. Cool so this worked and what's really interesting here
40:00
is that we took the same sort of interactive style that lets us easily explore data locally and we used it to run the same code in this case on an Azure cluster rather than locally but you don't lose any of the nice exploratory style
40:21
which is I think what makes data analytics in Fsharp really powerful and really efficient. You can just say cluster run this and then check what's happening in the cluster once you wrote the same code run it locally and test it that it works.
40:42
And the amazing thing here is that the transition from the local code to the cluster code is really just a matter of adding this workflow and if you went to the well no this is actually now
41:01
in Csharp as well so you don't have to go to the future of Csharp talk to see this. This is the present in Csharp so in Fsharp I think in 2008 the new feature in 2008 in Fsharp was this async workflow which lets you,
41:20
it's like async evade in Csharp lets you run code without blocking and this was done as a more general purpose feature so these days when cloud becomes a thing what the Embrace project is doing is that it's just taking the same programming model
41:40
the same constructs but defining another computation that rather than doing things asynchronously it does things in the cloud. And I think this is really sort of amazing power that you get from Fsharp here because without changing the language they were able to build a library that uses the same principles for something
42:03
that's really important, really cool today. So the project is Embrace you can find more information on the website and it basically does what I was showing here it takes this nice data scripting approach but brings it to the cloud.
42:24
It has the cloud computations which is the curly brackets thing that I was showing. It also has sort of data flow support so if you have lots of data that you want to process you can either, you can do it as I was doing here
42:41
with the frames but you can also just use the Embrace programming model where they have sort of nice optimized data flow streams that you can process across lots of different machines in the cloud. So that's Embrace and I have one more sample
43:01
which is moving away from the house prices to some financial data and I just want to show you a few more things that you can do with this. So let me close some of the things
43:21
and here I think everything I have here I already loaded. Now this time I'm going to be using another implementation of the virtual frame. This, the previous one was sort of fairly simple and it used rest based API as the source.
43:42
Here I have, I'm using the Azure tables but again it's just an implementation of an interface so you could use anything you want. You could use your own data sources that are local in your own little cloud if you have one
44:02
or you could build one for AWS, whatever. I do have my cluster, it's still there hopefully. And this frame, WDC, it's I think all the trades
44:24
in New York or in Nasdaq or somewhere for the Western Digital Company. And this is actually, I think it's price, it's like every single trade with the company stocks
44:43
over some number of years. So you can see that even downloading the first and the last, the first and the last frame, the first and the last day takes a little bit of time because they're basically like, at the beginning there's not that many of them
45:02
but if you scroll down a bit, it needs to download the second day. If you scroll down a bit you'll see that there's like every second there's a lot of trades. So now I'm just, now I just scrolled so it's logging that it needs to get
45:21
the second day of data. And you can scroll through it, you can explore it interactively. If I wasn't connected over Wi-Fi to Azure Data Center in New York then the scrolling would be faster.
45:41
And as I was saying before with these virtualized frames you can actually do quite a lot with it. So I can for example take the ask price as a series and it just gives me one sort of view of the data but you can even do some calculations with it.
46:02
So for example if I wanted to see what's the difference between the ask and bid price, then I can just say get one column, subtract from another column and you get the difference and this is still keeping the sort of,
46:23
it knows that it's still virtualized so it's not getting all the data and evaluating anything. It only does it when I actually want to see it. You can do all sorts of calculations here
46:40
like multiply it by something. You can even use built-in F-sharp functions like round. It still understands that this is just the operation applied to a virtual frame. I can also add this as a new column so this is the only place where I can actually
47:02
mutate the frame where now I have the bid and ask and the difference and so on. The next thing that I want to show is doing some other interesting calculations over this.
47:23
So what do I want to do? I want to take just one day of data so I'm going to say rows. Well actually, I do have a trick here so I don't have to write this.
47:41
Here I'm saying take this data range which is the first day in my data set and take the differences and if you're doing some financial calculations then in Deedle, if you have a time series you can do other interesting operations on it.
48:03
So what I want to do here is let's just take a moving mean which is fairly basic but you can see there's a lot more interesting stuff. And then I think, oh, I need to do one more thing
48:23
which is the series, the keys in the series are actually date, time, offset values and our charting library can only deal with date, times so I need to transform the key so that it's date, time
48:43
but that's something that will probably get fixed soon and I can draw a line chart of the moving average over one day and I did something wrong, what did I do wrong? Let's try this, oh yeah, and now my atom hangs
49:05
for a bit because the charting library actually has a tough time rendering all my data points but this is the average moving prices which is one sort of thing you can do.
49:20
The other thing you can do is do various sort of re-sampling of the data, so this is not what I wanted, prices to sampling, so here I'm again taking the prices
49:40
and I'm calling this series dot sample time into function which will basically split it into regular one minute chunks and calculate the averages over one minute and the next thing you want to do here,
50:03
so very often, or in F sharp there's quite a lot of libraries that you can do for doing some numerical computations, there's math dot net and so on but equally if you know R then in R
50:21
there's like thousands of packages for pretty much everything that has to do with numbers and so it would be nice if I could use some of the financial libraries that people have written for R and in FSLab one thing that's there
50:41
is this thing called R type provider which actually imports all the R libraries that you have installed in your local R installation and it lets you access them from F sharp, so there's like one of the packages is the stats package, that's a built-in one,
51:03
there's also like if you're doing finance there's a finance, one of the many financial packages is quant mod and so I can use some of the R functions on my values, so what I can say is to say R dot
51:23
and now I actually see all the R functions that the R type provider imported and one of them is this delt which basically does like returns on prices so it takes the current price, subtracts the previous price
51:42
and it calculates how much you would have gained over the period, so we can apply this on our values and I think this gives me a result which is some R value and I can convert it
52:04
into just a sequence of numbers that I can then sort of get back into my F sharp world as a series, so if I run this it actually invokes R on the data I got from the remote storage
52:24
and here I'm running the R locally, I got all zeros which is just because my formatting is actually set up to do like only two decimal points so if I multiply it I've got some returns here.
52:42
So here I was doing again stuff locally on the data that I got from some remote storage which is actually what in Blue Mountain people do all the time because you want to test some strategies locally and the last bit I want to show you is that this works in the cloud as well
53:05
and again I'll need to reset my F sharp interactive so that I'm not pushing all the local data into my cloud, I'll need to reload some of this again
53:21
and I need my cluster and I need this as well and I'm not going to write everything from scratch here because we don't have that much time left but the first thing here is I have a function
53:44
mean minute returns which is calculating and I need to open my R provider so this is calculating average returns over one minute it's using the deal functions for doing things
54:03
like sampling and then it's calling R to do the actual calculation and I can run this over using the cloud blocks that I was showing before I can run this over three different months
54:27
so here I'm looking at the first quarter of 2015 and when I run this it will actually send all the code I wrote here, send it to the cloud and start it there
54:42
and now I have this Q1 which is my handle for the process that's running in the cloud I can see it's running and if I look at my cluster it created all the work items and it's running there
55:01
and when it eventually finishes I'll show you a chart but before that happens the few important points here is that one thing that's really nice in this demo I'm running is that we actually
55:23
I have the data stored in an Azure data center somewhere and I created the embrace cluster, the compute cluster so that it's in the same data center so rather than downloading the data and do some calculations locally
55:41
I'm actually shipping the code to the same place where it can run much faster because the data access isn't like downloading anything it's just looking at a machine across the corridor or something to get the data it's still running because we're actually processing
56:00
fairly large chunk of the data set but I think you can see it's getting closer to the end so let's leave it running for a bit the other interesting thing I was doing here was using the R type provider and I actually didn't show you any other type providers this is probably my first talk where I haven't done that
56:23
but lots of people here were showing other type providers for things like JSON but I was only doing the R type provider which is this really, it's using the very general F sharp mechanism to import all the R functions
56:40
and I think that the general theme here is that with F sharp it's really easy to integrate with lots of other environments so if you're pulling data from somewhere from JSON or CSV you get a nice typed access to it if you're calling R you get nice typed access to it as well and let's see what my cluster is doing
57:02
it's still running the other interesting thing that I had to do here which I didn't show you because it's boring is that when I created my cluster I actually run another cloud computation there that installs R on all the machines in the cluster and then it installs all the dependencies
57:23
so you can actually do lots of weird things to these clusters including running arbitrary code that installs stuff there this is the slide and I'll tweet it with references to all the things
57:40
that all the libraries, all the websites where you can learn more about what I was doing here so F sharp org is the best place to get started ionite is the F sharp plugin for VS code and atom that has this nice integration with HTML formatters
58:01
FSLab is the data science sort of components all in one place and Embrace is the scalable computing project and I was using the DDL library for doing the exploration the R type provider for calling R and accessing all the different statistical functions
58:24
and X plot is the library for the charting so let's see if this has finished now it's still doing some stuff I'll leave it running and you can probably find me later if you want to see the results
58:41
it needs a few more seconds to complete if you want to remember just one URL it's the F sharp org where you can find links to all the other things and I'll be around if you want to chat more or see the pretty charts you can come to the FPLab at 1.40 in room 10
59:05
it's like by the, if you enter the building just go straight and then it's somewhere there up there hidden this is where all the functional people go at 1.40 to chat
59:21
in this room there's going to be more FB sessions I think we've done all the F sharp ones but there's lots of exciting Alex here talks and I'm part of F sharp works which is doing consulting and F sharp training so if you want to learn more or if you want to help with implementing
59:40
all the weird interfaces that I carefully designed so that only I can implement them you can talk to us thank you we probably don't have that much time for questions but you can just find me here and chat later