Algorithmic Trading with Python - TIB AV-Portal

Algorithmic Trading with Python

00:00

338

Kucan, Iztok Peeters, Joris

Formal Metadata

Title

Algorithmic Trading with Python

Title of Series

EuroPython 2016

Part Number

90

Number of Parts

169

Author

License

CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this

Identifiers

10.5446/21156 (DOI)

Publisher

Release Date

Language

Content Metadata

Subject Area

Computer Science

Genre

Conference/Talk

Abstract

iztok kucan/Joris Peeters - Algorithmic Trading with Python This is a look behind the scenes at Winton Capital Management- one of Europe’s most successful systematic investment managers. The talk will mainly focus on how Python gives researchers fine-grained control over the data and trading systems, without requiring them to interact directly with the underlying, highly-optimised technology. ----- Have you ever wondered what technologies are used in a systematic trading system that utilises computer models and accounts for the majority of trading on the stock market? This is a look behind the scenes at Winton Capital Management- one of Europe’s most successful systematic investment managers. In this talk, we’ll run through an overview of Winton’s trading infrastructure, including data management, signal generation and execution of orders on global exchanges. The talk will mainly focus on how Python gives researchers fine-grained control over the data and trading systems, without requiring them to interact directly with the underlying, highly- optimised technology.

Speech

Text

Image

00:00

CAN busDisk read-and-write headCore dumpSoftware developerInformationDatabaseData managementDivision (mathematics)Game theoryNoise3 (number)Landau theoryAlgorithmHypothesisObservational studyNeuroinformatikStaff (military)BitStrategy gameInferenceStatisticsData managementDisk read-and-write headOffice suiteData analysisCore dumpWave packetConstructor (object-oriented programming)Computing platformSystem callTelecommunicationForm (programming)Data structureCellular automatonComputer programmingStatistical hypothesis testingProjective planeQuantificationComputer animation

02:52

Physical systemSystem programmingWritingProcess (computing)Software frameworkSimulationSoftwareInsertion lossVideo trackingCore dumpData managementOrder (biology)LogicDecision theoryArchitectureData modelSoftware testingServer (computing)CalculationData analysisVisualization (computer graphics)Strategy gameDataflowEvent horizonPosition operatorGraph (mathematics)Social classVariable (mathematics)Letterpress printingWell-formed formulaClient (computing)Keyboard shortcutLevel (video gaming)Communications protocolNetwork socketObject (grammar)Series (mathematics)MereologyRun time (program lifecycle phase)Rapid PrototypingGame controllerSatelliteSeries (mathematics)Cross section (physics)InternetworkingFluid staticsCross-correlationSheaf (mathematics)Object (grammar)Social classSimulationLevel (video gaming)Keyboard shortcutSpreadsheetProfil (magazine)Category of beingVolume (thermodynamics)Computing platformLine (geometry)Well-formed formulaWeb serviceData storage deviceTime seriesPhysical systemServer (computing)CalculationParticle systemClient (computing)Different (Kate Ryan album)Strategy gameProcedural programmingSpacetimeCombinational logicSemiconductor memoryGraph (mathematics)Serial portData analysisInsertion lossComputer simulationResultantUniverse (mathematics)BitData managementSequenceLogicTheoryWeightEndliche ModelltheorieSystem callFrame problemCore dumpVisualization (computer graphics)CASE <Informatik>Event horizonAxiom of choiceScaling (geometry)DataflowExtension (kinesiology)CausalityService (economics)Real numberScripting languageVirtual machineMonster groupElectric generatorEstimatorTransformation (genetics)Computer architectureInterface (computing)Right angleForm (programming)MereologyCode10 (number)SummierbarkeitComputer configurationWebsitePosition operatorSet (mathematics)Wave packetSimilarity (geometry)Student's t-testQuicksortSoftware frameworkOffice suiteWater vaporExploratory data analysisFitness functionMultiplication signGreatest elementEvent-driven programmingSoftware testingComputer animation

10:54

SimulationKeyboard shortcutClient (computing)Level (video gaming)Network socketCommunications protocolSocial classObject (grammar)Graph (mathematics)Series (mathematics)MereologyRun time (program lifecycle phase)Rapid PrototypingInterpreter (computing)Embedded systemHash functionData storage deviceRevision controlArrow of timeFile formatLibrary (computing)ImplementationRepresentation (politics)Formal languageData structureBuildingDataflowMBus (SPARC)Singuläres IntegralState of matterService (economics)Interface (computing)CodeData managementTransformation (genetics)Information retrievalVisualization (computer graphics)Mathematical analysisImplementationLibrary (computing)CodeFrame problemKey (cryptography)Revision controlTime seriesDifferent (Kate Ryan album)Interface (computing)Data storage deviceWritingData managementMultiplication signSemiconductor memoryWave packetFitness functionSimulationShift operatorWeb 2.0Software testingInformation securityView (database)Complete metric spaceService (economics)MereologyInterpreter (computing)Graph (mathematics)Universe (mathematics)Strategy gameLevel (video gaming)Physical systemCore dumpWell-formed formulaObject (grammar)Game controllerEvent horizonBitSeries (mathematics)Structural loadTransformation (genetics)Type theoryFile formatTable (information)Function (mathematics)Data conversionFunctional (mathematics)Array data structureWeb serviceAsynchronous Transfer ModeGeneric programmingInsertion lossPoint (geometry)DatabaseTournament (medieval)Direction (geometry)Mathematical analysisWeightVisualization (computer graphics)Software maintenanceTheorySupersymmetryCodeRapid PrototypingSparse matrixPattern languageCategory of beingTelecommunicationSurface of revolutionBit rateSlide ruleTrailBus (computing)MathematicsoutputBasis <Mathematik>ResultantSound effect

18:57

MathematicsElectronic mailing listData storage deviceOpen sourcePhysical systemGoodness of fitTime seriesCASE <Informatik>Data managementExtension (kinesiology)Slide ruleAuthoring systemTotal S.A.Core dumpNumberEvent horizonPoint (geometry)Software bugFitness functionMultiplication signError messageSimilarity (geometry)FrequencyInformationCovering spaceSource codeComputer virusLibrary (computing)TelecommunicationFormal languageDifferent (Kate Ryan album)LogicFunctional (mathematics)CodeSocial classBootingServer (computing)Process (computing)Interactive televisionConnectivity (graph theory)Bus (computing)Condition numberSingle-precision floating-point formatComputer simulationMessage passingCuboidOrder (biology)Combinational logicSoftware testingScaling (geometry)Connected spaceInterface (computing)Public key certificateAuthenticationService (economics)RadarLevel (video gaming)SimulationImplementationVirtual machineAuthorizationTrailMeasurementOperator (mathematics)Price indexTracing (software)Category of beingFeedbackWave packetSequelSpeech synthesisQuicksortVideo gameWeightCoefficient of determinationSound effectPresentation of a groupSheaf (mathematics)Euler anglesAverageRight angleSubsetDirection (geometry)Channel capacityReading (process)AlgorithmFlow separationRevision controlElement (mathematics)Streaming mediaMehrprozessorsystemMoment (mathematics)MassTheoryEntire functionDatabaseBitMereologyLimit (category theory)Image resolutionState of matterLecture/Conference

Transcript: English(auto-generated)

00:00

Now we have two speakers, Kukan and Peters, about algorithmic trading with Python. Very interesting. Thank you. Hi. This talk is an algorithmic trading with Python. Just to clarify some terms, by

00:21

trading I mean buying and selling financial instruments on financial exchanges. By algorithmic I mean there is a program, computer program, running some kind of an algorithm that decides on what to buy, what to sell in these markets. In winter capital we manage about 35 billion dollars using a platform, primarily constructed of Python, or largely constructed of Python. We also use a lot

00:45

of Python for researching and data analysis for those activities. The talk is going to go roughly as follows. We'll do a quick company overview, we'll have a little bit of an overview into our research activities and the trading pipeline itself, and then yours is going to go into quite a bit of detail

01:01

about how and where we use Python. My name is Isto Kuchan, I'm the head of a core technology at Winton, and yours is the head of a very exciting new project, we have the data pipeline project, and particularly the heavy use of Python there. If you've come across Winton in the past, you may have seen us called a

01:21

quant fund, a algo trading outfit, a hedge fund, commodity trading advisory, all those are valid, but I think a lot would much rather be described as an investment management company that uses a scientific method to

01:41

conduct investment. What do we mean by scientific? Well, the empirical study of a lot of use of empiricism, hypothesis testing, experiment construction, and statistical inference in how we derive strategies that then we trade upon. We have around 100 researchers, which is about a quarter of the company, typically with a

02:03

background in academia, academics, ex-academics, or postdocs. These are organized in teams, a lot of the activities peer-reviewed, so it's a fairly open activity in how we arrive at the signals. Another quarter of the company is in engineering, which again is a fairly empirical discipline itself.

02:23

Geographically, we're primarily a UK company, so 400 staff, roughly 400 staff in the UK, mainly in London and some in Oxford, but we're expanding globally four offices in Asia and two in the US. A lot of those offices are not just sales offices, a lot of those offices are actively growing, so for example, we have a

02:41

new data labs outfit in San Francisco looking at esoteric data. Okay, so this is a Python conference, so what about Python and Winton? Winton has been active for about 20 years, and for the initial few years, the systems were far simpler

03:00

than what they are now, and effectively ran over an Excel spreadsheet. Now, of course, then gradually C++ extensions started creeping into that Excel spreadsheet, and gradually those things were taken out of the Excel and formalized as a set of objects called the simulation framework, and that was and remains the core modeling tool and also execution tool for our trading

03:21

systems, but we found that as the framework gained flexibility, we needed Python to start combining these objects in a more flexible way, so for example, if I want to do a delta series and then a volatility series, I would be using the same two objects as I would if I wanted to do a volatility series and a delta on the volatility, but I wanted to combine

03:40

them in a different manner, so Python was quite useful to do that. As soon as we engaged in them, as soon as we started using Python in that manner, it became very attractive for us to start writing strategies or passing strategies in Python themselves, and from then on, it never really stopped, so over the last 10 years, we've adopted Python for constructing the

04:02

trading platform, but also increasingly in data analysis and in research, so I'm sort of starting to create these two terms, research investment technology, with quite a strict distinction of what is exploratory activity and what is trading activity, so exploratory activity research is

04:21

looking to things that may lead to something or often will not, and again the research itself is conducted along three lines, I would say core research, which is research into signals and let's call them market behaviors, data research, which is research into data and properties of that data, and then in an extended sense, deriving data analytics like

04:43

volatility profiles, volume profiles, correlations, from that data directly, and we now, as I said before, we have a data labs section in San Francisco, which looks at esoteric data sets, speculative data sets like satellite

05:00

imagery or the deep dark corners of the internet, but once signals are derived, we transfer them into the investment technology section, now this is a much more rigorous exercise where we have a quite static trading pipeline, and again the key there is that you can do things in a very repeatable, very reliable, very secure manner with some sign-off, and the data pipeline

05:23

itself is composed of roughly four stages, let's call them data management, signal generation, auto management, and post trade monitoring. Now Python is used a lot in research, but it's also used now extensively in data management and signal generation parts of the trading pipeline.

05:42

With data management, typically the things we do in the trading pipeline is obtain large sets of data, clean them, transform them into the things we need, we use things like versioning to make sure that we can repeatedly see data as it changes, Python underlines all that architecture.

06:00

For the signal generation part of the pipeline, we also use Python extensively, so Python still drives simulation, which is a time series transformation engine, and increasingly so, Python is also interfaced to a data storage engine called the time series store, and Nioris is going to go into that in a bit more detail.

06:28

Right, so I'll give a bit more detail about how we actually use Python, some low-level detail, where exactly it sits in our stack, so the main reason we use Python really is because it presents quite a friendly face to research,

06:41

our low-level code is all in C++ typically, so execution or simulation platform, it's not something you want a researcher to write, so we expose all our codes, all the APIs are typically in Python, there are a few other options, but Python is definitely the main choice. It's not just for research, because it's such a nice and programmatic interface, we use it for monitoring, typically to serve as a web service as well,

07:04

and directly in signal generation. The reason we chose Python is it's extremely well known, it's very easy to learn, if you don't know Python, probably not too long before you do, and it has a lot of, just comes with a lot of support for data analysis, visualization, so it's quite nice to, as a researcher,

07:21

just to get all that, batteries included. So, this is a fairly large-scale overview of our training pipeline, there's a few kind of core principles to it, the whole thing is event-driven, so something happens which causes something else to happen, in this case, for example, we get our data, for example, from Bloomberg,

07:42

as soon as the data is there, automatically, we construct our equities prices, our futures prices, once that's done, automatically, all our strategies kick off, and that kind of event-driven flow sits really at the core of the Wynton technology these days, and then we have, as Istok mentioned, the simulation, sits at the right bottom there, so whilst our Wynton is pretty much a graph, it's a real-time graph,

08:04

it just sits there, all services listening to stuff happening, but we have the simulation, which is kind of like an in-memory offline graph, so essentially it's really catered, it's really designed to do kind of time series analysis, so we kind of spin up a trading system, that will kick off one of these simulations,

08:20

you run them, you can tear it down, you can serialize it, and that's kind of the other main technology that we have, so I'll give a bit more detail about both the real-time graph, which we call COMET, and then the kind of simulation strategies, to test simulations, backtest simulations, which is written in C++. First, simulation, it's written entirely in C++,

08:41

it's been going on for about 10 years now, I think, right after we moved away from Excel, pretty much. If you just ignore the left-hand side, it's kind of similar in concept to what these days appears, things like TensorFlow, so essentially it's kind of like a graph, it's very well optimized, it's in our case strongly typed, so you can just not fit everything in everything,

09:04

you have to, it's strongly typed data, there's an example there of a graph, it's quite a simple one, two data sets, two data series feed into something like a formula, it can be the sum of these two series, and then you calculate that thing. Now that's all running remotely on a calculation server, typically these things can be thousands or tens of thousands of assets,

09:21

you don't want to run that on your local machine, we run that on big calculation servers, but we expose on the left-hand side the Python client, so any user, any researcher can just connect or launch or spawn off any of these simulations, connect to it, and has full control over the remote simulation, so there's actually an example there on the left, it's a real Python script,

09:42

so the first thing it does, it starts the remote session, which is going to cause one of these simulations to be constructed and launched on the server, then it constructs these two time series, and it constructs a formula, and then it calculates it. That's the only thing you have to write, and you have full control over the simulation. This means researchers don't need to know any C++,

10:01

anything they need really is in the simulation, it comes with it, it comes with training systems, it comes with universes, all the kind of stuff we need, and essentially gives them fairly high-level control over anything they need to do. A little bit about the technology, I'm not going to go too deep here, the bindings, the Python bindings are extremely lightweight,

10:20

so they don't know anything about simulation per se, as soon as they launch a simulation, they get everything they need from that simulation, they populate your Python clients with all the objects, the classes are dynamically generated, the objects are spawned into your name space, if you create new objects, they are created both on the remote clients and on the local clients, essentially it gives you full control locally,

10:41

as if you were doing it remotely. It's very friendly in Python, so all the data is returned as panda series data frames, all that kind of stuff. One thing you can't do with the simulation bindings is, once you can control the graph, so you're kind of limited to these things like formula or value-based series or universes or particular training systems that technology has implemented in C++,

11:05

but you can't really do it this way, is if you have a complete outlandish training system that you want to try and you want to kind of plug it into this graph, if it doesn't fit this kind of formula or data, then you're kind of stuck, and for that we designed embedded Python. So what you can actually do is, in Python,

11:23

write one of these objects that run directly in the simulation graph. So from then on, anybody can launch them remotely, run your training system, and you don't have to write any C++, you can just contribute your Python code and everybody can run it. It wouldn't normally go into training, but this is more intended for rapid prototyping,

11:41

a researcher can pretty much build their training system in Python, test it in the simulation, which means they can back the system from 1970 to now, they don't have to write C++, because there you might often have to wait like a month or two months for technology to actually implement it, it's not a really good turnaround, so they can just build their thing, run it, test it, and if we're happy with it, then we can still implement it in C++ afterwards.

12:01

That's kind of the idea, it's definitely around the rapid prototyping, although some of it is actually in training as well. The technology there, unsurprisingly, the C++ executable hosts a Python interpreter. We use Boost Python to do the marshaling, all the data is exposed in NumPy, so we use the NumPy C API for performance.

12:20

Yeah, essentially you've got full control through this embedded Python for making your own Python training system available in the C++ backend, it's extremely powerful. So that's kind of the simulation, as Isak mentioned, we have this problem that we need to shift lots of time series back and forth, there's an enormous amount of time series to be safe, we have hundreds of thousands of assets,

12:41

you need to be able to very quickly load and write these to a database. Things like SQL are way too slow, because we do so much historical backtesting, we have to load all the data for 300,000 securities from 1970s now, in memory or distributed, and then write the results back. So what we designed at the time when we started this, there wasn't really a good alternative, so we built our versions and the duplicate data storage,

13:03

which I'm not going to go into too much detail, but it's a columnar format, so it essentially is super effective for storing lots and lots of time series very effectively they're typed. It's backed by MongoDB, that's kind of an implementation detail, anything that can store something from key to a binary blob would have worked.

13:22

And really key is that's immutable data, so one thing we don't want is once you've written your data frame, if you do it in Python, all you want is to get exactly that data frame back, we don't want the data to change, if you've written something it can never change, you always get it back exactly like that. That's it really at the core of our strategies, obviously if you're testing a strategy,

13:41

you don't want the data to change underlying, you kind of want reproducibility, you want to know exactly what you did, you want to be doing that forever the same way. So this time series was really revolutionary, it actually opened up a lot of possibilities. Technology there, kind of the same pattern again, so we tend to do something low level in a really optimized way,

14:00

C or C++, and then we expose all kinds of high level libraries to make it more accessible to users. So the store here is backed by MongoDB, there's a C library that sits on top of it, and that kind of deals with this columnar storage so that we can essentially store it very effectively, and then we build very thin libraries on top of that, C++, C sharp and Python here,

14:21

we're building a JVM one as well, and essentially these can be accessed by different kinds of technologies, C++ would typically be the simulation, but a researcher might use the Python library, and rather than having to deal with this kind of low level columnar storage, a researcher can just put data frames in there, it will get translated into C arrays, and then get given back to you as C arrays as well,

14:42

as data frames. Yeah, so small implementation details about how we've done this, we used the C function, the foreign function interface, and the nice thing is that it's such a friendly Python interface, you give it a data frame and you get a data frame back, you don't need to know about any kind of table formats or type conversions and all that kind of stuff.

15:08

Comma transforms, like I told you in the beginning, Winton is essentially a graph and simulations, services are sitting there, they're waiting for stuff to happen, they react from inputs, they produce outputs, the next thing is gonna listen to those outputs and go on again,

15:21

so this is what we call the comma transform system. It's microservice based, they sit waiting there on a topic from a bus, it's Kafka actually, there's an example there, it's super simple, we get the data from Bloomberg, we write it to the store, we announce that we've done that.

15:40

The equities transformation is gonna pick that up, it's gonna write to the store, and then as soon as that's done, our strategies are kicking in, we've got loads of strategies, so there might be five strategies waiting for the equities prices, they're all gonna kick off simultaneously, distribute it, and as soon as they're done, we can start going into execution. This is the next to last slide.

16:00

A little bit about the technology with comma transforms and kind of bringing it all a bit together, all the red things are where we use Python, everything that's not red is kind of low level and exposes Python as its external API, so all our events are posted on Kafka, we use protobufs throughout for the communication, it's really nice for the strongly typed and the kind of, you can increment your schema,

16:21

and then our service stack currently is in C sharp, so it's a proprietary service stack that essentially deals with getting the protobufs, translating the protobufs, but the comma transforms themselves are Python, are hosted by the C sharp or Python interpreters, so anybody can write anything and become part of the graph that is Wynton, just by writing some Python code.

16:42

That Python interpreter might be a strategy that launches a simulation that will use the SIM bindings, as I explained in the beginning, that simulation can host your own trading system that you've written, that would be an embedded Python, the simulation will read and write this data from the version store, which is our efficient way of storing time series,

17:00

and that then again can use the Python library, can be read by the Python store library, so anybody can read that data that's been written by the simulation, everybody has access to it through the Python store libraries. Whilst all this is quite complicated, there's a lot of technologies going around, the team is always relatively the same,

17:20

it's low-level code that is really optimized, it tends to be written in C or C++, implementation details are quite proprietary, can be protobuf, can be Kafka, but as a user you're only exposed to kind of well-chosen APIs that we've defined, they're quite flexible, they're programmatic because of this Python, you can do anything you want, but it is tailored and it's accessible,

17:42

and by providing that as an interface, it's still extremely performant, and we find that this is really good. So roughly as Dad said, it's all good, we think this works really well, we're quite happy with the system. Python throughout Winton, if you're a researcher or if you're in business, you wouldn't see anything else but Python, you'll just see Python,

18:00

you don't even need to know that there's any C code under there, it's the primary interface really for data management and signal generation, because it gives such fine-grained control, you don't really need anything else, there's no need to go into C or C++, you can, and that's what technology does if it needs to go really fast, but as a researcher typically you don't need it,

18:22

so you can define all your own data transformations, you can do with the data whatever you want, you can store data, you can retrieve data, your guarantees it will never change, so there's a time series store, as discussed it's backed by very low-level C C++ code that is implemented by technology and owned by technology, and the main reason we're doing this

18:40

is because it's so great for analysis, visualization, rapid prototyping, maintainability, I mean, that because it's such a programmatic interface to all the underlying codes, you can write web services, you can write monitoring systems, everybody can essentially start contributing to them in Python, which means we have an enormous view on what's actually going on in Winton's trading systems.

19:02

Yeah, so it's all good, and that's also all I had, so thank you. Yes, one minute left. Okay, oh, many questions. Okay, so I just start from here.

19:27

Hi, thanks for the talk. So you're using C and C++ code because you're in the high-frequency trading, or is this legacy code? No, neither actually,

19:41

so we are in low-frequency trading, and it's definitely not legacy. So even though it is low-frequency, we do continuous historical back-testing, so it means even though we might just trade one new data point, we want to be able to very quickly test the simulation all the way from 1970 to now. So yeah, so that's the main reason.

20:01

It has to go fast because we test the whole of history, but we do trade over periods of months. So common tools like Pandas and similar don't meet the requirements? Sorry? So tools like Pandas and similar don't meet that requirements like for huge back-testing?

20:22

I think we found that it probably doesn't cover our needs. The trading systems that researchers contribute are in Pandas, and so a number of them go into trading. So things like the tracking error control, it does run on Pandas, and it does run into trading.

20:40

It doesn't actually have C backends, but we find that if things need to go really fast, and we find we need that kind of speed, then the C implementation is still considerably faster to the extent that it's worth doing it. Hi, have you open-sourced your time series store, and if you haven't, why not? Open-source which one?

21:00

Your time series store for data. No, we haven't open-sourced that, for no particular reason. So there is actually, initially, it's only quite recently that we started looking into open-source. I think this is actually on the list of potentially being open-source. There's nothing particularly trading-specific about it. It is very generally applicable,

21:20

so yeah, that might come up. More questions? Your slide suggests Python 2, probably 2.7. It is exactly 2.7, yeah. Why not 3, and what is the incremental cost of migrating to 3? So there's a lot of code in,

21:41

there's an enormous legacy code base in Python 2. Upgrading it because we have all the C extensions right now is not trivial, but it is being actively pursued now. So all the new code that we're going to start developing will be in Python 3, and then we should gradually migrate all the stuff. The problem is there's not really an enormous business case right now. It's a lot of effort,

22:01

and we don't necessarily get a lot of it back at this very moment, but we definitely realize that, especially as a support is going to be dropped, we will have to have moved to Python 3, so that's going to be our main reason, and obviously there's a lot of features that would be considerably better, especially with the multi-processing and stuff. So yeah, that's for me personally at least. So yeah, we need a good business case to move really.

22:23

Hi there. Thanks for the presentation. Which exchanges do you trade on, and how many bytes of historical data do you have? Which exchanges? We traded all the exchanges, but that's more Isto stuff. I'm sure about 20 or 30 different exchanges

22:41

I won't go into this, but American equities, European equities, Asian equities, futures, FX now, and fixed income as well. How much data we have? Depending how we describe it, typically we ingest probably about a billion numbers a day. We have a petabyte class total capacity,

23:04

but that's a rough measure. How many we need is a different story, but that's how much we have. Thank you for your great talk.

23:21

I have two questions. There is any authentication or authorization system? Some researchers can see only a few machines or something like that? Yes. And how does it work? The API does it? We have our own proprietary authorization system.

23:42

It's basically token-based, and then we have the SQL server. So the SQL code is backed by Microsoft, so we got the authentication there. And then we got Mongo database, which is backed by certificates. So it's certificate-based authorization. Okay. And my second question is when a researcher wants data,

24:06

I guess it goes to the microservice and get a scroll scan operations. So all your data is going through HTTP, and if it does, how it's so fast? Because millions of events can be sent.

24:22

So a researcher would actually go directly to the store. So they make a direct connection to Mongo. They wouldn't necessarily have to be mediated. They can, and we're actually considering to build high-performance services in the middle, like gRPC-based or something. But right now, the library that we expose to researchers that sits on top of Mongo makes a direct connection to Mongo,

24:41

so that's why it's so fast. So how the authentication works? By the certificates. By the certificates in MongoDB? Yes. Okay, thank you. More questions?

25:07

Just a fairly, barely semi-related question, but have cryptocurrencies or anything like that crossed your radars yet? Crossed the radar and then left it, I guess. It's not something we do yet. Thank you.

25:22

More questions? We have still some time. I have a question. So if nobody, okay, what's the real, why is there a Kafka thing and a protobuf? So what's the issue with this one? No other errors than just this one?

25:44

Kafka is in the message box there. So you see the message was at the top there. We use Kafka to back all our events. So we have pop-up style events, which mean any consumer can kind of connect to any event that happens. So we needed a pop-up message box, essentially. We chose Kafka and we put protobufs on the wire

26:02

because it's strongly typed and it's actually fairly compact. So if we need to send lots of data over it, then Kafka plus protobuf is actually a really good combination. So essentially Kafka sits really at the core of Winton. All the events go through Kafka and everybody can chip in. Can you give an example of such an event?

26:21

So is it a trait or what is it? It's at different scales. One thing that an event that can be announced is where Bloomberg says, I'm done. I've actually downloaded Bloomberg data, co-find it in the store. But at a lower level, actually, we do send every single piece of information across Kafka as well.

26:42

So that's before this. All the data that we ingest that we download is streamed over Kafka. And then depending on who's interested, it can be stored in Mongo, it can be stored somewhere else. It can be actually transformed. We can run tests on it. So all the data goes over the bus as an event itself as well, yeah.

27:09

Thank you again. I just wanted to know, why aren't you using any event driven infrastructures such as Apache Storm or something like that?

27:22

It looks like it's a perfect solution. It's possible, yeah. We are actually investigating things like Storm, Spark, Flink, all of them. Do you have something to say about that? We're a company that's 20 years old. So there's a lot of technology that comes on the radar that of course you would immediately like to have,

27:41

but you can't because it takes time to migrate and you need, in the business case, to migrate as well. So something being new and sexy is not a business case. Of course, having 35 billion under management also means there's a lot of risk. Making a small mistake on such an investment just so that you can get sexy new technologies, again, not something that's very easy to justify.

28:02

So we do like adoption of new technology but we have to be cautious at the same time. Hi. So I'm interested in how are you testing technical systems like this because I could argue that there are a thousand things that can go wrong.

28:22

Yeah. It's a distributed system. It's a real-time distributed system. So what's your approach to testing? We, when I say that initially we gain so much by the immutability of data, that's definitely one thing. If you know that your data is not gonna change then you don't have to raise conditions about this might need to write before this reads.

28:40

So immutability is definitely, so one of the core principles. And then everything is strictly event driven, strictly a DAC. So it means that because everything is defined by the events you can write extremely good tests. The whole history can be reconstructed from the events of Winton. More information. All the simulations are run every day for the entire history.

29:01

So for example for our simulation today we'll compare the simulation up to the previous point let's say yesterday and we'll make sure that in every single data point the entire history of the simulation is the same. And the incremental daily step is also usually human verified. There's still some human interaction not because it's needed but because it's a sign-off process so there's a checkpointing process to this human.

29:23

More questions? Yes. So just say if you're tired. No, I can go on forever. Thank you. There was a slide with C-sharp usage and C++

29:40

and Python all together. The main and the other one probably whether it was service usage as well. Anyway the question was, yeah exactly, how the communication is done between the components between different languages and so on. Sorry can you repeat that? So once again the question is

30:02

how the communication is done between the libraries in different languages. The communication between the libraries or in the company? Between the libraries in C++ and Python in C. I don't know the details about all of them. I do know in Python we use the C foreign function interface.

30:23

Essentially what we always aim for is a fairly simple C89 I think interface the C library which compasses all the logic and then the other libraries are built on top of that. They tend to be fairly high level and just through the mediation the marshaling of data.

30:41

So all the logic tends to be in the lowest layer and then all the other layers are just representing data in something that is useful for the language itself. Does that answer it roughly? Okay. No more. There is still one question.

31:02

We are also at the boot by the way if you have more questions afterwards. Yeah maybe this is then the last question. Hello. I would like to ask you if you save some precomputed data in the history. Save some signals.

31:21

You take the source data and like compute something from them from all the history? Or you like pre-compute everything every time every day? Yeah so that's what Istok alluded to. In order to make sure and kind of fits in with your question in order to make sure that nothing has actually gone wrong in the meantime that no kind of bug has been introduced.

31:41

We rerun everything from the beginning of history pretty much to now to yesterday. We check that everything is exactly the same and only then do we allow the newly generated points to go through. It gives us an enormous amount of certainty that yeah nothing has gone wrong. So you then have problems like with the immutable data that they can change because you improve your algorithm

32:03

on find some error in the algorithm? It can happen. And then we re-baseline essentially. So if we do introduce a change it has to be in a controlled fashion. So the only thing we want to avoid is uncontrolled change. But of course if there's an improvement then we will re-baseline the system. Thank you.

32:26

Okay so I think very nice talk. Very interesting. Thank you.