We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback
00:00

Formal Metadata

Title
Datasets
Title of Series
Number of Parts
11
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
NeuroinformatikEvent horizonSoftware testingWave packetSet (mathematics)Limit (category theory)VolumeInformationOrder (biology)Type theorySample (statistics)Derivation (linguistics)AverageData typePairwise comparisonTelecommunicationMaxima and minimaHorizonAxonometric projectionImplementationReinforcement learningUniform convergenceFrequencyLocal ringBit ratePhysical systemAutomationBlogMenu (computing)Repository (publishing)CodeObservational studyData structureMessage passingTotal S.A.AlgorithmStatisticsForestData modelRandom numberAerodynamicsIntegrated development environmentStandard deviationSkewnessMathematical analysisLibrary (computing)LaptopMachine learningMathematical optimizationCategory of beingTerm (mathematics)Level (video gaming)RankingBounded variationArithmetic meanVariable (mathematics)Chemical equationContext awarenessOscillationFocus (optics)CountingEstimationOrder of magnitudeParameter (computer programming)Endliche ModelltheorieNegative numberSign (mathematics)SummierbarkeitMultiplication signMathematicsTimestampReinforcement learningDifferent (Kate Ryan album)Sensitivity analysisSupervised learningBest, worst and average case2 (number)Endliche ModelltheorieHorizonOrder (biology)Event horizonVolume (thermodynamics)AuthorizationNumberSubsetMorley's categoricity theoremRow (database)Level (video gaming)Thresholding (image processing)Task (computing)Shared memoryMIDICalculationState of matterDirection (geometry)InferenceWebsiteWeightMetric systemLimit (category theory)Set (mathematics)Wave packetDatabase transactionOcean currentTotal S.A.Drop (liquid)Basis <Mathematik>SoftwareCodeForestSoftware testingRandomizationFrequencyInformationRaw image formatPrice indexProtein foldingAverageXMLComputer animation
TimestampRule of inferencePredictionAlgorithmData modelFrequencyRandom numberComputer fileCodeLine (geometry)Integrated development environmentSoftware repositoryLibrary (computing)ImplementationForestBlogSample (statistics)AerodynamicsHorizonStandard deviationStatisticsSkewnessData structureLimit (category theory)Parameter (computer programming)Letterpress printingPrototypeOpen setOrdinary differential equationNeuroinformatikForestToken ringEndliche ModelltheorieHeegaard splittingResultantLimit (category theory)Electric generatorComputer configurationComputer fileSource codeRandomizationCodeSet (mathematics)Reading (process)Computer animationSource codeXMLJSONUML
Earlier I told you that the data already prepared for you to train both supervised and reinforcement learning models.
Now let's review how exactly the data was prepared for you. At the bottom, you can see each of the 10 days recorded. It's easy to see that Sundays and Saturdays were already dropped. Today's were dropped here and today's were dropped here. So this is most obviously Monday and consequently we can understand what it is.
It's Monday, Tuesday, Wednesday, Thursday, Friday, every of these days looks like this. It has nine faults, nine consecutive faults represented and structured in a walk-forward manner. This means that your train grows interactively, including every test subset.
So you prepared the model here, you predicted for some short time in the future, then you took into account this additional time in future, predicted and performed inference for the next short period of time, which is required for training the model, et cetera, et cetera, et cetera.
Then you've calculated your metrics based on average performance of your model on all of the test subsets. There were 400,000 samples, 144 features. The feature sets were the following. First there is a data of the raw 10 level limit orders with price and volume for bid
and ask sites. It looks like this. Similarly there are features describing the state of the limit order book explaining exploiting past information like calculated total volume, level volume, average price, and others.
Here are the features. So basic limit order book state, 10 levels for limit order book data, these 10 levels, they are simply ask price, bid price, ask volume and bid volume. Simply the same, which you see on this image, other features are calculated based on this.
There are time sensitive features like spread and mid price, right? Is the best task minus best bid, mid price is best ask plus best bid divided by two. There are price differences between N level and the best ask, same for best bid. For each level, we compute and normalize price for asks and bids and volumes.
How many shares were traded on that level at that time. Accumulated differences track how each level grows with time. Time insensitive, they are independent of the timestamp.
Time sensitive features are differences of the previous ask and current ask divided by time that passed between those events and the same for asks and same for volumes. Intensity is how fast the price changes.
These are just indicators, zeros and one, true or false is at this timestamp larger than this timestamp. So all of that creates our set of 144 available features. First do feature important study, then decide which to take. The target concept is important for supervised learning.
For RL and DRL, there are no such concept as target. There are like reward function and agent tries to maximize that. There are five targets presented in the data. The target is the mid price at some particular future point in time. Take a look at this image. Imagine we are here.
This is what our model has. What we want to guess is future state of the market. We can predict future midpoint. Knowing the future midpoint, we will know what to do with our current shares. If we predicted that midpoint has grown related to the current, then we will buy here to sell
later. If on the contrary predicted midpoint expected level drop in the future, we will sell here, go short to buy later and earn on the difference waiting for the near midpoint to change so
that we can earn at least something. The situation here is complicated by the exchange fee or transaction fee. By buying or selling each time, we pay some amount of money to exchange. So to do any action, the predicted midpoint has to be larger than current by the transaction
fee plus whatever we want to earn. Those targets in the data set, they are future midpoints at current time stamp plus some horizon. Since the events come to us irregularly, most likely we will fall somewhere between two
snapshots of the exchange and we will see exchange as the state of the latest event. To summarize, there are five targets in the data set for supervised learning. Each target corresponds to a different time horizon. There are closer future midpoints and more distant future midpoints.
It means target number one is a midpoint plus one millisecond. Target number two is current midpoint plus two milliseconds plus three, five and 10 events or milliseconds. What is the best horizon? I don't know. That horizon of your first future midpoint you are going to predict must be larger than
the delay time, the summary delay time of your model inference, network delivery to the exchange, internal exchange delays for placing the order and performing transactions.
Time horizon would be 100 milliseconds, for example. You take this time horizon as a basis and all the derived must be equal or larger than this, simply because you can't reach closer events in the future. The labels which you can find in the data set represent the percentage of change of
the mid price. How much the mid price changed comparing to a previous state of the LOB, 1%, 10%, 100%. The MJ is the future mid prices at the next one, two, three, five or 10 next events. MI is the current mid price.
Then the extracted labels, particularly the percentage of change of the mid price. It's a continuous value. It was recalculated to a categorical value. A threshold was selected of a percentage change of 0.002. Every change that is equal or greater than this threshold has value of one.
If the percentage changes in the range from minus 0.002 to plus 0.002, they use label number two. It means flat. The price will not change at this midpoint.
If the percentage is smaller or equal to minus, this is the other side of the movement of the price, label number three, the price expectedly go down. Essentially, the authors generated five horizons in the future from the less distant to the
more distant at each of the horizons. Authors generated just the categorical expected direction of movement of the price. Will the price stay the same label two, go up label one, or go down label three? I described the data, what kind of features do we have there?
What kind of targets do we have there? And now you are pretty capable of implementing hard-ruled trader and implementing simple random forest supervised trader for hard-ruled trader. Buy one share at 10 30 of each day.
Sell this share at 6 p.m. Same day, next day, the same. Then you calculate your net worth for random forest. I will show you the article and the starter code. Here is the article. I'm going over it right now. It uses lobster data for Tesla. The depth of order book is 10.
Same as ours. Calculated direction of change of a price, mid price spread and other things. You will have different features. He was predicting mid price. He used two different time horizons at 10 seconds and at 20 seconds. Also, he used to work forward methodology exactly the same as your data has been split.
So just use another features. Your goal will be to adapt his random forest to your data set. It means upload the data set, read, parse, prepare it. The code is super simple. Escaler, random forest classifier, reading files, collect the dates, limit of options.
Here he limits time. Name of the token. Here he runs the model. I look back period. This for label generation. Source collects time. Features and labels. You don't have to define and generate it. You just read them from data.
Then he feeds the random forest. Here you put your access. Here you put your vice. You will have to do this for five targets for every split. Then gathers the results, calculates the heat ratio and saves the result to a CSV file.
Thank you.