Chronos: Time series forecasting in the age of pretrained models
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 18 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/69365 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
AutoML School 20244 / 18
1
6
9
10
12
13
15
17
00:00
Service (economics)Series (mathematics)Mathematical modelMathematical modelTime seriesDifferent (Kate Ryan album)Range (statistics)PredictabilityPlanningDecision theoryPoint (geometry)Single-precision floating-point formatDistribution (mathematics)Arithmetic meanQuantileVirtual machineInferenceMathematical modelObject-oriented programmingCASE <Informatik>ResultantSocial classMultiplication signDirection (geometry)Local ringCartesian coordinate system1 (number)Problemorientierte ProgrammierspracheCondition numberEntire functionGroup actionTask (computing)Theory of relativityProduct (business)Food energyThetafunktionFocus (optics)AveragePattern languageLimit (category theory)Table (information)Complex (psychology)Library catalogProgramming languageSequenceWave packetConvolutionMessage passingArtificial neural networkParameter (computer programming)Machine learningSet (mathematics)MultiplicationTwitterHorizonÖkonometrieSineComputer animationLecture/Conference
09:50
Series (mathematics)Mathematical modelMathematical modelLatent heatTask (computing)Axonometric projectionContext awarenessAverageInferenceToken ringPredictionFormal languageComputer architectureDiscrete groupSequenceDistribution (mathematics)IntegerDependent and independent variablesPresentation of a groupRange (statistics)Scaling (geometry)NumberMultiplication signAnalytic continuationSimilarity (geometry)Token ringCorrespondence (mathematics)Sampling (statistics)MereologyCASE <Informatik>Different (Kate Ryan album)PredictabilityLevel (video gaming)WordRight angleMathematical modelState observerLikelihood-ratio testAbsolute valueAverageTrajectoryProcess (computing)HorizonElectric generatorWave packetSet (mathematics)1 (number)Library (computing)Latent heatFunctional programmingInsertion lossObject-oriented programmingInferenceTime seriesProgramming languageQuantileMathematical modelContext awarenessoutputProcedural programmingPoint (geometry)Limit (category theory)Direction (geometry)Mathematics10 (number)Parameter (computer programming)Graphics processing unitMultiplicationGoodness of fitNumeral (linguistics)Degree (graph theory)Function (mathematics)Geometric quantizationComputer animationLecture/Conference
19:31
Programming languageMathematical modelArchitectureSeries (mathematics)Context awarenessSoftware frameworkService (economics)Data AugmentationSample (statistics)WeightPattern languageKernel (computing)Linear mapFrequencyMathematical modelSingle-precision floating-point formatFlow separationAsynchronous Transfer ModeTask (computing)Multiplication signThetafunktionLatent heatLocal ringWaveTetraederSampling (statistics)10 (number)Kernel (computing)Range (statistics)Series (mathematics)QuantileTransformation (genetics)Computer architectureClassical physicsFigurate numberPairwise comparisonLinearizationPlotterWeightView (database)FrequencyType theoryProblemorientierte ProgrammierspracheRepresentation (politics)NumberSlide ruleSet (mathematics)Social classCuboidElectric generatorAxiom of choiceMathematicsTemporal logicTime seriesRight angleStrategy gameDifferent (Kate Ryan album)Field (computer science)Token ring1 (number)Codierung <Programmierung>Sound effectMathematical modelProgramming languageMathematical modelWave packetFlow separationReal numberMultiplication signTwitterGaussian processEqualiser (mathematics)Open sourceLatent heatTunisPredictabilityThetafunktionTask (computing)BenchmarkPerformance appraisalInsertion lossMeasurementPoint (geometry)MedianMetric systemHorizonDiscrete groupAnalytic continuationFrame problemStatisticsSoftware testingResultantCASE <Informatik>Error messageOrder of magnitudePattern languageCombinational logicCategory of beingSigmoid functionMereologyArrow of timeStandard deviationState observerAugmented realityFood energyConvex setDistribution (mathematics)DampingNumbering schemeSequenceGoodness of fitMixed realityTransportation theory (mathematics)Computer animationLecture/Conference
29:11
Mathematical modelLatent heatTask (computing)WeightInferencePerformance appraisalCodeMusical ensembleMathematical modelSinguläres IntegralMultiplication signPreprocessorScaling (geometry)Conformal mapPredictionMathematical modelResultantSoftware testingRight angleGoodness of fitPoint (geometry)Projective planeWave packetSet (mathematics)Multiplication signSingle-precision floating-point formatMathematical modelTime seriesElectronic mailing listDifferent (Kate Ryan album)PredictabilityAdaptive behaviorPhysical systemMusical ensembleValidity (statistics)MereologyScaling (geometry)Task (computing)PreprocessorOpen sourceLinearizationPower transformArithmetic meanSoftware engineeringNonlinear systemBlack boxOutlierTransformation (genetics)Conformal mapGraphics processing unitGradientLaptopLatent heatTunisPerformance appraisalInferenceCASE <Informatik>BenchmarkImage resolutionSoftware developer10 (number)CodeStandard deviationField (computer science)Web crawlerMessage passingLine (geometry)Frame problemServer (computing)Modal logicFreewareTerm (mathematics)1 (number)ConsistencyRevision controlPlotterProper mapComputer animationLecture/Conference
38:51
Singuläres IntegralMathematical modelMusical ensembleMathematical modelMultiplication signScaling (geometry)PreprocessorTask (computing)Conformal mapPredictionEvent horizonContext awarenessNumberInformationBenchmarkSeries (mathematics)Token ringGeometric quantizationSpacetimeShared memorySet (mathematics)Single-precision floating-point formatModal logicMathematical modelMathematical modelCASE <Informatik>Real numberPattern languageSpacetimeTime seriesPredictabilityToken ringProcess (computing)Factory (trading post)Multiplication signWave packetDimensional analysisInteractive televisionNumberSource codeSubsetAdaptive behaviorMereologyInformationProduct (business)Degree (graph theory)Form (programming)Category of beingDifferent (Kate Ryan album)Type theoryMultiplicationProblemorientierte ProgrammierspracheBenchmarkWave1 (number)Standard deviationMedical imagingAutocovarianceTask (computing)Electric generatorResultantField (computer science)Open setPoint (geometry)Level (video gaming)Arithmetic progressionMeasurementProjective planeSound effectGoodness of fitCausalityData structureInstance (computer science)AdditionProof theoryComputer animationLecture/Conference
48:32
Computer animation
Transcript: English(auto-generated)
00:05
It's always a pleasure to visit these RML events, RML summer school, RML conference, and excited to share our recent work on time series forecasting with you today. So today we'll talk about a few things. We'll start by just giving an overview of what forecasting is, what traditional forecasting models do, and how they work.
00:25
We will then move on to our work, talking about our work on Kratos and pre-trained models for time series forecasting, which is like a new class of model that recently started appearing and has shown some pretty impressive results. Finally, we'll talk about some future directions of our work in the world of time series forecasting, and we will see how AutoML can play a very important role
00:46
moving forward to getting better time series forecasting methods. So first, let's talk about some basics and cover some background material. In general, when I talk about time series, it just means that we have something that we measure over time, usually at regular intervals.
01:04
Such data comes up in a lot of different applications in many different industries. So, for example, in energy you can think about the energy consumption of different people, and at different hours of the day we have different energy consumption, and it's important for the energy company to know how much energy will be consumed at different times to make sure that they can satisfy the demand and
01:23
everyone can do what they need to do with the energy. Also comes up very frequently, for example, in retail where you want to maybe sell different products and you want to make sure that we have enough items in stock. Also other cases like healthcare, traffic, weather, all these different domains have these cases where we measure something over time and we care about how things are changing over time and what happens in the future.
01:46
When it comes to different machine learning tasks for time series, there are also lots of different cases. So, you know, maybe in general when we think about machine learning, time series is more like a footnote where it's like, okay, we mostly worked with tabular data, but there's also time series, but it's not that different from tabular data,
02:01
so don't worry about it. But in reality, there's a lot of stuff related to time series and also lots of different machine learning tasks that people care about. The first one that will be the focus of this talk is forecasting, where we want to answer the question of what will happen in the future given the past. But there are also many different ones, for example, anomaly detection, where we want to find something that goes wrong,
02:21
over time, classification, where you want to look at the entire time series and answer if it's, you know, how different entire time series belong to different classes, maybe time, maybe heartbeat time series of a healthy patient versus a sick one, clustering, where you want to find groups of related time series, and also imputation, where we have not observed the data for some time
02:41
and you want to figure out what happened in this time interval. But today we will mostly talk about forecasting and about machine learning models for forecasting. In general, the main question that we want to answer in time series forecasting is what will happen in the future given the past. Back to this example that we considered before about demand forecasting, you know, maybe you are some company that you have a company
03:01
that sells some products, you want to know how many items of each product will be bought in the future to make sure that you have enough items in your inventory to satisfy this demand. This also applies to all the other cases, maybe you want to predict the stock price, maybe you want to predict the energy consumption, maybe you want to predict the weather, the temperature, the amount of precipitation for the future.
03:20
All of these different problems can be formulated as a time series forecasting problem. And the way we usually approach them in practice is by, like, you know, using the language of probabilistic machine learning, we often talk about probabilistic forecasts. It means that our prediction tells us something about the distribution of the future time series values given the historic ones.
03:42
And so in the most general case, people now reason about the entire condition of distribution of the future values. But in many practical cases, we don't really need to go all the way to these complex, you know, high dimensional distribution objects. Often we just, it's enough to just look at a point forecast. And the point forecast just means that we predict a single time series value for each future time step.
04:02
So in case of sales, maybe you will say, I expect to sell 200 items tomorrow, 100 items the day after tomorrow and so on. And this will be a so-called point forecast, which, for example, can be the mean of this conditional distribution of the future values. Another way to talk about these forecasts, if you want to talk about the probabilities and the range of possible outcomes,
04:24
is to consider quantile forecasts, which kind of tell us about the range of possible outcomes. So instead of just, you know, predicting a single value in our forecast, we might predict different quantiles separately. So, for example, in demand forecasting, we can say, we predict that there is a 90
04:41
% chance or 80% chance that the sales tomorrow will be between 190 and 220 items. And this allows us to reason about the range of possible outcomes. This is also very important in these different planning problems to kind of look at this uncertainty in predictions and use that to make the decisions that respect all the different possibilities in the future.
05:03
So now, you know, after we have a look at this general problem formulation for forecasting, when talking about the conditional distributions, let's see how this is often done in practice. In general, there exist two main classes of models for time series forecasting. The first one are so-called local models.
05:21
And these are, like the very famous examples, are maybe statistical models or also called econometric models, such as ETS, ARIMA or theta. And these models essentially work in the following way. You fit a single model for each time series in your data set. So maybe if you want to forecast sales for 100 products, you would need to fit 100 models, each model for a single time series in your data set.
05:45
And this model usually captures some very simple patterns in the data. One such pattern would be seasonality, which means that time series have this, you know, this tendency to, you know, repeat themselves over time. So, for example, the sales next Tuesday are probably similar to the sales from previous Tuesday, but they might be different from the sales on Fridays.
06:02
So which is why you always have this like sine wave-like behavior in time series that is called seasonality. You can also have other things such as trends, where things slowly change over time. Maybe your company is becoming more successful, you sell more products over time, so you have a slow trend that goes up, that keeps going up. And some other things like autoregression, where, you know, the value tomorrow is similar
06:21
to the value yesterday or like the average of the last couple of values. Lots of different simple patterns, but they're all quite simple, and usually we just directly specify them by hand in these models. So, for example, ETS, we have the trend and seasonality, which is directly put in there. And we fit, you know, we have some parameter corresponding to seasonality. We have another parameter corresponding to the trend.
06:43
We fit a few of these learnable parameters from a single time series, and we get our forecast for the forecast horizon in the future. And these models, even though they're very simple, they have been used for a very long time, they're still very strong baselines in practice. So, you know, if you talk about lots of practitioners in the industry, they're still using these models in practical forecasting tasks.
07:05
These models also are sometimes interpretable, but they also have some limitations. The first one is, of course, that they have limited flexibility, which means that, you know, if there are some very complex patterns in the data, maybe you have some additional features, maybe you have some complex dependencies. You know what, if a user bought a product two weeks ago, they probably won't buy it soon, but they're like, after some time
07:24
it breaks and you have to buy some new ones, and this kind of gets, you know, combined and compounded in some complicated ways. These models won't be able to capture that. This is one big problem. And the other important practical limitation is that these models have a pretty slow inference time, because like I said, every time you want to do a forecast, you have to fit a new model.
07:41
So if you know you have a new product that you want to forecast for, you have to fit a new model, which takes some time. And this is, in general, pretty slow, which is one problem with these models. These two limitations led to a different class of approaches that became much more popular in the last maybe 10, 15 years in the machine learning community, usually called global models. And the main idea here is that given this collection of multiple related time series, so again, maybe sales
08:05
for different products in your shop, you would fit a single model, usually based on some machine learning algorithm. So either some deep neural network, or maybe CatBoost, LightGBM, all of these, you know, famous ones, but it would anyway fit a single model on all the time series.
08:20
And the idea is that this model, you know, has a single set of parameters that are found in such a way that the model on average gives accurate predictions for all the time series in your training set. Some examples of such models are DeepAR, TFT, and PatchTST. Usually these models are based on some sequence neural network models, for example, LSTM or transformers, also convolutional neural
08:43
networks, and they're trained, you know, like I said, on all the divided time series, and then at prediction time, they take the new time series one by one and generate the predictions with just a single forward pass. So these models are now much more flexible, they can capture much more complex patterns than the simple local models.
09:00
And also the inference for them is much faster, because we don't have to retrain the model each time you want to make a prediction. You know, once the model is trained, you can just use it to predict many times series with just one forward pass, which is much faster than having this local model. But of course, this also comes with new limitations. First of all, we have to train this model, which takes time, which takes time. And you also need to have quite some data to train these models.
09:22
If you only have a single time series, you probably won't be able to fit a complex neural network model on just a single time series to generate good predictions. So this really only makes sense if you have, you know, a large catalog of products that you want to forecast. So that's why these models, even though they're quite powerful, they're not applied in all the domains.
09:43
You know, they only make sense if you have lots of data and you don't mind waiting for a long time to train these models. And, you know, kind of looking at these limitations of the two main classes of models, local models and global models, we can maybe ask the question, can we get the best of both worlds? Is it possible to have some model that we don't have to train for every new data set, for every new time series?
10:05
Wouldn't it be nice if we can just take an off-the-shelf model, give it a new time series and immediately get the predictions? Better actually makes sense, you know. Can we get an accurate forecast in a zero-shot way without doing any data set specific training? This is, you know, of course inspired by the huge success in NLP recently, where we have these
10:24
foundation models, pre-trained models, and they're just, you know, like trained in a huge amount of data. And they can make good predictions for new tasks, for new things that you give to these models that can give you accurate forecasts or accurate predictions. And the question is, can we replicate the success of these large language models in the world of time series forecasting?
10:43
You know, we are not the first people to ask this question, like not the last, a lot of people have asked this question. And the first thing a lot of people have tried was to actually go to the LLMs from the NLP world and see if we can somehow turn them into time series forecasters. And maybe, you know, like the implicit assumption is that these models are so smart, they know everything in the world.
11:02
Maybe if we ask them nicely enough, they will tell us the future. So this is what people actually tried, you know, so in different ways you can ask this question. You can just try to convert your numerical time series values into text. As you can see here for LLM time and just give it as a prefix and then say please continue this sequence. And then the model will maybe output the future forecast.
11:22
You can also maybe add some more context around and say, hey, I'm trying to forecast the temperature in this location. And the last week's temperature was like that. What will the temperature be tomorrow? And then the answer is, I don't know, 78 degrees Fahrenheit, I assume. And so, you know, this is like another way, which is also just doesn't change anything about the LLM, just how it plays with how you represent the data.
11:44
But the models have been trained on text before. And then going more in the fine-tuning direction and somehow adjusting the architecture of the models. People have tried to add some special tokens representing numerical values. Maybe doing some kind of fine-tuning, maybe parameter efficient fine-tuning, where you, you know, maybe tune some input and output players.
12:02
But still start with a language model and just make some small modifications to turn it into a forecaster. So, you know, people have tried lots of things, but it wasn't super successful. And, you know, for a few reasons. First of all, these models were huge because all these LLMs, they have billions of parameters. Maybe tens of billions, hundreds of billions of parameters.
12:22
And they're massive. So even if you just want to make us, you know, run inference once, generate a single sequence. You often need to have multiple GPUs with like state-of-the-art GPUs to even just be able to do that, which is expensive and slow. And also like at the point where you have to like do some kind of special fine-tuning for your data set.
12:40
And, you know, you have to train this. So for these like fine-tuning models on the right hand side here. Once you have to do this fine-tuning, you have to tweak the architecture. Use all these tricks. You kind of lose the zero shot capability of these models, you know. Because the whole idea was to have a single model that I don't have to train, just give me the predictions. But if I now have to train it again, it's just a different training procedure.
13:01
And if these predictions are not as good as what the traditional deep learning model like DPR would give me. Then what's even the point of this entire thing? So, you know, looking at these different limitations and problems with these models. We decided to try a different approach and also use large language model architectures. But treat them completely as time series forecasting models.
13:22
So on a high level, if you think about it, language prediction and time series forecasting are somewhat similar problems. So this is just the overview of the model architecture. We'll get back to this later and look at it in details. Just, you know, the high level idea here is that we can turn LLMs into forecasters by just working with time series right away.
13:43
So the first observation that we have to make is that language modeling and time series forecasting. Are somewhat similar problems on a conceptual level. In both cases we have some sequence and we want to ask the question how does this sequence continue. In case of language we have sequence of tokens or words. And you want to figure out what the future tokens are.
14:02
In case of time series forecasting we have a sequence of real valued numbers. And we want to generate its continuation into the future. So seems very similar on the surface, but of course there are some important differences. So, you know, it's not immediately obvious how to work with that. So, you know, for text we just turn this the text that we have into discrete tokens.
14:25
We have some fixed vocabulary. We fit some tokenizer on the training corpus. And then each collection of character corresponds to one token which is essentially, you know, one number between one and vocabulary size. And that corresponds to every part of the sentence. And we feed these discrete tokens into the LLM architecture.
14:43
And this spits out the continuations into the future. And this way we can generate the text. For time series forecasting it's not immediately obvious how to achieve the same. Because time series values are continuous. And all these tokenizers that we have for text don't just work with continuous time series values. So of course, you know, how can we deal with that?
15:04
What are the options? Well it turns out we can take some very simple approach. Where we first scale the time series and then just discretize it by putting it into buckets. Which is actually quite similar to what is mentioned today early in the TAP EFN presentation. Where essentially, you know, we scale, discretize and turn it into buckets.
15:23
And specifically like this consists of two steps. First for scaling it's necessary to kind of put all the time series into the similar range of values. Because if you think about it maybe why one time series is the temperature which has numbers like 10, 20, 30. And then the other time series is role population which has numbers in billions.
15:40
It's very difficult for the model to kind of work on all these different scales at the same time. It's much more convenient in deep learning when we just like put everything into the same range. There are many ways to do this. One simple approach that was successful before in time series forecasting that we use in this model. Is called mean scaling. Where essentially we compute the mean absolute value of the time series over the history.
16:02
And then just divide every time series value by this. And this kind of puts it into a reasonable range of numbers. After we turn this into, after we apply the scaling. We now just do quantization or bucketing or discretization whichever way you want to call it. Which essentially just means we look at a large enough interval. For example from minus 15 to 15.
16:22
We divide it into evenly sized buckets. So here we have 4 000 buckets in total covering this interval from minus 15 to 15. And we just see into which bucket each value of the time series falls. And we convert the continuous value of the time series. That we're discretizing into the id of the bucket. So you know if this, I'm not sure if you can see this.
16:42
Yeah so like if you if this time series value here. The response to bucket number 2400. It just means we replace this continuous value by the integer 2400. We do the same thing to all the time series values. And we just now have the sequence of integers which encodes our time series.
17:01
And that's it essentially. End of presentation. Almost you know once we turn it into the sequence of discrete tokens. We can just use the existing LN architectures or like anything that can generate text. To also turn it into a time series forecaster. What happens we take these context tokens that we obtained after scaling and quantization.
17:21
We feed them into the language model. And this language model gives us the categorical distribution over the next token in the sequence. And each token essentially corresponds to some continuous value among these buckets. So maybe like you know the center of the bucket specifically. And you know we have this distribution over here. So we know what the true value of the token is at training time maybe 2350.
17:44
We see what the log probability of this token is. Assigned by the model we can put the log likelihood. This is our loss function this on the training objective. Cross entropy exactly as in the language models. And so essentially you know the conceptual picture here. Is that we have our time series data set of continuous sequences.
18:01
We turn them into tokens using scaling and quantization. And then we just take this data set and train it as if it was a regular language model. So everything the same you know we can use the same libraries. Like you know Hugging Face, Transformers, Hugging Face, Accelerate. Whatever your favorite ones are. Maybe you know some different ones. And for training these models we train it. And then the time this language model can now generate tokens.
18:24
That actually corresponds to time series values. So at prediction time we can just take our history that we have observed to the blue tokens here on the left. Feed them into the language model and generate future continuations. So essentially each sample from the model is one possible time series trajectory.
18:41
We know the model splits out these tokens sequence of these tokens for every value in the forecast horizon. We then do lookup from the buckets that we had before. And then we do scale it back up by the scale of the historic time series values. And we end up with this one single trajectory of the future time series value. So it's one possible continuation of the time series.
19:02
And after we do that we can you know we don't have to do this once. Because it's not a deterministic process. You know every time you query an LLM it will give you a different continuation. You can repeat this process maybe 100 times or 1000 times. And this way you get you know many possible future trajectories. And then you can just turn these many trajectories into a probabilistic forecast.
19:21
So for example you can compute the average value for each time step. You can compute the quantiles. You can ask all the different questions that you want to ask about the future. And this will be your probabilistic forecast. So putting everything together essentially. You know we don't have to modify the L1 architecture at all. All we change is we like we have to change how we represent the data.
19:43
By turning it from continuous values into discrete tokens. Frame LLM in the usual way. And then we can do probabilistic forecasting this way. So right out of the box. We don't have to make any special modifications to the LLM architecture. This purple box in the center. It's exactly the same LLM architecture that you can use for time generation.
20:02
Just now trained in different way. And they trained it completely from scratch on time series data. So not treating it as a language model. But just kind of you know we can think of this as we have a special language. Which is at our language of time series. And we have sequences from this language. And that we obtained using our quantization scheme. Of course it's not the end of the story.
20:22
It's also very important to have training data. Because you know if you don't have training data you cannot train the model. And this was like one of the larger challenges to overcome throughout this project. So we started by collecting a large collection of time series data sets. And the open source ones where you can train the models.
20:40
And actually you know with good licenses and everything. But covering a very large number of domains. So again finance, retail, energy, transportation. Some kind of mixed time series. All the different ones. Which ended up with in about like almost 1 million time series. With almost like 100 billion observations.
21:03
Which is pretty large by time series standards. But it's still tiny by the standards of large language models that are trained in human text. Where it's like many orders of magnitude more data than what we have in here. So to you know kind of overcome this problem. And see if we can somehow enrich the data set. We also export two ways to increase the diversity of the data.
21:23
And increase the effective size of the training set. And the first one was an augmentation method for time series. Which essentially what it does is it combines different time series from the data set. To give us some more complex patterns. Now we call it TSMixup. You know it has some like math here on the left. But it's actually a very simple procedure.
21:41
That is also very easy to understand. First to pick a number between one two and three uniformly. Then you select this many time series from your data set. Maybe from different data sets. You know maybe we have one time series from the energy domain. One time series from the retail domain. You then sample the weights for each of these time series. In a way that the weights sum up to one.
22:00
So for example taking the distribution. And then we just take this convex combination of the time series. And you can see an example here on the right. So you see we have two. First we have two time series with different seasonal periods. Then we have one time series with a trend. And when we sample different weights shown by these black arrows in here. We end up with new time series. Again have new seasonal periods.
22:21
And also have a trend in case this trend time series was included in there. It's essentially this way we like by just mixing and matching individual time series. We end up augmenting our training set. And increasing the amount of training data available for us. The second part is synthetic data generation.
22:41
Where we try to use Gaussian processes to generate new time series. And the idea is the following. We start with some kernel bank. Where we have a few dozens of different kernels. Maybe the linear kernel, RBF kernel, some periodic kernels, sigmoids etc. A bunch of them. We sample the different kernels from the kernel bank.
23:01
And then with these kernels we sample a Gaussian process. Like with a combination of them. And this again results with some new time series that have some patterns. So they can be periodic. They can have trends. They can have you know some like combinations of these different properties. And this way we again get new time series. That are somewhat representative of what we think real world time series should have.
23:23
We know that real world time series have seasonality and trends. Which is what these kernels essentially from the kernel bank kind of incorporate. You know and this way we are able to generate more synthetic data. And effectively increase the training size of our training set.
23:45
Now after we created all these data sets we trained several model variants. And we used the T5 encoder decoder architecture. We also had some special experiments with other ones. But this was the main one that we released in the end. Based on the encoder decoder T5 architecture.
24:00
It's essentially just a transformer that first encodes the historic values. And with a like simple transformer with some special things about the biases etc. But it's essentially just off the shelf NLP architecture that we tried in here. The model also comes in different sizes. So we just try to train different ones to see if the model size also has an effect on the performance in the end.
24:21
But in the end it was just you know these five models. Based on the T5 architecture trained exactly using the strategy that I described so far in here. Before we move on to the evaluation and talk about how well this performed in practice. And how well this did in the benchmarks that we considered. I also of course want to mention that you know we are not the first one not the last one. Who actually trains models for pre-trained models for 10 years forecasting.
24:45
They're almost like right before we released ours. And right after we released ours a few more papers came out from different ones. So you know from Salesforce, Google, CMU, Nixla, some other open source packages. Like many of them also like ended up being open sourced after we released Qanas. But in general like it's a you know pretty active field.
25:02
Where lots of models are being released out there. And in our paper we try to benchmark against everything that was available at the point of submission. So but an important difference between our architecture and all these different ones that I mentioned here. Is that for Qanas we really just used like out of the box LLM architectures. Without any changes to the architecture itself.
25:23
We only changed how we represented the data and how we trained it. And how we enriched the data set. But all these models in here they have some time series specific design choices. So you can think of these more like scaling up the existing deep learning architectures. Like Deep AR or Temporal Fusion Transformer. But just scaling them up trained on the bigger data set.
25:41
Which is quite different from Qanas where it's really just you know we kind of ask the question. Can we just use exactly the NLP architectures for language models and use them as forecasters. And in the evaluation we compared Qanas to all these different you know competing methods. So starting with the pre-trained models that I showed on the previous slide.
26:01
From LLM, Time, Forecast, PFM. Like LAMA, Moira and Times FM. So everything that is open source and easy to use. We also trained it compared of course to all this task specific global models. As we're trained specifically for each task. This mostly includes these global deep learning models that I showed before. Like PATCH, TST and Deep AR.
26:21
But also has this like one example which is called GPT-U4 time series. Which is a representative of this class of models. Where we where people take an existing actual GPT-2 architecture. And fine-tune it on time series forecasting tasks. But they use the language model initialization. But again like it's more of a task specific model. Because you have to fine-tune it on each data set to actually get any predictions.
26:43
And of course we also look at these local models. So statistical approaches, simple baselines. Like Naive and Seasonal Naive. And also the classic forecasting models. Auto ETS, Auto ARIMA, Auto Theta. And we rank quite a few comparisons in our in our paper. You know but I will show one here. This is quite a large figure.
27:01
So let's take a second view to process what's going on here. And what these plots here show. So first of all we evaluate two types of forecasting performance or forecast accuracy. Remember I mentioned before we can have probabilistic forecasts or quantile forecasts. Where we give it like where we predict the range of possible outcomes. For example we predict the 10th quantile, 20th quantile etc etc.
27:23
And this kind of tells us about the different ranges of outcomes. And this is what we have here on the left. Measured using the quantile loss. And on the right we have MACE which is a point forecast metric. Which just tells us like where we only consider a single prediction of each model. So like the median of all these samples.
27:40
This would be like one single number for the forecast horizon. And these are the two main tasks, types of tasks that people have in time series forecasting. Probabilistic forecasting on the left and point forecasting on the right. Now the way we actually report the numbers in here. Is we first for each data set among these 29 data sets. We run the baseline model which is seasonal naive. Which is actually a pretty good baseline.
28:02
That in many cases shows like relatively strong results. And we compute the error that this method achieves and set it to one. So essentially the score of one means our method achieves exactly the same score as seasonal naive. Across all data sets. A score higher than one means our method has worse error than seasonal naive.
28:20
So like it's bad. And the score less than one means the method is better than seasonal naive. So essentially the lower the better. And the number of one is the baseline that everyone agrees. Like is a reasonable first thing you would try on any data set. And then looking at individual methods. We have different classes of these models in here. First we have the local models shown in blue.
28:42
And so it's the baselines and statistical models. We have test specific models shown in pink. Where we train a separate model for each data set. And then in purple and green we have the pre-trained models. That did not do any data set specific training or tuning. So we just take the same model with the same weights. And just make forward pass. Get the predictions and use that as our final forecaster.
29:05
And we can see like once we understand like what's going on here. We can now look at the figures and actually see some patterns and understand some results. And so first of all we see that the local models shown in blue. They in general are less accurate than the pre-trained models. And then there's test specific models.
29:21
Which you know kind of makes sense. So you know they're definitely better than the baseline most of the time. You see like it's usually less than one. But they're not as good as these test specific models and the pre-trained models. Now when we look higher up the list. We see that in general of course like test specific models like TFT and PATCHTST.
29:41
They often achieve the best results. So on the left we see for probabilistic forecasting TFT is really strong. On the right for point forecasting PATCHTST and NHITS are the best models. And we see that right after them we have Kronos Large. So this essentially means that like once we take this Kronos model trained in our way. Just take it off the shelf. Don't do any fine tuning.
30:00
Just you know like use the same model to generate predictions. It ends up producing forecasts that are almost as good like extremely close to being as good. As the forecast of the best state-of-the-art deep learning models trained on these data sets. Which is was quite a surprising result. I think when we started this project we did not expect to you know reach this point. Maybe if we can be as good as ARIMA this would be a huge win.
30:24
Because you know if you can do ARIMA without training just with one forward pass this is great. Turns out we cannot only do better than as good as ARIMA. We can do much better than ARIMA and as good like almost as good as the best state-of-the-art models. These are all zero-shot data sets. So the Kronos models have not seen any of these data sets.
30:41
Any time to use from these data sets at training time. This is all like proper zero-shot evaluation on new unseen tasks. And this works really well. In general you know the larger the model the better its accuracy is. Which is I think you know a sad but like not very unexpected finding. That people have also seen in other cases. Like for NLP models typically the larger the models the better they perform typically.
31:02
And for the in terms of the ranking with other pre-trained models. We find that Kronos usually has a consistent edge across other pre-trained models. Like Moirai, L1 time. What all do we have here? Yeah I think I think those are the ones that actually in here.
31:20
I think in the paper you might be able to see even more recent plots. Which has some more models in the latest version of the paper. But in any case we see that PONOS models also have like a small consistent improvement over these other pre-trained models. Which overall was you know very nice very encouraged result. Where essentially you know the main takeaway message that we have in here. Is that we can have one pre-trained model that doesn't need any data-specific tuning.
31:43
Which works as well almost as well as the best test specific models for these data sets. Which was very nice. And like and also important thing to mention here the Qantas models themselves. They're completely open source. So like everything about the models you can find online and download for free. You know with like very promising licenses.
32:02
Both the training code, the inference code, also the code for evaluation. So if you want to reproduce these benchmarks and you don't see for yourself. Also the model weights, the actual training data sets and the evaluation data sets. It's all available on GitHub and Hugging Face. So very easy to you know get started and see for yourself. You can also if you want to try out Qantas.
32:21
I think like one of the easiest way is to use AutoGone. Which you know maybe my biased opinion. Because I'm also developing AutoGone. The time series forecasting capability in AutoGone. But the Qantas models are also in there. So all the different model sizes very easy. You know you can just give your data frame with your time series. And you will get the forecasts in a single line of code. Using AutoGone.
32:41
It's in there. And also you know like a very nice recent development. The Qantas models ended up like picking up adoption quite quickly. So right now we have more than 16 million downloads in Hugging Face. And this is like a top three ranked models at all on Hugging Face. Like across all modalities. And which was quite a huge surprise for us. Maybe I don't know someone out there who just has a server.
33:02
But they just keep saying download, download, download. But in any case for like a very nice surprise. How quickly we went from like essentially there were no time series pre-trained models at all. One year ago. To now it's like tell three model of all models on Hugging Face is a forecasting model. So I think the field itself is growing very fast. And probably you know a lot of more models will come out.
33:22
Because it's a relevant problem that people end up caring about. So you know at this point maybe you will ask. Okay great we've seen very nice results. Does it mean that we are now done? Have we solved time series forecasting? I think no. That's not the case. What we have done is we have found a very powerful recipe.
33:41
Which essentially means if we have lots of data. We have a large model. We can get an accurate predictor out of that. Which is you know not very surprising. But what we have seen is like this also. We knew that this holds for text. We knew that this works for images. We also knew this works for Tableau data. But now we know this also works for time series forecasting. Which may be not very unexpected.
34:01
But you know very nice nonetheless. Maybe you know like the immediate thing that we will ask after this. Is can we just follow what people have done in other fields such as NLP? And just you know scale up the data set size. You know crawl everything that we can crawl. Train bigger models. And just have like one huge model. That is the best forecaster of all the time.
34:22
And like solves all the forecasting problems. Maybe not. I'm sure that some people are trying to do that right now. And maybe you know like I think there's definitely some room for improvement over Konas. Because even now like with Konas and all these existing models. These are still very small. They are just getting started. You know maybe there are like some small architectural tweaks to do.
34:42
Maybe there's something special about the data. Maybe there is something about scaling the model size. That will lead to some improvements. But I don't think that this will completely solve time series forecasting. Where like we would just you know in the future there will be like GPT-4 for time series that everybody uses. I think it's going to be very different. And I think that AutoML has a very important role to play.
35:01
In this future of time series forecasting and the pre-trained models. And the main reason why I'm convinced that this is the case. Is the fact that these pre-trained models for time series forecasting. Are actually tiny by the LLM standards. So these LLM models they you know they have their sizes is in gigabytes. Or they're tens of hundreds of gigabytes.
35:20
They're huge. You need to have like many GPUs many set of their GPUs to run them. But these models you can fit on a single consumer grade GPU. And probably in the future we'll be able to make them even smaller. Where it's going to be super cheap to just run it on your laptop. Which is a huge difference from having these massive models. That kind of need to have all the knowledge of all the techs in the world. You know they need to know I don't know what each city means.
35:41
Or like what happened in different years like to generate possible text. But for time series forecasting I feel like they're like a lot less knowledge to put into these models. So likely in the end they will be much smaller. And they are smaller nowadays. And which of course you know opens a lot of really cool opportunities for us. As like you know researchers and people who are developing stuff around these models. Just because they're so small.
36:02
So maybe you know even if you just have you know even if in the end. We end up with this like one small pre-trained model that works very well. Maybe you know today we have some model that is the best model ever. There are still many things we can do around this model. And that will lead to better forecasts. And if you just take an off-the-shelf model like PONOS nowadays. Of course it's very convenient. You don't have to do anything.
36:21
You just take the model you apply it you get your forecasts. You know you don't have to like as a like software engineer. You're very happy because it's much easier to integrate into your workflow. But if you care about increasing the accuracy to make it as high as possible. There are lots of things that we can do. First of all maybe things like pre-processing. You know like scaling the data. There are different ways to do this.
36:41
There are some non-linear transformations like box-cocks transformations. Filtering the outliers. Power transforms. Lots of different things on pre-processing side. That can change the model behavior and improve the forecast accuracy in the end. There are also some things about the model or like post-processing of the model that we can do. For example we can fine-tune the model. This tends to help most of the time and some preliminary results that we have in the PONOS
37:04
paper also show that even like a small model that is fine-tuned is consistently much better than the largest model that was used in zero-shot way. So fine-tuning is important and like it has to be like part of the workflow if you want to increase your accuracy. And also there are lots of other techniques or tricks like calibration, conformal prediction.
37:21
Very essentially you just take a model as a black box. You look how it behaves on new validation data. And for these models all the data is validation data because they were trained on some other data sets. And now we can use these new techniques with these models and improve their performance further. Now they're better calibrated. Now they achieve better forecast accuracy which is great. And all these different parts this is AutoML right.
37:41
Even if the model itself is like pre-trained we don't have to worry about training the model. We don't have to worry about finding these hyperparameters. We can just like do stuff around the model which is like which people need which we need to do. Like and this will be AutoML. And of course it's not just you know this is only if we have a single model. But I think it's very like there won't be a single model that works best all the time. I think we have seen this over and over again.
38:02
Where it's like for different tasks we need to use different models. And since these models are really tiny for them to use forecasting. And probably saying that for TAP-PFN I think like you mentioned that like a smaller model compared to this LOM scale. We can have many different models and then you know as an AutoML system we can try out different models, combine them together, try out different pre-processing techniques,
38:21
do some kind of adaptation with fine-tuning calibration. You know like doing some kind of ensembling. And also for time series there are lots of ways to do ensembling. You know from boosting to stacking these linear ensembles lots of different things to try. And we end up in the AutoML world essentially. Where you know we have a huge toolbox of different things to try. We have lots of pipelines to build. And we want to automate it because you know as a user you know people really like this.
38:44
Like I click one button I get my predictions. Right now with these like pre-trained models you can get this behavior. Where you know you click one button get the predictions. But if you really want to maximize the performance when you click this button a lot of things need to happen. Maybe you have to evaluate a few models, a few different pre-processing steps, a few of these adaptation methods.
39:02
And this is where AutoML can step in and you know help. So I think in general just because these models are really small and having more models or like ensembling them is always better than just relying on a single model. And for these reasons I think that AutoML will have a large a big role to play in here. Then of course you know this is so far we've only been talking about univariate forecasting.
39:24
Which is only a subset of the real world forecasting tasks. So you know the first part in here is for univariate forecasting we just have the historic time series values. You know maybe I have the sales for each day of the week. You know for each day you know over the last couple of years I don't want to forecast the sales into the future.
39:41
This is of course fine but in many cases it's not all the information that I have. For example I might have promotions there might be some holidays that happen on different days each year. Maybe there's some additional features. Maybe the weather forecast if I'm selling ice cream the weather forecast probably plays a big role in my ice cream sales that I expect to happen in the next week. So all these things all this exogenous information on these different features
40:03
they are very important and they also have to be taken into account by forecasting models. And of course you know like it's with existing forecasting methods it's simple you just train on everything but then you have to train you know which is like not zero shot. And if you want to have a zero shot model that can process all of these different exogenous features
40:20
in a zero shot way that's a pretty big challenge for which we don't have a good answer yet. And I think you know there are some ideas out there in the literature maybe with synthetic data generation to do some kind of in-context learning. If you're training this way looking at viewable data sets but it's hard we don't have the answer for that yet and this is also an important gap that the follow-up work has to address.
40:41
And the main challenge is that like the number of covariates their types is not known in advance. So you know it's not like every in every data set with covariates we have exactly three covariates two categorical ones and one dual valued it's always different number different types so the models have to handle that and that's like that's not clear how to do. Now besides just you know these like standard situations with covariates
41:03
you also have multi-model forecasting problems where you know before I think people didn't even try to touch this because you know these simple models like I don't know deep AR you just like if you give it some text or images it just wouldn't work. But now that we have like very large models maybe trained on some more complicated data sets
41:21
maybe there is a way to also take into account textual or image information about the time series that will help us to make a better forecast. For example you know maybe let's say my time series is a temperature of some part of some machine at a factory and you can say like okay the temperature cannot go above I don't know 100 or 500 degrees celsius
41:42
or it would just break. If you put it in this text this model will know like can take this information into account and adjust its forecast. Or maybe for sales you can say no these are sales of a certain product which has these properties people tend to buy it on days when it's sunny or something like that and then we just are providing this the main knowledge in form of text
42:00
this can also guide these forecasting models in some way to give you better forecasts. Right now people have to come up with different features do some feature engineering to pull this information into the models but maybe as we move to this space of foundation models and pre-trained models we can just somehow fuse these modalities of text and time series and this way end up with better forecasting models.
42:21
And this is something that whenever you like with like once we release Khronos and talking to different people from you know work in different domains the first thing that we ask you is like oh can we talk to the model can I just describe my business problem and tell the model what to forecast how to adjust it it's we're not there yet it's not possible but I don't see why we cannot get there in the future. Again of course like the big challenge is that the data sets
42:42
where we have both time series data combined with other modalities are very scarce so we have to do something about it but it's like very likely that in the next couple of years we'll have some solution for that. And like finally for all these like other domain problem types we also have multivariate forecasting where we don't just want to forecast a single time series but maybe we want to forecast multiple times series at the same time
43:03
and we also want to take into account their interactions. So you know for a typical example would also be some kind of like factory process where maybe you have temperature, pressure, I don't know like speed at which something is moving all of these things happening at the same time and very often they influence each other so you have to look at these interactions to make a good forecast
43:22
then currently with the univariate models you can of course just look at each dimension one by one but probably not the best thing that you can do and there is some room for improvement but again we need to find a way to handle all these varying numbers of dimensions and all the different interactions that happen between them but also an important research question for us to figure out.
43:42
And finally you know like last but not least I want to like mention this and I don't want to mention this all the time I think one of the big bottlenecks right now in the world of time series like not just forecasting like anything related to time series is of course the lack of high quality data sets and benchmarks and even for Kronos that a large chunk of work that we had to do was about collecting, curating the data
44:01
you know like finding these like high quality data sources that we can actually use and in general you know like it's not just like something that one team can do once and that we are done with your set I think people will have to find ways to you know share these data sets and share these benchmarks which will also be very important for all these different modalities where we want to look at some maybe like you know
44:22
even not just like multimodal and multivariate forecasting but also for cases like you know if you want to time series classification time series clustering in time series rather we have anomaly detection having high quality annotated data would be important to enable these new use cases and of course you know like for
44:41
it's not just about collecting a huge data set putting it on a hugging face it's also about you know maybe doing science about the data for time series forecasting so there are some important questions that we can ask for example how can we quantify if our data set is diverse enough if it captures different patterns that are present in the real world time series do we know if it's a high quality data set
45:02
or if it's just garbage you know maybe somebody you know puts a data set which has I don't know one billion times series and you say okay is it a good data set you're just like you know if it's always the same I don't know sine wave repeated one billion times it's not very useful but like how can we you know somehow mathematically quantify the quality of these data sets and how useful they are for trading these models
45:20
this would also be important to answer synthetic data is a big you know like place where you might get some improvements to distant models right now what we have seen for corners like we also did some patients in the paper addition of these synthetic data sets really improves the performance and where it's like 10 of the data is synthetic this works better than just low only real world data
45:40
and we have also seen like if we like in some proof concepts works where we can just train fully on synthetic data and still get some decent performance that is better than these statistical models not quite as good as corners right now as the largest cross model right now but still decent but in general it is possible that for time series we can actually write these high enough quality synthetic data generators and this will allow us to completely sidestep
46:01
this problem of how do we collect huge collection of data sets with covariates and if you can just you know have a good synthetic data generator but of course how to do that it's a very big open question for which we don't have the answer maybe something like again like tapiophane we've returned to this like these ideas with structural causal models for time series it's even more tricky
46:20
because I think for tapiophane we have a structured causal model but it's only for a single instance in case of time series we could also have dependencies across different dimensions and in time so maybe you know not just the whether today influences my sales of something but maybe the weather two weeks ago also has an effect on what's happening right now so it's like the space of these possibilities
46:41
grows exponentially where you have lots of dimensions lots of features lots of ways in which they can interact with each other and like how do we find write these good data generators that capture all this diversity and then end up like result in models that generalize to the real world tempt you with data sets so a lot of work to do in this space
47:01
and of course like the question of how to evaluate these models it's still I think a lot of what a lot of things people are doing like have some limitations and so we have as a community to figure out how to do that better and what the best practices are and make sure that they're shared which is very important for us to kind of measure progress in a consistent way and understand which ideas are good and which ones are not
47:21
and where to do more investigation so before we conclude just want to you know give a huge shout out to all the co-authors and all the people who worked and contributed to the CONUS project and mentioned here yeah thanks everyone and you know you can also check the names in the paper and you know finally as a short summary hopefully I was able to excite you
47:41
about the field of time she was forecasting maybe this is something you will explore at some point in your projects and think about more there's like right now I think we are in a good stage where like we have seen that there is like this huge new field of pre-trained zero-show forecasting models and they already show some very promising results that are comparable to what to the best thing that people were able to produce in the last decade
48:02
but now of course you know we can move further and get even better we have looked at the CONUS model which essentially turns LLMs into attempt to use forecasters by formulating forecasting as this next token prediction problem and you know finally in the last part of the talk we have seen there are lots of important exciting open questions
48:21
in the field of time series forecasting and very likely AutoML has a lot to say you know and a lot to contribute to answering these questions and thank you very much