CatBoost - the new generation of Gradient Boosting
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 132 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/44984 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 201821 / 132
2
3
7
8
10
14
15
19
22
27
29
30
31
34
35
41
44
54
55
56
58
59
61
66
74
77
78
80
81
85
87
91
93
96
98
103
104
105
109
110
111
113
115
116
118
120
121
122
123
125
127
128
129
130
131
132
00:00
SoftwareGradientGradientLibrary (computing)Open sourceProjective planeComputer animation
00:38
GradientVideoconferencingComputer-generated imageryComputer networkUser interfaceInsertion lossResultantArtificial neural networkPhysical systemWebsiteMedical imagingEndliche ModelltheorieGradientComputer architectureExpert systemBlack boxQuery languageType theoryError messageNetwork topologyWave packetMultiplication signRow (database)Table (information)CASE <Informatik>Decision theoryVirtual machineSet (mathematics)AlgorithmLibrary (computing)PredictabilityTask (computing)Data structureVideoconferencingVideo gameProduct (business)Different (Kate Ryan album)
03:57
Pairwise comparisonAlgorithmInternetworkingComputer-assisted translationParameter (computer programming)Set (mathematics)Open sourceLibrary (computing)Different (Kate Ryan album)Pairwise comparisonAlgorithmTunisException handlingBeat (acoustics)Slide ruleMathematicsComputer-aided designFlow separationComputer animation
04:59
WeightSymmetric matrixWritingStochastic differential equationAbelian categoryPermutationCodierung <Programmierung>Morley's categoricity theoremGoodness of fitNetwork topologyData managementParameter (computer programming)WeightAverageLibrary (computing)PermutationWave packetResultantError messageObject (grammar)Slide ruleDiscrete groupDifferent (Kate Ryan album)Endliche ModelltheorieStatisticsAlgorithmData structureComputer fileDecision theoryMultiplication signGradientIterationCodeCategory of beingFinite differenceComputer-aided designPredictabilityNumeral (linguistics)Flow separationProcess (computing)Codierung <Programmierung>Term (mathematics)Type theorySet (mathematics)Internet forumFraction (mathematics)Software developerSymmetric matrixNominal numberTable (information)HypercubeCode1 (number)Level (video gaming)Well-formed formulaComputer-assisted translationTunisIntegerComputer animation
13:11
Linear regressionAsynchronous Transfer ModePoint cloudPredictionRankingoutputLocal GroupGraphics processing unitMorley's categoricity theoremBit rateProduct (business)Error messageQuery languageCASE <Informatik>Object (grammar)FrequencyType theoryStatisticsPredictabilityLinear regressionTask (computing)Different (Kate Ryan album)Combinational logicPairwise comparisonBinary codeNumberoutputWave packetRankingSelectivity (electronic)QuicksortLibrary (computing)Asynchronous Transfer ModeBefehlsprozessorAlgorithmGradientEndliche ModelltheoriePermutationSet (mathematics)Multiplication signEstimatorCategory of beingNetwork topologyCoefficient of determinationBitVideoconferencingMusical ensembleGroup actionService (economics)Classical physicsPoint cloudAutomatic differentiationMultiplicationProjective planeUltraviolet photoelectron spectroscopyTurbo-CodeLevel (video gaming)Computer-assisted translationDataflowGraph coloringData structureBoss CorporationMixed realityElement (mathematics)Computer animation
21:23
Higgs mechanismIterationPairwise comparisonGraphics processing unitIntelSample (statistics)CountingBefehlsprozessorPredictionMultiplication signPredictabilityTask (computing)Type theoryDynamical systemComputer-assisted translationFlagLibrary (computing)Object (grammar)Computer-aided designComputer animation
22:40
Endliche ModelltheorieLibrary (computing)Object (grammar)PlotterVisualization (computer graphics)Interactive televisionStatement (computer science)1 (number)
23:32
Object (grammar)1 (number)ResultantWave packetPerformance appraisalDifferent (Kate Ryan album)Library (computing)Computer animation
24:11
Wave packetException handlingVisualization (computer graphics)Special unitary groupMathematicsLaptopError messageComputer-assisted translationFlow separationWhiteboardCross-validation (statistics)Computer animation
24:44
Parameter (computer programming)AlgorithmReading (process)IterationLinear regressionPredictabilitySemiconductor memoryWeightObject (grammar)Mathematical optimizationPersonal digital assistantChemical equationMathematicsWave packetMultiplicationEndliche ModelltheorieParameter (computer programming)Process (computing)Sampling (statistics)Multiplication signHypercubeDefault (computer science)Musical ensembleIterationComputer-assisted translationQuantileBernoulli numberFunctional (mathematics)INTEGRALGoodness of fitWebsiteSet (mathematics)MereologyInternet service providerRankingNetwork topologyDistanceDerivation (linguistics)Data structureInformationComputer-aided designSimilarity (geometry)Medical imagingExponential distributionDigital electronicsLibrary (computing)Category of beingTask (computing)Binary multiplierCartesian coordinate systemComputer configurationSparse matrixCubeDisk read-and-write headGradientHeegaard splittingNumeral (linguistics)Formal languageFlow separationCombinational logicArtificial neural networkTable (information)TextsystemCASE <Informatik>AlgorithmMachine learningHash functionNatural languageLink (knot theory)TunisGraph coloringEinbettung <Mathematik>WordCross-validation (statistics)WaveInstance (computer science)AverageSoftware testingForestInterpreter (computing)PlanningStatistics1 (number)Point (geometry)outputBit rateMorley's categoricity theoremResultantProduct (business)Insertion lossFilm editingCodeSlide ruleIncidence algebraLine (geometry)Different (Kate Ryan album)Core dumpString (computer science)Scaling (geometry)Graphics processing unitVariable (mathematics)Metric systemRandom variableVirtual machineRandomizationRow (database)Perfect group
Transcript: English(auto-generated)
00:05
Hi, yes, my name is Anna. I'm going to talk to you about CatBoost today. CatBoost is a gradient boosting library. It's an open source. A few days ago, I think on 18th of July, we had our first birthday, so we are one year in open source.
00:21
One day before that, we reached 3,000s of stars, and I was really, really happy about that. So the project is growing. We have made many releases, and we are working on the project actively, and so many people start to use the project. So first of all, let's start with the problems
00:41
that we are trying to solve using this library. So gradient boosting is a machine learning technique that works well for heterogeneous data. There is homogeneous data, like images, sound, video, text. For all this data, you need to use neural networks to get a good result.
01:01
Yeah, sure. Yes, okay, so with neural network, for those kinds of tasks, you need to use neural networks, and there is structured data, like for example, for predicting credit score, if the person will pay the credit, you have a table of the data where each row is a person, and each column is a feature,
01:21
and those features do not have that much internal structures between them. Like in the image, two pictures near each other have a lot of internal structure, and here you do not have this situation. For this type of data, gradient boosting usually gives the best results. The next thing is it's very easy to use. You can use gradient boosting model actually as a black box model.
01:41
You give to gradient boosting your data, it trains the model and gives you a good result. You cannot do that with neural networks because you really need to be an expert to build a good architecture. And it also works well if you have not a lot of data, and that is something that happens many times in life,
02:00
that you do not have huge amounts of labeled data, but gradient boosting will still give you good result. Because of this whole set of reasons, it's used in production in many companies. It can be used in finance for predicting credit scoring, for example. It can be used in recommendation systems for understanding what are the best songs
02:22
that the person will like, or it can be used for sales predictions. It is also used heavily on Kaggle. There are many machine learning competitions, and for this type of data, the winning solutions in many cases are based on gradient boosting. Now, neural networks are very powerful,
02:42
and it would be really cool to use neural networks together with gradient boosting, and that is something that we are doing in Yandex. So, Yandex is a very large Russian company. We do search, we have taxi, we have self-driving cars. We have many, many different, it's like a silicon valley of Russia. And what we are doing for many tasks,
03:01
we are using together neural networks and gradient boosting. For example, if you have a query and you want to select images to show to the person, then you first calculate neural features on these images, then you combine these neural features with other knowledge that you have about this image.
03:21
For example, about the site on which the image was, and then these features you give to gradient boosting. So, it is a good idea to combine those methods. They are not contradicting. They can work very well together. Gradient boosting is an iterative algorithm. It works like that. It usually builds decision trees.
03:41
First, it builds one decision tree so that the training error is large. Then it builds another decision tree so that the training error reduces, and then it does it for hundreds or thousands of times so that the training error is very small, and you already can find complicated dependencies in your data. That is about gradient boosting. Now about cat boost.
04:00
The main thing, so the main reason why you should be interested in cat boost is on the slide, and it's quality comparison. There are several libraries in open source. Those are LightGBM, and main competitors are LightGBM and XGBoost. There is also the H2O algorithm. And as you can see on the slide, here cat boost wins on the set of
04:22
publicly available data sets. All these libraries by different amounts, sometimes by really a lot, like on Amazon, and that is the comparison after parameter tuning. You have the algorithm. You need to tune parameters to get the best quality. This is after parameter tuning.
04:40
We also have a comparison without parameter tuning, and cat boost with no parameter tuning beats all other algorithms on these data sets with parameter tuning in all cases except one, where LightGBM outperforms cat boost by a little. So that is the quality comparison. It's a good reason to try the library.
05:00
Now I will deep dive into the differences between cat boost and other libraries. The first difference is the kind of trees that cat boost is building. Different libraries or different algorithms build different kinds of trees. LightGBM builds trees node by node, and can get very deep, not symmetric trees. XGBoost builds trees layer by layer,
05:23
and the trees cannot get very deep, but they are not symmetric. Cat boost builds symmetric trees. An example of the symmetric tree you can see on the slide and that is not an error on the second level. The feature is the same. It's the weight on the second level in all the nodes in this layer. On the next layer, if the tree would be deeper,
05:42
there would be four nodes with the same feature. This type of trees, that is, we observed that this type of trees helps a lot with hyperparameters. So because when you are using this type of trees, the resulting quality does not change by a lot
06:01
when you are changing the hyperparameters. So the algorithm is stable to hyperparameter changes, and because of that, it gives very good results from the first run. So you don't really need to put a lot of effort into parameter tuning. You can make sure that the algorithm has converged,
06:20
so you had enough iterations, and then you just get the first model that the algorithm has given to you. Yeah, okay. That is about hyperparameters, and the next thing is about prediction.
06:40
With this type of trees, the prediction can be done very fast, and I will talk about this later. So that is the first difference. The second difference is the type of data that we are able to work with. There is this numerical data, like height or weight, numerical features, and it is clear how to work with numerical features if you are using gradient boosting on decision trees. You put a feature into the tree,
07:02
and if the value is less than something, then you go to the one side. If it's greater than this value, then you go to another side. So this is the way to use numerical feature in decision trees. It is not that obvious how to use categorical features in decision trees. Categorical feature is a feature with discrete number of,
07:20
with discrete set of values, where values are not necessarily comparable with each other by greater or less. An example of that would be occupation, or there could be examples with high cardinality categorical features, like user ID, for example. And the high cardinality categorical features are the features where it is the hardest,
07:44
where it is the hardest to work with them in an optimal way. So how do, what do we do with categorical features? The first thing that we are doing is very simple. It's one-hot encoding, and that is something that other libraries also do.
08:00
What is this? Instead of one categorical feature, like occupation, that the person is a manager or a cook or an engineer, you have three binary features. Is the person the manager? Is the person a cook? Is the person in PR, or what was the third thing I said? So instead of one categorical feature,
08:23
you have many binary features. You could do this during the pre-processing, but if you do this during the pre-processing, then your dataset grows, you have a very large dataset, and also the training time will grow by a lot. So the good way to have one-hot encoding
08:44
is to let the algorithm do one-hot encoding for you. So you just say, this is the categorical feature, please do one-hot encoding. And the algorithm does it for you. It will be better in terms of speed, and it also will be better in terms of quality.
09:00
There are details of the algorithm that allow for that. I don't have enough time to explain everything. So the first thing is one-hot encoding. Other libraries also do that. And then we have this whole set of things that we are doing with categorical features that are more sophisticated. And these things give a very large boost in quality.
09:21
So one-hot encoding we do for features that have little amount of values. For high cardinality features, we do the following. We do calculate statistics based on label values of the objects with this category value. The simplest thing you could do is the following. Let's say you have this data set
09:41
that we have on the slide. And there is this categorical feature that is occupation with two possible values, software development engineer, SDE, and PR. Now instead of one categorical feature, this categorical feature, we introduce one numerical feature that is equal to average label values
10:03
of all objects with this categorical feature. Instead of SDE, we will have three divided by four. It's the average label value. There are three ones and one zero. The average label value is three divided by four. This is called target encoding.
10:22
This could work, but the problem is it doesn't work. It doesn't work because it leads to overfitting, because it leads to target leakage. An example where you can understand that would be, let's say you have the single object with a category value. In this data set, you would have only one SDE.
10:40
Then this SDE will have a label value one, and your new numerical feature value will be just equal to your label value. So during training, the algorithm remembers that it has a very good feature that is equal to target, and it makes all decisions based on that. During the prediction, you will not have this magic feature that is equal to label.
11:01
Because of that, you should not do this, but what are we doing? We're doing the following. We are doing a random permutation of all the data, and now the data is permuted, and you are looking on the object with some categorical feature value, the eth object. Now you are calculating the same averages, but not including this object.
11:22
You are only looking on the object before this one in this permutation. For this object, the new feature value will be two divided by three, because there are three objects with categorical feature value SDE before this one. Now, what else we can do? We also can use priors.
11:41
For the first object, you don't have any objects before this one, so it will be zero divided by zero. To not have these problems, we introduce priors. We add a prior in numerator and denominator, as you see in the formula here, and it gives a boosting quality to try different priors
12:01
and to try to find out which prior is a good one for this particular feature. So what we are doing, we are calculating those averages. We are enumerating different priors. What else you could do? You could try different random permutations, but you cannot use two random permutations to train one model,
12:22
because it will lead to target linkage in the same way as the average on all data set. What you can do, and what we are doing, you can train simultaneously several models. We are training simultaneously four models, and on each iteration, when we are selecting three structure, we drop a coin, select one of those models.
12:41
Each model has its own permutation. So we select one of these models and one of these permutations. For this model, we select the tree structure. Then, after that, we give this tree structure to all four models that we have, and then we calculate leaf values based on one more permutation. This gives a good, high boosting quality,
13:03
and important thing is you cannot do this during pre-processing. That is something that you only can do if you are writing this inside the library. The next thing you can do, you can look on feature combinations. What are categorical feature combinations? Let's say you have two categorical feature,
13:22
a pet and a color, and your new categorical feature that will be a combination of those two features will have the following qualities, blue cat, black cat, blue dog, black dog. So it's a new categorical feature that is a combination of features. The problem here is,
13:40
if you have several categorical features, the number of possible combinations grows exponentially with the number of features. So you cannot really calculate those statistics for each combination. What we are doing inside the algorithm, we are enumerating combinations in a greedy fashion. So we do not enumerate all of them, but we are trying to enumerate through some best of them.
14:02
When we are building the next tree, we first, on the first level, we try only combinations of size one, and then we, on the next level, we are trying combinations of size two by adding features to the feature that we have already selected.
14:20
And we also calculate other statistics like frequency of the category in the data set, it also helps. That is the big thing about categorical features. Now the next thing that we are doing that gives boosting in quality is called ordered boosting. Classical boosting is prone to overfitting, and that means that the resulting model will lack in quality.
14:42
And that is because when you are building the tree, when you are building the leaf value, the leaf value is the estimate of the gradient of all the objects that would be in this leaf. And this estimate in classical boosting is biased because you are making this estimate
15:00
on the same object that you have built the model on. It is easier to see if you look on the error, if you are trying to estimate the error in the leaf, if you estimate the error on the same object that you have built the model on, then the error will be less, so it will be biased. The same thing happens with gradients. To overcome this problem, we use the same idea
15:22
that we have used for categorical features. We use these random permutations. Now you have your random permutation, and when you are building the tree structure, then for each object, you are making the estimates based on the model that has never seen this object. You are making the estimate based on the object before given one in the permutation.
15:42
And that gives the boosting quality in case if you have small data set or noisy data set. In case if you know that there might be overfitting, it really helps. Okay, now I told you about the main algorithmic things that we have in the library. Now let me tell you about the modes
16:01
that the algorithm is working on. There are three main modes, classification, regression, and ranking. And those three modes are in all gradient boosting libraries. First one is classification. And there are binary classification and multi-classification. Binary classification problem is, for example, if you want to predict if the person
16:20
will pay the credit or not. In your training data set, you will have labels, one if the person has paid the credit, zero if the person has not paid the credit, or you might have probabilities there. And for multi-classification, multi-classification is when you have more than two possible answers. For example, if you want to predict weather for tomorrow,
16:40
you want to predict type of clouds, and there are like six or nine possible types of clouds, and for that you can use multi-classification. The regression is when you want to predict numerical value. For example, if you want to predict taxi drive duration or to predict dollar exchange rate.
17:00
Those are regression problems. And there is also ranking. And ranking is a little bit more tricky. An example of ranking problem would be, for this particular city, like say, Edinburgh, give me top-end hotels in this city. Let's say that your input data has ratings. So for each hotel, you have a rating,
17:20
and you want, for some hotels, it is your training data set. And for prediction, for some hotels, you do not have the ratings, you need to predict the ratings first. And then rank the hotels and select top-end of them. How you would solve this problem? One way to solve this problem would be to solve this using regression.
17:41
That means you really will try to predict a rating for each hotel, and then you will sort hotels by rating and select top-end. But you don't need to do this. In this case, let's say in city A, all the hotels are really good. In city B, all hotels are really bad. And what you're trying to force your algorithm to do
18:01
is your algorithm will try to say that in this city, all hotels are worse than in this city. And that is not cheap. And you don't need to do this to understand top-end here and top-end here. You don't need to compare them with each other. So you don't need to learn the real rating.
18:22
And because of them, what you are doing, you are grouping the objects into groups, like here you will group the objects by the cities, and you are trying to rank objects only inside each group. That is ranking. We use ranking a lot in Yandex, because we have search, we have ads,
18:40
we have recommendations from music and video. We have very many places where we need ranking. And because of that, we have many very powerful ranking modes, which XGBoost and LightGBM do not have. The first one, let's say it's just ranking, is in case if you have something like ratings in your data set,
19:00
or relevance as if you have a search query and the documents, and then for each document, an assessor will write a number which will be a relevance of this document. That is ranking. We have two modes for ranking. They are called YetiRank and YetiRankPairWise, and the difference between them is that the first one is really fast,
19:20
and the second one is really powerful, but slow. In most cases, we use YetiRankPairWise. The next mode is called PairWise mode, and that is the mode that you use if you do not have any ratings. You only have the pairs of objects, and for each pair, you know that first object is better than the second object.
19:41
Third is better than fourth, and so on. So you only have pairs as the input. We have two modes for PairWise ranking, and the difference between them is the same. And we also have three other modes. One is the mix between ranking and classification, which might be useful if you want to do ranking, but you have the zero and one labels as the input.
20:04
Another is ranking plus regression, and one more is specific for the task when you want to select top one best candidate. That is also a ranking task, but there might be the case then when you are not really interested in top N,
20:22
you only are interested in top one. So we have this whole set of ranking modes, and if you are interested in ranking, I would strongly recommend you to try them. They work really good, and they are used in production in different services in Yandex. Okay, now what else?
20:40
We have talked about the algorithm, how it works. We talked how to use it, which modes are there. Now about the speed. Important things are CPU speed, GPU speed, and prediction speed. CPU speed, when we just released, CatBoost was really slow, and everyone told us that, and we worked a lot to make speed ups,
21:02
and currently, the situation is the following. On most of the dataset, we will be two to three times slower than LightGBM, so LightGBM is the fastest one, and with XGBoost, we might be the same or two times slower, something like that. So the difference is not that much,
21:21
but we still are a bit slower than other libraries. With GPU, the situation is completely different. We are very, very fast in GPU. We are about 20 times faster than XGBoost and two to three times faster than LightGBM. And the important thing is that CatBoost is super easy to use, as opposed to LightGBM.
21:43
CatBoost is easy to use as opposed to LightGBM. You just pip install the library, there is a flag, task type equals to GPU, and you use it, so it's really fast and it's really easy to use. Important thing about GPU is that GPU speed up comes with amount of data.
22:01
For the more data you have, the more is the speed up. For very large data, if you have millions of objects, you will have the speed up up to 40 to 50 times. And even though that will be on the newest GPU, on older GPU, the speed up will be about four times.
22:22
About prediction time. We care about prediction time also, and CatBoost prediction time is 30 to 60 times faster than XGBoost and LightGBM. It's not that everyone cares about speed up, but we care and we are proud that we are so fast.
22:41
And also, I wanted to mention a few other things, and that is how to explore your model. If you want to understand what your model is doing, you need to look on feature importances, so which feature are the most important ones. You can look on feature interaction, which pairs of features work really well together.
23:00
There is also per object feature importance. For this object, which feature are the most important? For that, we are using the sharp values, and there is a sharp value library that shows the visualization for that. So for each feature has the importance, it might have positive importance,
23:20
because of that feature, the cost grows, because of that feature, cost is lower, and so on. There are different plots you can look on to understand more about your features. There is also the way to understand what objects are the most important ones. So let's say you have this object,
23:41
and you want to see because of which object in the training data set the result is the following. There is influential documents for that, and that there is also a way to understand the importance of, if the feature influence is significant, statistically significant. For that, we have feature evaluation.
24:04
We have tutorials for all of that, and I recommend you to look into these tutorials. There is a bunch of different features also in the library. Except for training, there is a lot of visualization.
24:21
You can look on how your error changes during training using a Jupyter notebook, using a separate cat boost viewer, or using sensor board. You can also try training on data set with missing features. You can use cross-validation. We have also visualization for cross-validation
24:41
that is also running inside Jupyter notebook. So there is a bunch of stuff to try. I just want to mention a few important parameters. So if you want to tune your algorithm, if you want to get the best quality, these are the parameters that you want to tune.
25:04
And we have the tutorials, how to tune, and documentation, how to tune parameters for quality. We also just published the tutorial, how to change parameters if you want to get the most of the speed. I encourage you to look into this documentation.
25:27
Yes, and here are a few links. The last one is called GitHub Cat Boost Tutorials. And we also have published there a tutorial, like a tutorial with homework.
25:41
So if you want to learn how to use gradient boosting, you can try this tutorial with homework. And do the task there, answer the questions. And we have tutorials for all the functionality that I told you about. So if you run through these tutorials,
26:01
you will learn everything. And with that, I am ready to answer the questions. Questions from the audience.
26:23
By the way, can I also ask the question? How many of you have heard or used Cat Boost before this talk? Okay, that's half, half, okay. Hi, thanks very much for the talk. So you mentioned at the start that it's possible
26:42
to combine neural networks and gradient boosting. Yes. Can you give us some example applications? Yes, so first of all, we have a tutorial how to use neural networks on text together with gradient boosting. The idea is the following. You train a separate neural network. For example, a neural network that will say,
27:04
that will compare the image with text. And then from that you get numerical features like distances. And those distances you combine with other features. Like you have the image and you have a lot of other information about this image. Not only what you see in the image,
27:21
you have information from the site, from how many people clicked there and so on. And then you combine this all together and then you put this into gradient boosting. Okay, thank you. I have a question about the categorical features you mentioned. How did you use those statistics when you do prediction?
27:43
Because you don't know the true value. I'm not sure and this is the question for you. So how we use those statistics during training? During prediction. So what we are doing, we are reading the training data set and for each category that you have seen,
28:03
you have a value that is calculated based on all the training data set. Then you write this value to the hash table and when you are predicting, what you are doing is you are adding this test object to the end of your training data set
28:21
and then the feature value for this object will be the average based on all the objects before given one, that means on all the training data set. And this is the value that you have saved in the hash table. So basically you use the average value of the training data set. Yes, exactly.
28:43
Thanks a lot for the talk and for the library. I was just wondering, why would you use scikit-learn anymore if you have something like this? Like do you have any application? No, scikit-learn has a lot of stuff including gradient boosting.
29:00
Gradient boosting, like catboost and lgbm and xgboost work better than sclearn gradient boosting. So if you want to use gradient boosting, you should better go with a different library than scikit-learn but there is a bunch of stuff in scikit-learn that is very useful.
29:22
Yeah, that's true but for anything like predicting classification or regression and all of this, you would only use catboost at Yandex, right? So we use catboost for many, many different tasks. I don't think we are training any sclearn classifier
29:40
for our production purposes. Wow, thank you. But we not only use gradient boosting, there are other also useful algorithms. There are neural networks, there are, I don't know, linear models, nearest neighbors, we use all of that. But gradient boosting is a very important algorithm.
30:06
Hi, thanks. I was just wondering, do you have any tools within the library to extract the prediction path of a particular instance similar to a tree interpreter for a random forest? Could you repeat, please?
30:20
So when you make prediction for a given instance, is it possible, do you have some tools to extract the path this particular instance took to make that? Oh, like for each, what path in each tree? So we are currently working on providing the JSON model, and that will be the way to discover the paths.
30:48
I have a quick question. How well do you integrate with sclearn? Because I've been using XGBoost for ranking problems, and it's insanely hard to not use the basic XGBoost thing if you want to do cross-validation
31:01
or anything with ranking. So how well do you integrate with the rest of the ecosystem, or do I have to use cat boosting all the way? Yeah, so we are trying to integrate as best as we can. We are very much compatible with XGBoost, so if you are already using XGBoost, it will be not that hard to switch.
31:21
There are some methods like, I don't know, there was some method that, so we do not support all the methods in sclearn, but we know about one of them that does fail, and we plan to fix it soon, and if you see anything that doesn't work, you can just make the issue on GitHub,
31:41
and we will fix it. Thank you for your talk. Do you have any experience about applying cat boost to natural language processing use case or data sets? Yes, so for natural language processing, what we do, we usually use neural networks,
32:02
and we sometimes do, on top of what we are doing when using neural networks, we do use on top of that some gradient boosting. For example, we have the dialogue assistant that generates the answers.
32:21
You ask something, then there is this neural generator for the answers of the assistant to you, and it generates many answers, and then on top of that, there is a re-ranker that is based on gradient boosting that extracts some features from each of the answers and then re-ranks them,
32:41
but the core is usually based on neural networks. Thank you for the talk. In production, we use neural networks and like GBM a couple of months ago, we tried to use cat boost,
33:01
and basically our use case is we create word embedding, some other features, and one-hot encoding, and then we pass that to light GBM, and we spent a lot of time fine-tuning light GBM, but when we tried to substitute with the cat boost, the results weren't better,
33:21
so how easy it is to fine-tune cat boost? Do we need to spend, again, two months in grid search and so on? Any ideas? It should not be, so there is a set of parameters that you can change to try to improve the quality,
33:43
and those parameters are listed here. It's here. So you can try to fine-tune these parameters. Two of them, okay, one of them learning gradient iterations. You don't really need to fine-tune. You just need to, you need to find this point of convergence,
34:01
and these other ones, you probably have to fine-tune. With depth, the situation is the following. You don't need to enumerate through all the depths. You only, you basically try the depth six, that is the default one, and for some data sets, it is important to have bigger depth. So you try six, you try 10, and then if 10 is better, then you need to fine-tune
34:22
between eight and nine, something like that. So those are things to fine-tune. Yeah, and one more thing about one-potent coding that you told, it might lead to a slow training
34:41
for cat boost. So if it is possible for you to not do it during pre-processing and give the algorithm the possibility to do it for you, then it will be probably better. And about one more thing about word embeddings, it usually doesn't work perfect if you give the word embedding to gradient boosting
35:02
in a row like this 300 numerical features. You give it to gradient boosting. Usually it's better to try to find some distances. Yeah. Hi. I don't do ML as a part of my day job,
35:22
but I've used cat boost as a key ingredient in many of the ensembles in machine learning contests with great success, so thank you for that. So my question is related to the previous question. Do you have any tips on hyperparameter optimization? So I mean, the usual recommendation is grid search,
35:42
but usually I use something a little more intelligent when it comes to optimizing hyperparameters from your nets and things like that. Anything on those lines for cat boost? Yeah, there is a hyperopt which works better probably. I would recommend this library if you have, you know. Yeah, I incidentally use hyperopt for neural nets.
36:01
Okay. Is that what you'd recommend for parameter tuning on cat boost as well? Yeah. It's not particularly for cat boost, it's like for any. Yeah, it's quite generic. Yeah. Thank you.
36:20
Hi, thanks for a nice talk and for open sourcing the library. I have two quick questions. One of the bullet points you showed previously was that the library handles missing values, which essentially means you can feed in DNAs without imputation. Which is very useful, yes. And the other question is, do you support sparse matrices as input for the.
36:41
Yes, that is one thing that we do not support yet. Okay. And we are currently working on that. Okay, so there's some. Yeah, the plan for the next few releases is the following. We will be adding the sparse metric support. We are adding the distributed training on Spark.
37:02
We are also working on that. And we are adding multi-classification on GPU. It will come really soon. And we are adding the JSON model and improvements in our package. That is the plan for next releases. But sparse data is a very long, it's a very large thing to do,
37:22
so it's not yet ready. But we are working on that. Perfect, thanks. Hi, I found the quantile regression option in gradient boosting quite useful for my applications. So do you have something similar in cat boost, which could provide the prediction intervals?
37:42
Do you have what? The quantile regression option? Yes, we do have a quantile regression. So you could derive prediction intervals based on that? We do not provide prediction intervals, no.
38:00
So there is a loss function for quantile regression, but there are no prediction intervals. OK, thank you. We have time for a few more questions. If anyone have the microphone, yeah.
38:20
And I am very open to feature requests now. How will it handle imbalanced data?
38:43
So there is always a problem with imbalanced data. We have the possibility of re-weighting the objects, scale, pose, weight, parameter, the same for extra boost and light GBM. But that is the only thing that we are doing specifically
39:01
for imbalanced data sets. So it depends. We know some data sets, some imbalanced data sets, where it works really good, like Amazon, that was there on the slide. And some data sets where it does not work that well. Because your tree structure are balanced, right?
39:22
Yes. It's very sensitive to imbalanced data. I don't think that this tree structure is worse than other tree structure for imbalanced data sets.
39:42
So but we know that there is this problem with imbalanced data sets, and we are trying to figure out how to find this. Other libraries also do have the same problem. OK, thank you. We have time for one last question.
40:10
Thanks for the talk. I have a question about those parameters there. What do bagging temperature and random strength represent? Yes. OK, that is a very good question.
40:20
The bagging temperature, so when you are selecting the tree structure or when you are selecting the next tree, you are doing some bagging. So what you can do, you could do Bernoulli sampling. We select some objects and do not select some other objects. But you also can do other sampling. What we are doing, we are, by default, in regression and classification, we
40:42
are sampling from exponential distribution. And what we want to do there, we want to balance between having no sampling at all and having sampling from exponential distribution or heavier. And this bagging temperature changes that. So if it's set to 0, then all the weights are equal to 1.
41:02
If it's set to 1, then you have sampling from exponential distribution. And there is this balance. The random strength is one more parameter. When we are selecting the tree structure, we are trying to put a split in the tree. For each possible split, we are trying to put in the tree. And this split gets a score.
41:22
And this score is how much this split improves my ensemble. What we are doing, we are adding to this score a normally distributed random variable. And this random strength is the multiplier for this variable.
41:41
And this helps a lot to overcome overfitting. That is one more surprising hack that helps to improve the quality. OK, I want to thank Anna again for this enlightening talk. And thank you very much for the audience. Thanks.