We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Predict COVID-19 Spreading With C-SMOTE

00:00

Formal Metadata

Title
Predict COVID-19 Spreading With C-SMOTE
Title of Series
Number of Parts
30
Author
License
CC Attribution 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Data continuously gathered monitoring the spreading of the COVID-19 pandemic form an unbounded flow of data. Accurately forecasting if the infections will increase or decrease has a high impact, but it is challenging because the pandemic spreads and contracts periodically. Technically, the flow of data is said to be imbalanced and subject to concept drifts because signs of decrements are the minority class during the spreading periods, while they become the majority class in the contraction periods and the other way round. In this paper, we propose a case study applying the Continuous Synthetic Minority Oversampling Technique (C-SMOTE), a novel meta-strategy to pipeline with Streaming Machine Learning (SML) classification algorithms, to forecast the COVID-19 pandemic trend. Benchmarking SML pipelines that use C-SMOTE against state-of-the-art methods on a COVID-19 dataset, we bring statistical evidence that models learned using C-SMOTE are better.
Keywords
Program flowchart
Diagram
Program flowchart
Program flowchart
Computer animation
Transcript: English(auto-generated)
Good morning, everyone. I am Alesso Bernard, and I am a PhD student at Polytechnic of Milano. In this presentation, I'll present you a case study applying a continuous SMOTE version, called the C-SMOTE, to predict the COVID-19 pandemic spreading. This is the agenda of the presentation. I begin with the main challenges
to face when applying machine learning to data streams, and then I'll present you the case study and all the experimental evaluation that we did. In the end, I'll finish with the conclusion and some direction for the future. Traditionally, the standard machine learning techniques use a static batch of data that doesn't work in time,
and they split it into training and testing sets. And they, of course, use the training set to train a model and the testing set to test it. In case of data streams, instead, where the latency of each new sample is very low, and so a new sample at a time becomes available, the only thing that these techniques can do
is to add the new sample in the end of the batch, resplit it, and retrain a model from scratch. So with data streams, these approaches are a little bit unfeasible. Instead, the streaming machine learning models are able to continuously incorporate the new data
without retraining their models from scratch. And moreover, there is another difference that they firstly use the new sampling input to test the model and then to update it. And this particular approach is called the prequented evaluation, and it solves the problem that the testing set is not available because it's unfeasible to split the new data
into training and testing sets. So we learned that the first difference between the two approaches is the stream unboundedness. So in a batch, we have all the data available at the same time, and so we can split it into training, testing set, or evaluation sets, and we can also know all its characteristics.
So for example, the number of samples, the balanced ratio, the class distribution, et cetera, et cetera. Instead, with data streams, we have only one sample available at a time, and so everything here is different. We can only know the characteristics of the data seen so far and not the, let's say, the final characteristics of the data.
For these reasons, the streaming machine learning was introduced. It's able to maintain models online. That means that for the first point, that since we don't have the entire data available, an online model needs to incorporate one sample at a time, so on the fly.
Moreover, the stream is unbounded, so it must be able to update the model every time. It's impossible to retrain the model from the beginning of the new sample. Another consequence of the stream unboundedness is that the model needs to manage the data in a proper way. It's unfeasible and also impossible to save all the samples around the memory.
Otherwise, it will require an infinite amount of memory. And in fact, after the preventive evaluation phase, all the samples are discarded. And the last point, since the streams can evolve on time, they can present some form of non-stationarity, and so an online model must be dynamic and adapt to it.
In particular, this form of non-stationarity is called a cost brief phenomena. Here, the streams evolve in time, and so the statistics of such data are subject to gradual or even sudden changes without any warning. And so the risk is that the model trained at that point
doesn't fit any more than your data input. So it's not enough to build a model labeled to incorporate one sample at a time on the fly, but we need also to adapt that model build until now to manage the cost brief. And so this is the reason why a machine learning model must be dynamic. This is an example of the cost brief occurrence.
We can see that until the distribution doesn't change, the accuracy is high. But then when a drift occurs, if the model doesn't adapt to the cost brief, the accuracy drops, that is this line. Instead, if it is able to adapt and manage the drift, the accuracy stays high.
In order to manage the drift, the streaming techniques use a change detector model in their architecture. A common change detector is an adaptive sliding window that is called shortly, ADWIN. It's a variable length window of recently seen items, and its length is recomputed online accordingly to the change recoobservant.
The cost brief, unfortunately, is also related to the data stream imbalance. So streams evolves over time and also the data distribution can change. This problem in the batch settings is already no problem, and there are a lot of techniques to rebalance a batch before training a model.
But in data streams, it's still a big challenge due to the stream unboundedness. One of the most powerful and also famous techniques able to rebalance a data set is SMOTE. For each minority class samples, here in the field are the orange squares. SMOTE finds its k nearest neighbor,
it randomly chooses one from them, and then uses it to create a synthetic samples. But as already said before, the drawback of these techniques, but also all the others, is that they all work with a finite starting batch of data. So data streams being unbounded are unfeasible.
To apply this technique. For this reason, in our previous work, we proposed a solution called a disease model. The intuition was to continuously rebalance the stream as soon as a new sample comes in input. And moreover, being a meta-strategy, it's a sort of model that can be prepended
to any SMEL minus learner that are the algorithm that are not notably able to rebalance stream in presence of constant lifter. And to do so, SMOTE uses the admin change detector to keep a window of samples belonging to the same concept, and then uses them to apply SMOTE.
About the COVID-19 case study, we started from this data set with 59 attributes, of which 44 numerical and 5 nominal. The most important attributes are, for example, the country, the date, the total number of new and total cases and deaths,
and also some population markers. So for example, the number of population for each country, the number of hospital, the number of meds hospital, and so on. From this, we applied a sort of pre-processing phase.
We added some data attributes, for example, the month, the year, the day of the week, the week of the year, and if that day was holiday or not. And most importantly, we added a label that is zero if the today new cases are less or equal than the yesterday cases, or one if the today new cases
are greater than the yesterday cases. So here, the task is to predict if there are more or less cases than yesterday, that is, if the pandemic is spreading or it is contracting. And this is a challenging task due to the pandemic and stationarity. So in general, when the pandemic spreads,
we saw that we may simply forecast an increment trend in our science of decrement. But on the other way around, when the pandemic contracts or is stable, we can also ignore yearly signs of increment. Moreover, due to the constant brief occurrences, the classes can also swap. So for example, all the samples labeled as minority
or majority class before a constant brief occurrence now can get labeled as majority or minority after it. So there is a sort of swaps. Now, I present you some, let's say, example. This, for example, is the COVID-19 spreading in Italy
from February 2020 to February 2021. And in particular, the figure on the top shows, this one, shows the number of daily cases of COVID-19, while the other shows the COVID-19 increment or decrement cases as per the day before.
And here, the red lines represent the constant brief occurrence, so where there is a clearly change in the distribution. And we can notice that there are some points in which the distribution changes and change and is not stationary. In particular, we can see also from this table
that for each constant brief, the ratio of increment or decrement changes and swaps. So it's a clear, let's say, state that this phenomena is not stationary and it's not easy to detect it. And I can, I show you also the, let's say,
some data about the most important country in the world. And we can see that the constant brief is always different and also the ratio between the increment and the decrement is different. Let's say in general, considering all the data,
the minority class here is the level one that is the increment of the pandemic. About our research question, since the model using a balanced static batch, so with the traditional machine learning technique, can have higher performances
in respect to those that use an imbalanced dataset. In our research, we wanted to know if using the COVID-19 stream that we saw that is imbalanced and with constant drift, continuously rebalancing it before applying our SML minus model could improve the performances or not.
To investigate this research question, we assume the two hypotheses. The first one is that the minority class results of the SML minus models by Planet with Seismot are better than those without it, of course. While the second one is that the minority class results of at least one SML minus model by Planet with Seismot
are better than those achieved by the state-of-the-art techniques that are the ones, let's say, natively able to rebalance a stream in presence of constant drift. And that from now on, we will call them SML plus models. We compared the firstly Seismot by Planet
with the five different SML minus model with the respective SML minus model alone. And in particular, we tested the adaptive random forest, the affinity tree, the NaviBias, the k-nearest neighbor, and the temporal metric classifier techniques. And then we also selected four SML plus models from the literature, and we compared them with Seismot.
In particular, we selected the adaptive random forest with a sampling version, the rebalance stream method that uses multiple temporal metric classifier in parallel as model to rebalance the stream, and the oversampling and undersampling methods
that combine the random oversampling or undersampling with the lambda to rebalance the stream. To evaluate the results achieved by our tests, we use the recall and the F1 measure of both the minority and the majority class, and we also use the Gmin as, let's say, a generic metric.
We performed a one-tailed T-student test with significance level equal to 0.05 to determine if there were significant differences between the results obtained with and without Seismot. In particular, the null hypothesis, that is the light green one,
checks if the average results achieved by Seismot by planet with some SML minus models are equal to the SML minus model alone or to the SML plus average results. Then we defined the first alternative hypothesis,
that is the green one, that checks if the average results achieved by Seismot by planet with some SML minus model are greater than the results achieved by the SML minus model alone or by the SML plus models. And then the last one is the second alternative hypothesis,
that is the red one, checks if the results achieved by Seismot by planet with the SML minus model are less than the others. Here, starting from the comparison between Seismot by planet with the SML minus models
and the SML minus model alone, we can see that in the minority classes, so in this part of the table here and also in the Gmin, that is the general metric, we can see that Seismot is able to improve the performances of the basin. And so we can, let's say, see that,
we can say that the first hypothesis is accepted. And about the second one, here we can, let's say speaking a little bit more, speak a little bit more. So for example, comparing Seismot with the rebalance stream algorithm, that is the first row, only the off-reading adaptive tree model
by planet with Seismot is able to outperform the rebalance stream in the minority class performances and in Gmin. And the situation is similar also with the online, sorry, with the undersampling online bagging techniques,
that is this row. And here there is only the off-reading adaptive tree, but we can see that it's a plus difference that also the Gmin is always improved with all the baselands and then comparing the other two algorithms left, we can see that Seismot is able to improve
the minority class performance in the Gmin using all the baselands tested. So in, finally, we can see, we can say that also the second hypothesis is accepted and tested. So about the conclusion,
the Seismot pipelines minority class results are in most case better than both the ones of the SML minus models alone and then the SML plus algorithms. Moreover, we saw that the recall of the minority class gain is bigger than the recall of the majority class loss. And this means that the improvement in the ability
to correctly forecast the decrements or the increments when the infection is spreading or contracting is larger than the error introduced. And this is, of course, relevant and because this means that Seismot can modify signs of decrement during the pandemic spreading periods
and signs of increment during the contraction periods of the pandemic. About future work, we aim at analyzing the Seismot trade-off between the improvement predictive performances and the amount of time and memory needed to train the models. And even more,
try to improve the Seismot performances in particular on the majority class. And thank you very much for your attention.