Logo TIB AV-Portal Logo TIB AV-Portal

Raha: A Configuration-Free Error Detection System

Video in TIB AV-Portal: Raha: A Configuration-Free Error Detection System

Formal Metadata

Raha: A Configuration-Free Error Detection System
Title of Series
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date

Content Metadata

Subject Area
Detecting erroneous values is a key step in data cleaning. Error detection algorithms usually require a user to provide input configurations in the form of rules or statistical parameters. However, providing a complete, yet correct, set of configurations for each new dataset is not trivial, as the user has to know about both the dataset and the error detection algorithms upfront. In this paper, we present Raha, a new configuration-free error detection system. By generating a limited number of configurations for error detection algorithms that cover various types of data errors, we can generate an expressive feature vector for each tuple value. Leveraging these feature vectors, we propose a novel sampling and classification scheme that effectively chooses the most representative values for training. Furthermore, our system can exploit historical data to filter out irrelevant error detection algorithms and configurations. In our experiments, Raha outperforms the state-of-the-art error detection techniques with no more than 20 labeled tuples on each dataset.
management Free errors project configuration Free systems fitness systems
domain collaboratives Free states real project experts sets water staff experts font scalability domain errors configuration model systems tasks systems
rules algorithm information knowledge base algorithm time real expressive cellular sets argument rules information errors pattern configuration input pattern
choice domain algorithm algorithm cellular experts sampling argument lines Coloured goodness real vector Allegory configuration model classes selection configuration model addresses tasks
files algorithm breadth views time Ranges cellular combination heads elements number versions auth configuration shadow algorithm Redundancy Checking fan similar samples errors real vector case strategy configuration
point flow randomness functionality files algorithm time views Ranges sets water argument rules number training strategy different auth Representation configuration classes comparison algorithm relation point fit several real vector logic topology strategy input configuration pattern
clustering tuple flow labor states development sampling fit ones Clusters loss number training 4th voting processes rates logic different provide hierarchy iterations model table
digital filters algorithm strategy configuration states configuration effects sets Redundancy Checking
tuple digital filters time minimal number structured data evaluation samples different Representation 5th fitness comparison algorithm Numbers runtime division Redundancy Checking limitations scalability measures evaluation migration errors entropy strategy formal Right addresses
algorithm algorithm closed feedback argument part training number errors configuration configuration form tasks
but I'm I'm at some 2 billion and I'm very happy to present of recent project I hope
achieve the conflagration free detection system before that let me indicate that
this project the collaboration between 2 MIT and to the their addiction is a very important
problem in data actually given a data set like that the goal is to detect the toll values that are wrong so here other example is very is small and easy and you can easily identify these 4 values have to their 2 values because for example the capital of Spain is modeling but in real water you have long larger and the dirty there does that look like the font and the new question is how can we make the state doesn't do it again fold 1 day to do this task the staff would be asking a domain expert so indeed naive approach the ask a domain is expected to evaluate of a date that and the dirty value but as you can imagine the task would be the TDS on it is not a scalable for large data sets so that's why in these projects the main motivation of the the main goal of a system is to reduce the human involvement but there
are also other approaches to detect k times traditionally been there are a lot of language and that can detect the for us for example here you have 1 the rule violation detection algorithms this kind of algorithms usually take a date of that and a set of configuration of input for example for all the rule violation detection algorithm the conflagration could be a data constant such as the functional dependence and then this algorithm can walk from data values inside of a date of that as their 2 values and there are also other than
algorithms for for example we can have outlier detection algorithm that it can take a statistical parameters and we can have patent violation detect or that take some real pattern in internal informal offer illegal expression and we can also have knowledge base violation detection algorithm that usually watered knowledge research and the technical values that was relating these entity relationship so
1 question is if you have such a knife allegory to voice their detection brought broader problem a few challenging the did the reason is that because the if you have to cool important challenge talented if the 1st challenges algorithm selection all you have these algorithms but the shelf the that would and should be run on our data fit if you say OK only run 1 of the fabric then give you have a pool recall because each algorithm is usually it but it is usually fixed the expert to would you take only one kind of data and if you say run all of that d'Algerie 10 then did you have poor precision because not no and then because none all and not all of these algorithms are always a good choice for date of that the 2nd shattered if algorithm configuration even if we know that this algorithm is best for date of that the if you need to come would evaluated for example for an outlier detection algorithm you have to pro provides some parameter from a statistical parameter for other I so these artful challenges that the basically want to address in this paper FIL the basically driver the
challenge of the that the that basically formalize over problem has a can ossification task given a data the light that the other goalies to train it to laugh invocation model pair data column the doing it today data color because data values or compatibility inside a domain so the goalie's using these 2 that the the fire it later to predict the labels of data values we want to predict for example here the belly is the correct value but 1 2 3 is not a current capital you know the date of FIL as you can imagine you have to address tool important research question here for the fair need to feature I've over data lines because the classifier can only understand future not dividing and also we need to develop a sampling of prose because the classifier need some data another date of that to be labeled by the user from this
fans thought he'd feature in a future revision so better that achieve a solution for feature writing the data value is to use a case tool you basically the traditional algorithms we can use these celebrative because the that I will attend our samples to detect data so basically the camera on these algorithms in other date of that benefit the market data value the can I feel that's you have 1 value 1 inside the future of that value and that of a day do not the value we can add 1 view it in their future right off the bat for example here you can see that all the value barely have 1 and the 1st element because the fan emperor body and head out who did this the date of that and the 2nd I wouldn't did not how to defend and you have a view inside a feature vector of defined FIL but if you remember I told you that configuration of these algorithms is ver challenging for the question is OK if you want to run this algorithm how should be configured evaluated so the basic on use the preview both but habitable and wide range of configuration for these algorithms so basically the you're wrong the shadow IT and the different conflagration so let's look at his brother full
length and you have different and ablative here for example we have the 2 algorithms and each algorithm can take several configurations so 1st of all D. define 1 error detection authority as a combination of 1 algorithm among configuration so basically didn't have a number of a number of file latent times number of configurations error detection is that she let you get some example for this kind of
affirmative for example here you can the tree if you need if that's the 1st that's it for example is that if an outlier detection authority in the Fed does the value appear less than 2 times here 2 times the the conflagration is a parameter it can be 3 times it can be 4 times and also we have another strategy which is the 2nd of issue the pattern violation of 30 if the value is a non of it to guard against the non alphabetically the conflagration it could be is the value normally called or or any kind any other kind of pattern and the last 1 for example ever rule violation detection strategy because it's basically fair if the value is the date violate bond functional depend if it is you who receive the input conflagration so based on this is that it is that it would be wrong he says that in over data that didn't have all of the all too cool for a different set for example here the 1st strategy outputs doll values that appeared less than 2 times the number date of 5 again the D 4 values are outputted been did this basically authoritative so as you can see here not all of the fittest that's used are completely accurate but it doesn't matter for as long as the 5th strategy more to keep the same logic they it's still gives us some molarity features to compare data flow of the teach other so then by concatenating dead out of different authorities beach and create a feature vector for each state of and seems to be you all the algorithm written by the range of configuration didn't have a very explicit feature for representing the date of FIL in practice even have thousands of features for each they toughened if you can that was about
their future relations for the next challenge as I told you we need tool found from data files to be labeled by the user because we want to train function Connectivite if fight FIL very basic chapter which could be really just randomly sampled from data flows in over data that of a data set the separate water but it's also inefficient because each EEG given you the labeled due on Monday top pointing training fit for more interesting question is that can be have more than 1 training data points by any given user label and answer to this question view if we can't do that because they based on collapsed function from which you learning the chance at the on that true datapoint are going to have the same class label if they are in fine inside to love the for that voice be basically cluster the data flows in each column based on data that the feature representation that they have for Leslie you this process on other simple
example so remember we feature either there a data flow of another table so in the 4th sampling from data the 1st the the state of of insights in other common so here we have to make enough of doing blue 1 rate for then we ask the user to label Monday at all and you have to the full let's say you the labor the 1st value barely enough to live on the value 1 to be out there the 4 then the chance have you the labor for these 2 values of course but that the good thing is that we can also prepare data use the labor through that us that and have some noisy labels foot these Apple was basically to boost the number of training fit that you have because here the collapse of our as small but the in the in the logic of that then you have a lot of not being bigger that 1 user labels give of many more training data but by flavor providers and apples and you may also ask so what about the 2nd can often be don't have a label for that vote that would be a straightforward did you have a two month after the values you have at a loss of their development we just need to train it electrification model to predict the labels of the left of
as the can after that he should make about earlier the inside each column so maybe you want to told how many to laugh at the one-parameter if you're here the number of the loss of the college fees you want to do it so completely configuration-free they don't want to ask user to provide the number of clusters we use hierarchical clustering models and the started to tell us that the data column and in each iteration the the the the dry 1 topic that's covered have them off that covered as many on labor to lots about possible in different data columns and we ask you the tool the label the values inside these topic and the next iteration the to each data column and the continued use iterative process onto the user the labeling budget on so far far have unity
if you try to optimize the effectiveness of the solution and also optimize which means minimizing user involvement so that's may raise the question that how that good your efficiency would be because we are running a lot of error detection and were given configuration so to our thought and try to optimize their efficiency we all thought worldwide
that 1 option on this step in overseas there that's tool optimized basically the runtime Soviet that you remember the on all algorithms and all the configuration that you have and indeed could be time-consuming because we may have thousands of features thousands of the algorithm a configuration that had the goal of the effort if that is still basically future some of these I would have an configuration that are not going to be promising for us to do that basically you historical data so we have on that if the user have such a historical data that if you there have some data that that are already clean so we can basically compare of a new there today does that the the previously chillin' date of that and then only wrong dolls the retention of strategies on the new data sets that had have a good performances beforehand under previous state of so
we evaluated over that error detection she's than on a typical day must available data and we also know that did you think of of our all all the available data fit in the community and we also have 7 baseline 5 you can see here we have recently the error detection the sense have over baselines and there we also use their evaluation typical evaluation measure of precision recall and F underscore that combines these 2 measures and also run time and number of label public and we also have the more than 10 experiment on different aspects of of a solution but here because of the time
limitation let me just show you to chart the 1st shot in on the left shows the performance of right heart against the if stand alone or a division of his and said stand alone or a division of his them are actually error detection algorithms that can Waterboys by by day all and on the right hand side you can see a comparison of heart tool aggregation aggregator or the advection 5th and these aggregator or tools basically aggregated dealt with of different stand-alone error detection to fight you can feed right how quickly convairs and for all that the baseline and add against the grammar quickly converted it's around 20 label top Earth and the reason is the because of the magic is because of to tease the 1st thing is of very explicit feature representation and the 2nd thing is basically at all the clustering based sampling approach and label or migration
FIL and that traditional eradication is part for the Yvonne to avoid those kind of algorithm because you have to provide rules parameter that you have to bore why a lot of training data for that so that's the Varadhan leaves the user from the the tedious task of configuration algorithms and also ran on need a very very few fewer number of the training data and rock and it's only around 20 labeled other out old for alter form any other baselines that he had another experiment soviet that I would like to know die John-Michael close quarters and over a you than Jeremy and I would like to encourage you to take out of a paper and also other published than the other open-source tool and I also encourage you a tool to eat your feedback for us the hashtag thank you