DeepBase: Deep Inspection of Neural Networks
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 155 | |
Author | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/42951 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
SIGMOD 201982 / 155
34
37
44
116
120
122
144
148
155
00:00
Magneto-optical driveData managementComputer networkArtificial neural networkSoftwareBasis <Mathematik>WeightExecution unitLecture/Conference
00:16
WeightDecision theoryVideo gameTask (computing)ConsistencyPhysical systemProcess (computing)VideoconferencingLecture/Conference
00:42
HierarchyVisual systemComputer-generated imageryPhysical systemMathematical modelWeightVisualization (computer graphics)Flow separationBlack boxCASE <Informatik>HTTP cookieBitConfidence intervalWhiteboardComputer animation
02:14
HierarchyComputer-generated imageryVisual systemPattern recognitionVirtual machineTranslation (relic)AutomationRecurrence relationPairwise comparisonData structureMathematical modelFormal grammarProgramming languageCategory of beingObservational studyExecution unitSystem identificationComputer networkMachine visionPhysical systemIdentity managementCNNLinear mapString (computer science)Classical physicsScale (map)Machine codeLine (geometry)Function (mathematics)Cross-correlationSurjective functionMathematical modelMathematical morphologyHypothesisMathematical modelBlogSocial classSelectivity (electronic)Software testingPoint (geometry)QuicksortProgramming languageProduct (business)SpacetimeNeuroinformatikBasis <Mathematik>outputInterpreter (computing)StatisticsCASE <Informatik>Independence (probability theory)Forcing (mathematics)Mathematical modelComplex (psychology)Type theoryStudent's t-testMereologyInheritance (object-oriented programming)Different (Kate Ryan album)Speech synthesisExecution unitVarianceMultiplication signMachine codeLatent heatDiscrete groupCoalitionCross-correlationMathematical analysisTrailConfidence intervalLevel (video gaming)Goodness of fitPredictabilityNetwork topologyChainMachine visionExtension (kinesiology)Group actionFormal grammarHypothesisArithmetic meanBitLogische ProgrammierspracheFunctional programmingFormal languagePhysical systemData structureSet (mathematics)Variable (mathematics)SubsetInstance (computer science)Function (mathematics)Pattern languageMedical imagingObject-oriented programmingGreatest elementPosition operatorDisk read-and-write headDistanceForestOpen setMUDTranslation (relic)MappingForm (programming)Logic gateTensorMeasurementSeries (mathematics)CollisionTime seriesConcurrency (computer science)Field (computer science)Visualization (computer graphics)WeightSupport vector machineLogistic distribution2 (number)Linear regressionVirtual machineBinary fileMilitary baseCartesian coordinate systemLine (geometry)Maschinelle ÜbersetzungRegular graphExpressionSoftwareSlide ruleWordElectric generatorCodierung <Programmierung>ResultantTetraederFlow separationArtificial neural networkComputer animation
09:52
Binary fileMathematical modelMaß <Mathematik>Linear regressionLogistic distributionMathematical morphologyHypothesisMathematical modelCoding theoryHypercubeLine (geometry)Scale (map)Machine codeComputer networkVisual systemComputer-generated imageryTexture mappingObject-oriented programmingPixelApproximationString (computer science)Source codeComplete metric spaceRule of inferenceFormal grammarDefault (computer science)Physical systemDeclarative programmingParameter (computer programming)Programming languageMathematical modelMedical imagingFunctional programmingPhysical systemTime seriesFormal grammarSpeech synthesisSpacetimeSensitivity analysisRule of inferenceBilderkennungRun time (program lifecycle phase)HypothesisMathematical modelExtension (kinesiology)Object-oriented programmingWordBenchmarkMathematical optimizationLine (geometry)Materialization (paranormal)TensorPoint (geometry)Cross-correlationMachine codeMereologyFormal languageExecution unitConfidence intervalVirtual machineSet (mathematics)PixelMathematical analysisDimensional analysisTable (information)NumberPredicate (grammar)Game theoryPresentation of a groupSequelFunction (mathematics)Instance (computer science)Scaling (geometry)Type theoryDifferent (Kate Ryan album)Natural languageConstructor (object-oriented programming)WindowCASE <Informatik>Row (database)outputCoalitionLatent heatInheritance (object-oriented programming)Scripting languageData managementDependent and independent variablesCartesian coordinate systemGodApproximationMeasurementAuthorizationSurfaceProcess (computing)BitInformation securityTexture mappingProteinNetwork topologySocial classVotingDistanceAuditory maskingResultantMetric systemTwitterLogistic distributionDatabaseLinear regressionCivil engineeringSemiconductor memoryPressureEllipseWeightCurveBinary fileTime domainMatrix (mathematics)Parallel portFlow separationConvolutionEnvelope (mathematics)CalculationLetterpress printingMultiplication signComputer animation
17:25
Programming languageParameter (computer programming)Declarative programmingPhysical systemGoogolMathematical modelComputer multitaskingMathematicsPhysical systemMathematical modelComputer animationXML
Transcript: English(auto-generated)
00:01
My name is Thibaut, and this is DeepBase, a deep inspection of neural networks. This is joint work with Kevin Lin, Ian Huang, Michelle Young, Carl Vondrick and Eugene Wu, and all the work was done at Columbia University. So it's pretty clear that nowadays neural nets are popular, and people use them to solve all kind of crazy new tasks,
00:21
like playing competitive video games, or like driving cars. But of course not everything is so easy with neural nets. In particular, neural nets are notoriously hard to debug. And indeed, if you look inside a neural net, you will find this. The decision-making process of the system, so the internal logic, is embedded into thousands, sometimes billions, sometimes hundreds of millions of weights,
00:42
which are virtually impossible for a human to understand. This is why it became almost a cliche to compare neural nets to black boxes, and black boxes are almost impossible to debug. So another problem that motivates this talk is that sometimes neural nets do the right thing, but for the wrong reason. So to explain this, let me show you an example.
01:00
This here is a screenshot of VQA, Visual Question Answering, and it's a system from Georgia Tech. So the idea is that you show a picture to a model, you ask a question, and the model will give you an answer. So in this case, I show the picture of those two bears, ask what animal is that, and the system tells me that it's a bear, so the blue bar here tells me that I have 99% confidence. It's pretty cool. So to illustrate the point,
01:23
let's poke the system a little bit. The questions that are interesting here are counting questions, how many. So we can ask how many bears are there in the picture. And then the model will reply, two. We see that we have a high confidence, 66.9%. I think it's pretty awesome that we went that far, and it's actually pretty spectacular. So you can say, OK, it seems that the model knows how to count.
01:43
So let's continue and probe the model a little bit further. We can ask how many rocks are there in the picture. So this is a slightly harder question, because there's those two rocks here in the back, and you can say that, is it one big rock? Is it two separate rocks? It's kind of hard to say. And indeed, the model replies, well, it's somewhere between three and four, perhaps two.
02:01
So I don't think it's ridiculous, and it's pretty close to what humans would actually answer. So now let's continue and push the model even further. How many stroopwafels are there in the picture? So in case you wondered, a stroopwafel is a typical Dutch cookie. It's delicious. You probably never heard of it, but don't worry, because neither did the model. And here the answer is,
02:21
two. So at first glance, this result seems absurd, but actually it makes perfect sense. It makes perfect sense because two is the majority class. It turns out that most questions start with the keywords, how many, have the answer to. So the model was actually very smart. He did not get the question. He completely missed the point. But he thought, maybe I'm going to try my luck with the majority class.
02:42
The problem here is that we lose confidence in the predictions of the model. If they pull this kind of trick, how can we trust it? Let's go back to the first question here for a second. How many bears are there in the picture? Do you think the model replied two because it counted two bears? Or did it reply two because it tried to trick us? And this is a real screenshot. So the point that I would like to make here is that neural nets are really sneaky and they will pick up whatever
03:03
correlation they can to maximize their scores. Sometimes those correlations make no sense at all. Sometimes those correlations should never be picked up. So because those problems are so important, there's been a huge push in the machine learning community to try to better understand what happens inside those models. So this is a very fast growing field of research and many approaches are being developed concurrently.
03:25
But we identified a recipe, kind of a very popular pattern, a very popular class of analysis. And in fact, it is so popular that we decided to give it a name. That is deep neural inspection. So all of those papers here are deep neural inspection paper. So the idea behind deep neural inspection is to map a subset of the model.
03:41
So for instance, a hidden unit or a whole layer or a selection of hidden units to a function that we know and that we understand. We want to make statements like, oh, it seems that those two hidden units here are in charge of recognizing the body of the bears. And we're going to do this with the statistical methods as I'm going to show you in a minute. So let me give you a few examples. The most popular example of deep neural expression is perhaps Andrej Karpathy's blog post on regular neural nets.
04:03
So I think anybody who's working with a neural net has seen him before. So the idea is that Andrej Karpathy trained a model to predict the next, so to generate realistic C code. And what he realized is that there is one hidden unit that can keep track of all the open and closed parenthesis. It's the parenthesis neuron. So of course, this was a long time ago, 2015, so a machine learning speed is like poof.
04:26
It's really antiquated and there's been many other things in between. So just two months ago at ICLIT 2019 in New Orleans, so a team from MIT and QCRite studied neural machine translation models and they identified some hidden units that are particularly sensitive to very specific class of grammatical structures.
04:43
Like verbs in English, it's a hidden unit 1902. Or like articles in German. In the computer visual literature, same thing. So at the same conference, another team from MIT looked at GANs, so generative adversarial networks. So those are models that can produce images and they identified hidden units that are in charge of specific objects.
05:02
So here it's the tree and bushes hidden unit. If you ablate the hidden unit, all the trees and the bushes will disappear from the picture. If you force its output, you will see some little bush and some little forest popping up now and there. It's very cute. So to be clear, this is still a nascent field of research and there are many open questions. In fact, it's quite controversial.
05:22
What does the discoveries actually mean? How well do they generalize? How can we use them? What happens if we change the model? And so on and so forth. Nevertheless, what is beyond doubt is that this kind of approach is very popular. In fact, I can cite at least 10 papers that were published by the time our paper got accepted as sigmoid and now. I think eight or nine of them are actually here at the bottom of the slide.
05:43
So we decided to go through the code base of those approaches and we made three discoveries. So first discovery, so those approaches vary in complexity. It goes from a few hundred lines of code to thousands of lines of code, but very often they do the same thing over and over again. Second discovery is few of those code bases are optimized.
06:02
Some are and they run very fast, but very often they would take hours and hours to run because the first citizen here is correctness. It's experimental code. It's not made to be used by production pipelines. Finally, third finding, all of those code bases are completely application dependent, so they will not generalize. Therefore, we thought that there would be space for a declarative system and this is deep base.
06:23
So deep base is our attempt to make deep known inspection faster and easier. So before I describe the system, let me take a step back and tell you a few words about neural nets and neural net interpretation. So let's start at the basics. Neural net 101, a neural net is basically a function that takes something as input and that produces an output.
06:42
So in this case, this is a character level language model, which takes a character of C code as input and brings the next character. So the little white bubbles here are the hidden units. You can think of them as subroutines. So on the slide there's four, but in practice there's thousands of them. And our aim is to understand why they do. So the most simple way to do this is to run the model through a test set and keep track, so log, the output of those hidden units.
07:04
We obtain time series that look like this one. So in this case, we can look at it and make guesses. It seems here that this hidden unit outputs low value for everything that's between parentheses, high value of URL. So perhaps we found the parentheses neuron. It's actually exactly what Carpathi did in his blog.
07:21
Of course, things get more interesting when you look at not one, but several neurons at once. So in this case, it's a bit harder to do things by just looking at the time series. So we need something more systematic. So what is quite popular in the machine learning literature is to use correlations. You would generate an artificial time series that would give you one every time you're between parentheses and zero everywhere else.
07:41
It's called a one-hot encoding. It's a type of binary encoding. And then you look for correlations. So you can compute the correlation between this time series here and the output of all of the hidden units. And it will give you a kind of a score that tell you to what extent the hidden unit seems to replicate the logic of this parenthesis function. So to be clear, it's just a correlation.
08:01
There is no causation, but it's a starting point for analysis. So another type of analysis that's popular in the literature is to look at groups of hidden units that work together to encode a signal. And the way we detect those is to use a classifier. So suppose I have those two time series here at the top, and I wonder if together they can encode the parenthesis signal. The way I can do this is to fit an SVM or logistic regression classifier
08:23
and see to what extent I can predict the value of this parenthesis signal here. If they do, then it means that those two hidden units together encode this signal. So it turns out that everything I've shown you so far are instances of the same recipe, which is deep neural inspection. So now let's look at the general form.
08:40
It's basically based on three steps. First of all, you take your model, you take your data set, you run the model on the data set, and you log some behaviors. So in everything we've seen so far, we log the output of the hidden units, but you can look at other things like the gradient or like the activation of some gates if you like LSTMs. Second step, you formulate some hypothesis about what you think the model is doing, and you materialize those hypothesis with the time series here.
09:05
Final step, use a scoring method. So most likely it's going to be some measure of statistical dependency to create a mapping between the hidden units of the model at the bottom here in gray and the hypothesis here in red. So DeepBase is a tool to make this process systematic and hopefully easier.
09:21
So to see how it works, let's look at an example. So this is an ACL 2017 paper in which Belinkoff and his team, so from Harvard and MIT, I think, asked themselves the following question. How much grammar does a translation model know? More exactly, they were interested in part of speech tags, and they wanted to know if the model was particularly sensitive to say nouns or adjectives or adverbs or verbs and so on and so forth.
09:45
So they ran the following analysis. They took six models for six different pair of languages. They ran them on a corpus called the Tet corpus that has 2.3 million words, and they recorded everything, is the time series which are here in the middle. Then they ran a part of speech tagger on the data, and they checked to what extent
10:00
they could reconstitute the part of speech tags by looking at only the activation of the hidden units. So the part of speech tags are the red time series here. So the first one will activate. You have one for every verb and zero everywhere else. So they found that, indeed, it's possible to a large extent to predict the part of speech tags by looking at the activation of the hidden units.
10:20
Nevertheless, this varies based on which language you use and which part of speech tags, which indicates that, indeed, the model does have some sensitivity to specific types of grammatical constructs. So suppose that I wanted to reproduce the analysis with a debate, how would I do this? So it's quite simple. First of all, we load the data set, then we load the models.
10:41
So in this case, I load two models, English to French and English to Arabic. Then I load the part of speech taggers. So this is an array, and every and tag in it at VB and tag each are Python functions. So in the base, we call those hypothesis functions because they encode the hypothesis that we want to verify about the model.
11:01
They can be any Python function as long as the dimension of the output is compatible with the dimension of the layer that we inspect. So finally, I specified the score that I want. So I want this logistic regression score. I call debate.inspect, and the system is going to work a little bit and produce a table that will give me the different scores for different hypotheses and different models.
11:22
Then, of course, it's possible to export this score to our favorite plotting tool. So what's nice here is that the analysis runs in five lines of code. Okay, so there are some helper functions. If I include the code there, there may be a few dozen lines of codes. So we are pretty far away from the thousand-ish line of code from the initial code base.
11:42
So in everything we've seen so far, we just use image models, so text models. But of course, it's possible to do the same type of analysis for images. So this is a paper from CVPR 2017 in which the authors ask the following question. How do object detection neural nets work? So they run the following experiments. They took 65 convolutional neural nets.
12:02
They run them on a bank of 63,000 images called the Broadin dataset, and they recorded everything. So now it's three-dimensional surfaces. It's not time series anymore because we look at convolution layers. So you can think of the X and the Y axis as pixels in the image, and the Z axis as a response of a convolutional filter.
12:20
It's kind of like the same thing, but we had one dimension. It's not in time space. It's not in time domain anymore. It's in space now. So the author asked CRUD workers to annotate several objects and textures and materials, for instance, tables and chairs and wood and so on and so forth. And they checked to what extent the visual concepts, so the mask, actually overlap with the activations.
12:45
And they did this with something called the intersection of a union, which is kind of like the jacquard distance. They found that indeed, there are hidden units in the model, which are responsible for specific object classes. So here again, if I wanted to do this analysis with debase, I could do it with only a few lines of code.
13:04
Also, so as a sanity check, we downloaded the code base of the system, and we compared its output to our model. So the results are here. So every bubble on this chart represents a visual concept, so an object or a texture. On the X axis, we have the score by net dissect.
13:22
On the Y axis, we have the debase score. And we find that there is a very strong correlation. So there is a little spread here. We explain it by the fact that the score is actually non-deterministic. It's an approximation. It's based on sampling. OK, so optimizations. So a nice feature of debase is that it optimizes the computations for you.
13:40
So to see how it does this, let's first review the naive approach. First of all, you would materialize all of the activations from the model, so it's a big time series. Then you would materialize all of the output of the hypothesis functions, and then you would compute all the correlations. The problem here is that if you do this, you're going to incur a lot of pressure on main memory. So we can do some very simple back-end and envelope calculation.
14:03
So if I wanted to run VGGNet, which is a very popular image analysis model, and record the output of all of its layers within 900 megabytes per image, I think in the database, we have 65,000 images. So that's a lot of data. If this is not a data management problem, I don't know what is.
14:22
So one obvious trick that we can do to alleviate this is to partition the data. We basically partition the data in block, and for each block, we update the matrix. If we have a GPU, we can do this very fast, in parallel. Another trick that we can use is early stopping. So the idea is that we compute those correlations and we maintain some confidence bounds around them,
14:41
and we stop the processing when we reach a certain target confidence, like 95%. So to check how well does optimization work, we created a benchmark. So the starting point for the benchmark is the parameterized neural experiment. We basically replicated this experiment, we created our own version, but we applied it on hundreds of different grammar rules.
15:02
So this was inspired by an EMNLP 2016 paper. So the models that we inspect are SQL auto-completion models. So these models can take a window of 30 characters as input and print the next character. The hypothesis that we use in the benchmark are the grammar rules of SQL, but binary encoded.
15:20
So for instance, we have one hypothesis function that will give you one for every table name and zero everywhere else. One hypothesis function that will give you one for every quality predicate and zero everywhere else, and so on and so forth. So the nice thing with this setup is that we can arbitrarily scale the number of hypothesis functions that we use to probe the model. So by default, we have 142 of those.
15:42
We run the model on 29,000 records, so that's about 900,000 characters. And the model has 512 hidden units. So here on the left side, I show the runtime of DBASE as we vary the number of hypothesis functions. In the middle, we vary the number of records, and on the right side, we vary the number of units.
16:01
So the blue curve here, so on top, shows the runtime of DBASE unoptimized. So it's basically vanilla, Python, and NumPy, and it's very close in spirit to the scripts that you will find online. So we see that in almost all cases, the analysis takes over one hour, which makes sense because there are no optimizations. The red line here shows the impact of early stopping.
16:24
So the idea is that we materialize the full-time series, but we stop computing the correlation when we have 95% confidence. And we see that in most cases, the analysis runs in less than 10 minutes. And finally, the green line here in the middle shows DBASE, so with all of the optimizations enabled.
16:41
And we see that we are basically insensitive to the number of records in the database. So of course, this was just a glimpse in our experiments, and I invite you to build the full paper for more results. So to conclude, we presented deep neural inspection, which is a broad class of machine learning diagnostic method. And we presented DBASE, which is a declarative system to make this easier and faster.
17:03
So for DBASE, this is just the beginning of the story. Currently, we're trying to make it work for bigger models and for bigger collection of models. And indeed, everybody talks about using smaller, more interpretable models, but it does not seem that models are getting any smaller. In particular, I'm thinking about the natural language processing community.
17:20
Currently, there is almost a war raging for who is going to build the biggest language model. So those models ingest a lot of data, but also produce a lot of data. To understand what they do internally, we're going to need a lot of math, but also a lot of systems work. So I think there's a lot of very exciting research challenges here for the people in this room. Thank you very much.
Recommendations
Series of 11 media