We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Machine learning for patient stratification from genomic data

00:00

Formale Metadaten

Titel
Machine learning for patient stratification from genomic data
Serientitel
Anzahl der Teile
34
Autor
Lizenz
CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
The possibility to collect genomic profiles (gene expression, mutations, ...) from cancer patients paves the way to automatic patient stratification from genomic data to predict survival, risk of relapse or response to a therapy, for example. The stratification rule itself is usually estimated automatically on retrospective cohorts of patients with both genomic information and output, using a regression or classification algorithm. This estimation problem is however challenging from a statistical point of view, since candidate genomic markers (expression, mutations...) usually outnumber the number of patients in the cohorts. In this talk I will illustrate the difficulty of estimating such genomic signatures, and present a few methods we have developed in recent years to improve the estimation of genomic signatures, in particular the use of gene networks and of permutations to learn signatures from gene expression or somatic mutations.
Schlagwörter
Vorlesung/Konferenz
Transkript: Englisch(automatisch erzeugt)
Thanks for the introduction and thanks for coming here. So, yes, sorry for the four.
I think I don't even have the logos of the institutions you mentioned, because you mentioned InSIRM. Yes, but I forgot the logo, sorry. Now, the reason why there are so many logos here is just that, physically, I just walk in the same place, in the same street with you. But let's say that ENS is more in the Math CS domain.
Institute Curie is a hospital and research center with biology, and mine, Paris Tech, is computational biology. So I try to work at the interface of these domains. All right, so my talk may be a little bit different from many of the talks this week. I'm not a biologist for sure. What I would like to discuss is more some difficulties we have when we deal with genomic data
and how we try to overcome some of them. And more particularly, I will talk of how can we do inference, how can we learn from genomic data, because it seems to make sense nowadays when you want to study biology to generate data.
We generate lots of data, and we say, well, from the data, we're going to learn something. And I will discuss this problem of learning from data and why it's not so obvious. And to be concrete, I will just, you know, it's a session about health. I will motivate my applications with health applications,
in particular in personalized or precision medicine, for example, in cancer. So there is a view that nowadays we can improve the way we manage the disease and we treat patients by capturing lots of data about each individual. So someone with a cancer, we can now sequence the person, read the genome of the tumor, read the genome of the person, measure proteins, do imaging,
so we can collect lots of data about each individual. And by looking at the data, we can observe that there is a big diversity among cancers, not two persons have the same cancer at the molecular level. So maybe by analyzing all these very precise data, we can predict or suggest specific treatments, specific ways to approach the disease,
and for example, give different drugs or different treatments to different persons based on their molecular profiles. Okay, so there has been some success in that without going to, you know, all the full genome sequencing. It's well known that some markers, like the expression of some proteins,
sometimes is predictive for the response to some treatments. So this is used in the hospital already. And what we would like to do overall is to go a bit further, not look at one or two proteins, but maybe look at the full genome, look at the full images, look at the full proteome, and end up with, let's say, algorithms that would take this data as input
and make a suggestion of a treatment to give. Okay, so the difficulty here is that we don't know what should be the algorithm. You know, if, I mean, if I give you all these data and ask you what drug should you give, it's not obvious. We have some knowledge, but it's very partial,
and so we would like to automatize the process and design algorithms or programs that would do something automatically. And the way science processes these days is to say, maybe you can just collect lots of data about many people, observe, you know, how the disease behaves
or the response to different therapies, and try to find associations, correlations between what we measure and the fact that some drugs work or not, for example. And then if we capture some association, maybe we can say that based on what we have observed in large cohorts of patients, on clinical trials, for example, then we can suggest that, for example,
people who express a particular protein or people with particular mutations tend to respond well to the blue drug, and therefore we can suggest that the people should be given the blue drug. Okay, so it makes sense to have this approach. And so this is typically a machine, you know, what we call machine learning approach or statistical approach, where we are not sure to understand why the blue drug would work,
but just from empirical observations, we can find some association between what we measure as input and what the output is, right? So I'm going to talk about this process of how to design an algorithm that would decide what to suggest as output from the input,
so what drug should we give from the, let's say, genomic information. And more precisely, how this algorithm can be designed by analyzing lots of data where we have collected information about patients and the corresponding response. Now, statistically speaking, it's a very standard and physical problem
called regression or classification. And for example, in a textbook in statistics or machine learning, you often see these kind of pictures where this is an abstraction of the problem, or what you would say. The problem we have to solve is the following. We have collected data about a number of patients with cancer,
let's say we have given them some blue drugs and we have also said that sometimes the drug works and sometimes it doesn't work, then what we want to learn or to infer is a rule that would predict the effectiveness of the drug from the input, right? So mathematically or visually, you can imagine that you have points
where each point would be a patient here. The position of a point, I plot it in two dimension here, just in the space, as if I measure the two values, like the expression of two genes, so the x-coordinate would be one gene, the y-coordinate would be a second gene. Then you can plot the patients, and then they have colors that would be, let's say,
they respond or they do not respond to the drug. And then the statistical question or the inference process is from this picture, can you learn a rule that then could be applied to predict the color of patients? And so this picture is quite obvious and your brain does that all the time. That's why we call that machine learning, referencing to the brain the capacity to learn.
We observe that there is a trend to have black dots here on the upper right and white dots on the lower left, and therefore that maybe there is a rule like this one that separates the responders from the non-responders. And so if you can design this line, then in the future when we see a new patient, we could predict the color of the patient from its position
and then decide that maybe we should give the drug to this person and not to the other one, right? So this is a well-defined problem and well-understood problem, and this picture here is a solved problem in mathematics, in statistics. It has been solved 100 years ago. Logistic regression or more recent fancy approaches,
decision trees, for example, solve this problem exactly. Now, when I say it's solved, it's solved especially when you have, let's say, 19 points in two dimensions. You see the 19 points, they are here. In 2D, it's easy. Now, if you apply this approach to the genomic data,
there is a problem, which is that the genomic data are not really 19 patients in two dimensions because in genomic data, typically for each patient, you don't measure two genes, but what we want is to measure a lot of things. This is why we talk of genomics, right? It's not only two proteins.
It's more we can measure all the proteins, all the genes, all the mutations in DNA, images, et cetera. So, you can imagine that now a point, a patient, is not a point in 2D, but it may be a point in one million dimensions if we measure the mutations, for example, right? So, you have conceptually the same problem. As before, you have a cohort of patients,
let's say black and white in the sense of responders and non-responders, but now each point is not a point in 2D. It's a point in one million dimensions. And when you think of the process of finding this line that separates, you know, the two types of points, in 2D it's easy because it's super simple to have a line that separates these two things,
but in higher dimensions, there are an infinite number of ways to separate 19 points in one million dimensions, right? And suddenly, you can show even mathematically, statistically, that the problem becomes ill-posed. There is no easy solution to separate. I mean, there are too many solutions,
in a sense, to separate the black from the white, and it leads to a process that we call overfitting, in a sense that it's very easy to separate the ones we see, but it's very hard to ensure that the rule we have found will be good for future patients. Because, in a sense, there are so many ways to separate the black from the whites that you pick one of them, but there is no reason why it should be a good one
and why it should be the one that predicts the thing. Okay, so it's a long introduction for something well-known, which is that doing statistics in what we call high dimensions is hard when we don't have enough points, and the standard situation in genomics is really this one, is that we typically, we don't have 19 patients, but sometimes we have 100 or 1,000,
sometimes 10,000 patients. It sounds a lot when you are at the hospital when, say, I've sequenced 10,000 people, you know, that's already a massive investment, it takes time, and these are people, but mathematically, 1,000 points in a million dimensions is not a good news, right? It's really the situation which is hard. All right, so is it just a conceptual issue?
Well, I think not only, because there are signs that something doesn't really work in many genomic studies we do or we publish. Just to give you... Which cancer was it, the 19th patient? Oh, it was not a cancer, this one is. So this one is a theoretical model
just to illustrate the problem. Now, these are real data, so I will talk more precisely of some cancer later. This could be breast cancer, and we can have data, a category on public data with 2,000 samples and respond to some changes. Can I ask a naive question about this?
So actually, I thought that the more patients you have, if you just stayed within the two, the more patients actually, better. You get the curve better. So couldn't you, the machine, not you, run always two on the curve of patients
with the same disease, and just do two, two, two at a time, and then segregate it by this? Everybody who responds to two is either here or here, for two, you will know where to put it. So why do you need to do it in a million dimension? Well, if you do that, I mean, either you just decide by yourself
that you will just focus on two genes, and then you have one rule. The machine will go through every two, you start with one, every two, this is why they are- So you can do that, but then you still hit the same problem, which is that then, so suppose you pick two genes among a million, then you have a million times a million divided by two ways to do that.
So you have many of these pictures, like 10 to the 12 such pictures, and among them, many of them will have a perfect separation, just because you try so many, it's a question of multiple tests. The gray area, let's say, give the five time, five years relapse, there will be patients
with four and a half, things like that, or close to five, so there will be a lot of gray area. No, less than five. Yeah, yeah, yeah, but what I say is that just for just purely mathematical reasons, if you try 10 to the 12 2D plots, you will have maybe 10 to the nine
with perfect separations, just because you try so many. And so then the question is, what do you do with that? So if you say, I try all of them and I pick the best one, then this is overfitting. You will, it will look like you have found two magic genes that separate the patients, but when you will try a new patient, it will not work, and you will ask yourself, why? It's because you did not correct for multiple testing,
or you have this problem of having too many things to try. So we have a problem, and you're gonna solve it. Exactly, but maybe, so maybe I will, before that, I will just tell you that it's not only a theoretical problem, and for example, let's take breast cancer.
So this is one of the applications that people have looked the most about. Can we use not all genomic, but for example, gene expression data. So now we can measure the expression of 20,000 genes. So we can map each patient in 20,000 dimensions, and say, can we separate, here I will just show
not a drug, but the risk of relapse, which is important when someone has breast cancer, if you can evaluate precisely the risk of relapse, then you can inform whether you should give chemotherapy or not, so it can really impact the treatment. So people have been very excited about using genomic information for that,
to replace what has been done in the clinics for many years, which is to make an estimation of the risk of relapse from the age of the person, or the size of the tumor, which works to some extent, but it's not super precise. Now there are even products that are used in the hospital, you know, people use something called MamaPrint, it's a test that has been designed from this analysis.
It's a bit of a controversial issue, because some people say it brings a lot, and some people say at the end of the day, the genomic information doesn't bring a lot. And if we look, for example, at some publication, there has been some, you know, competitions, or where to assess as objectively as possible
whether it works or not, to use genomic data. Some people have said let's collect lots of data, so lots here is typically 1,000 or 2,000 samples, and ask people to predict the risk of relapse, and people typically, you know, through competition on the web, et cetera. And this is just one example of a competition that was a big one, where people have tried
to predict the risk of relapse either from genomic data, or using the age of the person, and the size of the tumor, and a few markers. And here you have a summary of the results, I will not go through all the details, but vertically you have the performance, so how well it works, right, this is a score,
it's a concordance index, whatever it is, so the higher the better. The higher means that you have a good prediction on new samples, not on the ones you see, but on the future samples. And these are the different columns here, are the performance of different models that were tried by different teams. And the very bad news here for me,
or for the community, is that when you take the teams who just use the good old clinical factors, so the age of the person, the size of the tumor, they reach some level of performance, let's say 62. So this is the state of the art in 1999, before we discovered we sequenced the human genome.
Now a few teams said, let's replace that by genomic data. We have 20,000 genes, that's wonderful, and here they are. So there were not too many teams here, there were two teams, but on average they reached 60%. So they lost performance compared to the good old clinical factors. This is disturbing, I would say.
Now many teams said, well, you know, maybe we should combine them because we have the clinical factors, and let's add in addition to that the molecular data. And then it goes up again at a level which in this case is the same as clinical factors. So this is an almost multi-billion dollar industry to design genomic tests,
and say we will use these tests to predict the relapse. But when you look at that, I wonder whether... So this is why I say there is some controversy. Some doctors say it doesn't bring much to use a genomic test compared to what we know. What I say, yeah, it brings. And find the default is in the detail because even using clinical factors, some hospitals may be using this model, which doesn't work.
Some other hospitals may be using this model. So there are also ways to train differently the models. There is not a single way to do it. It's not only a Cox regression on that. But overall, the information content of genomic data doesn't seem to be very strong in this case. Something that we observed, so we and others have studied this data.
What we observe is that in fact, the performance here, for example, increases with the number of samples. So what you said is correct. When we have more sample, it works better. And what we observe is that when we have 2000 samples, we are still increasing. So it doesn't mean that in the future, it will stay so low. But it's clear that currently,
even though it seems to be a lot to have 2000 samples, and maybe we can reach 10,000 or 50,000 now, there is a limit in the performance, which is not due to the data. The data are here, they are nice, but which is due to the inference process, which is that we maybe it could be much better if we could train on a billion patients,
but we will never have that. And for the moment, there is some limitation just for not biological reasons, not medical reasons, but statistical reasons. The inference process is limited. Yeah, so if you look just at the clinical data, you just said there may be using different tests or different analytical approaches.
Haven't they used statistical, I would have thought they would have used statistical approaches to eliminate the bad approaches. Yeah, sure. So this plot is just a raw plot of the results of a competition on the web. So it's just what happens in a real life.
You give some data to 50 teams, and you ask them, build me a predictive model. So one guy will say, well, it's easy, I make a Cox regression model on this data. The other one will say, well, I will do the same, but I will transform my data by taking the iterations. Someone else will say, I will do a survivor random forest, et cetera. And everybody believes they have a good model,
you know, on the data, they do cross-validation, they test it. But it's just now when you test it on new data, which is the performance here, you observe some progressions. Maybe, I mean, these ones are very bad. So you can suppose that, you know, there were students or people who didn't know much about it and they made a mistake, maybe. But let's say overall,
so this level here is the standard level in this case. Of course. Ideally, what you'd wanna do is combine the best clinical approach with the best molecular, you know. Yes. Sure. But, so something I should say is there is, I think there is also a problem in this kind of regression. So for example, in this publication, the conclusion surprisingly was not that,
you know, clinical is good enough. The conclusion was, look, many teams try to combine clinical and molecular and prior knowledge. And look who won the challenge. It's this guy, right? And the conclusion was, therefore, there is some information in the molecular data that is not present in the clinical data.
Now, you know, if you look from here, in terms of distribution, they don't seem different. It's just that more teams were here. So that's a case where, you know, concluding that this approach is better because the winner is here. Doesn't really make sense, right? If you had more teams here, you would go to 60s.
The same way. Is there any significant difference between just molecular and molecular plus prior knowledge? So just molecular is not a good one because nobody, I mean, there are, so in this data, there are just two points. So there is no significant difference, right? It's, I mean, here we do statistics among teams.
You know, it's a way to do science now. We do open science. We say, let's give data to many people and do statistics. Suppose, assuming that people, you know, are as good. Just do prior knowledge. What is prior knowledge? Sorry, so prior knowledge is, yeah, so prior knowledge is using the molecular data,
but asking someone who knows about the genes what he thinks are the good genes. So for example, it could be, instead of using all the genes, I pick only the genes which I know are involved in cell cycle. Or, you know, I use a network or this kind of thing. So if you just use prior knowledge without the molecular aspect.
So, I mean, here, sorry, here, the prior knowledge, here is some prior knowledge used to analyze molecular data. This is in this. And it's not better than just the machine doing molecular data, even though it's only two points. If you compare molecular- This one? Yeah, sorry, so yeah. So you mean without, if we don't use the clinical factors, if we don't use the size of the tumor,
then it's very hard to get a good model, right? I mean, these two things. And here is, I just blindly train a model in 25,000 dimensions. That's one way here. And here, I say, in addition to that, maybe I should just focus on a few genes which I believe are important. And you ask doctors.
And you don't do better. And they're, no, it's not better, right? So you have all these, you know, disturbing facts which contrary the intuitions sometimes of, you know, why you should work, et cetera, but that are here. Now, very quickly, a second disturbing fact, and this one I will spend less time. What is being predicted?
What specific feature of cancer being predicted? So in this case, sorry, the goal was to predict the risk that there is a relapse to the cancer. So this is breast cancer, metastasis, metastasis. Yes, that's correct. Within a certain time. Within a certain time, yeah. Well, yes and no.
I mean, this thing is called a survival model. So, you know, for each person, we have the person for which cancer was diagnosed, and then it's followed. Our original stage for the diagnosis was normal. That's okay. Cruz, all right, the time of diagnosis. When it was diagnosed, what stage? If this was-
This is our early stage. So, early, where we were here, and how much it was, that's crucial point. Yeah, this crucial parameter, it beats everybody. Yeah, sure, sure. So it's enough predict for the molecular data this stage. Exactly. So, I mean, you're completely right that, you know, the reason why clinical data are good is because they have been used forever because doctors have observed that, of course,
when I said the size of the tumor, it's not a joke that the size of the tumor is the most predictive factor in this case. We know that if you have a small tumor compared to a big one, the risk is different. So you should use that. And what I just say is that you cannot replace it by the gene expression, even though there is signal in the gene expression. But in this case, you know, there is not,
I mean, the summary is that there doesn't seem to be more signal in this plot, at least, in all the genes than in just using what we call the stage of the cancer. Size, it may involve the number of particular class of cells, if they're being counted or not. How many cells in the size, some of them may be not affected. Yeah, no, sorry, when I talk about the size here,
it's really in centimeters how big the tumor was. So you look at the tissue, they may have different composition, different cells, in which of the test it being can work. This parameter is crucial. These parameters are not used here. Maybe they are hidden in the molecular data. No, but it's a logical analysis. It's not involved in that.
It's not here. I mean, among the clinical data, you have like the expression of three proteins here. The clinical. No, but there are pathologists. It's just clinical. A doctor who observes, or somebody observes on the microscope can understand that. It's involved in clinical. Except that it's super-summarized, right? They look at the image, they do many things,
and they summarize that in five numbers, which is, does it express estrogen receptor, does it express progesterone, et cetera, et cetera. If you're thinking it's a logical, you look at the microscope and you see the shape of the area. Yeah, yeah, sure, sure. As you said, this was done in your clinical data, but this is summarized in what we call the grade of the tumor, if you want.
And the grade itself is a mixture of the number of cells in mitosis, et cetera. But here is just a number between one and four. Okay. So second observation, and this one also is completely not new, but it's still disturbing, is that in the early days of genomics, researchers said, let's look at the genomics.
So let's do what I showed previously. We have molecular data, so we can fit a model to predict the risk of relapse, and maybe we don't need to have all the genes. Maybe a few genes should be sufficient, even if you're a doctor or a biologist. Of course, nobody believes that the 20,000 genes are useful to predict the risk of relapse, right?
So people have said, well, we can use nice methods to select genes and build what's called molecular signatures, meaning a subset of the genes whose expression would be enough to have a good prediction of the risk of relapse. And so in the early 2000 years, several teams did the same things, basically.
They said, let's look at breast cancer, let's look at the good genes, which allow to predict the risk of relapse, let's publish it, and let's make product. You know, I say, Mama Print, this is this one, this is a product in the clinic now. What's disturbing, and it's not new, everybody knows that in the field, is that, for example, you have two different teams
focusing on the same problem, listing their magic 70 genes or 76 genes, and when you compare the list of genes, there are three in common, right? Is it a lot or not? Well, this is what you expect if you randomly pick 70 genes out of 20,000. On average, you get three in common, right? Yeah, but this one is more specific
for the lymph node metastasis. Yes. So this one is more general, so maybe that's... So that's a good point. So I mean, there are many reasons why it could be, it's not exactly the same problem, right? It's a different kind of metastasis. It's not the same technology, it's not the same cohort, so there are many reasons why,
you know, it should not be the same. Now, so I will not talk more about that because I want to talk about something else, but we and others still were surprised by that, and so some people say, let's, for example, look just at this cohort, here there are 300 patients, and let's simulate if there is enough power in the data,
statistical power to still discover good signatures. For example, instead of comparing this paper with this one, we just take this paper and we randomly split the cohort in two sub-cohorts. So you have 300 patients, you randomly cut it in two parts, 150 versus 150, and you train two signatures. So here is the same cohort, the same technology,
the same endpoint, and you compare the genes. And what do you get? You get three in common, okay? So basically, and whatever method you use to select the genes, you can do a T test, you can do complete methods, but you never get more than that just because, and it has been also investigated
and published by other teams. So you compare by expression or by sequencing of the genes? Here it's expression. But not sequencing genes, not the application? This is a very old paper, this is microarray. Yeah, it was in 2005, so gene expression. They just imagine you have these matrices with 300 patients and 20,000 genes,
and just randomly split the two, estimate two signatures. In the same cohorts, you get two different signatures. In fact, there were some papers showing that, for example, if you randomly pick 70 genes and build a model on that, then you get the same performance idea, right? So what it means, and it's not really a surprise nowadays,
is that you should not focus too much on the 70 genes here, on the 76 genes here, because you can pick any other subset of 70 genes, more or less, it will work the same, right? So these are not the magic genes. Now what has been observed too and quantifies is that if you don't have 300 patients like here, but if you increase that to 500, 1,000, 2,000,
then this number increases, right? Because it's very statistical here. It's because of the correlation structure among the genes, it's just a hard problem to identify with certainty some genes which have predictive signals. And the statistical problem becomes easier with more samples.
And here, in terms of how many samples you have to fit the model and how many dimensions you have and the correction structure, it's just usually extremely hard. And empirically, you get this kind of thing. All right, so this was a long introduction. Just to justify the fact that there are challenges here.
And in short, it's not enough to say, I will sequence or do gene expression on my cohort and then I will find a good gene so I will make a good model. It's a bit more complicated than that. So what I will therefore very quickly discuss and I will, I'm sorry I spent too much time but I think it's important to discuss that,
discuss what some more mathematical or straight ways to try to just to make some progress in this problem, which is not biological here. It's how do you get extra knowledge from this data which have very sometimes strange geometry or topology in the data.
So we'll discuss two things. One is about, and they are very, very standard but I will illustrate them on some things we did but it's a broader topic. One is showing that what we call regularization in learning is important and can be adapted to the problem. And the second is that maybe instead of taking the raw data,
typically when you have a patient you say, I measure the transcriptome so it's a vector of numbers. Maybe it's not a good representation and maybe doing something else like I will explain what this is, changes the geometry and makes the inference process easier, right? The key point is that things here
require some assumptions, some prior knowledge but if you don't do them, then it will not work. You cannot just trust the data. We are not in the setting where there are enough data to automatically learn everything. So it really makes a difference to do something specific on the data. And of course these are not definite answers. I mean, these are just things we did but many things can be done.
Okay, so regularization and representation. So regularization, what is it? Maybe many of you are not familiar with statistics and machine learning but it's just a concept that you remember when I showed the first picture with the points and the line.
Let's try to formalize a bit the process here. So the process is that we have observations which are vectors. So here I have, you know, look at the data. So this is a real data, a zoom in on some real data. So now we have samples which will be patient. So each, imagine this is a matrix of numbers representing gene expression values.
So each row here is a person, a patient. Each column is one gene. So you have 20,000 columns, let's say a few hundred rows. And you have a response here which is what I call the Y variable which here we take as binary. Imagine that this is one if the patients relapse
and zero if they don't relapse. This one, it's not important but this one is breast cancer. So remember I said when you observe this data this is another way to say we have points in 20 dimensions with black and white points.
Let's fit a line to separate them. So a line mathematically is just a linear function. So here the goal will be say, let's try to infer a linear function in this space. So it will be just, if X represents a patient it's a linear function, I write beta transpose X to say just a linear function, the sum of beta Xi. And the goal is to infer a beta.
So beta would be say the slope of the line. So the direction of the hyperplane in this case to try to separate let's say the plus one samples versus the zero samples, the two colors. Okay, I showed in 2D that it's easy to find a line and I said it's a solved problem. Here it's not solved in high dimensions
because there are many, many hyperplanes, many, many lines that can separate just a few points from a few other points in high dimension. So how do we solve that in practice? Well, the standard state of the art is to say, let's regularize and regularizing means that we write an objective function.
So we do some optimization, optimization over the possible lines, saying that we're looking for beta. So beta is the direction of the hyperplane that minimizes a sum of two terms. I will not detail what are the two terms but the first term is just a term that measures how well beta separates the two strings.
So this is a measure of how well you separate the two things so this is the center of things. But in places where you say you create the arithmetic which is not justified. Yeah. Which is not by no means justified. It's just out of the way. Right, so this would be, let's try to separate the data and if you do that, I said many times it doesn't work
because it's an ill-posed problem. There are many ways to have a perfect separation of these points in high dimension. So the solution that machine learning and CELCIC has found is to say let's penalize this by a second term which would be only depending on beta. This one does not depend on the data. Right, it's some kind of prior penalty
we put on any classifier, any line. And if we minimize both of them then if we choose correctly the penalty here it ends up with a well-posed optimization problem that will have a unique solution and will decide what is this line. Okay, so this is maybe, if you never saw that before it may be a bit abstract, but let's be concrete.
What are these penalties typically which are implemented in all the softwares we use every day when you do CELCIC analysis. Maybe the most standard two penalties are just norms. Right, so beta is a vector and the standard penalties used are just the Euclidean norm squared if you want.
Or what's called the L1 norm which is so the Euclidean norm is the sum of the squared coefficients and the L1 norm is the sum of absolute values of the coefficients. So these are two norms over vectors. Here you can show in the space if you have two dimensions the light blue shapes here are the unit balls
of these two norms. So you have the L2 norm is just a circle L1 norm is more diamond shaped here. And so typically these two things are used as penalties to regularization. And if you do that it has names and it's called ridge regression,
ridge logistic regression, support vector machines. All these things that work on average involve these two things. Now what you see is that these are pretty generic. It's not about biology here. There is no notion of gene whatsoever but these are already quite good
ensuring that in high dimension you have a well-posed problem and it does something. And just a last comment on that. So they have names. The L1 norm is often called the Lasso regression and the L2 norm is called the Ridge regression. Even though they look a bit the same they are just two norms. When you use them in practice they end up with very, very different models.
And in particular for geometric reasons it's known that if you penalize with the L1 norm which is non-differentiable at some points then the solution of the optimization, so your final model, will be sparse. Meaning that the vector of weights will have many, many zeros.
And you can control the number of zeros with the penalty parameter here. Whereas if you use the L2 norm as penalty it will not be sparse. So if you translate that in terms of genes it means that this one will lead to gene selection because at the end of the day your model, even though it's a model in 20,000 dimensions, will just contain a few non-zero components.
This would be the selected genes. This one will be non-zero everywhere. There is no gene selection. So since there are probably other non-mathematicians in the audience doesn't it worry you that you get very different results when you change this arbitrary penalty or semi-arbitrary penalty function?
I mean, if you would converge to the same answer you would say, well, it doesn't matter so much how you start the penalty. Yes, so indeed. So if we were in the easy situation with many samples in small dimensions, what we will observe is that the choice of the penalty is not important and probably you would not even need a penalty.
So here, something I didn't say is that there is a weight lambda that you fix. That balances how much penalty you put compared to how much you believe in the data. So in easy situations, all the penalties converge to the same solution which would be the ones you get without penalties. And indeed, as you say,
in the situation where we are in genomics, where we need to penalize, then you get very different when you look at the vectors. They are very, very different except there is no statistical way to say one is better than the others. And here we enter the field where we have to make that or prior assumptions
and typically assume that we believe that a good model should be sparse. Well, this is not huge area. It depends on how many paragraphs, square root of n. Inevitable area in all these dimensions, square root of n. So area in this understanding grows with dimension of each of n. So I think it is a big problem with all these methods, yeah, there you go. Ad hoc, ad hoc models
and sometimes the miraculous that they were, there is no reason why they should work. Yeah, I mean, it's, so a key question, no, no, the first question is, okay, in genomics, should you use the L1 or L2 ones? Anybody else? I'm just trying to choose to see which one will fit better to separate, yes? But none of them may work. I mean, this is a sufficient one.
You have to invent the right way. Yes. You take them from textbooks because people use them. It's absolutely a notification. Yeah, yeah, yeah. And in fact, you know, I just put L1, L2, you can imagine any norm is very- Most random, you choose two random numbers. I'm sorry, you have to use the right, invent something better. Yes. So for example, and so what I want to stress is that in addition to that,
the reason, I mean, whether L1 or L2 or something will work is sometimes very non-intuitive. And so for example, let's say the example of our breast cancer prognosis, right? So we have breast cancer, we want to fit a model. So I say, you need to regularize. Take a textbook and they say, well, there are two nice ways.
You can do less so regression, you can do rich regression, et cetera. And less so is due to a signature selection of genes. So I think there is a consensus among biologists or doctors that probably if there was a true model, like if we had enough data, probably the true model would be sparse
because we don't believe that all the genes have predictive signal. It should contain a few things. This is a nice idea. So if you translate that into, you know, our inference problem, it would suggest that probably L1, if you had to choose, maybe a better one because L1 will lead to a sparse model and L2 will not. And so that's why that's what everybody did.
Like, you know, I mentioned the mamaprint or the gene signature with 70 genes. It was done. It was not less so it was even simpler, you know, one by one selection, but it was based on the assumption that there is no need to have all the genes. A few should be enough. Now, if you test that on data, surprisingly, I will not explain the details,
but you can plot, you know, or do experiments where you plot some performance of your model. So it's, you know, ways you have data, you train on a subset of the data, and then you test on some other subset to evaluate the generalization performance. And we can compare models which are based on a few genes or on all the genes with ridge regression.
And if you plot the performance as a function of how many genes you have in your model, so typically, you know, you balance lasso and ridge and you say, I can make a signature with 70 genes, which is the standard in this field, I get a performance, or maybe I can add more, but I need to regress with l2.
So you did correctly. And what you observe is that, in fact, in this case, the performance just increases with more genes, right? And it's also disappointing observation that focusing on 70 gene signatures gives you some signal which is not too bad, but the performance will be better if you kept all the genes.
So this thing is not intuitive because if we had converged to a true model, myself, I don't believe we should need all the genes. But something happens here, which is just that because we don't have many samples and because of many things we don't understand, empirically, is the other regularization that gives better performance. And probably this has to do with the fact that,
you know, I said that it's very hard to be consistent in the gene selection. Sometimes when you pick 70 genes in some data, you end up with 70 other genes on the data. And this, you know, lack of robustness probably is related to the fact that the performance is not optimal and that it's better because it's hard to choose the good ones, to just keep all of them
and just apply a penalty that allows you to learn in high dimension, even though you are not sparse. The next question is because some gene may be not look specific for cancer, but because they may involve production of ribosomes, yeah, and therefore it may be very sensitive. Maybe it's a lot of genes not directly related,
but highly expressed, and so they will give this effect. Not because they're so important, but because they express in large quantity. Yeah, sure. But, you know, this is not clear. The only thing that we can test is, I mean, me at least, is the performance and we just... The expression, it's absolute expression or relative compared to the healthy tissue.
So what you put? Just to put normalized numbers, you're completely different result. In this case, this is where, let's say, the absolute expressions, even though absolute means that they have been processed by some pre-processing. Yeah, normalized by the control, right? Yeah, I need to intervene. From now on, we'll ask only short questions, okay?
No, otherwise we will never finish. Short question. If I went into your data and mislabeled a few patients, wouldn't that... From data with a little bit of noise in labels of patients, wouldn't that look...
You mean the performance would decrease or... Yeah, sure. So, I mean, the noise in the labels, in general, is something... I mean, all the methods that we work with are supposed to run with noise in the data. So in a sense, the picture, when I had the black and the white,
was not the correct one here. You know, when I say, if I come back here, we balance the sum of two terms. One is how well we fit the data, and the second is how good we are with respect to our prior knowledge or prior hypothesis. And here, the fit to the data does not impose that you make no error. It says, well, we can accept errors, and sometimes it's better to make a few errors
on the training set if we get a better penalty. I'm not talking about the errors of your algorithm. It's a long question already. Ask him later. Yeah, sure you don't. You're trying to fit the mistake. No, I don't say it's good to fit the mistake, but we all know that, you know,
the mistakes are part of what I call the noise in the training data. But then we can follow up in the coffee. Okay, so I understand I have to spare a bit, and I will not have long questions anymore. So what I wanted to say is that, you know, here, something happens. It's needed to do that.
It's sometimes counterintuitive to know what penalty works or doesn't work. And so some people, our group and many others, have said that maybe if we are confronted with genomic data, for which we know a lot of things, we know the genes, we know ribosomes, the cell cycle, et cetera, maybe it's possible to use this prior knowledge to design different penalties, a bit less naive than that,
because this one, you know, you have gene one, gene two, it's isotropic in all directions. And so things that we have proposed, for example, and others, are, for example, if you know a gene network, like if, as a prior knowledge, you have, you know that some genes, this one is a picture of a protein network, yeah.
But, you know, for my purpose, I just say that you have genes and connections that could be physical interactions, that could be pathways, that could be many things. Could this knowledge be put in the prior so that you can drive the selection of models, let's say, to be compatible or coherent with what you believe happens in the cell?
And if you're correct, this may be a way to help the inference process, right? In a sense, you could reduce dimension, focusing on what you want. You said that here, let's say you have a graph and you want to use that to constrain your model. How can you do that? Well, these are, you know, a few examples of penalties where the structure of the graph enters
a definition of a penalty. So all these ones are functions of beta. So beta would be a candidate model. Then for each candidate model, you can quantify something. So this is to replace the L1 or the L2 norms, right? And then you could say, maybe I can use these penalties, put it in the optimization,
and this gives me my inference model where at the end of the day, I will try to fit a model that explains the data and is small in terms of these penalties. So what are these penalties? I won't have time to detail all of them. Maybe I will just illustrate the weirdest one. I mean, I know there is heterogeneous level in mathematics in the room,
but for many people, this one is a bit strange to graph. Like if you have a beta, I say a penalty, a norm for beta would be the supremum over alphas such that alpha i square plus alpha j square is more than one for connected genes of alpha transpose beta. What is that? It's a bit ugly, it doesn't seem to make sense
for many people. But in fact, this one can be illustrated as follows. So this is the penalty I was referring to. It's just a variational form of a norm that is just a convex hull in high dimension. So what do I mean? Imagine, so here we live in the space of, the dimension of the space is the number of genes. So it's not two, it's 20,000.
And what we do is that we design a shape in 20,000 dimensions in this case, just by saying that when you have a gene network, so you take all pairs of genes which are connected. For example, suppose you have five genes here, you take the pair one, four, these are two genes, they are connected.
And therefore, what you do in this case is that in the high dimensional space, you just draw a circle with unit radius, in the subspace corresponding to dimensions one and four. Right, so you are in high dimension, but you can focus on just two dimension, dimensions one, four, you draw a circle. And then you do that for gene one, four,
then you do that for all the pairs, two, four, you draw another circle, four, three, you draw another circle, and you end up with many two dimensional circles. And then you take just a convex hull, and the convex hull is just the smallest shape, convex shape that contains the circle. So this picture just shows what would happen if you have two circles in 3D.
Here you have a horizontal circle, a vertical circle, and the convex hull is like the Chinese lantern that fits the circles. But not more. Right, so if you do that, then this defines a norm, like you can take this as the unit ball of a norm,
and this equation is just the definition of this norm. Right, and so now you can put this norm instead of the L1 norm, or instead of the L2 norm that I said before. And if you do that, if you analyze a bit of the situation, what would happen is that because the shape is non-differentiable in some specific circles, you can show that it will lead to a selection of the genes.
So when you minimize, you estimate the model beta, and the solution beta will have many zeros, just like the lasso. But among the non-zeros, so the genes which are selected, you will see that many of them are connected on the graph. And the reason being that the solution will be on some of the circles which correspond
to the fact that two connected genes are non-zero. Right, so this is just a way to change a bit of the penalty by putting prior knowledge and pushing the solution towards not only, let's say, selection of 70 genes, but hopefully, selection of 70 genes that tend to be connected on our network.
Right, we change a bit, we put a little bit. How genes may influence a lot, which may be specific, right, because you know, there's huge connections, and they may have way machine, and you influence on that, yeah. Yeah, that's a very good point, yeah. So here, I mean, a variant would be to have weight, et cetera, but this is a very important question
that we don't have any satisfactory answer. Just to illustrate, let's take back our breast cancer data, you know, and compare, for example, compare what would be a signature, so a selection of, yeah, I think there are just 60 genes. If you use a lasso, so this is, let's say,
a standard and state-of-the-art method to say I have, I start from all the genes, I select 60 genes, here they are. So you have many genes, and then if you map them back to a network, to analyze them, to try to say what is the signature, then you would observe that a few of them tend to be connected. So here, you have your ribosomal proteins, by the way,
and they show very strongly as six connected genes, so you can start to get some interpretation, like it seems that ribosomes may be involved, et cetera, but many other ones are not connected, so it's a bit harder to interpret. Now if you replace that, and you get performance of 61, if you replace that by, you know, just changing the norm from L1 to this convex shape,
you get different signatures, and it's not a surprise, but of course the signature is more connected, because you have put the knowledge of the graph in the penalties, so these are the 60 other genes, and suddenly you have bigger connected components, like your ribosome proteins are extended
to a bigger subnetwork or pathway. You see up here a second big component, which are cell cycle genes, and a few other ones, right? So here, you know, it's hard to say it's better, because you put the knowledge of the graph, and therefore it's mathematically obvious that you would get more connections, it's what you designed, so it's hard, you know,
to assess whether this is really a good news that you observed that, you had to observe that. What we can observe is that sometimes it's hard, but sometimes the performance slightly increases. Here, you know, I don't put the error bars, et cetera, but you tend to have some slight increase in the performance, which may be a good sign that, you know,
choosing different penalties could have some impact, and in particular here, using some prior knowledge may drive you to more realistic models, which may be more stable and also lead to better performance. All right, there are other penalties, but I will not discuss them, because I just want to say one word about,
you know, I said there are two parts in my talk. One is regularized, so I hope I explain what it means and that there is possibility to define penalties. A second one is also to say, you know, maybe we are just stupid to say, because someone measured 20,000 numbers, which would put that directly in our model, right,
because, as you say, the geometry here, we say these are numbers, therefore these are vectors, and I use it in our model. There is a very strong assumption here and very strong naivety in believing that this is a good geometry that we're... Well, there is a bit more obvious, right? There is a bit more, you use that model
because of the table. Yeah, yeah, that's right. And, for example, I will just illustrate on one complete case why we know that it's very naive. So, for example, you know, I've been showing this image many times, so I said these are like data, you can call it, these are real data, gene expression data, but in fact these are not the data that come out from the machine.
For example, here, if we talk of microarrays, it's the same for sequencing. What comes out from the machine is first preprocessed by many things, because we know that you have technical effects, you have batch effects, you have, you know, sequencing differences in sequencing depth, et cetera. So, before coming to this matrix, in fact there is a lot of things going on. And one thing, for example,
that is maybe one of the most standard preprocessing steps, like for people working in expression in microarrays, it's implemented in the RMA package in R, which is one of the most popular ones, is that if you do, you know, if you do some biological experiments where you measure gene expression over the same sample
many times in different days of the week with different people, you get different numbers, right, for many reasons. The weather changes, you know, it's very hard. So, typically, it's not that these are just unwanted variations, right?
These are variations due to technical effects and you don't believe it's biology. So, to remove these effects, someone has to do some normalization as a preprocessing. And typically, one thing which is done is called quintile normalization. It's a transform that given, so imagine that here each box plot would be one sample. It's a distribution of gene expression over one sample.
Usually, you don't take the raw values, you normalize them so that at the end of the day, when you have many samples, you more or less have the same value, so not in the same order, right? It's on the same genes which have the same values, but if you look at the distribution of values across each sample, it ends up being the same. This is called quintile normalization, right?
Now, what this says is that, in fact, what you kept from the original data, the original signal, is not really the values. It's just the relative order of the genes, because when you move from the raw data to the data you will use in your models, it's just the order that has been kept,
and then from the order, you define the new values that are constrained to follow a particular distribution, right? So, what can you say from that? Well, a few things. So, first, the thing I will, again, I will be short now. One question is, here I say you transform your data
so that at the end they have all the same distribution. First, it's not here what should be the final distribution. It's a bit arbitrary, right? You could impose that to be a Gaussian distribution, a uniform distribution, an empirical average. There are many choices possible, and there is no clear reason why you should decide one or the other. So, for example, these are different possible distributions.
So, something we worked on with a student of mine is to say that maybe this thing could be optimized, and I don't have time to explain, but it's possible mathematically, don't look at the equations, but not only to say I first transform and then I fit a model, but you could say I transform and I fit a model
and I optimize both the transformation in terms of the target distribution and the vector beta that defines the model, right? So, in short, it's a joint optimization of a model. Sorry, I changed from beta to w, but it's just a linear model, and the parameter of the transformation.
So, it's possible to do. The interesting byproduct of that is that when you do that, you need to clarify what's the link between the target distribution and your optimization problem. In fact, there is a simple link which goes through a first representation of each sample.
So, eventually this is one patient, one vector of expression. I said the information you keep from this sample is not the values, in fact. You will change the values. It's the relative order of the gene. So, it's what's called a permutation in mathematics. You know, it's a ranking of the things. And so, in our case, the way we represent this thing is through what's called a permutation matrix.
It's a binary matrix that indicates what is the position of each gene, right? So, each column corresponds to a gene and each row corresponds to a rank. And you can show that if you replace a vector of expression like that by this matrix, then this is what corresponds exactly to optimizing
both the model and the target distribution. So, I don't want to go to the details of that, but for me, the important lesson here, and somehow it works, but the important lesson here is that to make this work, what we have done and what we do sometimes
without realizing it is that when we use gene expression data, in fact, we don't really use the values that were measured. We first transform gene expression into a permutation. So, we move to a discrete setting, right? The space of permutation is called a symmetric group. And then, when you have, let's say, a thousand samples,
it means you have a thousand permutations. And then you learn from that. So, it's a new representation. And one way to do it is to do what I presented, meaning that from the permutation, we remap to the space of vectors, and then we make a linear model. But maybe you don't have to do that. Maybe you can directly say, what about designing algorithms or methods or ideas
that directly work on the symmetric group? So, it raises the question of, can you learn on the symmetric group? And, in fact, many people do that and are excited by that, right? So, in short, even though we believe
that we work with vectors of numbers that were measured, in fact, there is something in between that happens in many cases, which is working on the symmetric group. And just as an idea, maybe it's possible to directly start from here and design new approaches based on doing something. So, here the question is, suppose you want to make a linear model.
You need to define what is a linear model over the symmetric group. Of course, mathematics has a lot of tools for that, but it's not completely open. It's not like a straight sorting problem, no? Not, I mean, maybe it can be formalized this way, but it's not obvious. Okay, so I think I will just stop here because I have still a long thing.
I could talk more, but obviously I don't have time to. But just as a teaser, maybe the last thing I want to mention quickly is that I talked a lot about gene expression data, but there are many other data, like for example, we can look at these types of data, which are more discrete.
which again for cancer, so here not many cancers but this is a picture on breast cancer, which don't indicate the expression of the genes but which indicate the mutations in DNA. So you could say well now I look at the genome of the tumor, I compare the the tumor to the normal sample, make the difference and observe that some genes are mutated in the tumor, so
maybe just looking for a given patient with genes are mutated, it can help me predict the risk of relapse or predict the response to a therapy. But here again we have the same problem that we want to use that as input to fit a model and it's again high dimensional and there is something more here which is that you see just from this picture that you know these are not just
random vectors in high dimensions, it's a binary matrix with 99% of zeros. When you take two samples randomly they are basically orthogonal, they have no gene in common, maybe one, p53 but overall very hard. So again there is a very strange or complicated geometry structure which makes that if you directly say I take that and I fit a
model, Cox regression, if you want to predict survival or something else, it just doesn't work well. And so some people, we and others with Andrei Zinoviev in particular, have investigated maybe it's possible to change representation, say that instead of having a big binary vector we
should represent our sample by something else. So we propose something using a network again, I don't have time to explain but to make a long story short it's possible to replace the original representation by another representation such that at the end, maybe this is the only thing I can show, you can test empirically
when you have different ways to present your data and you fit a model how well they work. Overall these are different cancer, the performance is not good but the only thing I want to show is that in some cases just changing the representation from the raw data to another one to another one has an impact on the performance.
So here is the same data, is the same learning model, the only thing we change is do we start from a big binary vector like this, do we start from a transformation etc. and sometimes there is a significant difference in how well it works. Just to say that this is also the
domain in high dimension with two samples and your initial choices of representations or regularization matter, right, they matter because they changed the final performance of your model so it's really worth thinking about it. I don't have any secret or it's all very empirical and based on assumptions we try but it's a field
that requires, you know, I think there is still a lot of room for improvement in how, for example, to represent a profile of mutation. Alright I will stop here and I'm really sorry to be so long but in conclusion my, you know, my message is just that there are challenges in trying
to extract knowledge from genomic data and many challenges are not biological or medical, they are just mathematical, statistical. I mentioned regularization and representation as two ways, they are not independent of course to try to do something and more importantly I think there is really a subtle interplay between knowledge and
biology on the one hand and how we can put that in the mathematical or computational framework and the intuitions for why one representation works or not is often a mix between biological reasons and purely mathematical reasons. Thank you.
We have time for one or two questions, that's it. So for biologists it's obviously very worrisome that the mathematics that you do with the data give you different answers and the question I'd like to ask and I'm not sure if there's an answer is, is the reason why the
processing the mathematics is so critical is because the quality of the data isn't good or is it because the complexity of the problem is very high? Well I guess it's a mixture of both though I insisted not on the quality of data, I just insisted on even
if the data were perfect, on just the fact that, you know, for example you are in too high dimension and just a few samples or equivalently, you know, the question to your second hypothesis is the fact that the model is too hard to learn, meaning that learning even a linear model in high dimensions is hard, independent of the
noise. Now the fact that you have noise doesn't help. I think more or less, I mean for me, I consider that as additive problems like having noise makes the problem hard and having high dimensions made the problem hard and you know the difficulty is somehow the sum of the two. So you, it's easier if you have less nosy data and if you
have more data in our dimension. I happen to routinely deal with, you know, LASSO models and my experience with the data is prediction doesn't work, you look at which cases are hard to predict, there are usually a few bad
apples, you throw those out, everything works beautifully and then if you zoom into the bad apples you can justify it. Usually it's obvious, you know, error of class label. This guy actually died, didn't survive. I'm just wondering if in your experience there's some similar...
Yeah, so I would say this happens all the time and I completely agree with you that, you know, in all, even in all what I presented, in fact I completely didn't talk about some other pre-processing which is typically that we will move out layers, this kind of thing. Now this being said, I don't really agree on the, I mean, you didn't
claim that, but the fact that once you do that then the problem is solved. Then we enter the world of the other problem which is that, okay, even if there is no out layer and no noise in the data, we still have to, you know, to learn a big model and LASSO in high dimension is not robust for example. If you select the genes they will just not be the same even if you
move out layers. But I fully agree with you that practically speaking this is very important to, not to blindly apply these techniques to data but just check what are the data. You always do PCA, you visualize your data and do something. Okay, thank you. Thank you.