The proteome in context
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 34 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/50903 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
9
22
28
00:00
Lecture/Conference
09:13
Computer animation
18:27
Lecture/Conference
27:40
Computer animation
36:54
Computer animation
46:07
Computer animation
55:20
Lecture/Conference
01:04:34
Lecture/Conference
Transcript: English(auto-generated)
00:16
Good morning, thanks for showing up early. We are a technology laboratory and we work with proteins,
00:25
we try to measure proteins. And it is an interesting question right now and I'm actually pleased to be here because it also forced me to think about some things which we normally don't. Think about how do genomes and cell fate relate to each other and what role do proteins
00:46
have to play in this equation. And I think they have a large role to play and I'll try to give first a little bit of general consideration and then I will point to a few issues where we
01:00
think proteins can contribute. I think we live in a very interesting phase. This was a very interesting meeting to me to see some of these very intricate, extremely complicated mechanisms like trafficking that have been studied for decades and then there's the other world
01:21
where people generate a lot of data and I think we need to somehow bring these two worlds together. This is kind of the topic of my talk. So we are a technical laboratory and I don't want to talk about techniques. I just want to give kind of a status report to calibrate the expectations and the capabilities. So if we are, we have focused in the field of protein
01:47
measurements for a long time to exhaustively measure as many proteins in a sample as possible and as I can say this has now been achieved. We can probably measure virtually any protein in a sample. There's a few glitches here and there but that works well. If we want to relate
02:03
genotype and phenotype, this is not sufficient to measure one proteome, we need to work with cohorts because then only then can we start to see how system reacts to specific changes, for instance genomic changes. So there has been work ongoing to get to a stage where genomics
02:23
has been for a long time and that is to have a large number of samples, conceivably hundreds of samples, and have many analyzed, preferably all of them, in this case proteins, and to measure them reliably, quantitatively to generate the data matrix that has no or very few missing values.
02:43
So this has been a huge challenge. I don't want to talk about why this has been a huge challenge but recently in the last few years we and other groups have developed massively parallel mass spectrometric techniques that are kind of the equivalent to next generation sequencing which allow us to sequence many peptides at a time and basically cover, reproducibly cover
03:07
a set of the protein, a subset. We cannot cover the whole proteome at this time in many samples but to do a certain subset, several thousands of proteins with very high consistency and quite fast. So this is the kind of experimental basis that I will take off from.
03:26
So we can say that each sample, let's say this may be a cell extract, this may be a tissue extract, a biopsy sample, is converted into a single digital file by mass spectrometric technique which I'm not going to further discuss. This is fast. It goes from, if we could take for instance
03:41
in the clinic a biopsy in the morning and have the results in the evening, we can do about 20 such samples a day with one machine. This is quite fast also in view of genomics but we cannot do the whole proteome, we can do about 5,000 proteins in a sample. But with very good CV,
04:02
with very few missing values if you have a cohort and it is the sample, the measurements are quite precise and so you can think of it conceptually like about 50,000 western blots per sample. So this is kind of what we're doing.
04:22
This is also applicable to modifications and protein interactions. So this is where I would like to start and this is just kind of the technical base without explaining how this works and I'm happy to do this if someone is interested. So now if you read the literature and these large-scale genomic efforts, we read that we can do thousands of genotypes
04:46
can be measured in a cohort. Sometimes these are international consortia and for instance cancer versus control is one of the most dominant areas where this is applied. And of course we have also a lot of quantitative phenotypes from imaging, from the clinics
05:05
and from diagnostic tests. And so the big question, I think this is one of the questions of this session it tends to address, is how do we make a link, how do we make projections from the genotypic variants that are existing in a population or in a cohort
05:23
towards a phenotypic space and this in the clinical sense of course is healthy or disease but it could be any, it's a general problem of course, could be any phenotype. So we would like to predict phenotypes from their molecular origin which mostly is based on genomic data.
05:44
So if you want to make predictions this is of course one of the hallmarks of science and there is many fields, engineering, physics, which has developed a very high level skills how to make predictions based on models. So I use just this clock as an example, as a system which is
06:04
precisely understood and where we can make fairly precise predictions. So we know, if we know the state of the system at a particular time, this clock, and if we know some of the parameters then we will be able to say for virtually any time in the future where the clock
06:22
will be and because that's basically the purpose of a clock. We can also predict quite precisely how the system reacts if we change something in it. For instance, if we make the pendulum a bit longer we can predict what effect this has on the readout up here. So this works well for
06:45
systems with moderate number of parts and the basic model how they interact and so this is of course very rather straightforward system and I would just say there is equations which first principles we can plug this in, we can plug data and we can play in a computer with
07:02
parameters and then we will get a fairly precise outcome assuming that this system operates under idealized conditions. For instance, it is regards friction of air and so on. So in biological systems we also try to of course get to predictions to generate predictive models
07:27
but we have seen over the last few days and this is actually discussed quite extensively on for instance Tuesday evening in the discussion that this is very difficult to achieve and so there were terms where statements were made that this is usually equated to systems
07:44
biology and the terms were made or statements were made that this has been disappointing. And I would agree with that but there are some success stories where really quite well predictive models have been established in biology, oscillator, bacterial motor is one of the
08:01
toy and really well studied problems, cell cycle regulators. But I would say this is I would also agree that this has been disappointing from the point of view of generality that we can make general predictions. So good predictability has been achieved in these relatively limited samples but they are relatively simple. There are generalized models, they would be have a hard
08:26
time to explain perturbation somewhere else in the cell, how for instance bacterial motor acts the chemotactic motor if the cell receives two independent signals. This is hard to predict from these models and they're presently not scalable to really
08:43
complex systems like trafficking or other situations that were discussed in this conference. So while there has been successes the way how this approach, a very highly mechanistically driven approach based on the understanding of the wiring and the components of a system can be scaled
09:04
to larger systems is a huge challenge. And here I just want to make the point how big the challenge is if we want to go from something very confined like a bacterial motor to making statements about a whole cell. So here a while back we worked together with the group
09:25
of Joerg Beller to do basically a molecular inventory on a cell which has also been featured quite prominently in this symposium and this is an s-POMBI cells. So we tried to use
09:41
what was at the time the best methods we had available to precisely quantify, basically count the RNA molecules, the transcripts and the proteins in cells that were grown, POMBI cells were grown at different conditions, basically a starve condition and an exponential growing condition.
10:02
And this was, I mean to us, certainly to me, these were astounding numbers. So we could see that a gene produces between 30 and about a million copies of proteins per cell. So it covers an enormous dynamic range and of course the question will be how is this dynamic range
10:23
maintained and how is it regulated. The median protein numbers about 4,000, not quite 4,000 and the median mRNA copy per cell was about two and a half. This to me was extremely surprising numbers because this basically means this operates entirely in a stochastic domain, so you must have
10:42
cells if you have a population that some have none, some have maybe three, some have five and if we then make predictions from let's say transcript measurements, if we put weight on an increase from two to four transcripts, we have to ask of course what that means because we operate essentially in a stochastic domain, but here we don't. Here in the protein level, we don't because
11:02
there's not going to be cells present that have zero protein and others have 10,000. They will always have some variation around of course these 4,000, but the mean, there's very few cells that have none. So it already indicates that we operate probably quite from a
11:21
conceptual in a different domain if we work with transcription as proteins. The total amount of RNA molecules in these cells was about 40,000, which is basically not many and means that from an energetic point of view it's very cheap to make these proteins, whereas there's almost 100 million protein molecules in each one of these cells, which means to maintain
11:45
those, to control those is very expensive for the cell. And another issue which is oftentimes not really considered is that the protein concentration in these cells is more than 300 milligrams per milliliter. It is about a third of what crystallographers achieve when they squeeze the
12:03
proteins into a crystal to do X-ray diffraction. It's an enormously high concentration, actually it's a miracle or it's astounding that the proteins do not crash out because any biochemist knows if you want to extract these proteins from the cell and you do in vitro experiments, if you go
12:21
beyond let's say 10 milligram per milliliter the proteins tend to largely precipitate out. So how the cell maintains its concentration and can carry out its functions is actually an astounding feat and which I think is not so often considered. And so most of the vitro biochemistry, reconstitution experiment and so on are done at about two orders of magnitude concentrations
12:45
below what actually happens in the cell. So that's just to indicate that if we talk about cells we talk about a very complex system and these classical mechanistic models have a very hard time to reach any kind of comprehensive prediction of this model here. So now then this given this
13:07
situation and coming back to the question kind of posed or understood to be posed for this morning was how are we do we can ask how are we doing to predict phenotypes from genetic
13:22
variation which can now be very precisely measured and the answer is we're not doing well at all and this is not just of course our group this is in general we have great difficulty to make this link. So we can ask a few questions and which we which are simple questions which we should be able to answer but we can't and I think if everyone is here who would venture
13:44
to say that they generally have an answer to these questions then that would be very good to hear. So we can for instance not say with we cannot predict accurately what the effect of any inherited or somatic mutation is on the phenotype. We can take a particular background genome we could should have a model where we introduce a mutation any anywhere let's say
14:05
coding sequences you may predict what effect does that have and this has not really been achieved we do not know how to two or more mutations combine do they cancel each other out do they synergize do are they neutral we do not know how the same mutation affects
14:23
different individuals which is a huge issue in medicine particularly in this emerging field of personalized medicine and we also don't know how copy number variations in an individual are processed. So these are seemingly simple questions which we should be able to answer
14:42
and I believe that before we can answer any of these questions with some kind of have a path to answering those it is very it is presumptive and maybe too early to really go into the clinic and try to make statements about genotypic variability and its clinical outcome with the exception of course of some Mendelian traits where it is very clearly understood
15:04
what the molecular basis is and that how this translates into phenotype and most clinical phenotypes are not so are not so simple. So to summarize this part I think we are operating in an interesting time in life sciences and we operate and try to summarize this in this
15:26
graph here. So we have an axis this indicates the data that is available and I indicate the y-axis the amount of first principles or theory that's available for the field. We have certain fields like engineering, health technology like the biotech or biomed tech,
15:45
people who make a device for instance to monitor heart rate or blood pressure which are in a very comfortable position because it's essentially engineering there's a lot of theory there's a lot of first principles thermodynamics electrodynamics and so on which are which are used very widely
16:01
and work extremely well. So for them it is relatively straightforward with a limited amount of data to get to a predictive model. In biology we don't have this luxury we have very few first principles I come to this just in a second but we we have now increasingly data and of
16:20
course we also heard completely different types of data that exist and are generated from cell biologists from with imaging that with labeling and you can follow a specific molecule exactly where it goes how its amplitude varies and so on. So this is also of course enormously dense data so we operate now in the in life science and medicine in a space where we have
16:43
a lot of data available. But how this data generally in this genotype to phenotype space relate to each other is to me a big question and I'll come back to this that I think correlation simply correlating data is not going to work. So we have to find a way how
17:02
to translate this data in predictive models and this is the topic that I'm going to address in the following of my talk. We incidentally have whole classes of scientists or people who are of very highly influential and important roles in society, doctors, lawyers, CEOs,
17:25
politicians that have neither they neither have a theory or first principle how the system actually works nor do they have data that they can actually do tests and can do any experiments. I mean a doctor cannot have basically clone the patient and do experiments or
17:45
treatment of some type on some and not on the others so they have basically to accumulate an empirical base which they apply. So at least in the in the life sciences we're now in a domain where we can use empirically acquired data from strategically positioned data
18:05
that help us to make predictive like hopefully accurate predictions. So what are the first principles that we use in biology? We do not have models like the physicists do where you can vary parameters and it simulates and makes an accurate prediction. We have some principles
18:23
that we can apply for instance we have Mendel's law, Mendelian inheritance, we have the principle of Avery, DNA as a transforming principle. This of course is now taught to every undergraduate student. We have the one gene, one protein, one function notion from Beadle and Tatum.
18:44
We have central dogma. We have Linus's polling. I think this is an extremely important insight, the idea of a molecular disease that a particular mutation in a particular gene leads to a change in the protein, in a change in the structure of that protein and that manifests itself as a
19:04
complex phenotype, sickle cell anemia. We have the notion that proteins only function if they have a three-dimensional structure and of course we have the the most recent principle which I think is a fundamental principle that we we need to consider in this phenotype to genotype
19:23
relationship that proteins, basically of a modular biology, that molecules do not act by themselves but they act in modules or complexes or however one would want to call that. So we try to come up with a concept that would integrate many of these principles and is experimentally addressable.
19:45
So we call this the prototype model and this is our notion that we are pursuing experimentally that if we were able to measure this term or this entity, the prototype, we would have
20:01
an extremely informative entity that is fundamental to the translation of genotypic variability and phenotype. So how do we define this prototype? We define it as the composition of the proteins in a cell, basically the inventory and the way they're organized
20:23
in modules. So this addresses many of the principles that are monuments in life science research and especially it addresses the issue of Lee Hartwell and colleagues of a modular biology and it addresses the issue of that variation, genetic variation that is happening
20:45
somewhere in the gene affects the structure and the function of these modules. This is Linus Pauling's principle. So we would postulate that if we are able to measure this prototype
21:00
we would be able to make a useful link between genotypic variation phenotype specifically we would predict, and I will then expand on some of these points a little further, that this prototype, so the composition as well as the organization of the proteins, is the result of complex processes in multiple layers that are poorly understood. We know that there is
21:26
transcriptional models, there's models that predict how RNA interference or microRNAs affect gene expression, there's models that define translational control and of course we have protein kinases that affect protein phosphorylation and I think for all of these
21:45
levels there is information but everyone would be hard pressed to integrate this in a computer, in a model, into a comprehensive predictive system and we think that the
22:00
or to integrate the control events at each one of these levels and basically generates one entity, this is the prototype, which is the result of control elements at various levels. We further would assume that the prototype indicates the response of a cell, we know that of course that cells react to external perturbation of genetics, they'll come back to this,
22:25
and that the cell knows how to integrate or how to react and if it can measure the reaction then we would learn some biology. So we would further assume, this is basically the principle
22:41
of Beadle and Tatum and also from Linus Pauling, that the prototype determines the biochemical state and is therefore to be very close to defining phenotypes. Further, we do not ask what does the protein or gene product, gene or protein do, this is of course largely known, we know where kinase phosphorylates certain residues, ubiquitin ligase ubiquitinates
23:05
certain residues, a protease digests certain proteins, we know that, we can measure that in vitro but I think the question we try to address is how does the prototype or the system respond to alterations. So it's not just that we want to say a certain element has a certain
23:24
function but we would like to see how does the system react if this function is changed. And then we would present this as a system, this prototype, which has different levels of resolution, eventually there will be a higher level resolution all the way down to crystal
23:43
structures or atomic level but at the moment I think we have to assume that certain areas of biology are known in great detail and can be represented in very dense, with very dense data and others are not. And I think the whole discussion about trafficking is one field
24:03
where this enormous amount of prior information has been accumulated and we'd like to integrate this data into a larger representation at the level of the proteins. So this is kind of what we try to achieve and the considerations that basically indicate
24:23
that we believe that it will be very hard to make inference or predictions from how genomic variability affects, for that matter also environmental insults affect the cell if we simply do genomic measurements. So now in the following I would like to expand on a few
24:43
of these principles with actual data. The first question I would like to address, how does a simple genomic perturbation affect the prototype? So now we go into an experimental design where we induce an otherwise invariant genotype in a specific protein, specific
25:08
mutations and these mutations are derived from medicine, from basically they've been associated with specific disease, phenotypes particularly cancer. We ask what effect do these mutations
25:23
have if they are in the same gene but of different type, how do they affect the modularity, the composition of the respective protein module and how, what effect do these changes in this module have on the cells, on the cell's protein landscape. So the experiment is we
25:49
express mutated forms of a protein and determine the effects of interactions and function. We use a protein kinase DIRK2, we use a protein kinase for that because it is easy to measure
26:05
the reaction of or the response of these mutations on this protein because the function of this enzyme is to phosphorylate proteins. We can simply measure whether it phosphorylates different proteins or none or additional ones if it is mutated and we measure then the effects
26:23
of these, of effects on this protein. So how do we end up with DIRK2? So we have a computational postdoc in the group who developed a system which we call, which she calls domino effect where she tries to combine the genomic mutations from cancer genomic data, this is massive amount
26:42
of data, more than 10,000 complete genomes from cancer tissue and normal adjacent tissue and she tried to distill this down into mutations which very likely have an effect on protein function. And so she does that by basically statistical arguments looking at the
27:03
likelihood that a molecule is mutated at a particular site and then she uses prediction tools like do this tool, do this mutation likely change the confirmation of a protein or interaction of the protein that is affected and she came up with what she calls
27:21
hot spot mutations about hot spot mutation 156 genes which we have a fairly high likelihood that these mutations if they were introduced in an otherwise invariant background would induce a change in the either protein folding, protein interaction. Yes, so the mutations have been all found in to be a mutation which have been found in
27:56
genomic data says to be mutated in patients that have a certain type of cancer. So now we
28:03
know that there is many of these mutations are not known to be significant so there's tens of thousands of them and she simply tried to categorize them in those which have likely an effect on the protein on the basis of first of all frequency of occurrence and secondly that they occur in the folded protein in residues which either indicate that the protein
28:25
might be disturbed or that an interaction of a protein with something else might be changed. So I'm coming to the experimental data this was simply based on
28:43
predictions on structure predictions and then where you would predict the structure of proteins and then paint the mutation on this protein if it is in a region that is is predicted to be interactive it is assumed to have an effect. Which are of course not very precise and I'll come to then some experimental data.
29:04
Yes, how do you distinguish variants because variants polymorphism and causal by the frequency?
29:21
So now this is the one of these 160 or 159 proteins that came out we then followed up this is a protein kinase called DIRK2 it is an interesting enzyme it's a protein kinase it acts as a module and this is the kinase itself one of the protein it binds to is a ubiquitin ligase so it seems to be at an intersection of protein phosphorylation and
29:46
ubiquitination and then there's two other proteins here so it is a the core is a tetrameric protein that we want to study so it is it has a number of disease associated mutations is a tetramer and the some of the subunits as individual proteins have been
30:05
crystallized the structure is known so now we selected of 80 as an 81 in this genomic data set is 81 mutations that have been mapped to this protein some of course have no disease association some do and so then Maria filtered this down to to a number as for a
30:26
small number of modifications that we test now there is a truncation where the c-terminal tail is cut off this is an event that happens relatively frequently in cancer we have two point mutations at the site where the truncation happens we have a mutation
30:43
which is in the activation loop of the kinase and therefore affecting is likely to affect its activity we have a mutation in the catalytic site we thought to render it invariant and there's a mutation here in the region which we don't know what it's doing so the all these mutations which are labeled here and affect which are all not
31:06
dreamed up them there they've been occurring frequently in in patients with certain disease we were introduced this is the work of a postdoc martin maynard and it generated cell lines that have that express the mutated form of this respective protein and and then we measure the
31:26
interaction around the core so we see that each mutation even though they may just say may just change a particular residue somewhere in the protein which you don't really know what its function is each one of these proteins has a different if each one of these mutations have
31:43
a different effect on the module so this would be the the way we read this graph is that the this is the mutation this happens to be the one which renders the the kinase inactive and these colors are the interactors the the three interactors of the tetraminic core core
32:02
module we see if we in inactivate the kinase by mutation and there is substantial fall of or a reduction in the binding of some of the subunits there's some mutations which have very little effect but each one does have an effect on the interaction and so they they all perturb
32:22
the module in some way some more some less and as expected the biggest change that is occurring is the truncation mutant where the c-terminal part is missing this is fairly plausible so so the what we know so far the dirk two mutants which have been derived from
32:40
from clinical information show significant but varying impact on the assembly of these kinase core complex how do you measure interactions so we use two methods one is affinity purification mass spectrometry where we tag the proteins and pull them out into interactions and the other is called bioid where we express a modified protein and it basically labels chemically
33:07
the surrounding proteins and the two are not the result of the two are not quite identical but they're they largely converge and the related question is is this in the context of wild-type proteins so this is expression on top of one type or a knock-in or so this is
33:29
expression on top of the wild-type protein i'll then come to a knock-in in in a second so the message so far would be that oftentimes we would say okay a mutation affects this gene
33:42
the gene is maybe eliminated and then or or somehow modified and we would then like an inference from this mutated gene to phenotype and what i'm trying to show here is that each mutation has it even even at this level of the organization of this protein with its core module has different effects does it act as a dominant negative on the
34:05
endogenous one no it's just expressed on on yeah but it doesn't affect the endogenous activity because it will take the substrate no it doesn't because the the endogenous is basically um transparent to to these these techniques it could be that there is titrations or that
34:24
there if you express a act uh a tagged protein it mops up into interactors it changes the equilibrium yes it seems to me that what one is doing is allowing nature to do the mutagenesis
34:42
and then there was a phenotype that came out because we had the patients you're marking that that would have been equivalent that i would have actually started from scratch then my mutagenesis and my protein and then get my readouts right yes and right so it's it just allowed evolution to do the experiment for me so these are
35:01
these are mutations that are filtered and of course we know we don't say that these that these mutations cause cancer but they are statistically associated with uh with with specific with certain tumors and what what i'm trying to do here is to say these mutations that have occurred through evolution or selection in the cancer even though they affect the same
35:25
are usually lumped together they have an intricate and idiotypic effect on this module and if you believe what linus pauling's more uh theory was that of molecular disease the mutation affects the structure and the function of a of a molecule then it means that each one of these
35:44
has an idiotypic footprint this that's seems to me that's the same experiment i would have done in the laboratory when i force let's say i'm doing a screen of mutants right i mean i can let's say i could just drive me to genesis and select yes of course
36:03
but we try to we try to we try to make arguments that we we to we try to find ways to use the genomic variability which is associated with disease like cancer and to and to help
36:20
making the link from these explaining these mutations or on their relation to to the phenotype of course you could take any gene and you could mutagenize it and you could see what happens but that's not quite a question we ask here because there will be no phenotype in your why that's not true what will you make well let's suppose no let's suppose i'm working on
36:41
the platform and so i have a reader which is going to be a certain phenotype happen to see it's not disease but it's a failure to internalize and do my mutagenesis i accelerate it because i did in the lab i get the mapping through the protein and then i cluster you know in the properties yes of course this has been done here in nature to do
37:04
experiments for you of course there's a big difference is that you're in nature and therefore there is no when you are in the lab you work in an isogenic background not necessarily not generally depends if you do genetics to generally what you do and and and and here
37:22
that's the first thing you do and and second you can exactly start to work on your mutation when you have isolated the phenotype that means that you don't let the systems evolving further to to to generally in the cancer you don't have one mutation you have many mutations that probably
37:42
are the first is the one that might have caused the cancer the second one is the one as a load to survive the first mutation so you have a you have a very long evolution of of of a complex pattern of my i i understood that but i still conceptually i don't understand the difference i mean i'm just to me the difference is in one case allowed millions of years
38:04
to do this and whereas in the other one i just did it accelerated so okay so maybe because i'm doing accelerated i have less time to to do the more spread thing or maybe i'm doing less subtle phenotypes yes so so i agree if you work in in a in a in a
38:24
with yeast or with flies of course this has been done a lot even with mice and there's been huge consortia that basically do mutagenesis and see what phenotypes arise and i would just try to make the point here that if if you then say the gene was mutated that this is not sufficient granularity to um to make eventually mechanistic link because different
38:47
mutations even in the same gene in a particular back genetic background have a very different effects on the modularity as i will show now also on the function but i think it has a much fun much more fundamental effect is that when you do that in the lab you are looking for a
39:04
phenotype so i have done experiment for example where i'm looking at a particular pathway i do an eliminating let's say a perturbation in a gene and then i actually we actually see compensations in the system and sometimes you don't even actually we don't actually see
39:21
a readout on the phenotype the compensation is such that you don't see a reader but since since i know the module as you are pointing out i'm looking let's say at the level of expression of some other proteins and then you see there was compensation that said there was no phenotypic readout right because the system compensated right but this is generally
39:42
generally this is a rare case in general when you do experiments in the lab you you you you go for the strong phenotype and what what is remarkable is that when you look in nature you never have the same mutations okay let me continue later
40:07
yes but i think the principles that are trying to elaborate here also apply to mute mutations that are are generated by random mutation but the point is it's not unexpected because you're looking at a very specific function
40:22
yes so i'm just trying to say that various mutations which have been selected and been through statistical arguments associated with the phenotype that they affect the protein
40:43
differently at the level of organization and now we ask what does it do at the level of its function which is a to phosphorylate other proteins and so we can we can find on this protein a number of phosphorylation sites we measured them and we quantified them and we can see that again each mutation has for these proteins these few phosphorylation sites
41:04
on the protein that are measurable are have a have a a different pattern so not only does the mutations have been selected and expressed affect the wiring basically the modularity
41:20
they also affect the phosphorylation state of the protein and then we also carried out is now a knock-in experiment where we carried out and study to see how do these mutations with presumably perturbed modules affect the overall function of this basically landscape the number of the phosphorylation landscaper pattern of this protein so the experiment was
41:41
to take cells these are to knock out with CRISPR cast the intrinsic kinase to knock in then the mutant forms of this kinases then they're expressed and then we isolate proteins from these cells we we purify phosphopeptides and we analyze these phosphopeptides in a mass
42:01
spectrometer so we generate about 1200 or so phosphorylation sites in all these mutants and by just simply clustering this the phosphorylation patterns look quite similar so that means that these subtle mutations in this protein do not radically change the overall protein phosphorylation landscape which of course is expected because in the same cell
42:25
they have hundreds of other kinases active at the same time but when we start to look more closely which phosphorylation sites are affected and we focus on this panel over here we see again the various mutants is now knock-in mutants there's no more wild-type kinase
42:42
and we see that there's a set of proteins is about 30 to 40 phosphoproteins and phosphorylation sites which are changing in response to the various mutations and these phosphorylation patterns change again in idiotypically dependent on the type of mutations we see the
43:04
complete knockout this is the most strong pheno footprint here this is the second to last we have the deletion mutant the c-terminus is deleted it is similarly strong but not as strong as the knockout and then we have the kinase dead mutant
43:20
which is the third one from here which is again similar to the knockout but not identical and then we have the other the other mutations which either affect different residues which have a footprint on the phosphoproteome which is detectable which affects specific proteins but not as strongly as the absence of the kinase so we can of course then look what do
43:43
these proteins do and this would then provide a link to the activity of this protein or its modified form the effect of a mutation of this protein on specific phosphoproteins and if we assume that phosphoproteins are phosphorylation is responsible for modulating the activity we can say that specific mutations in this single protein DORC2
44:06
mutations which have been coming through selected through basically through evolution have affect various for various areas of the cellular physiology for instance some map to
44:23
methyltransferases is an epigenetic complex we have this protein here a scaffold protein activated of GTPases so this is probably people here know a lot about this protein nuclear pore proteins which we also heard a lot yesterday and cell
44:42
cycle regulating proteins so what we conclude is from this is that if we take a number of mutations that have been selected to be related to disease and if we introduce these mutations in this protein in a cell in other otherwise isogenic background they affect both the
45:04
organization of this module and its function and they do it in in in a highly modulated way and and the function of this protein complex which is a kinase affects different parts of the cells physiology so this is a very this points to a lot of complexity of how these mutations
45:25
mechanistically affect physiological processes so this is basically what we conclude from this and i would the overall conclusion the complexity of the cellular response to a simple genomic perturbation one mutation is beyond the reach of mechanistic models because
45:43
we have no no good way to predict a priori which parts will be touched by for instance a kinase or a ubiquitin ligase or a protease that's mutated okay so now see i'm getting of course very late um we i won't i won't get through so we'll we'll see the then we would like to
46:07
yeah that's fine yeah so i won't get through the third part of course this is which is actually also well anyway um so now we i would like to to extend this to a
46:23
situation now we have basically had one background one mutation ask what happens and now we would like to go to more natural natural situation where we say we have a number of genetic variants in in various in a population and we'd like to ask to what extent can we use this natural variation to make linkages eventually mechanistic linkages
46:46
for predicting a phenotype so so this is usually uh discussed controversially and because there's a lot of people like those here who um a very famous article by now who expunges the idea
47:02
that we don't need to have any hypothesis anymore we don't really need to understand mechanisms correlation is enough so the idea is if you pile up enough data measure enough genomes to chino g1 studies with enough um cohort size we will not need to make um to understand the
47:25
underlying mechanisms we can simply make correlations and make statements so this is this is widely used in in also in clinical circles i would like to show that this is probably well almost certainly an o-under estimation of the problem and that correlation
47:42
will not be enough so how do we show that if i may yes um it's used as uh as markers or biomarkers usually it's not meant to be a mechanistic model it's just if you have a hundred thousand people with this this this this this this and this mutation and they are at
48:04
this disease or this whatever they are at this stage of the disease you can say there are good chances that if you have someone with these these these these but it's only biomarkers it is and it's risk it's not even market it's risks and i think we know from these now very
48:20
large i mean some of these large g war studies are now hundreds of thousands of individuals have been genotyped and what of course comes out is that there is the larger the cohort the more genes show a small signal in these manhattan plots and so they they produce a there's additional genes or mutations in genes even associated with with a complex disease
48:45
but they are very very small contribution how these contributions can be used in clinically to make a risk assessment is actually very difficult so i think um this is a philosophical point i think one needs to eventually if one wants to do a risk assessment or treatment decision
49:00
know something about the mechanism yes so we now like to explore we'd like to explore how likely it is or how can we how can we use they systematically collected data sets and populations and mechanistic insights and prior information to to learn something about the system so this
49:22
we do we would like um this is the outline and this is the system we use this is uh and we use this system because we think before we can make any headway into in a system which is has controlled known genomic variability will have a very hard time to go to our
49:41
outbred human population so this are is an interesting collection of mice um mouse strains which has been generated by an international consortium we certainly have not contributed to that with the beneficiaries of it and there were two mice c57 black mice mouse and dba mouse were crossed and then there were um so there were f1 generation then f2 generation was generated
50:07
and out of that there were strains outbred which are each one of these strains is genetically identical they're inbred and they all have the property the genomes of these
50:22
strains have the property that alleles from either the one parent or the other have been have been distributed in these strains and there's about 180 strains of that so it's a terrific resource because we can we can we know the genetic variability it is limited in the sense from the
50:40
alleles that are present but the distribution of the alleles is of course different from strain to strain so we have done we have used these strains and in an experiment this is an early phase with now a lot larger data set which i don't want to discuss but we selected 192 proteins which are relevant for metabolism and selected them for quantification across this
51:03
cohort so this they cover some metabolic pathways we we took 40 of these strains which were grown either a normal food or and food that makes them fat so there's an external perturbation we have a genetic axis which is a genotype which is known we have a
51:23
environmental diet axis and this when we did this for 40 strains so for 40 strains we measure in duplicate under two conditions high fat or low fat and about um close to 200 metabolic proteins and then we want to see how this data set can be related to learn something about the
51:44
genetic effect on of the on the environmental effect on the behavior of this pathway so this is just showing that the data looks good this is the data table so this is what i said at beginning we have now the ability to measure precisely quantitatively
52:04
number of proteins across cohorts these were rather low for today stands somewhat low number but it's focused and to make the point it's efficient and so we can we can now link using qtl mapping quantitative lake trait locus mapping we can now make a link between the presence
52:26
of a particular allele at a locus and the abundance of a protein so this is referred to as protein qtl so we we assume that one allele causes the protein to be highly more highly
52:40
expressed than the other allele from the other parent and since we have a number sufficient number of these measurements we can we can we can proportion we can we can relate the presence of a particular allele to the abundance of a protein and it's referred to as qtl mapping so what we see here is that the we identify 44 from this 197 or 92 proteins 44 qtl so these
53:04
are low psi over which the allele affects the abundance of approach of the of a protein some are insist that means it affect each other i mean the locus affects it the product from this locus and some are trans the locus affects the abundance of a protein that's
53:21
coded for by different locus so this is not super interesting or super remarkable but when we also measure eQTLs is the the transcripts which are also measured in these mice we see that we have a rather similar number of qtls links between the allele and
53:41
the protein but they but they they have a different behavior so in the and the different behavior is that proteins qtls act more likely in trans than in than the transcript qtls that means the transcripts the the transcript regulation is more is more is less diverse in
54:02
the cell than the than the protein regulation by genetic means i don't get into the effects of the environment and now we try to we try to learn something about using this data plus um prior knowledge about these pathways to learn something that may be interesting by
54:25
biochemically or actually clinically so one of the qtls maps to an enzyme that is at the end the last enzyme in the degradation pathway of branched chain amino acid like lysine or isoleucine lysine or isoleucine so these are degraded in stepwise manner
54:50
exactly as the Beadle and Tatum principle suggests and each step here produces a metabolite as an intermediate product so now we have a qtl so we have a genomic locus that affects the
55:05
of this respective protein here this enzyme and this is either high or it is low so now we can we can correlate basically these the enzyme uh a presence here which we take as a surrogate for the activity and we can relate this to the metabolites up here so we basically do something
55:25
which which like a water hose where we say we close the water hose we have less we constrict it we have less of this enzyme and we ask do the metabolites up here pile up this would be assumed and if there is lots of the protein a lot of activity down there we would assume
55:43
that then the metabolites up there decrease in abundance because they're processed okay so this just shows that we can do this so from the the enzyme level is inversely correlated with these metabolites which are also measured by mass spectrometry this is exactly the principle
56:01
of this water hose constricted or open we also see that two metabolites up here correlate very nicely so if one is high the other is high so that means the enzyme down here constricts the whole pathway so we have now made a link between a genetic locus and the allele that controls the the enzyme level to be the higher low and the presence the abundance of metabolites
56:27
this is a mechanistic link because we explain this by by the enzyme activity that is present here now interestingly enough we can find then literature that says that it is that this
56:44
intermediate product here amino adipate is a small molecule that is being generated in of this enzyme is has been found in a large cohort GWAS study in the Framingham heart study as a biomarker for diabetes risk so this is of course now an interesting case because it it
57:08
allows us to make the statement that through measurements systematic measurements in genetically perturbed animals we are able to find a link between a genomic variant in a particular gene
57:21
and an enzyme an enzyme abundance and this enzyme abundance affects the past activity of a metabolic pathway the degradation of branch chain amino acids and if this activity is low of the pathway at the bottom the amino the intermediates pile up and they are
57:42
found to be a risk factor for a complex disease yes so do you know if the change is due to the dba background or the black six background because as far as i remember dba is more susceptible to set diabetes yes yes so there is there is a whole range of of actually
58:09
disease phenotypic measurements in these mice on on this is amazingly complicated so there is from these mice these bxt mice there is a more about 300 phenotypes have been measured
58:23
including some disease phenotypes and and many of these phenotypes are quantitative so you can say you can assign a numeric value to them and in every case i've looked at is is that the parents the dba or the black six are somewhere in the middle so you can basically list make a plot
58:44
of the numerical phenotypes from strain one 280 and the and if and the parents are always somewhere in the middle and they create offspring through the reorganization of the alleles that are far outside the range of the of the parents so this is of course outside
59:04
Mendelian inheritance and this is by for for all of these quantitative phenotypes are measured is actually the case yes yes no i'm yeah yeah okay so i want to summarize this part
59:20
i think the correlation of prototype and genomic measurements in genetic reference indicates very complex relationships between genetic constitution and and eventual eventually expressed information we we can we can we can in various specific and simple cases where there is a lot
59:46
known about the mechanism we can use this prior information and related to the big data set generated is not a super big data set that i showed but it's it's rather substantial data set and we can we can then
01:00:00
a somewhat mechanistic understanding and I think we need to find ways, this is a I think big challenge for the future, to systematically integrate large-scale data and mechanistic data like have like been generated by many of the biologists here work on for years on a very complex biological system to
01:00:22
then use this general principles as background and to determine how they are modulated, how this background is modulated in a specific case, in a specific genetic background or under specific conditions and that can certainly be elaborated by large data sets. So my conclusion clearly is
01:00:44
correlation is not enough, it is a useful tool but if we think we can use simply correlation of large data sets to get mechanistic biologically meaningful insights I think this will not work. So I wanted to, I was planning to but now skip this, I want to show that this how the cell processes gene dosage
01:01:05
effects and I don't have time to do this, I would just like to say to summarize what it does, maybe I can summarize this in one picture. So this is basically, we collected a panel of cell lines which from the sequence are
01:01:24
essentially identical or very similar, these are HeLa cell lines, they're very frequently used in laboratories, 100,000 publications but they are genomically unstable and so sequence-wise they're similar but
01:01:42
the genomic landscape is very different so here we map copy number variation of these cell lines that have been collected from various laboratories, people do experiments and so these are simply, we see that although they have the same name, these cells, and they're used in laboratories to do experiments, they're substantially different from the, not
01:02:02
from the sequence, but from the copy number variation, namely the number, the ploidy of genes in specific alleles and I just want to throw your attention to this picture here, this is two of these chromosomes where we see hot is always, high number of ploidy, green is a low number of ploidy and
01:02:21
we see that there's very large blocks of regions, chromosomal regions, which are amplified in these cells or not amplified, so it's very, it's kind of a green and red block, when we go to the transcripts this gets already somewhat diffused, if you go to the proteins it gets very diffused, so the
01:02:40
effects of this increased ploidy or decreased ploidy is interpreted by the cell in extremely complicated ways and what I do not have time to show is that the organization of these proteins that are coming out of these increased ploidy regions into modules is a big buffer, it's one of the
01:03:01
most dominant factors how the cell modulates the abundance of proteins that are induced by a higher number of copies of a particular gene, is the organization of these proteins in a complex, so if a protein is known to go into a complex and the other subunits are not also augmented
01:03:24
that protein is buffered down and is basically degraded, so that's why it's one mechanism, not the only one, that these copy number variations are interpreted by the cell or lead to very, very refined and actually strongly buffered landscape at the level of the proteins and therefore at the
01:03:45
physiological relevant proteins. So with that I would like to finish and try to show that this is the topic of this morning, that we can measure now with amazingly effective tools, a lot of genomic variability from very large
01:04:04
cohorts, thousands or tens of thousands of individuals, we have lots of phenotypic information and the way we bridge this, I believe, needs to involve proteins, not just the abundance of proteins, but also their modularity, and I think if we can make more headway into
01:04:22
basically defining by measurements this quantitative prototype, we'll have a much better situation to link genotypic variability to the phenotypes. So this is the collaborators whose work I showed, this last part I skipped
01:04:43
largely is the work of Yang Shenglu, together with Wolf Hart, who is a colleague at ETH, the Dirk II project is work from Martin Meinert, the postdoc, and this BXD project is work of Evan Williams and Yi Bovou, two postdocs, and we work with the group of Johan Obergs at EPFL, who created and maintains this BXD mice.
01:05:05
Thank you for your attention.
01:05:51
In each module, either in a complex man-made machine or in biological machines, it
01:06:01
is possible to distinguish a plant or a basic cross-plant, manufacturing plant or plant, and the control system, and engineers distinguish between those two components. It's relatively possible, let's say, if not easy, to understand module, the
01:06:20
control systems, if we can work them out. This goes back also to what I tried to show this morning. Once you have identified the modules and the regulatory systems, as I said, it's not hugely complex. It's possible to break down the complexity of the overall system, the cell of the organism, into
01:06:42
modules and the regulatory system, and the coordination among them. So this will be, I think, a way in which we could maybe try to predict complex system by breaking down the complexity into modules and into the regulatory layers.
01:07:00
This is what the engineers do, basically. Yes, I agree with that, and I think this is certainly the goal. The problem is that we are now reasonably good, not perfect, but reasonably good in determining the modules that actually do the work, but the control system we don't really know enough. I mean, and so I think it will be, so we have
01:07:25
transcription models, which is one level of control, we have micro RNA control, we have translation control, we have phosphorylation control. I think the analysis needs to start from a function and understand what is the control machinery, the control lab on that function, and this is not done, it's very neglected in cell biology.
01:07:46
Yeah, I think this is true. However, it's also very complicated, because it's not a single level of control that controls the system, it is many that contribute to the control of a system. And so that's why I think that we should work towards figuring out these control mechanisms, of course, but in the
01:08:04
meantime, for a foreseeable future, we are limited to, or better off, if we do measurements and basically take the point of view, the cell knows what control systems to use, and how these control systems are used to control a particular process.
01:08:22
And if we can make a readout that reflects all levels of integrated control, then we would be able to make a better prediction. So this is a surrogate for having a theory is to do, let the cell do the work and do measurements, which are close to determining the field.
01:08:42
I agree on the goal, but not on the method. Okay, we can discuss it. One more question. Woman. I wanted to ask if you think about this prototype as quite stable beside the non-genetic mutations,
01:09:00
like transcriptional noise or epigenetic events, so do you think this prototype is quite stable or a dynamic configuration? So it is quite stable, it seems. I mean, we are not able to make measurements, of course, at a single cell level. So we always measure aggregates over a certain number of cells, which can be good or bad, we can discuss that, but maybe not here.
01:09:28
But under specific conditions, the prototype is actually quite stable. But it is also strongly reactive, it always reacts. I mean, that's what I tried to show with this mutation.
01:09:41
Even a mutation somewhere in the protein, a single amino acid exchange, has an effect on the prototype, which is actually noticeable. This is quite remarkable. It's a very sensitive readout, but it is inherently quite stable, because through mechanisms like, for instance,
01:10:03
the buffering of transcriptional variability at the level of the modules that really matter for the function. So this is what I had to skip over. Okay, we have to stop now. Please stop the break.