We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Some Thoughts on the Evolution of Proteins

00:00

Formale Metadaten

Titel
Some Thoughts on the Evolution of Proteins
Serientitel
Anzahl der Teile
340
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - keine Bearbeitung 4.0 International:
Sie dürfen das Werk bzw. den Inhalt in unveränderter Form zu jedem legalen und nicht-kommerziellen Zweck nutzen, vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
RestriktionsenzymChemische StrukturProteineChemische ForschungGenf-ElementNanopartikelBesprechung/Interview
PhysiologieGenBaseRückstandSekundärstrukturDNS-SyntheseProteinogene AminosäurenProteinsequenz
MolekülRestriktionsenzymPhysiologieRoher SchinkenThermoformenMolekularbiologieErdrutsch
RestriktionsenzymMolekülPhysiologieDNS-SyntheseMolekularbiologieGenMessenger-RNSEssenz <Lebensmittel>WursthülleEnzym
RestriktionsenzymSubstrat <Chemie>Chemische ReaktionChemische StrukturGenBleierz
RestriktionsenzymPhysiologieErdrutschProteineSekundärstrukturProteinfaltung
MolekülRestriktionsenzymPhysiologieProteineSekundärstrukturFunktionelle GruppeChemische Struktur
MolekülRestriktionsenzymPhysiologiePhosphateFunktionelle GruppeProteinogene AminosäurenKettenlänge <Makromolekül>MolekülElektronentransfer
RestriktionsenzymMolekülPhysiologieElektronische ZigaretteProteineHeterocyclische VerbindungenPhosphateMolekül
MolekülPhysiologieRoher SchinkenSubstrat <Boden>LymphozytenmischkulturProteineImidacloprid
PhysiologieCycloalkanePentapeptideProteineRückstand
PhysiologieAntikörperSystemische Therapie <Pharmakologie>Funktionelle Gruppe
RestriktionsenzymChemische Struktur
RestriktionsenzymMolekülMolekülBukett <Wein>
RestriktionsenzymPhysiologieReaktionsmechanismusOperonSammler <Technik>Eukaryontische Zelle
RestriktionsenzymPhysiologiePentapeptideGenEukaryontische Zelle
MolekülRestriktionsenzymAntikörperExonFunktionelle GruppeChemische StrukturErdrutschSystemische Therapie <Pharmakologie>
PhysiologieGenSekundärstrukturExon
RestriktionsenzymPhysiologieGenExonStereoselektivitätChemische Struktur
PhysiologieProteineGen
RestriktionsenzymPhysiologieSetzen <Verfahrenstechnik>Funktionelle GruppeSingle electron transferSeafloor spreadingMolekülbibliothekProteine
RestriktionsenzymQuerprofilProteineDNS-SyntheseStromschnelle
RestriktionsenzymPhysiologieNucleotidsequenzProteineDNS-Synthese
RestriktionsenzymProteineFunktionelle Gruppe
RestriktionsenzymPhysiologiePasteProteinogene AminosäurenProteine
RestriktionsenzymSekundärstrukturAsparaginsäure
RestriktionsenzymPhysiologieInselFunktionelle GruppeSekundärstruktur
PhysiologieSekundärstrukturRückstandProteinogene AminosäurenAlignment <Biochemie>f-Element
PhysiologieBodenschutzProteinogene AminosäurenRückstandSekundärstrukturMeeresökologie
RestriktionsenzymPhysiologieMeeresspiegelProteinogene AminosäurenZündholzBodenschutz
RestriktionsenzymPhysiologieHydroxybuttersäure <gamma->RückstandPasteBodenschutzAlignment <Biochemie>
RestriktionsenzymSekundärstrukturAlignment <Biochemie>Computational chemistry
RestriktionsenzymAmine <primär->Proteinogene Aminosäuren
SekundärstrukturRückstandEukaryontische ZelleWursthülle
ZinnerzSubstrat <Boden>Eukaryontische Zelle
RestriktionsenzymPhysiologieMetallmatrix-Verbundwerkstoff
RestriktionsenzymPhysiologieFunktionelle GruppeSekundärstruktur
RestriktionsenzymPhysiologieEukaryontische ZelleFülle <Speise>Sekundärstruktur
RestriktionsenzymPhysiologieBodenschutz
PhysiologieNobeliumTyrosinDNS-SyntheseMethyltransferase <S>Systemische Therapie <Pharmakologie>
Roher SchinkenPhysiologieSekundärstrukturZündholzMeeresökologieTyrosin
RestriktionsenzymPhysiologieTrihalomethaneDNS-SyntheseProtonenpumpenhemmerf-ElementProteinogene Aminosäuren
RestriktionsenzymAsparaginsäureAlanin
RestriktionsenzymPhysiologieZündholzAsparaginsäureProteineGlutaminsäure
PhysiologieBaltischer BernsteinHydroxybuttersäure <gamma->DNS-SyntheseZündholzMeeresspiegelMethyltransferase <S>Sammler <Technik>
PhysiologieProteineSekundärstrukturMolekülbibliothek
RestriktionsenzymPhysiologieBiskalcitratumProteineOperonZellteilungDNS-Synthese
RestriktionsenzymPhysiologieMagd
PhysiologieProteineSingle electron transferZündholzEssenz <Lebensmittel>Eukaryontische Zelle
RestriktionsenzymPhysiologieInselProteineCytosinMethyltransferase <S>DNS-Synthese
PhysiologieCytosinDNS-Synthese
RestriktionsenzymElektronendonatorPhagozytAdenosylmethioninEnzymDNS-SyntheseMethylgruppeAcetylsalicylsäureFunktionelle Gruppe
RestriktionsenzymChemisches ElementZündholzTransposaseDNS-SyntheseSekundärstruktur
PhysiologieProteineFunktionelle Gruppe
MalzPhysiologieSekundärstrukturPaste
PhysiologieSekundärstrukturAlignment <Biochemie>Schwermetall
RestriktionsenzymPhysiologieVSEPR-ModellPotenz <Homöopathie>
RestriktionsenzymPhysiologieAlignment <Biochemie>WasserscheideSekundärstruktur
PhysiologieSekundärstrukturConsensus-Sequenz
RestriktionsenzymPhysiologieHomologisierung
RestriktionsenzymQuerprofilAktives ZentrumRückstand
RestriktionsenzymPhysiologieMeeresökologieRückstandProteineChemische StrukturHistidin
RestriktionsenzymPhysiologieProteineKohlenstofffaserAlphaspektroskopieSammler <Technik>
RestriktionsenzymPhysiologieSerinAdditionsverbindungen
PhysiologieSandSammler <Technik>SynthasenSerinproteinasen
PhysiologieInselCytosinChemische StrukturLactitolMethyltransferase <S>
RestriktionsenzymMolekülPhysiologieAtomNobeliumElektronische ZigaretteAlignment <Biochemie>
Transkript: Englisch(automatisch erzeugt)
As you know, proteins govern the complex chemistry in living cells, and they also form the building blocks of cellular structures and particles. You also know that genes contain the information to make proteins, and that proteins are synthesized
initially as long chains, polypeptides, typical proteins being of the length of one to several hundred residues of amino acids. And by folding into a specific, unique three-dimensional structure, the linear information that's contained
in the genes in the base sequence of the DNA and that linear sequence which is also contained in the amino acid sequence is converted to a three-dimensional code that reflects
the information that was in the linear form. If I could have the first slide, and we could dim the lights. This first slide is taken from James Watson's textbook of Molecular Biology of the Gene,
his first edition. It's one of my favorites because it really contains the essence of molecular biology and illustrates what I was just saying. Here we have a gene on the DNA within a large chromosome, in this case coding for a protein, of course through an intermediary messenger RNA, for a hexokinase enzyme.
We see that this folds into a specific structure which then, by means of complementarity, can fit the substrates and lead to catalysis so that the information within the gene is
transmitted to a specific action in the cell, thus governing a particular reaction. This particular portion of the slide shows us basically what we call the folding problem
in biology. The primary sequence of the protein determines its structure, and as was shown by Chris Anfinsen in the 50s and 60s in a large number of experiments, the primary sequence determines the final structure so that, in principle, one ought to be able to predict
how a protein is folded and possibly what its function is simply by knowing the primary sequence. However, it will probably be in the next century before we can attain that goal. Now, modern proteins are fairly large, and they often carry out more than one function.
For example, on the previous slide, that protein, the hexokinase, recognizes two different molecules and transfers a phosphate group to the sugar, and different parts of the
chain fit different groups on the substrates so that there are patterns of amino acids in the first part of the protein that, for example, might recognize the ATP molecule or even sub-parts of that molecule, for example, the phosphate portion versus the adenine ring
structure. Other portions of the protein further along towards the carboxy end would recognize other groups. So that a modern protein really consists of a number of segments, at least conceptually, that have different small functions, and when you put them together,
you get the complete protein function. This suggests, then, that proteins early on may have been quite small and simple. And so the premise of my talk is that early proteins were indeed
short, perhaps 20 to 30 residues. We don't really know the length. And that at some time early on, there was a large random population of these molecules, of these proteins. For example, if they are averaging 30 in length, there would be 20 to the 30th random polypeptides
existing. Now, the vast majority of these would have no specific folded structure, we would have, consequently, no function. But if we compare to the immune system where
10 to the 8th or 10 to the 9th total antibodies can fit almost any molecules, we could assume that perhaps a small fraction of these total could assume perhaps a million or even a hundred thousand, we don't know, again, the exact numbers, might by chance
have had specific structures. And those that had specific structures, a few might also have particular recognition capabilities of the lock and key sort to recognize small molecules. And all we have to do is postulate, and here is the big hurdle really, where
I really don't deal with this in my lecture. At some point in the earliest cells, we have to postulate that the useful functional structures, these small structures, became associated with genes, with small genes, which coded for them, and a selective mechanism
was developed early on so that those cells that had the right collection could in some way be selected and reproduced. So the big hurdle really is that first operating
cellular system, which could be subjected to selection and evolution. So what we're saying then is that the earliest cells perhaps consisted of a few hundred thousand or a million or so small genes coding for small peptides with partial functions, and
that these partial functions could perhaps aggregate together to form larger structures having a more complete enzymatic or structural function. Now at some point nature must have devised a way of producing this combinatorial joining together, much as in the antibody
system. And on the third slide here, we come to the exonic theory of genes, which even exists in the modern time. We know that most genes in nature exist in pieces, exons
with intervening entron sequence. So that here I've listed a large number of genes that might exist. These could have existed, of course, early on and then be perfected, but the modern exonic structure of genes is perhaps a remnant of the early, more disorganized
structure of genes. So we could assume then that new genes could be derived from these various pieces existing in other genes simply by a combinatorial selection in which a new
gene might borrow this exon, this exon, and perhaps that one to produce a new structure, a new enzyme or protein. So the notion here is that if there are millions of different
types of proteins in nature spread among the different organisms, there may be a more fundamental set of small functional pieces from which those proteins are all derived, made by various combinations. If we could somehow discover or catalog or make a library
of these patterns or motifs reflecting these early small proteins, we would learn a lot about evolution and about the structure of proteins. Now, there have been two events in the past, oh, I'd say a dozen years, which make it feasible to gather data on
this question. One is, of course, the very rapid advances in sequencing, DNA sequencing, so that now if one looks at a database, the protein databases of the order of 15 or 20,000
proteins and the DNA database approaching 100 million bases, there's a lag, of course, in converting some of those DNA sequences into translated products because in general it's not known what those proteins are. So what I've been working on in the past year
or so is some computer programs designed to examine the database to pick up the smaller motifs that exist inside of groups of related proteins. And in the past year have written
a program that operates quite well on a personal computer, and I should mention that the development of the personal computer in the past dozen years is the other event that is so important because it puts within the hands of every scientist the ability to test out ideas.
Okay, here is an example of my program which finds amino acid patterns in proteins. I've taken 33 reverse transcriptases present in the database and examined them for patterns
using this program. This pattern Y.DD, that's tyrosine dot aspartic acid, aspartic acid is found in 28 of the sequences exactly and partially in 5 of the others. The program
in effect within a few seconds can locate the high incidence of this particular pattern in this group of 33 sequences. And in fact also found two other patterns which I don't show you. Then it is able to align the sequences against that pattern and to find
the best alignment of the remaining sequences. These numbers refer to the scores, the higher the score the better this particular alignment matches those of the rest of the block. And the next best alignment is over here, that's the next best score. And these are the
residue positions within the sequences. And the name of the amino acid. Now up at the top you can see that we have in fact, although we originally searched just for three amino acids and found a conserved pattern, this is embedded in a sequence of high conservation that extends about 20 residues. A plus indicates partial conservation
of the amino acids in that column. In other words these are evolutionarily similar amino acids. A star indicates a much higher level of conservation but not exact conservation and if a letter is printed that means that all of them in the column match that letter.
That's the highest level. And we can see that in some positions there's little if any conservation of the residues in that particular column. Now how does the program actually work? There have been many attempts in the past by other computer experts to
devise a way to find such patterns. And they all seem to be stuck on the same approach, namely doing all the pairwise alignments and then examining those pairwise alignments
to see if there were common patterns between the different alignments. And of course this goes by the square of the total number of sequences. And so as the number of sequences goes above perhaps five or six, the computation time becomes lengthy and also it's not a totally
automatic method. Now it occurred to me that one could simply look for a three amino acid pattern, amino acid one, two and three, there are twenty possibilities for each of those positions, and allow a distance, a variable distance between the amino
acids, for example ten, here and here, so that the total number of possibilities in this case is eight hundred thousand. One then forms an array in which there are eight hundred thousand cells in the computer. And then you go through the sequences only once so that the time is only linearly dependent on the number of residues in the
sequences. You go through once and you simply catalog every pattern that fits this criteria into an individual compartment in the cell. So that here for example we have F, Q, E,
or is that an S? I guess that's an S. The cell for that pattern might be here, and since we have one occurrence we would put a one there. Now as we continue to search, if we find that pattern again we'd put a two here, and if we find it again we'd put a three here, and so on, incrementing each time, and so on. Another pattern might go
to this cell, another to this cell, and so on. So that you have a matrix or an array of eight hundred thousand cells, and many will be empty of course, but some will have higher numbers. For example if you have the 33 reverse transcriptases, certainly a pattern
that occurs, and you can show this by calculation, a pattern that occurs let's say 20, 25 times precisely in that group of sequences not only would be rare, but in fact would
be highly significant. It would virtually never be found unless it was there by purpose. So in the initial filling of this array we don't worry about the position, the positional information. Once the array is full we simply go back through and very rapidly examine
the cells in the array to ask if there are any entries that are high that approach that of the number of sequences. And those then we select out for examination, we go back, find out where that pattern occurred in each sequence, perform the alignment, and
then look to see if there are additional conservations surrounding it, which add more significance to the finding. It's very simple and it's relatively insensitive to the
answer to, and search out nearly all of the patterns. We've had considerable interest in the DNA methylases from our previous work, and one of the first systems we examined were 15 adenine methylases. We found a pattern proline-proline tyrosine occurring
in the majority, and again a region about 12 in length of high conservation. And even those two sequences that don't have the exact proline-proline tyrosine have a very close highly conserved match to that with reasonably high scores. I've refined the
program so that one can select out this region, this block, and form a profile shown here. This here again I show the pluses ending with the Ppy here. I've selected out the
block of high conservation, which is let's say diagnostic of the DNA adenine methylase, and along the top I show you the 20 amino acids in the one letter code. And the numbers simply refer to how well a particular like alanine here gives a score of 93, whereas
arginine gives a score of 13. So alanine fits this particular column in the profile better than does arginine. And glutamine fits very well, for example, or aspartic.
Glutamic and aspartic fit very well, and so on. And here we can see that tyrosine gives a very high score. Phenylalanine, which is a close match to tyrosine, also gives a high score here, but others are very low. Now with this numerical profile, one can
then search a protein database, and again I can do this on my own small computer within a matter of five or six minutes, searching through about 14,000 proteins. To find those which select the highest scores out of that profile, producing the highest level of match.
Now obviously it's going to pick out any DNA adenine methylases that are in that collection. And I point out here one finding there is a, I can hardly read it from this angle, hypothetical protein 4 from E. coli actually gives the highest score of any in this entire
library of sequences. And prints for that particular sequence, it's an unknown protein that was sequenced in the course of analyzing an operon devoted to cell division proteins.
Just on the basis of this high score, one might suspect that that is possibly a methylase, or a protein that binds to methylated adenine, possibly in DNA. Now there it turns out there's another highly conserved pattern or motif in the DNA adenine methylases,
and that is phenylalanine dot glycine dot glycine, F dot G dot G. When I searched, I picked this particular sequence out of the database and searched it for that sequence, it also has that one. So I'd like very much to write to the authors and tell them
that they have an unknown methylase, whether it's related to cell division, I don't know. But it points up the fact that one, by collecting motifs and distilling the essence of the motif into a set of numbers, one can now search the ever increasing database
of unknown proteins to find possible matches. And this will allow us to hopefully identify most of the protein's sequences, for example in the human genome, as they're obtained. Just want to give a few more. Okay, here's the F dot G dot G that I just mentioned.
It turns out if you just lump all of the methylases together, the DNA cytosine methylases, the DNA adenine methylases, the N4 cytosine methylases, into a large collection, they have in common this pattern. A few have variants, but still with reasonable scores.
And there is circumstantial evidence that this particular pattern or motif is associated with the binding of the S-adenosyl methionine co-factor, the methyl donor that all of
these enzymes use. So we identify this with the common function of all of these proteins. Here I simply show another example, the 18 DNA integrases from various temperate phages
or viruses which can integrate their DNA into the chromosome, and also from some of the transposases which also integrate their elements into DNA. And they share in common, I should mention that this was found initially, this match was found by a graduate student
by simply taking the sequences home every night and looking at them hour after hour. He finally noticed a similarity in these otherwise non-homologous proteins and came rushing in to point out his discovery. I obtained this in about 10 seconds with my program, and
it's a motif of about 35 or so residues. Okay, now it occurred to me that the ability to find a common pattern in a large group of sequences might be useful in aligning sequences which have homology. And this is
so over the past year I've developed such a program. I might mention that those that make their living by the development of these analytical programs, again, were stuck in the
use of the pairwise alignment procedure, and none had succeeded in really going beyond about five or six sequences using a fairly heavy duty computer, because you have the 10-dimensional problem. If you have, for example, 10 sequences of 200 in length each,
you're in 10-dimensional space where the axes are 200 in length. And there are, in effect, 200 to the 10th power comparisons that must be made. And this, of course, would tax most computers.
However, by the approach that I've devised, one simply makes a single pass through the sequences, collects the patterns that are in common, and then aligns at those points, and so you, in effect, divide the sequences into a number of aligned regions separated by unaligned regions. And I show here the application to, and that might be focused
a little bit better. There are five serine proteases, and this alignment is produced very quickly by, I'm not going to go into a description of the program, but you can
see that the sequences are broken into sections where they have a high correspondence. This, down here, is more or less the consensus sequence. A capital letter means that all are the same. A small letter means that there's 50 percent or more agreement, and a dot
means essentially below that level of agreement. So here we have a highly conserved region separated by a region of variable length, which has low homology, and then a region of high homology again, another spacer of low homology, then high homology, lower homology
over here, high homology, lower high, and so on. I've marked in pink those residues which are involved in the catalytic site. You can see that they're fairly far spread, but they occur within a region. So you can imagine that this high homology, this
high conservation here reflects a requirement to preserve the structure of that region so that the histidine can be properly positioned in the folded protein. And here we have a D and a, you can't see that over there, that are in highly conserved regions. Here
is the serine, which is the primary catalytic residue in a very highly conserved region here. The blue refers to structurally highly conserved regions in this collection of proteins.
Three of these, the crystal structure of three of these have been determined. This simply means that if you overlay an alpha carbon backbone of these three at that place, they would not divert by more than one angstrom. I'm going to just go quickly through
some other examples here. Here I've lumped together six additional serine proteases, calichrines, with the five that I just showed you. And you can see that, again, they align fairly well, and there are regions, the same regions as I showed before that
are highly conserved here. Particularly this one that has the, well I'm having trouble seeing it. So that one can extend this to a large collection of serine proteases to pick out those regions which are most highly conserved. My sand has run out, so here are ten synthases going from bacteria to man, a billion years of evolution. You
can see, again, that there are highly conserved regions important in the structure or catalysis separated by regions of variable length where it's apparently not important to conserve
the structure. And 20 cytochromes, this is 11 cytosine methylases. And I might mention that no other approach will allow the alignment of these automatically because of the large variation in the non-aligned regions. And I think I'll end there. Thank you.