We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Using NLP to Detect Knots in Protein Structures

00:00

Formal Metadata

Title
Using NLP to Detect Knots in Protein Structures
Title of Series
Number of Parts
141
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Proteins are essential components of our bodies, with their function often dependent on their 3D structure. However, uncovering the 3D structure has for a long time been redeemed by months of hard work in the lab. Recent advances in Machine learning and Natural language processing have made it possible to build models (eg. AlphaFold) capable of predicting the protein's 3D structure with the same precision as experimental methods. In this talk, I will explore an even more specific application of language models for proteins - the detection of a knot in a protein's 3D structure solely from the protein amino acid sequence. Knotting in proteins is a phenomenon that can affect their function and stability. Thanks to NLP and interpretation techniques we can try to uncover why and how proteins tie themself into a knot. In this research, we rely on many Python-based tools starting from Biopython to Pymol and Hugging Face transformer library.
KnotKnot theoryComputer animationLecture/Conference
ACIDBlock (periodic table)Data structureDatabaseData structureoutputMereologyMatrix (mathematics)Process (computing)Pattern languageSequenceWordInformationEndliche ModelltheorieVideo gameMeasurementSet (mathematics)Type theoryBuildingBlock (periodic table)String (computer science)BitNetwork topologyForm (programming)Graph coloringLevel (video gaming)Cellular automatonSimilarity (geometry)PredictabilityFunctional programmingPlotterAlpha (investment)Row (database)CoccinellidaeSelf-organizationComputer animation
DatabaseRepresentation (politics)Template (C++)outputSequenceSimilarity (geometry)Data structureCodierung <Programmierung>Sign (mathematics)ComputerDigital filterEndliche ModelltheorieQuicksortMultiplicationModal logicLoginComputer-generated imageryEstimationBildsegmentierungObject-oriented programmingTask (computing)VideoconferencingVirtual machineGraph (mathematics)Bit error rateMerkmalsextraktionACIDProgramming languageWave packetPlastikkarteProcess (computing)Computer fileRevision controlBlock (periodic table)CodeRight angleWave packetWordNormal (geometry)Knot theoryProgramming languageoutputMereologySet (mathematics)BitLine (geometry)Endliche ModelltheorieTransformation (genetics)TheoryField (computer science)Phase transitionThread (computing)DatabaseDifferent (Kate Ryan album)Morley's categoricity theoremAlpha (investment)Network topologyEngineering physicsComputer architectureKnotType theoryDistribution (mathematics)Product (business)Shift operatorData structure2 (number)PredictabilityTask (computing)Classical physicsDigitizingComputing platformTunisKörper <Algebra>Natural languageAuditory maskingMusical ensembleFunction (mathematics)Numbering schemeUniverse (mathematics)Arithmetic meanNumberMachine learningVideo gameQuicksortIdentical particlesProbability distributionBit error rateMathematicsTrailBuildingComputer animation
KnotPattern languageAxonometric projectionProblemorientierte ProgrammierspracheSchmelze <Betrieb>Alpha (investment)Membrane keyboardBeta functionSurfaceCellular automatonProgramming languageVideo gameKnot theoryUniqueness quantificationLocal ringKeyboard shortcutFunction (mathematics)Bridging (networking)Stress (mechanics)DissipationAbstractionKolmogorov complexityTopologyMechanism designSimulationProtein foldingFood energyTablet computerFiber bundleHelixSlide ruleArtificial neural networkFamilyEndliche ModelltheorieCore dumpProjective planeBitInterpreter (computing)Particle systemArithmetic meanPatch (Unix)Set (mathematics)Pattern languageMultiplication signFile formatFunctional programmingDatabaseMereologyAlpha (investment)ResultantSequenceData structureKnotDrop (liquid)PredictabilityRepresentation (politics)Task (computing)Einbettung <Mathematik>Knot theoryDoubling the cubeBuildingComputer animation
Slide ruleExpert systemSet (mathematics)Linear multistep methodGoodness of fitMoment (mathematics)Presentation of a groupType theoryFunction (mathematics)Representation (politics)Endliche ModelltheorieEinbettung <Mathematik>BitMultiplication signAlpha (investment)Library (computing)SoftwareLevel (video gaming)Physical systemComputer configurationSequenceEngineering physicsMereologyPlanningSampling (statistics)Data structureKnot theoryPolygonDifferent (Kate Ryan album)Bit error rateLecture/ConferenceComputer animation
Computer animation
Transcript: English(auto-generated)
So hello everybody, I'm Eva, and even though this is a Python conference, I will not be talking today so much about Python itself, but I would like to tell you more like a story behind the research I'm working on, and it deals with noted proteins.
So proteins, those tiny things that are in all our bodies, the key players that makes our bodies work. If I give you an example of hemoglobin, that's something you all probably know, that's a protein that goes from our lungs to all other parts of our body and carries the oxygen, so it helps us breathe.
You know that proteins are quite important to us, and it's honestly pretty fascinating to study them, so let's have a look at how a protein looks like. When a protein is being created, it looks like a long string of tiny beads, where the beads, the building blocks of the proteins, it's called amino acids, and we have to
study 20 different types of them. You can imagine it as like 20 different colors of the beads, and from these beads, you're making the whole protein. As long as the protein's finished, it doesn't stay in its linear form, but it immediately falls into some 3D structure, and in its 3D structure, it stays for its whole life.
And the 3D structure is crucial for the protein, because it determines the function of the protein. So if you look at the structures here, for example, at the orange one, it looks maybe a bit like a U-shape, like a pocket, so its function might be that some other molecule comes and sticks to the pocket.
Or the blue thing there, it's like a tunnel, so it sits somewhere in a cell membrane, and it guards which things can slip through in and out of the cell. Let's have a look at the folding itself. When you have this protein sequence somewhere, naturally, maybe in a cell, it can fold into
the 3D structure like immediately, in milliseconds. But when you're a scientist, and you have this amino acid sequence, which is quite easy to get with experiments, how long do you think it will take you to get the 3D structure? And we can do like a warm-up activity, where you can do a measurement with your hand.
It will be like hours maybe, days, months, you can show it. If you have any guess. Yeah, great. Most of you is right, it's maybe somewhere there. Because if you know what you're doing, if you have some protein you've maybe worked
with already, it may take you months, but usually it's more like years. And there's a joke, it's a pretty bad joke, because it says that every protein in a database took a PhD life, so it's not that great, right?
This difficulty in obtaining the 3D structures of proteins is very nicely demonstrated in the size of protein databases we have. So if you have a look at the UNIRAC database, which is the database of all protein sequences, it is something like 300 millions of sequences. If you want to work with that, you have to cross through the database so you end up
with like 50 millions of sequences, but if you want to have a look at the 3D structures, and it's like the important thing because you want to know the function of the protein, you're at something like 190 hundreds, thousands records.
So it's not that much, right? And this disproportion between getting the protein structure and protein sequence is really nicely demonstrated in the size of databases, and this plot shows that it will definitely not get better in the future use. So maybe it would be really good for us if we can have some predictions tools
that will help us to get the 3D structure of the protein. We can plug in the protein sequence and it will predict for us the 3D structure. Luckily you have something like that, and there's a really nice competition about this. It's called CASP, and it runs every two years.
And the guys from the competition publish some protein sequences with unknown structure, and basically anybody can join this competition with their models, and they try to predict the structure of these protein sequences, and then the organizers come, they get the real structure, and they see who's doing the prediction the best.
In the past years, the ladybot looked something like that. On the x-axis you have all the models that join this competition, and on the y-axis you have how well they're doing the prediction. And then 2020 came, and there was one tool that completely rolled over all the other tools.
It was called AlphaFOD, and the best thing about this is that this tool was so good that its precision was almost at the level of what we were able to achieve with our experiments, with those costly experiments which were running for years.
So that's a great thing to have. And let's now have a look inside at how AlphaFOD looks. Not surprisingly, it's a machine learning thing that is trained on all available 3D protein structures, and there's a pretty nice schema which we can split into like three parts.
The first part deals with encoding the structures, the protein sequences. So I actually didn't tell one thing, which is pretty important and interesting about proteins, and that is if you have two protein sequences that are kind of similar, it's very probable that their 3D structure will look very similar.
So on this example we have hemoglobin protein sequence from human, mouse and fish, and if you have a look at the human sequence and the fish sequence, maybe only like half of the sequence is the same, but the 3D structure looks almost the same, and it's because the function is the same.
So if you go back to the schema, that's exactly what's happening in the first part. You have your protein sequence in the input, and what you're doing is that you're looking through a database of protein sequences and trying to find something that would be similar to your input sequence, and what you hope for is that you can maybe extract from there some patterns
and get something that might be useful for modeling the 3D structure. Then the input is like a matrix of all pairs of amino acids in the sequence and their potential contacts, and then a logical thing, you're looking for some already existing 3D structure that would help you to model your input protein.
Then the second part is like in the processing of this input information, maybe you just put it there, the word everformer, it sounds a bit like a transformer, and that's almost what's inside there. And the third part is like the modeling itself,
so what your output is like x, y coordinates of each amino acid in the input protein. Now, if you would like to predict your bright new protein's 3D structure, it's honestly not so easy because this whole thing is really big,
it's computationally demanding, and one structure may take several hours to compute. So people started thinking, can we do it somehow faster maybe? Like, it's great we have this thing that can predict a 3D structure in one hour, but, you know, we are doing this experiment for years,
but still, it might be something that will do that better. And the idea is that we spot what is the slowest part, and it was like this database search for some similar proteins. What we can do here is that we replay this slow part with some language model maybe that would capture the idea of the whole protein universe,
and it would maybe understand the proteins the same way as the database can, and it would match faster, right? So here comes ASM fault, which did exactly this thing. It took basically the same scheme as AlphaFold had, but they replaced the slow part with their own language model trained on proteins,
and now we can do the prediction in seconds, so that's great. This is a bit longer interaction about proteins, how they work, the 3D structures, and now let's shift a bit and let's talk about large language models. So what is that? What is a large language model?
In a simple word, it's a type of an AI that is somehow able to understand and generate the text. Usually these models are trained on a huge data set, for example, all articles from Wikipedia, something like that, and they're trained in a very specific manner.
On the input, you're giving them a sentence. For example, here we have, there was a king who had 12 beautiful daughters, and in the sentence, you hide one word, so this daughter's word will be masked. And what you want from your model is to predict you the probability distribution of this hidden word.
So what this model does is that it predicts a couple of words with some probabilities, where the daughter scores with the highest probability, so that's exactly what you want. And what you hope for is that this model can capture somehow the idea of the text, of the language. However, training these types of models is extremely demanding
because you have to have a huge data set, model itself is really big, so usually only big companies can afford this, but when you have this model trained, it's really good, and you can use it for some subsequent tasks. So, for example, like classical thing, when you have movie reviews and you're classifying them into positive and negative,
what you can do is that you can take some of the already trained models, you basically cut the last layer which was predicting the next word and to replace it with a layer that will be predicting positive and negative, and to only fine-tune this model on your specific task.
And what you hope for is that the model already understands the text, so it's easy to just slightly adjust it and it will output you whatever you want. Nowadays, this fine-tuning has become really easy because we have, for example, a hugging face, so for those of you who don't know that, like a little advertisement, it's a really cool thing.
It's a platform for data science and machine learning where you can easily share your models and data sets, and fine-tuning has become this simple. It's like whole code for fine-tuning the model. In the first lines, you just download the model that is already available through a hugging face,
you just load your data set, you train it, and that's it. Now, you might wonder, how is this nature language related to proteins? It's a bit different, right? Not really. Because you can imagine that the protein is somehow a language. It's pretty easy because each amino acid has its letter code,
so you can imagine that each amino acid, each building block of the protein is a word, and you can then do with the protein the same thing as in normal language. You can mask some amino acids and you can train a model that will be predicting which amino acid is missing. People have already tried it, so we have, for example,
ProdBERT model that is trained on a whole database of protein sequences. If you spotted the word BERT inside this name, yes, it relates to the BERT architecture in normal language. It's the same thing but just trained on proteins. And another example is ESM model,
which is something I already mentioned when we were predicting the tree structure. It's like the faster tool. So when we have this most available tracking phase, it has become really easy, basically, for anybody to play with proteins. So even you can try that pretty easily. And finally, we're getting to something I'm playing with,
and that's proteins with knots. I know it sounds a bit weird, but it's a pretty fascinating thing and it's built on a whole field, mathematical field, it's from like 18th century, it's a knot theory, and it's categorized in different types of knots
and trying to distinguish them from each other. So on this picture we have like different examples of the knots, from the simplest one, which is like a knot to three knot, which is something you probably tie on your shoelaces or somewhere, and then many others. And maybe here to just explain the terminology,
the first number here refers to the number of crossings in the knot, and the second number is there just to distinguish different types of knot with the same number of crossings. How does this work with proteins? It's basically the same thing, but proteins are but not complicated, so you maybe don't see at the first look if the protein is knotted or not.
But most of the proteins are knotted, meaning if you pull the protein from both ends, it will untie, but there exists some knotted proteins. Most of them have this most simple three-one knot, but there exists some other guys like six-one knot or maybe even double three-one knot in bacterias.
Even though knotted proteins were studied for quite some time, we're still not 100% sure what's the purpose of knot. So there are some ideas, maybe the knot tries to prevent the protein from degradation or, for example, we know that the knot creates in some proteins the active site
of the protein. This is like responsible for the function of a protein, so it's like a very important part of the protein. But we'd really love to know, but we still don't, is if there exists any amino acid pattern that would be like responsible for the knotting.
And that's the place when we came up with our research, and our idea is that we take protein sequences and we build a model on them, which will try to categorize them if the protein is unknotted or knotted, and then we'll try to interpret somehow the model and see if we can extract
from it some patterns of the knotting. We don't have to train machine learning thing. You need some data for that, right? So one would say you're very lucky. You have this AlphaFold tool that will predict to you many, many protein structures so you can just look through this database
and see if we have some knotted structures, take them, take some unknotted, build a dataset from that. Easy task, okay? It's not that easy, because if you do it with this simple approach, you will end up with a dataset that's completely biased, and the model was learning like absolutely nothing about the knotting, so you have to be a bit smarter.
It's in every machine learning project, the dataset building to us like 80% of the time, and when you already think you're finished, it's like somebody coming, well, we maybe forgot about these things, so we have to rebuild database, and you have a very beginning,
so after like half a year, we finally got something, and we ended up with building it almost manually, the database, because we wanted to be really sure that the proteins we have there are really knotted, so we manually treated these proteins and took only protein
families that we're sure with, for which exited some already experimentally determined structure that was knotted. With this approach, we got quite a nice dataset, which had something like 200,000 proteins. It was nicely balanced, so it was really good for us,
and I already mentioned that we are now available these protein models, so we tried to use one of them, ProdBird, and fine-tune it on this dataset, and before we actually started the fine-tuning, we were curious how much does the model already know about knotting.
It's already shown that these models have some understanding of the protein features. For example, if you get protein embeddings, it's like the inner representation of the model of the protein. The model already knows, for example, from which kingdom the protein comes, or, for example,
which structure it possesses, so we tried it for our knotting problem, and we saw that the model already has some understanding of the knotting, even before it was trained, so it was a good thing for us. We then took our dataset, we tried to train the model,
and we got pretty nice results, with the overall accuracy around 98%. It was actually a bit suspicious. We were worried, for example, that the model might learn only to recognize the biggest protein family, so we already also checked this thing, but it looked completely fine,
so we then approached to the interpretation part, when we tried to see what the model thinks that might be responsible for knotting, which patterns in the sequence. We tried a couple of things, but the best working was our custom technique,
when we're, like, patching parts of the sequences, and for each patch, we saw how the model prediction would drop. So basically, if you cover part of the protein sequence, how much does the knotting score drop? And with this approach, you can get the place where the drop is biggest,
and with this patch, you can basically break the knot. So you can then aggregate all those patches and see if you can extract some biological meaning from them. We tried it for one-particle protein family, where we observed that the model primarily focuses at the end of the knot core,
and we're also able to get some pattern in there, which is pretty interesting, and we observed that this pattern is closely related with the function of the protein. But we would like to continue with this. We tried the interpretation only with one family,
so it would be good to extend it to some other families, and what's a bit like a crazy idea, but what might work someday is that we would like to create our own protein without a completely artificial design thing. You might ask, is this somehow useful?
And honestly, this is a bit tricky question for us, but I will try to give you some examples. For example, there's research that says that improper formation of a knot might be somehow related to the obesity. And another thing, we know that some knotting protein families, for example, spout family, is very important in our body,
and that the knot is really tightly related to its function. And one of the function can be related to bacteria resistance, so maybe deeper understanding of the knotting would help us to develop some new targets for antibacterial drugs. And to conclude this talk, just remember, proteins are cool,
especially knotting proteins are cool, and research on proteins is something really neat. As a last thing, I would like to thank to my team who helped me with this project, and also thank to you for listening.
So if I understand you, at this moment, just distinguish if it's knotted or unknotted,
and nothing more, like you don't try to distinguish the knots, like what kind of... Different types of knots, not yet. But we might want to try that in the future. And basically, all your samples, if I understand, are completely... Like, most of the samples you were trying
are at this moment artificially made. Like, most of the knotted proteins are not present in the... Artificially predicted, yeah. And are you able to see, like, maybe... Okay, we can discuss later. It's fine.
How much is the research considering modeling of proteins limited with post-translational modifications, since I guess it's hard to predict them, and this influences the structure and the catalytic activity as well?
Do you know how far you can get without considering this, or are there already options to consider this? Like, no idea about post-translational modification at this stage. Okay. Are there plans for the future? From my side, probably not,
because I'm not, like, a lab person, but maybe somebody else has this idea. Okay. Well, it's just a matter of interest. That's what's really interesting, the presentation. Thanks for your talk.
You were talking about a solution based on BERT. BERT of library. Have you tried any other deep learning-based solution? Yeah, we tried other transform-based things, like this ESM. It worked very similarly, so we just stick to this one.
And we also tried to use simple convolutional network. It honestly worked pretty well. A bit worse, but still really well, so it looks like this noting problem is somehow quite simple to distinguish. And what we also tried to do is to take just the embeddings from the BERT and then train some small thing on top of it.
It also worked pretty well. But we just stick to this thing. Really good talk. I was interested in the LMM stuff. So you said you would train it on a protein dataset afterwards. Have you tried just dropping in, say, a GPT-3
and seeing if it can discern the different proteins just as is? We were thinking about it. I think guys would try something like that. But yeah, connect to those guys over there.
Hi, very nice talk and very interesting research. Question is regarding the not-proteins and the problem of finding which protein is noted and not-noted. How did you... Which technique did you use to distinguish between these two sets?
There's a Python package. It's called Top Poly. It's done by guys from Poland. They're doing this research for quite some time so they already know how these things work. And what you do is that you take your 3D structure and you run this tool on that and it will output you which type of node you have there.
Okay. But you based on the protein sequence and then you move to the 3D representation by some kind of molecular dynamics modeling. How did you get the 3D representation of the protein? By alpha-fault. Ah, by alpha-fault. It's predicted. Okay, thanks.
We do not have any questions on the Discord. If we have any questions in here, we still have time. Oops, sorry. Any more? No? Then thank you, Ewa, for your great talk and nice talk. Thank you.