RISCy Business: Development of a RNAi design and off-target prediction software
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Serientitel | ||
Teil | 83 | |
Anzahl der Teile | 119 | |
Autor | ||
Lizenz | CC-Namensnennung 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/20035 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produktionsort | Berlin |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
| |
Schlagwörter |
EuroPython 201482 / 119
1
2
9
10
11
13
15
17
22
23
24
27
28
41
44
46
49
56
78
79
80
81
84
97
98
99
101
102
104
105
107
109
110
111
112
113
116
118
119
00:00
AusgleichsrechnungDreiInteraktives FernsehenFokalpunktEndliche ModelltheorieDatenfeldAutomatische HandlungsplanungComputeranimationVorlesung/Konferenz
00:43
DatenmodellNatürliche SpracheDruckspannungGruppenkeimReverse EngineeringATMSoundverarbeitungCodeDatenfeldEinfügungsdämpfungInteraktives FernsehenNatürliche SpracheAnalytische FortsetzungSelbst organisierendes SystemQuick-SortInstantiierungARM <Computerarchitektur>ForcingComputeranimation
01:34
SoundverarbeitungCodeSoftwaretestProdukt <Mathematik>GraphfärbungMagnetkarteVorlesung/KonferenzComputeranimation
01:56
SoundverarbeitungTotal <Mathematik>Mechanismus-Design-TheorieAbstandStandardabweichungTypentheorieQuantisierung <Physik>ComputervirusGraphfärbungVorlesung/Konferenz
02:35
Translation <Mathematik>Anwendungsspezifischer ProzessorKomplex <Algebra>RISCSpeicherabzugKomplex <Algebra>SoftwaretestNatürliche ZahlPhysikalisches SystemMultiplikationsoperatorQuick-Sortp-BlockSprachsynthesesinc-FunktionTranslation <Mathematik>Klasse <Mathematik>InstantiierungSchnelltasteEreignishorizontAggregatzustandGraphfärbungART-NetzAnwendungsspezifischer ProzessorComputeranimationFlussdiagramm
04:22
BildschirmmaskeSoftwareKonstruktor <Informatik>Vorlesung/Konferenz
04:57
SoftwareAlgorithmusPrognoseverfahrenDatenmodellPhysikalisches SystemSoftwaretestElement <Gruppentheorie>KonditionszahlDesign by ContractFolge <Mathematik>InstantiierungMatchingSoundverarbeitungSoftwareDatenbankOffice-PaketBildschirmfensterFamilie <Mathematik>Konstruktor <Informatik>VorhersagbarkeitÄhnlichkeitsgeometrieComputeranimationFlussdiagramm
07:00
Konstruktor <Informatik>ZeitabhängigkeitSchlussregelFolge <Mathematik>ÄhnlichkeitsgeometrieIkosaederSchnittmengeMatchingSpezielle unitäre GruppeCAN-BusNetzwerkbetriebssystemSchiefe WahrscheinlichkeitsverteilungSoundverarbeitungPaarvergleichStatistikEndliche ModelltheorieAggregatzustandÄhnlichkeitsgeometrieGüte der AnpassungKonstruktor <Informatik>GamecontrollerMessage-PassingIdentitätsverwaltungSoundverarbeitungArithmetisches MittelMultiplikationsoperatorStatistikSoftwaretestComputervirusInzidenzalgebraKonstanteTypentheorieCASE <Informatik>Konforme AbbildungMathematikInstantiierungPhysikalisches SystemFolge <Mathematik>EvoluteProgrammierumgebungQuick-SortBereichsschätzungVorlesung/KonferenzXMLFlussdiagramm
10:11
SoftwareSchlussregelSoundverarbeitungVorlesung/Konferenz
10:36
FreewareWeb SiteEnergiedichteFaltung <Mathematik>Lokales MinimumSchlussregelFolge <Mathematik>BeschreibungskomplexitätInhalt <Mathematik>Kategorie <Mathematik>FreewareDynamisches SystemKontingenztafelFormation <Mathematik>MultiplikationsoperatorDatenstrukturPerpetuum mobileSoftwareInhalt <Mathematik>Faltung <Mathematik>VorhersagbarkeitImplementierungTermLokales MinimumComputeranimation
11:26
Zeiger <Informatik>RISCKomplex <Algebra>SimplexTypentheorieKonzentrizitätOrdnung <Mathematik>Twitter <Softwareplattform>Komplex <Algebra>Komponente <Software>SoftwareTrennschärfe <Statistik>Perpetuum mobileSchnelltasteVorlesung/KonferenzComputeranimation
12:29
Web SiteDatenstrukturFaltung <Mathematik>EmulationRechenbuchDickeAbstandStellenringMittelwertCAN-BusLie-GruppeRechenwerkOrtsoperatorPASS <Programm>RechenbuchTrennschärfe <Statistik>MultiplikationsoperatorEntscheidungstheorieWeb SiteSoundverarbeitungGeradeProgrammbibliothekArithmetisches MittelMereologieInstantiierungMetropolitan area networkBildgebendes VerfahrenFaltung <Mathematik>OrtsoperatorKartesische KoordinatenFolge <Mathematik>Open SourceSpezielle unitäre GruppeSchnelltasteKonstruktor <Informatik>StellenringZellularer AutomatResultanteProgrammschleifeEnergiedichteMittelwertVorlesung/KonferenzComputeranimation
16:32
OrtsoperatorFolge <Mathematik>StellenringSystem-on-ChipMittelwertCAN-BusWeb SiteInstantiierungStichprobenumfangDateiformatComputeranimation
17:08
ZufallszahlenFolge <Mathematik>BildschirmfensterKonstruktor <Informatik>DreiSoundverarbeitungDatenbankSystemplattformMereologieCAN-BusWeb SiteMomentenproblemDatenbankSoundverarbeitungKonstruktor <Informatik>SoftwareSchwellwertverfahrenFramework <Informatik>Physikalisches SystemBitBildschirmfensterResultanteAutomatische HandlungsplanungFolge <Mathematik>Design by ContractMatchingVerschiebungsoperatorHasard <Digitaltechnik>InzidenzalgebraBildschirmmaskeBereichsschätzungComputeranimation
19:07
RechenwerkGewicht <Ausgleichsrechnung>Open SourceWikiIndexberechnungAusnahmebehandlungFolge <Mathematik>MomentenproblemDatenbankVorhersagbarkeitTopologieSchaltnetzDigitalisierungBenutzerfreundlichkeitVerkehrsinformationAutomatische IndexierungGruppenoperationProdukt <Mathematik>BereichsschätzungZweiMultiplikationsoperatorQuick-SortDifferenteProzess <Informatik>MereologieClientForcingHalbleiterspeicherPunktAlgorithmische LerntheorieGeradeKomplementaritätZahlenbereichGraphTypentheorieStichprobenumfangAbfrageProgrammbibliothekSoftwareSchwellwertverfahrenLesen <Datenverarbeitung>ProgrammAssemblerStapeldateiDickeComputervirusReverse EngineeringOffene MengeProjektive EbeneElektronische PublikationVorlesung/KonferenzXMLComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:15
Okay, let's start. So, who we are? So, I'm a technician. I'm working since 10 years
00:22
in a plant research institute, which is nearby, about 200 km from Berlin. And we, our focus is to search for genes which are involved in plant-pathogen interactions. And we use barley and powder mildew as a model system. And here you can see barley plants, young
00:43
barley plants, which are infected with powder mildew, which is a major problem in the field outside so you have a lot of losses due to this pathogen. And we try to find genes which are involved in these interactions by the worst genetics. So, how we can do this? So,
01:03
you might know that the genome of higher organisms are very large. So, for instance, for barley, it's almost twice as big as the size of the human genome. And this means you have about 30,000 genes. So, how to find genes which are involved in only in plant-pathogen
01:22
interactions? So, how you can do this? And we use a tool which is called iron ion interference, which was discovered in the 90s. And people thought they wanted to make petunia plants which are light blue, they wanted to make a dark blue. And they thought, okay, we know the
01:41
DNA gene for the blue color gene and let's add more of this in the plant and then we have petunia plants which are dark blue. They did this and actually what happened was exactly the worst. So, they found that the flowers are white or they have only stripes
02:01
of blue color. And they sketched ahead and repeated experiments, but it's always turned out to be the same and it took quite some years to find out that there's a mechanism behind this which is called lnRI. So, how does lnRI works? So, if double-stranded
02:22
RNA enters the cell, which is very unusual. So, people think that this mechanism is used to defend against RNA viruses. So, if double-stranded RNA enters the cell, it's recognized by an enzyme called dicer, which you can see here. And the dicer cuts the double-stranded
02:43
RNA into small pieces, which are about 21 base pair long, which are called small interfering RNA or siRNAs. Then, another enzyme detects these small RNAs, take it, remove one of the strands, which is important for later. So, only one of the two strands stays
03:04
and builds a so-called risk complex. This risk complex then is attached to the target, which is the mRNA, which is here. And this binding of the risk complex blocks the translation and you have no protein. So, if we go back to our blue color gene, so you have the
03:25
natural blue color gene in the plant, which is the mRNA. So, here you have the natural gene in the plant and the people added more of the blue color gene here, so the double-stranded
03:43
RNA. This was recognized by the plant. It was bind to the natural gene and blocked the translation for the blue color. So, you have no color anymore. So, this is how it works. So, with this you can efficiently silence and knock down genes and it works very well since many
04:02
years. So, people use it in human, in worms, in plants, in many species. And we also use it, so we have developed a testing system where we can use this tool. It's very efficient and very cool, I think. So, that's why we call it reverse genetics. So, if you don't know
04:22
what to do, you switch off gene by gene and look what's happened. Okay, so people use it and, but there are two problems with it. So, the first problem is that you might have missed target of other genes. And the other problem is that not every silencing is efficient.
04:41
So, how we can, we wanted to make a software to predict these problems and also to use this tool for designing good and efficient constructs. So, we wanted to make a software which can
05:02
predict off targets, which can predict efficiency, and also we wanted to test this software. So, we made this software some years ago, but in the last year we wanted to validate it and to make real experiments and to see whether this prediction is true. And, of course, we want to have a nice software which runs on Windows and everybody can use it and combines all these
05:24
features. So, what is an off target? So, imagine that you have, you want to make a construct, so you want to knock out a gene. So, you make, artificially you can do this in the lab, you make double stranded RNA construct, and then you have to check whether you only target
05:44
your gene you want, and not only another target. So, as I said, you have about 30,000 genes in the Bali genome, so the chance that you might hit another target is quite high. So, as you can see here, so again double stranded RNA split into small RNAs, and then
06:01
we use a short reader liner, which is called Bowtie over Python, and Bowtie tried to find all the matches from the small RNAs against a large database. For instance, the Bali genome or the human genome, whatever. And, if you designed a good construct, you should only find
06:22
hits of your target. But, now and then you might have off target, which is here, which have, of course, much less hits, but they might be efficient enough also to cross islands to have an off target effect. So, you might not only hit your gene, which you want, but also other genes, because for instance in Bali, it's very common that you have gene
06:43
families which have the same sequence, so very similar sequence, and it might be that you hit also another gene from the same family, which you don't want usually. So, it's important to predict this and to know it, and maybe to change your constructs design, for instance. So, then the question is how much similarity is required for being
07:05
an off target, and we used a theory, a model, which is called the molecular clock, which says that about every one million year in evolution, one nucleotide is changed. So, if you have a common ancestor, which has the sequence like this, then after one million
07:24
year, you have usually one base pair exchange. So, here for instance, I have to see, C turns to G, and here an A turns into T, and so on. So, after two million years, we have another change, and we use this to design our experiments. So, we constructed 15 constructs.
07:48
We made them synthetic constructs, which have, so we used one target, which we know it works, so we worked since 10 years on this, and we have very good efficient constructs and targets and tested and everything. So, we used our best candidate and made 15 constructs,
08:06
synthetic constructs, with decreasing matching similarity. So, you have the zero million years, so you have no change, so you have 100% identity. And first, we did this in smaller steps, so 98%, 96%, so on, to 90%, and then in larger steps. And then we run this over
08:26
the short read-a-line, our bow tie, and you expect that the hits you get are much decreasing as well as the identity. So, in the best case, 100%, you have 408 hits from this ironase to the target, and it's decreased to zero at the end. So, you don't
08:48
hit anymore your target. Okay, so we made these constructs and we tested in our system, and unfortunately, we have no time to explain how it works, so you have to trust me. Here,
09:02
you can see all the constructs, and 100% means the constructs have no effect, they are under control. So, the control is 100% here. And as much as stronger they go down, the more they have the effect. So, zero would be the best. And the red pass means that the construct is significant. So, we usually do about five experiments, more we
09:26
cannot afford because it's expensive and very laborious. And after five experiments, we do statistics and then we see what stays significant or not. So, the red means it's significant, the blue means has no effect, not significant. And you can see here,
09:42
it's very interesting that to the four million years to the 92% identity, you still have an effect. So, this silencing works, you knocked out the gene. But starting from 90% to 50 million years or to 30%, you have no effect anymore, you lost it. It's not
10:06
90%, you still have 59 hits from the alignment. So, you have 59 sRNAs, which match your target. So, why you don't have an effect anymore? And this raises the suspicion, also what the people know since a long time, that not each sRNA is efficient. So, you might
10:23
have sRNAs which are not efficient, and as you can see here, most likely it's like this, so these 59, you have it, but they are simply not efficient enough. So, when we started to make the software, we used very basic rules to estimate which sRNAs are
10:41
efficient because people did not know exactly what's going on and what meets this efficiency. So, we used stuff like A and 2 at the beginning and the GC content and temperature, but in the last few months, we wanted to re-implement the software and to have a better prediction of the sRNA and also to confirm it with our experiments. So, in the last
11:06
couple of months, many research papers appeared which tested a lot this efficiency, especially in humans, and turned out that the thermodynamic properties are very important, also the structure of the RNA. So, things like minimum free energy, self-folding,
11:24
free ends, or target accessibility. So, we used this, and the first very important thing is the strand selection. So, you have the sRNAs here, and then they bind to an
11:42
argonaut, and one strand is removed, if you remember. So, how does this happen? This happens like a zipper, and this happens if the less stable strand is easier removed and open, and this strand is taken from this complete, and one strand is removed.
12:01
So, this goes over the minimum free energy. If it's larger, so it's less stable, it's removed from this side. But now, you might have the problem that the wrong strand is removed. So, it might be not complementary anymore to your mRNA. You might have gotten the wrong strand. So, this sRNA is completely useless. So, this we included in
12:21
our software to make a strand selection and to remove all those sRNAs which are not targeted anymore. So, all the calculations we did after, we first run it over this strand selection to work only after with this which are really hitting the real target.
12:43
The second thing which seems to be important, and it's published many articles, this is the target site accessibility of the target. So, as you can see here, the RNA is, this is the primary structure of the RNA, which is in the cell. It's not really like this.
13:03
You have a lot of folding. So, you can see here, you have complementary bindings and loopings and such stuff, and this is the secondary structure, and this is the, sorry, and this is the, actually how it looks in the cell. This is a very small
13:22
RNA, and you see how much is folded and how much is going on. So, it's not a flat line. And you can see already here that on some places, it's easier to target it than on others. So, if you have, for instance, a folding, a matching here, it's very difficult to access for the sRNA
13:40
because it's already double-stranded. So, these parts of the loop, for instance, might be much easier to access. And you can calculate these things with a tool called RNA-PL fold, which we also use over Python, which calculates the local base pair probability. So, it takes the sRNA and the target, part of the target. So, we do this only
14:05
on the part of the target of the mRNA because it's much easier to calculate than on this structure, and it calculates the probability, how much, how accessible it is from the sRNA to the
14:20
target. And you can do this for each base pair, but also for an average. So, you can do this for the one mer, the two mer, and so on and so forth for the whole sRNA, so 21 mer. And people found that it seems to be reasonable results and also interpretable results if
14:41
at least eight base pairs of the sRNA can access the target. So, I guess if it's less, so if it's like two or three base pairs, the energy might be not enough to stick the rest of the sRNA. And so, people found that it's good to take and to look on the data if
15:02
there are at least eight mers of the sRNA matching to the target. And this is what we did. So, we took our constructs. So, for simplicity, I, so we checked for each of these mers, so one mer, two mer, and so on and so forth, for each construct so forth. Simplicity, I will show only the eight mer average because, as I said, we found the same, that it looks
15:27
the most reliable if you use eight mers, and you have four constructs here which were tested. So, you can see here the position of the sequence position, so you have 500 base pairs
15:41
of sequence of the target, and here it's shown the accessibility. So, zero means this site or this position of the target was not accessible, and one means it was very accessible. And you can see here, so if you remember, the zero mer in years and to the four mer in years had an effect, and the five mer in years not. So, you can see here that for the zero mer in years,
16:06
the one, the two, and the three, and the four have quite a lot of accessible sites. So, this is the last construct with an effect and still has quite some accessible sites, where for the five mer in construct, they completely disappear. So, this fits to our data. So, with
16:22
this, you can efficiently say, you can say how efficient are the CRNAs you make with this construct. If you look more closer to the first construct, to the one mer in year, zero mer in years, you actually can see that you have clusters. So, for instance, from zero to 100 base pairs,
16:47
you have not so many accessible sites. From 100 to 200, you have some. From 200 to 300, you also have some. And from 300 to 400, you have none or only very few. And from 400 to 500,
17:01
you have quite a lot. So, this brings us to the idea to make five other constructs, which we call window constructs, where only these 100 base pairs match to the target. So, you have from one to 100 a match, which meets only the target, and the rest is random sequence. So, it will not match the target. Then from 100 to 200, 200 to 300, 300 to 400,
17:26
and 400 to 500. So, we created these constructs and tested in our system. And the results looks quite interesting. I have to say that I just stopped to look at the results before I went to the conference. We're not finished yet with this. But it looks already
17:44
quite interesting. So, we got the strongest and most significant effect with 400 and 500 base pair construct, which also corresponds very good to the accessible sites. So, you have a lot. And we have no significant effect with 300 to 400, where there are also no accessible sites.
18:03
And for 200 and 100, it's similar. The only what makes us a little bit permanent is the one to 100 window concept. But as I said, we still look on the data. And it's also, we have to find here a threshold and to define a threshold because this what was known from the human does not fit to the plan. Okay, then already I want to come to an end.
18:28
So, we created a software which is called SciFi, which can be downloaded and run at the moment on Windows. At the moment, I'm making it cross-platform with PyCoutez. I switched
18:40
the GUI framework. You can have a custom database. You can find the fish in this RNAase. And you can recommend and find mistarget genes. And this is just screenshots how it looks like. So, I think it's very useful. And I wanted to make this talk just to show that also
19:00
non-programmers can do quite useful things with Python and can also, yeah, it's just, it's except from the community. Soon we will publish an article about it. And yeah, I hope you liked it. How fast is this library? How much, I have no idea how long such a sequence is.
19:44
Can you handle? The database? Yes, if you have some piece of RNA and want to do something with this, how much data is there? It's a lot of data. And is this done in a batch process? Yes, it's run in a batch process and both I use database indexing to do this very fast.
20:04
So, I'm not, I don't know exactly the details, but it's do it very efficiently. So, one check from the software takes about five seconds. So, it's quite fast at the end. Hello. So, I have two questions actually. So, I'm seeing you using Biopython, right?
20:27
Do you use that to do your blasts? No. So, we tried blast at the beginning, but it was performing very bad for short sequences. So, you have a lot of mistakes actually. So, it does not find all this RNA which are actually there. So, if you look on this graph,
20:47
on the first graph, you can see on the blue line. So, this was just taken to align all this RNA. With blast, you will not have this line. You have gaps in between. It cannot find all this RNA and that's why we switched to Biopython. Biopython, I just mentioned here because
21:01
I use it for some handy tools like converting the reverse complementing and open files, FASTA files and such stuff. So, this is not either, you're using this not at all to, because you were saying in the beginning you're trying to find the off targets as well, right? That's also part of the software. Yes, it's also part. So, this we do with Bowtie.
21:21
Okay. So, I split in Python the query sequence to 21 months or whatever length you like and then I take this sequence and use Bowtie to find it in my database which is also made in Bowtie. And Bowtie gives me, for instance, the hit which you can see it here. So, this is actually,
21:41
it's good. This is the target you want to hit, but you may type off target. So, this also comes. So, this is another gene and you might have here in this region, you might have off targets. So, this is done through Bowtie. Okay. So, this is like, just maybe I can show. So, this is, you know, if you have a, if you have query and target which are the same
22:05
and you split it, you will find each of these substrings on the target here. So, you have eight hits. But if you have a mismatch in between, so an X here, you'll find only three. You know, it's like this. And for this we use Bowtie.
22:23
And just, I was wondering whether those target size predictions, they are based on the secondary structure in some way. You do this? Yes. Okay. So, I just started to work with it and we try to find thresholds and maybe also we go for machine learning and stuff like this. But as I said, I'm not a professional programmer.
22:44
I need some time to develop all this. Okay, thanks. You said that you are using Bowtie as a short read assembler against probably some short reads. What I didn't get, maybe I didn't, maybe I misunderstood it. Where did the short reads
23:05
come from? From what? Yes. So, you have, you can enter your, in the software here, you can paste your query sequence. So, we call it query sequence. And then in Python, I just split it. So, I split it to, so here you can choose the size. No, that wasn't what I meant. I mean, if you want to do Bowtie, you sort of,
23:26
what are you doing it against? Against a Bali genome or some short? Yes, you can. So, this is a very big advantage of sci-fi. So, all the tools which assist online and also for downloading, you cannot use a custom database. So, they're usually for
23:40
humans. So, as RNA is usually very often used in human, everything is adapted for human. But this is useless for us. So, we need to check against the Bali genome or whatever. So, you can make a customized database and you can check against this. And this is very important because you know the sequencing of the Bali genome is ongoing and
24:03
you might have next week completely different sequence than last week. So, we need to be published, reference, genome assembly. Yes, all your own. So, sometimes you have projects where you have confidential data. So, that's why we also don't go online at the moment.
24:24
And not some kind of short read library that you require? No, you can do this if you like. So, you can download them and paste them and use it. So, we want it to be very flexible at this point because all the other samples are not. Very interesting. Thank you.