Probing the dynamics of complex microbial communities using DNA tridimensional contacts
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 34 | |
Author | ||
License | CC Attribution 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/50902 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
| |
Keywords |
9
22
28
00:00
Lecture/Conference
01:30
Lecture/Conference
02:08
Computer animation
10:27
Computer animationDiagram
18:46
Computer animationLecture/Conference
26:00
Computer animation
33:14
Computer animation
Transcript: English(auto-generated)
00:15
So first, I want to thank you for this invitation. First time I come here, I used to live next door.
00:20
So it's in Gypsy Rivette on the Orsay. So yeah, I'm a local, but I never set foot in the institute. So I'm very pleased to be here today. And I want to thank you for this opportunity. So today, I'm going to talk about mostly what is actually a technique that we have developed in my team, my team being mostly interested into the functional folding of genomes, mostly
00:42
bacterial genomes, yeast genomes. But by working on these topics, we ended up finding out that we can actually exploit the quantitative measurement of physical collisions between DNA molecules to solve genomics or metagenomics limitations that exist in these fields. So we are not coming from the genomics or metagenomics field.
01:01
But now, part of my team is now working on these approaches by developing new methods. And so I will show you today how this works. So something common between all ecosystems in the world, whether it is the gut of mouse or the mangrover in the French caribans, are the fact
01:20
that they are colonized by complex, heterogeneous populations of microorganisms, mostly bacteria, but also yeast, worms, and a lot of different species. And these communities of organisms have important roles in many fields. For instance, production of oxygen, recycling matters, depletion, bioenergy, et cetera, so they
01:42
can have some industrial importance or just biological relevance. And therefore, it is quite interesting or important for a lot of people to try to decipher the content of these communities to find out which bacteria is there, what are the genomes, in order to understand the maintenance
02:01
on the equilibrium of the ecosystem. I'm going to switch here, because there are more people here than there. So to understand these ecosystems, so a lot of people are interested to describe them the more precisely possible, to the full extent of their complexity. So you have many species coexisting together.
02:22
You can also imagine you have phages, viruses, mobile elements that can be exchanged between some of these species. So it's quite important to be able to describe the structure to the full extent of its complexity. And this is where the field of metagenomics emerge at the convergence between genomes, sequencing,
02:41
so very technical fields, how do you explore genomes from the structure down to the sequence, so high throughput sequencing, to the question, what are the diversity of microbes in the wild? So metagenomics is a field that consists in sequencing DNA from the environment
03:01
and trying to find out what species coexist with each other and trying to find out how they co-evolve or coexist in over time. So these studies provide some hints about the genetic content of an ecosystem and also some hints, therefore, about the balance or the imbalance of the community. And then you can actually work about many things
03:22
with these approaches. The way it is done so far is by extracting DNA from the population, from the ecosystem you are looking at. So you basically just extract DNA molecules that belong to different genomes, different organisms. And you do genome sequencing on this mix of DNA molecules.
03:41
So you sequence a lot of short reads or now even three longer reads. So you end up with a big picture of the DNA present in the population to improve things a bit because you don't know which molecule belong to which. You can actually try to extend these short DNA molecules into longer DNA molecules that are called contigs.
04:03
And still that don't represent full genomes because of the limitation of current assembly programs because also some of these species share some identical sequences. So the program just failed to isolate the individual molecules from the original population. And because it's a very complex problem,
04:21
you have to imagine some of these communities contain hundreds of species, therefore hundreds of genomes. And so it's quite impossible to assemble these DNA sequences into full genomes that provide you a good insight on the original community. So what people used to do is try to pull these contigs that are stretch of DNA of a few thousands of bases
04:44
into pools of contigs that group these DNA molecules according to, for instance, covariance in different experiments. You look at covariance between the amount of these contigs you obtain after sequencing. So you can pull these guys together. You can look at the GC content, codon usage,
05:00
different heterogeneous information present along these molecules that help you to pull this contig according to specific features. It's a very imperfect process and usually what you end up getting is many more pools of contigs that you have original sequences in the community. So here, for instance, you have six bacteria.
05:21
In the end, you will end up with nine communities of contigs, which is there is a discrepancy between the number of pools of contigs and the number of genomes you expect in the end. So there is actually a strong inability in this field to reach at a comprehensive genomic structure of complex communities.
05:40
And therefore, this limits the investigation of the dynamics and equilibrium of these ecosystems. So my team used to work on, I mean, he's working on genome folding. And so what we noticed, and I will show you what I mean by that, is that obviously each sequence, DNA sequence in 1D
06:00
has a unique 3D signature in the wild. So we can actually exploit this 3D signature to reverse, to go back to the 1D for the structure. Okay, so here is how we do it. So we use mostly an assay which is called chromosome confirmation capture assay, which has been developed by Yobdekker 15 years ago,
06:23
that aims at trapping physical collisions between DNA segments along the genome according to their collision frequencies inside the population of cells. So how you do that is you freeze the folding of DNA in each cell by adding a fixating chemical agents such as formaldehyde
06:41
that is going to generate covalent bridges between proteins and proteins and proteins and DNA. And therefore, if you have a DNA molecule inside a cell that is folded like this with a protein complex here, you will generate covalent bridges between these proteins on the DNA. And therefore, you will freeze the folding of this molecule in this cell.
07:02
So if you have billions of cells or millions of cells, you have a population of frozen structures like this in your mix where you add this chemical agent. Then the trick was to digest these DNA molecules. Therefore, you end up with restriction fragments, so short pieces of DNA fragments
07:20
that are again fixed with each other according to their frequency of collision inside this population. And then when you add an enzyme called ligase that is going to re-ligase these two restriction fragments together, you will end up with a molecule that is chimeric with respect to the original genome
07:42
but that reflects the fact that these two fragments, red and black, were actually close to each other in the 3D space. So then what you do is you purify all the DNA molecules and you end up with a library of DNA molecules where restriction fragments,
08:01
so DNA segments that were close to each other in 3D. How big a segment or restriction? So it depends on the enzyme you use, but they can be between, let's say, 50, I mean 10 base pair up to two KB, three KB. So if you use a frequent cutter, it would be mostly between 20 base pair and 200 base pair. If you use a six cutter,
08:20
an enzyme that recognizes six base pair, it would be between 500 base pair and three KB. Only depends also on the GC content of the genome, so there are a lot of parameters there at play, but usually it's between one KB and one KB, let's say, on average. So for a long time, the limitation in this field
08:41
was to quantify the respective amount of this event, because if you are able to have a global overview of the respective amount of all the relegation events between all the DNA segments in the original population, then you would have a global overview of the average folding of the genome. That was the theory. And that was actually solved
09:01
with the advent of high throughput sequencing, like Illumina sequencing, where basically you just plug sequencing adapters to the edges of these DNA molecules here, and you sequence one end of the molecule on the other end, and now you only count how many times you found in the library, for instance, the green fragment re-ligated with the blue fragment
09:22
and actually with all of the other fragments of the genome. And therefore, you have the contact, I mean, the respective contact frequencies of re-ligation between all the restriction fragments of the genome with each other. And this allows you to generate heat maps, contact maps,
09:40
that reflect this respective ratio. So this is an example for the bacterial Vibrio cholerae, which contains two circular genomes, so approximately a four megabase genome. So here are the chromosome represented here under the linear form. And when you plot the contact between all the DNA fragments
10:03
along chromosome one with themselves, you end up with this heat map here. So that's an intrachromosomal contact map. This is intrachromosomal contact map of chromosome two. And then you can also plot the contact between chromosome one and chromosome two. On here, you have intrachromosomal contact map there.
10:22
So this heat map are quite representative of what you get when you do this on any species so far, whether it is mammalian cells or bacteria or yeast. You have first a very strong diagonal that reflect the fact that DNA molecule is a polymer. And therefore, two DNA fragments that are close to each other along this polymer
10:42
are going to be relegated more frequently than two DNA fragments that are far apart from each other. Therefore, you have this strong DNA contact that, DNA, sorry, this strong contact. If the ligand polymolar is identical, not being close apart, you have to coincide. Sorry, I'm just kidding. You have to, to be gay. The stick hands must coincide more or less, yeah?
11:02
Yeah, it's easier. When the difference is close apart from each other, the same. So the contact frequencies reflect is more or less a power law. And so the further you go along, the further you would increase the distance between two DNA restriction fragments and it's going to drop very quickly. And this is actually a log scale.
11:21
But the ligation depends on the sequences of this end. Yeah, so you have some biases at this level, a few biases which are kind of characterized and they are not strong enough to kind of affect the outcome of this result. You may have issues if you have very high GC-rich genomes. Actually, the sequencing is going to be affected as well.
11:43
Like the Illumina sequencing may be problematic on highly rich GC or two. So there are biases, but overall it will not affect the general trend, which is that two fragments close to each other along this polymer are going to be frequently in contact and two fragments far apart like here
12:01
are not going to be frequently in contact. And so here the color of scale reflect the fact I mean rare contact are in white, frequent contact are in red. It's also dependable on the stage of the cell cycle, yeah, how they are positioned. Yes, so that's, yeah, exactly, yeah. So this will modify the, for instance, the width of this diagonal. If you go into metaphase, you will increase the width.
12:22
If you are doing replication, it will look like this. So indeed there are some functional events that you can actually identify in this contact map. And this is something we are working on in different species. For instance here, so the chromosomes are circular,
12:41
meaning this position here is adjacent to that one. That's why you have this contact in the edges of the maps. And also also related to functional events, you can see here, for instance, you have contact, it's very weak, but you have contact between the middle of this chromosome two and the middle of chromosome one. And this corresponds to the termination
13:01
of replication position of these two chromosomes. And you can actually ask why do these chromosomes see each other in space. You also have one specific contact here between the origin of chromosome two on the position on chromosome one. And that relates to an original control of the firing of replication on chromosome two by the progression of replication fork
13:21
along chromosome one, which seems actually mechanically related. So this is something we are working on in collaboration with Didier Mazel-Alpaster. But today what I want to insist on is that in all genomes you have this very strong diagonal that reflect the fact that these two chromosomes are polymers and therefore in the space,
13:41
in the cellular space, behave as relatively individualized entities. So the question we asked at some point when we were working on this was, okay, so what happens if we do the same experiment on a mix of species? Will a DNA fragment from genome one is going to be in contact by accident
14:02
by experimental procedure biases, be relegated frequently with a genome of, a fragment of genome two or genome three, or is this relegation events are going to be sufficiently low? And explain that. You don't understand it. When you really get what the curve is, sequences might be compatible, right?
14:20
How different species may negate to the different parts of the genome. Just because they are in releasing, so you have- No, the way the different sequence, they are not complementing, they won't- No, because the restriction enzyme, it cuts the same site, so you have a cohesive site. I suppose, so in the enzyme, it's a six-year, how it depends on the- On the dependent enzyme.
14:40
So if you use a cohesive end, it will re-ligate very efficiently. If you use a blunt cut, it's not so efficient. So the question we ask is whether or not these restriction fragments are going to be re-ligated often with each other from the different genomes, and therefore blur the signal.
15:01
So what we did was to do a preliminary experiment where we took three different species. Again, Vibrio cholerae, Escherichia coli, and Bacillus subtilis. We grew them independently, and we mixed them together, the cells together, and then we performed this genome 3C experiment directly onto the mix of species.
15:22
So then we take the reads from the parent reads, and we align the reads against the reference genomes of these three species. And then we look at the contact pattern between these genomes and within these genomes. And when we did that, we were quite pleased to see that actually there is very little background between the different species.
15:41
So that means that it's very rare, I mean, it's relatively rare, that one fragment from species one is going to be re-ligated by accident with a fragment of species two. And so re-ligation may occur just because when you lyse the cell, and you just process the cell biochemically, you may just disrupt the genome and have DNA fragments floating in the solution
16:02
that will just re-ligate spontaneously with each other. So this doesn't happen very often. Therefore, you have well individualized squares in the main diagonal that correspond first to the genome of Bacillus subtilis. Then you have this square here that is the genome of E. coli. And you can actually see that there is also
16:21
a tiny square there that correspond to a plasmid that sees a lot the genome of E. coli. And so that tells you that this plasmid probably shares the same cellular space as the genome of E. coli. And here you have again the two chromosomes of Vibrio cholerae that also see each other in 3D very often, and therefore you can actually
16:41
pull them together inside the same cellular space. So we can deconvolve, in a sense, the genomes of these three different species from this heat map. Why is E. coli different qualitatively than the others? So the coverage may be at play here, and it was a very preliminary experiment.
17:00
So there may be some differences in the amount of cells we put, even though we thought we put a similar amount of cells, maybe we just did some little mistakes and also bacillus is a gram plus bacteria. These are gram minus bacteria, and it seems to be slightly easier to extract DNA from gram minus bacteria than gram plus.
17:22
So therefore the coverage of bacillus is lower in this experiment than these two guys here. So you're attributing this spread of the... Oh, the thin line here? So it's the diagonal is much more... Right, the diagonal is also a matter of color scale here. It's the same color scale for the entire map, and therefore you kind of blur.
17:41
And it was, again, a very early experiment. So bacillus doesn't look like this. We did work on that, I will show you maps. It's much nicer than this one. Okay, so, okay, let's keep that. So what does the thickness of the diagonal mean?
18:01
So it means that it can be interpreted somehow as a condensation level. It's not totally true, but you can actually visualize it as a kind of measure of condensation of the chromosome. So if condensation is high, two DNA fragments that are far apart are going to be more frequently connected with each other
18:23
than if it's very decondensed. That's one way to see it. So what we did then was to do the same kind of experiment with 11 species, therefore 11 genomes, and these are yeast genomes. So each genome now is composed of multiple chromosomes.
18:41
So we pulled all these species together inside a mix, and we did the metagenomic assay, no, sorry, the metatrissy assay, as we call it, directly on the mix. And so when we aligned the reads against the reference genomes of these species, we were pleased to observe that there are indeed 11 big squares in the main diagonal
19:01
that correspond to the genomes of these species. Some of them are not as well covered as others. Again, you can see differences, but this is not very significant, this is not relevant. But again, here we aligned the genomes against, we aligned the reads against the reference genome, so we kind of cheat, we know the answer. So what we wanted to do was to do a design on assay
19:23
that would allow us to start- Excuse me, but for these, you don't have any common sequences or any common plasmids or any common- No, here there are no plas- We don't have plasmids. But each of these big squares is composed of smaller squares that correspond to the individual chromosomes present in each species.
19:42
On the dot also, just so you know, in yeast, you heard Eva yesterday, so probably talk about that maybe, but centromeres are colocalized in the nucleus, and therefore the DNA next to the centromere is colliding with the other DNA quite often. Therefore, you can see clusters of centromeres,
20:01
which are these dots in the contact maps. In the case of cerevisiae, you would have the two micron plasmids. Yes, but we don't align it. So there may be a plasmid there, but it depends on the strain, I guess. But for this strain, certainly, yeah. Is there a cerevisiae there? Yes, okay. Can you count the number of chromosomes and it will be-
20:23
Yes, so I will show you what we do now, which is actually exactly that. So what we wanted to do was, because this is kind of used, I mean, it's nice, but it's kind of useless, if you want to explore something from the wild, where you don't know the content in genomes, you cannot align the reads against reference genomes,
20:41
if you don't have the reference genomes. So what we wanted to do is to design a protocol, a computational protocol, where we would start from the reads, so only the parent reads, and we would like to see whether we can go down to the genome of all of these species. So we start here only with the raw sequences, which are the parent reads
21:00
that reflect the contact frequencies between all DNA in the population. So what we can do is, we can first try to increase the size of these small DNA reads, so we can use standard assemblers like DBIUD that are going to increase the size of the chunks of DNA we have. Therefore, we have a set of contigs that we can work on.
21:24
The other information we have, besides the DNA sequence in this set of reads, are the parent information, which correspond to the contact between the DNA molecules, from the design of the 3C experiment. So now what we can do is, we can also align the contact information
21:41
on this set of contigs. And that gives us a network of contact between this set of contigs. So we have this network, which is pretty big, and so a nice way to analyze network is to use the Louvain algorithm, which is going to partition the network
22:01
into, to optimize the modularity of the network partitioning, and look for communities of contig that see each other very much inside this big network, and don't see the rest of the other communities very often. And so we work with this Louvain algorithm,
22:21
we design a protocol that allows us to actually segregate these pools of contig into 11 communities of contigs. Just by using with the standard algorithm, we ended up getting 11 communities, which is actually the number of yeast species
22:40
that we originally put into the mix. We also had a contaminant like an E. coli genome here, 1%, which is probably coming from one of the culture of yeast we did at the same time. So that's okay. So most of these contig just see each other within the same community, and they don't see the contig of the other communities very much.
23:03
Can you please repeat the nature of the ages in your interaction network? The ages are the contigs, and the number of contacts between the contigs are going to be the nodes.
23:21
So the age, you put an age between this contig and another contig in case of what? But we know that the number of contacts is going to reflect the co-localization in the same cell. So we just segregate, yeah, and then we just... And so that works relatively well,
23:42
but here we have pools of contigs, and if we take one of these communities, we look at the contact between these contigs, this is what we get. So here we have 1,000 contigs, and of course, we don't sort them according to their position along the genome because we don't have the reference genome. So when we just align the contigs,
24:00
the reads against the contigs, and we generate the contact map, this is what we get. It's a very messy contact map compared to what I showed you before. And so here I will introduce another algorithm we designed at the same time that allows us to reorder this contig accordingly to their contact frequencies, and therefore try to re-scaffold the genomes of this species.
24:28
So this is a program that we call the GRAAL, that was designed by a PhD student in my group, and that, this is like an interlude, that aims at scaffolding contigs
24:43
based on their contact in 3D. So the situation we have here is the same as if we had the contact frequency between a non-fully assembled genome. Most genomes are not fully assembled, and therefore when you align the reads of a contact experiment against the reference genomes,
25:01
you have this messy contact map where, because you don't know here that the green contig is adjacent to the green one here, you will not position these two guys next to each other along the axis, and therefore you will have this strong inter-contig contact here in the contact map that is actually incongruent with what we know
25:22
about polymer physics of the polymer property of the DNA. So same for the red chromosome here, which is split into three contigs here, there, and there. You will have this incongruent contact signal at far, I mean, long distance in the contact map. So what you want to do is to solve this puzzle. So here it's quite simple. You can solve it by hand.
25:41
So for instance here, you can take the green contig there. You look at the contact made by this position, and you see it makes a lot of contact at the white position here with this guy. So you will position this edge next to that one. And now you have a longer scaffold, and it looks good on the map, okay?
26:01
But you may have long distance interactions. Yes, so you may have long distance interactions which are maybe functional, for instance, like the central mirrors in yeast, but this long distance functional contact are always much smaller than the very strong polymer behavior of the chromatin. So this diagonal contacts are always one, two,
26:23
I mean, usually two folds, two log higher than the contact you see at longer distances. And then what you want to do, for instance, with the blue guy, you take this contig here, and you see it sees a lot this edge there, and therefore you will have a red guy. Anyway, you get the idea.
26:41
You just resort the contigs until you get only squares in the diagonal, and you minimize the signal outside the main diagonal. So this is easy to do by hand, but it's actually more tricky to do with real data with a very big, complex genomes. So what we did was to design a program
27:02
that used these contacts, which are very quantitative contacts, on polymer physics prediction, to explore the space of genome structures and identify the genome structure in 1D that reflects the 3D data with the highest likelihood. So that's what I did.
27:22
So again, this is, for instance, the power law, and we noticed, so if you know the contact frequencies between two DNA fragments, and you have the contact decrease according to the genomic distance along the genome for a genome, if you know the contact frequency, you can actually guess with a very high accuracy
27:41
the distance that separates the two loci you are looking at. So here, for instance, you have a contact around 100, and then you can say, okay, so these two guys are, no, sorry, contact would be like 50, and then you say, okay, if I have a contact of 50 between two segments, it means they are separated with a certain likelihood by 100 KB in 1D.
28:04
So Herve designed this program, which starts only with the 3D data, and is initialized with a polymer model, because at first, we don't have contigs big enough to compute this distribution, which changes depending on genomes slightly.
28:21
So at first, we start with a polymer model on the data, on the original genome, so the set of contigs that is not fully scaffolded, that has been published or that we have generated. And then it's going to screen through the genome structures in 1D, trying to improve the likelihood of the structure
28:43
by fitting the 3D data on it, on computing a new measurement of the fit between the data and the 1D genome structure. So we are going to sample the space of genome structure in 1D, and compute for each time the new likelihood.
29:00
So this is the likelihood on the y-axis, this is again here, and this is the increase of likelihood according to the iteration of events. So we basically make small changes in the 1D structure, and for each small change, we compute the likelihood, and then we take the best likelihood structure,
29:20
I mean the most likely structure, and then we converge towards the situation where we are actually oscillating around 1D genome structure. And this is just a movie that illustrates this process. So this is a genome that was published as a set of 76 contigs, so it's scaffold, so 76 big pieces of DNA.
29:41
And so fungi, and we know that this species doesn't have 76 chromosomes, so there are more chromosomes, sorry, more contigs published than number of chromosomes. And when you do the 3D data, you can see clearly that indeed, something is wrong with this contact map, it doesn't look right. And here, we have illustrated the 76 contigs
30:02
that were published. So what we want to do is to reorder this DNA fragment according to the 3D contact, and converge towards the most likely 1D structure. So the first part of the program is actually splitting these big contigs into smaller pieces, because we actually assume that there are some assembly errors in the first place that were published.
30:23
And so therefore, we cut these big DNA fragments into pieces of 10 KB, and we just start to reassort them according to their collision frequencies. So the first part of the movie is just doing that on randomizing it for...
30:42
So we just split this DNA fragment into tiny pieces, and then here's the likelihood, and that's the number of contigs, and we start to explore 1D structures, and improving likelihood. And in the end, we are starting to converge towards a state where we have 11 big chunks of DNA.
31:04
And we have a contact map now that accordingly represents something that looks like a yeast or fungi genome. So here, when we stop the program, so sorry, not 11, but seven chromosomes. Yes, that didn't go down to the... Okay, it stopped.
31:23
Yeah, it stopped. One of them should be, but it doesn't matter. Well, anyway. So when we stop the program, which is oscillating forever around the same structure, we actually get this genome structure, seven chromosomes.
31:40
We have contacts that look very much like inter-centromeric contact, which is totally expected for fungi at this stage. And like this dot here, we can actually identify the centromere by looking at the position of these dots along the sequence, et cetera. And so we can actually scaffold this genome pretty efficiently. And this is actually the real genome. Now we checked it independently.
32:03
And we can do that actually for other species. And so this is just an example of unpublished data. It doesn't just work on small genomes, like this fungi, but this is the parasitoid wasp. This is a collaboration with Jean-Michel Dreyzen in Angers, where when we do the 3D contact
32:22
between the chromosomes of these wasps, we have this very messy contact map. This is a published genome made out of thousands of contigs. You can see them here. Each of the contigs is making a square, but then you have all these interchromosomal contigs contacts that we want to solve. And the program is also able to do that
32:40
pretty efficiently. So I'm just gonna go quickly. But again, we are just resorting the chunks of DNA according to their collision frequencies. And this gives us a relatively robust scaffold in the end. So it takes a bit longer, but here.
33:07
So when we do that, we end up getting relatively nice scaffolds that actually correspond to what is expected from cytology data. So we were able, by exploiting the collision frequencies
33:21
between these DNA fragments, to re-scaffold these genomes in a way that matches the expected cytological data. So now our collaborator can look at their favorite genome sequences, such as viral sequences that contribute to the parasitoid process of the wasp.
33:41
The wasp disrupts the host immune system to send the eggs inside the caterpillar it is infecting. So these viral sequences are important for the reproduction of the wasp. Anyway, so going back to this metagenomic assay,
34:01
we did this program, we ran this program on this 1000 contigs. And to get it short, we end up to get something that looks like a yeast genome contact map. And when we compare this contact map with the reference genome, we have 95% of the reference genome covered
34:21
by this heat map. So in one single experiment, we were able to actually isolate 11 genomes from this mix of species and re-scaffold them with approximately 90% accuracy compared to what was published. So that's what was quite promising. And here I'm going to show you the real experiment.
34:41
This is not a control mix. It is done on the gut of a mouse from the Pasteur Institute animal facility. It's not a very exciting sample because again, we were not microbiologists at this time. But so we took this one single thesis of this mice
35:01
and we performed the meta3C assay, the contact assay directly onto this single thesis. So we were able therefore to scaffold, to assemble this DNA reads into contigs. And here it's a very big network. We have 400 contigs, 400,000 contigs.
35:21
It represents approximately 600 megabytes of DNA. On the N50 is around 4 KB. So the contigs are relatively small. Again, we can, so when you look at the contact map of these contigs, again, we can use the parent information to bridge these contigs according to their contact frequencies. And we have this network of contigs
35:42
that contain 45 million contacts. So it's a very complex network. And this is how it looks like when it's not sorted. What we did again was to partition this network using a Louvain clustering algorithm.
36:00
And when we saw this network using Louvain, we were able to get nice communities of contigs that present more contacts inside each other compared to outside with each other. So each of these square correspond to a set of DNA sequences that collide more frequently in space
36:20
compared to the other ones. Okay, I'm gonna skip that. Okay, so what we did next was to investigate the content of these communities. And first, what we can do is to align the genome annotation against the DNA present in all of these communities.
36:41
So the communities are sorted by size. So of course, we have small communities of a few contigs, start from one to one of the contigs. But most of the DNA is being actually pulled into the large communities of contigs that contain above 1,000 contigs. So there are like more than one megabase of DNA. And so of course, since most of the DNA is present in the large communities,
37:03
most of the gene annotation is going to be present to correspond to the large communities. So therefore, this is what you see here. Most of the genes are actually present in the biggest communities. This is not very interesting. If we look at bacterial essential genes, again, most of the bacterial essential genes
37:22
are present inside these large communities, suggesting that these large communities correspond to bacterial DNA, which is, again, not totally unexpected. The interesting thing was when we look at phage-related DNA or plasmid DNA or conjugative DNA, we actually see an enrichment as well inside the smaller communities,
37:42
as if, of course, the phages are in contact with the bacterial genomes, the bacterial genes, because some of the phages are integrated in the bacterial genomes, like prophages, or some phages are maybe infecting the bacterial genomes. But also, some phages behave as individualized entities that in space are being actually trapped on themselves,
38:01
and not with another molecule. So that was quite interesting. So we decided to analyze these two sets of communities independently. First, the large ones, and then the smaller ones. And so what we observe, for instance, if we take one of these large communities, so this is 68,
38:22
again, we have this messy contact pattern. We can improve this pattern by working with the program we designed, GRAAL. In the end, we end up with a scaffold of approximately four megabytes of DNA, a single scaffold that presents this kind of pattern. So you have here the large diagonal,
38:41
and interestingly, you have also a secondary diagonal. And I will show you why this is interesting. It's because at the same time, we are working on bacillus subtilis. And this kind of 3D folding of bacterial genome is typical of bacteria like bacillus subtilis. So first, this origin of replication is positioned
39:01
at the edge of the crossing between this secondary diagonal and the main diagonal. You have more DNA around the origin of replication compared to the termination of replication, which correspond to a bacteria that is actively dividing. So you have more DNA at the origin than at the terminus of replication.
39:21
And the secondary diagonal reflect the fact that the two chromosome arms, even though they are bridged there, they are actually colliding with each other more frequently because they are being bridged by cohesins, as we actually showed in bacillus subtilis. So when you start replication, you bridge the two chromosome arms,
39:42
and they are actually maintained in close contact with each other during the cell cycle. And this is what is reflected by the secondary diagonal. So this is actually typical of the 3D structure of a bacterial genome like subtilis. So these are another community, another one, and so on.
40:01
This is more like the E. coli bacterial chromosome where we don't have the secondary diagonal. So this is like bacillus, this is like E. coli. But we get DNA structure that all look like bacterial scaffold genomes, 3D structures. So in one single experiment, we are able to recover dozens of bacterial scaffolds,
40:21
which is actually, I mean, until now it was not, you cannot do that in this field. You need to have multiple samples, so you need to do a lot of correlation analysis. But here in one single experiment, we end up getting hundreds of bacterial genomes relatively complete. At the same time, what we did was to analyze the content of these small communities,
40:42
which were the phages, I mean, that look like phage. So here we took 82 small communities that contain mostly genes that are phage genes. And we reprocessed the data a little bit. And we look at the sequences we obtained
41:01
for these 82 communities. And so these are different examples of these 82 communities. So one of them, for instance, is a 235 KB genome that looks like a phage genome, it's a phage. These guys here correspond to two molecules of 25 KB,
41:21
which are also good size for a phage genome. This is 50 KB, like a codoviralis genome, et cetera. So we have genomes of phages in this community of species that we can now actually try to correlate with the potential host, which are the bacterial genomes.
41:41
So we have, in one hand, bacterial genomes, and in the other hand, we have the phage genomes. And so now what we can do is to plot the contact in 3D between the phage and the bacterial genomes. And this is what we do here. We have the 82 phage sequences and the 100 something bacterial sequences. And this is what we get now.
42:01
We have a good overview of the infection spectrum, global infection spectrum in the community. So this is quite important because now what we can do is to do this kind of experiment over time and look at the propagation, for instance, of a virus in different bacterial species,
42:21
closely related bacteria. We can also look at genes of interest, for instance, an antibiotic resistant gene, which is present in one of the species. And then if we treat one of the community with, or we give an antibiotic to a patient, then we can look at the propagation of this phage, or sorry, of this gene, antibiotic resistant gene,
42:42
inside the population of bacterial species and see if at some point it will induce some dramatic stress in the population of bacteria. We can look at the activation of prophages depending on stress, et cetera, et cetera. So it opens a broad range of studies,
43:01
which until now were quite difficult to address because there was no good way to bridge the virome, so the phage sequences, to the microbiome, which is the bacterial sequences. And also we did not have the full genomes of these bacteria, so it's difficult to look at interplay between metabolic networks in these complex communities.
43:24
So this is, I don't know how long, what time is it? Five more minutes. Okay, sorry. So this is just another little observation is that we also have prophages inside the bacterial genomes, so they don't correspond to this individual,
43:42
necessarily correspond to these communities of phage sequences, but we can actually observe also some phage sequence in the bacterial chromosomes that correspond to prophages, most likely. And so here in red, this is the annotation of the phage sequences, and you can see clusters of phage annotation in this bacterial chromosome
44:02
that actually correspond to this weird pattern in the 3D contact. And so these are typical of phage sequences inside these bacterial genomes, maybe differences in AT composition, which affect the outcome of this experiment. And what we showed is that, for instance,
44:20
in Bacillus subtilis, this is a phage here, SPV beta, which also present this weird pattern. So the AT composition of this sequence is different from the rest of the chromosome, so the restriction is going to be affected, and therefore the contact pattern is going to be affected. And what we showed is that if we stress this bacteria, the phage now is actively dividing,
44:41
and is actually very prominent in the contact data. So this is, again, in controlled experiment of Bacillus subtilis, and we can actually observe also the increase in sequence coverage of this prophage here. And so that tells us that this prophage is actually actively dividing in this bacteria,
45:01
is actually analyzing probably the bacteria. When we go back to our unknown data from the wild, we can actually observe similar patterns. For instance, here we have a prophage that doesn't seem to be activated. The coverage is actually flat, compared to the rest of the chromosome. But here, this guy there seems to be actively dividing.
45:22
So maybe this prophage is active, this one is inactive. And so we wonder if we may actually guess the activity state of the prophage sequences inside the bacteria, again, in this single experiment. So that would be also interesting. So this is the experiment we are doing now,
45:42
along with others, either with wild-type mouse, or with mouse with controlled mixes of species, like 13 species or more. And we can, for instance, treat some of these mice with antibiotics. We can change the composition of the gut by adding, for instance,
46:01
a species with an antibiotic-resistant gene, et cetera. And so now the idea is to sample the feces of this mouse at different times, and track in 3D, in 4D actually, so in 3D over time, the propagation of genes, the activation of prophages, and reach at a comprehensive picture
46:23
of the dynamics of the population. And so this is where I'm going to stop. This is just a very preliminary data where we observe that the lines here correspond to phage, let's say the five phages present in this bacteria,
46:41
which is one of the 120 bacteria present in the gut in this mice. So this is one of the bacteria, and the phage behavior is represented with this line, the five phages present there. And actually one of the phage actually seems to behave very differently from the rest of the DNA fragments, and over-amplified, so that may point out when we treat with antibiotic,
47:03
the activation of this prophage that may play a role in the lasing of the bacteria or the dynamics of the population. So now we are trying to integrate all of this data. It's quite complex. I mean, for us it's complex because we are not used to work on these things, but to try to integrate all of this data together to get at this comprehensive picture.
47:23
So again, so what I showed you was that this contact frequencies can be used, of course, to analyze this biology of the chromosome, but we can also develop these alternative approaches that helps us to breed the microbiome with the virome and get this improved understanding of ecosystems,
47:43
complex systems over time. So these are people who did the work. Most of the experiments were done by a postdoc in my lab, Marciel, who now got a senior-ass position two years ago, excelled an engineer in my team and a Liam PhD student, as well as a former PhD student for the GRAAL program,
48:02
and a junior-ass physicist we work with at UPMC, but he didn't contribute to this work, but very good collaborator as well. I thank you for your attention. Thank you. Thank you.
Recommendations
Series of 5 media
Series of 11 media