We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Geodata on IPFS

00:00

Formal Metadata

Title
Geodata on IPFS
Title of Series
Number of Parts
295
Author
Contributors
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Storing and distributing large data sets is still challenging. There are issues like corrupted files, bandwidth limitations or just unreliable networks. This talk is an introduction into peer-to-peer technologies and [IPFS]. You will learn how these technologies can help you with the previously mentioned problems and why it is a great fit for your geodata. Such a peer-to-peer network doesn't necessarily need to be a public one, everyone can participate in, it may also be used to distribute the data between trusted parties. Currently, there is a Go and a JavaScript implementation — that even runs in the Browser — of IPFS available. IPFS is fully open source and licensed under the MIT Licence (soon MIT+Apache2 dual licence). [IPFS]: https://ipfs.io/
Keywords
129
131
137
139
Thumbnail
28:17
NoiseOpen sourceSoftwareCommunications protocolOpen setSystem programmingDistribution (mathematics)Content (media)AnalogyLibrary (computing)Address spaceHash functionGoodness of fitSimilarity (geometry)Projective planeAddress spaceContent (media)Game theoryCodeForcing (mathematics)10 (number)Library (computing)Physical systemAnalogyUniform resource locatorWeb 2.0NumberFunction (mathematics)Universe (mathematics)Term (mathematics)BitAuthorizationNoise (electronics)outputPhysicalismFunctional (mathematics)MathematicsStandard deviationOpen sourceSoftware engineeringPoint (geometry)Single-precision floating-point formatComputer architectureClient (computing)Server (computing)Context awarenessWebsiteStreaming mediaMereologyCASE <Informatik>Group actionMomentumPoint cloudDistributed computingProduct (business)Personal identification numberGeometryPhase transitionDressing (medical)Numbering schemeSign (mathematics)Hill differential equationBuildingMultiplication signOpen setBoss CorporationMeeting/InterviewComputer animation
Independence (probability theory)Link (knot theory)Library catalogFinitary relationData integrityConsistencyMultiplicationDistribution (mathematics)MereologyYouTubeScripting languageDigitizingTemporal logicHash functionState observerStack (abstract data type)Projective planeIdentifiabilityWeightNeuroinformatikDifferent (Kate Ryan album)Ocean currentUniform resource locatorRevision controlRight angleSatelliteGoodness of fitType theoryCodierung <Programmierung>Bookmark (World Wide Web)Physical systemHard disk driveMetadataObservational studyInformationStandard deviationData storage deviceDistribution (mathematics)Content (media)Directory serviceInheritance (object-oriented programming)Instance (computer science)Address spaceNumberPoint (geometry)Library catalogLink (knot theory)Virtual machineInternetworkingTerm (mathematics)State of matterDressing (medical)Self-organizationBit ratePhysical lawSoftwareCoefficient of determinationIndependence (probability theory)SpacetimeOpen sourceSound effectMultiplication signMatching (graph theory)Forcing (mathematics)BuildingCASE <Informatik>Escape characterArchaeological field surveyWater vaporMathematicsDataflowWorkstation <Musikinstrument>Computer animation
Physical systemDistribution (mathematics)Web browserImplementationInstallable File SystemOpen sourceComponent-based software engineeringBlock (periodic table)Link (knot theory)Hash functionComputer networkTerm (mathematics)Machine visionKeyboard shortcutCartesian coordinate systemFile archiverImplementationDifferent (Kate Ryan album)VirtualizationUniverse (mathematics)Computer fileQuicksortProjective planeMultiplication signFocus (optics)File systemContent (media)Connected spaceAddress spaceCASE <Informatik>BitMachine visionPhysical systemPlanningTouch typingSoftwareOverhead (computing)Distribution (mathematics)Server (computing)MereologyTerm (mathematics)Open sourceHash functionBlock (periodic table)Cycle (graph theory)Peer-to-peerDesign by contractPrototypeInternet service providerData storage deviceLink (knot theory)Order (biology)Data modelHand fanReplication (computing)Directory serviceData structureRight angleSpacetimeWeb browserPerspective (visual)SatelliteLibrary catalogState of matterGoodness of fitService (economics)Level (video gaming)Regulator geneMetropolitan area networkComputer architectureGroup actionOffice suiteScripting languageOpen setRevision controlVideo gameCoefficient of determinationComputer animation
Primitive (album)Physical systemBuildingPerspective (visual)Closed setComputer animationLecture/Conference
Transcript: English(auto-generated)
Ok folks, good afternoon. Welcome to the last session of this day. We'll start with Volker Mische, who doesn't really know and doesn't really need an introduction.
The floor is yours. Happy to be here. It's my 10th Force 4G, so happy to be here again after 10 years. So, let's get started. Oh, and Mike, sorry. Ok, it's just not a game because I'm normally always faster than I think I am.
So, welcome everyone. This talk is about GeoData and APFS. I'm happy to be here. It's my 10th Force 4G. I also welcome everyone on the live stream. But let's get started. So, really quick about me. I'm Volker. I work as a software engineer at Protocol Labs.
I normally do everything open source. You might know me from a geocouch or noise. I mostly code in JavaScript, but I still like to code in Python. And the most important thing is what is the goal of this talk.
The goal is that you learn about new technologies to distribute and publish your dataset in a different way. And the nice thing about this talk is it concentrates really on the concepts, not really on the projects. So I'm not selling a product or project. I'm selling an exciting future.
So, let's start with the concepts. The most important one, so, IPFS and similar systems are decentralized systems. Often people also say distributed systems, which is totally fine. I also use them interchangeably.
But I like the term decentralized more because it gives the hint that things were different. Because one could argue, if you store it in AWS or Google Cloud, they also use a distributed system in the backend. So it's a distributed system you store your stuff on, but it's not decentralized. And having such a system means that there's no single point of failure.
It will always kind of work. It might not work completely well, but it still keeps working. And so it's more robust and more tolerant. And another nice thing is about it's harder to intercept. And the fun thing is, in my talks about IPFS, I always use the example of Skype.
Because like 10 years ago, Skype was a distributed system. And when Microsoft bought it, they changed it and made it like a typical server client architecture. And in my previous talk, as in last year, I said, well, potentially people from Microsoft could listen to your talks, what you do on Skype.
And luckily enough, like two weeks ago, it was in the news that they're actually doing it. So, yeah. So that's also one thing. And you can distribute your data in a better way, but I come to this later on. So it's kind of like the topic of the talk.
The other concept is content addressing. So this is really core. And content addressing is kind of the opposite of location addressing. And it is about, content addressing is about which data, as opposed to location addressing, which is about where the data actually is. And this sounds pretty abstract, but there's a really nice analogy to it.
It's not for me, it's from a colleague, but I steal it because it's so good. So, content addressing means, it's like an ISBN number. So I read a great book, and I tell Skyler, hey, you need to read this book as well. And I just tell him the ISBN number, and he will be able to get this book.
Either go to a library, he might get a physical copy from me, he might order it online. And once we meet again, the next force for G, we can talk about the book, as long as we've read the same thing with the same ISBN number, we know we read the actual same thing.
That totally makes sense. In a location addressed world, how would this look like? I would tell Skyler, hey, I read this great book, and you can find it in Augsburg where I live, in the university library. It's the building down before the hill, on the second floor, down the hallway, the second shelf, the third book from the top.
Of course he could get there, but that's not certain. So it could also be that he doesn't have access, or the book is simply not there. Or, even worse, there's a different book. So he reads this book, and wonders what taste of books I have.
In this example you can see, like, content addressable systems totally make sense. So, yeah, you should just use it. But the reality is, we are so used to location address systems. The world wide web is a location address system. And there you have exactly the same problems as I just described with the library.
A website might be just gone, it might have changed since the last visit, and you can't tell. Also, the W3C, which makes a lot of web standards, also is really heavily concerned about location addressing. So even in more modern standards, like JSON-LD, you have this context element, which points to a schema.
But even this is at a URL, and it's a location. And even in this working group, they were like, oh, could we make it content addressable, but they just don't really want content addressable systems in their world wide web, it seems. Or they don't get the concepts, I don't know what the problem is, but they are really hard on location addressing.
And don't see the advantages that I come on later on from content addressing systems. And also if you think about the OTC, like many of the old standards are XML and XML schemas, and if you ever tried to actually get those schemas to validate your stuff, you might not be successful. And so, now the next concept is, of course, how to make your data content addressable.
Because you can't just assign an ISBN number to it. Again, this combines with the concept from before, the decentralized system, is we need something without a central authority. So with an ISBN number, someone, authority really assigns the number, and everyone agrees on, and everything's fine.
This won't work in the web, for example. You do hashing, and I just quickly introduced the idea of hashing for those who are not familiar with it. It's pretty simple, it's a bit of math as you learn at school, you have a function and it has an input. And in our case, the input is data. And the output is a long number.
It's as simple as this. And the nice thing about it is, if you put in the same number, if you put in the same data, the same number comes out. And if you put in different data, different number comes out. Now my question is, how can this work?
Because the data is kind of unlimited, and a long number is limited, so how can this even work? Therefore, there's the star at the corner. Because it is very likely that it is different. And there's another star on there, because it's likely enough. There's a short version of hashing.
OK, so we have this system going. And so now, coming back to content addressing, it's location independent, but what's the big deal about it? What are the advantages? Let's say you work in some institution and you are on the same research project,
and you download some satellite imagery, probably your colleague downloads the same data. And you both get them from AWS and everything works. But how nice would it be that if your colleague has already downloaded it, you just get it directly from the computer from your colleague instead from the web-by-web. It will be way faster and pretty simple.
And this works if you address the data by content, so you only care about it. You want this piece of data, not where you exactly get it from. It doesn't matter if it's on the internet, or if it's from your neighbor's machine, or if it's from the Wi-Fi within the organization, it doesn't matter.
The other thing is sneaker net. For those not familiar with the term, it means that you deliver data with your sneakers. So, for example, send over a hard drive, which still happens. It just happened recently. I ordered some data from the Bavarian survey, and it was 400 gigabytes. And the department is a department of surveying digitalization and broadband.
And they, of course, send them in hard drive. And if you want to plug in the hard drive, you would need to change the whole workflow. So many workflows start with download the data there, and the scripts work automatically. But if you don't start with download the data from here, you need to change path and things.
But if you want to know more about this, I gave a talk last year. It's recorded on YouTube. So, I'm done with the concepts. I now come to also, like, other things. Did I skip something?
No. Okay. So, another project where content addressing would make sense is STACK. I've heard of it. It's currently the newest, hottest thing in Earth observation. It's the Spatial Temporal Asset Catalogs. To keep it short, what it is, it's really like a metadata catalog, which is really trimmed down for searchability and discoverability.
It's mostly JSON. And it also uses URLs, so it's location addressed. It's fine. But what I did in the Earth observation data challenge was trying to find out, could I use the existing standard and make it work with content addressing?
And the people that are behind the standard are really smart people, because what they did is they put in the information about what the link is about as a relationship. So, the relationship is encoded in the metadata. So, they, for example, say it's an item. It's the children. It's the parent.
And this makes it possible to use content addressing, because the links are not inferred from the URL. What I mean with it often is you have, like, if you have a parent-child relationship, the child is just basically one directory deeper. Or you have an item, so you have slash item. So, basically, a URL kind of encodes the information, what type it is.
In this case, it's not the case. So, we can just use content addressable links, and we're done. And if you want to know about this, you just come tomorrow, same room, 11 o'clock, to the Phospho-Geo data challenge. I give a short talk about my work there. All right. How am I on time?
I forgot to start my timer. Okay, yeah, that's okay. I'm pretty much on time. Good. All right. So, come to the next thing where I actually think, because as I said, the talk is about distributing data. I think it would be great to distribute the Sentinel imagery with such a system.
First of all, because I've got critics in the past for giving talks about this topic is, I want to emphasize the current system works. And that's great, because getting so much data distributed is not easy. So, I don't want to bash the current system.
I just want to point out that there might be better ways or more efficient ways of doing it. So, what could be better? One thing is that if you receive the data and it's corrupt, it might not be actually the actual data, but during the transport it got corrupted somehow.
You unzip the file, and then you find out, oh, this is corrupt. You need to get the data again, and so on. If you use a current addressable system with hashes, you get it for free, because it means, so you have some data, you built the hash, and that's the identifier of the data. Now, you retrieve the data.
You can just hash the data in the same way, and if the identifiers match, you know it's exactly the data you wanted, and it's consistent, and it's not broken. So, you have it basically built in. The other thing is cost-effectiveness. Please note that with cost-effectiveness, I don't mean, hey, we can do all this way cheaper and everyone will be happy.
That's not what I'm saying. What I want to say is that you could spend the costs on different parts of the system and get a better system out. For example, multi-point distribution. What I mean with it is that currently you have, for Sentinel, you have the central instance like the Copernicus hub, or whatever it's called these days,
and you have the Dias, and this is where you get the data from. How great would it be that, for example, the DLR in Germany would just get the data from the Copernicus hub, and then they also have a copy, but they also publish it, and then France or Italy could either get it directly from the European Space Agency or from the DLR,
and again, this is the point of having a content address system. It doesn't matter where that comes from. They build a huge network, and you just basically get it from the fastest place or the nearest place or the cheapest place, or it doesn't matter where it comes from. The other thing is you would know the number of copies.
So the background story is what I've heard from people storing the data is normally, as the Sentinel imagery is quite a lot of data, you don't keep all the data on hard drives or SSDs. At one point, you put them in long-term storage,
and of course, having the data readily available is much nicer for the end user or people using it, and so what you do is you, for example, store like Germany or Europe forever on hard drives, and the rest of the world, you store for three months, and afterwards, you put it into the long-term storage. But what now, of course, could happen is that
if everyone's doing it like Germany and France and Italy, you have like 10 copies from the last three months, and then everyone puts it into long-term storage. What about if, for example, Germany takes the first quarter, Italy takes the second quarter, and so on? Basically, for the same costs, you would have more data already be available, and of course, you could also make it with contracts or policies and so on,
but it's insanely much overhead if you can just solve it with technology. So I haven't talked about IPFS yet, so we have five minutes to go. So IPFS is exactly such a system that does those things, and I quickly, so it is a decentralized system to publish, share, and distribute data easily.
It is like a virtual file system, so therefore, the FS, because it stands for Interplanetary File System, so it has really directory structures and files and so on. There's an implementation in Go and JavaScript. The exciting thing about JavaScript is it works in a browser,
and hopefully there will be a Rust implementation soon as well. And of course, it's an open source conference, so it is fully open source on an MIT license. We're clear about the rough architecture. It's not really this one monolithic thing, but it's composed of several parts.
The networking layer is libp2p. That's kind of the, yeah, those level, and you can also use it outside of IPFS. For example, it might be used for all the, so for all the blockchain fans out there, it might be used for the Ethereum 2.0, for example.
They use libp2p and all the other stuff. Then on top of it, there is IPLD, which is basically just hashes, links, and blocks. Not more, and this is what I work on, so all day long, this only has links and blocks in my mind. And on top of it is IPFS, which is kind of really just like the file system layer
with directories and files. The nice thing about it is that in IPLD, for example, you could store structured data. So when I did the prototype for the stack catalog, which is JSON, I just used the IPLD part. I didn't use any of the other parts. So, I come to the end of my talk and tell you my long-term vision.
I'd like that large datasets are distributed over a decentralized system. I don't really care that much if it's IPFS or some other system, but I think it could be just a way more efficient thing to do. And my goal is that the European Space Agency is, for example, using IPFS. I gave myself, as it takes a long time
to convince them, ten years. I'm now two years in, so I have eight more years to go. So, if you, some of the audience, are from the live stream or from the recording, know someone from the European Space Agency or connected agencies and think that's a good idea or they should at least look into it, please get in touch with me. I am happily talking about it and educate them
and think about plans or visions and so on. All right, thanks for your attention. Thank you, Volker. So, we have quite a bit of time for questions. Please wait for the microphone before you ask.
Could you talk maybe for a couple of minutes about what sort of applications do you foresee being able to use this? Like, how would this connect to applications that people would use it for?
Okay. Good question. So, for applications. So, currently, if you want to use it, there's, for example, there's also a fuse binding for Linux we can test this really as a file system.
But, in the future, there's, so the main API currently for IPFS is an HTTP API. It's kind of like, funny to see you have like, new protocols, peer-to-peer on everything and then you use HTTP for it. But that's currently the easiest way of doing it. But in the future, what you really do
is you use the implementations like the JavaScript one or the Go one or hopefully the Rust one and you're basically embedded in your application. So, let's talk about like, when we meet here again in ten years, there will be a talk from the GDAL people talking about how they implemented a node within GDAL so they make the replication easily possible with everyone else.
So you really would want to have a local node running in order to exchange data. But, until we get there, you would use the HTTP API for it. Yeah. Do you think the universities would like to be data providers too?
And can they cooperate about being or have one quarter of the data for each cycle or whatever? Yeah. Good ques-, I don't know if universities could be part of this. But also, what I forgot to mention in the talk is that
when I talk about this whole distributed system and so on it doesn't necessarily mean that it needs to be a public one. So for example, for the Sentinel distribution it could be that basically the delegations use an internal kind of like cluster or network a network of servers to distribute the data and the whole public distribution is a separate thing.
But of course, the ideal case would be that it would be public and everyone could participate in. I mean universities could then also participate in or everyone who has their servers available but it could also be separate. So I guess for a long term vision it would be rather that they use it only internally at first and then that's the next step.
Did this roughly answer your question? Okay. Yeah. Anyone else wishes to participate or comment on this? What's the difference between IPFS and DOT?
That's a full talk. It's very similar technology so they're also about like hashes and decentralization and content addressable systems so basically those concepts apply as well. This is why I wanted to stress so much on the concepts because you could use other systems as well.
The difference is I would say it's the focus of the project so that is more about sharing archives and sharing archives and giving some archives to someone else and replicating those archives
and IPFS is more about like a global universal network of things is what I say is the general idea and from a technology perspective I would say that IPFS is using the more academic building from the ground up approach making sure that all the primitives are sound
and that it works and so on which leads to it's not that usable yet I would say for the end user and that is in this regard radically different and I would say it's very usable compared to other systems. It works well but I for example like the primitives that IPLD for example provides
more than the primitives that that has but if you would use it today that would work I would say work better than IPFS. Alright I think we'll close it here. The session will resume at half past four. Thank you Volker. Thank you.