We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Jubako, a new generic container format

00:00

Formal Metadata

Title
Jubako, a new generic container format
Subtitle
A new file format to store contents all together
Title of Series
Number of Parts
542
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Jubako is a new generic "container" file format. It allows you to store content and metadata is one file. You can read your data from the file in a efficient manner, without uncompress/extract the data from the file. We will also see Arx, a file archive (equivalent to tar/zip) which use Jubako. Jubako is a new generic "container" file format. It is to storage what XML is to serialization. It specify how you store things but it is up to you to define what do you want to store. Jubako support: - Arbitrary content - Arbitrary metadata - Selective compression (some content may be compressed while other may not) - Efficient random access (no need to extract the data to read it)
MereologyComputer configurationDirectory serviceCompilation albumData structureSelf-organizationNumberPointer (computer programming)Data storage deviceEquivalence relationProjective planeArc (geometry)Client (computing)CASE <Informatik>Context awarenessDifferent (Kate Ryan album)BitTelephone number mappingContent (media)Link (knot theory)Category of beingSeries (mathematics)Electronic mailing listOverlay-NetzDifferenz <Mathematik>Compilation albumComputer fileData compressionFile systemWeb-DesignerMultiplication signFile archiverRandom accessExpressionDirection (geometry)Electronic signatureExtension (kinesiology)Library (computing)Goodness of fitSource codeIndependence (probability theory)CuboidWeb 2.0VideoconferencingHypermediaInternetworkingDirectory serviceThread (computing)Physical systemFile formatMetadataKernel (computing)Pairwise comparison2 (number)WebsiteMereologyConfiguration spaceComputer animation
WebsiteEmbedded systemNetwork topologyEquivalence relationCASE <Informatik>Data storage deviceOperator (mathematics)File formatLatent heatContent (media)Library (computing)File archiverComputer fileMathematicsFile systemLine (geometry)View (database)Computer programWeb browserPresentation of a groupEinbettung <Mathematik>BackupPairwise comparisonCategory of beingProjective planeMedical imagingDifferent (Kate Ryan album)InformationServer (computing)Binary codeData structureComputer animation
Computer animationProgram flowchart
Transcript: English(auto-generated)
Well done, everyone. Thank you. We start with a small introduction to have a bit of context about Jupyter. I'm Mathieu Gaultier, I'm a France developer, and I'm working... My main client is a QX project, and for there, I'm the lead developer of LeadZ.
What is QX? QX is a project to provide content where internet is not there. The question we try to answer, and we have answered, is how to distribute static websites. For example, if you don't know all Wikipedia in English,
it's 95 gigabytes and 6.5 billion articles and medias. To do that, we use the XIM format. It's an archive format for web content. Content is partially compressed, so you can compress textural content,
or not compress images or videos. You can do a random access without initial decompression, so you can access the content inside your archive directly. It works well and is pretty efficient, but there are a few flaws in the design, and the archive is really tied to web content and to QX.
You cannot add another metadata. The question I try to answer is, could we reuse the XIM format, the good idea of the XIM format, and do better and more generic? Here is Jupyco. Jupyco is a Japanese name for the Bento boxes,
and it's more boxes you can compose the way you want, depending on your needs. Jupyco is a new format independent of QX project, and this is a good idea of XIM format, but generic. Jupyco is a meta-container. It tells you how to store things, but it's up to you to decide
what do you want to store and how do you want to organize them. There is a reference library written in Rust. The features of Jupyco are mainly read-only archives. This is selective compression, so you can compress the content or not.
No initial decompression needed, and you can do random access on the archive. It's configurable, so you can decide which property you want on the entries. There is an extension system, so your user can download an archive,
and they can download extra content to add content to the archive you provide. It's embeddable in another file, and it's composable, so you can compose different kinds of entries together in the same container. There is checksum, and a few features to do.
Signature and encryption, direct access to uncompressed content, content deduplication, modification, diff and patch between archives, and overlay. Let's have a quick tour on the internal structure. The Jupyco containers are organized around packs.
There are three kinds of packs. The manifest pack, the content and the directory. Each pack can be stored individually in the file system, or they can be put together in one file, and then you distribute this file to your user.
The manifest pack is the main pack. This is a pack you will try to open when you want to open a Jupyco container. And it's mainly a list of all the other packs of the container. The content pack is a pack which contains the raw content, compressed or not, and without any metadata.
And the directory pack is where you store the entries, and the entries can print to content in the content pack. This is a configurable part of Jupyco. And inside the directory pack, there are entries with a specific schema.
So you have to define the schema, and the schema is a series of properties and their types. The content is just a property. It's a link to the content in the content pack, so you can have entries that point to several contents or no contents at all. And each entry, each schema, can contain a variant.
It's kind of union or enum in pre-clamation EC or rest. And you can have different kinds of entries inside one directory pack. Which use case? Why would you like to use Jupyco?
The first use case is file archive. There are two arcs which are the equivalent of tar, and here we have one kind of entry with three variants, file, symlink, and directory. All three variants share two common properties,
and for example, the five variants add a pointer to a content, symlink, the target link, and the directory just stores the pointer to the first entry and the number of entries in the directory. So it's kind of an organization and a free structure as a file system.
There is no enix property for now, but just mainly because arcs are pretty young and I don't want to bother with them while testing arcs in Jupyco, but easy to add. It's a file archive, so we can compile a bit arcs with tar
to see how Jupyco and arcs perform. If we take the Linux source code, the full Linux source code is more than one gigabyte, and both our tar and arcs are compressing, the source code is about 130 or 140 megabytes.
Creation time, arcs is a bit faster than tar. And extraction time, we are pretty close, arcs is a bit slower, but we have someone pretty close, both tools are pretty close.
What is interesting is when we try to list the content of the archive, tar took almost the same time that expression because to list the content in the tar archive, you need to uncompress all the content, and arcs is very faster because the list of the entries
are separated from the contents itself. If you want to extract only one content from the archive, and we try to, what I call dumping, and when you try to dump a third of all the entries,
you can see that arcs is really, really, really faster. And the same way, extracting one entry from the tar is pretty close from the time for listing the contents, the same way as you need to uncompress all the contents of the tar archive,
and arcs, you can locate the content and do a direct access to the content without uncompressing other contents. A few things that we can do there is mount the archive, directly mount the archive on the file system,
and if you mount the archive and you do a diff of the content between the original source and what is mounted, if you do a diff between two plain directories, it's a bit less than a second, with arcs it's four seconds and half,
and tar is an estimation, it will take something like ten hours to do the comparison. You can do something even more interesting with a mounted file system, with a mounting Linux source, is compiling the kernel.
So if you compile the kernel on the plain file system, it's a bit more than half an hour, and if you compile the kernel using the mounted arcs archive, it's a bit less than an hour. What is interesting here is that the compilation is made with G8,
so there is eight processes, and arcs fuel supply system is mono-threaded, so there is a huge bottleneck for now, but if we move to a multi-threaded fuel supply system, this could be even better. Another use case is GIM,
it's kind of equivalent of GIM formats, there is two variants only, and here we are storing the entries as a plain list, there is no tree structure. And the GIM binary just integrates with a small HTTP server looking for the entries.
What we can do also is we could replace, for example, RPM and DEB with arcs, or things based on Jupyter, so you could download your package and not extract it on the file system, just open it directly.
And even DIVO or Debug Info that could be put in specific content pack of the same archive, so it could simplify the management, and you will not need to have different package to different subtype of contents of your packages.
OCI containers are based on Tor, you need to extract them on the file system before running a container, so you could just use arcs among them, or you can even use directly put different layer in different content pack, and so the whole images will be one Jupyter container.
File formats, almost all file formats are in fact container for all the contents, so you could use Jupyter to just organize the contents you want to store
what you want for your own project and your own file formats. Websites, Jupyter is written in Rust, we could run it in Wasm, and so Jupyter could run, you could load your Jupyter archive in the browser once, and just open it directly in the browser.
Backups. Backups, Jupyter is almost incremental by design, if you reuse the content pack of the backups, previous backup, this is incremental, and you can decide which property you want to add, so for example, you can add a checksum on each entry
to do a comparison between the contents stored in the backup and what you have on the file system. Embedding resources, Jupyter can be embedded in the executor program, or even more, this presentation, you can download this presentation at this address,
and you will have a file, and this file is an ARK archive, so you can just use the ARKs tool to list the content, extract or mount the archive, and you will have access to all the files of this presentation, it is revealed, yes, and it's HTML content,
but the same file is also a GIM archive, so you can just use the GIM tool to just save the content, and open a browser to the local host. And the same file is also a program, so if you make it executable, you can run the program itself
to mount, extract or save the content. What is interesting is that between ARK, the content is not shared, it is an ARK and GIM archive, but it's just a view to the same content, there is no duplication, it's not two archives put together,
it's really one archive with two kinds of views of the same content. And the last line is the exact command used to serve this actual presentation. Conclusion. This is a new way of thinking. We could extract, we could use the archive directly,
instead of extracting it, so we can reinvent the world without thinking about using directly the archive. It's a new way of thinking. It's generic, it's a command-based tool
that can adapt to different usage. But it's pretty new, maybe some crash, and you can expect maybe some change in the specification. Thank you.
How do you compare to GramFS? Can you repeat the question? How does your format compare to GramFS?
I don't know. I know about SquashFS. The thing is that Jupyter is not a file system. ARK is an archive to store files, but Jupyter is not. So, Jupyter is more generic than GramFS or SquashFS, probably.
And ARK compared to SquashFS is half slower than SquashFS. On the size, ARK is better,
but on the performance, it's slower. Are you implementing this in other languages? We could implement that in other languages, repeat the question.
Could we re-implement this in other languages? You could. The specification is language-analytic, but I've just implemented the reference library in Rust. The specification is public.
ZP is pretty small. ZP is slower than ARK in almost any kind of operation and is bigger than ARK also.