We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

rav1e - 0.3.0 and after

00:00

Formal Metadata

Title
rav1e - 0.3.0 and after
Subtitle
What we did so far and what will do in the future
Title of Series
Number of Parts
490
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
rav1e is an opensource Av1 encoder. We'll see what makes it fairly unique beside the choice of using Rust as main development language. We'll see what we did in the past releases, what design choices we took and what we plan to do in the next two releases. By February we will have the release 0.2.0 and the release 0.3.0 out. I'll present what's coming in the release 0.4.0 and 0.5.0. This will include some performance evaluation and describing some of the features are currently unique in rav1e.
BitSoftwareCodierung <Programmierung>Open sourceLecture/Conference
Web pageRead-only memorySemiconductor memoryMultiplication signComputer animation
Source codeLibrary (computing)Software frameworkFingerprintTape driveVideoconferencingFile formatData compressionPersonal digital assistantArmFacebookGoogolIntelSoftwareCodierung <Programmierung>Computer animation
Android (robot)Web browserGraphical user interfaceSoftwareBefehlsprozessorWeb browserMereologyQuicksortComputer animationDiagram
Computer hardwareYouTubeDemo (music)HypermediaMobile WebCodierung <Programmierung>Personal digital assistantAlgorithmImplementationCompact spaceCodeNormal (geometry)OvalStapeldateiLine (geometry)FingerprintIndependence (probability theory)Total S.A.Axiom of choiceExtension (kinesiology)Principle of localityCache (computing)Mathematical optimizationMultiplicationProcess (computing)Thread (computing)Distortion (mathematics)Bit rateFactorizationLoop (music)ProgrammschleifeCondition numberInterior (topology)Digital filterSemiconductor memoryCASE <Informatik>CodeNetwork topologyComputer hardwareQuicksortLine (geometry)Condition numberAxiom of choiceGoodness of fitLoop (music)Projective planeDecision theoryMultiplication signExtreme programmingCodierung <Programmierung>Focus (optics)Sanitary sewerMathematical optimizationPlanningFamilyInternet service providerAlgorithmGodDifferent (Kate Ryan album)Data structureTheoryMilitary baseBitCache (computing)Ferry CorstenResultantLocal ringCore dumpStapeldateiOpen setSinc functionLatent heatCodecKernel (computing)INTEGRALInstance (computer science)Medical imagingSoftware testingProcess (computing)Computer animation
CodeSimulationCoroutineFingerprintCompilerFormal languageBefehlsprozessorMultiplicationParallel portNormal (geometry)Concurrency (computer science)Condition numberMotion blurLoop (music)Codierung <Programmierung>Constraint (mathematics)Parallel computingType theoryVariable (mathematics)Algebraic closureControl flowRead-only memoryResource allocationTrailDirac delta functionBlock (periodic table)Frame problemAlgebraic closureFrame problemMathematical analysisCompilerThread (computing)ResultantCodeTesselationPairwise comparisonCodierung <Programmierung>Multiplication signMereologySound effectBitDecision theoryFormal languageLatent heatNetwork topologyProgrammschleifeLine (geometry)QuicksortSemiconductor memoryMatrix (mathematics)Data typeCodeSingle-precision floating-point formatGeometric quantizationDifferent (Kate Ryan album)Point (geometry)LogicAbstractionGoodness of fitMathematicsIterationChemical equationLibrary (computing)Parallel portDirac delta functionNumberBlock (periodic table)Electronic mailing listKernel (computing)System call
Object (grammar)MereologyBitNetwork topologyImplementationComputer animation
Bloch waveBasis <Mathematik>Block (periodic table)Vector graphicsElectric currentPredictionWorkloadResource allocationRead-only memoryBinary fileLogicRatsche <Physik>AdditionSurfaceSimulationMultiplicationMotion blurControl flowBit rateBlock (periodic table)Bit rateMemory managementCodierung <Programmierung>Game controllerDirection (geometry)Point (geometry)IntranetCodeVery-high-bit-rate digital subscriber lineThread (computing)CausalityEndliche ModelltheorieBitInformationMultiplication signFile formatDifferent (Kate Ryan album)MereologyTask (computing)PredictabilityCompilerFrame problemMedical imagingVector spaceComputer animation
WeightEvent horizonSoftware testingOperations researchPortable communications deviceNormal (geometry)QuantumCloningOvalBefehlsprozessorMultiplication signMultiplicationCodierung <Programmierung>Set theoryObject (grammar)EmailResultantGoodness of fitGroup actionWebsiteComputer animationLecture/Conference
Point cloudFacebookOpen setOpen source
Transcript: English(auto-generated)
And our next talk is related to the previous one. This one we're talking about the encoder. Please welcome Luca. So, hi everybody. We did a bit of a talk for decoding
and now I'm presenting an encoder. So, I'm a contributor for both David and Revy, among the other open source software. Here are the contacts if you want to ask me question after. So, I will talk about everyone.
A lot about Revy. Probably I will mention Rust a couple of time. I won't be too preachy. I will talk a little about memory performance profiling on Linux. If you weren't aware with a previous talk, you probably already know what I'm going to say.
This won't be about many details because I don't have enough time. And it's mostly about roadmaps. So, interrupt me anytime, ask me question, I will try to answer. So, I hope you will have fun.
First, Revy, an everyone encoder is written in Rust. Strange, we know. It does have lots of R-specific assembly because we are importing it from David. And since we are using Rust, if you want to try it, it's quite easy to do that.
You just use cargo and you have it. If you are more traditional, you can enjoy it from Gstreamer, FFmpeg, and pretty much any software that could consume our C or a Rust API. So, Revy has to be fast, fissureful, and safe.
How far we are? Let's see. So, Jean-Baptiste already presented what everyone is, who is involved, so I can skip all of this because I know that all of you were before. So, the summary is everyone,
we are pretty well positioned for the decoding side because of David. We have support for all the browsers. Again, because of David, mainly. And this part is a sort of dumb problem
beside 10 bits, but work's being done. And we have hardware, so all is good, right? Well, encoding is a different story. Encoding is normally hard. If we consider the past history, H.264, about seven years to be the great encoder that it is.
H.265, another good project. To be a good competitor for even H.264, it took about the same time. And in that case, it managed to leverage a good deal of experience
because they shared a bit of code with the previous one. HEVC is much harder, much more difficult than H.264, right? So, what happens with AV1 that is a lot more complex, with lots of more features?
Well, what do we have here? Open source-wise, we have libAOM, SDT-AV1. Both are coming from a lot of previous code. One is the inheritor of VPX, so most of the structure is from there.
SVT is a whole family of encoders. You have SVT, whichever code that you can think about. Again, long tradition. And they are doing a lot of effort to get everyone ready and produce something that is as amazing as everyone should be,
because everyone in itself, at least on paper, and partially on practice, is a really good codec. So, what happens? Well, libAOM is done slow. It's really slow.
It's where all the experiment happened, and because of how it's managed, we could say that it's some kind of graveyard, because even the code that didn't manage to the specifications sort of lives inside, lingering. SVT-AV1 is blazing fast. It's really fast.
It needs lots of hardware, a lot. And obviously, there are trade-offs. So, currently, SVT-AV1 could be a good solution, at least if you have enough seal to sacrifice to the SVT god. If you don't, well, sewer grapes.
RAVI, what's the plan with RAVI? RAVI is completely new. It comes from a different kind of experience, because most of the Dalla team is now working on RAVI. So, it's from scratch, written in Rust,
but we have some background, so to speak, and we focus on something completely different. We want to explore, we want to leverage the experience from Dalla. So, the focus is on getting different solution, try different path, use different algorithms, and focus on trying to get the best perceived quality.
So, speed is a concern. Memory footprint is a concern, but the main focus is try to experiment more and see what we can do.
And obviously, initially, it was quite fast because it was quite tiny, and we want to try to stay fast and even faster. So, first part, we want the code that is readable, so not many, many, many, too many lines of code, and not something that was sort of smart to write once,
and then the future you is going to complain a lot with the past you for your choices. Speed is a concern, but we don't want to get speed just because we want to use more hardware.
We want to get something that is fast. It doesn't matter the kind of hardware that you can have. Compact, that means you can have multiple instances without requiring way too much memory. And we would like to make sure that real-time encoding could be a thing,
batch VOD encoding could be a thing, and everything that is in between those two extremes could be readable, so a lot on our plate. When I say that Revy is lean, I mean that if we consider libAOM, that is large,
it's really large, and the code is lots of C, lots of C++ because the way we are doing tests in libAOM is some assembly, so you can get lost just because there is way too much line of code
to search and sift through. Revy, consider all the optimization, is nearly a fifth of it, and if we consider just the Rust code, so no assembly optimization, it's about 55K line of code, so fairly tiny.
If we aggregate the two projects, David and Revy, so we have something that is functionally similar to AOM, we are still half of the size, even if we consider all the assembly that we are using.
So if you care about everyone and you want to have an idea of how it works, you can take the two projects, all together, just the C code for one side and the Rust code for the other side, says within about 100K line of code,
so it's still quite less to have to read to figure out what's going on, and both code bases are sort of easy to read compared to others. So we want to be fast. How you can get faster? Well, as I say, our first focus is
to get better algorithms, even the theory behind has to change before you can actually get something that is fast, but also you can just look at what you did and try to figure out if you can not do some work.
And another easy way to, much easier way to be faster, just leverage what the CPU provides. CMD is available pretty much everywhere. Using CMD is something that gets you good results
and does not require as much as intellectual effort as rethinking all the algorithm that you are going to use. Another item that is important is to be careful about how you use the memory. Cache locality is something that is going to kill you or save you, depending on how your code is laid out.
Last, so one online question. Is parallel encoding a thing in Revy? As distributed in several machines, I will answer it at the end. It will be.
So last but not least, multi-thread processing. We throw more hardware at the problem, and since in many cases we do have multiple cores in our machine, could be useful depending on your use case.
What I mean with algorithmic improvements, well, we can have something that is sort of easy. So we have lots of processing that is working on applying some kernel on an image. And in many cases, the intermediate results can be reused.
The concept of integral images lets you lay out what you are doing so that all the intermediate do not have to be recomputed all over. And that managed to speed up a lot the loop restoration process. Redistortion optimization.
This is where we are spending most of the time. So what you can do in that case? Well, this kind of code is like walking in a tree. So you make decision and you decide where to go. If you prune it properly, because you know that going down is not going to lead you
to anything useful, you're going to spare a lot of time. So we did a lot of work to get some early exit condition set up so we are not doing work that we are going to discard anyway. And this kind of work is something that we are doing all the time.
SIMD, we love SIMD, everybody loves SIMD. We don't want to write some SIMD code. Well, a good number of people, but anyway, we like it. So how to do SIMD in a Rust code base? Two ways. One is using stdarch that is part of the standard library.
And they are somehow like the C intrinsics, but arguably better performance-wise. On the other side, assembly is good. The people that are working mainly on David love assembly. We can share it. We can use it.
And since we are using Rust, even the compiler is going to help us much more compared to C, because the Rust language get the compiler more information. And through that, the compiler can produce better opto-vectorized code.
And that is helping us a lot, even more if people want to use avx2, because you can just enable it. And then the compiler is going to produce fairly good avx2 code for your normal loops. So that part is good. Multitreading.
Multitreading and Rust are sort of a sweet story, since when you're writing multithread code in other languages, you will end up making mistakes. You will end up spending lots of time debugging it. In Rust, you cannot do those mistakes.
If it compiles, it usually runs, beside if you made logic mistakes, but in the case, it's your fault. And what we can do with that? Well, another question.
If, okay. So that one will wait. So I was saying, multithreading. We can do that. Rust enable you to do something much easier, but also Rust let you have something that is sort of magic.
Because Rust abstractions are really at zero cost most of the time. And as I say, if we are using iterators, the compiler is going to have to vectorize them already. What happens when you are using something that take your serial iterator and runs it in parallel?
Well, you have parallelism for almost free. What does it mean, almost free? This is our main loop. It's a bit of a mouthful, but basically we work on tiles and for each tile, we encode it.
Simple, right? Okay, so this is serial. You get the list of tiles. Each tile gets processed and that's it, but the tiles are independent. So we want to do that in multiple threads. That's it. Just a single line and everything works in parallel
and we don't have to think much. Well, we have to think a little. We have to make sure that the data types we are using are sort of thread safe and we have to not mutate what we are doing, the closure, and with closure I mean this thing.
And that's it. That's how we can get lots of multi-thread goodness with the minimal effort. We are doing even a bit more work because we are not so lazy. So in the future releases,
you will have an alternative API that is based on channels so people that are used to Go or people that are writing Rust, you will have an API that is much more streamlined and much easier to use. And this is how it looks now.
So our API has a send frame call that is using to feed the encoder with frames and a receive packet that is pulling out from the encoder the encoded packets, sort of simple. And this is the effect of Rayan. All this part is running in multiple threads,
more or less in an optimal way. We have work to do on getting this part on different threads. So we don't have this kind of large gap that is fully serial. But this is how he's yet been improving our situation.
How we are doing that, what we are doing. I say, I will mention some of the tools that we are using because I did that in the morning, I will compress it. But mainly we try to keep all our code
as good as possible. We are trying to not use too much memory. We are trying to see if the new tools that we are implementing are having a strong impact or a small impact on the overall speed. And to see what I mean regarding measuring,
this is what happened to the location. We were using way too much memory and way too much location, in my opinion. 0.1, 6K, quite a lot. The kernel has to work a bit.
2.0, we managed to cut that to half. That is something that caused also a speed increase. Two days ago, I ran the numbers and we got even farther below.
So again, something that is useful. And we do this kind of analysis more or less all the time to give you a comparison. This is SVT-AV1. As you can see, it's locating a bit of memory.
I mean, one gigabyte, same content, six gigabyte. Now you see what I mean when I say that you have to be resource conscious. Speed-wise, we always keep improving.
We try to keep improving. This is what you see at our top speed. It's not something that you can write on yet because more than three FPS is still not something that is exactly great. But compared to about one FPS, well, we are doing well.
We are improving. We will keep improving. And this is about speed, about specific features. Since I say that Ravi is focusing on different algorithm, we did work on RDO biasing. That is basically we have our decision tree
and we try to move the decision based on how the future will behave for each blocks. So if something in the future will stay the same, we will try to bias it. So it will decide to keep the block
even if by the matrix that you have, you can apply just for the single frame, it might not be considered that interesting. Chroma-Luma Balance, that's something that goes against the common sense on coding
because if you consider YUV, you say always, Luma is more important than Chroma. Well, it's not always that because once you start to quantize the two, you can end up to a point in which the Chroma differences because of quantization
are something that you are going to perceive more than the Luma differences for quantization. So you can try to strike a balance. So with your bit budgets, you are going to spend a little bit more on Chroma and get better percent for results. Last but not least, everyone has a concept
of per frame quantizer deltas. So in every frame for each block, you can change a little your quantizing up and down. And you can optimize a lot with that and get optimal results without using many bits
to signal that kind of change. Since we like to have pictures, RDO biasing. So the tree is always the same, but this part is going to change. So you are not going to spend a lot on this chair,
even if it's something that you can predict quite well in the picture because it's going to be covered. The tree on the other hand, if you want to spend a little more in the past, so you are not going to spend a lot in the future. And this is the concept, quite simple. The implementation is a bit gory.
Block importance, again, same idea. If the future is better, we're going to spend more bits. If the future is grim, we are not. And this is how we visualize the whole thing. How much time do we have? Minus two minutes.
Oh, great. So trust me, everything's great. Keep it from next year maybe? Yeah. So what to expect? We started with 0.1 in Tokyo in December, VDD.
We got 0.2 about a month ago. 0.2.1 in which we managed to have different kind of improvement. And we had some kind of trade off. So we are overall about 1% better with a little slowdown.
And for 0.3 that will appear in the next week, possibly. We did more work on multi-trading, more SIMD code written, work on code paths. So the compiler is going to have to vectorize them for us. Less bound checks, so the safety from Rust
is not going to slow us down. And about a sixth of the memory allocation less. So more compact. We're working on the audio biasing so it works better, but that cause a slowdown on the high speeds.
We are implementing more tools. So now we have fine directional prediction and intranet vector. We are giving more features to the user. So if you want to use which frame and experiment with which frames, we have it. If you want to use Revy to make still pictures.
So everyone has an image format instead of a video format. Now that part is working. If you want to get crazy and put the encoder in the browser, there is a little bit of work
that it will appear so it will be quite easy to do that. Further in the future, the channel-based API should be complete by 0.4. So better thread usage, easier usage model for you. We are going to do a lot of work on the rate control.
Since this is one of the weakest point for most of the encoders, we're going to try to make it fast and overall useful. So doing a choose pass encoding is not good to be a daunting task. And the API is going to be expanded.
So to answer to the initial question, Revy is going to support chunk encoding and the chunk can be encoded in different nodes. After the whole process, you will have a way to aggregate the whole thing, not just the packets that you are producing, but also the rate control information.
So you can have multiple paths across multiple nodes. And this should happen in 0.4. The other question from the network, how do you track subjective quality over time?
So you can see the questions. We have, are we compressed yet? In which we spend a lot of CPU time to do multiple encodings with multiple settings from a large corpus that is giving you a good coverage
and get lots of quasi objective results. We don't have anything, we don't have any kind of group of volunteers that keep watching the same movie many times to tell us which looks better and which not.
If somebody wants to volunteer that, they are welcome. I'm sorry, we'll have to stop there because it's already 2.30. I'm finished. If you have any more questions, you can ask Luca. That's it. Your email is on the website. Thank you.