We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

State-of-the-art image generation for the masses with Diffusers

00:00

Formal Metadata

Title
State-of-the-art image generation for the masses with Diffusers
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The talk "State-of-the-art image generation for the masses with Diffusers" will explore the diverse applications of the open-source Python library Diffusers in the image and video generation space. The talk will showcase how Diffusers, based on diffusion models, enables fast and high-quality image and video generation, making it accessible to a wide range of users. The presentation will cover various use cases, including image inpainting, image editing, and scene composition, demonstrating the capabilities of Diffusers in enabling users to create and edit photo-realistic images with minimum effort. The audience will gain insights into the potential of Diffusers in revolutionizing the way images and videos are generated and edited, making it a must-attend session for anyone interested in the latest advancements in this field.
Diffuser (automotive)State of matterComputer-generated imageryRouter (computing)Tablet computerEndliche ModelltheorieContent (media)Multiplication signDifferent (Kate Ryan album)Formal languageCodeElectric generatorMIDIEmailComputer animationLecture/Conference
Computer-generated imageryNetwork operating systemEmailMIDIBlogRight angleElectric generatorFreewareFitness functionSimilarity (geometry)BitLecture/ConferenceComputer animation
Endliche ModelltheorieAsynchronous Transfer ModeSic9K33 Osa2 (number)Computer-generated imageryProcess (computing)Control flowElectric generatorSerial portContext awarenessMIDIFormal languageBitMultiplication signCharge carrierShared memoryCore dumpSlide ruleCASE <Informatik>Task (computing)Physical systemCondition numberOrder (biology)Open setoutputMathematicsStability theoryTable (information)Phase transitionVector potentialSoftwareNatural languageDescriptive statisticsOpen sourceCategory of beingRight angleMereologyEndliche ModelltheorieLibrary (computing)VideoconferencingComplete metric spaceNatural numberFrequencyNoise (electronics)Profil (magazine)NumberBlock (periodic table)Monster groupProcess (computing)Vector spaceReal numberHeat transferComputer animation
Point cloudCASE <Informatik>Level (video gaming)Endliche ModelltheorieReal numberGoodness of fitInterior (topology)Multiplication signSimilarity (geometry)Product (business)Computer animation
FreewareMultiplicationCASE <Informatik>Real numberLocal GroupMach's principleDifferent (Kate Ryan album)VirtualizationDescriptive statisticsPhysical systemFormal grammarObservational studyBookmark (World Wide Web)Block (periodic table)Computer animation
Computer-generated imageryEndliche ModelltheorieObject-oriented analysis and designData modelConfiguration spaceLibrary (computing)CodeLine (geometry)Graph coloringRight angleOrder (biology)VideoconferencingEndliche ModelltheorieStability theoryDigital photographyoutputElectric generator1 (number)Physical systemFamilyBitDynamical systemTemporal logicFrame problemTranslation (relic)Cartesian coordinate systemTask (computing)Constructor (object-oriented programming)Wave packetAdditionFormal languageConfiguration managementNatural languageCodierung <Programmierung>Computer-assisted translationDataflowMultiplicationMachine learningDifferent (Kate Ryan album)Noise (electronics)Scheduling (computing)Single-precision floating-point formatState of matterVirtual machineSeries (mathematics)NeuroinformatikSequenceNatural numberResultantTransformation (genetics)Software developerMeasurementSource codeSlide ruleEntire functionComputer clusterPoint (geometry)Goodness of fitContext awarenessOpen sourceMIDIOpen setObject (grammar)Fluid staticsMereologyGoogolTensorComa BerenicesCore dumpConnectivity (graph theory)Multiplication signRandomizationComputer animation
Data modelConfiguration spaceSoftware frameworkSemiconductor memoryDesign by contractCore dumpPrimitive (album)Similarity (geometry)Module (mathematics)Electric generatorMultiplication signEntire functionSoftware developerConnectivity (graph theory)Level (video gaming)BuildingBlock (periodic table)Wave packetEndliche ModelltheorieConfiguration spaceProcess (computing)Flow separationPoint (geometry)Axiom of choiceLibrary (computing)Message passingDefault (computer science)Different (Kate Ryan album)MultiplicationVirtual machineBitCodierung <Programmierung>LogicStability theoryGraphics processing unitGodMathematical optimizationDivision (mathematics)WeightNumeral (linguistics)Object (grammar)Parameter (computer programming)Scheduling (computing)Maxima and minimaSocial classAbstractionBefehlsprozessorOrder (biology)1 (number)CodeComputer architectureTwitterTable (information)QuicksortFront and back endsAgreeablenessInferenceExecution unitNatural numberCasting (performing arts)Computer animation
Design of experimentsState diagramHand fanMultiplication signPoint (geometry)Slide rulePresentation of a groupImage processingComputer animationLecture/Conference
Convolutional codeHand fanOperations support systemDedekind cutScale (map)System of linear equationsMaximum length sequenceState of matterAbstract syntax treeSteady state (chemistry)Standard errorRothe-VerfahrenElectric generatorLibrary (computing)Image processingDigital photographyMalwareCodeBefehlsprozessorInformation securityFlow separationComputer virusComplex (psychology)ResultantSlide ruleQuicksortFormal languageEmailGoodness of fitSheaf (mathematics)Right angleComputer fileSerial portVulnerability (computing)Virtual machinePort scannerComputing platformNoise (electronics)Transformation (genetics)Shooting methodSoftware frameworkDataflowContent (media)Binary fileVideoconferencingSpacetimeImage resolutionEndliche ModelltheorieGradientCharacteristic polynomialCore dumpMatching (graph theory)Linear regressionNatural numberDynamical systemFrame problemMachine visionMereologyTensorInformationMultiplication signPoint (geometry)InterpolationProcess (computing)Analytic continuationConsistencyMoment (mathematics)Context awareness2 (number)Ultraviolet photoelectron spectroscopyDefault (computer science)Message passingFunction (mathematics)FrequencyDomain nameSoftware testingFilm editingTemporal logicModal logicLevel (video gaming)Machine learningContinuous functionAreaPresentation of a groupCoefficient of determinationComputer animationLecture/ConferenceMeeting/Interview
Transcript: English(auto-generated)
We had a lot of PIO3 and then we had a lot of language models and how it plays a role in shaping our careers, but I think it's now time to move on to more fun stuff, as I like to call them. Before I actually start my talk, I would like to have a quick show of hands.
How many of you are into using AI for different stuff like content generation, coding, and things like that? Have you ever used an AI to generate the header image for your blog, like mid journey, Dali3, and so on? So I think my talk is going to fit right in then, because I'm going to talk about image
generation, but for free. To use mid journey and Dali3, you have to pay and so on, but my talk will show you how you can do similar stuff, but for free. Of course, you need to have access to a decent GPU, but things like Google Colab does give you a free GPU, so that should be enough. But I'm also going to cover video generation a bit, so that's going to be exciting.
So a little bit about myself. I work on diffusion models at Hugging Face. As a matter of fact, I get to maintain an open source Python library that's into image generation, video generation, and so on. So it's called Diffusers. How many of you know about Hugging Face, by the way?
Oh, that's a really good number, thank you. I'm very, very deep into open source. I'm very proud to say that it was my open source contributions and my open source profile that got me into Hugging Face, so I'm very proud and very grateful for that. I'm big time into cricket. We recently won the T20 World Cup, so yeah.
And my personal side is up here. And a couple of things before I start. So it's my first time presenting at EuroPython and my first EuroPython as well. I'll share my slides after my talk. And if I'm going too fast or if I'm going too slow, please do stop me.
I have a very thick skin, so I won't mind. And there will be enough time for us to discuss Q&A and things like that. And moreover, I'll be around, so feel free to grab me for any questions related to careers at Hugging Face or what is it like to be working at Hugging Face and so on. So I have worked on language models for a fair bit, so can tackle those questions too.
About the talk, I'll introduce image generation in a jiffy. I will keep it mainly in the context of diffusion models because diffusion models are the powerhouse behind things like mid-journey, DALI3 stable diffusion and so on. So it's safe to say that other things such as generative adversarial
networks that used to do image generation, they are kind of dead. So it's safe to say that. And of course, I'll be talking about the diffusers library which I get to dearly maintain at Hugging Face and the kinds of potentials it comes up with. And of course, I'm going to talk about the Pythonic aspects of diffusers and Q&A. There will be enough time for us to do Q&A.
So data of diffusion models, text to image. So this is stable diffusion 3, like the latest and greatest from stability AI. It's an open model. I hope we all can agree that a transparent sculpture of a duck made out of glass, these creatures cannot really exist in reality. But you know, with these models, these systems, they have got this tremendous
creative ability to come up with something that resembles our input prompt, like transparent sculpture of a duck. And then we have got Pixat Sigma, cute little panda acting as a mad scientist. And then I've got an astronaut popping out of nowhere in a jungle.
Again, I think we all can agree that an astronaut in a jungle, not really. But yeah. And then we have got this very famous video from Sora, from OpenAI, that lets you generate videos, like long actual videos resembling some kind of input description.
I mean, let alone leave aside the generation part, we as human beings, we will definitely have hard time parsing the description on the right-hand side. It's a very long, nuanced kind of description that we are seeing here. Very minute details, like as human beings, we will have hard time, let alone an AI doing
it for us. But here we are. We are in 2024. It's absolutely possible. So yeah. If you haven't checked that one out, Sora, I definitely recommend you to check this video out particularly. And then diffusion models in a GPU, I'm not going to go into the math because it's very
math scary. Even I get scaredified all the time whenever I'm reading a paper on diffusion models. So I'm going to try to keep it as simple as possible. So the basic idea behind diffusion models is what happens when you try to refine a noise vector so that it becomes a realistic image over a period of time.
And if I had to show you an infographic, this is how it would have looked like. So we are starting with a complete piece of jargon, and then we are slowly, slowly, blooming it to become a realistic image. And here's another way to take a look at it. So you see it's very sequential in nature.
It's not one shot unlike other image generation systems such as generative adversarial networks. We are slowly denoising a piece of noise vector until it becomes a realistic image. And then when you try to condition this whole, you know, denoising thing with text, for example,
this, it gives you things like this. So again, we hope we all can agree that this cute little farm monsters cannot really exist in reality, but here we are. So yeah. And then diffusion models that gives you a lot of flexibility. Let me try to establish how.
So text to image, imagine you are an artist. Having some kind of image just from, you know, natural language supervision, it should feel very liberating, right? Because you may be stuck in some kind of writer's block, and you are unable to get some ideas out from your vague description in your mind.
But the systems that they can be really helpful in order for you to get out of your writer's block. So text to image, it's very liberating, and we can, you know, things like transfer and sculpture of a duck. But let's say your generated image is not so good. It's probably not following the input prompt in the way you would have expected it to be.
So you wanted to condition the generation process with something more. Let's say a pose. Let's say you wanted the generated image to follow a particular pose. And here we are. We can make that with a dance however we want to. So that's it. And then let's say you wanted to edit a particular side of image,
but with natural language supervision, which is known as the task of image in painting. So that's also possible. Let's say you wanted to, you know, change the castle that we are seeing on the left-hand side with something else. You basically let the model imagine, and that's it. So image in painting is another task that we get
to do with diffusion models. And things like this, they were not really possible with other image generation systems. Such as GANs. So that's why I, you know, fixed my core tooling to diffusion models, because they are very, very flexible. Now some real use cases before I set the stage to, and before I move on to other Pythonic aspects.
So interior designs. Let's say you are visiting a friend and you really like their interiors. Now you wanted something similar, but not exactly the same. So you could perfectly, you know, capture an image of the interiors and ask a diffusion model to generate something similar, but not exactly the same. That's absolutely possible.
So you could imagine companies like IKEA doing it all the time. And they are using it at production, so that's good. And then fashion branding. Like lots of e-commerce businesses are using it already. Amazon is using it, that I know for a fact. So that's pretty cool, right? So you can do things like virtual try-ons and add-ons,
and things would just work. Like different poses, different sketches, and so on, with some vague description, and it will just work. And then my favorite piece, it's like extending creativity. So OpenAI actually did a formal study with like lots of artists from all across the globe to see how, you know,
systems like Dali, they are helping them out. And it turned out that systems like Dali are incredibly useful at, you know, mitigating the writer's block that artists often, you know, run into. So that's really, really cool. And here are some, you know, shifting gears a bit.
Here are some examples of popular text-to-image generation systems like Dali 3. Then we have Imagine from Google. Then we have the stable diffusion family from StabilityAI. So that's good. But not all of these ones are open. So in order to be able to use Dali 3 and MidJourney, you have to pay to OpenAI and MidJourney.
Imagine, hell, you can't even use it other than from the Google Cloud API. So you'll again have to use it. But the entire stable diffusion family of models, that's open. And being open, what does it mean? What does it entail? So that's what brings me to my next slide. We want to be able to study the risk factors like are there any,
you know, safety threats? Are there any, you know, equity concerns that we want to get away from? Like we want to be able to properly evaluate the safety measurements before we end up deploying them in our business applications. And finally, we want to be able to build on top of them, right? Because we are in a Python conference.
We are seeing so many open source Python libraries and so on. So we want something more in the open. We want something more out there in the wild so that, you know, other developers can potentially discover bugs and help us fix them and improve them in the long run, right? And stable diffusion from stability, I think it's a great example.
So yeah, which I think it's a perfect opportunity for me to, you know, introduce the diffusers library. It's a Python library that's primarily maintained at Hugging Face with contributors from all across the globe. Now we have got two broad objectives with diffusers. So how many of you are aware of the transformers library?
That's good. It has like more than 100,000 stars. Only the second library to have done that in the context of machine learning and the only one is TensorFlow. So it's a pretty big one, safe enough to say. So we have two broad goals. One is providing open and responsible access to state of the art pre-trained diffusion models.
And second is we want to democratize the ecosystem of diffusion models by making them as easy to use as possible for developers. And I'll get to the second point in a bit. So this, our favorite astronaut, popping out of nowhere in a jungle. So this is all the code that you need in order to generate it.
And a free GPU that you will get from Kaggle or Google Colab. This is all the code. Or if you have an MPS MacBook with M1 or M2, you could use your MacBook to generate this right now. This is all the code that you need in order to get from astronaut in a jungle called color palette to the image
that we are seeing on the right hand side. So it's just four lines of code since I care about things like indentation and clean code. I invented it, but you can imagine putting it all up there in just four lines of code. Two lines for importing libraries. One line to specify which model that you are going to use.
We are using stable diffusion Excel here. And then another line to initialize the entire system. And then another line to define the input prompt. And voila, then we are ready. All done, right? And then if you are striving for photo realism, I'm sure this cactus cannot really, really exist.
But then again, here we are. Again, just four lines of code and voila, you are done. And this is another model. This is not the stable diffusion Excel model. It is built on top of, you know, things that are very similar to stable diffusion Excel, but it's not exactly stable diffusion Excel. So that's all possible. And videos. Here we are making our dear Darth Vader surf a wave out of nowhere.
So videos, these are extremely challenging, right? Because videos are not just some random collection of frames. It has to be ordered, right? It has to maintain some kind of, you know, tempo spatial, you know, coherence. Like it needs to maintain the time frames.
It needs to maintain the spatial aspects of the frames. So it becomes like 10x more complicated than just doing image generation. But the same lines of code. We are not doing anything to, you know, solve the dynamics problem. We are not doing anything special to solve the spatial temporal coherence problem.
It's all the same lines of code. You are just changing the models and done. So if you were thinking like the APIs would change if you had to, you know, complicate your problem a bit, no. We have taken care of that for you. Now you do your part, we will do ours. So, yeah. This is all possible.
So, you know, just having a pose of Darth Vader and some language supervision and having it generate a static 2D image sounds a little boring. How about we actually make Darth Vader dance? So this is all possible. And the reason why I'm showing you this is because this system takes to video zero. It lets you take any text to image model and convert it
so that it becomes text to video. So you do not need to perform any kind of additional training to do like text to video generation. You are basically taking an existing text to image model. You are modifying it in clever ways and done. We can make our Darth Vader dance.
So that's good. Actually for real. So, yeah. There are many more tasks that are supported by our library. Image to image synthesis. Image editing with natural language constructs. So I am not a very good person that does Photoshop and things like that. So I want to be able to have a system that just takes, you know,
edit instructions in plain language, plain English and just be able to get done with it. So that's also possible. Image editing with natural language instructions. And then we have got image to video translation. And Diffuser is not just a library for images and videos. Audio is perfectly supported as well.
So that's it. So we have got many more tasks. Feel free to check it out if some of these tasks are, feel if they seem relevant to you. Now, coming to the more Pythonic aspects of things. I'd like to start with configuration management. So I'll have to, you know, clear out a couple of confusions here.
So usually when I say a machine learning model, it's just a single opaque model, right? But a diffusion model, it's not a single model. It's not. So we have got three, we have got a couple of text encoders. We have got the actual diffusion model and then we have got some kind of decoder and then we have got some kind of noise scheduler.
So you do not have to know the nitty-gritty of all these components, but just know that a diffusion model, it's not just about a single model. It is comprised of multiple different models, right? And like let's get the chronology, let's get the order out here. So a cat looking like a tiger, so from this text prompt
to get to that image, the flow would look something like this. The prompt will pass through the text encoders, which we may have two text encoders, we may have three text encoders. For example, stable diffusion 3, it has three text encoders, not just one. And then you will do some computation
and the results will flow through the actual diffusion model and it's like sequential in nature like we saw previously. It will also be accompanied with something called scheduler. Again, we do not have to worry about the details for now. And then the computations will flow through a decoder and then we will have our image.
So it's like not just a single model but a series of different models. So not single model, that's the most important bit that we need to remember. Now, the text encoders may have different sizes. The diffusion model that we are using, it can have different architectures. The scheduler does not have any parameters so it's probably just fine.
And then the decoder, it can also have different sizes, right? So which means different sizes, different architectures, all of them can lead to widely different configurations, right? So managing the entire configuration can be very hard. So we keep model parameters like the actual weights of the model
that we usually distribute, very separate from their configurations. So just to give you an example, so here I am loading some model from a pre-trained checkpoint and then we are immediately assigning a config object so that you can verify and investigate all the model configuration in isolation.
And it's going to print something like this. Again, the nitty-gritties are not important. The point I wanted to establish here is it's perfectly possible to initialize an object from an existing configuration. Let's say you do not want to initialize the model weights from a pre-trained configuration but you only need the configuration, just the configuration and you want to initialize it randomly.
That's absolutely possible. So we have an API for like from config and that's done. So the way you would think in your mind, that's exactly how we would do it in code. So that's one. And then it's also possible to reuse an existing configuration and do some custom ones like the one highlighted in yellow.
So that's the custom configuration argument that I'm passing. So that's also possible. And we promote reusability like the hell out of it and I'm going to show you how. So reusing existing components of like a text to image generation pipeline to do on other things,
it's absolutely possible. And you would want to do it to save memory. You wouldn't want to initialize different components because it takes a whole lot of memory. And as we saw, it's like three text encoders, a separate decoder, a separate division model. It's like we're talking in 20 GBs of, you know, GPU memory. It's very, very expensive. And you would want to, you know, keep that as a commodity
and you would want to rather promote reusability as much as possible. So here I'm first initializing a text to image model and then I'm reusing the components of the text to image pipeline to initialize another image to image pipeline and that perfectly works. So that works like from pipe and that's the API that we have.
And then if you wanted to, again, similar to the configurations, if you wanted to sort of, you know, reuse some existing components and pass on some custom components, that's also possible. So you have all the flexibility that you need with, you know, reusability. We will take care of the rest for you.
And then if you wanted to swipe out different pipeline components like we saw, like it will have a text encoder, it will have a diffusion model, it will have a separate decoder. So if you wanted to, you know, initialize a pipeline with all the components from a pre-trained checkpoint but you want to keep, you know, your component of choice
to your custom one, that's also possible. Like the one highlighted in yellow, that's possible. So we have very, you know, clear separation of concerns. And diffusion pipeline is our class, like the entry point to diffusers. It encapsulates all the logic for doing the entire diffusion process and it involves several components
like we saw earlier a couple slides ago. And we can swipe out any of these components given we can ensure compatibility in between them. So I feel that to be a very liberating point for me as a machine learning practitioner. And then, I mean, the point I wanted to establish
with clear separation of concerns is like all the components that we are seeing, the unit, the text encoder, the scheduler, the decoder, all these components are swappable. And the rest of the components will load from the existing checkpoint as is. So that's pretty convenient. And then we strive to be as explicit as possible rather than being implicit.
So by default, all our pipelines and models are loaded on CPU until and unless you do the explicit device placement on GPUs or any other accelerator you may have such as your MacBook. And all of the models are kept in the floating point 32 numerical precision by default. And as I mentioned, users need to do the device placement
or any kind of typecasting explicitly. And this is what I mean by that. So if you have used PyTorch, the development experience should feel very familiar. So we are doing the downcasting explicitly and we are also doing the device placement explicitly. So yeah. And we strive to be simple over easy.
Division models can be very computationally expensive to run like five different components, oh my God. And we do not, but we do not perform any kind of optimization until requested, but performing them is very simple. Thanks to the API design. And this is exactly what I mean. So this checkpoint, this is stable diffusion medium three.
If we were to pop it on a CUDA, it's gonna take at least 19 GBs of GPU VRAM, even on a modern GPU like A100 or H100, like 19 GBs of GPU VRAM, that's a lot. That's a lot. That's not consumer GPU anyway, right? But if we call this little method,
enable model CPU offload, it gets down to 12 GBs. So simple over easy. We have all the things that you need in order to get you succeeded, but you need to know them. And using them is as simple to use as possible. So yeah.
All right. And minimal abstractions. So we do not tend to be a whole lot more abstractive. We do not have as many abstractions that you may imagine. So we keep our abstractions to a bare minimum. All our model classes extend from framework primitives. And this is what I mean by it. So if you have worked with PyTorch before,
you must be knowing something called NN module. And all our modules inherit from NN module. And so we also support JAX as a backend. So we are a multi backend thing. So we do not just support PyTorch, we also support JAX and Flex. So that's good. And for Flex models, we use a similar framework primitive, native framework primitive.
And this way, we keep all the implicit contract between the core frameworks to none. So that's pretty good. And Diffuser is not just an inference tool. We try to ensure accessibility, like memory friendliness, optimization, and things like that. If you wanted to train your own diffusion model,
that's also possible. We've got a plethora of training examples for you to do that. And then we have got building blocks for you to do your research, to do your experiment. That's all possible. And we provide customization at a pipeline level, also at a component level. Because we understand, we hear from developers time to time and we like to act fast, listening to our community.
So we are a very much community driven library. So I welcome you all to check it out. And if you have any issues, please let us know. So our philosophy document, it may feel a little weird to have all these philosophies down, but we have an entire philosophy document, which might be helpful for you to think through some things, like if you spotted anything
that's not sitting right with you, if you're feeling like it's a little anti-pythonic in nature. So there may have been a very real reason to have it that way. So we have our little philosophy document and that's about it. I think I have enough time for Q&A. So if you wanted to get access to the slides, feel free to scan it.
And as a matter of fact, this point is also generated with stable diffusion three. So I'll open up the floor for questions and answers. So please take it away.
Thank you very much for your awesome presentation. I have a question. Is it possible to enhance the text to image process by my personal photo library, for example? Yeah, it's possible. So we have something called subject driven generation in our documentation.
So if you head to our documentation, we have an entire section on doing subject driven generation. So let's say you wanted to have your dog and if you wanted to have it rendered, I don't know, on top of that mountain, it's absolutely possible. So you just need to know the right techniques to do that. And it's all there in the documentation. So if you are unable to sort of follow it through,
feel free to shoot me an email or just open an issue. We'll be more than happy to help you out. Okay, thank you very much. Good. Also, thanks for an awesome presentation. Thank you. And my question is related even to the slide that you are showing here. You have this text, thank you. And probably everyone who played with image generation
noticed that that's something that they struggle with. If you have some sort of text that you want to include in the image, you have like an awesome picture generated and then the text is misspelled. Is there a way to improve accuracy of the text itself or are there any like hybrid models that actually create text of separate layer?
Yeah, that's a good question. So I think you are referring to the problem of models, diffusion models misspelling a lot of stuff. Like you wanted to have a placard saying some complex piece of text and it's unable to do so. So I must say with stable diffusion three, the problem has got reduced a lot. And then we have another upcoming model called Aura Flow,
spilled a little bin there. So Aura Flow is definitely gonna be better at spilling more textual content in the generated image. But if you wanted to have like utmost flexibility, then I welcome you to check out a framework called AnyText. It's built on top of diffusers and it enhances the text spilling capabilities by like 10X.
It's very good. It's called AnyText. Yeah, thank you. Thank you. Hello. Hi. Thanks for the talk. I was wondering, is it possible to use this technology to enhance images that were shoot on high ISO?
Images that? So you imagine if you're a photographer and you are taking images on high ISO, you will have lots of noise in the images. So is it possible to actually reduce the amount of noise or remove it? Yeah, so we have something called upscalers. So if you have like an image that you feel like
is not up to the mark where you would have wanted it to be, so you could just pipe it through the upscaler pipeline and sort of refine it over a period of time. So you make a single pass, then you have some output. Probably you are not that satisfied. Then you can feed it back to the upscaler again and then just see what happens.
And you have also little knobs and tricks like how much denoising to apply, at what point in time the denoising to apply and things like that. Yeah, so long story cut short, it's possible. Search for upscalers. Okay. Yeah, upscalers and diffusers. Is it also able to work with high resolution images? Yeah, so the images that I showed you,
it's like 2048 by 2048. So by default, we are talking about high definition images. Thanks. Thank you. Hi, thanks for the great talk. I was just thinking about video generation. Can you maybe just elaborate some more on that?
Is it frame to frame? Is the video generation like constrained on the previous frames that were generated? I recently saw the video of some, like generated video of some gymnasts making strange moves. As I understand, this is because no motion information is included in the video generation process.
But how can we constrain it to better generate videos? Text to video, it's not a solved problem. Nearly as solved as text to image generation. So when it comes to videos, there are a lot of things that are at play. So previous frame, it's just one part of the play.
Second, you'll have to care about the motion dynamics and so on. You have to care about the motion creativity and so on. And then you'll have to care about the spatiotemporal coherence. Like, okay, you maintained the previous frame consistency, you maintained the motion consistency, but how are the frames individually looking like? And if they're looking as expected on an individual level,
are they like temporarily well connected? So there are multiple things that are at play here. So, and there are also lots of trade-offs. Like when it comes to, you know, short videos, like 10 seconds videos and so on, it's probably okay with things like text to video,
you can get away with that because you do not have to worry about the motion dynamics a whole lot if you have like a very short video. But the moment you, you know, raise your bar to very, very long context videos, we have got a lot of roads to cover. So I would say better constraining, better frame consistency and things like that.
These are still a very active area of research. I'll give you one framework that's being used these days. So what you do is first, you predict some kind of a motion path. Like for 10 seconds, you have some kind of an arbitrary motion path, and then you have your first frame, and then you sort of interpolate it
with respect to the motion path that you had initially predicted with the language supervision that you have in your mind. So that's like the stepping stone that people follow these days, but there's a lot to be done, so yeah. Thank you. Thank you.
Hello. Hi. Thanks for the result. I would, my question is about security. I would like to know if it's possible for a model to end up executing code on the CPU. And if you don't know the model on the Hugging Face, it's possible. You wanted a model that can execute code on CPUs?
I don't want that. I would like to know if it's possible that you download some malware included in your- So we have got something called agents, code agents within the transformers library. A language model itself cannot run it, but you need to augment it with some kind of agentic approach
so that it can understand and call up an agent appropriately and make it do the stuff that you are looking to do. So you would probably want to look for something called transformers agents that can run code in the way you are expecting it to be.
On Hugging Face, is there some teams that are assigned to the security side of stuff? Do you have some scans of uploaded model to ensure that it does not contain malware? Yes, yes. Good news is yes. So we have got a hub scanning tool
that runs on all the public models that we have on the Hugging Face hub platform. And if it finds any vulnerability, it's gonna let you know. So I think two years back, there was a vulnerability discovered with Pickle that used to be like the defector tool in the machine learning community for serializing and distributing models.
And as soon as that was found out, we invented our own file serialization thing called safe tensors. And we also ended up inventing our code model scanning tool. So yes, so long story cut short, yes. Okay, thank you. Thank you.
All right, looks like, oh, sure, go ahead, please. So yeah, great talk. I was wondering if you had any thoughts on why diffusion models took off in the imagery and vision domains, but not in text
and why text is currently LLM or regressive based? Yes, yes, I was reading all the mathematical foundations of it just last night, so yes, I'm in the right place. So text by nature, they're very discreet in nature. The thing is the core formulation of diffusion models
based on score matching. And if you wanted to compute the gradients of the score that we end up back propagating, they do not have very good characteristics in the discrete space. So the characteristics only work and tend to behave when you have a continuous space, such as videos, such as audios and such as images.
So long story cut short, if you are operating on the discrete modality, the score gradients are not going to be as useful as they would have been for continuous modality, so. Okay, that's cool, yeah. I think I'll be around, so feel free to grab me for more questions if you have any,
but otherwise have a great conference. Thank you.