We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Efficient Machine Learning in Hardware

00:00

Formal Metadata

Title
Efficient Machine Learning in Hardware
Title of Series
Number of Parts
18
Author
License
CC Attribution 3.0 Germany:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
System programmingMathematicsComputer hardwareMachine learningEndliche ModelltheorieMathematical optimizationTexture mappingPointer (computer programming)Kolmogorov complexityMoore's lawException handlingFormal languageCore dumpContext awarenessFood energyParameter (computer programming)MultiplicationToken ringBit error rateMixture modelGraphics processing unitArchitectureAverageBenchmarkData modelAreaMetric systemEndliche ModelltheorieComputer hardwareFood energyMereologyArmVideoconferencingParameter (computer programming)Multiplication signCommunications protocolElectric generatorAbstractionTask (computing)Cartesian coordinate systemGraphics processing unitInteractive televisionDoubling the cubePower (physics)Physical systemRevision controlInferenceFamilyTheory of relativityWave packetSocial classMulti-core processorSmartphoneMoore's lawPotenz <Mathematik>BitClassical physicsBefehlsprozessorMultiplicationProcess (computing)Level (video gaming)Complex (psychology)PlastikkarteNumberRight angleTable (information)Domain nameEstimatorTerm (mathematics)Web pageTube (container)Computer architectureFormal languagePresentation of a groupQuery languageData conversionComputer animationLecture/Conference
Metric systemFood energyInferenceQuery languageAreaFLOPSRead-only memoryBand matrixProcess (computing)Kolmogorov complexitySystem programmingComputer hardwarePerspective (visual)AutomationMachine learningMoment <Mathematik>OpticsTexture mappingResource allocationScheduling (computing)Maxima and minimaConstraint (mathematics)Functional (mathematics)Endliche ModelltheorieComputer networkTime domainControl flowOnline service providerMilitary operationComputer-generated imageryTask (computing)outputSource codeBit error rateSingle-precision floating-point formatMobile WebGoogolFrame problemOperations researchField programmable gate arrayGraphics processing unitPower (physics)Endliche ModelltheoriePlastikkarteNumberProcess (computing)Semiconductor memoryGame controllerIntegrated development environmentOperator (mathematics)Complex (psychology)Scaling (geometry)Task (computing)Core dumpExecution unitBlock (periodic table)Computer hardwareComputer architectureMedical imagingMultiplicationPredictabilityTerm (mathematics)Utility softwareCondition numberModule (mathematics)Order (biology)Virtual machineDirection (geometry)Slide ruleRight angleFunctional (mathematics)TrajectoryMaxima and minimaPlanningStapeldateiFrequencySoftwareBijectionObject (grammar)Physical systemChainMereologyReal numberImage resolutionConstraint (mathematics)WeightPoint (geometry)Address spaceScheduling (computing)AlgorithmFood energyDifferent (Kate Ryan album)Multiplication signFrame problemVideo trackingMoore's lawAreaSampling (statistics)Run-time systemData conversionStreaming mediaMobile WebResultantComputer animationLecture/Conference
Graphics processing unitPower (physics)AutomationData typeForm (programming)DivisorInferencePlastikkartePhysical systemAnalogyVirtual machineArchaeological field surveyExtreme programmingSystem programmingLatent heatComputing platformScale (map)User profileLevel (video gaming)Inheritance (object-oriented programming)ComputerBenz planeGeometryGame controllerSoftwareMachine learningComputer networkTask (computing)Computer configurationDigital signal processorComputer hardwareFood energyInformation privacyProcess (computing)Point cloudProduct (business)Commercial Orbital Transportation ServicesPairwise comparisonAreaPerspective (visual)Commercial Orbital Transportation ServicesComputer architectureProcess (computing)Computer hardwareVideoconferencingComputerDifferent (Kate Ryan album)AreaInterface (computing)NumberNeuroinformatikAutonomic computingEndliche ModelltheorieSoftwareExterior algebraDialectPower (physics)Task (computing)DataflowComputing platformLevel (video gaming)BitLatent heatWave packetCASE <Informatik>Perspective (visual)Multiplication signSlide ruleEnergy conversion efficiencyFood energyFamilyArtificial neural networkFunctional (mathematics)Hidden Markov modelCartesian coordinate systemBlack boxRange (statistics)Operator (mathematics)Physical systemDomain nameBefehlsprozessorScaling (geometry)Single-precision floating-point formatInferenceTerm (mathematics)Mathematical optimizationCausalityProjective planeComputer animationLecture/ConferenceEngineering drawing
MechatronicsBit rateOperator (mathematics)Endliche ModelltheorieComputer hardwareArchitecturePeripheralMathematical optimizationTexture mappingSystem programmingType theoryMilitary operationMaß <Mathematik>Arithmetic logic unitFunction (mathematics)Point (geometry)Read-only memoryConvolutionStatic random-access memoryIntegerData typeClique-widthoutputInverse problemTensorBlock (periodic table)Dressing (medical)Dimensional analysisThermal expansionParameter (computer programming)Matrix (mathematics)MultiplicationImplementationDigital filterMathematical singularityStapeldateiStatisticsInferenceDivisorScale (map)Shift operatorNormed vector spaceScaling (geometry)BuildingBlogSource codeExponential functionLinear mapTransformation (genetics)Process (computing)InformationQuery languageLatent heatDirected setCalculationoutputPrimitive (album)Computer hardwareSlide ruleArithmetic meanMultiplicationHidden Markov modelOperator (mathematics)BitWave packetResultantPoint (geometry)CoalitionIntegerMathematical optimizationCore dumpSemiconductor memoryComputer architectureComplex (psychology)Standard deviationFunction (mathematics)CalculationConvolutionWindowMechanism designMultiplication signSpacetimeEndliche ModelltheorieCartesian coordinate systemParallel portBinary multiplierSoftwareNumberStapeldateiView (database)Division (mathematics)FeedbackLevel (video gaming)Computing platformKey (cryptography)Transformation (genetics)Physical systemLoop (music)InferenceMixed realityNormal (geometry)Network topologyTemporal logicDomain nameSubstitute goodDescriptive statisticsMereologyQuery languageMatrix (mathematics)Context awarenessArtificial neural networkFlow separationConnectivity (graph theory)ImplementationShared memoryTemplate (C++)SpeichermodellEstimatorMobile appDataflowProcess (computing)Linear mapFinite differenceDifferent (Kate Ryan album)Computer animationLecture/Conference
Latent heatComputer hardwareElement (mathematics)Process (computing)Component-based software engineeringRead-only memoryDisintegrationCalculationOperator (mathematics)Directed setImplementationSystem programmingoutputParallel computingPoint (geometry)Digital filterDataflowStructured programmingHierarchyMaxima and minimaLevel (video gaming)Food energyDatabase normalizationExploit (computer security)Buffer solutionComputer fileTheoryReduction of orderInterior (topology)WeightStatic random-access memoryReading (process)Dynamic random-access memoryMilitary operationGeometric quantizationMultiplicationClique-widthLinear mapType theoryTensorExponentiationGlass floatFraction (mathematics)AerodynamicsCompilation albumIntegerRange (statistics)Asynchronous Transfer ModeIdentical particlesFreewareTemporal logicFunction (mathematics)CASE <Informatik>Mobile WebWritingMaß <Mathematik>Semiconductor memoryBitHidden Markov modelFood energyNeuroinformatikIntegerMobile WebProcess (computing)outputElement (mathematics)MultiplicationLengthExecution unitWeightResultantOperator (mathematics)Loop (music)Different (Kate Ryan album)Geometric quantizationPoint (geometry)Cache (computing)CalculationMaxima and minimaRight angleMultiplication signCausalityData structureCommitment schemeGraphics tabletChainSeries (mathematics)DampingProduct (business)NumberOrder (biology)ConvolutionCommutatorNetwork topologyTerm (mathematics)ImplementationLocal ringDigital rights managementRule of inferenceLevel (video gaming)Scaling (geometry)Parallel portPower (physics)Data typeSpeicherhierarchieAdditionArithmetic logic unitComputer architectureComputer hardwareObject-oriented programmingWordComputer animation
Maß <Mathematik>Parallel computingSystem programmingComputer hardwareRepresentation (politics)Loop (music)Operations researchRead-only memoryData bufferWeightMultiplicationReduction of orderDynamic random-access memoryFood energyBroadcasting (networking)SummierbarkeitoutputFunction (mathematics)Shape (magazine)Surjective functionDigital filterPattern languageCycle (graph theory)GoogolTrailPartial derivativeMaxima and minimaConvolutionVery-large-scale integrationSingle-precision floating-point formatStreaming mediaComputerParallel portArchitectureProcess (computing)Array data structureVector spaceElement (mathematics)Level (video gaming)Execution unitStructural loadGraphics processing unitTensorComputer reservations systemReceiver operating characteristicHierarchyCache (computing)Interface (computing)Band matrixMereologyElement (mathematics)Multiplication signProgrammschleifeLocal ringInformation engineeringResultantData streamParallel computingEntire functionTerm (mathematics)BitMatrix (mathematics)DataflowSeries (mathematics)Constructor (object-oriented programming)ConvolutionCore dumpPattern languageOperator (mathematics)outputWordHydraulic jumpAddress spaceLengthFunction (mathematics)Semiconductor memoryWeightDifferent (Kate Ryan album)Computer architectureVirtual machineProcess (computing)Mechanism designFeedbackParallel portRight angleMultiplicationData structureSingle-precision floating-point formatMedical imagingInstance (computer science)Revision controlTensorDomain nameExecution unitMappingAreaVideoconferencingFood energyBefehlsprozessorComputer hardwarePoint (geometry)Process capability indexLevel (video gaming)Machine visionEndliche ModelltheorieMessage passingShader <Informatik>MehrprozessorsystemComputer animationLecture/Conference
UsabilitySystem programmingPower (physics)WhiteboardSource codeControl flowPlastikkartePoint cloudScalabilityComputer hardwareSystementwurfSparse matrixKernel (computing)Operator (mathematics)Geometric quantizationMathematical optimizationNetwork-attached storageTexture mappingAlgorithmArchitecturePhysical systemVisualization (computer graphics)FrequencyNumberRead-only memoryoutputFunction (mathematics)Type theoryProgram slicingBefehlsprozessorDecision theoryExpert systemMultiplicationInstance (computer science)SpacetimeComputer hardwareTask (computing)Transformation (genetics)Mathematical optimizationBlock diagramImplementationCodierung <Programmierung>Computer architectureEndliche ModelltheorieBitFormal languageMereologyVector potentialPhysical systemLevel (video gaming)Kernel (computing)Geometric quantizationProgram slicingEntire functionMappingData storage deviceFrequencyScheduling (computing)Multiplication signBlock (periodic table)MultiplicationSoftwareNetwork topologyOperator (mathematics)ConvolutionDifferent (Kate Ryan album)Pattern languageWordSemiconductor memoryFunctional (mathematics)Projective planeFormal grammarPoint (geometry)Row (database)Single-precision floating-point formatZentralstelle für Maschinelle DokumentationBefehlsprozessorNumberParameter (computer programming)Buffer solutionArtificial neural networkCore dumpCoprocessorFinite differenceSpeicheradresseCharacteristic polynomialConnectivity (graph theory)SpacetimeHierarchyComplete metric spaceModule (mathematics)Macro (computer science)Computing platformRight angleElectric generatorGraphics processing unitAreaAddress spaceStandard deviationTensorExecution unitRevision controlUniverse (mathematics)Slide ruleComputer animationLecture/Conference
System programmingOperator (mathematics)Block (periodic table)TopologyVariable (mathematics)Mobile WebCone penetration testMeasurementData modelPredictionFood energyCross-correlationFLOPSComputer networkInferenceAverageEndliche ModelltheorieAnalytic setFluid staticsEstimationSpacetimeWave packetAugmented realityMetric systemArchitectureRegulärer Ausdruck <Textverarbeitung>Read-only memoryResource allocationMachine learningProduct (business)Task (computing)Computing platformComputer configurationDigital signal processorPhysical systemSoftwareComputer hardwarePlastikkarteInformation privacyProcess (computing)Point cloudHausdorff dimensionScalabilityMilitary operationMaß <Mathematik>Compilation albumFunction (mathematics)Execution unitLatent heatMathematical optimizationTime domainConfiguration spaceBand matrixoutputParameter (computer programming)WeightSummierbarkeitPartial derivativeControl flowResidual (numerical analysis)Computer programSemiconductor memoryTrailConvolutionGeometric quantizationApproximationCondition numberArtificial neural networkMathematicsClique-widthWordNumberData typePower (physics)Constraint (mathematics)AreaOperations researchSimulationOpen setLevel (video gaming)Game controllerDigital signalReduced instruction set computingCore dumpBefehlsprozessorExtension (kinesiology)Tape driveElement (mathematics)PrototypeElectric currentFlash memoryFrequencyIntegrated development environmentPressureGradientSpacetimeParameter (computer programming)Computer architectureBitConvolutionCASE <Informatik>CircleDevice driverPower (physics)Endliche ModelltheorieFerry CorstenDiameterKeyboard shortcutWordAreaCellular automatonParallel portNP-hardSoftwareSlide rulePoint (geometry)ImplementationComputer hardwareVideoconferencingPerformance appraisalOperator (mathematics)Level (video gaming)WeightClique-widthExecution unitFood energyTemplate (C++)Matrix (mathematics)Semiconductor memoryInferenceWhiteboardNumberKernel (computing)Tape driveBefehlsprozessorFamilySource codeComputing platformReal-time operating systemMathematical optimizationMessage passingArtificial neural networkDifferent (Kate Ryan album)Loop (music)Local ringMacro (computer science)outputFunction (mathematics)Cartesian coordinate systemMultiplication signRow (database)AdditionEstimatorMagnetic stripe cardPhysical systemCycle (graph theory)MultiplicationSocial classInstance (computer science)Descriptive statisticsComputer animationLecture/Conference
Execution unitVirtual machineComputer animation
Transcript: English(auto-generated)
So let's talk about ML modeling or out ML community about hardware and that's sometimes different and I thought about how to arrange the presentation and after my conversation
yesterday I decided to put a little bit more tutorial style at the first half of the presentation and then to go into a little bit more into detail, probably it's helpful for you to have a better understanding what's actually happening in the hardware and what we have to take care about.
And somehow you're working on the left hand side, you would like to optimise given ML models or even to find ML models, we are looking here to bring that down to hardware and actually that's the next generation of NVIDIA, the protocol architecture, the GB200,
yeah that's a multi-chip solution and it's very high-performance but yeah it's probably not that good to bring that down into an embedded system and you will not see that kind of GB with a smartphone, then you need a cooler and even in a car it's probably
not that good and I also shown here some kind of device we are interested in, there's a large time lag, I don't know, I have to do it here and okay, we are focusing
on that kind of video capsule and I have it with me here, that's a capsule and you can swallow that capsule and yeah and then your GI tract will get analysed, then you would like to perform cancer detection and no one wants to swallow a GPU from NVIDIA probably but yeah you need really ultra low power interference solutions and that's
one part of the story I would like to talk about today, how to get there, what is needed actually to get a chip ready that's capable to deal with a number of ML
models to get implemented on that and the speed we need a lot of interactions not to directly get back the application domain data to the model, yeah but also to bring that back to architecture, here's some kind of abstract architecture, we will see
here some CPUs, we have a quad core CPU for example, it could be ARM CPU or a RISC-V CPU, here we have some MPUs, neural processing units, so that are AI accelerators and here we have probably a smaller MPU, probably we have here the always one part of a chip that sends the data, take care about activities, that's an input and if some
activity is found, then that MPU will get activated to do some kind of classification task for example and only if needed you can switch on the CPUs probably to communicate with the outside world and you know you have some kind of
multiple wake-up levels to be integrated and that device have to be ready to deal with different or a fairly fine class of neural networks, we are not interested to have a one-by-one relation to optimise the design towards
a single neural network, no, we have to be towards a family of neural networks, okay before getting there I would like to discuss first some complexity issues and yeah probably you know already that slide, we have complexity
drone here, first blue of, I learned yesterday deep learning version 1.0 and red 2.0, yeah and yeah the classical deep learning CNN based architectures in blue and foundation models, large language models in blue and for years a lot of people talked about Moore's law, Moore's law
tells you the chip complexity doubles every two years, actually it was in 75 John Moore correct a little bit to 18 months but later on it was again
corrected to two years, let's stay with that one and a lot of yeah innovation actually based on that idea that we have exponential growth in chip complexity otherwise it would be possible to have smartphones for example.
Okay let's take a look to the blue ones, to the DNN area, yeah we have a doubling every three or four months and if we go further then we see a doubling every one to two days in complexity, if it extrapolates the different models then we have a doubling in complexity every two days
and yeah and my of course tubing, we have a lot of MR colleagues in tubing and they always told me Oliver we just buy a new cluster, a new cluster and again and then it will somehow work but probably it's not a good idea to always follow that idea, probably it would be good to to look
forward to smart deep learning or probably also to go away from let's say put forth approaches here. Good, yeah probably you know that
already some complexity discussion here in terms of GPT-4 is a model of about 1.76 trillion parameters and yeah actually they are composing eight models each 220 billion parameters and yeah they are trained on 25 NVIDIA A100
GPUs and in 90 to 100 days yeah and yeah and the energy cost for their training was about 53,000 megawatt hours so it's
quite high. You have a table with some kind of other models it's taken here from the web page here on the right hand side and you will see numbers also what is expected for GPT-5 it's known that for GPT-5 at least
H100 but probably even better H200 or GB200 cards will be used and rumors talked about 100,000 GB useful for doing that task and yeah
it's probably not really sustainable approach to doing that yeah the cost as I'm estimates the cost of training yeah it's not possible for us anymore to to train it kind of models of course the hardware cost but all the energy costs are tremendous here that's that's really a problem here
and with respect to sustainability of older numbers your numbers just pick GPT-3 in the middle here you get this energy consumption to trade train GPT-4 it was 1287 megawatt hours and then also the carbon
carbon dioxide footprint is also shown so it's really significant but I'm just talking about training but yeah although in the end the problem and yeah if you make some estimates that are still older estimates discussed to 270
million queries per day and four and five Q&A turnarounds in a session and then the energy costs of one month of inference would take 18,720 megawatt
hours so it's much more than than just for training and so we have to discuss that and if you go to GPT-4 yeah that's that GPT-3 and it's a
powerful GPT-4 it's it seems not to get better here it's a problem yeah and what about the chips and probably the ships need also to get larger and larger and and we actually does not build the largest ships to exist a company a city plus you knows say breath no one okay that's that's a
company here and it's a single ship you see here on the right hand side a signal chip and it's the largest AI chip ever built according to the advertisements of a sailboat but it's actually true it talk about four trillion transistors and it's built with the order of the TSMC process as
as Nvidia is doing and you had a inside the chip they have memory and compute together yeah in memory compute and that's very before provides very high performance reduces but if you think all this getting better then
probably I see what is happening the single ship has a power consumption of 15,000 watt as a single chip and they need a large number of chips and as added already in in in larger devices and the ship is on a card the card needs
to be somehow cooled and the card has a power consumption of 23,000 watts even more and then the other claims they can really deal this with large models and they can build a cluster scale compute with up to 200 2048 of
that kind of modules then you can calculate about the core power consumption and but they can train CPT-3 in a day that's cool yeah but even costly and that's a problem and and the utilization in terms of very
low if you usually you would like to have at least 80% of the manufactured ships to be ready for operation and here they are talking about we are 10 to
15 percent so it's very expensive to go in that direction okay that are some kind of complexity remarks here probably some have already be mentioned this morning then yeah mission learning for automated driving I have some I have some slides or the net direction just for the motivation yeah
of course we are talking about it real hardware here's a car these cars there's a lot of sensors equipped on a vehicle but that's not our part here today we would like to take a look inside inside we have a architecture
of probably a multiple central compute units and they are both of several of our of the chips we just discussed him not say about of course and I don't want to teach automated driving but I would like to just highlight automate automated driving is more than just perception and that's
probably very often misunderstood we have basically a four basic basic blocks and then in chain one is perception of course the others predictions next one is planning and the last one is control yeah your perception means
actually I really environment perception segment segmenting a segmentation to see here for example we are the street where you can drive on we are obstacles we are other vehicles and so on and then that's probably solve solve quite well but of course we have to take care about adversarial environment conditions like difficult weather conditions for
example that needs to be taken account we are probably today it looks like California but in winter it can can look quite differently in Germany and but a very difficult task is prediction as well to predict what other
dynamic objects will do and it's probably easy for other vehicles but it's not that easy for other pedestrians for example you know to ever to have a prediction model we behave in models but it probably pedestrians it's still very difficult to do young and based on that then we
can start to do the planning for some in over planning and inside a minimum of a digital trajectory and then we can start to control the vehicle to do the vehicle dynamics that means you have a lot of functions to be implemented and not a one-to-one relationship and that's important here
and that's shown here on the next slide so that means we have a large number of machine learning models running we have classical software running on the system and all needs to be mapped and executed on that kind of hardware architecture this means we need all the runtime
environment for machine learning models so we have to take to take here yeah then I have to read the sensor signals when I can deliver the signals in which period I have to consider the classification task for example for perception and that's also that that's all schedule in very fixed
periods and if we run different networks on the same chip we don't want to run in a batch yeah we have to schedule them and then we have to take here we
are the data in the memory where how they are mapped to the memory and if the data are already available if I would like to access them there are a lot of important questions that needs to be solved and that needs to be considered with respect to energy demand here we would like to minimize
energy consumption under latency constraints okay then just let's just take a one task I even want even let's say object detection task will be
executed multiple times with different algorithms to do all some kind of possibility checks it's very important but let's take Justin's a single task and that's what a lot of people are doing to consider image net for example as a starting point we are talking about a resolution of 224 by 224 something like that yeah and the vehicle would like to go probably for
full HD or maybe for 4k resolution and that's significantly impacts actually is a complexity as well just by considering a single function here and it's shown here if we took this exam here real networks mobile and
address network sample and VGG and then you see some kind of complexity considerations this image net frame and it's shown here in that area if we go ahead and what's changed sorry sorry that's a consideration of a single image
so that's that's actually the status if you write your paper and you would like to bring results for for image net data that no I don't have to focus on
single images I have a video stream I have to deal with 30 frames per seconds and then of course a complexity will increase here and probably I need all the some kind of object tracking in the temporal axis here and then if I have full HD then have even more complexity and they have multiple
cameras cameras I'm here so that means the conversation the papers are dead one on left hand side and the real stuff is that on the right hand side but we have to consider that for multiple tasks and not for single tasks and the multiple tasks could infer that's all the media possible and in
the end we have some solutions like this one a lot of vehicles the trunk is full of compute you have a lot of those included in the trunk of the vehicle even in the vehicles in San Francisco if you would like to take a
ride with the automated taxis that is even equipped with a lot of compute power and yeah data conserved number was 4,000 watts I heard we have this cooling it's even more it's 5,000 to 6,000 watts and if you have if you
have electric vehicle and probably it's a good one which consumes only 40 kilowatts that in one third of the energy is just for the most driving though it's then you have to think about you would like to take a autonomous ride I would like to to get to the final destination never
suspect to the driving range yeah and so it's really needed that you will end up this very energy efficient solution in the vehicle okay then it's a whole chart here from 2020 about some ML accelerators you see at x-axis
X is a peak power and on the y-axis is a peak performance and here it is here's a therapist architecture with high performance and actually the best architecture would be a high performance with less power but there
is not there are not that many architectures and so we have to think about to really deal with that problem we have a lot of also academic architecture from MIT's iris architecture is a very very interesting architecture and that are very power efficient actually I'm oops okay if we
take a look how to implement the system in a vehicle then of course we have to think about what functionality will be deployed on central compute that shown
you on the left hand side and watch what will be implemented on sensor specific compute platforms so that means the idea is to put more intelligence also into the sensors and then you don't want to have large compute power in the sensor in terms of yeah cooling the sensor and that
stuff no do you need some kind of sensor specific solutions here and in that area and a green area you can do a lot yeah you can adapt your hardware together with your model you can really apply co-design AI system hardware software co-design to co-optimize the neural networks and engineering models together with underlying hardware architecture it's
quite interesting that's actually not possible left-hand side for COTS stands for commercial of the shelf platforms that are more or less black box and you have to take care about the resource they are available and then you need to just yeah probably use the vendor specific flows like from
Nvidia we have for example cooperation project this messy dispense and that they are running more and more the trouble they rely completely on Nvidia this level 3 support but if they would like to further the Zelda cars in China they have to switch away from Nvidia they have to found alternative
solutions and then of course you have to think about retargeting how to retarget your model from one platform the other one and that's not easy that that for more perspective is actually the case here because bringing
deep a model down to the platform to do the deployment takes a lot of effort you have to to do a lot of manual tasks and so that means the retargeting is still an open question and a lot of people are working in that area as well and probably you can all the pregnant on a
single chip solution consisting of multiple chip let's send to pick it stand by little bit chick with chip let's into a solution that can be composed of commercial the shelf platform that you buy a video chiplet and can put some sensor specific ship lens around for example and also some
kind of safety critical cause every vehicles you have for example Nvidia course that's two bits doing safety critical stuff here's a slide from from Mercedes about yeah terror operations that I expected for the
different levels of autonomous driving and we see that here before two was level four we actually had terror scale compute and they are all the thing about to have some kind of triplet approach and in a triplet approach you
can actually compose for example a GPU from Nvidia with a CPU from AMD and with an MPU from Zima AI or AMO or something like that but of course you have to convince all the players that the interfaces are compatible to doing that but that's that's an idea that's currently under discussion and then you
can go in different regions with different alternative setups here if some kind of blacklisting is done then you will kick out says GPU and and will substitute by another one for example okay that's a automotive domain the
others we are more or less edge computing here I would like to go to also yeah well-defined system for example a sensor-specific solution and it's very of the case a lot of companies would like to make the
sensors more intelligent more smart and and and then it's good to not rely on a central compute activity not to bring that come that kind of compute power into the sensor device itself and then you know quite well what which
task needs to be executed and then you can really do some kind of co- optimization and co-design of the sensor and compute hardware together with the models that needs to be executed here and that was a graphic as I shown before and yeah and think about the capsule and if you would
like to have to swallow the capsule then we have to look for devices like this one then we need a very very different regime and then we are talking about video inference in less than 10 microwatts and that needs to be
somehow developed and and and then you are very sensor-specific to doing their job and that's what we are actually also doing and I have some slides at the end that shows a little bit what what what we can do and yeah
to doing that we need to trade off between flexibility flexibility and area and power efficiency and of course very flexible devices and video GPU that's the reason why it's time for training you have a very well explored
flow to to get from Pytorch or TensorFlow or something else down to the device so it's very very well settled and on the other hand side you can take a single new network and can synthesize a hardware that's dedicated to
that kind of new network but then of course you have the highest area and power efficiency but at the course of a very unflexible design you probably can change some weights but you cannot change the architecture in the
end so you need something in between using the Tesla chip just a little bit in between but you can also think about hmm how I can combine for example this tonic area like this one I will discuss that later on with some kind of SIMD architectures I will discuss that later on to come with them with a
best fitting device and then here I have a flexibility then we can of course deploy a number of family of neural networks depending of the
and for doing that that's actually in a nutshell in the single slide what we are doing at my chair yeah to do a lot of core optimization techniques to describe architectural templates to do some kind of yeah hardware we are new
architecture search we will have later on the tutorial on that that means how to combine the new architecture search space with a search space that will be provided by the hardware because your hardware model yeah will also constrain your new architecture search space probably not all operators are
supported or operators are supported a different let's say performance that needs to be considered and for doing that we need all some kind of performance model of the hardware architecture either the performance model of the hardware architecture you will develop or for existing commercial
the share platform it's very difficult to do the performance estimate without running your model on a hardware and at the end of course you have to do it but probably to do it in a new architecture search noob it's
probably good to have a very efficient and a performance model and it's doing the job and then if I found a good solution then I need to deploy and a hardware architecture without deployment capabilities is useless you need a tool flow to deploy your neural network down to the hardware okay
so that was two examples two application domains one was automotive domain and the other was a sensor edge domain and this morning I thought a little bit of hmm should I directly dive into some kind of papers we have
worked on but probably I would like to bring something in between so now I have selected some slides from my lecture about operators and about hardware primitives and how we can bring that together then probably have a
better understanding him okay so now we would like to what's idea yet to to consider the neural networks I would like to implement I have to consider the operators they have to implement if all the good criticism the topology of the neural network and also somehow the hardware that needs to be used and
yeah of course we need to support coalitions and here would like to go very very quickly and here's a yes the loop nest of a 2d convolution and at
the end what support yet in the middle of the loop nest you'll find actually the operations that's a Mac operations you you see that that operation have for for memory accesses and to all operation means the multiplication is a accumulation of the result that means you need
hardware like this one yeah the multiplier and yeah and times an adder and probably this is all the way feedback to the input to do the accumulation and if that's can be executed an integer tends very easy if the floating point and you have to do a little bit more well okay and then
for for small hardware it's very good we are to think all about about to read use a little bit the complexity of convolutions and here's a depth-wise separable come to convolutions are very popular there was introduced by the mobile net network that's very successful in a lot of
embedded domains and yeah what it means I would like you would like to yeah if we start with regular convolutions all channels are considered in parallel and have to and they have to apply the convolution across all channels and
have to to build accumulated some also about all the channels and his ideas to do the convolution separately for each channel and then to somehow combine the result I apply a one by one convolution across the different
output channels and then hopefully I will end up with the same result actually not really but results are very good in the papers and if you compare the numbers here in the complexity how many operations needs to be executed here we
are about 1000 and we are here about 35,000 or just thinking about to deal a little bit with operators can be very good and if the end accuracy is at the same level then of course it could be fine and that's all part of just a search space space description and to deal a bit to
substitute convolutions by depth by several convolutions to deal a bit business with bottlenecks that what we have just seen here all the inverted bottlenecks make an app make operators where we can concatenate the result multiple times to enlarge the number of of output channels yeah fully
connected I don't have to spend many time here it's actually basically multiplication we have to apply here I don't want to spend too many time here just batch norm layer oh it could be difficult in hardware but not during
inference time for training it's complicated I have to calculate all that stuffs a means and standard deviation and then the square root and but a lot of numbers are constant and yeah and then I need just to keep the constant in the memory and then that can be applied quite well but of course
we have to take me bitches if I have to consider all the mini mini batch for inference for example I would would like to consider temporal axis for audio for example then I have to yeah think about to have some kind of
moving window to calculate system deviation and yeah last but not least of course exists much more but I have limited time here let's take a look to transformers there we have the self-attention mechanism actually multi-head attention and actually a lot of operators are very known we
have here the linear transformation for the query for the key for the value that's quite good and then we have to build a score give us and then we need to scale scaling means we have to build a square root but if you select
well you could well then it's good here in the paper they're selected to 64 squared of 64 well very good it's it's it's integer it's eight division by eight is just shifting by three that's cool yeah that can be quite well hardware but he did the killer here the softmax we don't like
software softmax as a hardware guys and it's probably think about to to have different operators like softmax to find more efficient hardware solutions but also had to deal with a really large context is all the difficult
because you have to to think about the memory consumption to held all the data to apply the matrix multiplication okay so that's the operator point of view it exists many more but just as a nutshell now the hardware design principles now I would like to slow down yeah one idea is to just take the
neural network consider the operators here we have for example the convolution and then yeah of course we have make make components inside just to connect the Mac components as shown here in your neural network model and
probably I need all the time I'm kind of inter interconnected between some edges in between yeah that could be solution but of course yeah then you need a really really harsh large hardware no nothing is reused you have a
implementation of your neural network into hardware is not a flexible and yeah it's if you have billion of make you make operations then you need billion of make units to for implementation so it's not the right approach okay so we
need some kind of parallelism yeah one is coming from that what we just mentioned to to to apply some structure parallelism had it means okay actually if I build a scale dot product it's more or less linear chain I can do it a bit better due to committed commutativity of operations then it
will end up as a number of trees to be to be executed that yeah the top level have the multiplication and then have added read at the end and I can have multiple energies and then all will be executed in power for example or I can remove dead parallelism and I can just do the accumulation here in
a in a and it in a temporal order and then of course you have a higher parallelism than this one and now somehow I have to orchestrate I have to to to have a idea then to apply some parallelism rule yeah okay let's take a
look to energy consumption and that's the energy consumption one one very important aspect here if you are thinking about to to to to build a hardware architecture okay here we have just normalized numbers that means
okay let's let's start with consideration that execution operation on alu takes one energy unit that's a starting point here because I need to I need to do the computation but where to get a data that I had that I can
influence if I can just apply local data access or I do have to access the data from the outside memory and that's shown here how did this related with respect to the energy cause it's very cheap to exact access
data that's directly available in the register of our Mac unit for example processing element that does deal with is that that's more or less the same cross but it's fine then if I have multiple processing elements here the
processing element contains the Mac unit usually and here I build some kind of array of processing elements to perform convolutions for example in a in a in a very high performant way then sending data from this one to this
one it takes two times energy of just calculation at a processing element then to access a bar for here it's approximately six times says energy and to access a DRM yeah it's 200 X 200 X more expensive in terms of
energy and that's the reason why a lot of people think about okay how I have to call to to to to orchestrate the data to to the memory hierarchy to move the data in a way that are available in the right point of time and I would like to reuse the data if I have if I know I have to reuse the
data then it's not a good idea to always access a DRM then probably it's better to keep the data in a register and so that's that's idea so that's one aspect just considering where to take the data from hmm next aspect is
quantization quantization is also very related to energy costs and that's shown here but on the right hand side here's the energy cost of operation with respect to different data types is shown here here on a top we have
integer add oops integer add operations in 8 bits 16 bit and 32 bit and you see the energy cost in picojoule then the next one is floating point at floating point is much more expensive than integer then there from
multiplication multiplication is more than addition of course and floating point is even more but again getting data from memories all the way expenses huh and we have to take here okay probably to reduce all the data word
lengths that needs to be indicated in the memory yeah and then if I would like to do the our computation then probably it's good to think about okay hmm should I calculate with 8-bit integer or always or only sometimes and
when it's actually needed and actually the quantization approaches applies a little bit differently and that's quite interesting him and so it's quite good let's start with to think about going from floating point 32 to integrate so that's the idea and usually you have some input values
and some weights and then here at the crossing point you have your Mac operation that multiplies the weight with the input and accumulates and
the accumulation can also be passed that way but also passed in that way and it's possible okay and then we have to think about it's good to to have all data in 8-bit a good idea is that solution exists all the theoretical concept behind that yeah to have actually 8-bit integers 8-bit
weights then the memories are also used much much more efficiently but it's not that good to do the accumulation 8-bits then yeah then at the end you
will not see some kind of difference between the results a lot of significant ditches are somehow shifted out and those ideas to do accumulations or two bits and then at the end I have to re-quantizize to 8-bit to bring it back to the memory and that's a very good approach then the
cost in multiplication takes 8-bit 8-bit so accumulated accumulation takes 2 bits and all memory accesses are in 8-bit and that makes quite well the energy consumption here on the right hand side and that's a yeah all the
approach that for example all yeah that Nvidia is following them good next that it does and data types I don't want to spend time here if you have questions afterwards I can spend time so now reusing the data how we can
orchestrate temporal and spatial rigors and yeah let's start of course yeah memory accesses we learned that memory accesses are expensive and but we some will have to calculate the data with the okay rest at 18 for example we have 1,800 million max in mobile net we have still sort of see
how see a series of million max and that will result in that numbers of memory accesses if you don't do anything and very easy approaches to put some memory hierarchy in between to apply reusing the data and of course
this memory will be small smaller than it was that one and like like a cache actually like a more scratch pad memory we have to know when data available and yeah and that's very important to do that an efficient thing
spatial reuse yeah we could of course unroll a loop if we unroll the inner loop of our loop nest four times and then we have a parallelized structure at the end of for Mac operators that are running in parallel but of course if
we do that then we will see that probably to all data can be fits the same input data as in wait depending on on the operation you are considering and if you if all the same weights are available then of course you can reuse
the data by just interconnect them directly and so that's that's a baseline to discuss a little bit the data flow to discuss data flow in a in a machine and exists now very fine terms to to orchestrate data flows and a lot
of architectures are following that idea one ideas do nothing no local reuse and all will be accessed all the time directly and then probably you will have different rates and at each processing element you will have
probably sometimes the same input activations sometimes not you will sometimes have part of the same partial sums to be to be propagated sometimes another one it's completely useless to to to consider the no local
reuse approach one is to think about the weights and it's called weight stationary that's a reference pattern I can reorder the loops in a way that
weights will get stationary that means here you see is a memory accesses that the data X that I used to access the weights in the memory and you will see okay multiple times the same weight address will be taken and that yeah and that four times so then we have a jump to the next one and then can I
can access a times the same value and yeah and that patterns of course if you use this pattern then you have to take into account probably it's good that you have a register in your processing elements that keeps your value a weight value and then you can always access the internal register
yeah to deal with that pattern in an efficient way a lot of architectures are following that idea but it's not all that you can do as a yeah that's that's a pattern edge at the end yeah you have and make make a multiple human unit and in a way that just does here in between and yeah and then
can access multiple times the same value and here's a partial sum you have to think about to accumulate internally or to accumulate across the different mech units that could also be possible next output stationary you see
okay I have just by reordering the internal loops of our convolution operator for example yeah I have a different patterns then I can access outputs multiple times how to do that at the end yeah that's very easy I put an output register the processing element and it output register will
get feedback to the to the to the mech unit of and then of course I have here more sequential operation but I can also think about to do some parallelization and yeah that's a pattern here at the end that comes
out a lot of architectures are the using that pattern and yeah and last but not least input stationary here I can reorder the loops again and then I can access multiple times the inputs and then yeah I have an alternate internal register for that for this for that to do some reuse here and
yeah if you build your architecture you have to think about okay what kind of reuse mechanism I have to apply I can also combine the different mechanism and that's a very important aspect to increase data reuse and to decrease energy demand yeah that's a that's a very non classification of
parallel computer architectures according to Flynn and so that's the standard CPU here a single instructions single data stream information engineering architects as we have very often SIMD approach that means you
would like to apply the same operation or two different data or with multiple structures single data stream architecture that shown here where I can apply multiple instructions to a single data yeah different point of
time is smaller like a pipelined version of execution and that's basic or a baseline also for for systolic array can also have multiple instances for that of course I can on a higher level of clarity I can deal
with multiple cores to do that and then we are on the right hand side here that's in the area processing just to have a single Mac operation applied to multiple data that's very easy you see that if you have for data where words I can access I can directly applies it as a single SIMD operations
systolic areas are shown here now we can think about to go more for a weight stationary approach or output stationary approach in a processing elements and then we can think about yeah when to pass data around vision when to propagate the data and what will be executed internally and that
needs to be August if I map my my operator I need to know how many processing elements I use if I have not that many as needed then I have to go in a temporal domain to complete your operation yeah and I will skip
okay coming to Nvidia there was a lot of questions that's an image h100 GPU win-win in yeah as a fully encrypted versions depending of here if you have
a PCI unveiling version and so on and yeah you have a lot of the and if you take a look into a single one that and you see that you have a lot of shader cores available the shader cores are working according to a SIMD principle here to apply the single operation multiple times on different
data but also you have tensor cores and that's quite interesting all the idea here actually to implement a matrix matrix multiplication efficient
way it means in a in each of the 144 streaming multi-processors you have four tensor cores let's take a look into the tensor cores let me come back to the memory that looks like this one here you have the mapping of the four tensor cores to that kind of construct and in principle in each I will not say
clock step but in each step there will be executed parallel an entire matrix matrix multiplication with the data data available here you have you have the data here if your data here's your data so this means you have a sine matrix a and violet matrix B and the result is in green
and then of course you have to think about how large are my matrices how they can map to the tensor cores and yeah and here I can access the data very efficiently I have local registers but if I have addressed the matrix in some of
the other stimulus processes I'm running into trouble and and and and that's probably a problem that needs to be orchestrated as well so that's that's the execution model for floating point 16 if I go to floating point 8 that's good I can use a 60-bit engine can divide it by 2 and then I have 2
times the throughput and that's shown here there's actually the same hardware just used for a smaller word length but two two words can be executed okay then if I would like to go to automotive solution then yeah video
provides Jetson urban platform and yet there you'll find more there you'll find all the number of GPUs then your inside is that what we have seen so far with a of the tensor cores and shader cores but also a lot of other
stuff yeah to to deal with with all the sensor aspects you will have in your car that's a Tesla chip and here they're full of other approach to follow a little bit systolic area approach here we have two systolic areas of 96
by 96 Mac units but all the GPU ZMD approach here you have a number of CPUs and so that's the idea how this system is orchestrated and the block diagram looks like this one you see that you have one of the Mac areas of
96 by 96 and at the end you have use in D lanes and here what's interesting is here the address generation there they've put a lot of effort in designing the appropriate address generation unit there to have always data at the right point in time so that's that that's idea yes and it's a
very beneficial aspect of that kind of tensor architecture a Tesla architecture and here's a Qualcomm version yeah and there exists many of them anymore and we have to deal with that we have to deal with very different architectures we have to know how large as a mega Mac areas how large as memories and if you take that into account and we know how we can
modify how we can optimize our neural network for implementation that means hardware where NAS had to take to take that into account you now we would like to optimize and yeah I I have a slide that's from you one from
university and he showed a bit about optimization potential he tried to to efficiently implement large language models and he started with algorithmic
optimization that means there to perform some parallel decoding I have to rethink about how to modify the transformer that I can execute parts of that's that's a manual task he didn't so showed hardware when us but then we
apply pruning quantization kernel optimization of course I would like to implement a kernel an efficient way with respect to the underlying architecture and then far at the end to optimize find optimized to make mapping to hardware and you will see without optimization of his work you
will have at each level two to three X less performance illusion at the end that's also the idea we follow and a little bit differently we have some kind of system orchestra orchestration at the beginning we need some door to
tour to schedule design to do some kind of memory considerations and then that's what we will show afterwards to do hardware when us yeah this implicit or explicit pruning or some quantization then you can do the kernel optimization optimize hardware mapping you put all the lot of effort in the
deployment aspects and but we don't have time to show that today what doesn't mean cannot steal some time from the okay okay okay okay okay okay
then so system orchestration now we have sir yeah the entire network of new networks and together with classic is off best to consider and then you have to think about okay yeah how I should map that on a device modern hardware architectures of a slicing capabilities that means okay I
can assign your models to different slices I can also assign memory locations to decrease the interleaving problem to decrease in feelings between the different parts and then of course I have to all the schedule
the design with respect to the periods I have to execute in a task in a given period I have also stores to consider that for doing that we have
some kind of formalization that's projects work they describe yeah data access patterns it's not not good to to just rely on single data words we would like to access the data access pattern because if you know your operators of you know of your ma model then you know which data will be
accessed next and that can be disparate described and used for optimization and but the starting point of the sensors and data that is fixed here if you have a camera then probably is a camera parameter tells you
superior get here this one is a fast one which gives you oh sorry this one is most more slow one with just 30 frames per second and this one is a fast one with 16 frames per second that means I need to execute my model afterwards
a frame rate of 60 frames per second otherwise would be possible. And here I have a LiDAR sensor, it has some kind of other patterns and here I have an inertial sensors like a gyro, accelerometer and and compass, a magnetometer that needs just to be accessed in one frames per second. And then I can also think about okay
I have different AFI functions of my network and I can map that to different devices. Probably I have here a neuro processing unit one, here a GPU, here a CPU and so on and I can do the mapping of the functions and then each function provides again data access patterns and
they can take into account if they are compatible or not compatible. You see a row by row and the delivery of the data and you have some kind of plugging, some kind of plugged tied approach to access the data for example if I apply
the convolution and depending how how these patterns are compatible or not compatible I can extrapolate the number as the memory size that needs to be used for the buffering in between. And that needs to be executed for all different parts in the system. And then I can optimize the entire
architecture in a way I can find optimized memory sizes in in the middle for example to find an appropriate streaming buffer. Yeah in a way that the other components does not wait for complete finish completion of the first component
or if I have already data in a plugging way available I can start execution earlier in a pipeline manner and it's only possible if I have that kind of data access patterns in a system level available. Next is what we are showing in the tutorial and hands-on to do
actually the hardware via NAS and very important is the search space definition and we have some kind of hierarchy here how to define the search space here we have some kind of let's say micro
architecture or macro topology of our module of our networks in mind which consists of multiple numbers of blocks each blocks can be can have different characteristics here we can put in a residual block or just linear block and that can be
selected and chosen automatically then inside you can also choose between different operators standard convolutions yeah there's that kind of depth wise convolution bottlenecks and so on all your operators you would like to see there okay you can define then if each operator can be refined further
with respect for example kernel size stripe number of channels and so on and you need also hey it defines a neural network search space and you have to do the same for the hardware and then you can put them together and that's the idea of what we are proposing but of course you need performance estimates and that's
take care about performance for the ml community performance tells you probably accuracy for the hardware community it tells you latency yeah but yeah yeah we have to be aware about it probably tell you about latency energy area memory footprint you need that matrix as well to evaluate your
network that's very important and then that's a yeah that's setup of hannah i don't spend time here you will expand level last but not least oh that's good what we can done for for edge devices
he would like to have well tailored devices that means okay i would like to find also an appropriate platform for doing that that's what paul is doing and yeah to have some kind of architecture templates
and that architecture architecture templates will be selected with respect to the family of new networks you would like to implement with respect to the operators that in that kind of families are are needed with respect to for example convolution sizes huh if a convolution size does not fit
to the source historic area it's probably not that good here and then by a certain generator based approach you will get an architecture and deployment tool to bring a new network down to the architecture then you can do some kind of modification
in a loop to to to end up with a design space here and then you can find for best suited solution at the end and then you can tape all the chip for that yeah that's that what we i have already discussed before is applied in that i uh idea all that
kind of different optimization steps and here you see uh yeah some kind of macro architecture here's the all some kind of mac area in in the middle with some kind of local memory then you can decide of course the size of the mac mac area is eight by eight sixteen by sixteen
four by four or even more then you have also decide how many registers we have to put inside the mac area to do some kind of pipelining then of course how many local memories you would like to to to support or do we have just a global memory that's not shown here here we have three feature memories for example that means
you can deal very efficiently with residual passes that means you have input memory then a memory for the input feature map memory for the output feature map and if you have a linear path then the row will change next time then there's the input feature map next time and it's output feature map and if i have to support a residual path i need additional feature memory for
example to to keep the data for the for the residual passes yeah and that can all be configured in an automated way and that's the cool cool stuff here and here that's for example a solution for multiple
application or execution of 1d convolution that was for audio use case here for example and and actually it could have for for for our audio chip we just need eight by eight mac area and then you can
run eight 1d temple convolution in one clock cycle that's quite efficient then yeah and here that's some kind of description of about that architecture and then you can compose a larger system out of different instances of that architecture and then you will also consider the
neural network for that auto use case that we wanted to to apply a class of temple convolution resnet based architectures in the end with early exit possibilities and that's shown here i will speed up a little bit then you need some kind of models to determine
the latency to determine the power of your device uh based on yeah on architectural templates then you describe uh what what parameters will be will what what what parameters i i i can mutate here you can decide
uh about how to constrain the parameters how to make areas sizes will be constrained for example what are the strides what and so on and at the end you will have that kind of design space and what's interesting here we have three different new networks explored a parallel that means
clue stands for the audio use case keyboard spotting uh o-ring stays uh tells you the data for voice activity detection and green uh vague war detection the area that's needed for hardware implementation is shown by the diameter of the circle and yeah the xr tells you
the power consumption in micro watts and here's a qc on the left hand side and then you see yeah here's the yellow one we are very very low low power solution and that's very helpful because we start with voice activity detection
first before to get to the next layer and i will pick up some points here we start with a manual design and it was a tc restaurant architecture chosen with an a with a word with the word width of six bits that was enough you could quantize the six six bits of the weights yeah sorry and of the features eight
eight bits and the make area size was eight by eight and then you see you will end up with a qcm 93 percent at a power consumption of 8.2 microwatts and if you apply that kind of architecture search or co-design together with the hardware
then you can see here that you can have the power consumption just by applying with the automated tool and yeah and you will see the qc will more or less get the same yeah but also you can also use the qc with a little bit uh by a little higher power consumption that's quite
interesting and that was a showcase that it's easily possible what what we what we are thinking about and that's uh shown in the chip that's that was done with uh at the end with with an approach with portionality which might but wake up levels idea was an always-on detector if the driver
gets to the vehicle the people's parking then hey it's me please open the trunk for example you can tell and yeah and you have to have the voice activity detection you have to detect the driver you have to detect the keywords and and that needs
does not need to to to to a high power consumption otherwise if you park for two two weeks a lot of pedestrians are walking around then your battery gets empty at the end and it was a yeah marco lecture here you have the always-on unit level one level two three four at the ends the cpu which can
communicate communicate with the rest of the vehicle and then that was a tape out the architecture that was chosen and suppose that itself could be executed in one gigahertz and but the accelerator itself was just executed in 250
kilohertz it can get more to until 250 megahertz but it was not needed it was just enough for real time and that's quite good and then here's evaluation board at the end yeah the next step we would like to to use the same for the capsule uh to do the video inference and yeah
and all the tooling what we will see afterwards will help us here and it was too many slides i think and now i'm i'm eric i'm adel thanks a lot for listening and it was a lot of hard pain in that cell working