Boosting simulation performance with Python
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 130 | |
Author | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/50102 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 2020107 / 130
2
4
7
8
13
16
21
23
25
26
27
30
33
36
39
46
50
53
54
56
60
61
62
65
68
73
82
85
86
95
100
101
102
106
108
109
113
118
119
120
125
00:00
Computer simulationRoboticsComputer simulationField (computer science)Computer architecturePhysical systemComputational geometrySoftware developerElectronic mailing listComputer scienceTouchscreenComputer animationMeeting/Interview
01:13
Computer simulationPhysical systemGraph (mathematics)Type theoryVideoconferencingExecution unitRoboticsPhysical systemSoftware bugComputer simulation2 (number)Asynchronous Transfer ModeINTEGRALDiscrete groupMultiplication signEvent horizonWorkstation <Musikinstrument>Software testingFloppy diskComputer animation
02:43
RobotComputer fontSimulationData Encryption StandardDistribution (mathematics)Military operationPhysical systemProcess (computing)Computer simulationFront and back endsOrder (biology)Linear regressionSoftware testingRegression analysisAlgorithmPoint cloudFormal verificationEvent horizonDiscrete groupOperations researchSequenceLibrary (computing)Software frameworkIntegrated development environmentRoboticsComputer simulationDiscrete groupEvent horizonGraph (mathematics)Library (computing)Client (computing)Software frameworkMultiplication signCodeSoftware testingIntegrated development environmentCASE <Informatik>Connectivity (graph theory)Term (mathematics)Ferry CorstenAreaAsynchronous Transfer ModeSet (mathematics)Uniform resource locatorState of matterMetreSoftwareSoftware developerComputer hardwareBitPhysical systemProduct (business)Computer virusQueue (abstract data type)Object (grammar)Web 2.0OrbitInformationProcess (computing)Computer configurationMereologySampling (statistics)WritingAlgorithmMathematical optimizationSimulationComplex systemOpen sourceReal numberReal-time operating systemFront and back endsPoint cloudThermal conductivityINTEGRALContinuous integrationOffice suiteLinear regressionConstructor (object-oriented programming)Operator (mathematics)Right angleAnalytic continuationMiniDiscOpen setComplex (psychology)2 (number)Floppy diskSoftware development kitDifferent (Kate Ryan album)
10:29
Hydraulic jumpRobotThread (computing)Parameter (computer programming)Asynchronous Transfer ModeSimulationNumberConnectivity (graph theory)Data Encryption StandardComputer simulationMultiplicationPhysical systemFront and back endsOrder (biology)Event horizonEuclidean vectorProcess (computing)System programmingQueue (abstract data type)Inheritance (object-oriented programming)Inclusion mapPermianQueue (abstract data type)Computer simulationMessage passingThread (computing)Physical systemRun time (program lifecycle phase)Software developerLaptopResultantState of matterCalculationDeterminismEvent horizonMultiplication signRoboticsProcess (computing)outputSoftware bugPatch (Unix)Functional (mathematics)TouchscreenSocial class2 (number)Letterpress printingView (database)ImplementationIterationCodeOperator (mathematics)Integrated development environmentGame theoryRegular graphConnectivity (graph theory)Parameter (computer programming)NumberAutomatic differentiationDiscrete groupAsynchronous Transfer ModeGraph (mathematics)Event-driven programmingSoftware testingFront and back endsUniform resource locatorVirtual machineReal-time operating systemMetreRandomizationPoint (geometry)Computer programmingSlide ruleComputer hardwareSystem callRight angleFlow separationComputer animation
19:54
Computer simulationCodeFunction (mathematics)TimestampImplementationModule (mathematics)Event horizonProduct (business)Computer simulationThread (computing)Front and back ends
20:14
Computer simulationCodeFunction (mathematics)TimestampImplementationModule (mathematics)Service (economics)Vapor barrierServer (computing)Process (computing)Client (computing)FreezingDependent and independent variablesProduct (business)Slide ruleService (economics)SynchronizationComputer simulationMessage passingDefault (computer science)Personal digital assistantExtension (kinesiology)Multiplication signFunctional (mathematics)TimestampFront and back endsServer (computing)Right anglePhysical systemModule (mathematics)MehrprozessorsystemShared memoryVapor barrierComputer animationProgram flowchart
22:48
ExistenceMessage passingService (economics)Computer simulationLibrary (computing)Physical systemLeakEvent horizonQueue (abstract data type)Data Encryption StandardSimulationMoment (mathematics)Vapor barrierExtension (kinesiology)Socket-SchnittstelleMultiplication signConnectivity (graph theory)ImplementationElectric generatorIntegrated development environmentOnline chatComputer hardwarePairwise comparisonTrailDiscrete groupZoom lensModule (mathematics)CASE <Informatik>Meeting/Interview
Transcript: English(auto-generated)
00:06
First on the list, we have Aaron Friedman, who just finished his master's of science, computer science in the field of computational geometry. And he is a Python developer working at Fabric,
00:24
formerly known as Common Sense Robotics, involved in system architecture and development. And he's gonna be talking about boosting simulation performance with Python. So, Aaron, thank you for joining us. Thank you for having me. Thank you for the introduction.
00:42
So where are you streaming from? Sorry? Where are you streaming from? Oh, okay, I'm streaming from Tel Aviv, Israel. Nice. Yeah. Excellent. How's the weather there? Really hot. Really hot. It's cool, yeah. Excellent. Well, I will turn this over to you then,
01:03
and I'm going to turn this. Thank you very much, and show the screen. Okay, so hi again, everyone, and thank you for being here. You're probably here because you run
01:20
any kind of simulation or integration test at your work. Now, how many of you would like to spend less time on waiting for them to finish and to have more time for coding or for solving bugs if you write some? So I'm glad you're here. Today, you will see how you can use the Diskette event simulation approach to simulate your system,
01:42
and how it will allow you to simulate hours of your system in minutes or even in seconds. So before I talk about how we run our simulation, let me tell you what we do in Fabric and what we simulate. Before I show you the video, I need to switch the sharing mode.
02:05
Okay. In Fabric, we build a fulfillment warehouse for online orders. Most of the work is done by robots. We have two types of robots. The first type is called lift robot. You can see it now in the video.
02:21
And the second type is called ground robot, which moves on the ground, on the floor. Together, they cooperate and help us to fulfill the orders. It works like that. The lift robot takes totes from the shelving units, puts the totes on the ground robot. The ground robot brings the totes into picking stations
02:41
where the items are picked and later delivered to the customers. Just exit the mode. So my name is Eran. I work at Fabric for about four years.
03:01
Now I mainly focus on the development of this cute robot. But before that, I was involved in different areas in the system. One of them is the simulation infrastructure, which I will present to you today. We start by seeing why simulations are so important. Then we'll see how to use
03:22
the discrete event simulation approach and how to do it in Python. Then I'll talk about some challenges we encountered and how we deal with them. And finally, how to distribute the multi-threaded simulation into a multi-process simulation. So first, what exactly we simulate.
03:42
So usually the term simulation means a tool that imitates the behavior of a system. Now our case, it is not exactly the case. Or let's take a look in this very simplified throughout our system. We have the backend, which is a pure software. It manages the activity of the system.
04:01
It manages the orders from clients, the stock, the motion of the robots, sends commands to the robot and receives telemeters back from the robot. So in this simulation tool that I will talk about, we simulate only the robots. We run the system, the backend just as it runs in production,
04:20
but instead of communicating with the real robot, it communicates with virtual robots. This decoupling of software and hardware is extremely important today when we all work from home due to the coronavirus and the access to the hardware is very limited. This tool has several more usages.
04:42
First, it is used as a testing tool when developer write new code, as long as the code doesn't run on the robot, then it is one of the options to test the code. It is also used as part of our regression test in the continuous integration system.
05:00
Also in a complex system, it's difficult to know how a new algorithm or optimization will affect the system, the KPIs of the system. So this is the place to evaluate it before running it in production. Again, robots and the hardware is very limited and very expensive,
05:21
and this decoupling of software and hardware allows us to run as many simulations as we need on the cloud. We use this tool to evaluate the new warehouse before investing the money in construction. We can run the system on a new layout and see what KPIs we can reach and how many robots are needed to reach this KPIs,
05:41
for example. In simulation, it is very easy to inject failures in the robots, and by that, to improve the reliability and the robustness of the system. We also have an integration center in our offices and an integration lab in our offices
06:02
where we can test the code with the real robots, but it's not as big as the production warehouse, so simulation is the only place that you can run it on big setups before running it in the production. Oh, we saw what we simulate
06:21
and what usage we do with this simulation tool. Now, let's talk about how we run the simulation. The approach we are using is called discrete event simulation. In this approach, continuous operations are modeled by instant events. For example, if we want to simulate an elevator, then the events can be door is open,
06:41
elevator arrived, button pressed, and so on. And the simulation also maintains its own clock, and it immediately moves from one event to the next event, and that's how the time can run faster than real time. In our simulation, we do it a little bit different. We treat the time as the events,
07:01
and we divide the time into time ticks, and in each time tick, we calculate the new state of the robots. So we simulate the operations of the robot, which are move, turning, passing thought from one robot to another robot. Let's take an example. Let's see the move operation of the robot. Let's say that the robot can move
07:21
in two meters per second, and we choose to have 10 time ticks in a second. So at first, the robot is located at X zero, and assuming it's got a move operation, then the next time tick will be 0.1, and the robot will calculate the new state, which is 20 centimeters.
07:41
Then again, the next time tick will be in 0.2, and the robot will calculate the new location, which is 40 centimeters. And notice that in this approach, the robot was never between 20 to 40 centimeters. It immediately moved from one set to the next. And in reality, the robot, anyway, sends telemetries to our backend a few times in a second.
08:04
So the behavior looks the same for the backend. It is this kit anyway, and we don't lose any information by doing so. So this is the idea of the diskette event simulation. Now to implement it in Python, we use the CMPI library. CMPI is an open source library.
08:21
It is a framework for diskette event simulation. It is very simple and well-documented, and there's a lot of samples on the web. It is also lightweight. I mean that it doesn't try to help you to simulate your components. It just gives you the framework
08:40
of how to implement the diskette event simulation. So we see how to do it in the code. Let's see, let's understand the idea of CMPI. So to understand CMPI, you need to be familiar with the three objects. The first one is environment. The second one is process, and the third one is the events.
09:00
The environment is the main object that manage the whole simulation. It has the simulation clock, and it has an event queue. Process represents the component we want to simulate. So in this example, we have two processes, one for robot zero and one for robot one. Now at first, the processes
09:21
adds the initial events in the queue. So we have two events, one for robot zero and one for robot one. And when we start the environment, it takes the first event from the queue, it runs it, so it calculates the new state of the robot. And before it is done, it adds the next event of that robot in the queue. And then again, it takes the first event from the queue,
09:42
which now belongs to robot one. Again, calculates the new state, adds the next state of robot one in the queue. And then again, it will take the next event from the queue which this time, which again belongs to robot zero, but this time it is in time 0.1.
10:02
So it will update the clock to 0.1. So this is the very basic idea of how SymPy works. Now let's see how to do it in the code. So in the code, we'll see a very simple example of using the SymPy. And in this example, we'll conduct a race of robots.
10:22
Let's say that a robot can move somewhere between two to four meters. Okay, so let's go over the code and then we also will run it. So here we define that we have three robots in the race. The race is going to last for 30 seconds
10:41
and we choose to add two time ticks in a second. So we'll have a time tick after every half a second. Here we implement the very simple robot. We only implement the move operation. As you can see, it is a Python generator. So at first, all the locations of the robot is zero. Each iteration of the while is a time tick.
11:01
So it will calculate the new location. To make it interesting, I use the function. Notice that they provide the function of one and two because we said that the robot moves between two to four meters and we have two time ticks in a second. So it is one to two after a second. Then the robot will print the simulation time,
11:22
the robot ID and the new location. And it will tell the environment that the next time it wants to run is in half a second. So here we initialize our environment. We register the simple process into the environment and run the environment. Let's now run the race.
11:41
I'm going to run the code. Remember that the race is about 30 seconds. And of course, it is not going to take 30 seconds. I use the time command, which will print the time it took for the program to run. So as you can see, it lasted less than one second.
12:02
So I saved each of you about 29 seconds. Okay, so an important point to be aware of is that all the simple process, all the components we are simulating are in the same thread.
12:21
As you could see, it's using Python generator and the environment runs each event at a time. And I'll talk about it again in few slides. Also the parameters that affect the runtime of the simulation are obviously the number of components. The more components we are simulating, then the more calculations we have to do
12:42
and therefore a slower simulation. And also the time to granularity. The bigger granularity, then again, more calculation in a second and the slower simulation. The model I described so far is called as fast as possible. It means that the simulation tries to run the fastest it can.
13:01
It immediately moves from one event to the next event. SimPy also allows to run in a real time mode where it tries to follow the real time. It will run an event and before moving to the next event, it will wait until the time of the next event will come. Now, why would we want to do this? I mean, in the first thought you would think that we always want to run the fastest we can,
13:23
but you may want to do some manual testing in your system like a REST call or whatever, or you also may want to combine real hardware with your simulation. So these are good reasons to run it in a real time mode.
13:42
Declaring the discrete event simulation approach has several benefits. The most obvious one is that it makes the development more efficient. When the developer finished the right code and test it, then you will get faster feedback. And also you get a shorter CI.
14:01
But as you can imagine from the previous slide, when I talked about parameters that affect the runtime, if we run the simulation with many, many robots, then it may be that the simulation will even run slower than the real time. But this is still an advantage because that way, the result of the simulation, the API,
14:20
will still be realistic. I mean that in every time tick of the robot, will you get the chance to do the calculations to calculate the new state. And that's how the time will not run too fast. From the same reason, it's also deterministic.
14:42
It doesn't matter if you run it on a strong grid machine on the cloud or on your private laptop, the runtime of the simulation may be affected, but the result will be deterministic. And also it is agnostic to profiling and debugging. Still we will get the same results. Last, using this approach will also allow you very easy
15:04
to simulate any date or any time of the day. Like you can run the system like it is the weekend or any special time that is interesting for you. In SimPy, for example, you just need to provide the initial time to the environment and it is that simple.
15:21
And then this bug wouldn't happen. We wouldn't be panicked before the millennium bug if we had this approach. Okay, so far I showed you how to simulate robots using the discrete event simulation, but recorded at the beginning, I mentioned that we want to run our system,
15:42
our backend together with the simulated robots. So our backend is a multi-threaded system. There's several threads that get messages and they react to them. The messages can be either telemeters from robots, input from the user, and orders from customers and such.
16:02
Can you think what is the problem of running the backend together with the robots? So the problem is that the robots may run the time too fast and the backend wouldn't have the time to do the work like it would do in the real time.
16:20
SymPy has a support for event-driven processes, but as I mentioned before, all the SymPy processes run in the single thread, so it will change the behavior of our backend and we already had a similar experience when we used the gevent monkey patch, which make your system, your thread cooperative
16:43
and runs the system like it is one thread. It did improve the performance of the system, but later we found out that we have some bugs that we couldn't see in simulation. Therefore, the solution of a SymPy for event-driven processes is not good enough for us, so we came up with our own solution.
17:02
In simulation, we create another SymPy process, which in every time tick, it holds the time until the event-driven threads will do their work. It is doing it by calling the join on the thread's queue and the join function waits until the queue is getting empty.
17:21
That's how we make sure that the event-driven thread will have the time to handle the events. Let's see an example in the pod. Okay, so in this example, we'll see, we'll have one event-driven thread, which listen to the queue, get a message
17:40
and print it to the screen. Another robot, which in every time tick, will send a message to the event-driven thread. So let's go over the pod. This time, we'll have a time tick, one time tick in every one second. For now, I'll just show you the problem, so we ignore this class. We'll see it later.
18:02
So this is a simple implementation of an event-driven thread. What it is doing is listening to the queue, pointing the message to the screen, and that's it. Here, we implement another simple robot. In each iteration, it adds a message in the event-driven queue, and it's the counter,
18:22
and tests the environment that will run again in one second. So again, we initialize the environment. This time, to see the problem, we use the regular Python queue. We start the event-driven thread. We register the environment and run it.
18:41
So let's run it. So just remember, we are going to run the simulation for 50 seconds, and in each second, the robot will send a message to the event-driven thread, which should be pointed to the screen. So run it. As you can see, we don't see any message in the screen,
19:02
and this is exactly the problem that I described. The robot did send 50 messages, but the event-driven thread didn't have the time to render them. So let's see how we solve it. We know it from the Python queue, and in simulation, we add another simple process
19:21
that in each time tick, it will call the join on the queue and then we'll again tell the environment that it will run again in one second in the next time tick. So let's use this queue this time. So let's run the example again, this time with our queue.
19:45
And as you can see, it solved the problem. The usage of the join helps the event-driven thread to handle the messages. So as you could see in the example, the event-driven thread is not really aware of sympathy,
20:04
and that's what I meant that we run our backend in simulation just as it runs in production. It doesn't aware whether it is a production or simulation. With the extension that the backend cannot call the default time functionality
20:21
because in simulation, they are not relevant, right? We have a different clock. So you have all that functionality in our own model, and the backend just calls this model, and this model knows whether it is simulation or production and calls the right functions. Last, in simulation, we print the simulation timestamp in the log
20:42
because when you are debugging the simulation, you care more about the simulation time. Now, eventually, we also moved to microservices, just like everyone else. And then we, again, we wanted the simulation to run as our assistant to run a simulation
21:02
just like it runs in production. So it means that this time, we don't use a multi-threaded simulation, but we want a distributed simulation. We said that SymPy doesn't support a multi-threaded simulation, so for sure it doesn't support a multi-process simulation. So we came up with that solution.
21:21
In a simulation, we run another service called the various server, and the responsibility of this service is to sync the time of the other services to prevent one service to run faster than the other services. So all the other services look the same as I described so far,
21:41
the same as the multi-threaded simulation. Each one of them has its own local SymPy, and all of them pick a shared time tick, and it works like that. At the beginning, they initialize the SymPy. They do, each service is doing this work, and once the shared time tick arrived,
22:00
they send the ready message to the various service. The various service holds that message and until it gets the message from all the other services. And once you got them, you send them the approval, and then they can move to the next time tick. That's how we prevent one service to run faster from the other services.
22:21
Notice that SymPy service sends the ready message to the various server until it gets the approval. The time holds for me. It waits for the other services to reach the next time too. So I finished the slides. I think we have a few minutes for questions, and then I'll just sum up the talk.
22:44
Awesome, thank you very much. So I do have a couple of, why did I have one question so far? If anyone has any questions, please post it in the Q&A here on Zoom, or you can also post it
23:02
in the Paratrack chat room over on Discord, which I'm keeping an eye on as well. Anyway, so Ruth Vanderham asks, are you familiar with the other Python DES called Salabim? Yeah, I heard about it. I think it is quite new, maybe 2017.
23:20
So before we started, it wasn't exist. But anyway, I checked it. It looks pretty similar to SymPy. I think it also use generators. And it has also the notion of environment, I think. But anyway, I didn't really try it. We already used the SymPy, and I also didn't see much comparisons with them.
23:43
So if someone is familiar with it, I would like to hear in the Discord system. Also have another question, actually also from Ruth Vanderham. He says, how does the messaging between the barrier service work?
24:02
So this is a, depends on your implementation, depends on your services communicate. In our case, we use a message queue for this. But you can do it with the rest or sockets any other messaging.
24:24
All right, excellent. So it looks like, that's all I have at the moment. So once again, thank you very much.
24:41
And yeah, if anyone wants to chat with Aaron afterwards, he is, you can look up his room, which I believe is boosting SIM performance in Discord. And Aaron, did you wanted to do a quick recap? Yeah, first, thank you.
25:02
So yeah, so what we saw in the talk, we saw just how important the simulation is, especially for an outdoor company like ours. And the Discord event simulation has some more benefits with it. Again, if you want to do it in SIM Pi, and Python, you can do it in SIM Pi.
25:21
Also, there is the module library. If you want to run the simulation with your system, then you may suffer a time leak. You just need to make sure that all the components are tied to the time somehow, to the clock. And finally, the extension of the simulation
25:42
into a distributed simulation was really straightforward for us. And it took us a couple of days to do it. So that's it, thank you very much for listening. Hope you enjoyed.