We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Debugging Your Code with Data Visualization

00:00

Formal Metadata

Title
Debugging Your Code with Data Visualization
Title of Series
Number of Parts
132
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Let's face it. Sometimes our code just isn't working how we expect it to work. When this happens, we fall back to our trusty tools to help us debug: pdb, the logging module, or even simple print statements. But sometimes that just isn't enough, we still can't figure out why our code is broken. We need something more. There are a number of fantastic libraries in python for creating data visualizations. These libraries are commonly used for analyzing and visualizing large sets of data, but don't see as much usage when it comes to helping us write and debug our code. What if we could combine the two? What if we could leverage data visualization tools to understand what our program is doing? Sometimes, by seeing a visual representation of what our code is doing, it becomes much easier to understand why our code is not working. We don't need to create publication quality graphs and charts, we just need to generate quick, one-off visualizations to understand what our code is doing. What we want is the "data visualization" equivalent to print() statements. To help illustrate this point, I will walk through some of the hardest bugs I've had to track down while working on projects such as the AWS CLI and boto3, the AWS SDK for Python. For each bug, I'll show you how I was able to leverage data visualization techniques to troubleshoot and fix these bugs. Come learn how to debug more efficiently by leveraging data visualization.
35
74
Thumbnail
11:59
Interactive televisionMachine codeWeb browserDatabaseControl flowServer (computing)EmailLetterpress printingStatement (computer science)Reverse engineeringArrow of timeLattice (order)Module (mathematics)Statement (computer science)Inheritance (object-oriented programming)DatabaseMultiplication signWeb 2.0Medical imagingDependent and independent variablesComputer programmingCodeEmailSlide rulePoint (geometry)Visualization (computer graphics)Universal product codeProgrammer (hardware)ResultantWeb browserSoftware bugDiagramComputer fileObject (grammar)Server (computing)Student's t-testComputer scienceObservational studyLoginMathematicsLetterpress printingReal numberModule (mathematics)Projective planeAreaArrow of timeReverse engineeringCASE <Informatik>SequenceDebuggerLink (knot theory)Variable (mathematics)Service (economics)Sequence diagramSoftware frameworkEndliche ModelltheorieSource codeTesselationForm (programming)Directory serviceLoop (music)Software development kitTracing (software)RootFunctional (mathematics)State of matterMessage passingLibrary (computing)VolumenvisualisierungWordSoftware developerString (computer science)Social classField (computer science)Connected spaceError messageLipschitz-StetigkeitDirection (geometry)Computational complexity theoryComputer wormFlagGradientEntire functionSet (mathematics)Cross-correlationAuthorizationRepository (publishing)PlanningUtility softwareRandom matrixMachine codeProduct (business)Integrated development environmentMereologyDataflowControl flowDiagram
Dependent and independent variablesSequence diagramMessage passingEmailServer (computing)WritingString (computer science)Proxy serverBlock (periodic table)Sample (statistics)Scripting languageClient (computing)StatisticsSource codeDiagramObject (grammar)Network socketInternet service providerState of matterComputer fileData bufferRead-only memoryRange (statistics)Point (geometry)Instance (computer science)Range (statistics)WindowSoftwareLine (geometry)Connected spaceBitAssociative propertyGraph (mathematics)Computer fileOrder (biology)Expected valueLetterpress printingState of matterMereologyDifferent (Kate Ryan album)MiniDiscSemiconductor memoryDiagramSystem callSemaphore lineMessage passingSequenceDevice driverObject (grammar)Slide ruleSequence diagramNetwork socketBuffer solutionVisualization (computer graphics)Source codeSoftware bug2 (number)Analytic continuationGreatest elementShape (magazine)PlotterMultiplication signWritingCASE <Informatik>Parallel portRadical (chemistry)Streaming mediaScripting languageData storage deviceSingle-precision floating-point formatVideoconferencingStatement (computer science)MultiplicationThread (computing)Remote procedure callCodeEntire functionRevision controlServer (computing)Client (computing)Zoom lensWiMAXVolumenvisualisierungDataflowEmailSocial classVideo game consoleMathematical analysisSheaf (mathematics)Video gameLogicMaxima and minimaRight angleDependent and independent variablesUniform resource locatorSoftware developerAlgorithmÜberlastkontrolleoutputBlock (periodic table)Computer animation
Menu (computing)Event horizonCore dumpChainSemaphore lineWindowNumbering schemeSequenceSequelMereologySimulationDiagramPersonal digital assistantBefehlsprozessorPattern languageFunction (mathematics)Thread (computing)String (computer science)CountingEmailQueue (abstract data type)BlogParsingLine (geometry)Message passingTask (computing)Medical imagingComputer-generated imageryFile formatLine (geometry)Module (mathematics)Graph coloringMereologyQueue (abstract data type)Point (geometry)RectangleMedical imagingEvent horizonMultiplication signType theoryThread (computing)CodeNumberParsingClique-widthTimestampStatement (computer science)WindowGroup actionComputer fileRight angleState of matterDiagramCASE <Informatik>Buffer solutionTerm (mathematics)ResultantLoginAlgorithmVisualization (computer graphics)Computer wormGraph (mathematics)Electric generatorPlotterState diagramZoom lensTask (computing)Order (biology)Unit testingLimit (category theory)Square numberData modelSequenceRotationFunctional (mathematics)Dot productLatent heatBefehlsprozessorThumbnailPrimitive (album)Source codeChainDifferent (Kate Ryan album)Endliche ModelltheorieRange (statistics)CircleLevel (video gaming)Shape (magazine)Semiconductor memorySlide ruleDampingQuicksortRevision controlPulse (signal processing)Row (database)Software testingHeegaard splittingMessage passingINTEGRALMultiplicationLinearizationLibrary (computing)BitPattern languageComputer cluster2 (number)TupleComputer animation
Machine codeStatement (computer science)Graph (mathematics)Message passingDiagramServer (computing)Object (grammar)Library (computing)Web applicationMessage passingUniversal product codeSoftware testingClient (computing)Electric generatorBlogQuicksortVector graphicsSequence diagramCodeShape (magazine)Time seriesPlotter1 (number)Visualization (computer graphics)Integrated development environmentArithmetic meanBitFigurate numberMultiplication signSoftware bugVector spaceOverhead (computing)Tracing (software)Level (video gaming)DataflowRepresentation (politics)Type theoryDifferent (Kate Ryan album)State of matterVirtual machineWeb 2.0Module (mathematics)MereologyTraffic reportingGraph (mathematics)Right angleZoom lensProduct (business)Event horizonMedical imagingSound effectRepository (publishing)Computer animation
Transcript: English(auto-generated)
Hi, everyone. Welcome. Thanks for being here today. My name is James, I'm a software development engineer at AWS, and I work on Python tooling there. So some of the things I work on include Boto3, which is the AWS SDK for Python. I also work on the AWS CLI, and another project is AWS Chalice, which is a serverless
micro framework for Python, and we just did a tutorial of that today, or earlier this week at EuroPython, and there's a link that we tweeted out if you're interested. But I'm going to be using some of those projects that I work on as examples today throughout the talk, and I'm going to show you some techniques for debugging when some of the
more traditional techniques like print statements and log statements don't really cut it. So at first you might think, is debugging really that important, or how much time do we actually spend debugging? And I tried to find an answer to this, and it really depends on who you ask and what study you look at. I found one that was as high as 49.9 percent of our time we spend debugging, which seems
pretty high to me. There's also another study that was 30 to 40 percent of our time is spent debugging, but what I would really say is I think that we spend more time than we expect debugging. Certainly the case for me. And there's also, there was an interesting study that I saw where they looked at novice
computer science students, and they were looking at how well they did in a course, which is what they considered a good programmer, versus how well they did at debugging. And the interesting result there was that they weren't necessarily the same thing, so just because you were good at programming and got a good grade in the class didn't necessarily mean that you were good when specifically tested for debugging, although
there was a much stronger correlation the other way. So my takeaway from this study is that debugging, while related to programming, is a separate skill that's worth studying and practicing and surveying the field to see what other techniques there are to make us more effective debuggers, because that will ultimately make us more efficient developers. So in other words, I think debugging matters.
And when you look at the traditional forms of debugging, things that we've probably all done before, there's the print and trace style debugging where you just annotate your source code with print statements and you put variables and print out certain code paths you take. That's probably one of the things we do the most. I do that a lot. There's also the post-mortem debugging, which we don't really do a whole lot of in Python.
And then there's also interactive debugging, so if you're familiar with some area of the code and you know that part of the code is where your problem is, you can set a break point or use a debugger built in in 3.7 or use a PyCharm debugger, whatever your IDE is, and really step through your code and see exactly what's going on. So in this talk, what we're going to look at is some techniques when print and
trace debugging aren't enough. We'll look at some data visualization of what our program is doing to help us get a better understanding of our code. And in that same study, one of the things they cited was that really a lot of the issues with debugging and one of the things that prevents people from being good debuggers is being able to get a mental model of what the code is doing.
And so a lot of times you either don't have a mental model of how the code's supposed to work or your idea of how the code's supposed to work is actually wrong and that's the result of where a lot of bugs come from. And I think data visualization helps with that, gives you a quick overview of what your code is doing and helps you troubleshoot faster. Now, I'm going to show you three visualizations that are going to go from a fairly easy visualization
and then just go in order of complexity. And for each one of these, I'm going to give you a real world example of where I use this technique to help try to troubleshoot a bug that I was working on. But some of the things to keep in mind is just like with log or print statements that you add to your code, it's not really meant for production code. So when you're done with it, you throw away the changes that you've made and
hopefully you remove your print statements before you commit and push your code. And the visualizations as well, they're not meant to be perfect. Really the intent is just to help us figure out what's going on because the real goal is to fix the bug or to verify your design or some other thing other than the visualization. So with that in mind, there's a lot I want to show, so we're just going to jump into
the diagrams here. So the first one, the simplest one is a sequence diagram. And if you haven't seen it before, this is an example of what it looks like. You read it left to right, top to bottom. And in this example, there's two requests being made to a browser and you can see that we send a request from one object to another and then there's a second post where we send a request to the browser, to the web server, and then the web server sends the request
to the database and you get a response. And you can draw these by hand. You can draw them in Keynote or PowerPoint or Visio, whatever you use for your diagrams. But you can also generate these programmatically or you can generate them by writing them out like this. So here you can create, this is a DSL for creating a sequence diagram.
And the way this works is you can see it's the same diagram as before but you have a browser with an arrow to the web server and an optional label and then the same thing in reverse. And then the way you can render this, the tool that I like using is this tool called seek diag and you can just pip install it and you'll get a command line utility that you can point it at this file and then it generates a ping file that you can
open up and see what this visualization is, which is this thing. Okay, so that's the idea behind a sequence diagram. Let me give you an example of where that's helped out before. So who here is familiar with expect 100 continue with HTTP? Okay, I certainly wasn't until I started working on the SDK. But the idea behind expect 100 continue is instead of taking an entire HTTP request
in response, what you do is you break it up into the headers and into the body. So you take your headers and you add an expect 100 continue header and you send it to the server and you ask, what do you think about this request? And so the server takes it, looks it over and decides, is this the right host? Do you have access? Maybe is the URI correct?
Check the auth header if you want. And if everything looks good, it sends a 100 continue back and then you can send the request body and then you get your normal response. So the server can also send a 400 and say you're not authorized to access this URI. It could also say the request looks good, but you have to send the request somewhere else and send you a redirect. And the reason this is helpful is if you have a large request body,
you can avoid sending a large payload to a server that's just going to end up rejecting the request in the first place. So expect 100 continues really useful and we use it for when we do uploads to Amazon S3. The problem that we had was that HTTP lib requests that your lib3, they don't support expect 100 continue and there's not a clean way to add support to these libraries directly.
And the last time I checked, I don't think request in your lib3 support it and they didn't seem like it was a great idea to add directly to that library in the first place. So we decided to add it to our own code because we thought it was a useful enough functionality and it prevents a lot of issues, so it's good to have for us. But there was a bug. So occasionally, the expect 100 continue flow would return 500, so
that's normally fine, that'll go through our retry loop and eventually succeed. But some of the error responses weren't actually errors. And of course, like all hard bugs, it very rarely happened. There weren't a clear set of steps that demonstrated the issues. We were trying to figure out what we could do. So the plan that I had was after trying the print statements and looking at log messages and not being able to figure it out,
maybe we could generate a sequence diagram and see if there's any additional insight that we can gain. So the idea is we're gonna annotate our source code when interesting messages are sent. We're gonna generate that .diag file that we saw in the previous slides. And then we're gonna generate an image from our diagram and see if anything stands out. So I'm gonna be showing some code here. It's not necessarily, I wouldn't expect you to understand there's a lot of code.
It's mostly just showing what we annotate so you can see the changes before and after. And so this is one of the classes we have, it's an AWS HTTP connection. And we're setting some stuff with the expect header, setting a flag, calling the parent class, and then resetting the state. But the point here is that just, we noticed when you first start debugging, you add print statements. So you might start with something like this.
And you say, okay, if the expect header is set, we're gonna add a print statement that said that. Otherwise, we're not gonna do that. But let's try to create a diagram from this. So we wanna generate a sequence diagram instead of using print statements. And you might think at first this is a lot of work, it's a lot of effort to do, but it's really pretty straightforward. What I did was I created a debug.py module and you just drop it right in the root directory of your project.
So we're gonna throw this away anyways. And we just have two functions here, add sequence and render sequence. And all it's really doing is just storing the string from the DSL that we saw before. So just where is the object coming from, and to, and an optional label. And then what direction the arrow goes. And just to show you how you can use this in a REPL, there's no magic to it.
Just let's say we import debug cuz it's just in our root, the root directory of a repository. If I create two objects, A and B, I can then say I'm gonna add a sequence, so from A to B, where we say sending request, and then from B to A, receiving response. And then if I call render sequence to dev standard out, instead of a file name, you can see it generates a sequence diagram. Right, so far pretty straightforward.
And the way we add this to our code is that instead of the print statements we saw before, we just change them to these message things. And you notice the first line I'm doing is import debug, and then we just say message equals debug.add sequence, so I don't have to keep typing that over. But instead of print statements, it's just a little bit more structured. We're saying we're sending a message from our self to self.host, and
we're starting expect 100 continue. And I annotated that all throughout the source code. You're gonna see a bunch of source code fly by here. The point though is if you see the green lines that are being added, they're just almost the same as what you do with a print statement, with just a little bit more structure. And so I annotated the HTTP connection class, went into requests, and you're allowed three. And for the most part, we didn't really change any code.
There's maybe one section, I think right here, where, yeah, so there's a con or self.newCon, so request is trying to be fancy. And we just broke up the lines and said, let's grab the connection so that way we can call add sequence on it, but it's still the same logic. So for the most part, we're just doing a little bit more work than print statements.
Now, this was doing add sequence. To do render sequence, we just have some driver script that shows an issue here. And so if you haven't seen code that works with the AWS SDK, all we're doing is creating a client for S3 and then calling put object. So we're uploading something to S3. And I'm skipping ahead here just so that we can save time. But if we sleep for 60 seconds, it's an interesting question,
what should happen? Because you'll have an idle connect timeout, so the server should maybe close the connection because you're not doing anything. And then we make another put object request, what should happen? And so we let that go, and then we call debug.renderSequence, and then we generate a sequence diagram. And so this is what we actually get. And we'll zoom in in a second. I just like seeing the shape of the diagram at first.
And I wanna stress, this is from real data. So this isn't drawn by hand or anything. This is just running our code with that script and then looking at the sequence diagram. And this is what we get. So zooming in here, if you've used request before, you know under the hood, it uses URL lib3, which has a connection pool. And so we see that the first thing we do is check out a connection from the connection pool. And it's the first time we've created it, so
we create this underscore socket object, which is just a socket. And it returns it back. And then just like we saw on our slide, we're gonna start the expect 100 continue flow. That was the first message statement we saw on the slides. So it goes through this, and you can see the diagram's not perfect, but you get the idea. It's sending headers, we wait for the 100 continue response, and you can see it's sending it to my bucket name, so .s3uswest2amazonaws.com.
And then after that, it's gonna send the body once we get a 100 continue. And we get a 200 okay, and then we put the connection back. Then at this point, we're sleeping for 60 seconds, right? And then we wanna see what happens when we create a new request. So we do the same thing again. We call getConnection. We create a connection here. And then we go through the same flow that we're gonna skip.
It's the same as before. We send the headers, wait for the response, and then send the body. But the key insight here, the thing that stood out once this diagram was generated, showed a difference in how I thought the connection pooling worked, which is that you have a single connection instance, but you have multiple socket objects, so the lifetimes aren't the same. So if a connection closes, what we do is instead of creating a brand new HTTP connection object,
we instead create a new socket object and associate it with the existing connection instance. Comparing that to if we're reusing the same socket, we just continually make requests and we don't sleep. This is the diagram that we'd see. And so now that we know the lifetimes are different, after all this analysis, the fix is pretty straightforward.
It's just when you close the connection, you have to reset the extra state that we added. Which is a little bit different from, say, the file API. So if you close a file, there's not really an expectation you're gonna open that same exact file object later. It's, you would just create a new file object. So it was a little bit different, but once we were able to get that insight that we're reusing the same connection instance, even when it's closed, it was much easier to track down the bug.
And so we won't have time to dig into the details of exactly how that worked. But essentially, the connection was in a bad state. We assumed it was a terminal failure state, and we didn't reset our state appropriately. So that was the first visualization. It was really straightforward to add a sequence diagram, cuz it did all the heavy lifting of generating the diagram. Now we're gonna look at the second one.
This is the second or the third one. And it's visualizing internal state, and we do this a lot. This is probably the technique that I use the most. And the idea is that as things are happening in your code, you just log whatever the internal state of some object is, and you can generate these interesting graphs. So to again give a use case where we tried to, where this was really,
really helpful, we're gonna try to download a file. It seems like it's a pretty straightforward thing to do at first. And so, if you wanted to download, say, a video file or something, the easiest way you would do it, assuming it's big enough, where you can't just hold the whole thing in memory, say it's 100 gigabytes or something, right? So what you would do is you would take the first chunk, and you would download a file, and maybe just write it to your local file.
And then you would take the next chunk of the remote file, download it, write it, and you'd keep going, and eventually you would have the entire file downloaded, right? And that works, and that's pretty straightforward. But what we do in the CLI is we take the file, and then conceptually we break it into chunks. And for each chunk, we then have a thread pool that's assigned to individual chunks of the file, and they download the file in parallel.
And so whenever they're downloading a file, the writes can come out of order, and we just seek to wherever the offset is in the file, and then write out the data. And so at some point in time, your file could look like this. There's just whatever data we have downloaded, we just write it out to that location. But eventually, you will get your entire file written just the same as we've
done it sequentially. And if we actually look at data here that plots the IO writes, this is using Matplotlib, these are the sequential, this is actual data writing where the S3 writes are happening within the file. And so we just took the file and shifted it vertically. So the bottom of the file, the beginning of the file is y equals 0, and then the y max is the end of the file. And you can see each thread is writing at the same time to different parts of
the file. And so you can see how this kind of parallelism can give a nice boost in performance sometimes. But that, and that itself is pretty non-trivial to write, but there was an interesting use case we had where we wanted to download to a non-seekable stream.
I should ask, has anyone used the AWS CLI before? Okay, cool. So this is a CP command, this is where we're downloading from bucket and to standard out. And so that way we can just pipe it to say, gun zip, and then pipe it somewhere else and send the data, and we don't need to have a local buffer. So if you have a large, say, I don't know, 500 gigabyte file or
something, you don't need 500 gigabytes of temporary storage to download the file to, you can stream it directly to a pipe. So the problem that we had was that here you can't do this, because you have to write data sequentially, right? There's no seek on the pipe. So one thing you could do is you could just stream it, just like we were doing before, and very early versions of the AWS CLI did that.
But we wanted to see if there was something better we could do, because let's still assume that downloading is a bottleneck, that writing to disk locally is much faster than downloading remotely. So the initial idea that we had, that we implemented, was let's just have a buffer, right? And you do the same thing with your thread pools. But instead, all you're gonna do is hold onto it in memory.
And if it's not the next contiguous block, you just keep buffering it. And then eventually, once that buffer comes along and you now have a contiguous range, you can then take that and flush it to disk, right? So there was a problem with this. We implemented this, it worked. And during development, we noticed that there was a pretty big problem, which is in the worst case, you buffer the whole file, right?
So if you're waiting on one chunk of the file, and networks are lossy, networks have congestion. Sometimes things come out of order. And so if you're waiting on one part, you could potentially buffer a 500 gigabyte file in memory. So we wanted to fix this, and we tried to figure out what would be the best way to fix this without any huge performance hit. So we ended up implementing, I did this sliding window algorithm,
which is very similar to what TCP does, if you're familiar with that. And you just limit the range of what your file can download to. So the way this works is, while you have parts that are downloading, you only allow them to download from that range. And it's not until something on the left side of the window comes in that it then unlocks that contiguous range.
And so you take that, you flush it out to disk or whatever your pipe is. And then that moves over, and the window slides over, and then that's the new part you can download, and that's your new max. So that was the sliding window. And we're not gonna look at the code. It gets a little more involved because there's threads, and you have to have semaphores and the appropriate locks. But essentially, there's an acquire, which is on the right-hand side
of the window, and then there's a release. And if the release is on the left side of the window, the window slides. And again, the point here is I wanted to show you the green highlights, which is how we annotated the source code to be able to verify that this algorithm was working how we wanted. And we'll look at what the T function is in a second. But for the acquire, there's only one trace that we're interested in.
And then for the release, there was three parts that we talked about. So one is if I try to release something in the middle of the window, then we just put it in a buffer. And if I release something that's on the left side of the window, then I can start flushing the buffer. And then potentially, there's a chain reaction, cuz once I unlock something on the left-hand side, any other contiguous range can then be now released. So there's three events there. And again, the thing is,
all this function is doing is just printing some internal state. So it's printing the sequence number, so where in the window we are, and it's printing the time, and then whether or not it was an immediate release or a chain release, but just basically internal state. And all this is doing is using the logging module. So we have this tag tracer that's just a quick shorthand so that we can dump structured data using JSON. And then we just have this T function again,
cuz we just want something really quick to add to our code. But ultimately, what you get is this. You get log messages that say the time, the thread, and then a JSON payload. And then we can take that and parse it, and then generate plots over time of our internal state to see what's happening. And what we wanna see is a sequential, cuz we have to write it, so we have to write it in order. We should see something that starts in the bottom left and goes up to the right.
Unlike the threads, the diagram we saw where we had multiple S3 writers going at the same time. So this is what the actual plot looks like. And we'll zoom in in the details in a second. But you can see visually, at least, this is a good confirmation that our algorithm is working how we want. I mean, we had unit tests and functional tests and integration tests, so we knew it was working, but in terms of the behavior, was this more or
less what we expected, some sort of linear kind of up into the right behavior? And if we zoom in a little bit, this is the four trace events that we added, that we saw in the previous slides. The acquire, which is the blue one, and then the three releases. We'll pick a specific event and walk through it. But let's say that here, the release always happens sequentially.
But if something wants to be released, and we'll pick this data point here. So the release is requested, but it's not ready yet. We aren't able to release it until we just trace along the same y value. That's when the release occurs. But it only happened with, this is the delay. It only happened because that black dot there was what we were waiting for.
So that's the left side of the window. And you can see immediately above that, the reds, those are the chain reaction. Those are the other things that could be released cuz they were just pending. And then immediately after that, you see the blue dots can immediately acquire the next part of the file, and that's the sliding window in action. So this was a good visualization for us to verify that yes, this is in fact the correct mental model. This is how it's working. It does seem to be working how we expect.
And we could look at different window sizes. So the previous graph was window size of ten. This is a window size of 100. And really the only trade off you're making is just you can potentially stall and have a bigger buffer size, right? So here, I think the window size is 100, and it looks like the stall is maybe 50 slots. So as long as we're okay with a stall of 100,
then this is a decent window size. But as we start to look at bigger window sizes, you can see the total time doesn't really change, you're just buffering more memory. So you don't really gain a whole lot from having a bigger window size. So the takeaway here was by generating log statements that just graph internal stage, you can then parse them and start to create really interesting plots to verify the behavior of your algorithms to make sure they're doing what you expect,
make sure you have the correct mental model. So the third one that we're gonna look at is creating a custom diagram. So far we've been really leveraging other tools to do a lot of the heavy lifting. So map plot lib for us, and then we also use seek diag. But there's some times when you have some sort of visualization in mind, and no existing tools really creating exactly what you want.
So for this, we're gonna generate our own custom diagram. And to keep with using the CLIs, you're gonna have to explain new concepts. We're gonna look at uploads now. So there was a problem very, very early on. I think this is version 1.0, five years ago. So it's not really that interesting anymore, but it's interesting in terms of the visualization, cuz this has long been fixed.
But there was a case where there was a high CPU usage. So the thing would work, but it seemed like it was consuming unnecessary CPU usage. And we were trying to figure out what caused it. And the way that uploads work is similar to downloads. There's a producer-consumer pattern. And you would take something from the queue. So you'd have a worker thread, pull something from the queue. And then in some interesting cases, it would take work from the queue.
The worker thread would look at it and decide it can't do anything with it. So it would put it back on the queue. And then there would be times where we thought it was just cycling. So it'd take something from the queue, look at it, decide it can't do anything, put it back, take another task, decide it can't do it, and put it back. And it would just keep spinning like that. And here's what the actual worker thread looks like. It looks like a lot of code, but the gist is grab something from the queue,
which is gonna be a reference to a function. And if we can call it, invoke it, put it on the result queue. And if not, we put it back. And all I'm doing here, there's these log.debug statements. And instead of internal state, we're just logging with certain events happen. So we're interested, when do we get an event from a queue? When do we start to invoke it?
When is it finished? When do we put it back? So that's that queue put back in the middle there. And really, again, this is the main takeaway, just the things that we highlighted. It's maybe five, six lines of code, and we're just emitting when certain events happen. And so the way that we parse this is we have a log file here that just has a various number of events. But even looking at this, it's kind of hard to tell.
Because if you notice, there's a bunch of threads here. There's thread two to thread seven, and it's calling get at certain points. And it's really hard to see the state transitions or the state diagram here. And so we wanted some sort of visualization to help confirm some of the suspicions we had. So what we do, take the timestamp, the thread, and the message, and we put that in a name tuple. So we go through each line and parse those out just by splitting on the hyphens.
And then from there, we group them by threads because we want to see a visualization of each thread at each time. And so we have things like what time did it start, what time did it end. And then based on the actual, if it's an invoke start or invoke multi-start, we pick the event type. And then once we have that, we're ready to generate our image. And I know that went really fast because the thing I want to focus on is just how to use Pillow,
which is what I use to generate the image. If you haven't heard of it, it's a fork of the Python imaging library. When I had used it before, I really only thought it could do image resizing, generating thumbnails, you know, rotations and cropping, that kind of stuff. But there's, I think it's almost a hidden module. There's an image draw module where you can, it has primitives for drawing lines and rectangles
and circles and basic shapes, and you can generate an image from scratch. And so that's what I did here. And there's a lot of code here, and the slides will be available online, but there's really just four steps. We go through each thread, and we figure out what color, we're just drawing a bunch of rectangles. We figure out what color to make the rectangle based on the event type. So if it's this put-back thing where you have a worker thread taking something from the queue
and then immediately putting it back, we'll color that red so it stands out. And then otherwise, we draw a rectangle, and the width of it is just based on that end minus the start time stamp so we know how long it took. And then at the end, we just call image.save, which saves the image to a file. And that's going to generate a visualization for us. So if we look at that, this is what we get.
And the way we read this is left to right for time, and then each row represents a thread. So we can see what the thread is doing at each point in time. And there's a legend there on the right, so we can see the gray is the non-multi-part upload, the black is a multi-part, and the red is a put-back. And if that's a little hard to follow, the main thing we're interested in is the red part, because that's the part we think might be having issues.
So just zooming in here, this is a multi-part upload, the black part, the gray part, upload part, and then the red is that put-back part. So what's nice about this diagram is if you take a vertical line and just trace it, you can see what each thread is doing at a given point in time. So you can see how many multi-part uploads are happening at any given point in time. For historical reasons, we had a limit of three, which we can't get into,
but that was an interesting visual confirmation of that. You can only see it's always touching three black rectangles. But if we zoom in here, at least you can see these small red squares here,
which are the put-back things. That was the part where we could clearly see that there's at one point one, two, three, four, five, six threads that were just spinning on the CPU. So it was taking something from a task, putting it back, and then continually doing that. In this diagram, it's pretty short, but there were also times where other diagrams that I don't have anymore where it was showing much longer spin cycles.
And so it was very easy to visually verify at a given point in time, here's what all the threads are doing, and they are potentially using a lot of CPU. And the main takeaway from this, which I know we went pretty fast through, was just that we can use Pillow based on some state events that we emit to generate our own custom diagrams. And it's maybe 10, 15 lines of code to do that.
Okay, so wrapping up, there was three diagrams we looked at. We looked at sequence diagram. That was showing us the flow of messages between objects. And we can use a seek diag tool to generate those for us. We also use map plot lib to graph internal state. And that was where we just have periodically throughout time, just flushing a JSON document that has all the internal state of our object we're interested in.
And then we can generate visualizations and time series plots from that. And then finally, when none of those work, we can use Pillow and have access to basic primitive shapes to draw visualizations based on the example I used, which was events happening. So hopefully the next time you're debugging, and you can't figure out why your code isn't working, and all the traditional means haven't worked,
maybe give one of these a try, and maybe they'll work for you. I know this stuff has been really helpful for me throughout the years, and I hope it's equally as helpful to you. Thanks, everyone. Thank you, James. So we have time for a few questions.
Anybody wants to come here to the microphone? No? So then I have a question. Yeah. So have you considered using vector representations of those,
the last type of diagram, for example, to be able to maybe integrate more text and zoom it in? I have not considered it, but that's definitely something that... I don't know how much more involved that's going to make the generation, but yeah, I think that's kind of why, with a lot of these,
whatever is the first thing we can get that will generate a visualization that gets us pretty close. But I think if we're going to put this in a blog post or documentation or something, we'd look at creating an actual vector image and cleaning it up, definitely. Great, thanks. Just a small question.
When you test this code, you test it in production or in a test environment? Because you don't want to put your code to the repository, you said. So what do you do for the actual testing part? Yep, so for the testing part, it's usually not in production. It's usually on some sort of test environment that we try to... And sometimes for these code, especially the ones that run a long time,
we'll have to leave it running sometimes for a day or overnight or something because it takes a lot for the bug to reproduce. You generate a lot of data. And I think the other thing that's a little bit different here is because these are client-side libraries, it's a little bit different from, say, a web application where you have a prod stage and a dev stage where with client libraries, it's not quite analogous.
Yeah, I guess you have a server that is only designated for tests. Yeah, you could do that. So if this was for a web app, you could also have a percentage of your... a small percentage of your traffic redirected to these test machines or these machines that have a little bit more intense monitoring on it.
Other questions, maybe? Sorry, because I'm gonna ask a very general question because I want to learn something and take away to apply my code debugging.
What's your suggestion? Because that gentleman mentioned, you definitely want to check out in the report, right? So what we can do or what's your suggestion? So I think one would be if you do have a test environment or you have the ability to shift the percentage of traffic, you could do this.
The other thing, though, is that a lot of the stuff that I showed doesn't have a whole lot of overhead. It's mostly just we didn't want the cognitive overhead of seeing what are these trace things in the code. But I think just like refactoring code, if you do something and then you do it again, and then definitely if you do it a third time, you think maybe if we keep doing this, it's worth putting in proper code in there to do these traces.
And for a lot of these things, we fix the issue, we verify the issue, and we haven't had to come back to them. But I think if we kept doing that, then yeah, we would try to actually clean it up and just keep it in the production code base. With the logging module, you can set the log level high enough where you're not actually emitting debug, and it's only if you turn on the appropriate log level that you would see these traces get emitted.
Thank you again, James.