Learn Python automation by recreating Git Commit from scratch
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 115 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 4.0 International: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/58803 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202140 / 115
1
3
19
25
31
34
36
38
41
42
44
46
48
52
60
61
62
65
69
73
76
82
83
84
92
94
96
103
110
113
00:00
GoogolSign (mathematics)Endliche ModelltheorieLogic gateData storage deviceString (computer science)FingerprintBlogSocial classSparse matrixRevision controlFunctional (mathematics)Parameter (computer programming)CodeFile formatObject (grammar)Formal languageDefault (computer science)UnicodeASCIISemiconductor memoryFlagChainTable (information)NeuroinformatikEmailLibrary (computing)SpacetimeTerm (mathematics)MiniDiscRepresentation (politics)Data structurePoint (geometry)IntegerCase moddingLine (geometry)File viewerAuthorizationOrder (biology)Interface (computing)Identity managementError messageSheaf (mathematics)Radical (chemistry)ResultantArithmetic meanDatabaseMappingRepository (publishing)Software testingBlock (periodic table)Function (mathematics)Control flowComputer-assisted translationRight angleTemplate (C++)Client (computing)Musical ensembleRoboticsMultiplication signSystem callIdentifiabilityIterationType theoryAlgorithmHecke operatorCompilerAxiom of choiceVisualization (computer graphics)Projective planeExpected valueRemote procedure callLevel (video gaming)2 (number)Boolean algebraLogicMoment <Mathematik>MereologyElement (mathematics)Key (cryptography)Modul <Datentyp>Event horizonLoop (music)Zoom lensTask (computing)Entire functionProduct (business)Process (computing)ImplementationBinary codeRow (database)Dot productStandard deviationComputerMedical imagingTunisThumbnailPlastikkartePhysical systemBranch (computer science)Lipschitz-StetigkeitDigitizingoutputGenderHydraulic jumpGastropod shellComputer fileNetwork topologyCommitment schemeContent (media)Electronic mailing listScripting languageMessage passingVideo gameAsynchronous Transfer ModeHash functionDirectory serviceLengthLoginUtility softwareGoodness of fitQuicksortStudent's t-testRecursionReading (process)Source codeCore dumpRun time (program lifecycle phase)Ferry CorstenMeeting/Interview
00:28
Computer virusMeeting/Interview
00:53
Student's t-testDatabaseDisk read-and-write headObject (grammar)Price indexBlogInformationFunction (mathematics)Hash functionConfiguration spaceEmailScripting languageComputer fileDirectory serviceSicNetwork topologyBeer steinRootAsynchronous Transfer ModeData typeVideo gameTask (computing)Commitment schemeIntegrated development environmentHash functionRepository (publishing)Software testingString (computer science)Utility softwareScripting languageFlagObject (grammar)Computer-assisted translationMereologyMiniDiscLengthStudent's t-testLine (geometry)Data storage deviceNetwork topologyDirectory serviceComputer fileAsynchronous Transfer ModeType theoryDatabaseContent (media)Data structureRepresentation (politics)Point (geometry)IntegerAuthorizationIdentifiabilityAlgorithmBlock (periodic table)Core dumpSheaf (mathematics)Radical (chemistry)System callArithmetic mean2 (number)MappingMultiplication signEvent horizonStaff (military)FreewareProjective planeComputer animation
08:13
Object (grammar)Reading (process)Data compressionLetterpress printingComputer virusType theoryDatabaseContent (media)WritingInternet service providerDressing (medical)Open setNetwork topologyElectronic mailing listQuicksortRange (statistics)Function (mathematics)Level (video gaming)LengthBoolean algebraElectronic mailing listRevision controlMultiplication signString (computer science)Binary codeObject (grammar)Line (geometry)Medical imagingComputer fileData storage deviceParameter (computer programming)Row (database)Hecke operatorFreewareSystem callFunctional (mathematics)Data structureLevel (video gaming)Message passingElement (mathematics)Library (computing)DigitizingFile formatVisualization (computer graphics)IntegerType theoryLengthEndliche ModelltheorieLoop (music)MereologyLogicBlock (periodic table)Social classSpezielle orthogonale GruppeHash functionContent (media)Asynchronous Transfer ModeNetwork topologyDatabaseCodeFormal languageDefault (computer science)RecursionQuicksortChainTable (information)Semiconductor memoryNeuroinformatikPointer (computer programming)EmailReading (process)Run time (program lifecycle phase)Computer animation
15:30
DatabaseNetwork topologyComputer fileSineAsynchronous Transfer ModeMessage passingParameter (computer programming)Letterpress printingScripting languageLattice (order)Internet service providerCodeData structureMaß <Mathematik>LogicGastropod shellInteractive televisionGastropod shellScripting languageSpacetimeCodeParameter (computer programming)Message passingNetwork topologyInterface (computing)Commitment schemeVideo gameClosed setHash functionValidity (statistics)Object (grammar)Electronic mailing listDatabaseTemplate (C++)Asynchronous Transfer ModeLine (geometry)Computer fileRaw image formatMereologyMultiplication signSystem callDirectory serviceFunctional (mathematics)IterationCompilerString (computer science)Axiom of choiceFormal languageProjective planeType theoryKey (cryptography)WritingModul <Datentyp>ComputerProduct (business)Matrix (mathematics)Process (computing)Arithmetic meanSoftware testingError messageDot productLink (knot theory)Utility softwareStandard deviationSource codeSoftware documentationQR codeProgrammierstilBranch (computer science)Ferry CorstenSubject indexingGoodness of fitRepresentation (politics)Graph (mathematics)Computer animation
22:48
CodeControl flowClient (computing)Line (geometry)DatabaseScripting languageLecture/ConferenceComputer animationMeeting/Interview
Transcript: English(auto-generated)
00:06
And we're back. You're sitting in, Brian. I hope you've had a nice talk. And if you don't know about it, we'll be having sprints tomorrow and on Sunday. So feel free to sign
00:20
up and sign up your projects, sign up to do it, everything. Do we have Matteo here? We do. Hey. Hello. I think your sound is a bit low. Maybe good. Maybe good.
00:41
Okay, like that? Yeah, that's good. Awesome. And you're going to... What are you going to teach us? You're going to teach us automation, right? Or you can use Python to automate your daily life and your daily task. And we will use
01:03
the git command as an example. We will create a 90-line script to remake the git command. I need this in my life, Matteo. Let's do it. Okay. Hello, everyone. Welcome to this talk. I'll start by presenting myself. My name is
01:23
Matteo Bertucci. I am a first-year student in a French engineering preparatory class, Polytech Marseille. I've been enjoying Python for more than two years now and use it daily to automate various aspects of my life and have fun on larger projects. In my free time, I'm also part of the staff of the largest online Python community, Python Discord.
01:46
We aim to foster a welcoming atmosphere for newer and older Python enthusiasts. I highly recommend you to join us at discord.gg slash python after the event if you want to continue the discussion or just chill out.
02:02
This is my first time presenting a talk and I'm very glad to be here. This talk will go in two parts. In the first one, we will explore how the git database, the core file storage of a git repository is structured. In the second part, we'll write a 90-line script to replicate how the git commit command works.
02:21
We'll go over some simple string manipulation and byte handling techniques and how to quickly make a Python utility script. During the last part, we'll discuss in more detail how we got to the script and why. Every command I will use in this part are made to be reproducible at home.
02:44
I will highly recommend you to try those commands and even deviate from the main path and try your own stuff and have fun. So what is the git database? Well, in each git repository, you have a hidden .git folder that contains all the data git needs to remember.
03:04
It will contain remotes, hooks, logs, and what we are interested in today, objects. As you probably already know, git gives each commit a 40-character hexadecimal identifier. This identifier is actually the hash of the commit.
03:22
But what is the hash you may ask? Well, a hash is produced by an hash function. You pass a block of data of any length to an hash function and you will always return the same hash for the same length. Git uses the algorithm SHA-1 which stands for secure hash algorithm.
03:44
The thing is, commits are just the visible part of the iceberg, as we will see later on. What you need to remember for now is you can give any blob of data to the git database and it will give you back a unique 40-character identifier that you can use later to access the data again.
04:04
Let's start with an example. We will start by initializing an empty repository using git init. Using the find command, we can see that the folder .git slash object, the git database itself, is empty. Then, using git hash object, we can store arbitrary data in the database.
04:25
Here, using the command echo, we will provide the string example blob followed by a new line. Do note that we use the dash w flag to actually save the object to the disk. Here, we got the hash starting with bf.
04:41
Using the same command as previously, we can see that a new object exists in the database with starting with the same hash bf. And finally, using the command git cat and the dash p flag to have a human-enabled output, we can check that we've indeed stored the string example blob.
05:04
Something I would like to point out too is the structure of the database. As you may have noticed, the blob is actually stored in the subfolder called bf, which are the first two characters of the hash. Cool. Now that we know about the database itself, we can look at what it actually contains.
05:25
Let's create a test repository for that. So we'll start by running a few commands to have a reproducible environment. Since commit includes metadata, such as the name of the author, if it isn't the same as yours, we'll have different hashes.
05:40
We can start by creating a new repository, creating a src folder, and three above-use files, alwin.me, alicents, and hascript. We'll add that file to git, and we can immediately see that three new objects have been added to the database, starting with 00, b0, and c9.
06:01
Using git hash object once again, we can double check that each file maps indeed to the content of the repository. Something that you may have noticed too, the name of the file isn't included anywhere in those objects. The objects inside the git databases are all anonymous. Another object will have to give meaning to them.
06:27
So let's now make an example commit. Let's call it example commit because it is very original. And look at the content of the database once again. As you can see, we have now three new objects. 27, 76, and 84. Let's investigate where those are.
06:43
We'll start by looking at the commit itself, as we could see on the terminal section 76. As you can see, the structure is really quite simple. We have three fields, the reference to the tree, the author, and the commit identity. They are followed by a new line and the commit body.
07:01
As a side note, the author and the commit will always be the same unless you use something like the author flag on the git command. Most of the users will probably notice the tree ID is also part of the database. Let's see what it contains.
07:21
It's also quite simple with only three lines. Each line starts with the mode of the file, which is an integer saying what the file actually is, followed by the type of the file, so integer, blob or tree, and its hash and name.
07:41
The last entry points to another tree, which just contains the script file. Git will use nested tree objects to represent larger directory structures. As a side note, this representation isn't actually how the tree object is stored on the disk, as we will see in a few minutes.
08:06
So let's now write some code. We can start by opening a file, and yeah. As it turns out, objects aren't stored in plain text to save space.
08:22
They are compressed or deflated if you prefer using the right term, using zlib. We can use the built-in zlib library to inflate it. You can also notice you have a header at the beginning of the file. Finally, it contains the file type, blob here, the size, a nil byte, and the actual content.
08:46
The file hash can be very simply calculated by using the hashlib.shar1 function on the whole content of the file. I would also just like to stop here for a second and talk about byte handling in Python. When you create a string of characters, the computer must store it somehow in memory.
09:08
For that, we created a table mapping an actual English character to a chain of bytes in memory, like ASCII. Python uses Unicode plus UTF-8 for a string, which by default allows most special characters to be used from foreign languages.
09:29
The problem with strings is they don't like random bytes, such as we saw previously with the blob that was still zib.
09:41
For that purpose, through the type exists the bytes and the byte array. The former is unmutable, just like string, while the latter is, you will usually use a byte array when you need to modify part of the array at runtime. We won't need it today. There are two main ways of creating a byte object.
10:02
You can either prefix a string with the letter b, which sadly means you can't combine it with the format string, or the other method is to convert it back and forth with the string object using the encode and decode method. They both take an encoding parameter, which is usually UTF-8.
10:23
With all that knowledge in mind, we can create our function. We will start with some import and by creating a pass constant, which is our database folder. If you don't know what passlib is, it is basically a fancier version of oed.pass that uses classes which is much easier to work with.
10:42
This function will take in the type of the block as a string, and its actual content as a byte, and return the hash as an hexadecimal string. Next, we can construct the blob of data we are actually going to store that follows the git model. We will start by putting the type, a space, the file size, a null byte represented by a backslash and a zero,
11:07
which is just a shorthand for a backslash q and for zero, and the actual content of the blob. We use the lega-c style formatting here because we cannot choose a b-string with the format string, as I've said previously.
11:22
What happens here is each %h for string and %d for digit is replaced by the element in the tuple, like so. We can now use the hashlib function to compute the hash and the lib to compress the blob. We will use hash underscore to avoid colliding with the hash built-in function like we did with the type element.
11:46
As you may have remembered, the objects are stored in a subdirectory which is the first true letter of the hash. We can represent that by using object pass forward slash hash underscore. We make sure this folder exists, and write the composite data down and return the hash.
12:07
Now, we can write a quick small function to take any parameter, any pass, sorry, as a parameter and store the content of the file's database using the type blob. This is quite simple, we just open the file in read binary mode as we want to handle stuff like images which aren't text and called write objects.
12:29
Easy right? Now it's time for the less easy part, writing a tree or a folder if you prefer. Each line in the tree object will be the mode of the file, a space, its name, a null byte, and the hash stored in raw binary.
12:46
We will also create a constant that will be all of our inured folders. We don't want to be actually committing the git database to the git database, or pycache folder, or even our own commit implementation. Then we can list every subfile or subfolder in the target folder and sort the array as it is required by git for some reason.
13:09
I will do a quick stop here and talk about recursion. Let's take this example. Our goal is to make an example function to take two arguments, the first one being the level of nesting of the list returned, and the second one being the length of each list.
13:27
We'll start by a simpler version. Here we just want to generate a list of decades of a certain length. It is quite simple, we just make a for loop and earn this string many times the list and return it.
13:41
Let's do a second version. We know two nested lists. The second argument will be a boolean saying if the list should be nested or not. The second argument is set to false. We can just have the same logic as before. But if this is true, we could use a true nested loop to present this nested structure, right?
14:02
But that would be just repeating the same code once again. So what if we instead make the function call itself to return a non-nested list, with the second argument set to false, and add it to a larger list? This way you will have the nested structure you want. Now we can move on to our final step, generating an arbitrary amount of nested lists.
14:23
Well, we use almost the exact same logic. Instead of using a boolean, we use an integer saying how many levels we still have left, and decrease it by one each time we want to call the function. If we are on the last level, one, we can just return our final list.
14:45
If that isn't clear, here is the visualization of what happens when you call the function with the argument free free. The function will first call itself free time with the arguments 2 and 3. Each of those calls will then call itself again with the argument 1 free.
15:03
And those last calls will yield a list of free duckies, creating this nested structure of 27 duckies. Now the question you may be asking is why the heck did that talk about this crazy technique? Well, if you look at it this way, our list is quite similar to folders, don't you think?
15:24
We can have an arbitrary amount of nested folders, and we want to handle that properly. So let's get back to the record. Here we are. We have a list of every children's files of the starting folder. We can iterate over each file.
15:41
If it is part of the included path, we just move on to the next iteration using the continue keyword. If it is a directory, we call the write function once again, making a recursive call. This means if there is another directory inside another one, inside another one, we will still be able to handle them.
16:00
If it is a file, we simply call write block. As you may have noticed, I store it every time at the hash of the new object and a mode value. Once they are saved to the database, we can create the raw byte object from a 40-character hash. We yield a 27-byte array, generate the last, the line as we saw previously, and add it to the last.
16:24
Once all of your files are processed, we can stick all the legs together, find the object, and return the hash. This is the last function that will interact with the git database. It is the one to store git commit. It is really quite simple. We write down the tree of the current folder, we create the commit according to the template we saw earlier,
16:44
and a few concepts we just defined, and encode it into a byte object, write it, and return the hash. The reason I'm using fstring here is just to make my life easier when we're templating. It is totally safe to do so, since all the characters inside the commit object will always be valid UTF-8 characters,
17:03
or at least we can assume that they will be. This assertion isn't true for other objects, and we could have run into a decoding error. We're close to the end of the script, so all that's left to do is to create our script interface. We want to be able to call our script on the command line by hiding a commit message as an argument.
17:25
Each of those arguments, including my__commit.py, will be placed as a list in sys.argv. One noteworthy thing is the shell handle parameter. We normally split them at space and put a quote around, like in this simple example.
17:45
We don't want to put quotes around, because we simply don't have any reason to do so. We don't have any special flag, we don't have many parameters. For that, each of these will just have to join them with space. A good example to follow when creating a script is having your main script in an if name equals main branch with all the underscores.
18:06
I call that the main graph. You may have already seen it. To understand what it really does, we need to understand what the name under represents. We can install two little scripts with start being around and import being imported.
18:21
As you can see, inside imported, the value of main will be imported itself. But in start, it will be the dunder string main. What this dunder actually is, is it is always the name of the file that is being run, except for entry point, in which it will be the dunder string main.
18:42
The goal of this well-looking branch is, imagine that I want, for any reason, to make a new script that will rely on those little functions we just created. We will run to import our script, and this branch here will be evaluated to false implement our script from actually triggering a new commit, because that's simply not what we want.
19:04
We just want to access the utility function. So let's get back to our code. We will add an import sys at the end of our file, or beginning, sorry, and add our main guard. We'll start by checking that we actually have a second argument. A good will of time to follow when making a command like toon is error should be handled gradually.
19:26
Imagine you want to use this tool you built two months ago, and you have a cryptic index error. Unless you dive into the source code, you'll be able to understand what actually happened. That's why error handling is so important. If there is an error, we follow the unique standard by writing the script name followed by two dots and the error message.
19:46
After that, we exit with a non-zero code, signaling to the initiator of the script that it failed. We can simply create the argument, the message from the argument, make a commit and write a message. We are finally done with our script.
20:01
Let's start from our final test. Here is our original commit, starting with the hash 76. We will delete our .git folder, finish edit the database, and run our script. As you can see, we have the same commit hash, meaning that the database is the same.
20:23
Our job has been successful. In reality, creating this script took me less than an hour from researches to the actual finished product. I would just like to take a minute to discuss how we structured this script and why choosing Python to do so. The script shot, you can see our entire code zoomed out.
20:44
Each little part of our script is divided into a more or less tiny function. This way, if I ever write another script that needs to access the Git internals, I can just snipe some functions from the script and save time. This is the main reason we wanted the main grant to be able to reuse our utility function.
21:02
Similarly, it will be easier to navigate if you ever go back to your code and have to change it. Modularity is more often the key than it isn't. Code documentation is also important, both through docstring and actual commands. I would also recommend you to follow a code style such as pep8 to have a command style across all of your files.
21:25
Typing is also a good way of communicating what a function expects and spills out. All of these tools combined allow you to still understand your code weeks after you wrote it, which in my opinion is quite important for small scripts or even larger projects.
21:43
Last thing I would like to stop on is why choosing out of all the possible languages Python. Well, simply because we aren't at a real Python or worse, it is as simple as that. Nah, I'm just kidding. For real, Python is an excellent choice for quick string team in my opinion because it is fast to write.
22:04
You don't have to stop and wonder what kind of computer science madness you are going to use to make the compiler accept your types of anything. That typing is quite powerful here. Additionally, having an interactive shell is quite powerful to explore your data, like we did at the beginning with the git blob.
22:27
Plus, it is an engage we are all more or less familiar here. All of this makes Python a very powerful language in my very own opinion. That's it from me. Thank you for listening for me today at Python 2021.
22:41
You can flash this QR code to have access to the final code or we'll pass the link into matrix in a few minutes. Thank you so much Matteo for the talk. I have one question. Why did you build this in the first place? What were you trying to automate?
23:08
Well, this is more of an example script. It doesn't have anything useful, you might say. It is just a way to have fun with the git database because I feel like looking at the internals of an existing tool is always very interesting.
23:29
I often do that with git, docker and stuff like that. And it is also quite short to do. We could have it in 90 lines and it is quite good for a talk in my opinion.
23:46
The best way to learn about something is to write a client for it. Awesome. Thanks again for coming. We'll be moving to the break now since there's no more questions.
24:01
Thanks again Matteo. That's a pleasure.