Python Unplugged: Mining for Hidden 'Batteries
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 131 | |
Author | ||
Contributors | ||
License | CC Attribution - NonCommercial - ShareAlike 3.0 Unported: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this | |
Identifiers | 10.5446/69495 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
EuroPython 202428 / 131
1
10
12
13
16
19
22
33
48
51
54
56
70
71
84
92
93
95
99
107
111
117
123
00:00
Data miningReceiver operating characteristicLarge eddy simulationSoftware developerCodeQuickBASICAdventure gameSoftwareExtreme programmingSlide ruleReal numberBitSoftware developerAdventure gameFormal languageCASE <Informatik>Exterior algebraWeb-DesignerTable (information)ImplementationGame theoryCodeServer (computing)Library (computing)Web 2.0SoftwareComputer animationLecture/ConferenceMeeting/Interview
03:41
DigitizingProjective planeCASE <Informatik>BitAuthorizationProcess (computing)Library (computing)Computer animation
04:45
Server (computing)Design of experimentsMiniDisc12 (number)ForestProjective planeRandomizationElectric generatorMultiplication signAuthorizationComputer animation
05:30
Library (computing)Local ringLetterpress printingUniverse (mathematics)Reading (process)Fluid staticsStructural loadCategory of beingType theoryCASE <Informatik>Greatest elementCodeLibrary (computing)Key (cryptography)Software testingNormal (geometry)Server (computing)Position operatorSound effectFunctional (mathematics)Software developerArrow of timeMultiplication signBitProcess (computing)Uniform resource locatorLipschitz-StetigkeitComputer animation
08:24
StapeldateiSimilarity (geometry)Electronic mailing listRange (statistics)Field (computer science)Computer fileRange (statistics)Electronic mailing listEmailIterationLibrary (computing)TupleStapeldateiSocial classType theoryNumberFunctional (mathematics)Electric generatorCodeGoodness of fitCASE <Informatik>Buffer overflowPolarization (waves)Stack (abstract data type)Uniform resource locatorString (computer science)MereologyModule (mathematics)Pointer (computer programming)Repository (publishing)BitLink (knot theory)Streaming mediaAngleNumeral (linguistics)Object (grammar)NeuroinformatikKey (cryptography)Multiplication signRow (database)InfinitySubsetPrice indexCountingMedical imagingLipschitz-StetigkeitNegative numberComputer animation
15:32
Universe (mathematics)Web pageKey (cryptography)Revision controlOpen setStapeldateiUniverse (mathematics)Electronic mailing listRepository (publishing)Data structureAlgorithmChainEndliche ModelltheorieDifferent (Kate Ryan album)Electronic signatureBitCASE <Informatik>String (computer science)TupleIterationMultiplication signLine (geometry)Functional (mathematics)Library (computing)Element (mathematics)Crash (computing)Complex (psychology)Uniqueness quantificationCartesian coordinate systemCodeInheritance (object-oriented programming)Process (computing)Pairwise comparisonRange (statistics)Similarity (geometry)Computer fileAlphabet (computer science)Semiconductor memoryHand fanRevision controlBus (computing)Error messageComputer animation
20:47
Pairwise comparisonRange (statistics)Electronic mailing listChainCASE <Informatik>Computer fileScripting languageObject (grammar)Software testingResultantReading (process)Computer programmingTupleSemiconductor memoryCodeNeuroinformatikString (computer science)NumberLibrary (computing)Regulärer Ausdruck <Textverarbeitung>Functional (mathematics)Software developerRight angleOrder (biology)Data dictionaryRepository (publishing)Data managementContext awarenessMultiplicationLevel (video gaming)Complete metric spaceGroup actionSocial classOpen setVideo gameType theoryInsertion lossIterationComputer animation
25:54
TendonMagneto-optical driveFunction (mathematics)String (computer science)Local GroupSoftware testingMultiplication signCodeWordMeta elementObject (grammar)ResultantFunctional (mathematics)DigitizingModule (mathematics)Exception handlingMereologyIterationRevision controlDefault (computer science)CountingElectric generatorTrailNegative numberState of matterCASE <Informatik>Hand fanBitComputer fileSubject indexingInsertion lossWritingLibrary (computing)Electronic mailing listParameter (computer programming)Video gameError messageCode refactoringKey (cryptography)Universe (mathematics)Dimensional analysisScripting languageGoodness of fitPerfect groupConfidence intervalEmailMathematicsComputer animation
34:34
CountingWordError messageKey (cryptography)Default (computer science)Maxima and minimaFamilyIntegerCASE <Informatik>WordMetric systemDifferent (Kate Ryan album)Electronic mailing listFunctional (mathematics)TupleRight angleControl flowMathematical singularityElectric generatorGroup actionGoodness of fitTotal S.A.CodeUniqueness quantificationStatisticsMappingLibrary (computing)Module (mathematics)Multiplication signFerry CorstenMereologyBitCombinational logicIterationMathematicsString (computer science)CountingChainReal numberInterior (topology)Parameter (computer programming)MiniDiscLengthComputer animation
43:03
Reduction of orderImage resolutionStatisticsSet (mathematics)FamilyElectronic mailing listData structureObject (grammar)StatisticsFunctional (mathematics)Social classCASE <Informatik>BitReduction of orderMultiplication signSet (mathematics)MereologyNP-hardFamilyDifferent (Kate Ryan album)Disk read-and-write headComputer animation
45:53
Pole (complex analysis)System of linear equationsRoundness (object)Lecture/ConferenceComputer animation
Transcript: English(auto-generated)
00:04
It's actually not a tutorial, so you don't have to really write along or stuff like that. But if you want to have the same slides on your phone because it's too small or you don't see enough or for the digital people out there who are not here in as a real person, maybe it's nice
00:20
to follow along as well. So I can always go back to the slide and, yeah, show you this so you can, if you want to rewatch some stuff of that. Okay, where am I? I'm Torsten, I'm a Polylang software developer, architect, thingy, we can choose our thing ourselves, right?
00:41
So I not only write in Python, but a bunch of other languages. I'm a professional dev, like professional since about 12 and a half years I would say. But I've written my first code on the Amiga in a quick basic, back in the days, a really crappy text adventure based on pen and paper game.
01:02
And reproducing the crappy DVD logo, I don't know if you know that, at least the older people here maybe, the one which moved in the TV from corner to corner, never hits the corner and that's always something I wanted to see and so I wrote it myself. Yeah, basically that, beside that I'm a proud dad,
01:24
I'm a husband, also proud, ask my wife about that. I love reading table tennis and I love extreme metal because it helps me calm down. Might be surprising for some people, but still. What problem am I trying to solve here? I want you to learn the possibilities of the built
01:43
in standard library of Python. So Python has a huge ecosystem built in as well as on PyPI and so on, but I want to show you what is possible inside of Python itself without ever using pip install and stuff like that or if you want to use UV or stuff,
02:00
one of the modern alternatives. Most of those are super awesome for simple cases and you don't always need external libraries like pandas and so on, but obviously it's really for the simple cases as soon as you really need some speed up in large data, obviously the pure C written stuff is faster.
02:25
Bunch of those tools are useful in the day to day work. I would say most of them, maybe not all, but at least most of them. And let's see, so this talk is dedicated to beginners, maybe a little bit for the intermediate developers in Python so hopefully most of you learn something new,
02:44
at least some of the tools. And yeah, what am I not trying to solve? So I won't solve any huge performance issues here. So when you have that as a problem, most likely pandas or whatever tool you're using or when you're a web developer, fast API and so on are the better solution.
03:04
Yeah. I won't solve any lead code problems here or something like that. So sometimes the solution I'm writing here is not the perfect solution, but just to show off the tool and not to have the perfect solution available for you. I'm basically having a bunch of chapters.
03:22
I hope we get, or we pretty likely get through with one, two, three, so fetching data, cleaning data and processing data, furthermore. And maybe we can go into the miscellaneous and additional stuff and very unlikely we get into some vanilla Python server implementation later, but we can talk about that after the talk if you want.
03:43
What's the story? So in the beginning of the first draft of this talk was like okay, this is a tool and this is a tool and that goes on for like 45 minutes and I thought it was a little bit boring. So who's familiar with Terry Pratchett? Yay. For those who are not familiar with Terry Pratchett,
04:02
this is a fantasy author who writes very funny fantasy and I tried to go along a little bit with the Terry Pratchett world. So in this case we are working on a project in Amalpok which is a large city in this world and there's an unseen library and it has a librarian who wants
04:22
to finally implement some digitalization magic and in the first step we already have some, yeah, magic funnel, let's call it API in our world, sending us book data one by one. And our job is to clean the data, work on the data, save it in some Excel or CSV and because, yeah,
04:43
I had to use some goal for that. I created a full generation stuff for random books with random titles and all the stuff like that, but to be honest I didn't end up using a lot of that so it was more like a procrastination project.
05:03
But it's very sophisticated and there are a lot of random data there. Yeah, but still made fun. But we were working with those kind of books. Example book, for example we have here some random title,
05:20
authors, maybe the book was lent by someone, maybe it was lent since some time and, yeah, we're working with that. Okay. First chapter, fetching data and, yeah, I know it's a wrong fandom universe, but it was a good chance to put data into that and, yeah, I loved it.
05:42
Yeah, everyone knows stuff like that. We use a request library which was one of the most popular libraries you can install and use and a lot of other packages actually have that as a dependency. And what you can do with that, I want to fetch a book and it requests from some server
06:01
and gives them back some JSON. But it turns out we don't really have to do it like this, especially for our simple cases. We have the URL request and for normal post and get request we can actually use that. And it's almost the same, like we are opening a URL, we're reading data and then we can return it by JSON loads
06:24
which is another library already built into Python. So, for example, in this case we are fetching a book from our server, we don't run today because of time issues but still, and this is the dictionary
06:41
where you might get out of it. I move it to the top because we are going to continue on the bottom. Yeah. The first thing I want to show you is a little bit more related to a typing talk regarded to typing Python but basically I think it's still a good thing to show this
07:02
because I met already a bunch of developers who didn't know about that or thought it was something totally different but it really can help us write code. It helps the IDE help us with the correct type hints but it also has a positive side effect of giving us actually something to test
07:20
about with various static code checkers which check how good we are doing our job because in the end we are human and we're doing mistakes and static code checkers will help us doing less mistakes or at least one category less of mistakes. This is how it looks like. It still functions the same as a dictionary.
07:41
You can still work with that as always but it's, yeah, it's now typed and we put it in there after the arrow so we actually know what this API gives us back. So a good idea is for external APIs to type the API so you really know what you get back.
08:01
You can also specify that you don't always get some keys and you can do it the other way around. You know all of them are not really required but those we really mark as required. This is how it looks like in VIN. I'm a neckbeard Linux user so of course I'm using VIN and yeah, basically it helps us getting
08:22
to know what data we are working with. Okay, let's continue. This, for example, is one of the, it's a librarian of the Ansi library. Nobody knows his name but yeah, if we would know it, it would be easier to turn him back to a human. That is one thing from the story.
08:42
There are a lot of books and we only have limited resources and this means we need to work on books in batches and there are plenty of solutions and we can work on various angles on that but still let's first look how we might have done it before Python 3.12 and it is already a code.
09:00
You find numerous places on Stack Overflow and even Chajibiti advises to use something like this. So, for example, in this case we have our batch function. It takes some list. It takes a size and it gives us back a list of tuples or fixed size which means if I have ten books I always get only two at the same time.
09:24
Yeah, it works basically like this. I don't even know if it's actually necessary. I really just took this from Chajibiti just to show you that it really is something that is advised to use and in the end we're returning this
09:40
and we have some function. Of course we can improve this and one of the tools really is useful in this case is iSlice. iSlice is not so, I don't know if many people know about it, at least in the beginning, but it allows us to work with generators of unknown sizes,
10:02
so not only with lists but actually with all stuff which generates values for us. And just as I think most people don't know it, we have an iterator. Iterator is basically a class which, or an object which contains next and iter as addendum methods and iterable is only, it's a subset of that,
10:25
containing just the next method and this is the reason why I've used it here. This means I can take anything which has some next method and turn it into an iterator and then I go through the batches and I can work with that.
10:41
This is preferable to the previous solution because imagine we have a generator generating values and I mean like endless values, like it never stops. It will count to infinity and in this case we can really still work with that without creating a huge array, an infinitive large array and yeah, that is the first step.
11:04
I would say it's a little bit better already, at least pre-312, Python 312. But in Python 312 there's batched and it does exactly what we want here and we don't have to write it ourselves and we don't have to use stick overflow or chatgpt or whatever tool comes next.
11:25
Okay, let's start using our batched function or not our but actually the built-in batched function and we now wanna fetch data from the library which only has like this API which streams books to us one after another and this is the first step.
11:42
We fetch the books and we work on that in batches and do stuff with the book in batches and yeah, one of the things we could do, for example, we wanna save the data first. As a good data scientist we wanna save the raw data maybe first before cleaning it up because we might mess up the cleaning situation.
12:05
I think that is one of the tools many people likely know which is CSV which is already built in so also no pandas really necessary in this case or polars if you're on that side and what we can reuse now is the type dict of the keys
12:22
so we have already our field names for our CSV or Excel file. Next step, we write the header in our file and then we can work in batches. For example, in this case we write rows inside. Take 10 at a time. Keep in mind most cases we really have like an IO issue
12:42
not so much a computation issue when we are working with data. Summary, okay, we started really small and we might have learned something new and I think many of the tools were already known but still I wanted to have a baseline. Obviously, I think you already knew CSV
13:01
at least knew that it exists. Obviously, I think JSON as well because I would say most people knew it already. Type dict, at least some people might have heard or most people might have heard but that you can really use the keys and for example, writing some CSV might be new
13:22
and yeah, I have all of those examples in more detail with far more comments in the code folder inside the repository. Remember, I have the link in the beginning. Okay, then I showed you a URL request. It's basically a very simple fetch and send data via HTTPS.
13:42
It is restricted to get and post so no put or delete request to some API and the post is really determined by just if I'm sending data, it's a post, if not, it's a get and it's really not as bad as one might think considering the vast amount of modules to replace it like request, HTTPX, async, a lot, like really a lot
14:02
and it's definitely worth a try if you only have a few non-async requests you wanna do and don't wanna, may have a super large, I might, maybe Docker image, for example. Then this might be closest to be something new, the islice part from the eta tools module.
14:23
It basically works like this. Many people already seen this even in the beginner talk, I would say but it also works on generators with unknown size and this makes it really nice to work with. Caveat, it cannot go backwards or use negative indices. So for example, we can do this to reverse some list
14:42
or reversing a string when we do it with a string but we cannot do it like this so we cannot go back because it has no, it only has some kind of pointer to some generator. And finally, we learned about batched.
15:01
It finally arrived not so long ago which is why I think many people don't know it yet. It works on all kind of iterable stuff so if you can do four x in y, it works on that and it gives us a tuple of the same type of the size we give it. So in this case, for example, we have a range of five,
15:22
numbers from zero to four, excluding the five and we wanna have always three numbers, we get this back. Does this work? Yeah, it works, cool. So cleaning data, yeah, I know from meme universe again but I like data and it was an opportunity I had to take.
15:42
Duplicate books. So we want to fetch data from the library but it is a magical library. It means the books will duplicate just randomly because it's magical and they will do it like this. And the Oram Utan Librarian of the Anson University wants a list of unique books as CSV obviously because that's what Uram Utans want, right?
16:03
And as it is not a lead code talk, there will be no obvious solution if p is in np, everyone knows it already obviously and also no super clever text similarity function or stuff like that or duplication algorithms. It really is about showing you the tools and not really having a solution for duplication.
16:22
Okay, this is our current version. We have some stuff we wanna save and now we know that it is, we have some duplicates and I just brought in some filter duplicate double books but it has some filter magic up to you to really find something for yourself
16:40
because I already have the books there. It's not really about the algorithm here but what we now find out and this helps us using our next tool is that the books always duplicate in pairs. So they are always siblings next to each other and that makes it a lot easier for us. So we have in this case a very naive approach.
17:02
We have the last book, we go through a book generator and every time we have our current book is different than the last book, we write it in our CSV. So and we have those new things here to make sure we always have some kind of a book here and don't get an error.
17:21
What we could do instead is use pairwise tool. I think, I haven't seen so much to be honest and it really does what the name says. It gives us all the things inside a generator or a list or a tuple or whatever you can choose, even a string and it always gives us two siblings.
17:42
So from, if you take the alphabet, it gives us A, B, B, C, C, D and so on and we can use that to write only the non-duplicates in our file and yeah, this should work, right? I'm not so sure because in this case, we are getting from, for example, we have three books.
18:01
We only get the one and two and two and three and we're always saving the first of those tuples. So in this case, the three will never be saved. That's bad. What we could do is do add more code and more code is always good because we are paid by the lines of code, I think.
18:20
Not and yeah. And I could add more stuff and then do an additional check and save it and now it works at least as intended but to be honest, I'm not a fan, really, because it adds more complexity and more complexity is always bad because we have to maintain it and maybe we are switching jobs and it's not a bad thing
18:40
because it's the next problem, next person's problem but yeah, most likely, we still have to maintain our own code. And what we can also use is we could array the structuring which means basically, this had JavaScript first but still, we can use the full generator, put it in a list and define the full purpose
19:02
of a generator that we have to work lazy, that we can work lazily on some values. So in this case, it will, yeah, in an infinite generator, it will create an infinitely large list which is not what we might want. Obviously, there are still, again, something we have in the tool belt which is called chain
19:21
which chains different iterables and gives us first all the elements of the first iterable then all the elements of the next iterable and so on. You can put arbitrary many iterables in there. In this case, I have this as a tuple with just one element, I put just one none in there and I have our standard generator
19:43
which is wrongly written, great find to myself and I will fix that later in the repository. And in the next step, we really will check if they are different and as we know, really goes through all the books
20:01
and always writes a second book if it changed to the first book, we really get all the books and can really finally have some unique CSV. Okay, what new models did we learn about here? Pairwise, signature is pretty basic. We have an iterable, something we can iterate over
20:23
and we get back another iterable but it is nested so we have two bits of size two, exactly two. It uses an iterator, for example, range. So if you put range 99999, it still won't crash our application even if we have very little memory.
20:42
So yeah, and creates a new iterator like that until it is, yeah. Other example, we can take, for example, string which is also iterable because we can iterate over the characters and we get always two characters in this. It is lazy, so this means I have put here a list
21:02
so I directly gets a result but if I call it without the list, it does nothing at first. Okay, then we learned about chain. Chain is basically chaining together different iterables and there it is. And in this case, it makes from hello world,
21:21
it makes hello world, yay. And advantage, it's lazy, it doesn't create a large object like, for example, in this case, I put together two lists or tuples or whatever we have in there which we can unpack and in this case, we will have a smaller object
21:40
which basically just points to the right iterator we have there. It can also, which is some hidden feature in that because it's a class method, we can create, we can flatten lists and flatten a list of stuff, for example, a list of lists with numbers inside and flattening that, I already seen also a lot of people
22:02
having written their own code for that but we don't have to do that. We can really do stuff like that and in the end, we get a long list with things inside that and I think having this in your tool belt is actually useful because it makes the code a lot smaller.
22:20
Yeah, and for the brevity of completeness, no monkey thing here but still, there's also something called as a chain map and a chain map works similar to chain but it works on mappable things. For example, we have dictionaries here, A, B and A again and in this case, it goes through
22:41
every one of those mappable things and gives us the first hit. In this case, A would give us one and not the three from the end and so on. And attention, it is a different behavior than you have with A dict and B dict when you can unpack them inside a new dict and create a completely fresh dict
23:02
but the order is completely the opposite. And again, full examples and much more complex examples with typing and everything and a repository. Okay, this is more, to be honest, I'm always honest, an excuse to show you grouping of context managers
23:22
because I think it's a feature not more people could use because of the nesting. I don't like if the code gets too nested and in this case, just if you didn't know it yet, you can put multiple context managers, so context managers are the stuff where you put with some stuff as some stuff,
23:42
for example, with open file as my file and you could put arbitrary many here and group them together and then in one step, read from one CSV file and write to another CSV file which is very nice to your memory. And yeah, so again with the removed duplicates in this case.
24:05
Okay, another thing, we are having a library and some of the books, I mean, it's a magical world and which haven't gone back to our library since, I don't know, 300 years or 500 years or stuff like that and we will want to ignore those books
24:21
because they are lost to us. And for example, in this, I called some function height loss books and in this case, we are having here our book again and as we just learned in the beginning, we have the typedict to really know what is inside an object, so we're using that here.
24:42
We are putting this here. Again, it's not a typing talk but I think it's always nice to have those little helpers for us as developers so I put that in here just because I can. So what we want to do is, the year looks really wonky to be honest
25:01
which means it's always looking like this because it's a magical date and I just came up with it to make my life more complicated in this case and we want to really want to filter out all those books which are too long ago and lent out. How does that look?
25:21
We have a function here which extracts the year which is basically the very, very much hated rec x but this is really simple. It really just says, give me a number and this should be easy, right? I want to have a number and this should be correct, right? But I don't trust myself and I don't really want to try all kinds of numbers and that's boring and we have computers
25:42
for automating stuff, for example, testing stuff and PyTest is an awesome tool which is for testing and you should use it but if you have really small scripts and stuff like that and small programs, you can also use DocTest and it is actually pretty easy to use. We just need some documentation
26:01
and we all have documentation in our codes anyway, right? Most of the time and yeah, let's start with the same function but add some minimal documentation. In this case, some explanation and afterwards, how we would call it and what we expect it to be and the result is we can call it like that by the way,
26:21
really simple, like giving the full path and call it the module DocTest which is inside the Python standard library and it gives us back, yeah, we did not good because we just got the first digit. What we can now do is fix it because we now have this problem and if we fix it and if we have this running of the test
26:43
in our pipeline, our deployment pipeline, we make sure that this error will never happen again even if it's small because of small, maybe some kind of refactoring. What we now do is we fix it because we say, okay, it needs to be at least one digit but it can be more digits,
27:01
it makes our life already a lot easier and we can add a bunch more tests, for example, negative number and as I correctly assumed, we didn't catch four negative numbers so far and now we can get that as well. We are confident enough, it's not 100%, there might still be edge cases
27:21
but we are confident enough that this works now and we can go on filtering out our books. So in this case, we are extracting the year, we are comparing it to some arbitrary value. Obviously, we should put that in some constant in the beginning of the file because magical values, who knows, in two years why we choose minus 300 but still, yeah.
27:43
Then another thing we use is yield from. I mean, who has heard of yield? Okay, great, who has heard of yield from? Little bit less, nice. So basically, it does what you actually expected to do,
28:01
it takes something you can iterate over and returns it but it takes also into account that we have some magic inside of Python, for example, generators can also receive values, they not only can give us values, they can actually receive data and work with the data if we want it to be and yield from actually takes care of, yeah,
28:23
just forwarding the data from whatever uses height loss books into whatever is inside our iterable which means it takes away a lot of the work we have to do otherwise also. So we want to keep track of the lost books because yeah, customers,
28:41
they change what they want very often and in this case, they not only wanted to hide the lost books but they actually wanted a list afterwards and what we can do is use a global state because why not, because it's easy way and in my opinion, it's really bad because global state can mess with us and our code
29:01
and it's, I would say never but let's be cautious, almost never a good idea and yeah, we could also use that inside the function. So we have a local variable and in the end, we yield none to show off hey, now we gave you all the books which are there
29:20
and now I give you, I yield you all the lost books, not a fan of that either because it is argue with the return type, I have always take care of book and none and not only book and I want to keep it simple which is funny because that looks not more simple to be honest but what this does is basically,
29:42
it says that our function can give us, yields us books so it generates one book after another, it expects no data so this is a sending thing I talked about and it also returns so it can yield and return stuff. Yeah, this is how it looks like, we append it here and then we return it
30:02
in the very end of the books and this is how we would use it so we have some iterable in our hide lost books, I have hidden away, funny, because it's hide, hidden away the code and I generate it here, I have the none lost books,
30:20
I could totally consume the full generator just so we have the return value at the end and this is how we get the return value. It looks a little bit wonky, that lies because Python works by when it's generating through stuff, iterating through stuff, in the end it throws an exception called stop iteration so it's not really an exception
30:41
but yeah, this is how it's done and I don't think we're gonna change that soon but inside that we can get the value of the return value and that allows us to get the generator as well as other stuff. Yeah, summary, generators yield values, generators always also return a value
31:02
inside of stop iteration, per default it's none so what we defined in the header, is it a good idea? No, to be honest, most likely not as it's kind of surprising behavior and you would really want your code to be non-surprising like if you have some new people working your team or maybe someone fresh from university
31:21
or fresh from school or anything like that, they should have easy code to work with because that makes it easier for them to be helpful very early. But why did we just see it? Why did I show you this? Maybe just to waste your time but no, sometimes in one-off script it's a little bit faster, sometimes it's okay to have a hacky solution,
31:41
maybe you're writing in the advent of code and just want a one-off solution nobody needs to maintain afterwards and it's still good to know that those generators can be a little bit more complicated than just yielding values. The middle argument I already talked about, it allows us to really send stuff into the generators
32:02
and adjust maybe the behavior for example and obviously it's a little bit out of scope I would say but still I wanted to show it because that would have been the first question I might have asked if I've seen it. Yeah, but how would I do it? Because I said I wouldn't do it like this
32:20
and perfect time to learn a new thing we use an object to track the data and in this case I want to show you named tuple. This is already a different name in the standard library for a long time without the uppercase N entity but it allows us to have some kind of container of data which is immutable so we cannot change it
32:42
but it's also easier accessible than zero and one we can give it names which makes it a lot nicer. It's relatively cheap, we only transport objects which means we only give actually references to those objects and don't really copy all the data and we then yield this container
33:01
with always the most current version of the data. This is how it looks like, we have this book meta thing, I didn't come up with a better name but still we have the current book we are yielding, we have the lost books as a list and as you can see in the below part we have an object containing the lost books,
33:21
we still add them here but instead of just yielding one book after another we're yielding a book meta which contains the books as well as a reference to the lost books. Yeah, okay, chapter three, summarize and key data and now the wonderful librarian wants us
33:42
to get some data to get to know in which dimensions we're working here for how many stuff we have, how many kind of, what kind of data and so on and the first step he wants us to create an index of all the words in the title so maybe we can search for that or stuff like that. It's very simple so we have for example the counts
34:00
for 123 times or a 42 times, magician 99 times and so on. I would call the function pretty basic like most common words in title because he wants us to know all the words in the title and we use a dictionary as a container to save those words in the counts and then we go over all the words in our book here
34:24
in our title and split the title and then go for every word in the title and count it up and obviously we have to take care of the case where we don't have the word yet so we have to create a new value, a new key inside our word counter here and so we don't have any key error.
34:43
And this I think many people already know is default dict but I think it's very important it saves a lot of time and code and this is the reason why I still put it in here which basically says okay I'm a dictionary and whenever you ask me for any kind of key
35:00
I give you, I use this function in this case int to create some value for you if it's not there already. In this case int without any parameters gives us back zero and this makes our code a little bit more Pythonic because we can just now add stuff up and without really taking care
35:20
if we have the word already counted or not. But now the library will update its assignment and so glad customers in real world always know what they want in the first iteration but let's bear with me in this case. So instead of the most common words in title here wants to know in all the data
35:41
so what we are doing now is adding all the parts where we have string inside I just took three but you can imagine that there are more columns in our data and we split it and do it like this and actually this works it's fine to be honest but I wanna use this opportunity to show again the combination of chain
36:00
so we chain all the words and also show you a little bit more about our word generator. In this case we have a word generator which actually creates a chain of words so a chain, a nested chain of words so which is not easy to work with because in this case we would always go through
36:21
through the whole generator but also through the inner generator and which allows us to go back to the chain from iterable which flattens the whole generator back to one word after another which makes it a lot easier to count which we do with the counter module. So counter basically counts, surprise.
36:42
And it gives us back a mapping from whatever it counts and the count of it and you can convert it to a dictionary or and yeah work with that and you don't have to count yourself. Obviously we can remember what I said about surprising behavior.
37:00
This might be not surprising but only because when you're a junior you don't know what's happening here because there are so many concepts and this one you really don't know what's happening here and at least I wouldn't if I were fresh to Python I would say whoa iterable, what's all that? For that I would really advise to push out concepts into functions
37:21
which contain a singular concept and that words in the book, okay I can get that now and I can even go one further if I wanna be really extreme and generate words from library so we have another concept here and if I wanna be really extreme I even can call chain from iterable as flattened
37:40
because actually it flattens and to be honest I'm not sure if I would really do that here or put a comment there but just you know you can do it like this. Let's add some more data and no data picture this time sorry. Obviously this is simple we use the min length here we just take the min of anything
38:03
and this is trivial I would say is the same for max. The maximum is also trivial then we maybe want to know which family lent the most because we wanna have some statistics which family chose to be really open to libraries and also we are interested
38:22
in the largest bookshelf in the library for example and a lot more ideas we can use but the basic gist that we have a bunch of different metrics we wanna extract. Okay start with the family lent the most. Basically we want to expect that it gives us back a string and an integer, the family name
38:41
and how many books and we go over the not lent by books because when they're not lent by anyone we can just go over them and don't ignore them but you can also reverse this function by this method by if book lent by and then put all the stuff inside of that
39:01
but I really like to have as least as indentation as possible. Yeah then we split the name and we take the family names first one and obviously we have a lot of people here from different countries and not always a name consists of just two names but for that we can next split
39:22
and really take care of okay we just assume the first thing in your name is your first name the rest is your last name we don't care about middle names because that is the disk world and in this case we just think in this case that it works like this. Yeah again we can do the whole default thing
39:42
and you know that already so we are using it here as well using the default date to count really by the family name and but what we also can do there's a lot of code we can really wait let me go to
40:04
it's grayed out but I hope you can see it anyway we can use group by and group by allows us to give it a key function to yeah to already group our collection by some data this helps us for example
40:21
because the grouping needs to be done in some way obviously or not obviously but maybe surprising for you is that group by only works on subsequent data so if the key changes back and forth it will generate multiple tuples and that will make it a little bit more weird just bear with me so we have to sort it first
40:40
which totally defies the purpose of a generator again but still I wanted to show you group by and now we just fine with totally consuming the generator. In the end again we are using named tuple to return because named tuple is always nicer to read than a tuple because now we can use biggest lender dot family
41:01
and biggest lender dot lend books instead of zero and one to exit it and yeah yes this sorted part consumes our iterator this is not so nice to be honest but yeah still we need this for group by because in the end group by really groups together values in tuples like this
41:23
and always the same key in the same place and as soon as the key changes it starts a new tuple and yeah just if you ever use group by this is a foot gun I used in myself in the past foot gun for the non-English speakers it's a gun which points directly to your feet and if you shoot it it's not a good idea.
41:43
Yeah so the librarian changed the, did I get closer to this? Changed the requirements again because it's the only one right? I want the total lenders, unique family names and the amount of un-lend books. We have those three functions as to extract the data because it's not so complicated
42:01
we won't go into that further but we also want to have the one statistics object in the end again a named tuple because it's small footprint it's easy to use and it's immutable and I like immutable data because it cannot break somewhere in between. Yeah then we have our function
42:21
which is called gather statistics. Basically what we can do is we just call our functions put it in there and everything's fine right? Yeah but the generator is consumed in the very first function and that just might not be what we really want which is a good idea maybe to just pre-generate
42:41
all those books put them in a list and then put the list in all those three functions and then it works and it should work actually but we still have the problems of our generator which is completely gone now and we have our huge list which might have billions of books inside.
43:01
So it really doesn't solve our problem. So there are two tools which might help us. Itertools reduce and Itertools T can help us here. I have to say T is more like a fake solution but I wanted to bring it in here because this guy has a reduced teacup and I needed to get it in there.
43:22
Let's start with reduce. So we have our functions here. It's getting kind of crowded sorry for that. Then we have some mutable object in this case a data class. I think data class are well known. I think five years ago it was like what? And this really is there to accumulate our data.
43:43
So we have a set of our lenders, we have a set of our families and we have unland books. So which basically means we have in this case we always have just one of each item because that is what sets do. It only contains one of each item. Then we have our gather statistics function again
44:02
which gives us back the statistics from before and this is what we adapt now. First step, we need a statistics reducer which is a complicated name but in the end it's basically taking a bunch of stuff and in the end we have one stuff, different stuff. And in this case our statistics accumulator
44:21
is a thing we always take from the last one and give to the next run of this statistics reducer. And the book is in this case the part where we iterate through all the books and this is always different. What we're doing here if we have it land by we add it to the lenders and the families sets
44:41
and if it was not land where we accumulate the unland books by one so we increase it by one. And then we continue with the accumulator for the next run of the statistics reducer. This allows us to get all the data we want in just going once through all the items which makes it pretty nice to use
45:02
and in the end we call it like this and then use it here. Why do I use different objects in this case? I really like to have the, am I getting close? I really like to have the stuff divided
45:22
so we have mutable data inside which is fine because it makes it easier but immutable data outside because it almost always makes the stuff easier. How much time left? Because we are pretty. No more. No more, oh no. I should have learned to rap but I'm a metal head so it makes it a little bit hard.
45:41
I'm so sorry. No problems, yeah. But we're pretty far to the end so it's okay. No, it's okay, it's okay. I'm so sorry but seems really interesting but if you want to know more you can find him after his talk and if you have also any question and also reach him too on Discord.
46:02
Yep. And thank you for your talk. You can give him a round of applause. Thank you.