We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Things I wish I knew before starting using Python for Data Processing

00:00

Formal Metadata

Title
Things I wish I knew before starting using Python for Data Processing
Title of Series
Part Number
55
Number of Parts
169
Author
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Miguel Cabrera - Things I wish I knew before starting using Python for Data Processing In recent years one of the ways people get introduced into Python is through its scientific stack. Although this is not bad, it may lead to learn solely one aspect of the language, while overlooking other idioms and functionality included in Python as well as some basic software development good practices. I will share some useful tricks, tools and techniques and software design and development principles that I find beneficial when working on a data processing / science project. ----- In recent years of the ways people get introduced into Python is through its scientific stack. Most people that learned Python this way are not trained software developers and many times it is the first contact with a programming language. Although this is not bad, it may lead to learn solely one aspect of the language while overlooking other idioms, standard and common libraries included in Python as well as some basic software development good practices. This may become a problem when a data science project is moved from an experimentation phase to an integration with technical environment. In this talk I share some useful tricks, tools and techniques and as well as some software design and development principles that I find beneficial when working on a data processing / science project. The talk is divided into two parts, one is Python centered, where I will talk about some powerful Python construct that are useful in data processing tasks. This include some parts collections module, generators and iterators among others. The other I will describe some general software development concepts including SOLID, DRY, and KISS that are important to understand the rationale behind software design decisions.
11
52
79
DemonProcess (computing)Electronic data processingCoefficient of determinationInternet forumCycle (graph theory)AlgorithmHypermediaProcess (computing)Uniqueness quantificationLecture/Conference
EmailAreaInformationLecture/Conference
SoftwareStack (abstract data type)Machine learningMachine learningSoftware engineeringWave packetVirtual machineObservational studySource codeXML
CodeMathematicsObject-oriented programmingVirtual machineLevel (video gaming)IterationBoss CorporationInformation engineeringRandomizationPoint (geometry)Laptop1 (number)Direction (geometry)Projective planeModule (mathematics)Scripting languageInformationGenderUniverse (mathematics)Interactive televisionMultiplication signRight angleSoftware developerGoodness of fitGraph (mathematics)ProteinMedical imagingAbstractionMathematical analysisStudent's t-testDifferent (Kate Ryan album)Lecture/Conference
CodeMachine learningImplementationInformation engineeringSoftwareSoftware developerLecture/Conference
SoftwareCodeComputerCodeScripting languageComputerProcess (computing)XMLLecture/Conference
SoftwareComputer programmingStreaming mediaMereologyScripting languageParameter (computer programming)Code1 (number)Cellular automatonEndliche ModelltheorieDynamical systemObservational studyCodeSoftwareComputer animationLecture/Conference
Exception handlingModul <Datentyp>Social classAerodynamicsType theoryFormal languageCodeAttribute grammarProgramming paradigmObject-oriented programmingAttribute grammarObject-oriented programmingCore dumpOperator (mathematics)Matrix (mathematics)XML
HTTP cookieObject-oriented programmingSocial classCASE <Informatik>Different (Kate Ryan album)Template (C++)Lecture/Conference
HTTP cookieInheritance (object-oriented programming)Line (geometry)Curve fittingSystem callFunctional programmingConstructor (object-oriented programming)Social classObject-oriented programmingCASE <Informatik>Time zoneTemplate (C++)HTTP cookieMultiplication signAttribute grammarType theoryKey (cryptography)Duplex (telecommunications)Orientation (vector space)Instance (computer science)Expert systemComputer programmingAlpha (investment)Right angleStatement (computer science)AdditionLaptopComputer animationLecture/Conference
Object-oriented programmingPlanningWritingComputer programmingScripting languageLecture/Conference
Representation (politics)Physical systemSingle-precision floating-point formatOpen setInterface (computing)Inversion (music)Degree (graph theory)Key (cryptography)Object-oriented programmingComputer fileMultiplication signProcess (computing)Social classXMLLecture/ConferenceSource code
Object-oriented programmingParameter (computer programming)Table (information)Combinational logicInformationRow (database)Software developerBit rateSystem callCASE <Informatik>Computer programmingCodeComputer animationLecture/Conference
Rule of inferenceSelf-organizationCodeStandard deviationCodeSpacetimeRule of inferenceRight angleComputer fileTask (computing)
TendonCASE <Informatik>Text editorMaxima and minimaConfiguration spaceLecture/Conference
Data structureSoftware testingCodeVideo gameSoftwareCycle (graph theory)Software testingProjective planeRevision controlData structureMultiplication signCore dumpSoftware developerSource codeXMLLecture/Conference
Programmer (hardware)Formal languageAreaSoftwareWebsiteDescriptive statisticsComputer animationLecture/Conference
CodeOrientation (vector space)Object-oriented programmingRight angleCodeAlgorithmLibrary (computing)Perspective (visual)FrequencyData dictionaryBlock (periodic table)BuildingStatisticsCountingModule (mathematics)Source codeXMLLecture/Conference
Generating functionArithmetic meanOperator (mathematics)Default (computer science)Real numberIntegerCASE <Informatik>Computer animation
Field (computer science)Social classFlow separationInformation securityElectronic mailing listOperator (mathematics)FreewareDefault (computer science)CountingLecture/Conference
Total S.A.Inheritance (object-oriented programming)Social classFunctional programmingStatisticsInformation overloadWritingWahrscheinlichkeitsfunktionLecture/Conference
Data dictionaryProcess (computing)Special unitary groupCodeLecture/Conference
MathematicsPoint (geometry)CASE <Informatik>Social classAttribute grammarEquivalence relationContext awarenessData structureVideo gameSource codeXML
Social classCodeQuicksortInstance (computer science)Public domainOntologyMereologyLecture/Conference
RankingSocial classCodeMereologyHypermediaDerivation (linguistics)Right angleData storage deviceIterationXMLLecture/Conference
Letterpress printingSocial classGraph (mathematics)Data dictionaryDifferent (Kate Ryan album)Online help
Graph (mathematics)Directory serviceElement (mathematics)WordCASE <Informatik>NumberIterationElectronic mailing listLecture/ConferenceComputer animationProgram flowchart
Socket-SchnittstelleImplementationInfinityFunction (mathematics)CASE <Informatik>Electronic mailing listIterationObject-oriented programmingCodeType theoryBit rateData dictionaryComputer fileData storage deviceLecture/ConferenceProgram flowchart
Local ringFunctional programmingElectric generatorCASE <Informatik>Social classIterationParametrische ErregungComputer fileModule (mathematics)CodeLecture/Conference
Price indexLine (geometry)Order (biology)InferenceComputer fileLine (geometry)MereologyBit rateIterationSheaf (mathematics)Data dictionaryException handlingElectronic mailing listGraph (mathematics)Computer animation
Electric generatorParameter (computer programming)FrictionIterationExpressionElectronic mailing listNumberSquare numberGenerating functionArmFamilyMultiplication signLecture/ConferenceComputer animation
Square numberData typeNumberSemiconductor memoryElectric generatorExpressionMiniDiscObject-oriented programmingFactory (trading post)Square numberMultiplication signFunctional programmingCASE <Informatik>LaceBinary multiplierElectronic mailing listLecture/ConferenceXML
ExistenceElectric generatorFunctional programmingNumberGenerating functionSequenceCodeComputer programmingFibonacci numberLecture/ConferenceComputer animation
CASE <Informatik>SubsetNumberFunctional programmingSequenceLecture/Conference
Open sourceLetterpress printingOpen setPrime idealGenerating functionReading (process)CASE <Informatik>Beta functionPhysical systemIterationFunctional programmingXML
Letterpress printingOpen sourceOpen setServer (computing)Open sourceElectric generatorLine (geometry)System callLocal ringLoop (music)Process (computing)Lecture/ConferenceXML
Data streamElectronic data processingSemiconductor memoryStructural loadIterationSocial classCASE <Informatik>Electronic mailing listEvent horizonResultantField (computer science)Performance appraisalLecture/Conference
Open sourceStreaming mediaDigital filterLine (geometry)Streaming mediaServer (computing)Level (video gaming)outputExtreme programmingProcess (computing)Object-oriented programmingSheaf (mathematics)Standard deviationOpen sourceOrder (biology)XML
Open sourceIterationObject-oriented programmingGenerating functionFunctional programmingInformationClique-widthOptical disc driveLecture/ConferenceXML
ArmProcess (computing)Software developerMedical imagingMultiplication signClosed setAutocovarianceElectronic data processingProduct (business)Object-oriented programmingCodeEndliche ModelltheorieSoftware engineeringLecture/Conference
InformationMereologyIterationLibrary (computing)Table (information)XMLComputer animationLecture/Conference
Multiplication signLecture/Conference
Transcript: English(auto-generated)
talk of this session. It's titled, Things I Wish I Knew Before Starting Using Python for Data Processing. So ladies and gentlemen, please welcome our next speaker, Miguel Cabrera. So welcome to my talk.
I remember this room being smaller last year. I don't know. So my talk uses a clickbait title, but I won't show any ad today. So my name is Miguel Cabrera. I'm gonna talk to you about some things I learned
in the last few years. I've been working with Python, and I would have preferred to know before starting using it for data processing. So quick introduction. I'm Miguel, I'm from Colombia. I live in Berlin. I work for a company in Munich called TrustU. We do data processing for hotels,
and as I said, I've been doing Python just for a couple of two years, so this is more like a beginner to beginner talk. However, I think if you're starting with Python, and in particular in the data science area, you're gonna take some good stuff from this talk.
That's my contact information. So the priors for this talk, where I think you are. So you are relatively new to Python. You are used Python mostly. You're in the scientific stack, NumPy, SciPy, so on.
Your work, or your desired work, has the data, were in it, or machine learning. You are not necessarily a trained software engineer. So if you are, if you have years of software engineering experience, you're probably gonna get bored in this talk.
So it's your opportunity to walk away. I can give you two minutes now. No, just kidding, but you can leave if you want. So who's who? I wanna know, so who wants to be, or who is a data scientist? Please raise your hand. Okay, data analyst, data engineer here, yeah.
So machine learning developer, that's a cool, okay. That's, I hope I'm not gonna bore, like. So for developer, yeah, so you're being warned. It's really basic, really high level.
So if you already have experience, you might get a bit bored. So this is a really basic. Other title, okay. So the agenda, we're gonna talk about some basic concepts and practices, in particular object-oriented programming. Then I'm gonna talk about some goodies of the collection module.
So something about iterators and iterables. As this was like a collection of things I wanted to show, I have many more things prepared, but because of time, I do have to pick the ones I like the most. As there are different things, there are different level of abstraction,
so be warned that we're gonna switch to a really high level to some code, and we're not gonna get in-depth in any of those. However, I will give you some points and some direction where you can get more information about it. So this is like a meta-talk, so to say. But let's start with a story. This talk, as I said, is based on my experience,
but it's also based on experience of my colleagues and my interaction with them. So let's talk about David. David just graduated from the university. He's a math PhD. He mostly uses R and MATLAB, and he comes to work for a company where he has to do mostly Python.
She started to write a code to classify some documents, text documents, for example. He uses NLTK and exactly learn, which is, I assume you are somehow familiar with it. Thing is, he writes a really nice, I Python Jupyter notebook, sorry, with the code, it runs, nice wraps, and so on.
This is a random image, so don't try to get anything from it. And then my boss tell him, okay, you have to integrate the code in our code base. So he had to go from a Python notebook to a really big project with a lot of dependencies,
a lot of script and so on, and of course, he is lost. So he tries the best to do that with not knowing what's going on, and he ends up writing what some people call spaghetti data science code, which is same as spaghetti code, but for data science.
So if he has to integrate the code, it's gonna be really bad for him as well, and someone else has to integrate the code, that person's gonna hate him forever. Almost forever. So how we prevent that to happen when we're gonna start doing data science
and we're gonna actually integrate data science and machine learning so I can learn coding to our code base. So the thing that I can say is going back to the basics and in a nutshell, I think data engineers or scientists have to become more software developers and get into the middle point. And how do you do that? Well, first we have to have a distinction
between code and software. I take this from a talk from Daniel Muset in PyData Berlin this year, and I really like how he put it. So code is something that runs in a computer. So when you write a script or you write a Python notebook, you're probably writing code.
Code might not have tests or follow any convention or documentation, you just write the code and it does the job, it's okay. Software, on the other hand, some people think it's just a programming test, text inside a deliverable. Some people think it's the whole thing, including all the script, deployment script,
testing, documentation, even customer support or technical support are inside the software. And you want to do, create the software that is maintainable, testable, deployable and all the apples that you can put in. So the question is, what's the way to
transform my code into software? So let's go back to the Python, to the basic. What is Python then? If I'm gonna work in Python, what is Python? The important thing from this is, I got it from the documentation of Python. So this is Python's object-oriented programming language.
So as a data scientist, you should be able to know what is an object and how to use objects. So this is like the first thing I'm gonna give you is, as a Python data scientist or data engineer, learn how to use objects.
And for that, I'm gonna give you a really quick and dirty introduction to what are objects. So objects, three main concepts. The objects, they have data, they call attributes, and they have some operational data that are called methods, that's in a nutshell. And how does an object look in Python?
Well, before going into that, I just want to raise a distinction between cookies, cookie cutters and cookies, sorry, classes and objects. So what is a class and what is the difference with the object? A class is kind of like the template that you use to create more such objects. In the case of cookies, the cookie cutter, if you're from the UK, you're gonna call it psquid.
And you take cookie cutter to create many, many, many cookies and you eat them, hopefully afterwards. Not all because it's gonna be bad for you. So in Python, this is how an object looks. It has a name, a class, sorry, that's the template for creating cookies,
or the cookie cutter in this case. It has any construction function that you, that is called every time you instantiate it. And it has some data, that's the attributes and some methods. Right now, you're already expert in object orientation in Python.
If you want to create a cookie, you just instantiate the cookie class. If you want to, one of the key concepts in object oriented programming is that you can subtype, so you can extend one object or one class to make do something special. In this case, I, alpha is just a type of cookie
that is eaten in Spain, also in South America. And I just, with this example, I extend the cookie class and I just add additional attributes, for example. So, who's familiar with scikit-learns? Just raise your hand if you use Jupyter and IPython notebook. Yeah, so, not so much.
But when you're working in IPython or Jupyter and you're calling scikit-learn for examples, you're writing a statement. Many of those statements look like this, and what you're doing there is you're actually calling objects and creating objects and interacting with objects. So, you have to be aware of what you're doing
and how you can use that in your advantage. So, how do I write good object-oriented code now that I know how to write code? This is, that's a really tough question and I don't plan to answer it today, but I'm just gonna give you some tips. There's some basic ideas in the object-oriented world and actually in the programming world that you want to know.
One is, don't repeat yourself. So, if you're writing scripts and you feel you're repeating, repeating code, copying from one file to another, the same code, you might want to create an object out of it and reuse it, and that's one of the key features of object-oriented or the target of using object orientation is to reuse things. Keys, always keep it simple.
Don't try to put a lot of things inside objects. And also, use the solid principles. These are really abstract principles. I'm not gonna go into details, but basically it's that one class and one object should do only one job and you have, well, I'm gonna skip the rest out of time issues,
but my recommendation is to check that and to check the link below. It's really important, it's really nice to know these concepts. So, first thing, I think it's important if you're gonna start doing serious data science with Python is learn object-oriented
programming in Python mastery. So, the next thing that you have to learn once you already know how to organize your coding objects is there's things called conventions. This convention is like table manners for developers. So, you're sitting at some things you want to do so you don't annoy other people
what you're doing, what you're eating, or what you're programming in this case. When data scientists try to integrate the code, one of the things that annoy people the most is that they have no idea of conventions and you have to always return the code to fix it or you fix it yourself. Why conventions?
Well, what is, well, conventions are important because readability comes. And they are small details, actually. Things like fuel you space or fuel you tabs. What are the indentation rules? How do you organize the code in a file? PEP-EQ is a defective standard so you should learn it.
There's some resources online to you to check. This is a nice user-friendly way of learning. PEP-EQ.org is an example of a right versus wrong way to doing things. There are many details in these conventions and you might get, oh, so many things to learn.
I just want to go. Well, you can help yourself with your editor. In this case, it's Emacs I want to use but sorry for the BI guys or other editors. You can configure probably to help you not only with checking that you're following the convention of your company or if you're using PEP-EQ, following PEP-EQ, but also to help you detecting
things that might go wrong. Like for example, in this case, a variable is never used and your editor can help you detect such things. Other topics that I would have loved to mention or to go into more detail but I don't have the time because I want to show you more cool stuff.
It's a project structure testing. There's no test in a data science project. Generally, they don't come with tests. Versioning and branching, namely learn how to use your source control. Core reviewing in general, the software development lifecycle. There's some books that I recommend you to read.
They're really general. They're not that specific to one language but if you want to get closer in this side of the data scientist area and become more of a software developer, those are good books to start with. Also, I was reading the description on your Python website and there are some talks that I think are relevant
and they probably talk about these issues. If you go to any of them, please tell the guy that I sent you there and he's probably gonna buy me a beer for that or something. So let's go to some now, go into code right now. It's been a really theoretical part
and you're probably boring and you want to see some code. So let's do it. So the tips and tricks I would have loved to know before starting doing in a code sense. Particularly, in a nutshell, the collection module. It's incredible how few, when you start using Python
from the data science perspective, how few you know about things that are in the standard library and one of those things is the collection module. And let's start with basic thing, counting. Counting is kind of like the basic building block for many statistical algorithm. If you start from basic NavAce to Word2Vec,
they are based in counting. However, I don't think data scientists know how to come properly in Python. Let's see. Let's start. How do you count in Python? The first attempt, you use dictionaries. Who has written such code to count stuff? Oh, you particularly know your stuff, apparently.
Let's see. So the actually more Pythonic way, if this one, when you don't ask for permission but for forgiveness. I think it's something like that that it means. Who has written something like this? But we can do better.
Let's use collection default it. Who's familiar with default it? So some of, yeah, that's good. Some of you. So it's basically the same. You use default it that has a, basically you pass a default value or a default generation function. And in this case, it's an integer
and by default it will be zero. So I don't have to do any check. But let's use the counter. Who's familiar with the counter? Few of you. So counter is really cool. And it's just a default it that is already prepared for counting. And it's for free.
And that's how you use it. I just pass only the list of items, an iterable there, and I just get the count. However, come from some extra goodies like you can get some, the most common, some values and do some step operation on them. And I found that pretty cool.
But remember, counter is a class and I just mentioned that you can take classes and extend it and add your own behavior. And for example, you want to calculate the probability and for some items. I can extend the class counter at a normalized function and I already have the probability mass function for that.
Easy peasy. If I want to overload the initializer to go normalized as soon as I have all the items in the counter, you can do it also. So when you're counting things in Python that you want to do when you're using a statistic
and you can use things like pandas or scikit-learns, probably sometimes when you're building the features, you should totally check out counter class. And there's a really nice article from Trey Hunter about how historically the counting process been developed in Python and it's really good to read.
Name topples. So, name topples are a thing that is, I discovered recently and they're kinda hidden more or less. Who's familiar with name topples?
Most of you, some of you, yeah. So the thing is that when you're writing code, you use, people use a lot of dictionaries, lists, topples, and when you start integrating that into a large code base, you see that code and you see it's a dictionary and you don't have what to expect.
So it's really, it makes the code hard to read. In this example, if I remove this, you have no idea what I'm talking about, what is PT in this case. You might, out of the context maybe. So just by using name topples, you can make the code clear. So name topples are basically sort of like a class generator online with the particularity that the attributes are read-only.
So they are basically a nice struct in Python as if you're familiar with C, that's more or less the equivalent. And so you can create class on the fly. It has cool methods also. If you really need to use Dict, you can transform it into a dictionary and you can create one out of an interable.
So, and I think it's a nice way when you're writing code to organize it and to create sort of like domain classes that represent things in your code and your ontology. In this case, we worked a lot with hotels, so I created a hotel base and a total distributor
and I actually inherit from it and add a method to calculate something. So and I pass the class or instances of this class around my code and that makes it, in my opinion, more readable. So let's go to the more needy part
and this is really interesting because it's really confusing. And for me it was, and actually I remember that during my first, my first interview for the company I'm working, right now I was asked something about iterators and iterables and I think I answered correctly
but it was out of luck. I don't think I, then I discovered, oh, I did it right but I didn't know why. So let's talk about this. So when you see a code like this,
you're probably familiar with it, how to iterate through a list. But what is happening underneath? Why you can do this? How can you do this and why it works and how can you write your own classes that have the same behavior?
I was confused and was looking for ways to, what's the difference between iterate or iterables? Is it a list, dictionary and so on? And I found this nice article by Beeson Driesen when I use it, I use this graph from him. You should totally check it and we're gonna start kinda like exploring the concept
using this graph. So two concepts, maybe abstract for you right now, it's iterable and iterator. So an iterator, sorry, an iterable is something that you can call the iterator method on and it will return an iterator. And an iterator is something that produces a value
when you are called next. Abstract, okay, let's go into more detail. So a comprehension, for example, produces a container. And that container, for example, can be a list, a dictionary, a tuple also. A container is something that you can check whether something is inside the container.
That's where more or less the name comes from. In this case, I checked that one number is in the list. In this case, it's set. And a container is typically an iterable. So you can go through all the elements one by one. So in this case, this is a list
and I call the iterator method that gives me X and Y that are iterators. So I can call, if I see the types of both, one is a list and the other is a list iterator. And I can call the method next on those items and obtain laterally the items from the list.
Now, when you do this code underneath in the by code level, that's what's happening. Python gets the iterator from the list and it start getting the values. So this is more or less like syntactic sugar in some way.
So in a nutshell, iterables is any object that can return an iterator. That includes container like list dictionaries, files. They have to implement the, if you want an object to behave like that, you have to implement the iterator under method. Some of those things might not be finite.
They just can generate value forever and when I see an example of that. There's a module in Python called iterator tools that have a lot of functionality to working with them in iterables and iterators in generative. So how do I implement my own iterator?
So for kind of like parametric reasons, you can implement both the iterable and the iterator in one class. So you have an iterator method that return itself and then you implement the next method. In Python 3, it's like a dunder method. In this case, it's just a code that reads a file
and then iterates in inverse order. So it start from the last line up. When there's no more lines to return, it will raise top iteration, which is the exception that is called to stop the four. You can use it easy.
You instantiate it and then you do the same as if it were a list. So now we covered the green part of this graph. So we know that we can get iterable from things like lists and dictionaries and files and from there we can get iterators that will produce values in a lazy way.
But there's another way to get that iterator and it's by using generator. Who knows who's a generator? Fewer of them, okay. So let's start. You can get a generator from a generation expression
or from a generation function. Both, as I said, are generators. So from a generator expression, let's start with a non-generation thing. It's basically a list comprehension and I'm generating 10 numbers and then I'm creating, sorry, a list of 10 numbers
and then I create the same list of the square of those numbers. So if I check the type, it's a list. What these are, only 10, but what if it's a billion of numbers? Probably I won't have enough memory to store them in my RAM or in my disk. You can do the same with generation expressions
and this is not a tuple, just although it looks like it, it creates a generator object that will produce the squares, in this case from the list number, in a lazy way. So each time I call next, it will calculate the square and return it.
Think about it as a factory of items and the factory uses, in this case, the function, the square function, multiply X by itself. So if I want to do the same,
I do with a list, I get, I generate the squares and the lazy squares, I can print the items and it will be only generated when the four internally calls the next function. Before that, those number don't exist.
So a generation function, it's the same idea, but it uses a magical work or yield that works in a nice ways. When you create this, you call the function fib, you will obtain also a generation, a generator.
Then you call next, what will happen is that the code is gonna be executed, then this yield will return the value back to the program and will continue only after the next, well, the next next is called.
In this case, I'm calculating the Fibonacci sequence that you might be, you are familiar, I'm sure you're familiar with, and I can just call next and next and next. Something to be aware here, this is an infinite generator. You see the while through there, it won't stop.
If I put this into a for loop, it will go forever. It will be generating, generating, generating numbers or the sequence. I can use some, one of the methods or the functions of either tool to just obtain just a subset of that,
and in this case, I just get the first three using four. You can also implement your iterators or iterables using the yield keyword, namely replacing the either function instead of returning itself. You just return a generation function.
In this case, I'm reading a file, for example, from HDFS. I distribute the system. Imagine just one server located somewhere, and imagine I pass a source that has the method open, and I just start iterating through it.
This open method might be even an iterable or even a generation generator, for example. I do something with the line, and then I pass it back, and I just can't call that as I do with the for loop and process of the line. So that's more or less the iterables and iterator.
So I can think that, hey, this is supposed to be related to data science and data processing and so on. What is this, what is the relation? Well, sometimes you cannot load all your data into memory. And if you're working into the big data field, that's probably your situation.
As I say, you might not have enough memory to store all the data you want. That will happen when you use a list. So you can work in such cases by using data streaming. In data streaming, you can get it by lazy evaluation, which is what I just showed, generating or processing things
as long as they are available or needed. And you can create some such in-memory data processing pipelines. Using iterables by changing them. So my example, I just showed you this class that gets some obtained line from a server
and just send it, do some processing, maybe split something. I can create another that takes that and then check whether it's a, say, Python comment or some random comment, and passes over. So it's kind of like a stream that gets processed and then sent forward.
And you can change, first create with the source, create one object and pass it as the input of the other, for example. And you can just call it in a for loop. And inside, you're gonna be processing in kind of like an extreme fashion.
So think about Inception, the movie. You're going different levels. First level, do something. The second level, do something. They are sending data up. You don't have to write an object for this. You can just get a generation function and replay the whole thing with the function. I just do it as an example how I like to do it.
There's a talk in your Python which we'll get into more detail. It's on Friday. If you got really, probably you didn't get the whole idea. It's clear as I want, but you can for sure get more information in this talk.
So to finalize, some conclusions or closing remarks. Data scientists, engineers, developers, you name it, you should learn, in my opinion, start with the collection and data tools, models that are basic, that are your best friends. You should teach iterables, iterators,
and build your data processing pipeline using them. Use object-oriented programming for organizing your code. That will help you not only to make your code more maintainable, but when you go to integration and you're working in large teams, you will have a better time
getting your code into the code base. And finally, you're gonna have to start moving to be more a software engineer instead of being just a scientist or a data juggler. You will have to become more of a software developer
when you want to get your solution into an existing product. Some credits, the images I use, I base most of my talk in a couple of articles,
in particular in ideas coming from Radim Trejurek, he's the creator of a library called Jensim, and he also, I really like how he, in this article I'm linking here, how he talks about data processing pipelines using such iterators and iterables.
As I say, we're for trust you. We are hiring. You want to know more about the company I'm working for. We have a small table where you can get some goodies. Just drop by and talk to us. And if you want to talk to me about the talk after the Q&A session, you also are welcome there.
So a question, comments, remarks, or you want to trash my talk, be welcome too. Thank you. I think everyone is hungry, so I don't expect many questions.
We have time for questions anyhow, if you want. If we don't have questions, we thank you again. And good lunch.