Python & Spreadsheets
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Alternativer Titel |
| |
Serientitel | ||
Teil | 5 | |
Anzahl der Teile | 48 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung - Weitergabe unter gleichen Bedingungen 3.0 Unported: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben. | |
Identifikatoren | 10.5446/33207 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
DjangoCon US 20175 / 48
2
5
6
14
16
23
26
30
31
32
34
39
43
48
00:00
TabellenkalkulationEntscheidungstheorieAggregatzustandFontFundamentalsatz der AlgebraDemo <Programm>TaskTypentheorieCodeSoftwareentwicklerSpieltheorieDreizehnEin-AusgabeKonfiguration <Informatik>VerzeichnisdienstWeb-SeiteBaum <Mathematik>Lokales MinimumSoftwareentwicklerProgrammbibliothekFontDemo <Programm>CodeSummierbarkeitDatenbankSchnittmengeMultiplikationsoperatorProgrammiergerätWort <Informatik>TeilbarkeitQuick-SortPi <Zahl>DatenverwaltungFormale SpracheAppletZahlenbereichSpezielle unitäre GruppeDienst <Informatik>Prozess <Informatik>Natürliche ZahlSystemprogrammPunktMereologieRechnernetzDivisionComputerspielWiederherstellung <Informatik>Arithmetische FolgeKartesische KoordinatenProgrammierungNotebook-ComputerInternetworkingMomentenproblemTouchscreenSchreib-Lese-KopfTermFundamentalsatz der AlgebraDatentypHypermediaRechter WinkelCASE <Informatik>RechenschieberFormation <Mathematik>Deskriptive StatistikVersionsverwaltungDreiecksfreier GraphVerschlingungKategorie <Mathematik>Basis <Mathematik>Kontextbezogenes SystemInformationTabellenkalkulationProjektive EbeneElektronische PublikationUmwandlungsenthalpieNormalvektorSpieltheorieSelbst organisierendes SystemAggregatzustandDifferenteTwitter <Softwareplattform>Spezifisches VolumenMultiplikationMetropolitan area networkZellularer AutomatIntranetsinc-FunktionComputeranimation
08:20
TypentheorieTabellenkalkulationKonfiguration <Informatik>CodeKernel <Informatik>Strom <Mathematik>Ein-AusgabeVerzeichnisdienstVisuelles SystemGruppenkeimGeschlecht <Mathematik>Normierter RaumSteuerwerkCodierung <Programmierung>IndexberechnungVerschlingungHochdruckLokales MinimumWeb-SeiteSpieltheorieViereckMailing-ListeTupelGefrierenTabellenkalkulationTypentheoriePunktAttributierte GrammatikObjekt <Kategorie>GraphUmwandlungsenthalpieProdukt <Mathematik>MAPElektronische PublikationAnalysisResultanteOffene MengeProjektive EbeneZellularer AutomatBeanspruchungMailing-ListeDatensatzInformationAusdruck <Logik>SoundverarbeitungMultiplikationsoperatorEin-AusgabeKonstruktor <Informatik>DifferenteCodeCASE <Informatik>DialektErgodentheorieFokalpunktQuick-SortGeradeFunktionalProgrammbibliothekSoftwareentwicklerDatentypSchnittmengeDateiformatGauß-FehlerintegralKategorie <Mathematik>ProgrammierumgebungProgrammiergerätSpiegelung <Mathematik>MultiplikationBridge <Kommunikationstechnik>MusterspracheLogarithmusWort <Informatik>MereologieFigurierte ZahlVerzeichnisdienstSelbst organisierendes SystemComputerspielBildschirmmaskeTeilmengeTermHyperbelverfahrenGemeinsamer SpeicherClientKünstliches LebenDemo <Programm>TouchscreenAusnahmebehandlungUnrundheitLesen <Datenverarbeitung>Computeranimation
16:40
Mailing-ListeTupelEin-AusgabeZellularer AutomatHochdruckKernel <Informatik>DialektDivisionLokales MinimumTypentheorieAttributierte GrammatikObjekt <Kategorie>PROMWeb-SeiteKonvexe HülleTabellenkalkulationDemo <Programm>FontHyperlinkVererbungshierarchieViewerInformationInklusion <Mathematik>LoopSummierbarkeitZahlenbereichDemo <Programm>Zellularer AutomatDatensatzLesen <Datenverarbeitung>TabellenkalkulationEinfügungsdämpfungObjekt <Kategorie>ProgrammbibliothekDifferenteKonfiguration <Informatik>Quick-SortRelationale DatenbankDatentypInformationMultiplikationsoperatorHalbleiterspeicherTypentheorieZeichenketteOffene MengeAttributierte GrammatikTupelCAN-BusEindringerkennungMixed RealitySchnittmengeFunktionalRegulärer GraphLoopElement <Gruppentheorie>StandardabweichungFontGeradeViewerResultanteBitKategorie <Mathematik>Generator <Informatik>OrdinalzahlCASE <Informatik>Mathematische LogikHochdruckRuhmasseE-MailPlastikkarteDivisionMomentenproblemDatenbankRechter WinkelMereologieForcingEndliche ModelltheorieEreignishorizontAutomatische HandlungsplanungWort <Informatik>Reelle ZahlComputeranimation
25:00
Data DictionarySummierbarkeitSpieltheorieDivisionMailing-ListeDemo <Programm>Cantor-DiskontinuumPROMLokales MinimumAnalogieschlussTabellenkalkulationE-MailGebäude <Mathematik>Funktion <Mathematik>UmwandlungsenthalpieZellularer AutomatKernel <Informatik>SchlüsselverwaltungEuler-DiagrammRuhmasseFrequenzProzess <Informatik>DreizehnLemma <Logik>SichtenkonzeptStabHIP <Kommunikationsprotokoll>Total <Mathematik>StatistikSpezialrechnerOffene MengeOvalInklusion <Mathematik>Funktion <Mathematik>Objekt <Kategorie>DatensatzE-MailStichprobenumfangDatenverwaltungOrdnung <Mathematik>Prozess <Informatik>DivisionCodeAutomatische IndexierungMailing-ListeZellularer AutomatHochdruckCASE <Informatik>MultiplikationsoperatorHyperbelverfahrenTabelleMereologieFrequenzTypentheorieQuick-SortSelbst organisierendes SystemGebäude <Mathematik>EchtzeitsystemBeobachtungsstudieTabellenkalkulationMetadatenParametersystemSchnittmengeGradientUmwandlungsenthalpieZahlenbereichProgrammierungResultanteEindeutigkeitSoftwaretestPlotterDesign by ContractTopologiet-TestProgrammschleifeRechter WinkelElektronische PublikationData DictionaryCall CenterDemo <Programm>Natürliche ZahlRegulärer GraphEindringerkennungMinimumTotal <Mathematik>BitSystemaufrufComputeranimation
33:20
TabellenkalkulationVisuelles SystemEin-AusgabeFunktion <Mathematik>ProzessautomationAttributierte GrammatikZellularer AutomatOrdnungsreduktionTwitter <Softwareplattform>TabellenkalkulationVisualisierungQuellcodeGeradeNichtlinearer OperatorValiditätTUNIS <Programm>BrowserMakrobefehlArithmetisches MittelGravitationARM <Computerarchitektur>IdentitätsverwaltungProgrammbibliothekElektronische PublikationInterface <Schaltung>MultiplikationsoperatorBitCodeStabilitätstheorie <Logik>TypentheorieObjekt <Kategorie>NeuroinformatikZellularer AutomatDesktop-PublishingDatenstrukturSummierbarkeitBimodulEin-AusgabeTaskGruppenoperationQuick-SortTwitter <Softwareplattform>RandverteilungTabelleDokumentenserverAnalysisKanalkapazitätRechter WinkelPunktSpannweite <Stochastik>RechenschieberDateiformatResultanteSchlussregelSpielkonsolePivot-OperationDatentypFunktionalNotebook-ComputerSichtenkonzeptMailing-ListeFunktion <Mathematik>DatensatzRechenbuchProgrammfehlerComputeranimationVorlesung/Konferenz
39:21
JSONXML
Transkript: Englisch(automatisch erzeugt)
00:21
Here are my slides. As you can see, I'm very fancy, and I've made very fancy elaborate slides with white and black text. Because as a man who talks about spreadsheets, of course, I'm very concerned about fancy things. So the talk here, Python and Spreadsheets, State of the Union, August 2017, it's called
00:44
that for a reason. And just for information purposes, my name is there, my Twitter handle is also on every slide. So if you have questions you want to tweet at me later, then that's a thing you can do. I'm usually pretty available. So this is called State of the Union, August 27, because I gave this talk initially in 2013, or version of it in 2013, when I first
01:04
started using the library I'll be talking about. So right now, I'm a QA specialist for a startup in Houston called Decisio Health. I used to be an accountant, and then I got an MBA, then I ran away to China, and was a college instructor there for a few years, and I also taught here in the US. The slides
01:23
in the iPad Notebook I'm using for this talk will be available at the link there at my GitHub, DjangoCon 2017. It's not available at the moment, but it will be soon. Note to self, when I turn to look at the screen, turn my body, and not my head, because if I just turn my head, the volume goes down. See, that's not good.
01:43
This is what happens when you practice this talk a few times, you realize these finer details. So basic outline of what the talk's going to be. We're going to talk about how I got here, sort of my secret origin story. I'll demonstrate some of the fundamentals, some of the fundamental data types that OpenPy Excel gives you, and that's because
02:03
there are a few things there that are a little not obvious, but then once they're explained, you recognize how helpful they are. Then we'll do a basic demo of some basic things you can do with Python and spreadsheets, and then we'll look at some of the problems that you'll run into trying to use spreadsheets with code. There are certain things that are ... There are some situations where using a spreadsheet
02:23
with code is straightforward and simple, but then there are a lot of situations where it's not, and so I just want to highlight some of those. Again, the pain of my life as a recovering accountant can hopefully benefit you at some point. So, da, da, da, secret origin of Cojo. I used to be a professional spreadsheet fighter,
02:41
which is also what's known as an accountant. Lots of spreadsheets all the time. You'd be terrified by how many large organizations are running spreadsheets, but I'd always had an interest in code and learning to program, but I never really needed to as someone who wasn't a professional programmer, so I decided in late 2012, so you get more serious about teaching myself to code and becoming a professional developer, and I went
03:02
pro, i.e., got my first development job in December of 2015, so that's sort of how I got to this point, and so my role in the Python community is such that I've always had an interest in trying to make some sort of a contribution to the community, but as someone who didn't come from a traditional CS background or a coding background, I knew
03:23
I wasn't going to start just making contributions by writing awesome code on day one or even early on, so I thought, well, I can, you know, I've got decent personal skills, interpersonal skills, so I can maybe help grow the community, and so I felt one of the best ways to grow the Python community was not by converting people who used other
03:43
languages into Python developers, but by taking people who weren't developers and bringing them into the Python community by showing them how Python can benefit them, and for one reason, if you're already developing, say, in Java or C++ or what have you, you already have your built-in biases, but at the same time, there are more people who
04:02
aren't programmers than there are people who are programmers to convert, so it's just a bigger growth factor. Also wanting to look at some solutions that are not obviously people who aren't developers, so there are certain things, if you're a developer, you think about things in a certain way. When I first gave this talk, one of the solutions
04:22
to this problem that someone suggested was, well, just put it in a database. Well, if you already know SQL and you have access to databases, you wouldn't have the problems that I'm discussing in this talk, so trying to come up with some solutions that are useful to people who aren't already developers. So who's the talk for? We've got two sets
04:42
of people the talk is geared at, and again, I'm trying, well, not again, but if you've read the description, you might have seen that I'm trying to make this talk as beginner-friendly as possible. So we've got two general categories of people. One, people who are using spreadsheets on a regular basis but want to sort of step their game up, want to be able to do some different things with spreadsheets, in most cases, Excel, so they want to be able to
05:03
do some things outside of the norm with the spreadsheet, and then the next set of people are people who are already Python developers, but they keep being given spreadsheets, they keep being confronted with spreadsheets, and it's like, well, what do I do with these? There's got to be some better way to handle these, so hopefully, this will benefit both sets of people. So the code here is not going to be very advanced. Again,
05:23
partially because I wanted this talk to be as beginner-friendly as possible, but also because the really interesting things code-wise are going to be based on your specific application. So pulling data out of the spreadsheet, writing data back to the spreadsheet, those things are fairly straightforward. There are some other things that the library will
05:41
help you do, but the really interesting things are going to be based around your specific use case. So I don't know what that is, and I can't demonstrate that code, so I'm showing fairly basic code for a couple of basic applications to give you some ideas of what can be done. So now it's time for a demo. This demo is sponsored by Jupyter
06:02
Notebooks, Jupyter Notebooks, purveyor of fine intranet notebooks since like five years ago. Not that long ago, but still very helpful. So step one, has this readable to people in the back? Yes? Okay. I got to kind of. Is that too big? Is that? Yes,
06:27
all right. I think I've gotten rid of all those, yeah. So the first thing you have to do is you have to know your data, and for that, we will take a look at the actual
06:40
spreadsheet file. It's almost impossible to try to work with code programmatically. We work with a spreadsheet programmatically if you don't know what's in the spreadsheet. And so here, what we've got, for my simple example, let me try to, I want you to do what I say. I want you to do what I mean, not what I say. These spreadsheets are trouble
07:02
already. There we go. I'm going to blame my own mistakes on the spreadsheet. That's some data. We have what is some simulated timesheet data. So what we have here, we have an employee number, and the employee number identifies each employee, so that's the same
07:23
employee, this is a different employee. An employee number, the cost center that that employee worked for, so there's sort of this specific work group. We also have their division within the company, so your company is broken into multiple different divisions, and so each division will have multiple cost centers in it. Then who is that employee's manager, and of course, these are fake simulated names. And then the date that they worked,
07:47
so again, this is the timesheet data, so we're seeing here on this date, this employee with this name and this employee number. Can you all see the cell moving fairly well? All right. So this employee worked one hour on a project, and so this is simplified
08:03
data just for the purposes. If you work in a professional services firm, so I've worked in accounting firms and for engineering companies and things of that nature, one of the things that they look at is what's called utilization, so how much time do you spend working on an actual billable project versus what's called overhead time, where you're at work
08:23
and you're doing things, but you're not working on a specific client project that they can bill for. For this example, everybody's working on a billable project, and so we're not doing that level of analysis. But that's what we've got here, so we've got basic timesheet data, so this employee worked one hour on a project, and then they worked three hours on another project the next day, so on and so forth. So I've got
08:42
about 10,000, not 10,000, 1,000 rows of data here. Examples that I've worked on before had 10, 15, 20,000 rows of data. So this is the basic data, and again, if you're going to be, it's just like anything else in programming. If you're going to be writing code to manipulate data, you need to be familiar with that data. So here's the
09:00
basic data we've got, and then we start with some basics, reading a file and getting some basic data types. So the library that I'm using here is called OpenPyXL. OpenPyXL is a library, a Python library, that lets you read from and write to .xlsx files, and for those not familiar, the older .xls file format
09:23
is what Microsoft Word used up until, I believe, 2007. And after that, they switched to the .xlsx format, and that extra x means that it's based around XML, and it is compatible with the Open Document Organization's formats. And so,
09:46
at the time in 2013 when I started looking at this, OpenPyXL was one of the few libraries that would actually work with that file format, and that's what I had to deal with. So again, very straightforward stuff here. Importing OpenPyXL, by show of hands, how many of you would consider yourselves beginner or novice programmers in early stages?
10:06
Okay. And so, how many of you might know some people who would be beginner or novice? Not you, yourselves of course, but you might know some other people who might be beginner or novice programmers that might maybe benefit from hearing things described in that way. Okay, so be sure to tell a friend and share this with them.
10:25
So I'm trying to step through this in a fairly straightforward way, again, so that you could explain it to someone who's not an experienced developer, and they could actually get some benefit from this. So you're importing OpenPyXL, and then from OpenPyXL, importing Workbook. And this gets you a Workbook object, and we'll talk about that distinction
10:45
between Workbooks and Worksheets and that sort of thing later. And then this line of code, Workbook equals OpenPyXL.loadWorkbook, this loadWorkbook function, this is the name of the file that we're using. So the spreadsheet that we just saw, this spreadsheet, this is pyxl underscore demo underscore django condit xlsx.
11:05
That should be, that's wrong. That should be true. I was playing with this earlier. The distinction here is this data only equals true is for situations where you have a formula in a spreadsheet, which lots of spreadsheets have formulas in them, if that data only is equal to true, what that's going to do is that's going to give you the result of the formula
11:25
in all cases as opposed to bringing you back the formula itself. So that's that, and let me do this. I'm just going to run this to make sure I've got everything behaving as it should. All right, and so this is all sort of fairly
11:44
straightforward Python stuff, except for this data only equals true. This is something specific to OpenPyXL. So now we're going to talk about this distinction between Workbooks versus Worksheets or spreadsheets and tabs. Most people tend to use the terms sort of
12:01
in the spreadsheet. So in Accounting Nerd Talk, a spreadsheet is a workbook. So Excel describes them as a workbook. So a workbook is the actual file itself. So what we see here is part of a workbook. The individual tabs here, I have clean data, which is what we're using,
12:21
then kpop draft roster, which is a whole different thing. kpop is one of my things. So each of these tabs is known as a worksheet, and the multiple worksheets make up a workbook. And so that is important because when you're using, so most people will say like a spreadsheet or a tab in a spreadsheet,
12:44
but Excel defines them as a workbook, which is a collection of individual worksheets. And that's important because when you're accessing the data, you need to tell the, you've got a workbook file, so you want to open that file. That's what we're opening up here in this cell. But then you need to know which worksheet you need to get out of that workbook.
13:00
And there are different objects that have different properties to them. And so here, we've got this WB is the workbook that we've opened, which is the entire thing. And for those of you who are, again, newer to Python or who have friends who are newer, the DIR function, the directory function in Python is helpful because it will give you a list of the
13:21
attributes that are available for a particular object. Now, for a lot of basic Python constructs, if you're already familiar with these, maybe you don't care. But if you are using a new library like this, like OpenPy, Excel, it has some different data types that you're not going to be familiar with because you haven't seen them before. So using the DIR function on them is helpful. And so we see a lot of the different attributes. And we'll notice here a couple
13:45
sheets. And so this will show you the sheets that are available. Let's see. Copy worksheets so you can copy a specific worksheet. Create sheet, which we'll see a little bit later. Get sheet by name so you can go to a, you can grab a specific worksheet out of
14:03
a workbook. Or if you don't know what sheets are available, get sheet names. And so that will tell you, okay, what sheets are actually here in this workbook. So we'll look at those. Give me a round. So that's that. And then so I'm here, I've got the workbook,
14:22
and I print workbook dot sheet names because I want to know what sheet names are there. Now you will notice, I'm trying to make sure it's high enough on the screen so people can see it. You'll notice the workbook dot sheet names, when we looked at the actual spreadsheet itself, we saw two worksheets, clean data and kpop draft roster. But when I print the
14:44
workbook sheet names, I get three. Japan spending, clean data, and kpop draft roster. So the Japan spending worksheet is actually hidden. As I was working on this talk initially, I was in Japan, and I was trying to figure out where all my money had gone. And so I started, I was like, well, I shouldn't have spent that much money. So show sheet,
15:04
Japan spending. So that's a hidden worksheet. I point that out because if you're in a situation where you're trying to hide something from someone and you want to hide a sheet, well, if they just look at it, they won't see it. But if they get a list of the sheet names, it's still visible there. So those are the worksheets we have available to us. The one we want is the clean data worksheet. And so here, I'm going to create this
15:24
variable called demo worksheet. And I want to get sheet by name, and then pass it the worksheet name. And so now, this WB is a workbook object. And actually, let me do that here. A live coding portion. Notice this WB, it's a workbook object. Again, which is not something
15:43
that natively exists in Python. So open pyxl.workbook.workbook. So it's a workbook object. And it has those different attributes that we saw before. That's why I had the DIR. So I ran the directory function on it. And again, here, I've created this demo worksheet, which holds a specific worksheet. And we see that it is a worksheet object
16:06
that has a specific set of attributes that we can see with the directory function. So those are the different things you can do with it. And the two we're going to focus on most here are, well, so we've got cell. So you can get a particular cell
16:26
from worksheet. But we're going to look at columns and rows. So each worksheet, again, depending on how familiar you are with the worksheet, columns go up and down,
16:41
rows go side to side. In this case, each row represents a particular record similar to in a relational database. So those are the different attributes we've got for the worksheet object. And these workbook and worksheet objects and the cell objects we'll see later, there are a lot of different attributes, a lot of different options. There's clearly not time to go through all of them. But I'm just going to try to go over
17:02
some of the highlights. So we've got this worksheet now, this demo worksheet. And so what we want is we want the data out of this. So we're going to grab it by rows. And so demo worksheet.rows. So one of the rows here. And you'll notice it returns this generator object. So generator object worksheet.cells by row. And so what this does
17:25
is it gives us a generator object. Instead of reading every row out of this spreadsheet, it creates a generator object. And if you are not familiar with generators, a simple way to think about them is that a generator object is something that has a collection of items. But instead of
17:44
giving you all those items at once, it will give them to you one at a time. And this helps to save memory. So instead of having all 10,000 rows of this spreadsheet, you have this generator object that gives you one row at a time as needed. And that helps to save memory. But I point that out because if you say, oh, okay, well, show me the rows, you're not going to
18:03
just get plain rows or plain text or things of that nature. You're going to get this generator object that's going to give you a row at a time. And so what you get is when you print
18:27
out this generator object again, so this demo.worksheet. So for row and demo.worksheet.rows, so I want to see what these rows are like. And you'll notice that each row, I've also got it printing the type. And so you've got a generator object, but what it returns, each
18:43
object it returns is a tuple of cells. And so you've got this cell, clean data, A1. Clean data, A2. So cell A1, cell B1, cell C1 of this clean data worksheet. And so A1, B1, C1, that's going to correspond to the first row. And then the next one
19:04
goes to the second row and so on and so forth. And so you have a tuple. So this generator object is giving you tuples. And each element of that tuple is a cell, which is again, so tuple, standard Python data construct, a cell is not. A cell is another open pyaccell
19:20
data construct. And we'll see why that's important and useful in just a moment. So we move on, we'll take a look at cells. And I just want to show you some of the differences about cells. So for cell in next, demo worksheet.rows. So again, demo worksheet.rows is a generator object. If you have a generator object, like I said,
19:43
it gives you one new item at a time. To get one item at a time, you can use this next function. And so that's just pulling one item out of that generator at a time. And so I'm having it print out just cell and then the cell name. So here, cell.column, tells me what column the cell is in. Cell.row tells me what row the cell is in. Using some
20:03
string formatting here to make that look sort of nice. And then I'm printing the cell itself, which gives me information about the cell. Then the type of cell, just to demonstrate that it's this different data type that's not native to Python, it's provided by the library. So cell A1, cell A1 from clean data, it's of the type cell. And
20:22
the value in it, that last line, is going to show you what's actually in that cell. When you look at a spreadsheet, that's what you're after. You're after the cell.data. And so we see A1, the value is employee num. B1, it's call center. C1, it's division.
20:40
So row one, these are the headers, and so on and so forth until we move on. So I just popped off that top row just to demonstrate that. And so we've got these cell values and types, and now OpenPy is also smart enough to to try to take the data that's in a cell, the values that are in a cell, and convert them to
21:04
the appropriate Python data type. So here, we're looking at demo worksheet E1.value, and so I also do this to demonstrate that instead of grabbing things just by rows, you can go to a specific cell if you want to. So in this case, we're going to this demo worksheet, which is the clean data worksheet, and we're grabbing cell E1 specifically,
21:24
and we're getting its value. And we're doing the same thing to cell E2. So if we take a look at the spreadsheet, cell E1 is date worked, E2 is that first date. And so we print these, and these two cells, 180 and 181, are basically the same thing, just shown differently.
21:43
So that's the value, and then that's the type. So the value here is this date worked, and it's a string. The value here is that first date, so it's showing it, it's displaying it here, but it's also showing you that it's a date time by date time object. And so hyphen knows that, hey, this is a date. And so that's useful.
22:05
So cell attributes, again, so why is there a cell object? Just like we had the workbook and the worksheet object, we have these cell objects that have all these different attributes. And the primary reason for that is because when you're looking at a spreadsheet, it's a cell object, when you're looking at a spreadsheet, there's a lot more going on
22:23
than just what's in the cell, the cell itself, than just the value. There are other attributes. So is it bold? What sort of styling is going on? What other things are happening with the cell? And so these cell objects contain all those other attributes.
22:41
And the one that we're going to use the most here is the value attribute. So we'll be doing cell.value to get the actual value. That's the actual data we're wanting to work with. However, there's other data in the cell, so if you need to know if a cell is colored a certain way or uses a certain type of font or something like that, you can grab
23:03
that information and do things with it as well. But for our purposes, we'll be focusing mostly on that. When you look at cell styles, there's not enough time for me to do a demo on the cell styles, but there are all sorts of things you can do working with styles. OpenPyXL documentation is, of course, on Read the Docs. Thank you, Air Culture,
23:23
for making things like this available for us. So there's a lot of style information, and while that might not seem like the most important things when you're dealing with data, one, you might be in a situation where the spreadsheets that you are given are being styled in a certain way. So maybe a number that's a loss is red and a number that's, you know,
23:42
a profit is green or something like that. You can actually make use of that information. So you can actually pull that out. And if you're needing to write your results to a spreadsheet, you can write them in that fashion as well. So making beautiful spreadsheets has been left as an exercise for the viewer. So I'll let you all do that on your own. So example one, aggregating timesheet info. What we basically want to do here is take this
24:04
timesheet, and instead of having these individual lines for each day, what we want is we want to see, okay, how much time did each employee work in this month? So this is June of 2017. So we want to see, okay, how much time did each employee work in this month? So we want to aggregate that information. A lot of the Python here is just, it's not
24:27
particularly impressive, but I wanted to point out the spots where we use regular Python mixed with things that are specific to OpenPy XL. So here, you can use a for loop versus a set
24:41
comprehension. I used a set comprehension because I wanted to be able to, I wanted a set of the employee IDs. I didn't want every occurrence of an employee ID because you'd have multiple. So I used a set comprehension here. Trey Hunter, who is here at the conference, does an exceptional talk on what he calls comprehensible comprehensions. And so you can
25:02
take, if you see Trey, ask him and tell him, I told you that. Is Trey here? He's probably in that room. So if you see Trey, ask him about comprehensions and tell him that Kojo told you to ask him. So he'll do a better job of explaining comprehensions than I will. But what you would do with an for loop, in a lot of cases, you can do with comprehension. So here, I'm creating this set comprehension of employee IDs just so I
25:24
have a list of the unique employee IDs. And that's what this looks like here. And then, I'm using that set that I called employee IDs one. And I'm using that to create a dictionary that takes the hours, I'm using list comprehensions here,
25:43
list and set comprehensions here. I want all the hours for that employee. I want a set comprehension of the call center. Each employee should only work in one call center. So here, I've got a set comprehension of one call center. A set comprehension of divisions, again, each employee should only work for one division. Each employee should only have one manager. So this is, again, the part about knowing your data. So I've got set comprehensions
26:03
that are building those things. You'll notice here, I've got row six dot value for row and demo worksheet dot rows. So in this case, demo worksheet dot rows is, again, that generator object that's giving us a row to time. So here, I'm saying I want
26:26
row index six for hours. If we take a look back at our spreadsheet, we see one, two, three, four, five, six, seven because spreadsheets, and this is where it gets a little tricky.
26:41
We'll see this later. Python indexes from zero. So if you have a list of four items, Python will count them as zero, one, two, three. Spreadsheets index from one. And so we'll see in the code a little bit later, we have to make that adjustment. So here, index zero, one, two, three, four, I can't use my key, zero, one, two, three, four, five, six.
27:01
So I'm getting the hours there, but I'm using that row dot value, row six, row index six, which is going to be a cell object in the row, then dot value. So I'm pulling the value out of that, and that's what's giving me my hours. Doing the same thing to get the cost center, the division, and the manager. I do QA where I work now, and I also used to be an auditor, so
27:25
I tend to try to, you want to test things and things of that nature. So I've got a little assertion here, because I know there should only be one cost center, one division, and one manager for employees, so I just do that there. And then I build this employee aggregate object. And again, so a lot of this is regular Python here. I've got some of the specifics
27:46
to OpenPy Excel. And then I print this employee aggregate object, just a pretty printed, a dictionary, just so it's clear. So what you end up with is an employee ID as the key,
28:00
and then the cost center, the division, and then the number of hours for that employee. And so this lets me take this spreadsheet, turn it into a dictionary, which could also be used as a JSON object. And I know from some filtering of the spreadsheet, some manual filtering of the spreadsheet, that I should have 49 employees. And so that's what's happening here.
28:23
Now, here we've looked at reading data from a spreadsheet, and then processing it and trying to get into something in Python. The next thing becomes, what if you've already got a Python program that's running, and you want those results to be written to a spreadsheet? Well, you can do the same thing with OpenPy Excel. So first, you need to create
28:41
a workbook. And I've given it the very creative name of output book. And so I've created this new workbook object called output book. And then you need to create, well, let's see, you can create a specific sheet. Here, I'm creating a sheet called output sheet. So output book dot create sheet, which is this create sheet, is a method that belongs to the
29:02
workbook object. And I'm giving it a name here, aggregate time. And I'm also giving this argument a zero. The zero argument means it's going to be the zeroth item in the workbook. Otherwise, by default, when you create a workbook, it will have a sheet one of the first object. So here, I'm saying make this the first item in the sheet that we see.
29:22
And then when we look at output book, we see that it's a workbook type object. Then I decide to build a header because I don't want to just write the raw data to the spreadsheet. I also want some sort of a header so the spreadsheet looks sort of organized when I give it to someone else so they can understand it. And so I'm just building a header here basically by just copying the values out of the demo worksheet. And so
29:44
again, I'm just accessing those cells directly. And then I'm printing the header to make sure it's what I want. And so I've got that same header. And then for output data, I build this table. It's a list of lists. And I move through this employee aggregate.
30:02
And I'm creating new row. I'm building these new rows. So I want the row with the employee, the cross center, the division, the manager, and the number of hours. But here, it's going to be the number of hours that were aggregated from the earlier dictionary that we saw. And then now, I'm assigning those values that are in this output data construct.
30:26
I'm writing them to the output sheet. And I've got nested for loops here because I'm writing them by row and by cell. And here, we see this row index that I've got here
30:44
and the column number. But I've got to use plus one because the indexes that come from Python start with zero. On the spreadsheet, they start with one. So that writes that stuff. And so here is the output data construct that I built. And so you see the first list is the header. And the next list is the aggregate numbers for each employee. So
31:05
this employee worked 160 hours in a month, so on and so forth. And now, I can save that. So outputbook.save. And then I give it a file name. And so this is the file name
31:21
of the file that it's being written to. And so this is the result. And now, this doesn't
31:49
take a huge amount of time. But it's a small amount of data. So there's this. So I've got this aggregate time sheet with those times that I can check the totals. So the total there
32:02
is, well, you can't see it at the bottom. It's 5,386 hours. And if I go back to the original spreadsheet, I've got the same total, 5,386 hours, which is here at the bottom, perhaps visible from the front row. And again, this isn't the most complicated of things.
32:28
But it gives you an idea of if you have a lot more spreadsheets to work with, 10 or 100, or you have a lot more data. The last thing we do is I can take that aggregate time
32:40
object that I created, and I can write it out as a JSON file. And so then what I end up with is, and so I get that as a JSON file that I can then use to configure something else or
33:07
to do other processing if I'd like. So the problems that you'll run into, so that reading the data, writing the data, that sort of thing, not terribly complicated.
33:21
The problems that you run into is that often a spreadsheet is going to be used as a visual medium. And so someone wants a spreadsheet to look nice. And so that spreadsheet might not make sense to code. If the only spreadsheet you ever get is one that looks like this, then you'll be fine, because you've got fairly well-structured data. But the reality is,
33:42
someone, the boss, wants the spreadsheet to look a certain way, and so it's been laid out. They've tried to do desktop publishing with it or whatever, and then you've got to sort of go through it. In those situations, you might be able to access individual cells, or you may be able to convince them to maybe change some of the formatting with the idea that, hey, we can speed this up by literally 100 times. So if you might have visual input
34:04
or you might have a visual output requirement, that's where Steinling can help you. You can make new friends by helping teach your coworker how to automate some of their simple tasks with Python. Again, the Python here that I demonstrated wasn't terribly complicated. The comprehensions were probably the most complicated thing. And so you can
34:21
teach them to read data and do some things with it and write the data back out. That's what I've got. So I am transition on Twitter. If you have questions or the slides and the code will be available in this GitHub repository very shortly. With the Python CSV module, you're able to use DictReader and actually get
34:46
named columns in and out. Does OpenPyXL support that? I'm not sure if it supports the name column. Here, I read things in by rows, but you can also do the same thing by columns. Okay. You mentioned that visual spreadsheets would not be a good candidate for this sort of
35:05
approach. Are there any types of actual data structures in spreadsheets that would not be good for programmatic analysis this way? Data structures in spreadsheets. Let's see. So if you have a lot of computations happening in macros,
35:24
and that's something I had not worked with very much, but if you've got a lot of macro calculations going on, that might be a little tricky. You should be able to grab either the data that's going into that macro or the results of those macro calculations. So that's probably what you'd want. There's a whole different approach that involves being able to run Python
35:45
inside of a spreadsheet. I might add that to this talk and update it later, but thus far, I've focused on just the files themselves. Is there any way to make pivot tables in it? I know you could do pandas. You could do a pivot table in pandas and write it statically to the Excel sheet, but is there any way to? I believe there is.
36:04
If you're looking to, because in my mind, all this stuff kind of gets rid of VBA. If you go to the OpenPy Excel documentation, I believe there is a pivot table function that's there. Off the top of my head, I can't recall. I personally am sort of a love-hate
36:20
person. Okay, yeah. Just before I ask my question, I've worked a lot with POI, and if you make a pivot table and then make it use a range, you can actually use something like this to populate your pivot table. So you can just sort of have a pivot table as a question for you.
36:41
Is streaming and stability, so have you noticed any bugs or any stability problems? And then when you're dealing with large amounts of data, is there any streaming interface that you're familiar with, and anything you'd comment on that? Not familiar with the streaming interface. So I haven't used this recently with huge amounts of data, so I couldn't really speak to
37:04
that. I think the fact that it's not pulling in at the time. I've seen large spreadsheets that have caused Excel itself to sort of slow down and run slowly, and so I haven't run this with those same, because those spreadsheets, unfortunately, were visually formatted, but I think the fact that this is creating this generator object and returning the data in small
37:22
pieces at a time would help alleviate some of that issue, but I haven't actually had a chance to test it. I need to create a fake thing with a bunch of data and try that. Hi. So your Excel was pretty nicely, obviously, formatted. Would it deal with cells or rows that are merged in the middle of the sheets? For example, a merged cell of Monday,
37:43
days of the week, and so on? It has some capacity to deal with that, but again, that's sort of a knowing your data type of thing, and I haven't specifically tried things with merged cells, but you can access individual cells, so that might be a situation where you might need to access an individual cell, because I'm not sure how
38:02
OpenPyXL sort of views that. Visually, just like with the hidden worksheet, you can't see the hidden worksheet, but OpenPyXL can see it in a list of sheet names, so I'm not sure how OpenPyXL sees those, because the merging is just a visual thing. It's not an actual data thing. Have you ever used XLRD or XLWT,
38:22
and how does it compare to... This is one of the more common questions I get with this talk. I've used those a little bit, but when I started using this library at the time, this is 2013, so those two libraries, XLRD and XLWT, they wouldn't work with XLSX files, and so I believe now they do, but at the time they wouldn't,
38:44
and so I played with them a little bit, but I was like, oh, well, I either have to take every spreadsheet, convert it to an XLS file, and use this, or I can just use a library that supports it natively. Thank you. I'm going to reveal my ignorance real quick, but what tool were you using to run your Python in a browser and trust the output?
39:04
That was Jupyter Notebooks, so this talk is sponsored by Jupyter Notebooks, purveyor of fine, so a Jupyter Notebook. All right, seeing none. Thank you, Kojo.