An documentation workflow loved by both Data Scientists and Engineers
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Subtitle |
| |
Title of Series | ||
Number of Parts | 637 | |
Author | ||
Contributors | ||
License | CC Attribution 2.0 Belgium: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/52460 (DOI) | |
Publisher | ||
Release Date | ||
Language |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
FOSDEM 2021135 / 637
1
2
7
8
10
11
12
17
29
33
35
38
40
44
48
50
54
59
63
65
85
87
91
95
97
105
108
114
115
119
120
122
126
127
129
130
133
137
140
142
143
147
149
151
156
159
160
161
168
169
170
175
176
177
178
179
182
183
184
187
189
191
193
197
198
204
206
209
212
220
222
224
227
230
233
235
238
242
243
245
247
252
253
255
258
260
261
262
263
264
265
272
273
278
281
282
285
286
287
288
289
294
295
296
302
304
305
308
310
316
320
323
324
328
330
332
335
338
342
343
347
348
349
350
351
360
361
365
368
370
372
374
377
378
380
381
382
383
386
390
392
395
398
402
405
407
408
409
414
419
420
422
425
427
430
439
451
452
453
458
460
461
464
468
470
471
472
473
475
478
485
486
487
491
492
493
495
496
498
509
510
511
512
516
532
534
538
543
548
550
551
554
556
557
559
563
568
570
572
574
575
577
583
585
588
591
593
595
597
601
602
603
604
605
606
607
610
611
617
627
633
634
00:00
Dean numberOpen sourceView (database)Self-organizationAbstractionCodeProduct (business)Task (computing)Content (media)Thermal expansionArchitectureImplementationProof theoryNichtlineares GleichungssystemAuthoring systemMarkup languageFunction (mathematics)Revision controlFile formatSource codeBinary codeEmailHand fanWritingDistribution (mathematics)Probability density functionAnalog-to-digital converteroutputComputer virusProduct (business)WebsiteForm (programming)MereologySoftware developerLattice (order)INTEGRALHypermediaGaussian eliminationData managementConfluence (abstract rewriting)Function (mathematics)Plug-in (computing)Physical systemMultiplicationLevel (video gaming)WikiWordWeb pageVideo gameDifferent (Kate Ryan album)Office suiteDiagramMultiplication signOctahedronMathematicsContent (media)SoftwareTerm (mathematics)LogicNichtlineares GleichungssystemRevision controlFile formatCodeComputer programmingUniverse (mathematics)Medical imagingRepresentation (politics)Self-organizationElectronic visual displayFormal languageView (database)Point (geometry)Goodness of fitDisk read-and-write headDataflowPresentation of a groupRight angleTraffic reportingNeuroinformatikNetwork topologyAvatar (2009 film)Functional (mathematics)Sheaf (mathematics)Source codeLocal ringSpacetimeException handlingMarkup languageProbability density functionData conversionReading (process)TextsystemTask (computing)Wireless LANRow (database)In-System-ProgrammierungBuildingComputer animationXML
07:34
Product (business)Task (computing)Content (media)Digital filterCommon Language InfrastructureComputer configurationPresentation of a groupRepresentation (politics)WritingScripting languagePhysical systemNichtlineares GleichungssystemCodeComputer fileVolumenvisualisierungBlock (periodic table)Einbettung <Mathematik>Artistic renderingStructural loadWikiGoogolÖkonometrieControl flowRevision controlBinary fileSoftwareUtility softwareFile formatDistribution (mathematics)Probability density functionBuildingDataflowCloningCompilation albumText editorFile viewerAuthoring systemDirectory servicePhysical systemProbability density functionFunction (mathematics)Revision controlFile formatContinuous integrationInstallation artMereologySource codeComputer fileOpen sourceSheaf (mathematics)Table (information)Hand fanBuildingConfiguration spaceText editorDefault (computer science)Electronic mailing listMetadataAxiom of choiceSocial classProjective planeInclusion mapExtension (kinesiology)Slide rulePlug-in (computing)Computer fontCodeMedical imagingAdditionVariety (linguistics)Key (cryptography)DiagramKernel (computing)MathematicsComputing platformEnterprise architectureWebsiteState of matterMessage passingSoftware testingOrder (biology)2 (number)Computer configurationFlow separationLatent heatCompilation albumContent (media)WritingScripting languageBlock (periodic table)outputTorvalds, LinusPresentation of a groupPersonal area networkParadoxPolygon meshINTEGRALUnicodeSoftwareDataflowValidity (statistics)OctahedronProcess (computing)Intelligent NetworkGoodness of fitWeb pageFormal languageMaxima and minimaPiMiniDiscPhysical law1 (number)Coefficient of determinationFilter <Stochastik>Greatest elementBitTouchscreenImplementationLie groupFigurate numberXML
14:57
CodeView (database)Markup languageFormal grammarBlock (periodic table)Physical systemContinuous integrationPoint (geometry)WordSoftware developerNichtlineares GleichungssystemText editorFlow separationSheaf (mathematics)Group actionTransformation (genetics)Function (mathematics)Revision controlData conversionFeedbackBuildingSign (mathematics)Control flowNegative numberMobile WebWeb pageComputer fontComputer iconSoftwareSoftware testingSoftware repositorySlide ruleFunction (mathematics)Physical systemBuildingProduct (business)Figurate numberSingle-precision floating-point formatFile formatData conversionGame controllerWeb pageWordProfil (magazine)ImplementationComputer fontOrder (biology)Field (computer science)Multiplication signSlide ruleData storage deviceEntire functionPoint (geometry)AdditionSearch engine (computing)Probability density functionSoftware testingGoodness of fitCodeReading (process)MathematicsBasis <Mathematik>Software repositoryShape (magazine)Transformation (genetics)Revision controlContent (media)Different (Kate Ryan album)SoftwareRobotAuthorizationNeuroinformatikFile archiverLink (knot theory)MereologyEvent horizonMachine visionComputer fileLibrary (computing)Source codeRepository (publishing)Nichtlineares GleichungssystemSystem callParadoxQuicksortPosition operatorBookmark (World Wide Web)Personal area networkTemplate (C++)Open setModule (mathematics)Row (database)DataflowSheaf (mathematics)2 (number)Dean numberFlow separationType theoryTerm (mathematics)Form (programming)Computer iconComputer animation
22:20
Element (mathematics)Drill commandsComputer architectureWebsiteAnalytic continuationMereologyCodeInformationComputer fileRoutingForm (programming)EmailTable (information)Uniform resource locatorProbability density functionMultiplication signSoftware developerType theoryExterior algebraRevision controlRepository (publishing)Source codePhysical systemCollaborationismOnline chatNamespaceView (database)Service (economics)BitCartesian coordinate systemLogicPoint (geometry)Decision theoryNichtlineares GleichungssystemProjective planeMultiplicationFluid staticsContent (media)2 (number)Web pageCellular automatonOnline helpSelf-organizationServer (computing)Error messageDesign by contractLecture/ConferenceComputer animationMeeting/Interview
28:13
Element (mathematics)Physical lawComputer animation
Transcript: English(auto-generated)
00:06
Hello. Thanks for joining us today. My name is Colin Dean. I'm generally wearing a top hat at conferences and a scarf as well. However, I've let my hair grow out during the global pandemic, and my hat no longer fits on my
00:22
head, so I just have to go with this scarf for today. Lawyers generally say it's a good idea to say and display this, so the views expressed herein are my own and do not necessarily represent the views of my employers or associated organizations past, present, or future. I work for Target, an American retailer.
00:43
We're based out of our tiny Pittsburgh office. Like most other tech workers, I'm working from home during the pandemic, so I'm recording this from my home office in beautiful Wilkinsburg in the Commonwealth of Pennsylvania. I run Code and Supply, a Pittsburgh-based organization of thousands of software professionals. We run
01:03
some conferences that you may have heard of even outside the U.S., such as abstractions and artifacts. I'm also president of a nonprofit network builder that is pivoting into the wireless ISP space in 2021. So here's the task and the meat of the presentation here. Our manager came to us one day and
01:26
told us about an upcoming off-site meeting in which we'd be asked to present about the work that we've been doing. A part of that necessitated a write-up detailing how our product worked at a high level. Multiple people would be consuming it. We knew that we needed a well-designed
01:42
document that could be easily read when printed, but also take advantage of the various aspects of the hypermedia that we were going to use to distribute it digitally. We'd most likely deliver a PDF, but a long web page was possible, too. We knew that we needed to provide good content, but it also needed to be well-summarized and be navigable.
02:03
Then, we hit a big change. There was a development pause. Our product development was paused, so we needed to document everything. It may not be our team that continues development on the product that we've been working on. With that development,
02:24
our scope grew and we knew that we had to go a lot more in-depth to capture the knowledge of our team for posterity's sake. For example, what we'd intended to be work the depth of a quick tour of a house turned into being an owner's manual documenting every quirk and hidden compartment.
02:45
There were seven of us, normally co-located in the same office. We each had different concerns and interests to be recorded in the document. The engineers needed to show how the product works and how it was deployed as production software. The data scientists needed to show the business logic
03:03
behind that software, including equations and citations with many acronyms, initialisms, and other terms likely going to be in a glossary. We needed something that we could easily use and to which we could contribute simultaneously.
03:20
After all, we had just a handful of days to complete this document. We needed a workflow that would get out of the way and let us focus on content while still looking professional. It's a good practice to figure out things that you value about a system before you start building it. When I'm building software,
03:42
after the initial idea comes out, but before I start engineering a system out of it, I think about quality attributes, the base facts of thinking architecturally. We assumed that our content would be a lot of text and diagrams and equations. We wanted a markup format
04:02
that was easy to read and preferably didn't require a special program to read. That is, we wanted it to be reviewable in our own code review tool. And we wanted not to be able to control styling within the document, except for perhaps some semantic markers calling out special sections.
04:22
We wanted to be able to use LaTeX if we needed it, but because we knew that we might want to draw something with Tixie, the drawing language built in, or some other esoteric functions of LaTeX. Lastly, we needed it to be
04:41
easy to use for both humans and computers. One command should build the document, and we should treat that document as a build artifact that is versioned and archived. The key idea is this. Treat documentation as source code. This concept of treating documentation as source code is probably not novel to
05:01
seasoned documentarians, but for many for whom documentation is all too often an afterthought, like engineers and data scientists for the most part, this concept is absolutely life-changing. The things that we wanted to avoid caused us to eliminate
05:23
Microsoft Word, Apple Pages, just straight up LaTeX, our Atlassian Confluence wiki, and virtually every less well-known text format. We considered some of the newfangled tools like MDBook, but found that they didn't have the quality of PDF output and the integration with the niceties
05:43
of the Pandoc ecosystem. That's right, we really wanted to avoid this. The team really despised working in Word and strongly wanted to avoid authoring in our Confluence wiki system, which we treat as ephemeral and quickly out of date. So, we came up with this solution.
06:05
We built a solution based on these components in about two weeks, while also writing prose. This got us simple text, easily read, reviewed, and agreed upon through our established systems of consensus and code review,
06:22
and produced the same way every time. We could even automatically notify stakeholders whenever we released a new version. The biggest benefit though, LaTeX typesetting without LaTeX, or really LaTeX when you needed it. Our final document had very little
06:42
raw LaTeX outside of some equations and a couple of diagrams that were simple enough to be quickly redone in Tixie instead of leaving them as ping or SVG. So, let's talk about Pandoc, the universal document converter. Pandoc was started a while ago and hit 1.0 in 2008 with the 2.9 release
07:02
coming out just as we were starting to work on this. Now in February of 2021, we're at 2.11, maybe nearing a 2.12 release. It's written in Haskell, but it supports Lua for writing plugins that process the internal format that Pandoc uses
07:22
to represent dozens of document formats it can read and write. You can also write some plugins in other languages, and we'll talk about that later. First, some Pandoc basics. Pandoc is available in virtually every package manager, and there are downloadable installers available on pandoc.org.
07:42
Pandoc attempts to figure out the input and output formats based on the file names, but oftentimes it's better to be explicit. This is an example output of running Pandoc on an early version of these slides, which is written in Markdown and uses Pandoc to produce the code which compromises this, or comprises
08:02
this part. This is an example output of running Pandoc on an early version of these slides, which is written in Markdown and uses Pandoc to produce the code which comprises the slide. Break down chapters and sections into separate files whenever you're actually building out a document in this. Pandoc easily concatenates input, as you
08:23
see here in the command invocation. Note the use of filters, the choice of PDF engine that enables XELATEC in order to use Unicode and some other implementation-specific features. Note the choices about the table of contents, section numbering, list of figures, tables, and
08:43
bibliography. Command line options can pass metadata into the document as well, or you can use YAML in the front matter of a Markdown document. This metadata and some other options easily
09:02
go into the YAML as you can see here. Pandoc supports a configuration file containing the defaults. I use this configuration file in almost all of my newer projects in order to shorten the makefile. Reveal and PowerPoint are supported first class for output.
09:21
This slide deck, as I said, is written in Markdown and transformed into a reveal presentation via Pandoc. I will advocate for using a build system to build whatever it is you're going to be building with Pandoc. I strongly suggest avoiding writing a script.
09:43
I have a maxim, any sufficiently advanced build script eventually simply re-implements make. Some plugins that we want to talk about, these are some very common ones. We've used SiteProc and CrossRef extensively, and I believe in recent
10:01
versions SiteProc is actually bundled into Pandoc itself. Include code pulls in files, which is great for pulling in snippets from an external file that might change over time. I'm making extensive use of Panpipe in a workshop that I built. I can put code examples directly into the document, write those
10:22
examples to a file during the document's build process, effectively using Pandoc as a build system, and then subsequent code blocks later on in the document can execute on the files written to disk. This way, I'm always up to date on my code examples, and the code is always checked during the build.
10:42
These two together are fantastic for ever-changing data and source code. You can write plugins in Haskell, Lua, and Python, and a few other languages as well. I strongly recommend using Lua, which doesn't require installing or compiling anything additional.
11:00
The Lua engine is built right into Pandoc. Images can be in a variety of formats. I tend to convert everything to PDF and include the PDF version in the markdown, because I know that PDF is going to look right versus SVG, because SVG output unfortunately rarely converts fonts to paths.
11:21
In this lower example, I redrew a diagram using Tixie, and it looks a whole lot better than the ping derived from a screenshot of a PDF. Using the SiteProc filter, we can easily enable biographical references using a bibtech format like you see at the bottom there, and then
11:42
referencing the key and the actual prose on top. Git was originally written to manage the Linux kernel source code. Linus Torvalds himself wrote it, if you didn't know that. But it was really popularized in the 2000s and the early 2010s by
12:01
GitHub. GitHub is a widely used public website with an on-premises version called GitHub Enterprise. That's what we use at our company. You can also use GitLab or Giti or a host of other systems. In fact, Code and Supply uses GitLab and MetaMesh uses Giti.
12:23
Next, I'll tell you about how my team used this powerful tool to create a workflow that enabled us to collaborate without having to pass around a file. We used these four primary tools. Pandoc, Git, GitHub, and DroneCI. DroneCI
12:40
is a common open source continuous integration platform. For those of you who are familiar with GitHub, you will certainly recognize this workflow. Note how changes move around the system. You'll see that you have something in your working copy, you can make some changes to it, compile
13:02
it with Pandoc, make some fixes, eventually you've reached a state that you like, commit, push up to GitHub or GitLab, whatever system you want to use, have your CI system pull down the changes that you made, validate it and push a message back into your
13:22
version control system, your code review tool, and then you can keep up this loop until you've reached a desirable version, compile it, save some build artifacts, and then you've got a release that you can push out. It's the exact same workflow as releasing software, only instead of releasing a
13:42
binary or a package, you're simply releasing a document as a PDF or HTML file or whatever other format you'd like to use. There are some Markdown specific text editors out there. I'm a fan of MacDown, but all too often just come back to using Vim with a few plugins for easily editing
14:04
Markdown. The style of writing where you write one sentence per line, regardless of the language, makes it so much easier to suggest small changes to documents during your review process. When you limit relevant content
14:22
to a single file, you can definitely extract a single chapter or just a few chapters into a PDF. This is great for a summary PDF or for test building only one file instead of doing the whole large document. Our final paper, which was about 45 pages,
14:41
takes around 12 seconds to compile on my 2019 MacBook Pro. Compiling only one section takes about two seconds. Whenever you're committing with Git, you want to still use Git the same way you would with source code. You want to use Git commits to tell a story about the changes. Now, because you're
15:04
writing prose, you're describing prose, you don't want to write the prose all over again, so it becomes an exercise in summarizing the content change as small as you can. Next, we'll
15:20
talk about code review and pull requests and merge requests, so I suggest some further reading or watching after searching for Colin Dean Code Review on your favorite search engine. I recommend using some automation like code owners in GitHub or a pull request bot or some other automatic assigning system in order
15:41
to ask for a review by your team. Every time I've not set CI up on a repo using my what we call white paper template, someone has managed to merge to master some minute change that breaks the PDF build, and it's been difficult to track down.
16:02
I know what to look for now because of more than a year of experience with this system. For example, Xe Latex doesn't like Greek letters inside of formatting inside of equations, which is apparently an older practice that was required in earlier versions of Latex and is no longer required.
16:24
I highly recommend using a good code review tool like GitHub's code review pull request system or GitLab has a great merge request system. Use Garrett, some other good review system for editing content. Use it as it was intended
16:44
to be used and you'll be in good shape. Of course, this was not without some pain. This was the new thing. Almost all new things have some kind of pain to them. Some things were easy to fix, others were harder, and of course some were yet unresolved. Some users had problems installing
17:02
large dependencies on our corporate network. Some users found minute differences between Pandoc Markdown and CommonMark to be frustrating. Some wanted to just use Latex or Microsoft Word. We never really established a reliable way to make changes and see them automatically. I've since found
17:23
that events on Linux is by far the best PDF reader for this workflow as it automatically reloads PDFs when they've been changed to the exact page that you were looking at. Also, in the production of this system, I was all too often a single point of failure. Anytime you're building
17:43
a new system and you're kind of using it as you're doing, you're going to hit a problem like this where the person building it is slowing everybody down. Lastly, authors less comfortable with tech equations were frustrated by the tooling to preview their equations quickly. They did eventually
18:03
find Latex and MathJax to be useful for that. Authors who wanted to deviate, of course, had to own their own sections entirely. It was entirely up to them. But the greatest risk of
18:22
additional transformation tools whenever somebody wanted to use Latex or Word or something else? Transformation output overwriting other versioned files bit us really hard. In a subsequent use of this system, one person so strongly preferred to use Latex and
18:42
another preferred to use RMarkdown and both converted to RMarkdown to be versioned. This went okay, but it had some caveats. I recommend avoiding this entirely. Store the original document in the repository and let the build system handle
19:02
conversion from the original format to something that works with Pandoc as like a pre-processing step. You don't have to take my word for it, though. Leveling the playing field for contributions, great for collaborating and building documents with all the features of Latex, said
19:22
one of my co-workers, a doctoral guy who absolutely loves Latex and uses Latex for himself all the time. He also said, I miss having the fine control of figures and subfigures and positioning, and that's something that Pandoc doesn't really handle super well. The current use of our system, though,
19:44
is growing significantly. Our leadership was quite impressed, not only with the document that we produced, but also us telling them about the system that we built to produce it. We've since expanded its use to several other teams, including one fork
20:00
of my original repository, used as the basis for a documentation spanning several teams. Content is brought in via Git submodules, where each repository is also its own separate document. One executive decided to use Pandoc to write a book after seeing what we'd done.
20:21
That's awesome, because the basis of this system was what was used to write the first edition of the book, A Friendly Introduction to Software Testing by Bill Laboon. Many tools were involved in the creation of this. This is just some of them here. We could not have possibly done this without Pandoc, obviously. Make was such an important
20:42
tool in it, as was Git. I do want to call out here Tectonic. Tectonic is a newer implementation of LaTeX and Rust, and it's fantastic to work with. A long-term vision for this is to have a single source for this document system that gets turned into a bunch of easily consumable formats,
21:03
creating a searchable library of sorts, kind of like an internal archive.org for our internal scientific documents. And then there's styling, of course. Our documents use mostly out-of-the-box Pandoc and LaTeX styles with a few customizations. I'll probably leave Computer Modern as the font, though, because it just conveys a
21:24
certain air of respectability. So, when you get a chance, go check out Pandoc at pandoc.org. I'm sure you it'll become a part of your documentation workflow. Here's some references and attributions,
21:42
icons, of course, by Font Awesome, and you can't really click on the link here, but check out the slides later to read my friend's book, A Friendly Introduction to Software Testing. These slides are available at my own GitHub profile in the document workflow
22:01
folder, or you can see the rendered version at speakerdeck.com. That's all folks. Thank you so very much for watching. I'm Colin Dean.
22:30
So, what questions do we have? We have a lot of questions. The one that, the latest one that I saw, and, well, you need to know that
22:41
I'm coming from an organizational, architectural viewpoint always, was how do you share this with the wider organization? Are people assumed to be able to pull it from your publishing flow, or in what form do you share it that further? So, when you say, okay, the
23:02
white paper is finished, how do you share that? So, we have as a part of our continuous integration, continuous deployment system, we would build the document on every pull request, every push to master, and then we could do tagged releases, you know, git tag
23:21
on a certain commit, push up that tag to our CI CD system. It would build a release copy of the PDF and push that out to GitHub releases, and GitHub releases became like the master download location for the file. Those who wanted to watch
23:42
the repository could watch just the releases, and they would automatically get an email anytime there was a new release. They could open that email, click on a link, and go download the latest version of the paper. And that was an easy thing for everyone to adopt, to look for things in GitHub release.
24:02
Yeah, we don't know how many people actually consumed it that way, because we as a development team would often just send the PDF by email to the people who are going to get it to make like damn certain that they were going to get it, rather than relying on this
24:21
slew of GitHub emails that virtually everybody gets. I don't know how much GitHub email our executives get, and that's who the primary audience for this document was at the time we started writing it. While your talk was running, there were
24:42
several questions back and forth about alternatives to Pandoc and ProContra. I'm not going to repeat that. But if people want to find you with questions and the merits, where do they find you again? Can you say that or type it in the chat here? Yeah, I'm at Colin Dean
25:02
just about everywhere on the internet. There's very few services. I've not managed to preserve my namespace. So if you see, especially the picture of me with like a top hat and a scarf, it's probably me. That's definitely something that I do. And once you get to the hairdresser,
25:22
I will fit again. So another organizational question is, you say one of the most important takeaways is to definitely treat docs as a source code. Now, if this is not something that people do intuitively, because once they do
25:42
this, you don't have to advertise this idea, but if it isn't happening, how do you get there? I like to explain the benefits of it as an architectural decision that if you are an architectural and a collaboration decision and ultimately
26:02
collaboration is an architectural feature from a certain point of view. So you can build it in and then that's it? No other way? You can it's not something that needs to be built in from the start. It is best if it is, but we don't have that
26:21
luxury all the time. So there's ways that you can because it's source code, you're infinitely modifying it. You can add things to the source code that enhance documentation and the documentability of the project, of the code base as time goes on.
26:41
We started with a very simple one-page markdown file documenting this particular project that we were working on. That eventually grew unwieldy, so we turned it into multiple markdown files. And then it turned into a Hugo static site generator
27:02
website that we had some information on. And then when we realized, oh, we need to write something more like a document, then we decided to go down this route of producing a very nice looking, scientific looking PDF. And there's ways that we could take the
27:22
information that we wrote for this and either integrate it into the Hugo site, which is still relatively small, mostly usage instructions and a little bit of architectural information, but not a whole lot of the explanation of the business logic of the application. All of that's in the paper. We'd like to eventually merge these two things and choose either Hugo or
27:42
either this Pandoc system. And it's probably going to fall on the Pandoc system. And there's so many static site generators out there that use Pandoc, we can assuredly find one that will work for our purposes and remove Hugo from the equation. You've got the question also, what have you done with Table of Contents? Have you also used
28:02
Pandocs for collections of documents? So Pandoc has a Table of Contents feature. It's suddenly you have a table.