Automated Documentation Proofreading

Video thumbnail (Frame 0) Video thumbnail (Frame 778) Video thumbnail (Frame 5716) Video thumbnail (Frame 10744) Video thumbnail (Frame 15288) Video thumbnail (Frame 19406) Video thumbnail (Frame 23827) Video thumbnail (Frame 25416) Video thumbnail (Frame 28465) Video thumbnail (Frame 29302) Video thumbnail (Frame 34707) Video thumbnail (Frame 35617) Video thumbnail (Frame 36766) Video thumbnail (Frame 38315) Video thumbnail (Frame 39383)
Video in TIB AV-Portal: Automated Documentation Proofreading

Formal Metadata

Automated Documentation Proofreading
igor: Making Documentation Easier
Title of Series
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this license.
Release Date

Content Metadata

Subject Area
Making documentation easier and better by automating tests for errors in language, formatting, and usage. Few people like to work on documentation. There are numerous rules for wildly-varying documentation formats, many rarely used and hard to remember. An automatic proofreader to check for errors ranging from spelling to meeting all the arcane formatting rules of the different toolchains would relieve much of the stress. Not only will this encourage improving the documentation, it helps to prevent errors in the first place, and detect those that have slipped through already. Clean, consistent files are easier to maintain, expand, and convert to new formats. The automated proofreader, named "igor" after a famous lab assistant, helps the writer focus on improving the content of their document.
Freeware Videoconferencing Block (periodic table)
Direction (geometry) 1 (number) Function (mathematics) Open set Mereology Standard Generalized Markup Language Computer programming Software maintenance Formal language Neuroinformatik Different (Kate Ryan album) Entropie <Informationstheorie> Damping Data conversion Error message NP-hard Closed set Computer file Complete metric space Tablet computer Entropie <Informationstheorie> Chain Data conversion Right angle Freeware Spacetime Point (geometry) Web page Computer file Consistency Line (geometry) Rule of inference Writing Audio file format Energy level Hydraulic jump Rule of inference Noise (electronics) Standard deviation ASCII Consistency Projective plane Content (media) Standard Generalized Markup Language Audio file format Line (geometry) Software maintenance Power (physics) Word Mixed reality
Point (geometry) Standard deviation Multiplication sign Patch (Unix) Mereology Rule of inference Computer programming Neuroinformatik Programmer (hardware) Network topology Hypermedia Message passing Error message Task (computing) Area Noise (electronics) Standard deviation Online help Concentric Feedback Electronic mailing list Audio file format Database Line (geometry) Message passing Process (computing) Error message Network topology Right angle Quicksort Resultant Spacetime
Point (geometry) Web page Greatest element Randomization Computer file Multiplication sign Source code Sheaf (mathematics) Maxima and minima Computer file Mereology Vector potential Rule of inference Formal language Wave packet Neuroinformatik Optical disc drive Mathematics Computer configuration Data structure Macro (computer science) Stability theory Rule of inference Web page Electronic mailing list Maxima and minima Line (geometry) System call Template (C++) Computer file Word Macro (computer science) Software testing Automation Freeware Data structure Reading (process)
Axiom of choice Mobile app Implementation Computer file Capillary action Sheaf (mathematics) Standard Generalized Markup Language Computer programming Formal language Personal digital assistant Spacetime Software testing output Implementation Error message Position operator Area Standard deviation Computer file Electronic mailing list Directory service Line (geometry) Demoscene Computer file Process (computing) Personal digital assistant output Software testing Automation Regular expression Resultant Spacetime
Context awareness Computer file Code Latin square Sheaf (mathematics) Translation (relic) Function (mathematics) Graph coloring Number Revision control Frequency Mathematics Hypermedia Personal digital assistant Source code Flag Pairwise comparison Error message Traffic reporting Descriptive statistics Area Email Web page Mathematical analysis Line (geometry) Computer file Word Integrated development environment Function (mathematics) Software testing Text editor Object (grammar) Quicksort Spacetime
Web page Slide rule 1 (number) Translation (relic) Line (geometry) Content (media) Standard Generalized Markup Language Computer programming Computer file Word Personal digital assistant Personal digital assistant Source code Website Software testing Text editor Hard disk drive Website Regular expression Error message Mathematical optimization Spacetime
Coefficient of determination Word Markup language Mathematical analysis Software testing DVD-Audio Right angle Mathematical analysis Formal language Formal language
so my name is Werner Block I'm here to talk about automated documentation proofreading and if that's not the TOC Europe expecting here in the wrong room I know the other talks you could attend I'd like to thank you for flying w like airlines excess at the rear in the front of the room the question we start
with his wise documentation harder right well there are a lot of reasons but 1 of the big ones is rules rules rules rules there so many different types of role of there are particular type of rules for text files those are pretty lax spelling punctuation line around that's fairly easy there's nothing being strict accept your readers and if you disappoint them you let them down but they probably won't complain they may go somewhere else they needed to back which is far far more complicated there are many more rules with this it's not particularly straight but to get the output you what you have to comply with those rules and of course for free BSD where you start with SGML currently for much of our documentation and they're the rules get a lot weirder a lot more complicated a lot more involved and there are 2 sets of rules there are the toolchain rules which say you must use this type of formatting this type text and there are the free BSD documentation project roles which they you should conform to the standards you should use this type of wording you should indent with 2 spaces per line that type of role and those at UN informal rules that are not enforced by anything so far another problem we get to that makes documentation harder right is existing documentation is inconsistent some is really well done and I'm not talking about the writing style I'm talking about formatting tags the US complete package other than the kind and for me at least and I'm sure many other people the way you start working on this stuff is not from a blank slate you don't read the indictment agent jump in writing from scratch you find 1 that's similar to what you're looking to do and copy the problem being if the 1 you take to copy was not done very well it gets you started on on about and the problem the reason we have those inconsistent examples is because at present there isn't really anything that's out there looking at here's another example Dr. guest you mail that might help a little there's an error in this paragraph the toolchain doesn't consider it an air moreover it doesn't reported as an error who sees it don't say it is building on Kay about half a 3rd more and more that that is supposed to be a closing paragraph those are I don't know what it is with those we have and still have many of those they're hard to see it in your mind or your eyes expecting to see a close Paraty so it zoomed right over them so the toolchain doesn't care why does not help us it's a computer exposed to automate things disposed to remember stuff like that
and now we come to the question of why worry it's documentation even if it was a program if it builds ships good enough I mean it's secondary right consistency in the documentation encourages quality in the documentation and in the programs themselves you want to have that high consistent level of quality or the end user and other users will be disappointed by varying levels of quality and of course also for maintenance if the documentation is all consistent it's easy to maintain it teetered modify it's easy to understand both for existing documentation writers and new ones and then of course if you're documentation is consistent it's easier to convert other formats and an example of that is right now we have dock with SGML we're really really wanna get that in the dark with XML but not all those documents are consistent on the conversion can fail and of course there's talk of open because as these men if our documentation pages are man pages are consistent automated conversions can apply that and there are other future formats that may not have even been invented yet I 1st kind of like ASCII dark which is sort of not a but it's a step in an opposite direction or a different direction than most is going but the point of that is if all your documents are consistent they can be automatically converted and again another part of the world why you should just ship it is entropy problems accumulate if there's nothing validating these documents other than the tool chains which may or may not tell you and for example of that the previously porters Handbook is 16 thousand lines of approximately about guest you know it's about 50 thousand worked it's a it's a book to fix whitespace problems along that would be indentation tabs where there should be spaces tabs mixed inside content or white space at the end of a line required an 8 thousand 1 commit on a 16 thousand 1 document and then another 4 thousand line now the it's not 12 thousand lines through some of those overlap but it that's appalling if that was a program it wouldn't compile and that type of entropy that noise accumulates because you have many many different people contributing small pieces rewriting small pieces and we need to avoid that because a document that particular 1 it was hard to work on the indentation levels were wrong it had many problems and because it was so hard work on it discourage fixing problems and this book is 1 I mean every ports committed reads this or let's be honest parts of it right just just flip to the page you need and use that so what can we do well
let's make things easier for writers we want to encourage right programs the best program in the world is useless unless you can tell the user how to use it and that's with documentation is for and let me say that again the best program in the world is useless if the user can figure it out so you can write the best thing areas but that work maybe last so the other thing we need to do is we need to make it easy for people who rarely work on documentation because that is about everybody when was the last time you work on a mandate for some people last night some people have now but for many people 6 months ago or a year ago and that's just impossible to remember the right formats and rules for man pages when you work on rarely and most people work on America or doctor the the same thing applies maybe you just wanna do a patch to fix 1 part of the book if you haven't looked at it in 6 months or a year or forever those rules will not come back to you it's I tell people I paid my house every 5 years at the beginning of it I'm an amateur at the end I'm a professional and then 5 years later I'm an amateur and it's the same thing so if we can make it easier we can encourage programmers to document their work which like I said a couple days ago programmers the right man pages because meant itself they suck to right it's being specific than that and so let's make them easier and we can encourage programmers to documents that and we can help try to avoid that situation where they fix the long-standing problem or added to a a long wanted feature and it's just not in the mandate let's fix that let's make it easier we can also encourage end users and occasional users of a particular program to contribute to the documentation and that is tremendously important every programmer will tell you I eat their ways of my program was use that I never anticipated and their end users who tried to do things I thought were obvious and yelled at me later because they were so we can get that feedback from users and get it in there those common experiences which valuable and finally we can make it easier for writers to expand and improve the documentation we have which is the point we want to be clear we want to be concise and we wanted to be thorough and if you have to worry about well I forgot to put a blank line on line 17 that distracts from trying to be creative and improve the documentation what its writing what it's talking about OK so what can be
automated proofreading do because clearly there are some things you can't there some things the human has to be responsible for well 1st of all it can remember I'm suffered serious and is anybody else have that Sierra Stanford can't remember stuff a computer I mean that's that's make a program where all the stupid little roles that the the human has trouble remembering can be checked by a program that's what it's for it can also help us find errors either errors that the toolchain would find it we wanna catch 1st or errors the toolchain just ignores which there are many of those and it can help us comply with standards we have 2 kinds of standards like I talked about before and the free the documentation primary has a list of what's column standards their rules but they're effectively suggestions because there's nothing to enforce them and this is sort of a contentious saying after the end of a space there at the end of a sentence used to spaces I didn't realize that was contentious but idiots and it doesn't really matter because that's what the documentation promises to use so if you want to get your stuff in their use to spaces the problem we had was there was no nothing check that and with an automated proofreader we can help encourage people to comply with those so my formal informal rules by saying hey this is not in compliance and I've I've seen this happen people will fix those not because they agree but just to get the checker to shut up leave me alone but the media people remember off anybody get out of my face it was a program that would you could tell and we can also help that use this to help keep mistakes out of the tree database people will tell you it's far easier to keep mistakes out by and projecting than to try and find them afterward and that goes back to the entropy example those noise noise and errors accumulate and it becomes such an ugly job to fix those that it doesn't get that so if we keep them out from the 1st place we avoid that and the end result of all this is we can with the writer concentrate on the message that's what we what we want the human to do the hard work of telling you this program does or how to use this feature it's so what task in the Ottoman will
for all files we can check simple thing we can check spelling but will do this in a different way because honestly I didn't wanna write a spelling checker that would be able to spell check any random text file C source files have certain rules for spell-checking so what we did was look through previous the man pages text files and dock of source files and look for existing misspellings misspellings and made a list of known misspelled words and that tends to work because you have fairly stable population of people committing to previously source but the documentation or and it doesn't change that few people start new on a few people leave but the same people tend to use the same as stocks so we can catch no we would then what point fingers can it is very consistent you I've gotten to the point where when I look at a file I can sometimes tell from the the consistent misspellings who did it but the other thing we have a problem with is repeated words and that's the way computer right you'll be writing a sense you pause to think and you'll end up retyping the last word so you get lots of is is and there very very charming and when you're reading that sentence pauses you it's like a verbal parts of of and it breaks you train of thoughts we can etch those very easily because we don't care what they are we just look for 1 word and if the next word after is duplicated we get we say hey here's a duplicated work and bad phrases and I wanna be careful about this because I think some PI call these things bad phrases and out bottom there's a lot to for those in their occasionally and some people seem to think those or from non-English-speaking people but in my experience it with learned English as a 2nd language speaker it better than the people who learned it as a 1st I think it's another 1 of those pause things where you're writing a new pause to think and then you write something odd and then we can test for style and what we have right now now it's a very fairly simple minded thing it tests for usage of words and suggest other options and their reasons for all these and will show some of those later and also on man pages it checks to see if there are examples as as we said earlier this week in a trivial example is better known example for there are certain things we can test
for of a free BSD guidelines sensors should always begin on a new line hours don't always know but we can check that we can also check that the document date was updated when nontrivial changes were made and this is kind of what started this whole thing because I had planned barber say yeah you should change the document is like now and you had to change some time but it never occurred to me and I'll bet money that 6 months down the road I will forget that again it's you don't normally change it and it should be the lasting change before committing and then the structure and not 7 says the 8 min. macros you need for page is anybody know what they are I don't they said because I'm looking at here those are the 8 minimum macros in that order many of our man pages meat that's that many do not and some of them that don't part in the contribute section all
David I would guess you know we can test a fair amount we can just for white space like I talked about earlier we can test for indentation which is actually fairly tricky I got my so that we can check the tag usage style which certain tags like if you put a program listing Faggin inside paragraph it leaves a huge huge space gaps in your out the document in a scene 90 or major HTML but it looks that let's not do that Let's warn the user filed capillarization we have standards for that but we don't enforce them it's actually the associated press style was checked for them and so finally the result of this is equal or the lab assistant you may say I go movies from some years back and some of the design balls with this is it must be quick and easy to use it shouldn't require setting anything up it should auto detect the type of the input file you shouldn't have to tell it should handle multiple files in compressed files like say a directory of man pages like if you wanna check everything in many which I've done there's plenty of areas in there and I don't wanna have to decompress each 1 in theater something it should be there for and we should also test for conformance with previous the prize the freebies the documentation primer because we can encourage those standards and sometimes when you're modifying a man page or another file you only want to test the errors that test for errors in the section that you that many of our files like the porter's Handbook of other areas but you don't wanna take on the whole job fixing everything at once and so we should be able to test just for white space or just for indentation and it should avoid false positives so true errors are not lost in the app
any implementation dwell it's written Pearl but whatever and I'm the I'm serious about that I I dare you to write something better so I don't have to work on this anymore I will also contribute to us I don't care what it's written in blood is regular expressions all the way down so you may want to factor that into your choice of language when you do that all it does really is apply a bunch of usually regular to pression test to each line and what does it look like a map but
this is this is actually in everybody see that a should turn off the lights and don't worry about this and will have a better the 2nd and here you can see were running the door with the dashed the flag which means ignore that Dr. were just checking these these files for existing errors were not making changes to the eldest of compressed man pages and the reviving out that in the last and here it shows the name of the file and then in this 1 for example online 102 found bad phrase to for which it's highlighted with brackets on on the rest of the line in context and the spelling errors and trailing white space a space a tab at the end of a line repeated words will not be an I think we believe these were that common until like coded it and ran over some stuff you get those will these can be I mean they got the rare they're not and here is 1 of the man page the undocked things were a section header description has been used but section a synopsis has not been defined here that's not required by the toolchain but the dot mandates as its now this particular output I find is exceptionally hard to read but it's there so there's a plain ASCII version which may be able to incorporate into an editor so you can highlight a section of documentation of social media is in this section and then grabbed those line numbers and error messages out here stealing the idea from mostly could spell we use and c highlights into color highlights and that's what the dash our flag is for which corresponds to the unless which is able to display those and this is the same output but with the answer color highlighting the error messages are highlighted in different colors so they don't all blend together it turns out there are enough fancy highlight colors that are visible to do all the different areas give each their own code but this helps keep them from all blending together and for white space problems we use reverse video and that it helps I think it helps tremendously and like we talked about earlier this is the style analysis and it is very simple minded it's not it checks for word frequency and it says you use 512 touch well you and your informal style don't use that if you can avoid try to the formal an objective this is sort of a a stops to get from their knees and it checks for various uses of other word simply in basically all I'm going to that's how I feel when somebody says basically to me they're saying I'm gonna done down for it yet so far as any of you know know and it also points out use of Latin Aegean II which tend to be used in academia and science environments and might be constructively replaced with the actual English words turns out that many people use those incorrectly
for Doc book we have another issue where we have many translators and they care about white space or rather they don't care about they don't want to see whitespace commits because that does not affect the translation the the ash z appear couples the only checks for white space problems so you can take a booklet reporters hand feed it into this yet a huge output that shows all the various whitespace problems fix those committed and the translators are unaffected and if anybody's look at that's for white space it is horrible this is the flip side of that
previous 1 this is kind the opposite white space and these are the ones that translators would care about check to things like no common after EEG because typically there is a pause after that for example or that is there's a pause after that but capitalization these words should be wrapped in file name command tags in which case it would complain about and spelling errors and again EEG in IT is an open paragraph about closing etc. we looked for the very 1st that means you start a new paragraph but there was no posing back on the previous 1 which means it was 1 of those slashes left off and finally where is
it It's imports as text Proc ego or thanks you than Barbara and I have it on my website there which I will show in another slide here and just a 2nd lessons learned well optimize regular expressions in short circuit whenever possible Dr. guess you know indentation is decidedly nontrivial that I could use a word for that but I finally syntax highlighting is good for white space and on that web page which I will come back to hear a 2nd there is a syntax highlighter for white space for the nano editor which nobody uses except me I think which will show you that in white space the end of lines and use it on your document you be appalled and finally advertising because a program that will help you with proofreading documentation does you no good unless people know about it and that's partly why I'm here I want to let people know about this let's make it easier not improve our
documentation and for the future well there could be a rewrite nobody's volunteered yet it's still early the dog which indentation testing that may be affected by a switch over DEC's now and advanced language analysis instead of words have little sentences have look at paragraphs and say this paragraph right here is unclear we could do that I can't but somebody and other languages and by that I mean non-English languages we can have the the spelling checker look for other words the markup will still be in English so all we have to do is add those other misspelled words and that's it I wanna
thank you for coming and I want to particularly thank my mentor lost when Barbara and that personally and