We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

docx2tex: Word 2007 to TeX

00:00

Formal Metadata

Title
docx2tex: Word 2007 to TeX
Title of Series
Part Number
4
Number of Parts
33
Author
License
CC Attribution 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production PlaceCork, Ireland

Content Metadata

Subject Area
Genre
Abstract
Docx2tex is a small command line tool that uses standard technologies to help users of Word 2007 to publish publications where typography is relevant or only papers produced by TeX are accepted. Behind the scenes, docx2tex uses common technologies to interpret Word 2007 OOXML format without utilizing the API of Word 2007. Docx2tex is planned to be published as a free open source utility that is accessible and extensible by everyone. This paper has been originally written in Word 2007 and then converted to TeX using docx2tex.
PlanningPresentation of a groupComputer programmingMultiplication signMereologyData conversionSoftware developerJSONUML
File formatWordField (computer science)Open sourceProduct (business)Block (periodic table)Online helpOffice suiteElectronic visual displayMassData conversionMereologyCASE <Informatik>Control flowPresentation of a groupMultiplicationDemo (music)Different (Kate Ryan album)American Physical SocietyCollaborationismWell-formed formulaTrailMedical imagingVariable (mathematics)Speech synthesisDependent and independent variablesCartesian coordinate systemMathematical analysisDataflowStandard deviationComputer fontLocal ringAuditory maskingSmoothingIdeal (ethics)Sampling (statistics)Multiplication signLine (geometry)Extension (kinesiology)Physical systemTask (computing)Connected space1 (number)Sound effectQuicksortClosed setSheaf (mathematics)XMLJSONUML
Form (programming)Radical (chemistry)Hand fanContent (media)Cycle (graph theory)Greatest elementSocial classFunctional (mathematics)Cellular automatonSound effectResultantPrisoner's dilemmaNumberDiagonalCollaborationismRule of inferenceRight angleGroup actionMultiplication signOpticsInformationSummierbarkeitFood energyDiagramOpen sourceWordMiniDiscFile formatSequenceCore dumpMedical imagingMassAuthorizationProjective planeMaterialization (paranormal)Physical systemFlow separationData conversionSource codeHome pageMathematicsComputer fileNamespaceMereologyTrail.NET Framework
Particle systemRevision controlMereologyOffice suiteDifferent (Kate Ryan album)Binary codeElectronic program guideComputer animationSource code
Multiplication signForm (programming)Line (geometry)Point (geometry)Process (computing)Sheaf (mathematics)CASE <Informatik>NumberResultantDifferent (Kate Ryan album)MereologySoftware bugSign (mathematics)Speech synthesisGreatest elementBinary codePhysical lawContent (media)Medical imagingArithmetic progressionComputer fileSoftware testingProbability density functionRevision controlFunction (mathematics)BitData conversion.NET FrameworkConfiguration spaceLengthVirtual machineDirectory serviceComputer animation
Form (programming)NumberInformationMultiplication signPoisson-KlammerResultantSource codeArithmetic meanSummierbarkeitComputer fileMereologyVideo gameError messageRight angleDifferent (Kate Ryan album)Maxima and minimaForcing (mathematics)MassWell-formed formulaPay televisionTable (information)Context awarenessDirectory serviceComplex (psychology)Interior (topology)File formatOpen sourceoutputElectronic mailing listCodierung <Programmierung>Data structureCartesian coordinate systemDoubling the cubeWordVisualization (computer graphics)Web pageComputer animation
CodeSampling (statistics)Type theoryFile formatRevision controlContent (media)Function (mathematics)Computer fontMatrix (mathematics)Web pageComputer fileSpacetimeFigurate numberData structureWordInformationMathematicsNP-hardBit error rateStandard deviationVisualization (computer graphics)Error messageOpen setLatent heatMereologyField (computer science)Open source.NET FrameworkBlock (periodic table)FlagPhysical systemWell-formed formulaForcing (mathematics)Data conversionDifferent (Kate Ryan album)Arithmetic meanForm (programming)Machine visionVideo gameRight angleCASE <Informatik>Element (mathematics)Process (computing)Hybrid computerOffice suiteMultiplication signEndliche ModelltheorieResultantIntegrated development environmentLevel (video gaming)Software testingTheory of relativityAreaXML
Transcript: English(auto-generated)
So, welcome everybody, my name is Christian Potsa, I'm here with Mihai Biczol and our PhD supervisor is Zoltan Orpran. We are full-time software developers and we do our PhD work in Hungary at Dartmouth-Lorant University.
So, the topic of our presentation is a conversion tool that combat documents from World 2007 to tech. So, the short agent of our speech can be seen on the display.
Okay, so I overview this tool, the motivation, we talk about features, benefits, applications, use cases. I will not go deep into technical details, I will have a demo session and I will show block exotec while working.
And I will speak about variability and license of block exotec.
This part will be the marketing bullshit part of my presentation and the last part will be the demonstration. So, as the name suggests, block exotec is a very small tool, a command line tool that is able to convert World 2007 Office OpenXML block exformat to tech.
OpenXML is a European international standard, so we can trust it and we can use it as a source.
Okay, and this is a completely open source product, so block exotec is completely open source. Why we created this small tool? Well, Word is not very good at typography, it's very bad and it produces very bad printout compared to tech.
But it has advantages also, it's a very big, easier to use than tech and supports user collaboration and teamwork very well. So we have track changes, support, we can upload it to SharePoint server, field fields, very very sophisticated features it has.
So we created a tool that converts from block exotec to tech to help users of Word 2007.
Okay, the idea is that there are many many existing solutions to solve the problem of conversion. So, Word to tech, Word to LaTec are commercial products and they are not open source.
There is a tool called TF2LaTec2e and there are many many Word plugins, we don't like them, they are complex, not free, not open source.
I've read somewhere that OpenOffice can read Word documents and can export to tech, I think, I haven't tried it. So every solution has a shortcoming, so that's why we created the text to tech.
We support the most important features, that's the first version, there are a lot of to-do, but it will be, I think, very useful. The most important to-dos are add some sophisticated mass formulae support.
We can handle mass formulae, but not as expected and we do not handle drawings, we may convert drawings to .xv.
Okay, these are the features of docx2tech. We can convert standard text, support some basic styling, sections, verbatim text, simple tables, different break times,
numbered and bulleted lists, multiple lists, captions of different parts, cross references to these parts, and image formats. We use ImageMagick to convert from PNG, JPEG, EMF, BMP to APS and we handle special characters.
We also process the content of text boxes, but we do not follow the layout of Word 2.0.
Lotak is much better in layout handling than Word, so we trust in the layout engine of Lotak. Okay, and we have basic mass support. Okay, we use it for scientific publications, but it can be used for books, articles, it
can be used by publishing companies to find materials being published and for educational purposes also.
So we can teach everybody for all XML format. Okay, this is our article writing workflow. We decide that we should write an article. Okay, we read the conference homepage. They say that all the accepted format is Lotak.
It would be pain to write a paper in Lotak, and it would be also pain to write it in Word, but it's a bigger pain in Lotak. So we assign different parts of our article to different authors. The users work separately.
And when all the authors are finished, we merge the work. Okay, then we start collaboration work. The user text changes functional words to correct and review and correct the contents of the article.
And once the result is accepted, then we apply Docx to TeX and convert the result
to TeX format, and we do some special formatting, special work, and we submit the article. Okay, this is a picture about the track changes feature of Word. I also already presented that in the morning. I may show it after that.
No, I will speak about the technical details very shortly. This is a sequence diagram, a sequence UML diagram for a conversation that Docx to TeX can do.
So, the first step is all XML de-packaging. All XML is a bunch of XML and graphics files tied together.
So the first thing is that we de-package it using the standard .NET API. It's a system iO packaging namespace. That's the responsible URL.
Okay, and we start the core XML engine that processes the OWINGSML content. And we have several methods, classes, helper functions to do, for example, numbering, to do image conversion, image handling, image resizing, styling.
Okay, and when we created the pure code, we run it through a simplifier, beautifier engine that brace lines at, for example, 72 characters long.
And when it's okay, then we save the result to the disk. Okay, license and availability. The license is BSD. BSD, we like BSD because GPL is cancer, BSD is okay.
The source can be downloaded from CodePlex. CodePlex is the source of Microsoft. There are about 5,000 projects and one of these projects is DuckX2TAC.
This is the URL. You can download it. You can download the binary and the source code also. Okay, demonstration begins. So, I have two things to demonstrate.
Okay, this is the draft version of the article we published. So, this is the article. This will guide you through the different parts of Office OpenXML.
Some examples will be shown. Okay, so let's combat this DuckX2TAC. Okay, this is the binary of DuckX2TAC.
Here is the config file. This is the path to image magic and this is the line length set to 72 characters. Okay, this will run. I start the command line.
example. Okay. Okay. So, this is the binary.
DuckX2TAC draft and the output will be in the LaTeX directory. DuckX2TAC won't stack with the output. Okay. It's a bit slower because I run it for the first time after booting the machine.
And it has to load the .NET Framework. It needs .NET Framework 3. That's the part of Windows Vista.
Okay, this is the output. We have a TAC file and we have several images that will be embedded. So, for example, DVI and PDF. Okay, let's compile it.
Okay, we use a miktac distribution and compile it once more. I have references. Okay, compile it to convert it to PS.
Okay, and to PDF. Okay, this is the resulting PDF.
Okay, so you can see it converted the standard rest, converted sections, numbered list, other numbered list. Yes. Okay.
There are pictures. Special characters. Verbatim text. Okay. And different styles, iconic styles.
Okay. So, this is the plain conversion. This is the result of the plain conversion. I have the modified version of this article. Okay, this is the LaTeX version and this is the final version. We apply the took style sheet.
It has two columns. Okay. So, let's leave the results. This and this. Okay, compile by content. Okay.
The red text signs the deferences. So, a big beginning of the document deference, of course. But inside the document you cannot find too much deference. So, you don't have to do too much work after the whole version.
Okay. Okay. This is the text file that I've converted now. And this is the modified text file.
So, not much deference. Okay. I would like to show one other example. We have here created this document for demonstration and testing purposes only.
So, we have a little from every supported feature of LaTeX2TAC.
Here are mass formulae inside the document. I will show its conversion also. And here are a bit complex.
Now, it's fine.
All right.
Okay, here is an error. So, the error is because we don't support double subscript. In the mass formulae, this is a to-do. For example, I run it twice.
Yes.
Okay, and this is the result. You can see the picture, the wrist.
These are reference to the table. Or table one, table one. Okay. Okay. This is a continuous numbered list. You can find deference or formatting.
Okay, special context. These are references to substructures and structures. Okay, and these are mass formulae. Okay, this is a more complex mass formulae.
And these are also mass formulae. Okay. Let's look into the source of word document
and the result in the tech document also. Okay. Okay. So, it's a cheat that contains XML files. Okay.
Control page done and total command goes into the cheat file. Okay. Here we can find in the word directory a document from XML. This is the main XML file that contains the whole text. And inside, you cannot find,
but you did not manage to unzip it. Okay. Okay.
Okay. So, for example, this is a paragraph. VP is a paragraph. This is a run. This is a test. So, this is a... Okay, this is a underrun.
Underrun with italic text with sentence. So, it's a very, very easy XML format and it's not a big deal to convert it to analyze it to any other format. Okay. This is a picture.
This is a picture reference. Inside, here we can find the picture information. These are also converted. And I show, for example, mass formulae. Mass paragraph here begins the mass.
This B. Okay. And BD means brackets. BA in brackets equals with... Here, X.
So, it's very, very easy to interpret this XML file, at least with the application. Okay. And the result is that text file.
Yes. The input encoding is Latin2. I'm going to be using Latin2. So, it would be made a code-free label, for example. And I will show some mass.
Yes. Here is some mass that Vakestute created. Here is a reference. This is the int and a reference number that has given two different parts. Okay. This is another mass formulae.
So, this is the resulting text file. Okay. So, I have five minutes. The source code is written in C-sharp, in C-sharp 203.
And we have used Visual Studio 2008. And you can download the source code from CodePlex also. All right.
And thank you for your attention. I'm sure we have some interesting questions. Is it possible to use Mono instead of C-sharp and the matrix of Truechain?
Ah, interesting question. It depends on Miguel De Casa, the creator of Mono. We use the features of .NET Framework 2. I don't know if Mono supports it, maybe.
We use for Docx, the packaging, the system.io packaging. That's the part of .NET 3. You're not asking. No, no. But we may try it.
No, sir. I'm curious. Since the open XML specifications behind Docx is probably pretty complex, have you come across parts that you can't figure out how to translate it to the latest tech? We have read the draft design standards.
But mainly we created example documents and based on this created code version tool. The standard is many, many hundred pages long.
So we didn't try to read it through. But when they didn't understand something, they used it. One thing you might consider is annotating the tech output with comments that would allow you to recover the Docx format.
The reason I raise this issue is I'm primarily meant by my mathematics colleagues who are collaborating with people in other fields who use Word. I want to be able to convert it to tech, make their changes and send the revised document back in Word format, not in tech format.
Interesting. I think since they're just changing the probes, this should be relatively straightforward to support with additional comments. So as long as you don't change the structure when you're in the LaTeX environment, the round-tripping should work. I think what you mean by round-tripping,
most of what you're doing when you're converting a Docx thing to a structured system, as we know it, is actually ignoring all the formatting stuff, all the change stuff. That's why that's been so key.
Absolutely everything has to say how you import changes in it. Lots of detail for formatting and massive information like that. To carry all that along the user travel. I'd like to pull that and then bring it back again. On a related note, in your sample documents,
do you use the traditional, naive Word users, force it to look the way I want, or better use a defined style for my chapter and heading elements? Do you have examples of both and do you find differences in how hard it is to convert?
We do a very basic conversion. We give LaTeX the most possible work that it can do.
Actually, in the source code, I saw a couple of places where I think you actually ended up turning visual formatting in BERT into visual formatting in LaTeX, like four spaces in front of a math formula. Yes, yes. Was that left over from the previous version?
Or do you actually intend to have that kind of visual carried over, rather than trying to abstract from... No, that was a part of the Word document, and the audit in the LaTeX document maybe can be omitted.
So you mean there were four spaces in Word, so you try to do the four spaces? We may delete it. Are the four spaces in block and type the same? Excuse me? Could you repeat, please?
Are the fonts in the document? Oh, fonts, fonts. The same? No, we have all the standard fonts. Do not call that font. That may be a pretty sharp work. Last question. Related to the fonts, are the handles to flag errors or warnings?
The problem is if something goes wrong by a translation, does the user get information about it? Uh-huh. Interesting question. It's hard to...
So DocEx2Tech plays the XML and converts everything that it can do. The error can be that it leaves out something. We do not process bad content,
so we bring a lot of bad content to the back file. We may omit something that is in the Word document. I'm sorry, but we have to give our next speaker a chance to give his entire talk.