Exporting a Plone site to Word, results and lessons learned
This is a modal window.
The media could not be loaded, either because the server or network failed or because the format is not supported.
Formal Metadata
Title |
| |
Title of Series | ||
Number of Parts | 72 | |
Author | ||
Contributors | ||
License | CC Attribution 3.0 Germany: You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor. | |
Identifiers | 10.5446/54743 (DOI) | |
Publisher | ||
Release Date | ||
Language | ||
Production Year | 2020 |
Content Metadata
Subject Area | ||
Genre | ||
Abstract |
|
Plone Conference 202059 / 72
1
3
6
9
13
14
18
23
34
35
39
40
43
46
50
51
56
58
60
63
66
71
00:00
Function (mathematics)WordWebsiteCondition numberProbability density functionFeedbackProjective planeData structureWater vaporCodeData managementIntegrated development environmentLevel (video gaming)Mereology2 (number)Trail40 (number)Endliche ModelltheorieSanitary sewerWebsiteAreaSoftwareFamilySoftware testingDataflowResultantFeedbackSlide rulePresentation of a groupBitInformationNormal (geometry)Functional (mathematics)Speech synthesisOvalModule (mathematics)Multiplication signRoundness (object)Computer animation
04:26
Probability density functionFeedbackWordWebsiteSubject indexingPixelData managementBeat (acoustics)Roundness (object)FeedbackWebsiteInformation technology consultingDampingProjective planeStreaming mediaMereologyVideo gameClosed setComputer animation
05:15
PixelSubject indexingWordWebsiteProbability density functionInformation managementFocus (optics)Data structureNetwork topologyMenu (computing)ForestDemo (music)Content (media)Computer configurationInternet forumLink (knot theory)MereologyTable (information)WebsiteWater vaporDefault (computer science)WordNetwork topologyBranch (computer science)Projective planeIntegrated development environmentContent (media)Grass (card game)File formatProbability density functionPixelWeb pageCloningPerfect groupData structureAreaSystem callForestSubject indexingExpert systemForm (programming)DataflowLinearizationSheaf (mathematics)VirtualizationType theoryComputer animation
07:49
Latent heatGroup actionMachine visionINTEGRALSheaf (mathematics)PressureMereologyDivision (mathematics)
08:30
Menu (computing)Translation (relic)Data structureView (database)Computer animation
08:58
Sheaf (mathematics)View (database)Block (periodic table)Content (media)Menu (computing)Confidence intervalCASE <Informatik>Computer animation
09:41
View (database)CASE <Informatik>Sheaf (mathematics)Menu (computing)Field (computer science)Network topologyElectronic visual displayMedical imagingComputer animation
10:12
Graph (mathematics)Sheaf (mathematics)Diffuser (automotive)Menu (computing)View (database)Computer animation
10:44
Information managementZoom lensComputer-generated imageryLink (knot theory)Computer configurationWordDemo (music)Sheaf (mathematics)View (database)View (database)Content (media)WebsiteSheaf (mathematics)Electronic mailing listCondition numberLinearizationText editorComputer animationSource code
11:25
PressureWebsiteData structureBlock (periodic table)Electronic mailing listView (database)WordChemical equationBitLink (knot theory)Revision controlSheaf (mathematics)Type theoryMedical imagingContent (media)Computer animation
12:15
Error messageFormal grammarWordMedical imagingWordNetwork topologySelf-organizationExpert systemMultiplication signPhysical systemSheaf (mathematics)Graph (mathematics)Computer animation
13:06
SummierbarkeitScale (map)Multiplication signSheaf (mathematics)WordInformationProof theoryCore dumpFunctional (mathematics)Ocean currentVideo gameWebsiteGroup actionComputer animation
13:58
RoutingContent (media)WordInformationComputer fileSheaf (mathematics)Normal (geometry)WebsiteExpert systemRootDirected graph
14:45
Inclusion mapLink (knot theory)View (database)Computer iconWord2 (number)Sheaf (mathematics)Front and back endsMereologyWebsiteData structureError messageNeuroinformatikTable (information)Video gameLocal ringProjective planeContent (media)Confidence intervalText editorRevision controlComputer animation
15:48
Function (mathematics)WordContent (media)Error messageData managementAxiom of choiceRevision controlSoftware maintenanceControl flowParsingStandard deviationTemplate (C++)Default (computer science)Simultaneous localization and mappingStreaming mediaNetwork topologyLoop (music)Electronic program guideLink (knot theory)Web pageInformationBlogSubject indexingField (computer science)Newton's law of universal gravitationWritingOverhead (computing)Inversion (music)Computer-generated imageryFormal grammarPixelServer (computing)Graph (mathematics)Mathematical optimizationShape (magazine)File formatAerodynamicsGraph (mathematics)Sheaf (mathematics)Subject indexingWeb pageObject (grammar)Network topologyMedical imagingMereologyContent (media)InternetworkingDot productComputer fileWebsiteFlow separationElectronic mailing listText editorRemote procedure callParsingBitFunctional (mathematics)Loop (music)Software maintenanceRegular graphTable (information)WordField (computer science)Message passingBenutzerhandbuchError messageLevel (video gaming)Basis <Mathematik>Speech synthesisInformationStability theoryPresentation of a group19 (number)Confidence intervalExpressionVideo gameModule (mathematics)ParsingCASE <Informatik>Electric generatorEmailRootTemplate (C++)Price indexMilitary baseDisk read-and-write headElement (mathematics)Source codeComputer animation
19:25
Interface (computing)Basis <Mathematik>Axiom of choiceText editorProbability density functionTunisComputer-generated imageryRepresentation (politics)Escape characterHyperlinkToken ringRootProduct (business)Inclusion mapError messageDatabase transactionWordComputer networkMessage passingEmailDependent and independent variablesParsingContent (media)Context awarenessFunction (mathematics)Programmable read-only memoryElectric currentMathematicsElectronic mailing listParsingLink (knot theory)Personal digital assistantReading (process)Logical constantPlane (geometry)Control flowBookmark (World Wide Web)Web pageFile formatWordMultiplication signObject (grammar)WebsiteParsingRootTemplate (C++)Content (media)Link (knot theory)MereologyProbability density functionCodeParsingField (computer science)Electric generatorSheaf (mathematics)Level (video gaming)Grass (card game)Loop (music)Office suiteRoutingDegree (graph theory)Open setHypermediaProcess (computing)Scripting languageDomain nameMedical imagingDifferent (Kate Ryan album)Token ringComputer animation
21:47
Rule of inferenceDuality (mathematics)Random numberInformation managementSelf-organizationTemplate (C++)WordSource code
22:22
Link (knot theory)Computer-generated imageryEscape characterInformation managementEmailSymbol tableHyperlinkSoftware repositoryClefRepresentation (politics)Element (mathematics)Template (C++)Self-organizationWebsiteComputer fileControl flowDefault (computer science)Set (mathematics)Content (media)WebsiteLink (knot theory)Self-organizationMarginal distributionNumberTemplate (C++)Disk read-and-write headElectronic mailing listWritingLevel (video gaming)Projective planeElement (mathematics)File archiverDot productTable (information)Computer fileEmailMedical imagingStreaming mediaWordMultiplication signComputer animation
24:27
Computer wormInheritance (object-oriented programming)Optical disc driveInclusion mapPiLink (knot theory)WordData managementMereologyWebsiteClosed setHydraulic jumpProbability density functionError messageAsynchronous Transfer ModeSoftware developerElectronic mailing listCASE <Informatik>Differential operatorComputer animationSource codeXMLProgram flowchart
26:11
Link (knot theory)Computer fileWebsiteComputer-generated imagerySheaf (mathematics)Table (information)Ideal (ethics)Metropolitan area networkFile formatMass flow rateConvex hullClique-widthLimit (category theory)Overhead (computing)Scale (map)Block (periodic table)Metric systemPixelElectronic mailing listMacro (computer science)WordShape (magazine)String (computer science)Escape characterInclusion mapOrdinary differential equationMenu (computing)Maxima and minimaHill differential equationMaxima and minimaShape (magazine)BlogMedical imagingScaling (geometry)Element (mathematics)Multiplication signExtreme programmingText editorLimit (category theory)Form (programming)MereologyLevel (video gaming)Slide ruleWebsiteWordContinuum hypothesisInsertion lossDataflowWeb pageElectronic mailing listDisk read-and-write headFunction (mathematics)Table (information)Field (computer science)QuicksortPixelContent (media)Macro (computer science)Right angleLine (geometry)DampingInformationBlock (periodic table)Message passingFile formatClique-widthLink (knot theory)GUI widgetNormal (geometry)Computer animation
29:14
Inclusion mapLevel (video gaming)Computer wormWorld Wide Web ConsortiumMathematicsPlateau's problemWordClique-widthMedical imagingText editorMarginal distributionLine (geometry)Theory of relativityPosition operatorAbsolute valueLimit (category theory)Shape (magazine)Source code
29:48
Link (knot theory)Server (computing)Computer-generated imageryString (computer science)Escape characterGraph (mathematics)AerodynamicsContent (media)Graph (mathematics)Mathematical optimizationVector spaceFile formatShape (magazine)EvoluteMedical imagingString (computer science)Presentation of a groupUniform resource locatorServer (computing)Type theoryWebsiteFront and back endsContent (media)Flow separationComputer animation
30:20
Software maintenanceAsynchronous Transfer ModeData structureWordComputer configurationBlock (periodic table)Function (mathematics)Axiom of choiceSheaf (mathematics)Source codeWebsiteText editorRight angleStreaming mediaWindowGraph (mathematics)WordSinc functionProjective planeFile formatSoftware maintenanceSheaf (mathematics)WebsiteData structureElectric generatorCodeSeries (mathematics)Axiom of choiceLaptopFunction (mathematics)Computer configurationProbability density functionNetwork topologyBitServer (computing)Limit (category theory)Asynchronous Transfer ModeTouchscreenFunctional (mathematics)IterationVideoconferencingShape (magazine)Medical imagingSimilarity (geometry)Multiplication signOffice suiteGraph (mathematics)Link (knot theory)Service (economics)Matching (graph theory)Desktop publishingVector spaceFlow separationOpen sourceContent (media)Process (computing)CASE <Informatik>MultimediaDirected graphBit rateArmRepetitionDivisorMeasurementMereologyInverter (logic gate)WorkloadHypermediaInformation securityUniqueness quantificationExtreme programmingSound effectComputer animation
Transcript: English(auto-generated)
00:00
Welcome everybody. I am pleased to introduce our next speaker on this track. You know him well. He's a long time member of the community and you may have seen him yesterday already at another talk, Fred Van Dyck.
00:23
And he's gonna tell us about exporting upload site to Word with results and lessons learned. Take it away. Thank you, Fuvio. So welcome to my talk. Yesterday I used the same project we did
00:43
for the last one and a half years to explain some nifty tricks we did with Collective Collection Filter. And now I'm going to focus on another part of this project, which I'll explain further. I'd like to ask you a question. You can post it in Slack or on Slido,
01:01
which I now have opened, which helps a lot to get some feedback at least, because you're talking in the void with doing the presentation, is would you like to hear more later in this talk about the project difficulties and middle manage stuff, or would you like to see a lot of code exporting nitty gritty thingies with Python docx,
01:22
which is the module we use? Just throw some stuff in the Slack channel or the Slido channel. So yes, to introduce myself quickly, I'm Fred van Dyck. I'm working for Zest Software from Rotterdam, the Netherlands. We've been working remote for customers for many years, but now we're of course fully remote.
01:41
My direct colleague is Maurits van Rees, which you will probably know is our semi-new second release manager. And this talk will focus on the export functionality for which Maurits did most of the difficult technical work and finding out, and I was nagging him,
02:01
tinkering, gluing and polishing stuff later to deliver it to the customer. This is part two of a talk I did already in Ferrara last year, where I explained more about the project and we were still struggling and working on the site structure. I will point that out a little bit in this talk
02:21
in the first part, but then I will move to the export to work because that's where the fun started at the beginning, end of last year, beginning of this year. So we'll have a lot of details if you're interested in those, and I'll do a conclusion with lessons learned and some generic stuff. So the scope of the projects we did,
02:41
we are working and providing support for the Flemish Environment Agency, which is a kind of, in parts, a little brother of the European Environment Agency, but of course the Flemish Environment Agency is focused on Flanders, the Dutch speaking part of Belgium. And they do a lot of the lower level
03:02
executive work to manage water and also manage air pollution, air cleanliness in Flanders. They also have some committees and one of those committees is the CEVE, which organizes the water management. And then you have to think about water flow through all of the country.
03:21
Where are the sewers? How do we check for, how do we check for pollution? How do we, should we watch out for pesticides in the water? Everything they do, the executive part. So one of those committees, which is called the CEVE, organizes a kind of,
03:42
they have to write a water management plan every six years. And every six years, they write a long, a large plan. They collect information from lower government bodies. They have a feedback round and then they make a new plan of, okay, how should we manage the water flow in smaller rivers, in ponds?
04:01
Where should we work on the sewers? It's a very, very detailed, very big plan. They have now made a new website for that. So every six years, they have to do this. We manage normal publication websites, but they asked us two years ago, how a new plan is coming up. Can we do this more digitally?
04:21
They tried it once, like 10 years ago. And now we were like, can we do this again? The website is now live. You can just, if you want, you can visit it. It's SGBP, which is the Dutch abbreviation for, don't be scary, Stromgebiet beherplanner.
04:44
Somebody asks if I'm online. I think so. So maybe you should check the floating. Streaming. So this is the, it's already live. They are now running the consultation round and they're trying, they're waiting for feedback
05:01
on the plans that have been written. So the project for us is now partly finished. We've already did this. And the big question was, can we have a website with all our plans? And can we also, because this is a huge plan and it's organized around water basins. So this is Flanders and a large part of the plan
05:21
is organized around water basins, which is probably a small river is flowing through here. And this is a kind of technical way to limit the different areas. Can we also export this whole website to a word document or to any document with a linear text flow? Because we have to present it to government
05:41
and government has to, higher government has to approve this plan. So we have some experience with exporting website content to another format. The previous export we have a lot is with PDF export. That has some challenges, mainly a pixel perfect layout.
06:00
PDF is like, okay, write it to a virtual document. And the main issue we had there with a large other environmental website project was for example, a table of content, indexes, other stuff. Most of these add-ons, we have some add-ons
06:20
that provided collective send us PDF we created. A very large one is EAPDF, which was used on the environment agency website. But they all depend on an underlying tool called WKHTML to PDF, which says you first generate your whole site
06:40
in one huge HTML document. And only then, for example, such a tool can generate indexes and table of contents and other things for you. And we didn't, we had that nasty experience. So we were checking, is the grass greener on the other side on exporting something more semantic, structural,
07:02
and then let the other two do the formatting and layout, which is something you can have with Word. Well, then summary, I'll spoil the summary. The grass isn't exactly greener on the other side, but we did manage it. So I spoiled the end, but I'll talk you to the rest. So this project 2019 and 2020,
07:22
we focused on the structure of the website because you need one huge tree and not a forest. And the problem with the default clone site is that you have a folder item and you have a page item and you can have multiple pages in a folder and subfolders. And that doesn't really nicely translate to one single tree
07:40
where you can run through all the branches and every branch is a heading in the linear document. I will show you now what we've created here. So we've created one new content type, which is called a section. And actually this matches to the folder-ish, to the folder-ish document,
08:01
which many clone integrators also use. And that really helps us. So let's go dive into one. So water basin specific part, we will now go to one water basin. And this water basin is structured around introduction, who is who, pressures, which are all ecological and pollution series,
08:24
the situation of the water basin and the plan, the vision and actions they want to take. And then I can somehow scroll through it, see some, this is linked to some documents and here we have a menu. So I can go to, for example, introduction. Okay, introduction.
08:41
Thank you. Then we have some specialties about this one. It's about boulders. I don't even know the English translation, but the Netherlands is full of boulders. We have canals. And as you can see, I can quite easily go through the structure. The trick is that this, for example,
09:02
I will jump back to the main isobechen. We only have three views here, which is a text view with a subsection where it will generate menus. There is a view that says, okay, I'm on this section, but all my children's sections should be text blocks.
09:22
And you normally use this one at the end, at the leaves of the whole content tree, or you can make a longer menu. So I will show you what happens. This one is now set to subsection menu. So it generates all the subsections, which are in here as items here. I can show them.
09:40
If you go to the contents, you see we have five subsections here. And the view in this case just shows the five items. And it uses the icons, which are actually an extra field on the section as menu items. So I can go to Canis Marking and Canis Marking again, we see this same similar thing.
10:01
This text here is when I go to edit. It's the main text. That's the rich text. There's this image, which is used as the pictogram for the upper tree. And here below we can dive in, for example, here, and we can dive even further. And what you could do here is, for example,
10:21
create three sections and then say, display, show them as long items. I won't do it like here now. So for example, here you see four sections. These are now rendered as a menu. But if I would switch the view to a menu, it would render them like a longer menu.
10:41
And if I would switch them to text blocks, it would become one big story. And with these three views on only one content item, the section, we can create this whole website. At least we can create all these plans here.
11:02
And we have this necessary condition to generate one long linear list. So the thing I've already skipped here is that for editors, it was a bit confusing at first, because this linearity of the whole website
11:21
demands that they shouldn't insert their own navigation. They should use the navigation, which I just showed you, which is either this listing view at the leaves of the whole site structure, or they should use these blocks. And things go a bit wrong if you, for example,
11:43
let's go to here, pressures. If you would here start building your own navigation in the text section of a section, if people would start here, and look for other nice info, yada, yada,
12:02
and they would put a link on it, and this would get inserted into the Word document, which is really strange to read for somebody who reads the document version of the website. So that was a kind of finding a balance between having nice pictures by using the images
12:22
on the content types, and by pressuring the content editors, please don't build your own fancy HTML navigation and other stuff, because it will, and that was a struggle, because it was like, oh, but if I can't express myself in the website and make it fancy and make it online, then yeah, just forget about the word expert.
12:41
That's not too important for me. But another person in the organization would say, look, we need this word export, do generate it. Okay, so that's the whole system, which is underneath here. We need this in Word, and also another extra fancy one was can we exclude subtrees from the export?
13:02
The idea was that in this whole tree of nodes, of sections, you could at one time say, oh, here's a nice graph, you could say, look, this section here, that's very interesting as background information, and we might have some stuff from specialists here,
13:23
but can you please exclude it from the exported Word document? So what we did is you can say here, it's core or it's background information. If you would flip the switch, then it would switch everything behind it also to background information, and it wouldn't get included into the Word document.
13:44
Unfortunately, it was a kind of functional requirement, but in the current live website, it's not used, but it works. Okay, now to the meat of the thing and the proof of the pudding, let's go to Azure backend, and now I can say under the actions,
14:03
export section as a Word document. There we go, this is a nice little trick where we don't have any async support. Now, one of the soap threats is actually generating the Word document. We pull on the site route for the status,
14:21
we combine that with the user who created it. Oh, and now it says info document has been created in the folder document exports. And the last trick to do this, so I created one this afternoon to be sure it was generated, but the trick here is that we finalize the Word export as a normal file content item
14:42
in the document exports here. Okay, now we can have this item. Is it really a document? I will show it to you, open it with Word, and here it is. This is the whole Azure backend section from this part of the website
15:01
with the whole structure there. Let's see, do we have questions? Yeah, I'll answer that one later, Paul.
15:20
So this is the whole document. There's one thing, we are not responsible for the layout in large parts, and for example, the table of contents is this nice little trick where you say right click to update, compute, and Word will compute it. And here we have our whole document. I will now go into some more details
15:41
and I will come back to this document. I've stored another version locally. So our thinking was, okay, so you've seen now it works on the live site. We thought in the project, let's first generate the basic structure, we had that,
16:01
and then let editors create more content, catch errors they have when they generate the Word version, and they hopefully do that regularly. And then we can catch all the minor details and the caveats and then optimize and specialize it. The problem one was that editors didn't start inserting content until like three months before the website had to go live
16:22
because they were dependent on all kinds of other external agencies. And the second problem was that, of course, because of COVID-19, we didn't really have any contact anymore with the editor. So after we finished the basics, it was a lot of remote work and fixing things. So how did we pull this off?
16:40
How do we generate it? We are using two modules, Python docx, we use beautiful soup, both well-maintained projects, but especially Python docx is stored in functionality. There are many pull requests in the GitHub repo, but the maintainer is conservative to add them, last official release 2019. Beautiful soup is beautiful.
17:03
And as I will show later, we use an example on the internet for a rich text parser to parse the actual contents in every section. Python docx has a special trick that it first creates an empty Word document. And that's kind of like the normal dots or the normal dot x where you have in Word.
17:21
And on basis of that empty document, it will allow you to iteratively add text, add elements, add headings, paragraphs, tables, page break, et cetera. You can find this on the Python docx documentation. There's a huge user guide here with all kinds and it looks very extensive and it is very extensive for the basic things.
17:43
So here's a small example of how you would do this in docx. So now the question is, okay, how do you actually do this for the Splunk website where we only have sections? So we create a document object from a template.
18:02
We find all sections in the whole site and build them as a tree. And we collect all of those UIDs that are on there also in Splunk. And then we kind of build a content tree. Then we start adding to the document, the first page, the document info, a table of contents.
18:21
Then we loop over all the sections and we insert first the heading, which is dependent on the level it was in, which is calculated by this big main loop first. And then for every section, we do a rich text pass over the text field and we create the actual content part. While we do all this,
18:42
we keep some separate lists of things we've inserted and other stuff. And we write then at the end of the document, we write some indexes. For example, we could write an index for images used or for references to external files. And we have a special list for error logs. And then I have now in my presentation, I have a long, long list of many, many, many, many details
19:04
of all the things we found that were issues and we had to solve before we could get this to a stable part. Who wants to see a bit because I've got about 10, 15 minutes left now. So we were warned by this
19:21
because we found this page on the internet from someone, I hope it's still live. Yeah, somebody who had used Python and beautiful soup to actually parse HTML content and generate something out of it. And he had this big warning like, okay, I switched to PDF.
19:40
Well, PDF is where we came from and the grass wasn't green with PDF, but we still went. So actually docx is a low level XML and it's a huge back which is used by open office and others. It should be standardized, but still they do things differently in the code. And now I can switch to what we have here.
20:01
So this is our main, maybe too big. Yeah, this is fine. So here we have some special stuff to do the JavaScript, et cetera, tokens, get time outs, the create doc. We create a document. We create this whole document contents here.
20:24
And then here is the main loop where we first find all contents in the website. We parse out all those UIDs because we have to make internal links. We have support for generating the root or generating somewhere a subtree. And then we kind of have a parser where we add the heading
20:44
and we feed into the parser every section it's, you see, it's from our roots. We feed the object text. We also feed the object text raw which is the unprocessed rich value of the text field for some stuff.
21:00
Then we handle the attachments, the warnings, et cetera. And we write a word document and we're done. Of course the devil is in the details which is in the rich text parser, which you see here. Here's some recognizable stuff to parse HTML. And this is where we actually, this is where my previous text warning comes from.
21:23
And this is actually the rich text parser part where there's really, the devil is in the details to fix C tags, C images, C lists, C other items. So this is very quickly an overview of the code. One thing I'd like to show in the website
21:42
is the story about the template because one of the difficult parts we found out is that when you base docx-export on the template and you want to insert into the document, let's see where my word, where is my word?
22:02
There is my word. If you want to, for example, have these, these are styled for the organization one. And if you want to style these correctly, there's a mismatch between the normal dot from the organization and there's the normal dot which is from docx. And those internal names in the XML are important.
22:22
So what did we do? Site settings or content settings. So I have, for example, we can upload in a special folder in the website, we can upload an organizational normal dot. And for each of these normal dots, and we can have several,
22:43
you can choose which one to use, we can map a title and a footnote anchor, which is the docx kind of fixed ID of a heading to the internal organization created template. So we could, with this, we could experiment with a number of templates
23:03
and our webmaster for the project could upload his own documents. So here you have the list of templates where we could add multiple, and here our webmaster could experiment and choose a different one. And we could also here set some margins
23:21
and other stuff for the images, which I'll hopefully have some time for. So now I'm going to very quickly go through many, many details that we ran into. You can write low level XML and that was needed because a lot of, so you have this add heading, add paragraph
23:40
and add other stuff, but you don't have, for example, a table of contents, and you need to insert some raw XML in the document stream, which you can do with our XML element. I've already explained the problem with styling of the headers is that you need to say, I have a heading one, heading two, heading three, heading four, but maybe in your template,
24:01
that's not called that way. And docx is actually a zipped archive of a number of XML files. And the headings are in a separate XML. Internal links. Internal links are end links to appendix documents. What we did was if somebody in the website,
24:23
let's go, for example, who cares? Here, if somebody creates a link in here to another document, we have to check if it's inside the website and create an internal link in the word. But if it's a link to any external website, we decided to have a footnote inserted
24:44
with then in the word document, the link to the website. So for example, this links to Strom Chobitnifo. Okay, that's somewhere else, a document that's in the management plan and not in the specific parts. Okay, close it. So these links, and I will now jump
25:01
to the word document, are all visually here. So here you see, for example, footnote five, then there is here somewhere a link, footnote five, to another part. And we have to try to see if it's, and here you see we had still an error in this one where it still links to a PNG. So somebody put a link to a PNG
25:20
and it doesn't recognize in this case that it was the same website. That's probably because I'm running this now in development mode and I actually generated it this afternoon. So here you see, okay, interactive map, interactive map. So here you see all these references. What we did as an extra requirement, if somebody uploads a PDF
25:40
or another kind of document in the website, that we collect those and we create at the end, you see, and this is only one water basin word document. At the end, we create a huge, uh-uh, uh-uh, uh-uh. Yeah, here's the list of edit documents.
26:01
So in those, we, one, two, and three, these are all PDFs, which somewhere in the documents get linked to and which is also in the final word document. This was also all the footnotes and all the other stuff. We had to generate those using low-level or XML elements
26:22
because they are not really available as a high-level method concept in Python docx. Passing of the rich text field is tricky. Editors could do all kinds of things and TinyMCA is not that restrictive in the output of HTML. So we stripped many of the TinyMCA formatting.
26:40
If you look in the website and I start edit, I edit this page, then you will see we've limited the layout to only two headers. Blocks are only for graphic links. So we tried to limit the amount of stuff that editors could do. Listings, intersection headers,
27:00
there are only two subheadings, but they are not in the table of contents. If you don't want them in the table of contents, you can't use add heading and you have to create a sort of faux heading. And I'll get back to this lessons learned. The Volto blocks engine would have been ideal to minimize all this messy HTML to limit the horrible stuff editors can do
27:23
in a normal TinyMCA rich text widget. So image scaling. Never upload in an image the full size
27:44
directly into the Word document with add picture because then you will maybe add a blob for four or five megabytes. We kind of pushed through 150 DPI, which means your image only needs to be 104 pixels. And we used an image scale to convert all images
28:00
to this one at maximum of 1,000 pixels. And in the website, in the image insert one, we only use two sizes, half width and full width. Image alignment. And this is from one of the Slido questions. Yes, a lot of the Python docx limitations are not Python docx limitations, but are actual Word limitations.
28:20
Word doesn't have any concept of float left right. And docx only allows you to insert an image as an inline shape. And when you want to align an image in Word on the right, you would have to first create it to a floating shape. Let's see which one document I now have.
28:40
So here I did the trick because this is, there's actually another document which I just opened, which is this one, which still has the ugly one. The only way to fix this was to have, this is the first output from the Plone website, from our export. And what we did was we ran a macro and the macro converts the inline shape to a floating shape.
29:04
And we could pass left or right info. So we should, we could improve this later, but then the macro runs for four or five minutes. And then you finally get this one. And there you see it's either half width or it's not totally,
29:21
but then you will have to do some manual readjustment as an editor or the image was, let's see here, we have full width images. And that limited and made it somewhat useful because what actually happens is if you align an image to the left or the right, then word is actually creating a floating shape, calculating it from the margin and moving it with a kind of relative absolute positioning in this line.
29:44
So that was one of our hardest tricks that we couldn't really do here. Escape your strings. If you insert a footnote within a URL with this ampersand, then that's an invalid character in XML. So we had a long search for why our export broke.
30:02
Something that was actually rather easy was for my presentation from last, from yesterday where we have dynamic images. So we have these nice high charts images in the website. These are actually separate content type. And this content type uses a backend export server
30:22
to generate PNGs and SVGs for, and every graph. I would have loved to have inserted the SVG into the Word document because of the size. The problem is Python docx doesn't support it. And the only work around I've found
30:42
is that you can convert SVG to some obscure Microsoft vector format from 15 years ago. And that one should be able to be inserted into the docx stream. But we didn't have any time for it. So we converted all our graphs to PNGs and then inserted it.
31:01
I've already shown you the trick of asynchronous generating the document. It could take four to five minutes. We didn't choose plan up async, but we made a kind of trick where we pull background generation. So that was a lot of things. I've skipped some stuff, but I hope you get the idea that it is doable.
31:21
It is workable. You've seen it working, but there's a lot of limitations. We managed to export the whole structure to Word. Volto would have been a good match if we had more budget and time. And Volto had been a bit further when we had to make this choice halfway 2019.
31:40
Volto matches because you have this folderish document, which is matches to our section. And every feature you could put that into a separate output function. More lessons learned. If I would start now again, Python docx would scare the shit out of me because it's in maintenance mode
32:01
and there hasn't been a release since early 2019. But it's the same situation as a lot of our PDF export stuff because that depends on this nice tool, which also hasn't seen a release for the last two, three years. But still, I think this project was great even though people didn't use all of the functionality.
32:24
This could save government bodies a lot of work if they would be able to handle the kind of tree structure you need there. And Word as an intermediary format is also great because you can create a semantic export and then you can let editors do the tinkering
32:40
and the other stuff as the kind of DTP. I shouldn't say this, do desktop publishing in Word, but you can all export these problems to Word. Then the final remarks are, should we open source this code? We want to, it's not a legal problem, but we didn't have the time yet.
33:00
We only, this project is like two, three months now open and done. But also we have a lot of restrictions in here, which I think, okay, should we just dump this code into the collective and then let other people suffer or should we first clean it up and explain more? It's still fragile. We had the webmaster of this project post edit a lot of content to fix some of the export issues.
33:24
So that's it. If you want to see the website again, it's sgbp. I'll post a link later on Slack. It's not easy to pronounce in English. It's a Dutch site, but that's it. Thank you for your attention.
33:40
Back to Fulvio. Thank you, Fred. Yeah, I want to see the claps in Slack now, but there are some questions in Slido and- Yeah, I'm picking them up as well. Why don't you pick, can you see them? Yeah, I'll move to my laptop screen here.
34:04
Would this work for iterative documents? A question from Paul. Yes, well, we didn't have, we didn't have to, we couldn't give them the support this year to have external agencies also log in for a section of the website, but you could activate iterate on the section
34:21
and then have section. It would become a bit more problematic if you would have a whole subtrees that are versions, but for individual sections, you could just use normal workflow on it and say, okay, let an editor post this and have a kind of a check before publish
34:42
from a final editor. Yes, a lot of the limitations of docx are actually limitations in the whole word and the whole docx format. Docx just has to struggle with that. And the maintainer, I think, did a hell of a job to block all kinds of experimental pull requests
35:01
from other people, but it kind of, let's see. Ah, bb server publish. Yes, well, we also looked at that two years ago, but that project depends on an external service to convert all kinds of things. And there are also two or other options
35:21
where you kind of first generate your whole website into a kind of intermediary format, like also like a huge HTML that you can feed into WKH HTML to PDF for PDF. But it got very complex. You have to run, I think also an office server and we went for, because we had this very structured thing
35:43
and we saw some merits in using docx to generate the document as one big stream. So we did consider using pp server or other similar solutions, but we choose for this one.
36:01
Yes, thank you. That's the name Armin. It's the EMF format, which Microsoft invented like 25 years ago and which was a kind of a precursor to modern SVG stuff. So what we found out that you could generate first convert SVG to EMF, and it should be according to a pull request on the Python docx GitHub repo,
36:23
you should be able to insert EMF as a shape into the stream and then you would have a vectorized image in word. I think those were the questions. All right. Thank you, Fred. I'm not sure if anybody can hear me,
36:42
but I just wanna remind everybody that you can join the Jitsi channel by clicking the join face-to-face button in blue down below in the center column below the video window in loud swarm.
37:01
Yes, I'll move there too. Then we can discuss if this is useful for other people. I talked with people in Ferrara and we should continue talking about this online. Great. Okay, thank you very much. Thank you, Fulvio. Bye-bye. Have a nice remainder of the Blanconf.