XSweet, a toolkit under development by the Coko Foundation, takes a novel approach to data conversion from .docx (MS Word) data. Instead of trying to produce a correct and full-fledged representation of the source data in a canonical form such as JATS, XSweet attempts a less ambitious task: to produce a faithful rendering of a Word document's appearance (conceived of as a "typescript"), translated into a vernacular HTML/CSS. It is interesting what comes out from such a process, and what doesn't. And while the results are barely adequate for reviewing in your browser, they might be "good enough to improve" using other applications.
One such application would produce JATS. Indeed it might be easier to produce clean, descriptive JATS or BITS from such HTML, than to wrestle into shape whatever nominal JATS came back from a conversion processor that aimed to do more. This idea is tested with a real-world example.
XSweet - XSLT-based tool for document conversion from MS Word – it works!
My adventures converting a Word document via HTML into BITS (JATS)
(A
Project conceived and motivated by Adam Hyde / sponsored by
Open-source, reusable, hackable XSLT 2.0
Prototype design and implementation by Wendell Piez and Alex Theg
Libraries of XSLT transformations to be applied in combination
Demonstration pipelines in XProc (XML Calabash or any XProc toolkit)
Components run on any platform or stack supporting XSLT 2.0
For example,
Made available to public on an open source basis (MIT license)
One task supported: exposure of contents of Word documents
("
Produce HTML from your Word document
Work well on arbitrary Word
(Keep fiddly settings to a minimum ideally none)
Deliver something plain, generic, and useful
Complement other tools (lightweight mix/match alternative)
Simultaneously offering a hackable, extensible open source toolkit (hey why not?)
Standards basis (XSLT) for portability / reusability / dependency management
XSweet does nothing like a complete job in document conversion (by design)
Several challenging real-life problems are set aside for other processes
Its HTML represents, as literally as possible, properties expressed in the Word document As HTML/CSS equivalents of the formatting as designated in the Word So
<p class="Header1">Paragraph Style "Header 1"</p>
<span style="font-weight: bold; font-size: 24pt">24pt, bold face</span>
Our name for such an application (or profile, or style) of HTML: "HTML Typescript"
By analogy to role of typescript in print-oriented workflows
Typescript wasn't print, but it emulated print for certain practical purposes
We need something analogous for web-based workflows, and HTML is a
A typescript isn't finished - it's a
HTML Typescript comes flat - no structure
(Like WordML sources)
Until you add structure in a later step
Typically documents are loaded with formatting info
(Because that's what we had in the Word doc)
E.g.:
Such formatting usually (not always) "says something" – "latent semantics"
HTML like this isn't finished either: there's much work to be done still
Not to drop anything coming across
To "look the same"
Following the rule that looks are everything
To do as little as possible to improve anything
Only to show what's there
Looks tolerably okay in a browser
Shows "the same document" as the Word doc, now in more transmissible form
Most salient properties are exposed as HTML+CSS
Noise is removed or dampened
"Does its best and doesn't worry"
Extensible to new properties as requirements are discovered
Is also … XML (i.e. syntax well-formed) ...
… thus suitable for further improvement ...
(
Dovetailing with several projects including JATSKit, XML jellysandwich, Pause Press
Needing BITS examples
Having a Word manuscript (
Project demonstrates end to end production of BITS from a "real" (monograph) example
Followed by demonstrations of its production
(Including, experimentally, XML in the browser …)
Mark Scott, the author of
Fairly typical Word document by a "non-technical user"
(A writer and writing instructor who knows his way around a text)
Code turns out to be quite clean
Some specialized content types make this text interesting Scholarly monograph albeit w/o formal referencing of sources Some sections comprise
Links: WordML
Running XSweet gives me only a rough-ready HTML
To produce BITS from this, more effort is required
Producing structures including Epigrams, in sequences sections and (untitled) subsections Overall sequence of sections (Parallel or nested?)
These were handled by analysis + XSLT Analysis established consistent Such as consistent use of blank lines, text indents Given the analysis, XSLT did the work E.g., producing JATS Is custom-fitted XSLT really easier than hand conversion? (sometimes)
Level of effort: more than an hour, less than a day
(Caveat: already had some HTML → JATS XSLT available)
The pipeline works
But it is only as good as the analysis
What does our analysis cover?
One document? A set of (related) documents?
This aspect of the problem is irreducible
Range of possible structures and content types
Range of ways of representing them
For example, we have "epigrams" (indicated by paragraphs with leading white space),
which become
(How often will
Meanwhile I am all set up to convert more manuscripts by Mark Scott
What was not a problem Keeping everything even peculiar stuff (okay I was lucky) Mapping from HTML Typescript into JATS (BITS) Editing, production … everything subsequent to conversion An "XML Early" workflow
Principles of scale apply Yet even a small operation can benefit from positive externalities (Commodity software, standards, open source toolkits) For even "fine-grained" work
Conclusions re: XSweet Good enough! Probably (already) roughly comparable to other inexpensive / low-end approaches Requires extension / work / tuning to be as good as "state of the art" applications
or service providers (But maybe that is what makes them state of the art)
Beautiful BITS format!
Given an XML/XSLT/JATS toolkit, subsequent editing (in BITS) is a breeze
Meanwhile we started looking at productions via JATSKit stylesheets
HTML, version available at (Should have EPUB soon / already) Also - XML in the browser? Even more experimental SaxonJS version
Yes, starting halfway up the slope is a big help
Producing good BITS from HTML is a breeze. From Word
Certain kinds of redundancy / sloppiness are okay and even useful
Set aside your "engineer's pride"
The will to execute is always more important than the means
"Love Your Data" remains the only way to produce good work
(Which can't be automated)
Rube Goldberg's Self-operating napkin (Wikimedia Commons)