X-Git-Url: https://git.stg.codes/stg.git/blobdiff_plain/bfec9cc7ab5a396f7662090b208691ec59a69f1b..2f1753cc3e240fa497a87873ed19fe3f11e22331:/doc/help/xslt/roundtrip/specifications.xml diff --git a/doc/help/xslt/roundtrip/specifications.xml b/doc/help/xslt/roundtrip/specifications.xml deleted file mode 100644 index 85db866e..00000000 --- a/doc/help/xslt/roundtrip/specifications.xml +++ /dev/null @@ -1,1420 +0,0 @@ - -
- - Round-Tripping Specifications - - Bob - Stayton - - Sagehill Enterprises - - - - Steve - Ball - - Explain - - - - - 1.8 - 2008-05-22 - SRB - Updated for current implementation. - - - 1.7 - 2008-02-22 - SRB - Added edition. - - - 1.6 - 2007-10-19 - SRB - Added keyword. - - - 1.5 - 2007-01-05 - SRB - Reduce emphasis on WordML, add support for OpenOffice. - - - 1.4 - 2005-11-11 - SRB - Added bibliography. - - - 1.3 - 2005-10-31 - SRB - Added mediaobjectco, imageobjectco, programlistingco, areaspec, area, calloutlist. - - - 1.2 - 2005-10-13 - SRB - Version prior to using revhistory. - - - - - This document specifies how DocBook elements are mapped to paragraph and character styles in a word processor. The specifications are used to write conversions between DocBook XML and word processor XML formats, such as Microsoft's WordProcessingML (WordML), OpenOffice's OpenDocument and Apple's Pages. - -
- Introduction - Microsoft Word 2003 introduced WordProcessingML (WordML), an XML vocabulary for Word documents. Since then, other popular word processors have become available that use XML as their data representation, namely Apple's Pages and OpenOffice. By converting Word (or OpenOffice or Pages) to XML, it becomes possible to convert a word processing document to DocBook and vice versa using XSL transformations. Such conversions then enable the following. - - - DocBook content creators write in their familiar wordprocessing application, rather than learning a new XML editing application. - - - DocBook XML documents can be styled for output using the typesetting features of the word processor. - - - Word processors have a simple, flat data model; documents consist of paragraphs (and tables) and paragraphs contain text and character spans. All word processors allow styles to be associated with paragraphs and spans. - This specification describes how DocBook elements map to a set of paragraph and character styles. It defines a specific set of style names for which a Word style template can be created. The style names are also used in XSLT template match patterns for conversion. Although originally targetted to MS Word, the system has subsequently been extended to use other word processors, notably Apple's Pages and Open Office. -
-
- Project goals - The goal of this project is to enable a word processor, such as, but not limited to, Microsoft Word, to be used with DocBook files. The specific goals include: - - - Enable authoring of basic DocBook documents in the word processor. - - - Enable importing of basic DocBook XML documents into the word processor. - - - To meet these goals, the project provides a toolkit that can be immediately put to use. The kit includes: - - - Templates for Microsoft Word, Apple Pages and Open Office with formatting styles attached to the style names. - - - XSLT stylesheets that convert a word processing document that is authored with the corresponding template into a DocBook XML file. - - - XSLT stylesheets that convert a DocBook document into a word processing document that can be opened in a word processor. - - -
- Why basic DocBook? - This project will never be able to support all DocBook elements and structure. Take, for example, the address element. This element can be used both as a block element for metadata. It can also be used as a phrase level element in a block parent, such as the affiliation element. To make matters worse, it can itself contain phrase level markup, such as personname. No word processor allows character styles to be nested. - The project will initially focus on a basic set of commonly used DocBook elements in order to create a useful editing environment that utilises a word processor with DocBook. - One problem facing this conversion project is the sheer number of DocBook elements, over 400 in DocBook 5.0. To support DocBook structural models, several of the elements require more than one paragraph or character style. This would lead to very long and unwieldy list of styles in the word processor interface. That would make authoring less efficient and discourage users. - Accordingly, this project assumes that authors who need the full set of DocBook elements and structures will use an XML authoring tool that better supports them. This project is focused on authors who wish to write basic DocBook documents using a word processor. Because Microsoft Word is so widespread, it is hoped that this project will help a lot of new DocBook users get started with familiar tools. They can then graduate to more advanced tools as their needs develop. -
-
-
- Project Non-Goals - The following goals are not in the scope of this project: - - - Support of versions of Word that do not feature reading/writing WordML (XML). That is, all versions prior to Word 11 (Office 2003). - - - Support of arbitrarily defined styles. This system may expect certain styles to be defined in a particular fashion (in particular, those defining the title of components and divisions). - - -
-
- Mapping elements to styles - Although WordML, OpenDocument and DocBook are all XML, there several challenges when trying to convert between them. - The basic problem in mapping paragraph/character styles to DocBook elements is that word processor documents support far less structure than DocBook. DocBook permits nesting of elements within other elements, providing multiple levels of context for each element. - Word's only structural feature is the outlining mode. In Word outlining, certain paragraph styles are assigned outline levels. When a user applies those styles, they effectively create logical structure in the Word document. Unfortunately, Word itself attempts to automatically determine which paragraphs are headings, rendering this method is unreliable. - Instead of relying on Word's built-in outlining mode, this system uses only the names of paragraph styles to determine document structure. Certain heuristics are applied to build the DocBook element structure from the (relatively flat) word processing structure. Titles and other features are used to mark the beginning of a structure and all paragraphs following that are included in that structure until the beginning of the next structure is found. That is, the beginning of one structure marks the end of the previous structure. - Problems may arise when a structure should end, but there is no word processor feature that marks the endpoint. To mark the end of a feature an empty paragraph is used. - Nesting of block elements is another commonly used feature of DocBook. It is not possible to use Word's outline mode for blocks if it is being used for components and sections. So in this specification, nesting of block elements is indicated by adding a number suffix to a style. So a paragraph with style orderedlist2 is considered to be contained within a preceding paragraph with style orderedlist1 or itemizedlist1. Where appropriate in the word processor, paragraph indent levels are used to visually indicate nesting of blocks. - Nesting of inline DocBook elements is particularly difficult to support because word processors do not nest character styles. That means a nested inline would require a separate character style to indicate the parent-child relationship. Given the large number of combinations possible, a prohibitively large number of character styles would have to be created. In this project, nesting of character styles is not supported. Nested inlines being imported from DocBook will be converted to a sequence of single-name character styles, where possible, or rejected. - In many cases, DocBook structure can be derived from the flat sequence of paragraphs based on sibling relationships. For example, when a paragraph styled as para is followed by a paragraph styled as itemizedlist1, the conversion to DocBook will output a para element and then start an itemizedlist element, with the second paragraph as its first listitem. All itemizedlist1 paragraphs that follow without interruption are inserted into the same itemizedlist element. - Some combinations of elements cannot be supported (at least not with the techniques as described in this document). An example is informalexample and its permitted content; there is no title to mark the beginning of the element and no marker for the end of the element, also there are too many parent-child combinations to reasonably define style names. - The design principles used in this project for selecting paragraph/character style names are as follows: - - - Where Word (or OpenOffice or Pages), by default, has a style or feature that corresponds directly to a DocBook element then that style or feature will be used (and documented in this document). For example, the Normal paragraph style maps to a DocBook para element, and a Word table (w:tbl) maps to a DocBook tableIn some cases Word may posess a feature, but it doesn't function in an acceptable manner. For example, lists. In these cases the feature is to be avoided, and a workaround provided.. - - - Paragraph and character style names will match DocBook element names as much as possible. This will enable authors to learn DocBook element names and help debug problems with conversion. - - - A style may indicate a parent-child relationship, but the paragraph for such an element may only occur after a paragraph that denotes the beginning of the parent structure. In this case the element name is used as the style name. For example, a personblurb paragraph may only occur after an author, editor or othercontrib paragraph. If a paragraph occurs without the appropriate preceding paragraph, then an error is signalled. - - - Some styles may also indicate a parent-child relationship, but either the parent structure is ambiguous or the paragraph starts the parent structure. For example, chapter-title indicates that the paragraph is a title element whose DocBook parent is a chapter element. - - - Some style names are simplified to make them easier to use in the word processor. For example, a paragraph in an orderedlist requires three elements in DocBook: orderedlist, listitem, and para. The paragraph style name in Word is shortened from orderedlist-listitem-para to just orderedlist1 (for a first level list). In the case of lists (see below), the list level is appended, which is why this example becomes orderedlist1. - - - Style names with a number suffix indicate a nesting level, as described above. - - - Style names with continue indicate that the paragraph is part of the preceding element. For example, a para paragraph is used for a single paragraph para element. This causes any preceding list to be closed. If a list item in the preceding list is to contain more than one paragraph, then the subsequent paragraphs in the word processor documentmust use the para-continue style. - - - Character styles map to elements that are children of the element for the paragraph, hence there is no need to encode parent-child relationships. For example, a surname character style in an author paragraph becomes a surname child element of the author element. - - - Empty paragraph and character styles are ignored. This can be useful to end structures. - - - The first paragraph style in the word processor document is used to define the root element of the DocBook document. For example, if the document starts with book-title, then the DocBook document will have book element as its root element. All the rest of the document content will be contained in that root element. - - - Sequential structures are coalesced into a single parent element. For example, a sequence of itemizedlist1 paragraphs becomes a single itemizedlist element with several listitem element children. - - DocBook to Paragraph/Character Styles - - - - - - - - DocBook element - - - Style(s) - - - Comments - - - - - - - - Components and sections - - - - - - book/info/title - - - book-title - - - - - - - - book/info/subtitle - - - book-subtitle - - - - - - - - book/info/titleabbrev - - - book-titleabbrev - - - - - - - - chapter/info/title - - - chapter-title - - - Assigned Word outline level 1. - - - - - chapter/info/subtitle - - - chapter-subtitle - - - - - - - - chapter/info/titleabbrev - - - chapter-titleabbrev - - - - - - - - appendix/info/title - - - appendix-title - - - Assigned Word outline level 1. - - - - - preface/info/title - - - preface-title - - - Assigned Word outline level 1. - - - - - article/info/title - - - article-title - - - Assigned Word outline level 1. - - - - - article/info/subtitle - - - article-subtitle - - - - - - - - article/info/titleabbrev - - - article-titleabbrev - - - - - - - - bibliography/info/title - - - bibliography-title - - - Assigned Word outline level 1. - - - - - bibliography/bibliodiv/info/title - - - bibliodiv-title - - - - - - - - biblioentry/title - - - biblioentry-title - - - Metadata elements after the biblioentry-title paragraph become part of the biblioentry. - - - - - glossary/info/title - - - glossary-title - - - Assigned Word outline level 1. - - - - - index/info/title - - - index-title - - - Assigned Word outline level 1. - - - - - part/info/title - - - part-title - - - - - - - - section - - - - - - Unnumbered section elements are translated into their equivalent numbered paragraph style. Sections 6 levels and deeper are reported as an error. - - - - - sect1/info/title - - - sect1-title - - - Assigned Word outline level 2. - - - - - sect1/info/subtitle - - - sect1-subtitle - - - - - - - - sect2/info/title - - - sect2-title - - - Assigned Word outline level 3. - - - - - sect2/info/subtitle - - - sect2-subtitle - - - - - - - - sect3/info/title - - - sect3-title - - - Assigned Word outline level 4. - - - - - sect3/info/subtitle - - - sect3-subtitle - - - - - - - - sect4/info/title - - - sect4-title - - - Assigned Word outline level 5. - - - - - sect4/info/subtitle - - - sect4-subtitle - - - - - - - - sect5/info/title - - - sect5-title - - - Assigned Word outline level 6. - - - - - sect5/info/subtitle - - - sect5-subtitle - - - - - - - - simplesect/info/title - - - simplesect-title - - - - - - - - simplesect/info/subtitle - - - simplesect-subtitle - - - - - - - - bridgehead - - - bridgehead - - - - - - - - - Metadata elements - - - - - - abstract/title - - - abstract-title - - . - - - - abstract/para - - - abstract - - - - - - - - affiliation - - - affiliation - - - - - - - - address - - - address - - - - - - - - author - - - author - - - - - - - - date - - - date - - - - - - - - edition - - - edition - - - - - - - - legalnotice - - - legalnotice - - - - - - - - pubdate - - - pubdate - - - - - - - - publisher/pubishername - - - publisher - - - - - - - - publisher/address - - - publisher-address - - - - - - - - revhistory/revision - - - revision - - - - - - - - - Block-level elements - - - - - - para - - - para, Normal - - - Any Word paragraph with style Normal will also be converted to a para element. - - - - - formalpara/title - - - formalpara-title - - - - - - - - formalpara/para - - - formalpara - - - - - - - - simpara - - - simpara - - - - - - - - note/title - - - note-title - - - - - - - - note/para - - - note - - - Consecutive paragraphs with style note after the first note are to be treated as part of the same note element. That is, consecutive notes are coalesced. The note may or may not have a title. - - - - - caution/title - - - caution-title - - - - - - - - caution/para - - - caution - - - Consecutive cautions are coalesced. - - - - - warning/title - - - warning-title - - - - - - - - warning/para - - - warning - - - Consecutive warnings are coalesced. - - - - - important/title - - - important-title - - - - - - - - important/para - - - important - - - Consecutive importants are coalesced. - - - - - tip/title - - - tip-title - - - - - - - - tip/para - - - tip - - - Consecutive tips are coalesced. - - - - - itemizedlist/listitem/para - - - - itemizedlist1 -itemizedlist2 -itemizedlist3 -itemizedlist4 - - - - A number suffix indicates a nesting level within other lists. - - - - - orderedlist/listitem/para - - - - orderedlist1 -orderedlist2 -orderedlist3 -orderedlist4 - - - - - - - - - listitem/para[position() != 1] - - - para-continue - - - This paragraph is included in the immediately preceding listitem. - - - - - example/title - - - example-title - - - All content following the title is included in the example element. The end of the example content is marked by a caption paragraph or an empty paragraph if there is no caption. - - - - - figure/title - - - figure-title - - - All content following the title is included in the figure element. Metadata must immediately follow the title. The end of the figure content is marked by a caption paragraph or an empty paragraph if there is no caption. - - - - - informalfigure/mediaobject/imageobject/imagedata/@fileref - - - informalfigure-imagedata, caption - - - The content of the imageobject-imagedata paragraph is taken as the URI for the image. Metadata may immediately follow the paragraph. - - - - - mediaobject/imageobject/imagedata/@fileref - - - imageobject-imagedata, caption - - - The content of the imageobject-imagedata paragraph is taken as the URI for the image. May be followed by a caption style paragraph. Metadata may immediately follow the paragraph, before the caption, if any. - - - - - table - - - Word table, caption - - - - - - - - table/title - - - table-title, caption - - - Metadata may immediately follow the paragraph. - - - - - informaltable - - - Word table - - - A table with no title imediately preceding it. - - - - - caption - - - caption - - - - - - - - literallayout - - - literallayout - - - Inside a literallayout paragraph in Word, lines should be separated by line break (Shift-Enter) rather than paragraph break (Enter). - - - - - programlisting - - - programlisting - - - Inside a programlisting paragraph in Word, lines should be separated by line break (Shift-Enter) rather than paragraph break (Enter). Tabs are not supported. - - - - - blockquote/title - - - blockquote-title - - - Must immediately precede a blockquote paragraph in Word. - - - - - blockquote/para - - - blockquote - - - - - - - - blockquote/attribution - - - blockquote-attribution - - - Must immediately follow a blockquote paragraph in Word. - - - - - bibliomisc - - - bibliomisc - - - - - - - - - Non-DocBook elements - - - - - - xi:include - - - xinclude - - - The content of the paragraph becomes the value of the href attribute. - - - - - - Inline elements - - - - - - emphasis - - - emphasis - - - - - - - - emphasis/@role="bold" - - - emphasis-bold - - - - - - - - emphasis/@role="underline" - - - emphasis-underline - - - - - - - - footnote - - - Word footnote - - - - - - - - link - - - link - - - In Word, hyperlink properties identify the DocBook linkend. - - - - - releaseinfo - - - releaseinfo - - - - - - - - surname - - - surname - - - Character style. Must occur in an appropriate parent paragraph, such as author or editor. - - - - - firstname - - - firstname - - - Character style. Must occur in an appropriate parent paragraph, such as author or editor. - - - - - orgname - - - orgname - - - - - - - - keyword - - - keywordset/keyword - - - Paragraph style. Consecutive keyword elements are merged into a single keywordset parent element. Words (phrases) within a paragraph separated by commas become individual keyword elements. - - - - - citetitle - - - citetitle - - - - - - - - city - - - city - - - - - - - - contrib - - - contrib - - - - - - - - country - - - country - - - - - - - - email - - - email - - - - - - - - fax - - - fax - - - - - - - - honorific - - - honorific - - - - - - - - jobtitle - - - jobtitle - - - - - - - - lineage - - - lineage - - - - - - - - orgdiv - - - orgdiv - - - - - - - - otheraddr - - - otheraddr - - - - - - - - othername - - - othername - - - - - - - - phone - - - phone - - - - - - - - pob - - - pob - - - - - - - - postcode - - - postcode - - - - - - - - shortaffil - - - shortaffil - - - - - - - - state - - - state - - - - - - - -
- - Proposed Additions - not yet implemented - - - - - - - - DocBook element - - - Style(s) - - - Comments - - - - - - - variablelist/varlistentry/term - - - - variablelist1-term -variablelist2-term -variablelist3-term -variablelist4-term - - - - A variablelist in Word should be a sequence of alternating paragraphs styled as variablelistN-term and variablelistN. - - - - - variablelist/varlistentry/listitem/para - - - - variablelist1 -variablelist2 -variablelist3 -variablelist4 - - - - Consecutive paragraphs are coalesced. - - - - -
-
- Attributes - Attributes are a feature of DocBook XML that have no direct counterpart in Word. - XML attributes are encoded in Word comments (annotations). Some dummy text (just a space, using a character style that includes the hidden property) anchors the comment. Within the comment text, character types are used to indicate attribute names and values (these must be paired). This approach keeps the attributes separate to the main body and allows multiple attributes to be encoded. - A disadvantage to this approach is that a paragraph may be related to more than one element, but the attributes are associated with only one element (by default the parent). For example, a section may have an attribute as well as the title child element, but only a single paragraph (with paragraph style sect1-title) represents both elements. Any attribute defined in a comment would be associated with the sect1 element. - Pages does not have annotations, so the character styles attribute-name and attribute-value are used. -
-
-