Determining if a paragraph is actually a heading
Overview
This document discusses the utilization of the w:outlineLvl
element in WordprocessingML as defined by the ECMA-376-1:2016 standard, focusing on its role and contrasting this approach with how document processing tools like Pandoc interpret outline levels based on style names.
Outline Level in ECMA-376-1:2016
According to Section 17.3.1.20 of ECMA-376-1:2016, the w:outlineLvl
element specifies the outline level associated with a paragraph within a document. The outline level is an integer from 0 to 9, where ‘0’ represents the highest level of detail and ‘9’ indicates no specific outline level applied. This element primarily affects the generation of the Table of Contents and does not change the appearance of the text. However, it plays a crucial role in document structure organization and can influence application-specific behavior. Notably, if this element is omitted, the content is assumed to have an outline level of 9 (no specified level).
Usage in Document Styles
Below is a snippet of XML demonstrating the use of w:outlineLvl
in a paragraph style definition:
Pandoc’s Approach Using w:name
Pandoc, a document conversion tool, interprets document structure differently. It uses the w:name element within style definitions to determine the header level. This approach is not directly tied to the w:outlineLvl
specification but rather to the naming convention of styles.
Below is a snippet of XML generated by Google Docs, demonstrating an example where w:outlineLvl
was not used in a paragraph style definition:
Here’s how Pandoc might parse the header level from a style name:
For broader accessibility, here is the same function represented in pseudocode suitable for a Python environment:
Pandoc extracts and converts the style name if it starts with the prefix “heading “, interpreting the following number as the header level. This method relies heavily on naming conventions and may not capture the intended structure if the document creators do not adhere to these naming rules.
Contrast in approaches
The fundamental distinction between the ECMA-376-1:2016 standard and Pandoc’s approach centers on the source of information used to structure documents:
- ECMA Standard: The standard explicitly uses standardizes the
w:outlineLvl
element to define the outline level, embedding the document’s structural semantics directly into its markup. This ensures that the document’s structure is clearly defined and consistent with its markup. In practice we see examples wherew:outlineLvl
is not used, like in Google Docs. - Pandoc: Depends on the
w:name
element and requires specific naming conventions ("heading x"
) to deduce the structure. This approach is less robust against variations in naming conventions, but more robust against inconsistensies in word processing applications not usingw:outlineLvl
.
Proposed solution for comprehensive interpretation
Illustrative example of potential limitations
Consider the following XML snippet, which demonstrates a scenario that Pandoc might not adequately cover due to its reliance on naming conventions:
Proposed strategy
A robust solution to ensure comprehensive structural interpretation would involve analyzing the properties of styles and their inheritance. By examining styles that derive from others (w:basedOn
), one can effectively merge properties from the base style. This merged analysis would involve checking for either:
- A style name that starts with
"heading "
in either the derived or base style. - The presence of a defined
w:outlineLvl
within thew:pPr
element of either the derived or base style