Determining if a paragraph is actually a heading

Overview

This document discusses the utilization of the w:outlineLvl element in WordprocessingML as defined by the ECMA-376-1:2016 standard, focusing on its role and contrasting this approach with how document processing tools like Pandoc interpret outline levels based on style names.

Outline Level in ECMA-376-1:2016

According to Section 17.3.1.20 of ECMA-376-1:2016, the w:outlineLvl element specifies the outline level associated with a paragraph within a document. The outline level is an integer from 0 to 9, where ‘0’ represents the highest level of detail and ‘9’ indicates no specific outline level applied. This element primarily affects the generation of the Table of Contents and does not change the appearance of the text. However, it plays a crucial role in document structure organization and can influence application-specific behavior. Notably, if this element is omitted, the content is assumed to have an outline level of 9 (no specified level).

Usage in Document Styles

Below is a snippet of XML demonstrating the use of w:outlineLvl in a paragraph style definition:

<w:style w:type="paragraph" w:styleId="Heading1">
    <w:name w:val="heading 1"/>
    <w:basedOn w:val="Normal"/>
    <w:next w:val="Normal"/>
    <w:qFormat/>
    <w:rsid w:val="00856A25"/>
    <w:pPr>
        <w:keepNext/>
        <w:pageBreakBefore/>
        <w:numPr>
            <w:numId w:val="11"/>
        </w:numPr>
        <w:spacing w:before="480" w:after="240"/>
        <w:jc w:val="center"/>
        <w:outlineLvl w:val="0"/>
    </w:pPr>
    <w:rPr>
        <w:rFonts w:ascii="Arial Black" w:hAnsi="Arial Black" w:cs="Arial"/>
        <w:bCs/>
        <w:caps/>
        <w:color w:val="FFFFFF" w:themeColor="background1"/>
        <w:kern w:val="32"/>
        <w:sz w:val="32"/>
        <w:szCs w:val="32"/>
    </w:rPr>
</w:style>

Pandoc’s Approach Using w:name

Pandoc, a document conversion tool, interprets document structure differently. It uses the w:name element within style definitions to determine the header level. This approach is not directly tied to the w:outlineLvl specification but rather to the naming convention of styles.

Below is a snippet of XML generated by Google Docs, demonstrating an example where w:outlineLvl was not used in a paragraph style definition:

<w:style w:type="paragraph" w:styleId="Heading1">
    <w:name w:val="heading 1" />
    <w:basedOn w:val="Normal" />
    <w:next w:val="Normal" />
    <w:pPr>
        <w:keepNext w:val="1" />
        <w:keepLines w:val="1" />
        <w:pageBreakBefore w:val="0" />
        <w:spacing w:after="120" w:before="400" w:lineRule="auto" />
    </w:pPr>
    <w:rPr>
        <w:sz w:val="40" />
        <w:szCs w:val="40" />
    </w:rPr>
</w:style>

Here’s how Pandoc might parse the header level from a style name:

getHeaderLevel :: NameSpaces -> Element -> Maybe (ParaStyleName, Int)
getHeaderLevel ns element
  | Just styleName <- getElementStyleName ns element
  , Just n <- stringToInteger =<<
              (T.stripPrefix "heading " . T.toLower $
                fromStyleName styleName)
  , n > 0 = Just (styleName, fromInteger n)
getHeaderLevel _ _ = Nothing

For broader accessibility, here is the same function represented in pseudocode suitable for a Python environment:

def get_header_level(namespaces, element):
    style_name = get_element_style_name(namespaces, element)

    if style_name is not None:
        prefix = "heading "
        style_name_lower = style_name.lower()

        if style_name_lower.startswith(prefix):
            header_level_str = style_name_lower[len(prefix):]
            header_level_str = header_level_str.strip()

            if header_level_str.isdigit():
                header_level = int(header_level_str)
                if header_level > 0:
                    return (style_name, header_level)

    return None

Pandoc extracts and converts the style name if it starts with the prefix “heading “, interpreting the following number as the header level. This method relies heavily on naming conventions and may not capture the intended structure if the document creators do not adhere to these naming rules.

Contrast in approaches

The fundamental distinction between the ECMA-376-1:2016 standard and Pandoc’s approach centers on the source of information used to structure documents:

  • ECMA Standard: The standard explicitly uses standardizes the w:outlineLvl element to define the outline level, embedding the document’s structural semantics directly into its markup. This ensures that the document’s structure is clearly defined and consistent with its markup. In practice we see examples where w:outlineLvl is not used, like in Google Docs.
  • Pandoc: Depends on the w:name element and requires specific naming conventions ("heading x") to deduce the structure. This approach is less robust against variations in naming conventions, but more robust against inconsistensies in word processing applications not using w:outlineLvl.

Proposed solution for comprehensive interpretation

Illustrative example of potential limitations

Consider the following XML snippet, which demonstrates a scenario that Pandoc might not adequately cover due to its reliance on naming conventions:

<w:style w:type="paragraph" w:customStyle="1" w:styleId="Heading1Mod">
    <w:name w:val="Heading 1 but modified"/>
    <w:basedOn w:val="Heading1"/>
    <w:next w:val="Normal"/>
    <w:qFormat/>
    <w:rsid w:val="00B2425D"/>
    <w:pPr>
        <w:pageBreakBefore w:val="0"/>
        <w:numPr>
            <w:numId w:val="0"/>
        </w:numPr>
    </w:pPr>
    <w:rPr>
        <w:sz w:val="28"/>
    </w:rPr>
</w:style>

Proposed strategy

A robust solution to ensure comprehensive structural interpretation would involve analyzing the properties of styles and their inheritance. By examining styles that derive from others (w:basedOn), one can effectively merge properties from the base style. This merged analysis would involve checking for either:

  • A style name that starts with "heading " in either the derived or base style.
  • The presence of a defined w:outlineLvl within the w:pPr element of either the derived or base style