How I Converted Thousands of Markdown Files to DITA XML Using TypeScript

This story happened in mid-2023. I finally found enough free time to write about it.

Manually converting Markdown to DITA isn’t fun

I used to work for a company that used DITA XML for its documentation. For reasons I cannot discuss due to NDAs, a few of their products used Markdown (MkDocs) for their documentation instead of DITA.

Running both systems in parallel seemed fine at first. However, it eventually caused problems when we started integrating those products with the rest of our portfolio, which used DITA for documentation. We needed to reuse content and keep it synchronized, which meant manually recreating most things. After a few weeks of manually copy-pasting content, my manager decided we should migrate everything to DITA XML.

How should we handle the migration?

At first, my manager suggested using ChatGPT or writing a script to pass everything to the GPT-3.5/4 API. I was not convinced by these options.

First, we were dealing with confidential company information. We did not have company-approved ChatGPT licenses, and getting approval from the InfoSec department would take weeks or months. We did have authorization to use the GPT-3.5 and GPT-4 APIs, but I still did not think that was the best solution. Processing large volumes of XML could cause our API costs to skyrocket, especially with GPT-4. Additionally, I knew my coworkers would not want to use a command-line interface multiple times a day just to do their jobs. There had to be a better way.

After searching online and brainstorming with the newly launched Bing Chat, I came up with an interesting solution: creating a desktop app that transformed Markdown to DITA.

The logic was straightforward:

Load the Markdown files into the application.
Transform the files to HTML.
Manipulate that HTML into structural DITA XML using standard developer tools.

Analyzing the source Markdown

Before writing any code, I needed to check the source Markdown to see what I was dealing with. I had never looked at our Markdown source repositories before; all I knew was that “it’s just Markdown.”

After getting access to the repositories, I analyzed the files in our most active projects (the ones we were currently converting to DITA manually). After a couple of days of analysis, I made some interesting discoveries:

It wasn’t just Markdown: We were using MkDocs with heavy extensions for admonitions, collapsible sections, and content snippets.
Tons of raw HTML: Authors were using embedded HTML tags as workarounds for Markdown’s styling limitations, like raw HTML tables or inline breaks <br> to force layouts.
Massive files: We had articles stretching over 3,000 lines. Fortunately, most of these were just giant HTML tables.
Consistency helped: Fortunately, every Markdown file contained exactly one H1 header. This consistency proved very helpful later on.

With this information, I established a few clear requirements:

Good performance: The tool needed to process hundreds of these large files quickly.
Extensibility: I needed a Markdown library that either supported our exact MkDocs syntax out of the box or allowed me to add custom syntax rules easily.

Selecting the tech stack

I wanted to choose between Go and JavaScript, as those were the languages I knew best. I also needed a framework that made prototyping fast and easy, since it had been a few years since my last major development project.

Initially, I wanted to use Go exclusively, specifically combining GoldMark with Wails. I love Go’s syntax and wanted to squeeze every bit of performance out of the conversion engine. However, that plan hit two major roadblocks early on:

Due to our custom markdown, GoldMark struggled to generate HTML correctly, producing artifacts and wrapping standard Markdown inside code blocks for no obvious reason.
I could not figure out how to create custom syntax rules easily, and the documentation was sparse.
Wails was more difficult to configure than I expected. Prototyping would take too long.

As a result, I chose the JavaScript and TypeScript ecosystem instead:

Markdown-It: It is well-documented, supports custom rules easily, and is highly performant.
Cheerio: I needed a way to manipulate the resulting HTML files and transform them into XML. At the time, I did not realize that Node.js and other backend environments lack native DOM manipulation mechanisms. I chose Cheerio because it was the easiest tool available. This decision would cause issues later when building the desktop app, though I eventually found a workaround.
Neutralinojs: I needed a lightweight, cross-platform framework because my coworkers used both macOS and Windows. Neutralinojs was simple and mature enough for this task.
Bulma CSS: I used this purely because I was already familiar with it.

Building the core

Since I had more experience with backend development than frontend, I decided to build the backend logic first: a JavaScript library to transform the Markdown to XML. Afterward, I would build the desktop app.

After reading the Markdown-It documentation, I realized the easiest approach was to create new renderer rules to output XML instead of HTML. Renderer rules would handle about 70% of the work; I just needed to handle the custom MkDocs extensions and raw HTML tags separately.

First, I created an abstract base renderer class to transform generic Markdown elements into generic DITA elements:

import markdownit from "markdown-it";

export abstract class BaseDitaRenderer
{
  protected md = new markdownit({
    html: true,
  });

  constructor() 
  {
    // ...
  }
}

Then I added the following rules to the base renderer:

Blockquote (>): Transformed into <lq>.
Code Block (indented and fenced): Transformed into <codeblock> while escaping the internal content.

// Indented code block
this.md.renderer.rules.code_block = (tokens, idx) => `<codeblock>${this.md.utils.escapeHtml(tokens[idx].content)}</codeblock>`;

Inline Code: Transformed into <codeph> while escaping the content.

this.md.renderer.rules.code_inline = (tokens, idx) => `<codeph>${this.md.utils.escapeHtml(tokens[idx].content)}</codeph>`;

Bold: Transformed into <strong>. The script processed these later to align with our DITA usage.
Italic: Transformed into <cite>. We did not use DITA’s typographic elements, only semantic ones. Visually, <cite> was closest to italics, though we had to manually fix the tagging later to use the correct semantic elements.
Link: Transformed into <xref>.
Image: Transformed into <image> with a break placement.

this.md.renderer.rules.image = (tokens, idx) =>
{
  const srcIndex = tokens[idx].attrIndex("src");
  const srcAttr = tokens[idx].attrs?.[srcIndex];
  const srcValue = srcAttr?.[1];
  const altIndex = tokens[idx].attrIndex("alt");
  const altAttr = tokens[idx].attrs?.[altIndex];
  const altValue = altAttr?.[1];

  return `<image placement="break" href="${srcValue}" alt="${altValue}"/>`;
};

Strikethrough: Stripped out the <del> and <s> tags entirely while keeping the text inside them.

Extending the core

With the base renderer in place, I created specific renderers for each DITA topic type we supported (Concept, Reference, and Task). Each renderer extended the base class and added rules exclusive to that topic type.

The concept and reference renderers functioned similarly:

They intercepted header tokens. If the engine hit the opening of the file’s single H1 element, it injected the required XML prolog, the DTD doctype declaration, and the opening root DITA tag.
Any subsequent headers (H2, H3, etc.) were provisionally mapped to <section>\n<title>.
When closing the H1 tag, the engine appended the opening body tag, like <refbody> or <conbody>.
Finally, the tool checked for the presence of the XML prolog in the output string. If it was missing, it meant the file did not have an H1 and was structurally invalid, prompting an error.

export class ReferenceRenderer extends BaseDitaRenderer
{
    constructor()
    {
        super();

        this.md.renderer.rules.heading_open = (tokens, idx) => tokens[idx].tag === 'h1' ? `<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE reference PUBLIC "-//OASIS//DTD DITA Reference//EN" "reference.dtd">\n<reference id="topic-id-placeholder" xml:lang="en-us">\n<title>` : `<section>\n<title>`;

        this.md.renderer.rules.heading_close = (tokens, idx) => tokens[idx].tag === 'h1' ? `</title>\n<refbody>\n` : `</title>\n</section>\n`;

    }

    toDitaReference(markdown: string, eventLogger: simpleLogger): string
    {
        try 
        {
            markdown = this.md.render(markdown);

            if (!markdown.includes(`<?xml version="1.0" encoding="utf-8"?>`))
                throw "NoHeaders";
        
            return `${markdown}\n</refbody>\n</reference>`;
        } catch (error)
        {
            eventLogger.logError(`Unable to convert document to DITA XML. Verify your file is properly formatted and try again.\n${error}`);
            return ``;
        }
    }
}

The task topic renderer used the same logic but added extra rules to handle strict DITA task structures:

Transformed all level-1 ordered lists into <steps>.
Transformed all <li> elements inside those level-1 ordered lists into <step>.

export class TaskRenderer extends BaseDitaRenderer
{
    constructor()
    {
        super();

        this.md.renderer.rules.heading_open = (tokens, idx) => tokens[idx].tag === 'h1' ? `<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE task PUBLIC "-//OASIS//DTD DITA Task//EN" "task.dtd">\n<task id="topic-id-placeholder" xml:lang="en-us">\n<title>` : `<title>`;

        this.md.renderer.rules.heading_close = (tokens, idx) => tokens[idx].tag === 'h1' ? `</title>\n<taskbody>\n` : `</title>\n`;

        this.md.renderer.rules.list_item_open = (tokens, idx) => (tokens[idx].markup === '.' && tokens[idx].level === 1) ? '<step>' : this.md.renderer.renderToken(tokens, idx, {});

        this.md.renderer.rules.list_item_close = (tokens, idx) => (tokens[idx].markup === '.' && tokens[idx].level === 1) ? '</step>\n' : this.md.renderer.renderToken(tokens, idx, {});

        this.md.renderer.rules.ordered_list_open = (tokens, idx) => tokens[idx].level === 0 ? '<steps>\n' : this.md.renderer.renderToken(tokens, idx, {});

        this.md.renderer.rules.ordered_list_close = (tokens, idx) => tokens[idx].level === 0 ? '\n</steps>\n' : this.md.renderer.renderToken(tokens, idx, {});
    }

    toDitaTask(markdown: string, eventLogger: simpleLogger): string
    {
      // Same as reference renderer...
    }
}

Applying the preliminary fixes

The first tests were promising: the renderers successfully generated strings that resembled DITA XML. However, some pieces were missing. To fix this, I added a series of pre-processing adjustments that ran directly on the raw input files before rendering.

Collapsible elements

We used the non-standard MkDocs ??? syntax for collapsible sections. Because Markdown-It did not recognize this syntax, it skipped these blocks entirely. To fix this, the script converted them into regular ## headers before rendering:

element = element.includes("???") ? element.replace(`???`, "##").replaceAll(`"`, "").replaceAll(`**`, ``).trim() : element.trim();

This introduced a side effect: to handle the indentation MkDocs uses for collapsible blocks, the script had to trim every line in the file. While this worked fine for regular text, it could break code blocks that relied on indentation. I added a warning to the logger so writers knew to double-check those files after conversion.

Content reuse (conrefs)

MkDocs snippets use the --8<-- "filename.md" syntax to include content from other files. Because our DITA XML repositories used a completely different file structure, the script could not resolve these paths automatically during conversion. Instead of ignoring them, the script flagged them as <draft-comment> elements. This allowed technical writers to locate them easily and wire up the real DITA conrefs manually:

updatedString = element.replace("--8<--", `<draft-comment>Import the contents of `).replace(`.md"`, `.md" here.</draft-comment>\n`)

While not elegant, it made the post-conversion cleanup process manageable.

Miscellaneous pre-processing fixes

A few other adjustments ran before the renderer touched the files:

Inline styles like {: style="color: red"} from MkDocs attributes were stripped out entirely because we were not preserving local styling.
The Footnotes: marker used in some files as a section divider was removed since it had no DITA equivalent.
A newline was prepended to ## headers to prevent edge cases where Markdown-It failed to parse them correctly after a collapsible element transformation.

Fixing leftover HTML

After the renderer did its job, the output was mostly valid DITA XML. However, Markdown-It passed our embedded HTML tags through untouched because I left the html: true configuration enabled. The next step was cleaning up these elements using Cheerio.

In our source files, UI navigation paths like Foo > Bar > Baz were written as bold text. In DITA, these must be structured using <menucascade> and <uicontrol> elements. The script used Cheerio to find every <strong> element, check if its text contained > or → symbols, split the text, and wrap each part accordingly:

  replacement = $('<menucascade></menucascade>');
  parts.forEach(part => {
    replacement.append($(`<uicontrol>${cleanPart}</uicontrol>`));
  });

If a <strong> element did not contain navigation separators, it became a plain <uicontrol>. This handled almost all of our use cases, as bold text in our documentation was used almost exclusively for UI labels.

Tables

This was the most complex fix to implement. Markdown-It renders Markdown tables as standard HTML tables. I needed to convert them to the DITA table format, which is based on the OASIS Exchange Table Model:

  <table>
    <tgroup cols="3">
      <colspec colname="col1"/>
      <thead><row><entry><p>Header</p></entry></row></thead>
      <tbody><row><entry><p>Cell</p></entry></row></tbody>
    </tgroup>
  </table>

The basic conversion was straightforward, but merged cells were problematic. Some HTML tables used colspan and rowspan attributes, which DITA’s table model handles differently. I wrote an unmergeCells function to expand merged cells into separate, empty cells before performing the conversion. This did not perfectly preserve the layout, but it kept the data intact so writers could refine it later.

Notes and admonitions

MkDocs supports admonitions like !!! note and !!! warning. In our source files, these appeared either as HTML <aside> elements or as {: .note} / {: .tip} / {: .warning} markers at the end of a paragraph. Both needed to map to DITA <note> elements:

const noteType = html.includes('.note') ? '' : html.includes('.tip') ? 'tip' : 'warning';

Notes without a class mapped to a plain DITA <note>, while tips and warnings received the corresponding type attribute.

Fixing the XML structure

At this stage, the output almost looked like DITA, but it still contained structural errors that could only be resolved by evaluating the XML document as a whole.

Fixing sections

Markdown does not have a concept of wrapping content between two headings; it simply renders individual opening and closing heading tags. For concept and reference topics, the output initially looked like this:

<conbody>
 <section><title>Foo</title></section>
 Lorem ipsum...
 <section><title>Bar</title></section>
 Lorem ipsum...
</conbody>

The content sat between the section blocks instead of inside them. I wrote a fixSections function that grabbed all elements between two consecutive <section> tags and restructured the tree so the content sat inside the correct section. The last section was handled as a special case since it had no trailing elements.

For reference topics, a <refbody> requires at least one <section>, even if the file has no H2 headings. If the function detected no sections, it wrapped the entire <refbody> content inside a single section.

Fixing task structure

DITA tasks have strict structural rules. After the initial render, the contents of the <taskbody> were often unordered or invalid. A fixTask function resolved this using three steps:

Subtasks: If a task file contained H2 headings inside the body, they became nested <task> elements with their own <taskbody>. This allowed us to handle files that documented multiple related procedures in one place.
Context and result: Content appearing before the <steps> element was wrapped in <context>, and content after it was wrapped in <result>. If there were no steps, everything went into <context>.
Step structure: Every <step> requires a <cmd> element as its first child. The function replaced the first <p> inside each step with <cmd> and wrapped all subsequent elements in <info>. It also fixed an edge case where nested lists or paragraphs accidentally ended up inside the <cmd> element by moving them immediately after it.

Generating topic IDs

Both fixConceptReference and fixTask handled topic ID generation. The id="topic-id-placeholder" string injected by the renderer was replaced with an ID derived from the H1 title. The script stripped non-alphanumeric characters, replaced spaces with underscores, and converted the string to lowercase. This generated human-readable IDs consistent with our existing DITA files.

The final cleanup pass

A final fixPendingTasks function ran on every topic type to resolve remaining issues:

Any remaining <strong> tags that were not menu paths became <uicontrol>.
Raw HTML <a> tags became <xref> elements.
Internal links like href="#some-anchor" were prefixed with the topic ID so they resolved correctly in DITA (for example, href="#my_topic/some-anchor").
External links automatically received format="html" and scope="external" attributes.
<li> elements without a <p> child had one wrapped around their content to satisfy DITA requirements.
Anchor tags used only to mark page positions (using an id attribute instead of href) were hoisted: their ID moved to the parent element, and the anchor tag was removed.
A fixAnchorIdForTitles function handled the MkDocs convention ## My Section {#custom-id}. It extracted the custom ID, applied it to the <title> element, and removed the syntax tracker from the title text.

Wrapping up

That forms the core of the tool: a Markdown-It parser extended with custom rules for each DITA topic type, pre-processing scripts for MkDocs syntax, and post-processing passes to ensure the XML validates against DITA DTDs.

Once the core library worked, I built a web application and a desktop application so the writers could convert files quickly. Because we phased out the MkDocs websites gradually, writers updated the MkDocs site first, then used the tool to port their changes over to the DITA repositories.

Did it work? For the most part, yes. The output still required human review to correct some semantic tagging, adjust the DITAMAP files, and fix files with complex raw HTML. However, it eliminated the bulk of the mechanical work. Converting a file went from taking 30 minutes to a few hours down to just a few seconds. The writers could focus on parts of the documentation that required editorial judgment instead of copy-pasting XML tags all afternoon.

If I were starting this today, I would probably make a few different choices, which I will cover in the next post. Packaging this as a library was the easy part. Wrapping it in a desktop app introduced a whole new set of headaches, and a certain DOM manipulation library that seemed like a perfectly reasonable choice at the time had other plans.

Manually converting Markdown to DITA isn’t fun#

Analyzing the source Markdown#

Selecting the tech stack#

Building the core#

Extending the core#

Applying the preliminary fixes#

Collapsible elements#

Content reuse (conrefs)#

Miscellaneous pre-processing fixes#

Fixing leftover HTML#

Menu paths#

Tables#

Notes and admonitions#

Fixing the XML structure#

Fixing sections#

Fixing task structure#

Generating topic IDs#

The final cleanup pass#

Wrapping up#