Customized Remote Work Solutions From the World’s Largest Fully Remote CompanyCustomized Remote Work SolutionsLearn More
Technology
7 minute read

Create a Publication Chain with Pandoc and Docker

A former university lecturer, Phillip has extensive experience in all aspects of software development, with particular expertise in Java.

Today, we will be taking a closer look at how professionals can enlist the help of Pandoc to create a robust and easy-to-implement publication chain. Pandoc is an extremely simple yet powerful tool that allows users to convert documents into various formats, depending on their requirements.

It can greatly simplify documentation and publication, or even open up a few new automation possibilities. Best of all, Pandoc relies on Git-friendly Markdown, which means you can also implement a version-control system for your documentation without any additional hassle.

Speaking of hassle, we will be relying on a Docker image to install Pandoc and LaTeX with a simple pull. Installing software can be time-consuming, and setting up a working software environment from scratch when starting a new project is hardly productive. Docker helps mitigate these problems by allowing users to set everything up in minutes, regardless of the platform.

Additionally, it is not uncommon for employers to require that you provide your own computer hardware. This is often referred to as Bring Your Own Device (BYOD), and let’s not forget that the COVID-19 pandemic has made working from home much more prevalent. Without solutions like Docker, it would be hard to support applications running on diverse hardware and operating systems such as Windows, macOS, and Linux.

Let’s get started by taking a closer look at Docker containers and images before proceeding to Pandoc.

Docker to the Rescue

Using Docker containers can eliminate the need to install multiple software applications on a new machine. Prebuilt Docker images are available on Docker Hub for a large number of applications. Leading cloud providers such as AWS, Azure, and Google all provide container registries. There are many other third-party registries including GitLab and Red Hat OpenShift.

It is likely that an image will be available for most (if not all) applications. This means it is not necessary to install most applications and their dependencies. The application can simply be run in a Docker container. This conveniently eliminates issues associated with team members running applications on different hardware and different operating systems. The same image can be used to run containers on any system that can run them, and Docker specialists can make this process extremely fast and efficient.

Pandoc Use Case: Documentation

Documentation may be required in several different formats. The same document may need to be available in different formats such as HTML for presentations and PDF for handouts. Converting between file formats can be tedious and require a lot of time. A good solution is to have a publication chain with a single source of truth. All documents should be written in the same language. It should be a text-based language as it is easier to version and store in Git repositories.

Markdown is a good choice for creating a single source of truth. Software that can convert Markdown documents into a variety of other formats is readily available and tends to be reliable.

Markdown can be easily converted into a range of different formats for various uses
Markdown can be easily converted into a range of different formats for various uses.

Pandoc

Pandoc is a software package that can convert documents into different formats. In particular, it can convert Markdown into HTML, PDF, and other widely used formats. The conversion process can be customized using templates and metadata in the Markdown source.

Pandoc requires a LaTeX installation to create PDF files. Installing Pandoc and LaTeX is quite time-consuming. Fortunately, there is a Docker image called pandoc/latex, which eliminates the need to install anything other than Docker.

Docker Commands

A suitable Docker image needs to be found or created that contains the necessary software. It is advisable to pull the image to the local registry as the download can take some time.

docker pull pandoc/latex

To run a command in a Docker container requires a wrapper to run the Docker container and execute a command in the container. A good solution to this is to write a shell function on a macOS or UNIX/Linux system. The function can be put in any login scripts or in a separate file such as $HOME/.functions. It is also possible to write a script or an alias with the same functionality.

function pandoc {
   echo pandoc [email protected]
   docker run -it --rm -v $PWD:/work -w /work pandoc/latex pandoc "[email protected]"
}

This function does the following:

  • It prints the command to the screen.
  • It runs a Docker container from the pandoc/latex image.
  • The -it option creates an interactive terminal session and makes the output of the command visible.
  • The --rm option deletes the container once the command has terminated.
  • The -v $PWD:/work option mounts the current directory on the host to the directory /work in the container.
  • The -w /work makes the /work directory in the container the working directory.
  • The final pandoc "[email protected]" runs the pandoc command in the container passing all of the command-line options passed to the function.

A shell or a script needs to load the function into memory.

. $HOME/.functions

The function is now a command in its own right and behaves the same way as if the pandoc binary was installed locally. This approach can be used for any command available in a Docker image.

Markdown to HTML

To convert Markdown to HTML, it is better to use a template and metadata in the Markdown.

Markdown to HTML

Markdown Metadata

The Markdown source can have a header section that can have arbitrary metadata. The metadata takes the form of key-value pairs. The values can be substituted in the HTML template.

---
title: Document title
links:
  prev: index
  next: page002
...

The header starts with a line containing only three dashes ---. It is terminated with a line containing only three dots .... Keys are single words followed by a colon and its value. Keys can be nested. The example shows the definition of keys called title, links.prev and links.next.

This approach uses a separate file for each page. In the example, the previous page is ìndex.md, the current page is page001.md, and the next page is page002.md. In practice, more meaningful file names would be used so that it is easier to reorder and insert pages.

HTML Template

An HTML template is simply an HTML file. Metadata substitution and simple control structures can be added between dollar symbols. Here is a simple example of an HTML template for Pandoc:

<html>
    <head>
        <meta charset="utf-8" />
        <meta name="viewport" content="width=device-width, initial-scale=1.0">
        <title>$title$</title>
        <link href="../css/style.css" type="text/css" rel="stylesheet" />
    </head>
    <header>
        <h1>$title$</h1>
    </header>
    <body>
        $body$
    </body>
    <footer>
        $if(links.prev)$
        <a href="$links.prev$.html" class="previous">&laquo; Previous</a>
        $endif$
        $if(links.next)$
        <a href="$links.next$.html" class="next">Next &raquo;</a>
        $endif$
    </footer>
</html>

The example shows a template. The $body$ gets replaced with the Markdown text converted to HTML. The conditional statements only generate the HTML link if the metadata is defined in the Markdown header.

Generating HTML from Markdown

Pandoc just needs to be told what the input and output files are called plus any template files. The default input file format is Markdown. It can infer the output file format from the specified output file extension.

The commands to generate outputs can be a script or a makefile.

dir=Project
for input_file in ${dir}/*.md
do
    output_file=HTML/${input_file%.md}.html
    if [[ ${input_file} -nt ${output_file} ]]
    then
        pandoc --data-dir . --template presentations.html -t html \
                -o ${output_file} ${input_file}
    fi
done

For every .md file in the Project directory, it creates a corresponding .html file in the HTML/Project directory if the output file is older than the input file or doesn’t exist.

Generating Beamer PDF from Markdown

Beamer is a LaTeX package for producing presentations. The output is a PDF slideshow.

Generating Beamer PDF from Markdown

The same Markdown source files can be used to generate the beamer PDF.

pandoc -t beamer -o PDF/Project.pdf -V theme:Boadilla -V colortheme:whale Project/index.md Project/page000.md

The command details are:

  • The -t beamer option says use LaTeX and beamer to generate the PDF.
  • The -o option specifies the output file.
  • The -V options select the beamer theme and color theme.
  • The command ends with a list of Markdown files that will be concatenated in the order given.

All of the processing is performed inside a Docker container.

Markdown to PDF

Converting a Markdown document to a PDF document is also quite simple. PDF is always generated by first converting the Markdown to LaTeX. Metadata can be added to the Markdown header to customize the output such as setting the paper size and the margin size.

---
title: Title of document
papersize: a4
geometry:
- margin=20mm
...

Generating PDF from Markdown

The command to convert the Markdown to PDF is simple:

pandoc -s Project/outline.md -o PDF/ProjectOutline.pdf

The -s option creates a standalone document.

Conclusion

It is no longer necessary to spend many days installing software. Simply running a command in a Docker container eliminates the need for installation. Many applications have suitable Docker images on Docker Hub. If the software needs updating, simply pull the latest Docker image.

Setting up a new computer is just a matter of installing Docker, pulling the necessary images, and creating a few scripts.

It is no longer necessary to create documents in different formats. A single format such as Markdown can be used for all documents. Tools such as Pandoc can then generate documents from Markdown into a large number of different formats.

As Markdown is a text file, when checked into a Git repository, a full version history is available. Git repositories also automatically render Markdown and allow people to comment on changes without having to use messy change histories in the document file itself.

Understanding the basics

What is Docker and what is it used for?

Docker implements lightweight virtualization that allows applications and their dependencies to be run in isolated environments called containers.

How does Docker work?

Docker creates containers from a subset of available CPUs, memory, and other resources and executes a single application that is stored along with its dependencies on a cut-down Linux file system called an image.

What is Pandoc?

Pandoc is open-source software that can convert documents, typically written in Markdown, into a wide range of other document formats including HTML and PDF.

How do you use Pandoc?

Pandoc is a command-line tool that takes as parameters the input file and its document type, the required output file and its document type, and optionally a template file defining the output file format.

What is Pandoc Markdown?

Pandoc Markdown is a slightly modified version of Markdown that is a lightweight plain text markup language often used for formatting readme files and online messaging posts.