Recursive document transformations with Pandoc and Clojure

..

Today, we’re going to use Pandoc and Clojure to produce a nice EDN file with all the links from an Markdown file.

Motivation: general tools

I strive to learn to use general tools. I want to be able to mix and combine my existing toolbox to new problems. To achieve that, I’m willing to sacrifice some clarity and some control.

Pandoc and Clojure are general tools. Pandoc supports a wide range of document formats. Clojure is a great tool for general purpose programming.

Here’s a list of more specific ways of solving adjacent problems:

  1. Use a markdown library API to transform a markdown document
  2. Use jsoup to transform HTML directly
  3. Just write hiccup instead of a text format like Markdown or Org-mode so that you can work with plain data

Specific tools are often easier to get started with than general tools. Doing something specific is also a great way to learn. By minimizing indirection in what you do, you minimize your chance to get lost.

That’s not what we’re going to do today! Today, we’re aiming for general.

Let’s get to it.

Terminology

First, let’s define our language.

Term Definition More details
walk A way to transform recursive data structures https://clojuredocs.org/clojure.walk
Pandoc Document converter https://pandoc.org/
Pandoc filter A program that can transform Pandoc JSON https://pandoc.org/filters.html
Babashka Clojure runtime for scripting https://babashka.org/

Note: we could use plain Clojure instead of Babashka. But Babashka is a good fit here because of fast startup time.

Introduction to Pandoc

Pandoc provides a common document format abstraction, and transformation from/to a wide range of formats. Let’s look at an example.

Converting markdown to HTML with Pandoc

Given doc.md:

# Pandoc converts

> /pan/
>
> involving all of a (specified) group or region

"Pan-doc" like "pan-Atlantic", get it?

It supports lots of formats.

We can call Pandoc:

#!/usr/bin/env bash

pandoc doc.md -o doc.html

To produce doc.html:

<h1 id="pandoc-converts">Pandoc converts</h1>
<blockquote>
<p>/pan/</p>
<p>involving all of a (specified) group or region</p>
</blockquote>
<p>“Pan-doc” like “pan-Atlantic”, get it?</p>
<p>It supports lots of formats.</p>

Converting markdown to JSON with Pandoc

Given link.md:

See [teod.eu][teod]

[teod]: https://teod.eu

We can call Pandoc:

#!/usr/bin/env bash

# pandoc link.md -t json | jq > link.json # pretty
pandoc link.md -o link.json # compact

To produce link.json:

{"pandoc-api-version":[1,22,2],"meta":{},"blocks":[{"t":"Para","c":[{"t":"Str","c":"See"},{"t":"Space"},{"t":"Link","c":[["",[],[]],[{"t":"Str","c":"teod.eu"}],["https://teod.eu",""]]}]}]}

Man, that’s a long line. Here:

{
  "pandoc-api-version": [1, 22, 2],
  "meta": {},
  "blocks": [
    {
      "t": "Para",
      "c": [
        {"t": "Str", "c": "See"},
        {"t": "Space"},
        {
          "t": "Link",
          "c": [
            ["", [], []],
            [{"t": "Str", "c": "teod.eu"}],
            ["https://teod.eu", ""]
          ]
        }
      ]
    }
  ]
}

See? It’s just data 🙂

Transforming JSON with Babashka

Recap:

  1. Pandoc can convert anything* to JSON
  2. We can transform JSON
  3. Pandoc can convert from JSON to anything*.

So, by leveraging Pandoc, we can create arbitrary transformations on anything*!

(*anything: https://pandoc.org/index.html)

So, how do we want to do this? A Pandoc filter takes JSON on stdin and produces JSON on stdout.

We can use jet and bb do this:

echo '{"args": [1, 2]}' | \
    jet --from json --keywordize | \
    bb '(assoc *input* :sum (reduce + (:args *input*)))' | \
    jet --to json
{"args":[1,2],"sum":3}

Nice!

Implementing rickroll.clj

I wanted to work on the Pandoc filter incrementally, writing each step. That didn’t happen. I got in the zone, and wrote everything. So you’ll get after-the-fact commentary instead. The source code renderer on play.teod.eu currently (2022-07-15) demands very short source code lines. So you can view rickroll.clj as a raw file or on Github if you’d like. Otherwise, keep on scrolling.

Here comes a full listing for the Babashka script. We continue below the code listing!

Jump below rickroll.clj 👇
(ns rickroll
  (:require
   [clojure.walk :refer [prewalk]] ; recursive transformation
   [clojure.edn]  ; read pandoc JSON as EDN
   ))

(comment
  ;; a nice pattern for recursive transformation in Clojure:
  ;;
  ;;   1. walk
  ;;   2. change element if (predicate?)
  ;;   3. otherwise, leave it be.
  ;;
  ;; Example:
  (prewalk (fn [el]
             (if (string? el) ; touch strings
               (keyword el)   ; do this to strings
               el))           ; otherwise let it be
           {:big ["nested" "structure"]}) ; big thing
  )

;; We must identify pandoc JSON links
(defn pandoc-link?
  "Is this a valid Pandoc link?"
  [pandoc]
  (= "Link" (:t pandoc)))

;; What's the simplest link transform we could do?
;; Removing links is easy.
;;
;; For the interested reader, Geepaw Hill provides some
;; great commentary on you should take small steps.
;;
;;   https://www.geepawhill.org/2021/09/29/many-more-much-smaller-steps-first-sketch/

;; But I digress. Back to our totally serious project.
;; Let's remove some links.
(defn remove-links [pandoc]
  (prewalk (fn [el]
             (if (pandoc-link? el)
               {} ; emtpy element
               el))
           pandoc))

;; To try, set `transform` to `remove-links` below :)

;; Finally, here's a rickroll pandoc filter:
(defn rickroll [pandoc]

  (let [;; Inline example of the data we're working with
        _link-example {:t "Link",
                       :c [["" [] []]
                           [{:t "Str", :c "teod.eu"}]
                           ["https://www.youtube.com/watch?v=dQw4w9WgXcQ" ""]]}
        ;; which made the assoc-in easy to write:
        rick-link (fn [el]
                    (assoc-in el [:c 2 0]
                              "https://www.youtube.com/watch?v=dQw4w9WgXcQ"))]
    ;; now, the (if predicate change no-change) pattern again:
    (prewalk (fn [el]
               (if (pandoc-link? el)
                 (rick-link el)
                 el))
             pandoc)))

;; I first tried running it all at once:
;;
;;   pandoc -i doc.md --filter "bash -c \"jet --from json --keywordize | bb rickroll.clj | jet --to json\" -o doc-no-links.md
;;
;; But pandoc doesn't support filters with command line arguments.
;; So we need a script to wrap it up.
;; More on the wrapper later.

;; Aaand I hard-code some test data for development.
(def example
  {:pandoc-api-version [1 22 2], :meta {},
   :blocks [{:t "Para",
             :c [{:t "Str", :c "See"}
                 {:t "Space"}
                 {:t "Link",
                  :c [["" [] []]
                      [{:t "Str", :c "teod.eu"}]
                      ["https://teod.eu" ""]]}]}]})

(let [transform rickroll ; choose rickroll or remove-links
      input (try
              (clojure.edn/read *in*)
              (catch RuntimeException _ nil))] ; if *in* looks right, use that.
  (if (map? input)
    (transform input)
    ;; otherwise use test data
    (transform example)))

;; How to run without pandoc:
;;
;;   cat link.json \
;;       | jet --from json --keywordize \
;;       | bb rickroll.clj \
;;       | jet --to json --keywordize
;;
Jump above rickroll.clj 👆

Bash wrapper: rickroll.sh

Pandoc’s --filter requires a single script. So here’s rickroll.sh:

#!/usr/bin/env bash

jet --from json --keywordize \
    | bb rickroll.clj \
    | jet --to json --keywordize

You can use rickroll.sh like this:

./rickroll.sh < link.json
{
  "pandoc-api-version": [1, 22, 2],
  "meta": {},
  "blocks": [
    {
      "t": "Para",
      "c": [
        {"t": "Str", "c": "See"},
        {"t": "Space"},
        {
          "t": "Link",
          "c": [
            ["", [], []],
            [{"t": "Str", "c": "teod.eu"}],
            ["https://www.youtube.com/watch?v=dQw4w9WgXcQ", ""]
          ]
        }
      ]
    }
  ]
}

Look at all those closing parens! 😄 Hiccup is quite compact when you think about it.

[:p "See " [:a {:href "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
            "teod.eu"]]

Let’s rickroll ourselves!

#!/usr/bin/env bash

# We could rickroll from the org file:
#
# pandoc --standalone \
#     --from=org+smart \
#     --shift-heading-level-by=1 \
#     --toc \
#     -i index.org \
#     --filter rickroll.sh \
#     -o rickroll-ourselves.html

# But that's too easy, so let's use the HTML file instead.
pandoc  \
    --standalone \
    -V title:"" \
    -i index.html \
    --filter rickroll.sh \
    -o rickroll-ourselves.html

Have a look at the result: rickroll-ourselves.html

Oh, that EDN file with links. I almost forgot.

extract_links.clj is rickroll.clj with some edits:

(ns extract-links
  (:require [clojure.walk :refer [prewalk]]
            [clojure.edn]))

(defn link?
  "Is this a valid Pandoc link?"
  [pandoc]
  (= "Link" (:t pandoc)))

(defn link-href [el]
  (when (link? el)
    (get-in el [:c 2 0])))

(defn links [pandoc]
  (let [links-found (atom [])]
    (prewalk (fn [el]
               (if (link? el)
                 (do (swap! links-found conj
                            {:href (link-href el)})
                     el)
                 el))
             pandoc)
    @links-found))

(def example
  {:pandoc-api-version [1 22 2], :meta {},
   :blocks [{:t "Para",
             :c [{:t "Str", :c "See"}
                 {:t "Space"}
                 {:t "Link",
                  :c [["" [] []]
                      [{:t "Str", :c "teod.eu"}]
                      ["https://teod.eu" ""]]}]}]})

(let [input (try
        (clojure.edn/read *in*)
        (catch RuntimeException _ ()))]
  (if (map? input)
    (links input)
    (links example)))

And here’s how to run it:

#!/usr/bin/env bash

pandoc index.org --to json \
    | jet --from json --keywordize \
    | bb extract_links.clj \
    | bb '(clojure.pprint/pprint *input*)'

, producing:

[{:href ".."}
 {:href "https://pandoc.org/MANUAL.html#general-options"}
 {:href "https://clojuredocs.org/clojure.walk"}
 {:href "https://pandoc.org/filters.html"}
 {:href "https://babashka.org/"}
 {:href "rickroll.clj"}
 {:href "https://github.com/teodorlu/play.teod.eu/blob/master/document-transform-pandoc-clojure/rickroll.clj"}
 {:href "https://github.com/weavejester/hiccup"}
 {:href "rick.html"}]

That’s all for now.

🙌