Recursive document transformations with Pandoc and Clojure

..

Today, we’re going to use Pandoc and Clojure to produce a nice EDN file with all the links from an Markdown file.

Motivation: general tools

I strive to learn to use general tools. I want to be able to mix and combine my existing toolbox to new problems. To achieve that, I’m willing to sacrifice some clarity and some control.

Pandoc and Clojure are general tools. Pandoc supports a wide range of document formats. Clojure is a great tool for general purpose programming.

Here’s a list of more specific ways of solving adjacent problems:

  1. Use a markdown library API to tramsform a markdown document
  2. Use jsoup to transform HTML directly
  3. Just write hiccup instead of a text format like Markdown or Org-mode so that you can work with plain data

Specific tools are often easier to get started with than general tools. Doing something specific is also a great way to learn. By minimizing indirection in what you do, you minimize your chance to get lost.

That’s not what we’re going to do today! Today, we’re aiming for general.

Let’s get to it.

Terminology

First, let’s define our language.

Term Definition More details
walk A way to transform recursive data structures https://clojuredocs.org/clojure.walk
Pandoc Document converter https://pandoc.org/
Pandoc filter A program that can transform Pandoc JSON https://pandoc.org/filters.html
Babashka Clojure runtime for scripting https://babashka.org/

Note: we could use plain Clojure instead of Babashka. But Babashka is a good fit here because of fast startup time.

Introduction to Pandoc

Pandoc provides a common document format abstraction, and transformation from/to a wide range of formats. Let’s look at an example.

Converting markdown to HTML with Pandoc

Given doc.md:

# Pandoc converts

> /pan/
>
> involving all of a (specified) group or region

"Pan-doc" like "pan-Atlantic", get it?

It supports lots of formats.

We can call pandoc:

#!/usr/bin/env bash

pandoc doc.md -o doc.html

To produce doc.html:

<h1 id="pandoc-converts">Pandoc converts</h1>
<blockquote>
<p>/pan/</p>
<p>involving all of a (specified) group or region</p>
</blockquote>
<p>“Pan-doc” like “pan-Atlantic”, get it?</p>
<p>It supports lots of formats.</p>

Converting markdown to JSON with Pandoc

Given link.md:

See [teod.eu][teod]

[teod]: https://teod.eu

We can call pandoc:

#!/usr/bin/env bash

# pandoc link.md -t json | jq > link.json # pretty
pandoc link.md -o link.json # compact

To produce link.json:

{"pandoc-api-version":[1,22,2],"meta":{},"blocks":[{"t":"Para","c":[{"t":"Str","c":"See"},{"t":"Space"},{"t":"Link","c":[["",[],[]],[{"t":"Str","c":"teod.eu"}],["https://teod.eu",""]]}]}]}

Man, that’s a long line. Here:

{
  "pandoc-api-version": [1, 22, 2],
  "meta": {},
  "blocks": [
    {
      "t": "Para",
      "c": [
        {"t": "Str", "c": "See"},
        {"t": "Space"},
        {
          "t": "Link",
          "c": [
            ["", [], []],
            [{"t": "Str", "c": "teod.eu"}],
            ["https://teod.eu", ""]
          ]
        }
      ]
    }
  ]
}

See? It’s just data 🙂

Babashka as a pandoc filter

First, recap.

  1. Pandoc can convert anything* to JSON
  2. We can transform JSON
  3. Pandoc can convert from JSON to anything*.

So, by leveraging Pandoc, we can create arbitrary transformations on anything*!

(*anything: https://pandoc.org/index.html)

So, how do we want to do this? A Pandoc filter takes JSON on stdin and produces JSON on stdout.

We can use jet and bb do do this:

echo '{"args": [1, 2]}' | \
    jet --from json | \
    bb '(assoc *input* :sum (reduce + (:args *input*)))' | \
    jet --to json
{"args":[1,2],"sum":0}

Nice!

Implementing rickroll.clj

I wanted to work on the pandoc filter incrementally, writing each step. That didn’t happen. I got in the zone, and wrote everything. So you’ll get after-the-fact commentary instead. The source code renderer on play.teod.eu currently (2022-07-15) demands very short source code lines. So you can view rickroll.clj as a raw file or on Github if you’d like. Otherwise, keep on scrolling.

Here comes a full listing for the babashka script. We continue below the code listing!

Jump below rickroll.clj 👇

(ns rickroll
  (:require
   [clojure.walk :refer [prewalk]] ; recursive transformation
   [clojure.edn]  ; read pandoc JSON as EDN
   ))

(comment
  ;; a nice pattern for recursive transformation in Clojure:
  ;;
  ;;   1. walk
  ;;   2. change element if (predicate?)
  ;;   3. otherwise, leave it be.
  ;;
  ;; Example:
  (prewalk (fn [el]
             (if (string? el) ; touch strings
               (keyword el)   ; do this to strings
               el))           ; otherwise let it be
           {:big ["nested" "structure"]}) ; big thing in here
  )

;; Here's the predicate we're going to use later:
(defn pandoc-link?
  "Is this a valid Pandoc link?"
  [pandoc]
  (= "Link" (:t pandoc)))

;; I choose to pull "this is an empty element" out of the walk logic:
(defn pandoc-empty
  "Empty Pandoc element"
  []
  {})

;; What's the simplest link transform we could do?
;; Removing links is easy.
;; Let's start there.
;;
;; For the interested reader, Geepaw Hill provides some
;; great commentary on you should take small steps.
;;
;;   https://www.geepawhill.org/2021/09/29/many-more-much-smaller-steps-first-sketch/
;;
;; But I digress. Back to our totally serious project.

(defn remove-links [pandoc]
  (prewalk (fn [el]
             (if (pandoc-link? el)
               (pandoc-empty)
               el))
           pandoc))

;; To try, set `transform` to `remove-links` below :)

;; Finally, here's a rickroll pandoc filter:
(defn rickroll [pandoc]
  (let [;; I like to see an example of the data
        ;; structure I'm working with
        _link-example {:t "Link",
                       :c [["" [] []]
                           [{:t "Str", :c "teod.eu"}]
                           ["https://www.youtube.com/watch?v=dQw4w9WgXcQ" ""]]}
        ;; which made the assoc-in okay to write:
        rick-link (fn [el]
                    (assoc-in el [:c 2 0]
                              "https://www.youtube.com/watch?v=dQw4w9WgXcQ"))]
    ;; now, same (if predicate change no-change) pattern
    (prewalk (fn [el]
               (if (pandoc-link? el)
                 (rick-link el)
                 el))
             pandoc)))

;; I first tried running it all at once:
;;
;;   pandoc -i doc.md --filter "bash -c \"jet --from json --keywordize | bb rickroll.clj | jet --to json\" -o doc-no-links.md
;;
;; But it turns out, pandoc doesn't support this.
;; A filter must be a single script.
;; Filters can't take arguments.
;; So we need a wrapper.
;; More on the wrapper later.

;; I hard-code some example data so that "just running" gives me feedback:
(def example
  {:pandoc-api-version [1 22 2], :meta {},
   :blocks [{:t "Para",
             :c [{:t "Str", :c "See"}
                 {:t "Space"}
                 {:t "Link",
                  :c [["" [] []]
                      [{:t "Str", :c "teod.eu"}]
                      ["https://teod.eu" ""]]}]}]})

;; ... but if *in* looks right, use that.
(def input
  (try
    (clojure.edn/read *in*)
    (catch RuntimeException _
        ())))

(let [transform rickroll] ; choose rickroll or remove-links here
  (if (map? input)
    (transform input)
    (transform example)))

;; How to run without pandoc:
;;
;;   cat link.json \
;;       | jet --from json --keywordize \
;;       | bb rickroll.clj \
;;       | jet --to json --keywordize
;;

Jump above rickroll.clj 👆

Bash wrapper: rickroll.sh

Pandoc’s --filter requires a single script. So here’s rickroll.sh:

#!/usr/bin/env bash

jet --from json --keywordize \
    | bb rickroll.clj \
    | jet --to json --keywordize

You can use rickroll.sh like this:

./rickroll.sh < link.json
{
  "pandoc-api-version": [1, 22, 2],
  "meta": {},
  "blocks": [
    {
      "t": "Para",
      "c": [
        {"t": "Str", "c": "See"},
        {"t": "Space"},
        {
          "t": "Link",
          "c": [
            ["", [], []],
            [{"t": "Str", "c": "teod.eu"}],
            ["https://www.youtube.com/watch?v=dQw4w9WgXcQ", ""]
          ]
        }
      ]
    }
  ]
}

Look at all those closing parens! 😄 Hiccup is quite compact when you think about it.

[:p "See " [:a {:href "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
            "teod.eu"]]

Let’s rickroll ourselves!

#!/usr/bin/env bash

# We could rickroll from the org file:
#
# pandoc --standalone \
#     --from=org+smart \
#     --shift-heading-level-by=1 \
#     --toc \
#     -i index.org \
#     --filter rickroll.sh \
#     -o rickroll-ourselves.html

# But that's too easy, so let's use the HTML file instead.

pandoc  \
    --standalone \
    -V title:"" \
    -i index.html \
    --filter rickroll.sh \
    -o rickroll-ourselves.html

Please head over to rickroll-ourselves.html!

Oh, that EDN file with links. I almost forgot.

extract_links.clj is rickroll.clj with some edits:

(ns extract-links
  (:require [clojure.walk :refer [prewalk]]
            [clojure.edn]))

(defn link?
  "Is this a valid Pandoc link?"
  [pandoc]
  (= "Link" (:t pandoc)))

(defn link-href [el]
  (when (link? el)
    (get-in el [:c 2 0])))

;; Keeping the old =rickroll= function for comparison.
(defn rickroll [pandoc]
  (let [;; I just copied in an example of what I was going to generate
        _pandoc-link-example {:t "Link",
                              :c [["" [] []]
                                  [{:t "Str", :c "teod.eu"}]
                                  ["https://www.youtube.com/watch?v=dQw4w9WgXcQ" ""]]}
        ;; which made the assoc-in okay to write
        link-to-rick (fn [el]
                       (assoc-in el [:c 2 0] "https://www.youtube.com/watch?v=dQw4w9WgXcQ"))]
    ;; now, just follow the walk pattern from above.
    (prewalk (fn [el]
               (if (link? el)
                 (link-to-rick el)
                 el))
             pandoc)))

(defn links [pandoc]
  (let [links-found (atom [])]
    (prewalk (fn [el]
               (if (link? el)
                 (do (swap! links-found conj
                            {:href (link-href el)}) el)
                 el))
             pandoc)
    @links-found))

(def example
  {:pandoc-api-version [1 22 2], :meta {},
   :blocks [{:t "Para", :c [{:t "Str", :c "See"}
                            {:t "Space"}
                            {:t "Link",
                             :c [["" [] []]
                                 [{:t "Str", :c "teod.eu"}]
                                 ["https://teod.eu" ""]]}]}]})


(def input
  (try
    (clojure.edn/read *in*)
    (catch RuntimeException _
        ())))

(if (map? input)
  (links input)
  (links example))

And here’s how to run it:

#!/usr/bin/env bash

pandoc index.org --to json \
    | jet --from json --keywordize \
    | bb extract_links.clj \
    | bb '(clojure.pprint/pprint *input*)'

, producing:

[{:href "./.."}
 {:href "https://pandoc.org/MANUAL.html#general-options"}
 {:href "https://clojuredocs.org/clojure.walk"}
 {:href "https://pandoc.org/filters.html"}
 {:href "https://babashka.org/"}
 {:href "rickroll.clj"}
 {:href
  "https://github.com/teodorlu/play.teod.eu/blob/master/document-transform-pandoc-clojure/rickroll.clj"}
 {:href "https://github.com/weavejester/hiccup"}
 {:href "rick.html"}]

That’s all for now.

🙌