Today, we’re going to use Pandoc and Clojure to produce a nice EDN file with all the links from an Markdown file.
I strive to learn to use general tools. I want to be able to mix and combine my existing toolbox to new problems. To achieve that, I’m willing to sacrifice some clarity and some control.
Pandoc and Clojure are general tools. Pandoc supports a wide range of document formats. Clojure is a great tool for general purpose programming.
Here’s a list of more specific ways of solving adjacent problems:
Specific tools are often easier to get started with than general tools. Doing something specific is also a great way to learn. By minimizing indirection in what you do, you minimize your chance to get lost.
That’s not what we’re going to do today! Today, we’re aiming for general.
Let’s get to it.
First, let’s define our language.
Term | Definition | More details |
---|---|---|
walk | A way to transform recursive data structures | https://clojuredocs.org/clojure.walk |
Pandoc | Document converter | https://pandoc.org/ |
Pandoc filter | A program that can transform Pandoc JSON | https://pandoc.org/filters.html |
Babashka | Clojure runtime for scripting | https://babashka.org/ |
Note: we could use plain Clojure instead of Babashka. But Babashka is a good fit here because of fast startup time.
Pandoc provides a common document format abstraction, and transformation from/to a wide range of formats. Let’s look at an example.
Given doc.md
:
# Pandoc converts
> /pan/
>
> involving all of a (specified) group or region
"Pan-doc" like "pan-Atlantic", get it?
It supports lots of formats.
We can call Pandoc:
#!/usr/bin/env bash
pandoc doc.md -o doc.html
To produce doc.html
:
<h1 id="pandoc-converts">Pandoc converts</h1>
<blockquote>
<p>/pan/</p>
<p>involving all of a (specified) group or region</p>
</blockquote>
<p>“Pan-doc” like “pan-Atlantic”, get it?</p>
<p>It supports lots of formats.</p>
Given link.md
:
[teod.eu][teod]
See
[teod]: https://teod.eu
We can call Pandoc:
#!/usr/bin/env bash
# pandoc link.md -t json | jq > link.json # pretty
pandoc link.md -o link.json # compact
To produce link.json
:
{"pandoc-api-version":[1,22,2],"meta":{},"blocks":[{"t":"Para","c":[{"t":"Str","c":"See"},{"t":"Space"},{"t":"Link","c":[["",[],[]],[{"t":"Str","c":"teod.eu"}],["https://teod.eu",""]]}]}]}
Man, that’s a long line. Here:
{
"pandoc-api-version": [1, 22, 2],
"meta": {},
"blocks": [
{
"t": "Para",
"c": [
{"t": "Str", "c": "See"},
{"t": "Space"},
{
"t": "Link",
"c": [
["", [], []],
[{"t": "Str", "c": "teod.eu"}],
["https://teod.eu", ""]
]
}
]
}
]
}
See? It’s just data 🙂
Recap:
So, by leveraging Pandoc, we can create arbitrary transformations on anything*!
(*anything: https://pandoc.org/index.html)
So, how do we want to do this? A Pandoc filter takes JSON on stdin and produces JSON on stdout.
We can use jet
and bb
do this:
echo '{"args": [1, 2]}' | \
jet --from json --keywordize | \
bb '(assoc *input* :sum (reduce + (:args *input*)))' | \
jet --to json
{"args":[1,2],"sum":3}
Nice!
rickroll.clj
I wanted to work on the Pandoc filter incrementally, writing each step. That didn’t happen. I got in the zone, and wrote everything. So you’ll get after-the-fact commentary instead. The source code renderer on play.teod.eu currently (2022-07-15) demands very short source code lines. So you can view rickroll.clj as a raw file or on Github if you’d like. Otherwise, keep on scrolling.
Here comes a full listing for the Babashka script. We continue below the code listing!
Jump below rickroll.clj 👇ns rickroll
(:require
(walk :refer [prewalk]] ; recursive transformation
[clojure.; read pandoc JSON as EDN
[clojure.edn]
))
comment
(;; a nice pattern for recursive transformation in Clojure:
;;
;; 1. walk
;; 2. change element if (predicate?)
;; 3. otherwise, leave it be.
;;
;; Example:
prewalk (fn [el]
(if (string? el) ; touch strings
(keyword el) ; do this to strings
(; otherwise let it be
el)) :big ["nested" "structure"]}) ; big thing
{
)
;; We must identify pandoc JSON links
defn pandoc-link?
("Is this a valid Pandoc link?"
[pandoc]= "Link" (:t pandoc)))
(
;; What's the simplest link transform we could do?
;; Removing links is easy.
;;
;; For the interested reader, Geepaw Hill provides some
;; great commentary on you should take small steps.
;;
;; https://www.geepawhill.org/2021/09/29/many-more-much-smaller-steps-first-sketch/
;; But I digress. Back to our totally serious project.
;; Let's remove some links.
defn remove-links [pandoc]
(prewalk (fn [el]
(if (pandoc-link? el)
(; emtpy element
{}
el))
pandoc))
;; To try, set `transform` to `remove-links` below :)
;; Finally, here's a rickroll pandoc filter:
defn rickroll [pandoc]
(
let [;; Inline example of the data we're working with
(:t "Link",
_link-example {:c [["" [] []]
:t "Str", :c "teod.eu"}]
[{"https://www.youtube.com/watch?v=dQw4w9WgXcQ" ""]]}
[;; which made the assoc-in easy to write:
fn [el]
rick-link (assoc-in el [:c 2 0]
("https://www.youtube.com/watch?v=dQw4w9WgXcQ"))]
;; now, the (if predicate change no-change) pattern again:
prewalk (fn [el]
(if (pandoc-link? el)
(
(rick-link el)
el))
pandoc)))
;; I first tried running it all at once:
;;
;; pandoc -i doc.md --filter "bash -c \"jet --from json --keywordize | bb rickroll.clj | jet --to json\" -o doc-no-links.md
;;
;; But pandoc doesn't support filters with command line arguments.
;; So we need a script to wrap it up.
;; More on the wrapper later.
;; Aaand I hard-code some test data for development.
def example
(:pandoc-api-version [1 22 2], :meta {},
{:blocks [{:t "Para",
:c [{:t "Str", :c "See"}
:t "Space"}
{:t "Link",
{:c [["" [] []]
:t "Str", :c "teod.eu"}]
[{"https://teod.eu" ""]]}]}]})
[
let [transform rickroll ; choose rickroll or remove-links
(try
input (*in*)
(clojure.edn/read catch RuntimeException _ nil))] ; if *in* looks right, use that.
(if (map? input)
(
(transform input);; otherwise use test data
(transform example)))
;; How to run without pandoc:
;;
;; cat link.json \
;; | jet --from json --keywordize \
;; | bb rickroll.clj \
;; | jet --to json --keywordize
;;
rickroll.sh
Pandoc’s --filter
requires a single
script. So here’s rickroll.sh
:
#!/usr/bin/env bash
jet --from json --keywordize \
| bb rickroll.clj \
| jet --to json --keywordize
You can use rickroll.sh
like this:
./rickroll.sh < link.json
{
"pandoc-api-version": [1, 22, 2],
"meta": {},
"blocks": [
{
"t": "Para",
"c": [
{"t": "Str", "c": "See"},
{"t": "Space"},
{
"t": "Link",
"c": [
["", [], []],
[{"t": "Str", "c": "teod.eu"}],
["https://www.youtube.com/watch?v=dQw4w9WgXcQ", ""]
]
}
]
}
]
}
Look at all those closing parens! 😄 Hiccup is quite compact when you think about it.
:p "See " [:a {:href "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
["teod.eu"]]
#!/usr/bin/env bash
# We could rickroll from the org file:
#
# pandoc --standalone \
# --from=org+smart \
# --shift-heading-level-by=1 \
# --toc \
# -i index.org \
# --filter rickroll.sh \
# -o rickroll-ourselves.html
# But that's too easy, so let's use the HTML file instead.
pandoc \
--standalone \
-V title:"" \
-i index.html \
--filter rickroll.sh \
-o rickroll-ourselves.html
Have a look at the result: rickroll-ourselves.html
extract_links.clj
is rickroll.clj
with some edits:
ns extract-links
(:require [clojure.walk :refer [prewalk]]
(
[clojure.edn]))
defn link?
("Is this a valid Pandoc link?"
[pandoc]= "Link" (:t pandoc)))
(
defn link-href [el]
(when (link? el)
(get-in el [:c 2 0])))
(
defn links [pandoc]
(let [links-found (atom [])]
(prewalk (fn [el]
(if (link? el)
(do (swap! links-found conj
(:href (link-href el)})
{
el)
el))
pandoc)@links-found))
def example
(:pandoc-api-version [1 22 2], :meta {},
{:blocks [{:t "Para",
:c [{:t "Str", :c "See"}
:t "Space"}
{:t "Link",
{:c [["" [] []]
:t "Str", :c "teod.eu"}]
[{"https://teod.eu" ""]]}]}]})
[
let [input (try
(*in*)
(clojure.edn/read catch RuntimeException _ ()))]
(if (map? input)
(
(links input) (links example)))
And here’s how to run it:
#!/usr/bin/env bash
pandoc index.org --to json \
| jet --from json --keywordize \
| bb extract_links.clj \
| bb '(clojure.pprint/pprint *input*)'
, producing:
:href ".."}
[{:href "https://pandoc.org/MANUAL.html#general-options"}
{:href "https://clojuredocs.org/clojure.walk"}
{:href "https://pandoc.org/filters.html"}
{:href "https://babashka.org/"}
{:href "rickroll.clj"}
{:href "https://github.com/teodorlu/play.teod.eu/blob/master/document-transform-pandoc-clojure/rickroll.clj"}
{:href "https://github.com/weavejester/hiccup"}
{:href "rick.html"}] {
That’s all for now.
🙌