In this post I describe my experience of programmatically generating large (5000+) page PDF document locally in under 30 seconds!
Table of contents
Open Table of contents
Backstory
I had to figure out a way to export the content from a Ghost CMS based website. However, for reasons beyond the scope of this article, the entire site was to be exported into a single PDF (print-friendly) document. The problem statement was screaming for some spicy automation.
Introduction
Ghost CMS has a great feature that lets you export the entire content of the site as a glorious JSON! A feature that we just don’t get to see a lot lately for existing CMS out there in the market. The next step is to just export each post from the array of posts and somehow generate the PDF out of it.
In a lot of sense it sounds like a typical use case for typesetting. Generating HTML from JSON data source is a solved problem, there are troves of tools out there that one can use (among which Hugo is one of my favorites, btw). But generating a PDF directly without generating HTML is not so straight forward as it first sounds. You’d also need a template, obviously. The general recommendation is probably to use LaTeX.
Full disclosure, I love LaTeX. I’ve used it quite a lot in the not so distant past and have been impressed by the sheer possibilities. However, LaTeX isn’t exactly beginner-friendly and also depends on a lot of dependencies or packages for getting started. For my use case, I couldn’t figure out a way to programmatically extract and conditionally render a PDF from a valid JSON. Not at least without relying on external programming tools outside LaTeX ecosystem. Additionally, from my experience, LaTeX compilation is slow, with errors that I need to google my way before actually understanding it.
Enter Typst
Typst is a new markup-based typesetting system. It was recommended to me by a friend, whom one may consider - a LaTeX nerd. Now if I had to explain, Typst is like Hugo or handlebars where you first define a template and based on the template and data, the actual HTML gets rendered. For me, it was the same mental model but the output was PDF, PNG or SVG without any HTML during the processes. And the syntax is a lot like Markdown.
Typst fortunately supports multiple data formats for loading data.
This is how easy it is to load ghost’s JSON data into typst:
<!-- this is the template for a single post-->
#let p(post) = [
== #post.title
#v(2pt)
*published on* = #post.published_at
#post.plaintext
#pagebreak()
]
<!-- This is a variable declaration -->
#let ghostToPDF(db) = [
= Posts
#for (post) in (db.db.at(0).data.posts) {
if (post.visibility == "public" and post.type == "post" and post.status != "draft") {
p(post)
}
}
= Pages
#for (post) in (db.db.at(0).data.posts) {
if (post.visibility == "public" and post.type == "page" and post.status != "draft") {
p(post)
}
}
]
<!-- calling the template and passing the parsed JSON data -->
#ghostToPDF(json("ghost.json"))
Well, that’s all the code you need!
You place the JSON export in the same directory as the typst file (ex: main.typ), and run the typst cli command: typst compile main.typ
and voilà! I just compiled the entire site, separated by posts and pages, totaling to more than 5000 pages to a single PDF, in under 30 seconds! The resulting file size was around 15 MB.
Here’s a sample output with a title page and a blog post:
Bonus
The exported JSON contained references to the site URL mentioned as __GHOST_URL__
, in several thousand places across the JSON. VS CODE (rather, VS CODIUM) suffered quite a bit while indexing the entire JSON. I didn’t bother trying any find and replace methods.
Fortunately, Typst has ways to replace text. All I had to do is add these two lines before the rest of the code:
#show "__GHOST_URL__": "https://yoursite.in"
#show "GHOST_URL/": "https://yoursite.in"
This replaces all the texts as expected (during compile time), which is pretty cool IMO.
Conclusion
The best part of using typst was the familiar feeling of using go
like syntax for functional logic. Using loops and conditional statements felt natural and straight forward.
My next experimentation is to use typst in a Lambda function and generate PDF’s on the fly (without having to spin up an entire headless chrome that coverts HTML to PDF). It also compiles to WASM which makes things even more interesting. Although I have no experience using WASM, I think it is worth exploring as a viable alternative to generating PDF’s on the frontend without relying on the client’s browser capabilities, which can be a hit or miss.
I’ll probably end up writing my thesis in Typst (which I keep procrastinating hard btw). If you haven’t already, I highly recommend you to check out Typst. The source code for the project can be found here. Until next time !