Generating YAML from CouchDB docs

Continuing the theme of the last two posts, the old Posterous blog content is now available as JSON inside CouchDB. I’m now going to combine a few pieces that are unique to CouchDB to build up the components that will become blog fodder for OctoPress.

Octopress, which is based on Jekyll, uses a mixture of YAML and markdown for pages and posts. We’ll use a show function, which passes a JSON document to a JavaScript transformation function, to build this up.

First, the posterous format includes a whole lot of stuff we won’t need. For OctoPress we want title, display_date,tags, and body_full only:

{
   "_id": "21298063",
   "_rev": "1-8ba11a44954e3171de4f4fa9d68c3210",
   "is_owned_by_current_user": true,
   "slug": "setting-up-a-shared-photo-library-in-picasa3",
   "tags": [
   ],
   "title": "setting up a shared photo library in Picasa3 on MacOS",
   "display_date": "2009/01/09 13:49:53 -0800",
   "body_full": "setting up a shared photo library for several...",
}

Assuming OctoPress can parse the date format, this will be easy. Let’s map these to title, date, categories, and content to build our YAML like this:

---
title: "setting up a shared photo library in Picasa3 on MacOS"
date: 2009/01/09 13:49:53 -0800
comments: true
categories: []
---
... body_full goes here ...

The show function is pretty straightforward:

function(doc, req) {
    if (doc.slug && doc.title && doc.display_date && doc.tags && doc.body_full) {
        return {
            body: '---\nlayout: post\n' +
              'title: "' + doc.title + '"\n' +
              'date: ' + doc.display_date + '\n' +
              'comments: true\n' +
              'categories: ' + doc.tags + '\n---\n',
            headers: {
                'Content-Type': 'application/text'
            }
        }
    }
}

Let’s walk through that line by line.

  1. Declare the function, and therefore our scope.
  2. First I check that all the entities we require are present. This avoids generating an expensive exception in the JavaScript engine if later on I try to access data that isn’t actually present.
  3. return an object comprising the body content, and the headers
  4. the body is built up from JSON properties of the supplied doc object.

That seems like a good start, so wrap that up into a design document, drop it into your CouchDB and test it out:

$  curl --silent --header "Content-Type: application/text" \
  http://localhost:5984/posts/_design/posts/_show/yaml/31797293
---
title: "ubuntu saves the day"
date: 2010/10/28 04:25:37 -0700
comments: true
categories:
---

Notice how we needed to query using Content-Type: application/text? Try that same link in your browser. You’re prompted for a download that refers to the _id stored in CouchDB.

It would be nicer to get that with the correct markdown filename already. Let’s use doc.slug for the name, prefixed with the date of the original post. Octopress expects a yyyy-mm-dd format so I’ve sprinkled liberally with regex pixie dust.

Finally, an additional HTTP header Content-Disposition: attachment; filename=<file.ext> is required to provide the proposed name via our show function.

Now’s a good time to append the actual blog post content too.

function(doc, req) {
    if (doc.slug && doc.title && doc.display_date && doc.tags && doc.body_full) {
        // Replace / with - and trim display_date to yyyy-mm-dd- only
        // to match the octopress expected post format.
        // This will be passed as an HTTP header and will be used by
        // browsers or wget as the proposed filename.
        var post_date = doc.display_date.replace(/\//g, '-').replace(/^([-0-9]+).+/, "$1");
        var post_name = 'attachment; filename=' + post_date + '-' + doc.slug + '.md';
        return {
            body: '---\nlayout: post\n' +
              'title: "' + doc.title + '"\n' +
              'date: ' + post_date + '\n' +
              'comments: true\n' +
              'categories: ' + doc.tags + '\n---\n' +
              doc.body_full + '\n',
            headers: {
                'Content-Type': 'application/text',
                'Content-Disposition': post_name
            }
        }
    }
}

I’ve put this into a separate show function, and this is what comes back:

$ curl --silent --header "Content-Type: application/text" \
 http://localhost:5984/posts/_design/posts/_show/octo/31797293
* About to connect() to localhost port 5984 (#0)
*   Trying 127.0.0.1... connected
* Connected to localhost (127.0.0.1) port 5984 (#0)
> GET /posts/_design/posts/_show/octo/31797293 HTTP/1.1
> User-Agent: curl/7.21.4 (universal-apple-darwin11.0) libcurl/7.21.4 \
 OpenSSL/0.9.8r zlib/1.2.5
> Host: localhost:5984
> Accept: */*
> Content-Type: application/text
>
< HTTP/1.1 200 OK
< Vary: Accept
< Server: CouchDB/1.1.1 (Erlang OTP/R14B04)
< Etag: "6PML44SHRNE54M212K0O6BLXZ"
< Date: Thu, 22 Dec 2011 13:02:36 GMT
< Content-Type: application/text
< Content-Length: 1668
< Content-Disposition: attachment; filename=2010-10-28-ubuntu-saves-the-day.md
<
{ [data not shown]
* Connection #0 to host localhost left intact
* Closing connection #0
---
title: "ubuntu saves the day"
date: 2010-10-28
comments: true
categories:
---
My work laptop had a BSOD today, which looks like it was caused by bit rot ...

$ wget http://localhost:5984/posts/_design/posts/_show/octo/31797293 \
  --content-disposition
Resolving localhost... 127.0.0.1, ::1, fe80::1
Connecting to localhost|127.0.0.1|:5984... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1668 (1.6K) [application/text]
100%[======>] 1,668       --.-K/s   in 0s

2011-12-22 14:04:46 (79.5 MB/s) - `2010-10-28-ubuntu-saves-the-day.md' saved [1668/1668]

Now we can transform arbitrary Posterous blog entries via CouchDB into Markdown format. Next time, I’ll use CouchDB to pull all the data out in one swoop.