This is a stripped-down implementation of the server side of the
Atom Publishing Protocol
as an Apache module, implemented in C. It felt like something that needed to
exist and I am better-qualified for this particular chore than your average
geek; having said that, I have no idea if anyone actually needs such
a thing. mod_atom activity can be tracked on this blog, for now,
here.
If any interest develops, then I’ll transfer discussion
to a blog at mod-atom.net which will be driven, of
course, by mod_atom.
For the moment, I’m going to brain-dump everything about the project right
here, if only as a crutch for my own memory. People who care
about the Atom protocol, and those who care about Apache internals wrangling,
might find it interesting; the intersection of those two groups is, I suspect,
me.
What’s an Apache Module?
It’s code that gets linked into httpd, the Web server binary.
There are
hundreds; a few are
included with the server distro, but most aren’t.
Code in a module doesn’t have to do anything like CGI, you’re just a C
subroutine that gets called with a package of details about the request and
the current server state. Which can save some cycles. Might those cycles be
significant in your application? Maybe, sometimes. If mod_atom is fast, it’s
more apt to be fast because of its low-rent flat-file-only approach.
On the other hand, being
in the server means that you have to code in C and you have to be
really careful about concurrency and memory management and all sorts of
low-level grunge.
By the way, to be technically correct, whenever I say Apache I should
probably be saying “httpd”, since while they used to be synonyms, Apache means
much more now, the
httpd Web Server is just one piece.
But httpd is an ugly little splodge of letters, Apache sounds so much
better. And on modern Debian-family systems, httpd is called “apache” anyhow.
Why Me?
Well, I understand the Atom Protocol pretty well and I’ve already
written a couple of Apache modules (for a failed startup), so it’s less work
for me than it would be for nearly anyone else.
Also, I think that the protocol is going to be a big enough part of the Web
ecosystem that Apache, as perhaps the world’s single most important piece of
Web infrastructure, really ought to support it. Think of it as giving
PUT something useful to do.
What Does it Do?
Implements all of the Atom Protocol, near as I can tell.
There’s no database. Everything is persisted in files.
Entry paths look like /blogs/tim/atom/e/entries/2007/06/23/cat-pix
Since it blasts Atom Entries straight into files, it can easily (unlike
most Atom protocol implementations) preserve foreign markup.
It should run fine under any MPM, without concurrency issues.
All the atom:id values begin urn:uuid, so you
could in principle move a whole publication from one server and directory to
another. Those who have memories of me arguing bitterly against URNs in
general and atom:id in particular can please restrain your
snickering while I’m around.
Configuration
There isn’t much. In your Apache config file, you can define as many
“publications” as you want.
Each requires one directive, for example:
AtomPub /blogs/joe /z0/pubs/blogs/jb "Joe's Blog" "J. Blow"
The first argument is a prefix; any URI beginning with it is considered to
be part of the publication.
The second is the filesystem directory where the data is rooted. The
filenames are the same as the URIs, only with the directory substituted for
the prefix.
The title and author are self-explanatory. There are no defaults.
When Apache starts up, if there’s an AtomPub directive but the directory
structure isn’t there, the init code creates it.
mod_atom doesn’t do any other configuration of any kind, for the moment.
Yes, I know there are lots of other kinds of configurations you might like to
be able to do.
People talk about hitting an 80/20 point; this more like a 60/1 point.
Publications have collections, and per RFC4287, the minimum you need is a
title and an author; so you really couldn’t do this with any less.
And with one line in a config file you get a fully-functional publication.
One thing you can’t configure at all is the directory layout where the data
goes. That’s hard-wired way deep into the code.
Right now, a publication comes with two hard-wired collection named
“Entries” and “Media”. The code can actually (in theory) handle multiple
Entry and Media collections, but I haven’t figured out a cheap enough way to
configure them.
After all, haven’t people been saying “Complexion over Commiseration” or
something like that recently?
How Much Work Was It To Implement the Atom Protocol?
Not much, actually, for a competent C programmer who understands
the protocol and some of Apache. My Apache-module experience was less
valuable than I’d expected,
because I had written Apache 1.* modules and the 2.* API is quite a
bit different.
Anyhow, I started on April 26th and I have enough today to start showing
the world. I program fast but I’ve been busy, so it’s a very part-time thing.
There are 8400 lines of code, but that includes a 2600 lines of of
Genx (because Apache
doesn’t have much of an XML generator) and then 2700
or so of unit-test code (1700 or so being Genx’s). So it’s really no big
deal.
Life was immensely easier because of having the Ape available.
Being an Apache module imposes some constraints that make unit testing tricky.
While the Ape provides functional rather than unit testing, strictly speaking,
using it shook out loads of bugs and saved a huge amount of time. The setup
was amusingly arcane; The Ape’s Ruby code running under JRuby in a servlet in a
Java EE app server talking to my naked hacked Apache server, 8080 to 4444 I
think.
What with some other things that are there to support
ongoing, my little laptop is running more than its
share of Web servers.
Rocket Science?
There’s really not much. You suck in XML and bit-bags from the net, you
find a place to put ’em, you build feeds describing them, you echo them back
on request, you’re careful about concurrency. It’s vanilla
infrastructure engineering.
There’s one premature optimization; I worried about someone setting up a
few thousand publications on one server (wouldn’t be surprising) and since the
way a module works is you have to look at every URI that comes in to see if
it’s one of yours, the task of scanning through your list of known pubs for
prefix matches could be pretty costly. So, the
mod_atom setup code compiles the list of known pubs into a simple little
finite automaton which can tell you which if any of your pubs a URI belongs
to really fast.
Which is pretty silly, YAGNI territory probably. But I’m a sucker for finite
automata.
I tried to avoid mutexing; the only place where you really have to (I
think) is when a PUT comes in and you need to lock things down while you check
the ETag and, if you accept the PUT, blast it in. I think you should be able
to get enough concurrency out of the filesystem for the rest of the
protocol. Based on what I hear, if someone took a mod_atom install and
started firing PUTs at a few existing URIs from a lot of parallel sources, I
bet the apr_global_mutex... calls would start to hurt pretty
quick. I have lots more premature-optimization ideas for that situation.
Frankly, the hardest bit was figuring out all the autoconf and
libtool
voodoo to compile the sucker, and in the end I couldn’t; in the finest
open-source tradition I reused
code from Josh Rotenberg
and did cut/paste/hack till it worked.
I’m assuming that one of these days someone I respect will explain to me
why libtool & friends are a good idea and how to use them properly; until
then I’m going to ignore them and hope they’re replaced. This technique allowed
me to avoid ever learning either imake or C++.
Legal Status
Apache V2 license, copyright Sun Microsystems, if the ASF ever got
interested I have the go-ahead to sign over whatever to whomever.
Haven’t figured out where to host yet, but
here’s a tarball.
If you want to actually try to run it, do please contact me.
Technical Status
It’s not really ready to use, but I’m publishing it because I want to start
talking and get some advice and opinions on what I should do about some
things, and that’s easier if you can point at source code.
mod_atom passes a few (eighty-odd) unit tests, plus it gets a clean bill of
health from the Ape. One of my short-term to-dos is to run Joe Gregorio’s test
client against it.
I’m pretty sure the basic technical approach to wrangling entries and feeds is
sensible and can probably be made to run very efficiently.
It has one big and one small missing piece, and a major enhancement I think
would be good.
The big missing piece is HTML (see next section).
The small missing piece is collection paging; it just isn’t there at the
moment; you get the last 20 entries in reverse app:edited order
and that’s all you get. No biggie.
The big enhancement I want to do is non-destructive editing. Right now it
implements PUT by replacing the old data with the new, and DELETE by,
well, deleting the data. I think it would be better, in all cases, to copy
the data aside, uh, somewhere. But I want to talk to people about this one
too, because I suspect it may involve weird corners.
To HTML or not to HTML?
For the moment, mod_atom is just an Atom server, not a blog engine.
Which is to say that it accepts and stores and updates and
deletes the Atom Entries and generates feeds appropriately, but doesn’t
actually generate any HTML versions.
I’m not sure what to do about this. It’d be pretty easy to just pull the
data out of the Atom Entries, wrap some basic HTML around it, and have a
blogging engine. But I think it’s irresponsible to publish HTML from outside
without sanitizing it. While I’m betting that it’s appropriate to do the
low-level persistence and CRUD in the bowels of httpd, I’m having trouble
believing that HTML sanitation and beautification belong in there too.
There are tools like TagSoup and Hpricot which are just the thing for the
job.
So maybe there is an ancillary “blogging system” that does the necessary
with the Atom entries?
Or maybe there’s a TagSoup equivalent available for C that could help out?
To-Do
Suggestions welcome.
Try it out on a few other systems, right now I’ve only tested OS
X. I expect breakage in my hacked-up build system, but not much in the actual
code. Programs written in C are portable, everyone knows that.
Shake it down with Joe Gregorio’s
APP Test Client.
Add a bunch more tests to the Ape for bits of the protocol which, now
having implemented them, I realize are tricky.
In particular, the Ape never tested sending a PUT to a media resource, so that
portion of the mod_atom code is unexercised and likely buggy.
Add collection paging.
See if anyone at ASF might be interested, now or down the
road.
Fix up error handling so that client errors get an explanation in the
response body, not just an HTTP error code. Apache doesn’t make this as
straightforward as you might expect.
Simultaneously, refactor error-handling internally. Some of my
routines return
apr_status_t and others char *;
it’s kind of ad-hoc and not very well thought
through.
Figure out how to do some load testing.
Do some evangelism. My eyes have that a Ruby gleam these days, and
grinding out all this C has been kind of painful so it would be nice if it
turned out to be useful for somebody.