Posted Wednesday, December 9, 2009 at 10:23 p.m. by Chris Amico in Projects about government, journalism, programming and transparency
This weekend, Sunlight Labs is sponsoring a nationwide hackathon, bringing together developers to build open source projects to open up government. I'm planning on dropping in on the one in Dupont Circle.
That has me thinking about a couple recent conversations I've had with the Sunlight crew about what developers, journalists and open government types ought to be working on, and how success can build upon success.
Much of this comes from a session I sat in on back at PubCamp in October, which asked, "What should we do with government data?"
Where data is hidden, locked up or otherwise unavailable, make it so.
This is the simple stuff. Forget about file formats and data feeds for a moment, and just get the data out there. The Sacramento Bee got off to a good start a few years ago by publishing a database of state worker salaries.
Current salaries of anyone who works for the state (or local) government has long been available, but it tended to be handed out in thick reams of paper printed from Excel. Putting it online removed a barrier.
Given the choice of PDFs or file cabinets, I'll take a PDF.
Where data is available, make it accessible and usable.
In most cases, though, PDF or paper is a false choice. Most any document or data set I'm interested in started its life on someone's computer, probably in Word or Excel or Access. Converting to a PDF makes it less accessible, and time spent freeing data from PDFs is time that could be spent doing meaningful analysis.
For all the uproar over the Sac Bee's salary database, it's not all that useful by itself. The Bee and its reporters can use it, but I'd have to scrape figures off the page or file my own public records request to do anything with it, and that's especially difficult because the salary database is built in Caspio. The data doesn't live on the Sac Bee's site; it's on Caspio's servers (and mostly invisible to both search engines).
That's why I like the Guardian's Data Blog and Data Store.
Data is displayed in simple, sortable HTML tables. Even better, it's available in Google Spreadsheets, with an invitation to download and mash it up. Kevin Anderson, the Guardian's blog editor, even posted a list of tools to make maps. Very cool.
Where data is accessible and usable, make it meaningful.
This is the hardest part, and it's what reporters most want to do. But if data isn't available, there's no meaning to be had. And if you spend all your time scraping PDFs, it's time not spent actually looking at what the data means.
Making sense of massive piles of data can be hard. I wrote a bit about our process in Patchwork Nation in my posts about Frameworks for Reporting. Thankfully, there are more tools than there used to be, whether you need to scrape, organize or visualize your data. The rest is just journalism. We know how to do that.
Share.
Almost everything I've listed above is free. Nearly every piece of software I use in my day job is open source. All of it exists because someone decided to give away what they built, and others gave back, and the software is better for it. The journalists and activists working to pry open government should learn from that model.
Here's a real-life example: Congress publishes all kinds of information about itself. Members, committee assignments, votes. It's all online, in varying degrees of usability, thanks to the efforts of uncounted journalists and activists, and developers both inside and outside of government.
A news organization, in this case the New York Times, collects and organizes all of that for its own reporting purposes. This isn't new. But the Times decides to go a step further and creates an API for that giant database, and opens it up to the public.
Now we have data that was previously scattered and obfuscated made available and usable.
Shortly after this data is released, Derek Willis, who works for the Times, releases a Python wrapper for the Congress API.
A few months later, I'm poking through Congressional records and need a way to see who's voting with whom. As it happens, the Times Congress API can do that. So I fork Derek's API client, add a method to compare votes and push the result back up to GitHub. Now we both have a better tool.
So maybe this is my manifesto on data:
- Where it's hidden, make it available.
- Where it's available, make it usable.
- Where it's usable, make it meaningful.
- Share.
...
As I said, I'll be hacking away at some of this on Saturday. I think I'll work on FedBlogger, a project I started months ago that really could use some love. If you want to help, the source code is online.

Comments:
Comments are closed for this post. If you still have something to say, please email me.
Before: Happy Thanksgiving | After: Good Reads on China and Reality