Some notes on creating Citable live datasets.

One reason we are so interested in citable live data is that we believe it creates a level of accountability in government data that doesn't exist currently. Tt also makes it easier for other people to use and reference that data in their own work.

Steps for making data Citable:

1) What data is there? (Identifiable)

2) Where is it? (discoverable)

3) What does it mean?

4) How has it changed over time? (delta)

5) Who's touched it? (nice but not necessary other than publisher)

Ideally it should be discoverable, we should be able to either pull on a regular (daily?) basis or have a mode of notification of changes so that we can accurately archive.

A basic Microformat for a queried Dataset:

Query

Time of Query

Answer and Hash of Answer (dataset result)

URI (location of Source Data - publisher)

Source Data, Time of Source Data, Hash of Source

We could also simply capture the entire dataset itself, hash it and store it (storage costs are minimal approx ($450 on s3) even for a Dataset of 10G that changes 1% daily.

Depending on traffic we could just just capture the Query's dataset and store it as well (esp if diverse queries are rare.)

With new queries we can check via the hash to see if the exact same query results were given before and return same pointer/reference.

DataSets