If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.
You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

DataSets

Page history last edited by Silona Bonewald 14 years, 1 month ago

Some notes on creating Citable live datasets.

One reason we are so interested in citable live data is that we believe it creates a level of accountability in government data that doesn't exist currently. Tt also makes it easier for other people to use and reference that data in their own work.

Steps for making data Citable:

1) What data is there? (Identifiable)

2) Where is it? (discoverable)

3) What does it mean?

4) How has it changed over time? (delta)

5) Who's touched it? (nice but not necessary other than publisher)

Ideally it should be discoverable, we should be able to either pull on a regular (daily?) basis or have a mode of notification of changes so that we can accurately archive.

A basic Microformat for a queried Dataset:

Query

Time of Query

Answer and Hash of Answer (dataset result)

URI (location of Source Data - publisher)

Source Data, Time of Source Data, Hash of Source

We could also simply capture the entire dataset itself, hash it and store it (storage costs are minimal approx ($450 on s3) even for a Dataset of 10G that changes 1% daily.

Depending on traffic we could just just capture the Query's dataset and store it as well (esp if diverse queries are rare.)

With new queries we can check via the hash to see if the exact same query results were given before and return same pointer/reference.