This is where I hope we will post discussion and finding on how we should handle data citations.
Some ideas to throw out there are:
How to handle citing specific pieces or subsets of data? Would like to point to atomic pieces of data if possible.
For example a specific row in a CSV file
or a specific table in an excel spreadsheet (I think cells are easier)
or even a dataset on the cloud.
People I am trying to recruit are:
David Strauss - large scale systems architect
Brian Fitzpatrick - Data Liberation League at Google
Brian Aker - Architect for mySQL and Drizzle
Peeps from microsoft research
need EC2 and S3 expert
Some Random Ideas
- Consider the more general problem of citing structured objects that are not text documents (text documents are a special case important enough to warrant their own model). This brings us to at least two parts of the problem
- Referencing the whole object. In order to do this, we need a UID for the object. This is addressed in the text document citability problem.
-
Referencing a part of the object relative to the whole. e.g. in a table-like structure, this could be row/column addressing.
- Structured object may contain other structured objects that actually in multiple places in multiple other objects (e.g. representations like database denormalizations can cause this). We have to decide whether sub-objects are objects on their own or are should only be cited relative to the containing object. A figure in an XLS file is a good test case.
- Datasets can be thought of as a large structured object, allowing us to reduce the problem back the containment issue. I don't believe it matters if the dataset is "in the cloud".