Citability Archive Server
The archive server will run on a backbone of Bazaar, a distributed, file-centric version control tool with an excellent Python API and good cross-platform support. Using Bazaar provides many facilities to quickly develop an effective and transparent system:
- As a file-based version control tool, it inherently handles archiving changing documents over time.
- As a distributed tool, it allows anonymous replica creation ("branching") with efficient updates.
- Digitally signing (and verifying) revisions is built-in, creating a tamper-evident and repairable system if multiple, distrusting parties replicate the data. Forging revisions would fail the signature check. Tampering with the source revision history would cause the "pull" process on a replica to refuse destruction of its existing data (unless forced).
- Bazaar efficiently and correctly records document moves and deletions.
- As a file-based version control tool, it has good support for versioned binary data, like images.
It is also a scalable design:
- Large sites can either shard archives into multiple branches (making cloning somewhat harder) or use multi-level stacked branches.
- Bazaar uses a highly efficient "group compress" approach to storing revisions. Minor document changes take minimal space.
- A system like Gearman can queue archive updates and collapse identical queued update requests.
An archive server will support at least three public interfaces:
- A Bazaar branch (via bzr://) for anonymous replication.
- A wrapper for Bazaar that can display the content for a Citability-spec URL, which includes a timestamp and a path. David Strauss wrote a prototype of this in about 10 minutes. It just:
- Runs "bzr cat --revision=date:[timestamp] [path]" (or its Python API equivalent) for the requested timestamp/path combination
- Adds appropriate anchor tags to allow citing a specific section, paragraph, or equivalent
- A URL to ping with a URL to archive and the known hash of raw content. This allows the Citability server to queue archival operations but quickly weed out most redundant requests. As mentioned above, archival requests may queue through a system like Gearman to support periods of heavy traffic. When there's a new revision to archive:
- The content is downloaded and hashed.
- The hash is cached so repeated requests to archive the same content get ignored.
- The content is written to the Bazaar branch and committed.
Description: Server-hosted ability to archive documents, process (or create) anchor tags, and show documents using the URL specification.
Project Lead: David Strauss
Project Team
Links:
Where it is hosted? Launchpad: https://launchpad.net/citability
Test environment? None, yet.
Documents:
Listed on Launchpad:
https://blueprints.launchpad.net/citability
Binaries:
Hosted on Launchpad:
https://launchpad.net/citability/+download