Discussions and Questions


Want to discuss something?  You can't really discuss it on the mailing list (which is announce-only), but you can discuss it here.

Add a new link off this page, or just add a new section to the bottom.  Inline discussions I guess are more threadable and trackable?  I'm not sure.  Anyone can edit, but you do have to register first.

 

From Ian, July 27 2009:

 

OK, so giving a quick look at Thomas and thinking about what I think

are the goals of Citability, here's some thoughts or questions:

 

1. Is Thomas (http://thomas.loc.gov/) pretty much all the data we're looking to make citable, at least federally and as a first step?

 

Silona> actually the plan was for ANY public govt docs to be citable any level, any country, any city...

I actually would love to kill WestLaw and their copyrighted citation process!

 

Ian> sure, I figured federal legislation is what we can all be more interested in to start with; once we have examples of that, applying it locally will be a lot easier.  As I mention, in this case getting the data from Thomas isn't the hard part, it's figuring out what to do with that data.

 

2. Are we really just trying to make a better Thomas?

 

3. Thomas includes some static resources (bills) and some more timely information (bill status).  Does this timely information fit into what gets cited?  Does it involve creating ongoing timelines?

 

4. Is the XML version of legislation contain everything we want to display?  There's a bunch of metadata in there besides just the bill. (For whatever reason, the XML has somewhat nicer URLs than the HTML) Not everything has XML; maybe only at a certain point in the passage of a bill is it translated?

 

5. Do we want to translate the XML into HTML + microformats or something?  There's stuff like this:

 

<action-date date="20090106">January 6, 2009</action-date>

<action-desc><sponsor name-id="J000032">Ms. Jackson-Lee of

Texas</sponsor> introduced the following bill; which was referred to the

<committee-name committee-id="HAS00">Committee on Armed

Services</committee-name>

 

There's handy information in there, but no particularly good HTML equivalent.  We could link these things up, e.g.:

 

  <a href="/name/J000032" class="fn">Ms. Jackson-Lee of Texas</a>

 

That is, we translate each of these IDs into a URL.  We could try to maintain a list of backlinks as a starting point for these pages, but simply the URL itself is a useful identifier.

 

The XML itself has lots of ids on sections and paragraphs.  These aren't present in the HTML, which is unfortunate, because I am guessing they may(?) be stable, and ids are linkable (just not easy to discover).

 

Silona> that all sounds awesome and a good idea but does go beyond the goal of citability but PLEASE

don't let that stop you!  We love Microformats here and making things more readable.

 

 

6. Another added feature over Thomas, I guess, is versioning?  That is, regularly polling the Thomas site to see updates, and keeping a record of all updates.  Do we actually need index pages for all bills then?  I think Silona was talking about a URL structure like:

 

  /bill/House/111/HR65/ -> the latest bill

  /bill/House/111/HR65/20090301 -> the bill as it was on March 1st 2009.

 

But I wonder if /bill/House/111/HR65 should actually be an index page of all versions?

 

Silona> version will happen by the tools we create to sit on top of this standard but we are focusing

on citations here so degradable URL's for referencability are primary! 

 

7. Would the HTML be the canonical version of all pages?  That is, will the HTML be a parseable and documented format?  (More documented than just "HTML", but with specific classes with specific meanings)

 

Silona> Yes I also want to do hashes for the archiving servers so that citations can be

easily verifiable.  David Strauss wants to make the archive server w a distributed versioning system

like Github or bizzare so that we create a verification trail.  i think that is a great idea.

 

Ian> Given the text, an attempt at stable ids (it's a fuzzy problem, so at best it is an attempt) and a feed to allow people to easily track incremental updates, I think the versioning can be implemented on top.  I don't think hashes are important, as you can actually compare the text itself.  No paragraph is all that big, and a hash is almost a compression system in this case ~Ian

 

If this is all just Thomas data but with better linkability, then the scraping seems pretty simple but there's a lot of questions about what

a truly canonical source of legislative information should be.  Or, if this is something else, then probably all my questions are off base ;)

 

Silona> this is reason I would like the govt to do it so the URL can be ARCHIVE.HOUSE.GOV

govtrack.us and whitehouse.wikia.com and others already do permalinks but those are not

"official" sources.  This is a mechanism to create official clonable sources with verifiable hashes.

 

Ian> To be canonical, I think you really have to be complete.  So if archive.house.gov contains the bill, it needs to have at least all the information available through Thomas.  That includes some funky metadata.

 

Ian> It occurs to me generally that it would be very doable right now to construct an example of "the bill page we want".  That is, take one bill and hand-massage it into what exactly would show up on archive.house.gov.  Once that is done, and there's a sort of rough consensus on the scope and goals, implementations should be easy.  Or at least, much much easier -- the choice is one of how to do it, not what to do.