Self-citing sources for genealogical research
07 March 2006 01:29
Today on the LDSOSS mailing list, Dan Lawyer sent a link to his new blog, "Taking Genealogy to the Common Person", the purpose of which is described thus:
A clear majority of people on this earth want to know more about their ancestors. In spite of their innate interest, they are often overwhelmed at the complexity of the process and underwhelmed by the experience. This blog is a forum for promoting innovation that will help to take family history to the common person.
A worthy cause indeed. His inaugural post brings up the idea of self-citing internet sources in genealogical research. He proposed some kind of tag markup in a given page indicating source information for the image or content displayed on that page, and opened the challeng out to standardize some kind of universal format. He places all this in the context of the digitized image delivery system the LDS Church is building to provide online access to digitized versions of it's more than 2 million microfilms.
I commented that microformats might be a good candidate to handle this kind of thing. On the microformats wiki, there is already a potential citation microformat that may be just what we're looking for.
In my comments, (which I include, slightly edited, below), I brainstormed several possibilities these self-citing pages could provide. Exciting stuff indeed.
I think microformats might fit the bill for what you're looking for here:
"Designed for humans first and machines second". Microformats are essentially snippets of structured (X)HTML that are machine-readable, embedded in pages that are human-readable.
It appears there is already work in progress to develop a citation microformat: (http://www.microformats.org/wiki/citation.
I can think of a number of ways this kind of thing would be useful. As one example, imagine browsing through all the digitized images and being able to click a browser bookmarklet (http://en.wikipedia.org/wiki/Bookmarklet) that does something like "Associate the source document displayed on this page with an individual/marriage/event/etc in my account in the church's new FamilyTree system." Clicking the bookmarklet would scan the current page for the microformat, and then lead you to a page in the FamilyTree system that would let me select the individual/event/marriage with which to associate the image as a source.
Of course, the church could also just put a link on each image display page that does just that, but a microformat would allow systems from different vendors to interact with these sources. A genealogy program could allow me to paste in a URL from which it could automatically extract the source information. Or going one step further, genealogy tool providers (RootsMagic, The Master Genealogist, Legacy, et al.) could provide browser plugins or browser toolbars that automatically detect these self-citing pages, and which could offer actions similar to the example above.
Now if the ContentDMs (digital library software used by BYU and many other digital libraries) of the world could do similar things with their image display pages, we'd really be moving somewhere.
I think if the chuch, with its weight and influence were to adopt such a standard, we'd potentially see a lot of other digital content providers begin to follow suit.
As to the concern of having to "load the whole document" just to get the source citation, HTTP allows one to only fetch a page's text content without also fetching the pages images, so I don't see this as (too big of) a problem, unless the text on the page itself is also very large. In a digitized image delivery system, I think the pages would be fairly lightweight as far as the text goes.
And why stop with just citations? One possibility is to create a "GEDCOM microformat" in which linkage and other genealogical data can be embedded in machine-readble forms in human-readable pages. (http://www.microformats.org/wiki/genealogy-formats) If each software program and each online family tree system that display pedigrees and other genealogical information were to include these microformats, there's a huge number of possibilities for how such a format could be used:
- Tools vendors could again provide bookmarklets, toolbars, or browser plugins to do things like, "Import the individual on this page (and all their ancestors) into my genealogy database." Clicking on such a button/link would popup the user's genealogy tool of choice, which would pull down the page in question, parse the microformatted data, and chase the resulting tree.
- Plugins could be provided to display lists of individuals/marriages/events/sources that are on the currently displayed page in a sidebar, each with options to import and/or process with the user's tool of preference.
- Search engines and aggregators could do automated match/merge of individuals they parsed out from spidered pages, offering suggestions to their users as to pages related to their research.
Again, if the church were to get behind this kind of effort, starting with (the publically accessible) portion of the new FamilyTree system, a lot of other vendors would soon follow.
Let's do it!
Some more thoughts:
Some advantages of microformats:
- They're invisible to the average user
- Yet they provide so many possibilities to tool providers and geeks like me (and through them/us to the masses via the tools they/we build)
- I don't see them as being that difficult to add/implement in most of the tools that are out there that generate HTML, once we have a good standard established (that's the hard part :-).
One more application idea along the lines of a "GEDCOM microformat":
Imagine if the church's new Family Tree system published RSS/Atom feeds of activity happening on people's trees. Each time a person was added, for example, an item could be added to an feed for that tree or account (if such info was safe to do so, e.g., the new person was not living). And if those feed items embedded these microformats, I could subscribe to these feeds with my aggregator (such as bloglines.com). As new individuals were added to my trees of interest, they would show up in my aggregator, and the lovely browser plugins (this assumes people are using web-based aggregators) would pick up on these microformats in the feeds I'm looking at, offering all the options to import, etc.
Dan responded, saying
Microformats looks like a possible option for what we need to do. I'll spend some time learning about it.
I've thought about the value of having RSS or ATOM feeds from people the Family Tree. Seems like a powerful concept. There is a question of granularity on such a feed. Is it scoped to a person, family, family line, n number of generations, etc.?
The microformat concept for genealogy has some potential also. It would definitely need to be coupled with a citation capability otherwise there's a risk that we make it easier to propogate unsubstantiated pedigrees.
To which my response was:
Offer flexible levels of granularity, a la del.icio.us or flickr.com.
For example, on del.icio.us (a social bookmarking site), I can subscribe to feeds of:
- all recent urls being submitted
- all popular urls being submitted (urls that are getting submitted most frequently)
- urls being tagged by a specific tag (e.g., 'linux')
- urls being submitted by a particular user
- urls being tagged by a specific tag by a specific user
and so forth.
For an example of a usage of that last feed, one of the sidebars on my website brainshed.com is generated by slurping down and parsing the feed of all urls I have submitted and tagged with 'perl_module'
Flickr offers similar functionality for various aspects of their photo service.
As another example for feed possibilities, Yahoo provides RSS feeds for search results. So I can subscribe to an RSS feed based upon the search results of say 'hanks genealogy', and theoretically, (although it doesn't quite work like I'd like) be notified any time a new search result pops up for those search terms.
So, for FamilyTree, it would be fun to have:
- an RSS feed for all changes being made by a particular user (given the user's permission to do so, etc)
- a feed for all changes to a particular individual or set of individuals
- a feed for all changes to an individual or any of his ancestors for N generations
- a feed for all changes to an individual or any of his descendants for N generations
- a feed for any sources that are added to an individual, a set of individuals, an individual and his descendants, an individual and his ancestors, etc, etc.
- A feed for all new digitized images coming online (i.e., one entry in the feed for when images from FHL #123456789 become generally available)
- A personalized feed for any disputes that are submitted for info in any of my lines.
- And so forth :-).
Now, I don't envy the developers who have to build the backend for such a system, but I don't see it being too hard. Somehow you log changes being made in the system, and for each change you determine which feed interests (see the list above) that change would apply to (you'd also have to determine if the change is a private change, and shouldn't be made publically available. Then you'd have an application/CGI/etc to then take incoming HTTP requests for feeds and dynamically determine which of the change events need to go in each feed requested (with plenty of caching involved, of course).
Granted that's probably a simplistic view of what would be needed to implement such a system, but I hope you get the idea. Make the set of feeds available infinitely (or nearly so...) customizable, and we'll all probably be surprised at the varety of uses that arise from the availability of these feeds.
This is getting long, but I wanted to include this all here, as I've been giving a lot of thought to this kind of stuff lately. I really like the ideas behind sites like edgeio.com and inods.com, where instead of one player (like Amazon) holding all the data, we each own our own data, and host it wherever we want, and it is then spidered and aggregated into useful tools. I'm not saying I don't like Amazon (I love Amazon!), but I do like the idea of being able to keep all my data in one place, instead of having to post reviews here, as well as on Amazon. If Amazon wanted to be really bold, it could do its own aggregation a la inods, and add lists of links to reviews from blogs on its product pages. They would loose an amount of editorial control that they now exert, but would gain, in my opinion, some very good content.
Applying these datalibre ideas to genealogy, I could host my genealogy data wherever I want (even on the new FamilyTree system), and if were all microformatted, then any search engine or aggregator or whatever tools we haven't thought of yet, could still use it in meaningful ways.