Distributed genealogical record extraction
29 March 2005 00:38
Last week I attended the Family History Technology Workshop at BYU, and came away with my head abuzz about all the cool things underway in the genealogical community in terms of emerging technologies.
A common thread through several of the presentations was that of data extraction and indexing. There is so much genealogically rich information out there that sits unavailable to most on dusty archive shelves. There are two main processes that separate that data from the rest of the general Internet audience:
You first have to scan or image the "offline" materials to make them available online. Once they're in a digital format they then need to be transcribed or "extracted" so as to make the information indexable and ultimately searchable.
An idea I had with regards to the extraction step is inspired by the Project Gutenberg's Distributed Proofreaders. The Distributed Proofreaders website enables anyone with a web browser and an Internet connection to help add new etexts to the Project Gutenberg archive by proof-reading small sections of OCRed text. The user logs into the website, is presented with the OCR image and a text box. The user then makes any corrections to the text and submits the form.
So if we take that idea and apply it to genealogical records I think we have the potential to make a lot of information available online that was previously "locked away" (so to speak) in microfilm or other "offline" media.
At the conference, a presenter from the LDS Church's Family History Department spoke about digitization efforts underway at the church. My dream would be to see the entire collection of the church's microfilm records available for viewing online. But as I mentioned above, digitization only gets you half way there. Once digitized, the church will have a large amount of data that will need to be extracted from the digital images so as to be indexable and searchable. That's where the distributed proof-reading comes in.
You could apply the PGDP approach and allow volunteers to sign in to a website where they would be presented with a scanned image and a text box in which to enter a transcription of the text found in the image.
Now let's take that idea a step farther and instead of waiting for users to remember to login let's proactively send them an email daily (or at whatever rate the user prefers) with the next image for them to transcribe and an HTML form in which they can put the transcription. Or perhaps provide a customized RSS feed to which they can subscribe with their newsreader which would provide the same thing on a recurring basis. My wife suggested the wise addition of a rate-limiting mechanism by which a new email or feed item would only be "sent" upon completion of the "pending" item. That way you wouldn't get a whole lot of these things stacked up in your inbox.
On 28 January 2006 04:59 Richard K Miller