Book Reviews   Digital Libraries   Astronomy Log   Software   About  

Distributed genealogical record extraction
29 March 2005 00:38

Last week I attended the Family History Technology Workshop at BYU, and came away with my head abuzz about all the cool things underway in the genealogical community in terms of emerging technologies.

A common thread through several of the presentations was that of data extraction and indexing. There is so much genealogically rich information out there that sits unavailable to most on dusty archive shelves. There are two main processes that separate that data from the rest of the general Internet audience:

  • Digitization
  • Extraction
You first have to scan or image the "offline" materials to make them available online. Once they're in a digital format they then need to be transcribed or "extracted" so as to make the information indexable and ultimately searchable.

An idea I had with regards to the extraction step is inspired by the Project Gutenberg's Distributed Proofreaders. The Distributed Proofreaders website enables anyone with a web browser and an Internet connection to help add new etexts to the Project Gutenberg archive by proof-reading small sections of OCRed text. The user logs into the website, is presented with the OCR image and a text box. The user then makes any corrections to the text and submits the form.

So if we take that idea and apply it to genealogical records I think we have the potential to make a lot of information available online that was previously "locked away" (so to speak) in microfilm or other "offline" media.

At the conference, a presenter from the LDS Church's Family History Department spoke about digitization efforts underway at the church. My dream would be to see the entire collection of the church's microfilm records available for viewing online. But as I mentioned above, digitization only gets you half way there. Once digitized, the church will have a large amount of data that will need to be extracted from the digital images so as to be indexable and searchable. That's where the distributed proof-reading comes in.

You could apply the PGDP approach and allow volunteers to sign in to a website where they would be presented with a scanned image and a text box in which to enter a transcription of the text found in the image.

Now let's take that idea a step farther and instead of waiting for users to remember to login let's proactively send them an email daily (or at whatever rate the user prefers) with the next image for them to transcribe and an HTML form in which they can put the transcription. Or perhaps provide a customized RSS feed to which they can subscribe with their newsreader which would provide the same thing on a recurring basis. My wife suggested the wise addition of a rate-limiting mechanism by which a new email or feed item would only be "sent" upon completion of the "pending" item. That way you wouldn't get a whole lot of these things stacked up in your inbox.


Comments

On 28 January 2006 04:59 Richard K Miller wrote:
I wonder if this sort of work could be done using Amazon.com's Mechanical Turk (http://www.mturk.com/), or something like it -- a giant web app that interfaces between people and programs and has group proofing built in.


Happiness
True love begins when the needs of others become more important than your own.
The practice of true love begets true happiness

Me

Daniel Hanks

I'm a system administrator working for Omniture

Interested in

perl
books
python
databases
genealogy
astronomy
digital archival
digital libraries
web applications
web infrastructure
distributed storage

among other things . . .

Storyteller


Pamela Hanks

is an excellent storyteller.

(She also happens to be my wife :-)

A storyteller makes a wonderful and unique addition to family, school, church or other group events. Schedule her for your next gathering.


Utah Open Source

Kiva.org
Kiva - loans that change lives

Recent Blog Entries

Subscribe with Bloglines
- OpenWest Conference 2014 Presentation Slides - Ansible
- OpenWest Conference 2013 Presentation Slides
- Utah Open Source Conference 2012 - Presentation slides
- E-Book Review: Data Mashups in R
- Book Review: Illustrated Guide to Astronomical Wonders
- Book Review: Wicked Cool Shell Scripts
- PLUG Presentation Slides: The Open Source Data Center
- Harnessing human computational power from computer games
- I love a good roadtrip
- FamilySearch Developers Conference 2008 presentations now available online
- FHT follow up: an idea for a mobile genealogical application
- Family history and technology: it's only getting better
- President Hinckley passes away
- December is NaBoMoReMo - National Book of Mormon Reading Month
- Family History, Photos, Blogs, and Books
- The Compact Oxford English Dictionary
- 1830s English and the Book of Mormon
- Google adds My Library feature to Book Search
- Utah Open Source Conference
- Wiki diagrammer (Steal this idea!)

All Entries . . .

LDSOSS
LDS Open Source Software
A website discussing the use of Open-source software for applications useful to those sharing values of the Latter-day Saint (Mormon) faith.

© 2009, Daniel C. Hanks