Wednesday, March 05, 2008

I Like the NEH a Lot More Today

Today I just learned that with two colleagues I am a recipient of an NEH startup grant for "Pattern Recognition through Computational Stylistics: Old English and Beyond." The project has been led by my friend and collaborator (in our linked classes "Anglo-Saxon Literature" and "Computing for Poets"), Prof. of Computer Science Mark LeBlanc, and we are working with Prof. of Mathematics Mike Kahn. Our idea is to take some techniques that have been very successful for the study of bioinformatics and apply them to the Old English corpus (I've pasted in the short description of the project below).  Key idea:  we don't subordinate traditional humanistic goals to science, and we don't limit the science to making tools.  We respect both and get them to work together. 

I now have a really fun summer ahead of me, and a new goal:  I will now have done a collaborative project with four different faculty in Math/Computer Science (with one more project on tap) and now think it's not unreasonable to think that, before I retire, I could work with all of them.  Mathematicians and Computer Scientists are just as much fun to hang out with as Biologists (and I have the ongoing Sheep DNA project there -- getting some minor results), and these projects give me an excuse to lurk around the science center.   Fun! (and my kids will be thrilled that they get to run around there this summer).  

Pattern Recognition through Computational Stylistics: Old English and Beyond

Abstract
Entire new worlds of scholarship and research are emerging from a synergy of technology, problem solving, and statistical analysis. This interdisciplinary proposal joins researchers in Old English, statistics, and computer science to show by example how the development and application of computational stylistic tools across an entire corpus enables the large scale exploration and discovery of patterns and trends that go undetected by the unassisted eye and brain, yet generate novel sets of hypotheses for subsequent scholarship both digital (through further pattern recognition and analysis) and traditionally humanistic.

From one direction this proposal approaches texts in the same way as genomics  (the analysis of DNA texts): we seek information-rich patterns in a long string of letters. From the other direction we come from literary studies, looking for meaning, influence, symbolism, reference, source material, and interpretation. We then meet in the middle, providing a way for literary scholars, computer scientists, and statisticians to work collaboratively to solve complex problems, problems that are so new and so difficult that we really need all the tools we can get. We posit this interdisciplinary tack as a significant contribution to Humanities scholarship: a myriad of new questions can be expected by embracing the tools and techniques that require computational and statistical expertise and linking these approaches with some of the traditional methods of humanistic research.  

In particular, such an approach is needed to augment the large and ongoing Dictionary of Old English project (the Dictionary of Old English corpus, the DOE itself, and its new link with the OED). We aim to go beyond traditional lexicography—the painstaking method of looking, individually, at every single word in the corpus—by using statistical models based on relative word frequencies across all texts in a corpus. We can then, forexample, generate data that can shed new light on questions such as: Is there a Winchester vocabulary, in slightly different form, in other texts than those previous identified? Is there a “poetic language” statistically different from prose Old English in all poems or only some poems? Are the poems of the Anglo-Saxon Chronicle, given their age and manuscript context, fundamentally different from those of the Exeter Book, Vercelli Book or Junius Manuscript? Was there a Southeastern ‘koine’ spread by Dunstan? Do writers of similar documents exhibit similar patterns on the
very small level as well as in specific vocabulary?

Anticipated outcomes include scalable, open-source software to facilitate the computation and organization of word frequencies and other patterns and empirical measures of success when using various statistical analyses on the condensed data. An additional and essential outcome from our perspective is how this research leads to and impacts the development of interdisciplinary course materials for our “connected” (interdisciplinary) undergraduate courses in English, Statistics, and Computer Science in order that computational analyses become a more inviting option for faculty and advanced research students in the Humanities.