Monday, January 31, 2011

A Nice Little Trick Enabled by Lexomics (and Excel)

It's nice when people overhear conversations and then help you.

I was at the climbing gym (Rock Spot Climbing in Boston -- the best climbing gym anywhere) and, in between bouldering runs, was talking with my wife about how my research was coming.  Somehow we got to talking about whether Excel could speed up some of my searching.  A guy at the gym overheard and said he had been the Excel guru for a Psych research project and offered to help.  What follows comes from that brief collaboration.  By combining material on the Lexomics website with Excel, you can do some interesting searching in uncommon words in the corpus of Anglo-Saxon.


Let's say you are researching a particular Old English poem, say, Juliana.  You want to look at the more uncommon words in this poem and see if they are shared with the rest of Cynewulf's poetry or with other texts in Anglo-Saxon. 


Go to the Lexomics website, choose "tools," and the "word frequencies."  Click "entire corpus" and then "get stats."  Click on the HERE to download this as an Excel file.  You now have a file with a list of every word in the Anglo-Saxon corpus ranked in order of frequency.

Copy the column of words and the column of word frequencies and paste them into a new spreadsheet as column A and column B.

Now go back to the lexomics website, go to "tools," "word frequencies," and choose the poem of interest.  "Get stats" for that poem and download them by clicking on HERE.  You now have an Excel file with a list of every word in the poem ranked in order.  Copy the column with the words and paste it into column C in your spreadsheet.

Now you are ready to find those words that appear in your poem and only a few times in the rest of the corpus.

Go to cell D1 and enter the following formula:

=SUM( if ( $A$1:$A$x = C1, if ($B$1:$B$x < n, 1)))  ; where x = the total number of words in column A and n = the low frequency threshold (i.e., you want all words that appear fewer than 5 times)


*important* do not just press ENTER.  Instead, press CTR-SHIFT-ENTER.





Then copy the formula into the entire D column by clicking the box in the lower right corner and dragging down to the last word in D.



It will take a few moments for processing.


When processing is complete, you will have a 0 in every cell in D in which the word does not fulfill the criteria (appearing in your poem and between 5 and 2 times in the corpus), and a 1 when the word does fulfill the criteria.


You can search for these 1's manually or use "Conditional Formatting" to bold or color the rows with a 1 in column D.




Now you can search these words in the Dictionary of Old English concordance and see where else they appear.  Look for patterns.  Enjoy.