Friday, November 08, 2013

A Little Formula

Turns out that you can find out stuff about Old English texts with just a simple formula:

For any text of length n
with a sub-segment of length w < n

where k is the first term in w
þ is the total number of thorns in the segment;
ð is the total number of eths in the segment;
and w+k ≤ n.

The real tricks are figuring out if what you're detecting is significant or just a product of stochastic variation and, if it is statistically significant, whether or not it is just an epiphenomenon of a less interesting process.

As Richard Feynman, one of my intellectual heroes, once said “The first principle is that you must not fool yourself and you are the easiest person to fool.”

Which is why I've been having learning to debug programs in Python and re-learning Stats II from 20+ years ago.

Unfortunately, at least one of the more striking findings is looking like its just an epiphenomenon. But the good news is that the other discoveries seem like they are pretty robust.


wellinghall said...

What does theta(k) (ie the term to the left of the equals sign) represent, Michael?

Michael said...

What I'm calling theta is the ratio, in any given segment of text, of the number of thorns to the number of thorns plus the number of eths (i.e., the total possible number of interdentals in the segment). Theta k is therefore that ratio in a sequence that starts at unit (word, line, paragraph) k and continues for w units. This is a "forward looking" rolling average. It can be adjusted to be centered (starting at k-(w/2) and ending at k+w(/2) or backward looking, but overall we're doing a continuous rolling average with a "window" that is w units long to calculate the result for the location, k.

Michael said...

and the reason for doing this is to see if there are patterns to high- or low-thorn use (there are) and if these are correlated with other things of interest (they are). However, the trick so far is to figure out if we're merely finding an oblique way to see the percentage of third-person verbs in a segment, or if we're finding something else. According to all the correlation coefficients I've been calculating, most of the time we're finding something more interesting, but since at least once the correlation between theta and tau (number of terminal interdentals divided by total number of interdentals in a segment) is .99, in that case we have just found a proxy for a shift from second person to third person.

wellinghall said...

Thanks, Michael.