Wednesday, December 11, 2013

A good trick? (and is it legit?)

Every couple of years I team-teach the lit half of a "Connected" pair of classes with my friend, the mathematician Bill Goldbloom Bloch. "The Edge of Reason" links my SciFi class with his Math Thought class over the course of an entire year, with us alternating teaching days and each prof sitting in on and participating in all of the other's classes. It's incredibly fun, and I learn a lot about math and, maybe more importantly, about how mathematicians think. Bill says that most mathematicians have 1 or 2 "good tricks," ways of conceptualizing the world or handling problems, that allows them to make multiple discoveries. Going beyond mathematicians: physicist Richard Feynman had his "integrate over all paths" trick, Einstein had his visualizations, etc. My own "good tricks" have included "read the whole thing" (you'd be amazed by how few scholars do) and "push the metaphor until it breaks."

Now I think I have a new one. And it's mathematical.

This summer our research group was working on the problem of thorn / eth distribution  . We were having trouble visualizing the data. I don't know why, but suddenly into my mind popped the notion of a rolling average, something I think I'd learned way back in high school and which had shown up when I was being creative with budgets to avoid laying people off during the financial crisis: it turns out that the amount of money you are allowed to draw from an endowment's revenue stream is based on a rolling average of the returns in several previous quarters. This saved me during the crash, as we had a little more money right at the beginning—since the the previous quarters were propping up the average—so we could at least give visiting and part-time faculty a year or two to try to find something else instead of just dropping them into a terrible economy (my sole accomplishment as department chair was that I didn't lay anyone off or fail to renew a contract).

So I started calculating the rolling ratio of θ (total number of þ divided by total number of þ plus total number of ð) through a text: choose a "window" of words or letters, add up all the thorns and eths in that window, calculate θ, and then move the window one unit to the right and re-calculate. The plots of the rolling ratios turn out to be very interesting. I'm just finishing up a paper now on what they might tell us about a work's textual history.

But I have been worried--following a chance remark by Janet Bately at ISAS Dublin--that all we were detecting with θ was the frequency of first-, second- or third-person plural present tense or plural imperative verbs. These forms end with an interdental, and there certainly seems to be a correlation between terminal interdentals and scribal use of ð (most famously by the B-scribe of Beowulf, but elsewhere as well). I wanted to know if θ was just a complicated proxy measurement for portions of the poem in the plural present tense or the imperative.

So we developed another measure, τ, which is the ratio of terminal interdentals (þ and ð) to the total number of interdentals in a passage. We calculated τ as a rolling ratio as well, and then compared the plots of τ and θ.

Sometimes these plots appear to be negatively correlated with each other: when τ goes down, θ increases, but other times,  not so much. And just looking at the graphs wasn't entirely satisfactory. So I calculated Pearson correlation coefficients between τ and θ. It turns out that these are pretty ambiguous when applied to whole texts, generally being on the order of .3  (1.0 would be perfect correlation and 0 would be no correlation at all). That wasn't entirely helpful: with an r of .3, tense and number could be contributing to θ, but other things (textual  history) could be as well.

Then last night I was staring in frustration at the τ and θ graphs for the Old English Genesis, and it hit me: there was a visible correlation between τ and θ in Genesis B, but not in Genesis A. I quickly calculated the Pearson correlation coefficient for each poem and indeed, Genesis B is highly correlated, with an r of .69, while Genesis A is only weakly correlated.

And here's where both the "good trick" and my question of legitimacy comes in. I realized that I could do the rolling window trick with the correlation coefficient. Calculate τ and θ, then choose a window length and calculate the correlation coefficient for that window. Then shift to the right and recalculate r. Plot the whole thing.

Except that it was hard to read the plot, since you ended up with both positive and negative correlations (negative correlation just means that when one variable goes up, the other goes down. It's just as much a correlation as a positive one).  So I had idea of taking the absolute value of r and plotting that. When you do so, you get very interesting results. Genesis B, for example, jumps right out of the Genesis plot. So too does the canticle-sourced material in Daniel and the section of Christ III that's based on the sermon of Caesarius of Arles.

My tentative conclusion: because not all scribes consistently followed the "terminal interdental to be represented by ð" rule, the correlation between τ and θ is actually useful data. Instead of simply invalidating θ, the correlation--and its absence--tells you something about the copying history of the text. My hunch is that it's the later scribes who produce segments with closely correlated τ and θ, so when we don't see the correlation, we can hypothesize that we're looking at a text that was written and copied earlier and so in which the inertia of the earlier forms is influencing that final copy.

But my worry is that a rolling Pearson's correlation coefficient is somehow statistically or mathematically illegitimate. You've got two rolling ratios (τ and θ), each of which over-samples many of the same data points (because the same point is going to influence multiple windows) and then you're doing the same kind of rolling comparison with over-sampling with the relatively complex Pearson formula.  I'm worried that my lack of mathematical and statistical sophistication has led me to miss something that should cancel out something else. Unfortunately, it is finals week, so I can't meet with my friend and co-author the statistician for a while at least, so I just have to live with being both excited at a potential discovery and worried that at any moment the intellectual floor is going to collapse out from under it.