U.S. Supreme Court uses corpus created by BYU professor Mark Davies
His billion-word online corpora aid study of language, culture
Provided by Mark Davies
PROVO — Linguists like to joke that you can tell a lot about a word by the words it hangs out with. And 50 years ago, "gay" was hanging out with "grave" and "brilliant" while "sex" palled around with "hygiene" and "conflicts." Today, the words "gay" and "sex" have much more controversial companions, illustrating not only a change in the grammatical and lexical structure of the English language, but a cultural shift as well — all seen through the study of words.
"I love linguistics," said BYU linguistics professor Mark Davies. "I love looking at how and why language changes, but I'm equally as interested in history and culture, and language can serve as a beautiful window on that."
Davies, regarded by many in the linguistic community as a standard setter, has created a window of more than 1 billion words, gathered from books, magazines, newspapers, academic sources and transcribed interviews.
His corpora, plural for corpus, the Latin word meaning a body or collection of writings used for analysis, are the largest, free collections of English words on the Internet, searched by tens of thousands of users each month, from linguists, teachers, students to district Judges and Supreme Court justices, all trying to make sense of this odd language we call English.
Corpora in courtooms
Thirty years ago, when lawyers or judges disagreed on a word's meaning, there were two solutions: dictionaries or telephone surveys.
"Both of them are unreliable," said BYU linguistics professor and department chair William Eggington. "Dictionaries are usually way behind the times and usually don't cover the full range of the meaning of the word, and in a dictionary, there's no way to measure frequency, how often this meaning is used. And surveys, they're hit and miss. But then the corpus comes along and changes everything."
Suddenly, instead of relying on stale definitions or unscientific survey methods that left room for doubt, judges and attorneys could turn to hefty databases that painted a much more accurate picture of words in context, he said.
These corpora are even finding their way into high-profile cases, like the March 1 Supreme Court decision, where Chief Justice John Roberts cited corpus data as a foundation for limiting the descriptive ability of 'personal' to people, not corporations.
AT&T had been asking for a "personal exemption" so they didn't have to reveal certain financial documents. It made perfect sense, the company argued, because legally a business can already be considered a "person."
The high court disagreed. In his opinion, Roberts pointed out that adjectives are usually related to their corresponding noun, but there are exceptions, such as "corn" and "corny" or "crank" and "cranky."
He also cited specific corpus data about how the word 'personal' has been used over time — data provided in an amicus brief by attorney Neal Goldfarb on behalf of The Project On Government Oversight.
"We do not usually speak of personal characteristics, personal effects, personal correspondence, personal influence, or personal tragedy as referring to corporations or other artificial entities," Roberts wrote. "This is not to say that corporations do not have correspondence, influence, or tragedies of their own, only that we do not use the word 'personal' to describe them."
Though AT&T may be upset by the results of the corpus-based decision, linguists are encouraged by this growing reliance on contextual word banks.
"I expect that we'll see this trend (of using corpus data in courts) accelerating in the future," said Mark Liberman, director of the Linguistic Data Consortium and a professor of phonetics at the University of Pennsylvania. "New kinds of analysis have become feasible, old kinds of analysis have become easier and properly interpreted corpus data is persuasive to judges and juries."