Quantcast

U.S. Supreme Court uses corpus created by BYU professor Mark Davies

His billion-word online corpora aid study of language, culture

Published: Wednesday, Sept. 2 2015 12:41 a.m. MDT

BYU linguistics professor Mark Davies (Provided by Mark Davies) BYU linguistics professor Mark Davies (Provided by Mark Davies)

PROVO — Linguists like to joke that you can tell a lot about a word by the words it hangs out with. And 50 years ago, "gay" was hanging out with "grave" and "brilliant" while "sex" palled around with "hygiene" and "conflicts." Today, the words "gay" and "sex" have much more controversial companions, illustrating not only a change in the grammatical and lexical structure of the English language, but a cultural shift as well — all seen through the study of words.

"I love linguistics," said BYU linguistics professor Mark Davies. "I love looking at how and why language changes, but I'm equally as interested in history and culture, and language can serve as a beautiful window on that."

Davies, regarded by many in the linguistic community as a standard setter, has created a window of more than 1 billion words, gathered from books, magazines, newspapers, academic sources and transcribed interviews.

His corpora, plural for corpus, the Latin word meaning a body or collection of writings used for analysis, are the largest, free collections of English words on the Internet, searched by tens of thousands of users each month, from linguists, teachers, students to district Judges and Supreme Court justices, all trying to make sense of this odd language we call English.

Corpora in courtooms

Thirty years ago, when lawyers or judges disagreed on a word's meaning, there were two solutions: dictionaries or telephone surveys.

"Both of them are unreliable," said BYU linguistics professor and department chair William Eggington. "Dictionaries are usually way behind the times and usually don't cover the full range of the meaning of the word, and in a dictionary, there's no way to measure frequency, how often this meaning is used. And surveys, they're hit and miss. But then the corpus comes along and changes everything."

Suddenly, instead of relying on stale definitions or unscientific survey methods that left room for doubt, judges and attorneys could turn to hefty databases that painted a much more accurate picture of words in context, he said.

These corpora are even finding their way into high-profile cases, like the March 1 Supreme Court decision, where Chief Justice John Roberts cited corpus data as a foundation for limiting the descriptive ability of 'personal' to people, not corporations.

AT&T had been asking for a "personal exemption" so they didn't have to reveal certain financial documents. It made perfect sense, the company argued, because legally a business can already be considered a "person."

The high court disagreed. In his opinion, Roberts pointed out that adjectives are usually related to their corresponding noun, but there are exceptions, such as "corn" and "corny" or "crank" and "cranky."

He also cited specific corpus data about how the word 'personal' has been used over time — data provided in an amicus brief by attorney Neal Goldfarb on behalf of The Project On Government Oversight.

"We do not usually speak of personal characteristics, personal effects, personal correspondence, personal influence, or personal tragedy as referring to corporations or other artificial entities," Roberts wrote. "This is not to say that corporations do not have correspondence, influence, or tragedies of their own, only that we do not use the word 'personal' to describe them."

Though AT&T may be upset by the results of the corpus-based decision, linguists are encouraged by this growing reliance on contextual word banks.

"I expect that we'll see this trend (of using corpus data in courts) accelerating in the future," said Mark Liberman, director of the Linguistic Data Consortium and a professor of phonetics at the University of Pennsylvania. "New kinds of analysis have become feasible, old kinds of analysis have become easier and properly interpreted corpus data is persuasive to judges and juries."

Besides being persuasive, corpus data are often quite fascinating, like the "gangster rap" corpus Eggington was asked to create for one criminal case.

A white gangster rapper suspect from Kansas City, Mo., had been charged with homicide, but after he referred to his black victim using the N-word, prosecutors bumped the charge up to a hate crime.

Yet defense attorneys wanted to show that their client referred to everyone that way, not just African-Americans.

After Eggington created his corpus, discovering along the way a slew of things he never wanted to know, he found that the majority of rap music is purchased by young, white males who really do refer to each other using the N-word, thus showing that the term alone wasn't an "automatic indicator of racial hatred," he said.

"(Corpus data) in language-related cases is probably the same advancement (level) as DNA or the use of fingerprints, though it's not nearly as dramatic," Eggington said. "It doesn't get on the crime shows."

Beyond a dictionary

The first time the Supreme Court justices mentioned a dictionary in an opinion was in 1785, when they quoted an attorney's reference to one.

Since then, their reliance on dictionaries has only increased, setting what some call a dangerous precedent.

As a young law student, Kevin Werbach published an article in the Harvard Law Review in 1994 arguing that the Supreme Court should "exercise greater sensitivity in its use of dictionaries."

Up until the 1980s, the court referenced the dictionary only occasionally and never more than 15 times a term. But between 1987 and 1992, the high court cited dictionaries more than 15 times a term, with 32 references in 1992, Werbach said.

Yet, the justices never specified their reasons for choosing certain dictionaries or definitions, Werbach pointed out, thus missing a crucial step.

Though it's been more than 20 years since his study, Werbach, now a professor of legal studies at the Wharton School, University of Pennsylvania, contends his basic argument is still valid.

"If a court relies on a dictionary to determine the meaning of a term, it should recognize that dictionaries have their own variations and biases," he said. "And the courts should recognize that language evolves. We can see how a word was used in 1800, but that doesn't necessarily tell us what it means today, because the world is so different."

Utah Supreme Court law clerk Stephen Mouritsen echoes that concern, having completed his master's degree with Davies before graduating from BYU's law school.

In his piece published in the coming issue of the BYU Law Review, "The Dictionary Is Not a Fortress: Definitional Fallacies and a Corpus-Based Approach to Plain Meaning," Mouritsen points out that judges often assign too much weight to definitions that are listed higher on the page than others, falling into the "basic human presumption that the most important things should go first in a sequence," he wrote.

Yet in the preface to Webster's Third New International Dictionary, the dictionary most cited by the high court, the editors caution that the use of numbers and listing words is merely a "lexical convenience" that doesn't imply a level of importance.

"The corpus can do what dictionaries and human intuition cannot do alone," he said. "Determine which of the competing senses of a given term are most common, or ordinary, in a given context."

And unlike a dictionary with one or a small handful of authors, corpora show no linguistic bias, Davies said.

"You just scoop up … the text," he said, "and what's in there is what people are (saying)."

Window to history

But these corpora aren't just for judges or linguists, and they offer way more than just answers to complex lexical quandaries.

"It's a depository of not just linguistic data, but of cultural data," Davies said. "If I weren't a linguist, I'd be a cultural historian. I love using it to see how views on issues have changed over the last 200 to 250 years."

Take the word "web" for instance. From 1950 to 1970, the word was contextually surrounded by words like spin, safety, weaving, glass, woven, delicate and spun.

Jump ahead to 1980 to 2000 and surrounding words now include site, sites, page, visit, http, information, world and e-mail.

In a broader glimpse, from 1900 to 1950, the word "gay" pulls up surrounding words like grave, spirits, brilliant, heart and lady. The comparison search from 1960 to 2000 pulls up top words of lesbian, rights, straight, community, military and marriage.

These comparison searches — done in Davies' 400 million-word Corpus of Historical American English (COHA), which was generously funded by a grant from the National Endowment for the Humanities — show that something beyond grammar has been changing, and is continuing to evolve over time

Along with identifying semantic changes, the COHA can also illustrate a word's increasing or decreasing popularity since 1810.

The word "naughty" brings up slightly bell shaped results, growing until 1870, when it peaked, then decreasing, with a slight bump in 1960, and again in 2000.

Users looking for the word "teenager," won't find it before 1940, though it's been growing rapidly in use ever since, Davies said.

"This is not just a linguistic issue," he said. "Something culturally was going on in the United States that kids in their teens were viewed as being different from adults starting in the '40s and '50s."

After WWII, the country was financially stable enough that young people no longer worked in factories like their mothers or fathers, Davies said. Instead, they now had time to hang out with friends, listen to records and go to the malt shop. And with that, they merited their own title.

Davies' corpora, especially the Corpus of Contemporary American English, with its 410 million plus words from 1990 to 2010 can also shed light on the growth and expansion of phrases and grammatical constructions, such as the "like" phrase: "I'm like, no way", which, as could be expected, is most prevalent in spoken English, rather than fiction or academic writing and almost non-existent in academic writing.

Those interested in checking out their own terms can take a five-minute video tour on the sites and then dive into their own cultural, corpus research.

And each month nearly 70,000 people do. Along with fellow linguists, translators, language teachers and language students are also big-time users, eager to learn how English is really spoken by native speakers.

But be warned — it's addicting.

"It's ruined me," Davies jokes. "I can barely read a book or even a newspaper article, because I'll see a word or a phrase, and I'll say, 'Wow, that's an interesting phrase, I wonder what's been going on with that in the last 60, 70 years.'"

So he plugs it into the corpus and gets a glimpse into history.

"To be aware of language change is not going to make you a better person and it may not make you speak or write differently," he continued, "but doggone it, it's fascinating."

e-mail: sisraelsen@desnews.com

Copyright 2015, Deseret News Publishing Company