Language Homogenization at Harvard

On December 20, 2021 I read an interesting paper from CSPI written by Leif Rasmussen. My main takeaway was that among National Science Foundation grant award abstracts there is what Rasmussen calls "Constriction of the space of ideas." In other words, the language of NSF grant abstracts is becoming increasingly similar.

The constriction is demonstrated by a decrease in cosine similarity among word embeddings and word frequency vectors between the years 1990 and 2020. Now, I know that sentence was perfectly clear, but, just for fun, let's break that down a little.


Suppose we were going to rate words on a scale of one to ten in two dimensions. First, we are going to rate how old fashioned a word is. Second, we will rate how funny the word seems to us. For each word we will get a pair of numbers.

For example, if we wanted to rate "poppycock" we might say that it's a 9 for "old fashioned" and an 8 for "Funny". Poppycock would be 9, 8. Another word, like snail, isn't especially old fashioned or funny. Though, the word has been around a while - we could call it a 5, 2.

Now that we can turn any word into a pair of numbers we can also turn a sequence of words, called a "document", into a pair of numbers. Just convert all the words in the document into pairs of numbers and then find the average of those values.

My snail speaks poppycock. = 7, 4
That swain is up to some jiggery-pokery. = 8, 3

We can treat the pair of numbers as a coordinate in two dimensional space. We can imagine an arrow drawn from the origin (0, 0) with an arrow head at our coordinate. A document, which could be anything from a single word, to a simple sentence, or even a full length novel, can be an arrow in 2-d space. Two documents become two arrows.

Example of two documents in 2-d space as arrows.

We can then find the cosine similarity which is the cosine of the angle between the two arrows. The cosine similarity gives us a measure, between zero and one, of how similar the two vectors are.

In our example the cosine similarity between our two vectors is 0.987. That tells us that our documents are similarly composed of old timey silly words. Which is true.

Cosine difference is just 1 - cosine similarity. It's a measure of how different two vectors are. Smaller difference and the vectors are more similar.

Our sentences aren't really that similar though. They only appear to be similar because we are just evaluating them on two dimensions - old-timeyness and silliness. What if we were to evaluate the sentences on 96 dimensions? Or 300? As we expand the number of dimensions we are evaluating we expand our description of words and the extent to which we can tell whether words, or documents, are similar or different.

Evaluating every word on 300 dimensions would be tedious. Luckily machine learning models can now do that for us. A model can process a bunch of text and learn to represent words as vectors. Once that is in place, we can do the same evaluation from our example, just in 300 dimensions instead of 2 dimensions - the math is the same.

Now that we understand the vocabulary, let's review what the CSPI paper did.

  1. Get all of the accepted grant abstracts between 1990 and 2020.
  2. Turn each of these documents into a vector representing the average vector of all the word vectors of each word in the document.
  3. Go year by year and calculate the average cosine distance between grant abstracts.
  4. Finally, plot the cosine distance over time to reveal a decline.
CSPI graph illustrating decline in average cosine similarity for word embeddings

What that graph shows is a decline in the cosine distance. The abstracts are getting increasingly similar. Or, as Rasmussen put it - "Constriction of the space of ideas."

Language at The Crimson

Since reading the CSPI paper I've been on the lookout for a dataset to which I could apply a similar analysis. I found one in Harvard's The Crimson which is the university's daily newspaper that has an archive going back to before 1900.

I scraped all of the articles between 1900 and 2020 to create my dataset. Dataset available as a torrent here.

I then used spaCy's pretrained "en_core_web_sm" model for word embeddings to compute an annual average cosine distance and plot it. What I found, will shock you...

Okay, maybe it won't shock you. But, what I found is similar to Rasmussen's result. A steady decline in average annual cosine distance between articles. Each dot shows an annual average cosine distance for a year between 1900 and 2020 and the solid line represents a regression plot illustrating the trend in cosine distance.

Can we say anything about what causes this? Not definitively, no. Rasmussen connects the decline in lexical diversity with an increase in diversity words - equity, diversity, inclusion, gender, marginalize, underrepresented, and disparity.

Rasmussen does not say exactly this, but my sense from reading his paper is that (he thinks that) NSF grant applicants are becoming more focused on the political notion of diversity and that focus is constricting their ideas making their writing more and more similar to other applicants over time.

We can illustrate this idea with The Crimson dataset by adding a line to our graph that represents the percentage of news articles in a year containing at least one of our diversity words.

The right hand y-axis shows the percentage of articles for a given year that contain at least one diversity word.

On the one hand, distance and diversity words are inversely correlated. The correlation between the two is -0.68. Diversity words go up as distance goes down.

On the other hand, that's not too surprising. The diversity words are new (new-ish). Anything else newish would be inversely correlated with cosine distance too. For example, I repeated the same experiment with "tech" words like Google, YouTube, iPhone, browser, and so on. Those tech words correlated at -0.6 with cosine distance.

Another argument against connecting distance and diversity is that distance is on a long running decline from 1900 even for the first four decades while diversity words were basically flat. When diversity words pop in the 90's there isn't an immediate reaction in cosine distance, it's only about a decade later, in 2000, that cosine distance takes a steep drop.

I don't really have a better idea than Rasmussen. I'm certainly sympathetic to the notion that, if 20% of your news articles are at least mentioning diversity, then that is an inherent focus (or constriction) of your idea-space. But, I don't really see it in the graph.

I'm using a different dataset than Rasmussen - he looked at NSF applications and I looked at The Crimson articles. But, in both datasets we see the same trend. A decline in lexical diversity as measured by cosine distance of document vectors from word embeddings AND an increase in documents using diversity words. It's interesting, even though I don't really know what is going with this.