This post uses Most Distinctive Words to analyze what we talk about when we talk about Presidents.*
I begin with the Wikipedia pages for each U.S. President. I downloaded these in January and then got distracted with work, so they’re a few months out of date, but still relatively fresh compared to most of the texts I work on. I wasn’t too strict about what I took; basically I started at the top of the article and stopped when I felt the article was over. Just having this much gives you access to an underrated form of quantitative textual analysis: checking how long things are. Here are the word counts for each President’s article:
|George H.W. Bush||10832|
To me this variation appears to have barely any rhyme or reason. LBJ is a solid contender for the top spot; his Presidency is very tough to rank, because it includes both an incredible domestic agenda (Civil Rights Act, Medicare) and arguably the worst foreign policy agenda (Vietnam). But if you take the “absolute value” of everything he did, there’s no denying he’s one of the most consequential Presidents. Fillmore is also a decent contender for last place, with less than a fourth of LBJ’s word count; I think he’s probably high in the running for “most forgotten President”.** But in between, things quickly get strange. Eisenhower ahead of 4-termer FDR? John Tyler ahead of Thomas Jefferson? Harding ahead of Teddy Roosevelt? Monroe near the bottom?
The big lesson here is that these pages are pretty weird artifacts. Their authors will have stylistic tics (maybe Tyler got a verbose guy, and Monroe got an Imagiste), and editorial decisions might displace whole sections into other articles. For example, in Jefferson’s article, the Louisiana Purchase gets about 250 words, but there’s also a standalone article about the Louisiana Purchase that’s about 5,000 words long—i.e., more worthy of discussion than the entire administration and life of Millard Fillmore, according to random Wikipedia editors.
Most Distinctive Words
Still, even with these idiosyncrasies, we ought to be able to extract something interesting from the language of these articles. For instance, which Presidents’ write-ups have the most to do with slavery, or war? What are the most remarked-upon aspects of, say, Teddy’s life, or the founding fathers, or the Gilded Age? What words, if any, set apart the discourse surrounding an icon like Lincoln from that around a tremendous moral failure like Andrew Jackson?
To explore these questions I turned to Most Distinctive Words (MDWs). This is basically a measure of the words that appear more frequently in a given text than we would expect, based on their frequency in some comparison corpus. In my case, that means checking which words appear disproportionately often in one guy’s article, compared to what we’d see if the words were distributed evenly across all articles.*** So, for instance, we might expect to see “atomic” appear distinctively often for Truman, since he dropped more atom bombs than anyone else—and, in fact, “atomic” is a distinctive word for him (though “bombing” gets you Reagan and LBJ as well).
A few notes about the MDWs you’ll see in the rest of this post: To make life easier, I converted everything to lowercase (that way “train” and “Train” aren’t different words, just because one appears at the beginning of a sentence). I also removed stop words (things like “the” and “of”, which are so frequent that they can skew things, and also are often boring), numbers, and symbols. Finally, I took out the ordinarily used names of Presidents (so, “andrew”, “jackson”, and “jacksons”, the latter to catch possessives), because otherwise they dominate the data, since they are naturally very distinctive of their articles.
The System Works
When you check the MDWs for a particular guy, you usually find a pretty nice encapsulation of his Presidency’s Greatest Hits. Here are the top few for Lincoln:†
You start with his two signature issues, pick up his home states, roll through his political acts and opponents, and even capture his assassin and, three cells later, one after the other, the reason he was killed. Another good example is Andrew Jackson:
|Andrew Jackson MDWs|
You’ve got his famous battle (“orleans”), his refusal to understand finance (“banks”), and his penchant for genocide—rendered all the more striking when you realize that “creek” refers to the Creek tribe (now called Muscogee), who lost a brutal war against Jackson and years later were also victims of the Indian Removal Act.
Since the MDWs work pretty often, it’s pretty striking when they depart from expectations. For some guys, this means a focus on the pre-Presidency—Madison’s top word is “constitution”, Reagan’s are littered with California and Hollywood terms, and Eisenhower’s focus on war terminology for eight straight words until they arrive at “interstate”, before jumping back to “ii”. Ulysses S. Grant is similar—unsurprising, since his own memoir barely mentions that he was President.
In another case that surprised me a little, the focus is on the post-Presidency:
|William Howard Taft MDWs|
Taft was the only President who ever went on to become a Supreme Court justice. That’s distinguishing in either sense of the word, and a nice legacy for a guy whose is probably best known to the public for being too fat to get out of a bathtub. (The article I have says that the evidence for this actually happening is unclear, but gives two sources for the distressingly ambiguous sentence “However, he once did overflow a bathtub.” I’m surprised and a little disappointed to say this whole sequence has been removed from the current version of the article.)
Another guy who surprised me was JFK. The word “assassination” is just 12th on his list; but on reflection, this may have something to do with the 8,000 word separate article on it, not to be confused with the 19,000 word “John F. Kennedy assassination conspiracy theories” article, which is longer than any Presidential article.††
Rules of Distinction
One feature of MDWs is that they privilege proper nouns. This makes sense when you consider just how specific (i.e., distinct) proper nouns are: all sorts of kids have dogs, but only Oblio has Arrow. This means there are a few things that define you if you get a Wikipedia page:
- Your home. A President’s home state usually appears in his top few MDWs. If a guy has two home states, they both appear: Lincoln gets Illinois and Kentucky, Obama gets Illinois and Hawaii (and, even higher, Chicago). This isn’t a universal rule (JFK doesn’t have “massachusetts”), but it’s quite common.
- Your wife. George has Martha, John has Abigail, Abe has Mary, Rutherford has Lucy, Herbert has Lou, Dwight has Mamie, Dick has Pat, Ron has Nancy, Bill has Hillary. You’re known by the person you love. But, there’s also:
- You enemy. The first word for Washington is “british”; “confederate” makes the top five for Lincoln and Grant; Polk has his “mexico” and Truman his “korea”. Booth, Guiteau, Czolgosz, and Oswald make their expected lists. LBJ has not just “vietnam” but “goldwater”. And look back at the Jackson list above: creek, indian, indians, calhoun, bank, banks, seminole, tribes—that’s eight enemies in just 16 words (and another, “orleans”, is the site of a battle). For everyone, but especially for bloodthirsty maniacs, distinction is conferred by who and what we choose to fight.
Eras, In So Many Words
Another cool option with these MDWs is approaching from the other direction. Once we have them, we can pick a word and see who it encompasses. For instance, take the word “gold”. This turns out to be an MDW for Grant, Hayes, Garfield, Cleveland, Harrison, and McKinley—in other words, every President but one (Arthur) from 1868-1901. This is probably a function of the currency debates that dominated that era (the last three guys also have “silver” as an MDW), but it’s also a nice, very literal way to capture the Gilded Age.
Or take another definitive American word: “slave”. That word and “slaves” appear as MDWs for Washington, Jefferson, Madison, Monroe, John Quincy Adams, and Jackson—six of the first seven Presidents, and all of the ones who owned slaves themselves. (JQA, like his father, didn’t own any slaves, and the two words appear in his article in the context of his fierce opposition to slavery; for the rest of them, the words are there mainly because they owned slaves.) After this crew, those two words largely disappear, with the exceptions of Fillmore (he had “moderate anti-slavery views”, according to the article) and Lincoln (for obvious reasons).
But the issue does not disappear. The words “slavery” or “antislavery” appear as MDWs for JQA, Jackson, Van Buren, Polk, Taylor, Fillmore, Pierce, and Buchanan, before coming to a close with Lincoln. That’s everyone between the Founding Fathers and the close of the Civil War with the exceptions of William Henry Harrison (who served one month) and John Tyler (who was in office, but didn’t exactly serve at all). Many of these Presidents were slave-owners themselves, but we see a shift away from personal ownership as the focus (with a few overlap cases), and toward the rise of a political cause—from slaves to slavery. It’s a striking lexical marker of the transition from one paradigm to another, maybe somehow indicating the point at which Wikipedia writers and readers feel that Presidents were “of their time” instead of responsible for it.
A Final Mystery
I want to end with something I noticed but can’t quite explain. The word “president” actually appears as an MDW in several cases. Here they are:
|president||80||0.008887996||George HW Bush|
In some of these cases, it seems like the word might have to do with unique relationships to the office. Harrison died immediately, Tyler took over even though no one wanted him (he was known as “His Accidency“), while succession laws were still untested, and Johnson abused the office to veto Congress until they impeached him (note: if you include “presidential” in these results, you add Clinton to the mix, suggesting impeachment may play a role). Still, even if this is right, it only explains a few articles. I have no idea what any of this has to do with Taft.
And then there’s this: Every Republican President since 1968 has the word “president” as an MDW. What’s more, in this era it’s only Republicans—Carter, Clinton, and Obama are all missing from that list. Why is this happening? Is it some sort of conservative preference for hierarchy/authority? A right-wing love of the institution? The tendency of these Presidents to wield presidential authority in problematic ways (Watergate, the pardon of the guy who did Watergate, Iran-Contra, the Decider and his father)? Just a random tic from a prolific Wikipedia editor? (Even then, it might interesting that the editor of these articles has that tic.)
I looked at the word’s usage in the articles in hope of clarity, but the answer wasn’t immediately obvious. I did notice that, in the George W. Bush article, for instance, there was a tendency to call him “President Bush” in photo captions (which are included in the articles I analyzed)—but this doesn’t explain why other articles don’t follow the same practice. This all put me in mind of a bumper sticker I used to see in Texas, that looked roughly like this:
I never knew how to interpret it. What’s the point of stating that the current President is the President? I am being completely honest when I say that I don’t know if this is supposed to be combative, reassuring, snarky, patriotic, a sign of the tribe, or something else I haven’t even thought of. So it’s interesting to see a sort of version of it replicated in these MDWs—105 uses of the word President††† in an article that tells you, right at the top, that it’s about a President. It’s an interesting form of distinction for the modern Republican President—the simple confirmation that they held the job.
*It was very tempting to use this as the title of the post, but I think you just can’t do that anymore. If you Google “what we talk about when we talk about” -love (the last part is so that you don’t get any actual references to Raymond Carver’s short story), you get 211,000 results. Based on those results, here are a few of the things about which we talk about what we talk about when we talk about them:
- Apple and Compelled Speech
- Gun Violence
- “The Uyghurs” (quotation marks in original)
- Clone Club
** I doubt he wins though; his name is too weird. My guess is Ben Harrison.
***Specifically, I used word frequencies from all articles to set expected values, and word frequencies in given articles to set observed values. I then used a Fisher’s exact test to determine which words were significantly more present than expected. I did not look for words that were missing (e.g., if a President’s article says “war” much less than ordinary). My thanks to Mark Algee-Hewitt for helping me write the R code used in this project, and for explaining MDWs to me in the first place.
† In all cases, the words are ordered by p-value, where lower is taken to mean “more distinctive”. Here and below, I’m pasting in partial lists for space purposes.
†† This makes it longer than Macbeth, as well as 7 other Shakespeare plays. See also the 2,800 word “Assassination of John F. Kennedy in Popular Culture” article.
††† W’s article has 105 occurrences of the word “president”, more than three times as many as George Washington, who not only has a roughly equal-length article, but practically invented the office.