*Note: There is an audio version of today’s newsletter, if you’d rather get it that way!*
Today, we're going to talk about Wikipedia and one of the most insane stories of the past several weeks. First, I need to be up front that I did not know that Scots is a language spoken in the lowlands of Scotland and parts of Ireland. I'm OK admitting to my dumb American ignorance here because my family tree runs deep in both countries. I was ignorant—but at least about my own history (no clue if we actually spoke Scots tbh).
Scots has largely been absorbed by English, but it's still there. OK, now to Wikipedia. As with many languages and dialects, there is a version of Wikipedia written in Scots. And the Scots Wikipedia, has largely been written by a (now) 19-year-old who doesn't know Scots. Reddit user Ultach unearthed this bomb two weeks back. From their Reddit post:
I’ll just say that if you click on the edit history of pretty much any article on the Scots version of Wikipedia, this person will probably have created it and have been the majority of the edits, and you’ll be able to view their user page from there. They are insanely prolific. They stopped updating their milestones in 2018 but at that time they had written 20,000 articles and made 200,000 edits. That is over a third of all the content currently on the Scots Wikipedia directly attributable to them, and I expect it’d be much more than that if they had updated their milestones, as they continued to make edits and create articles between 2018 and 2020.
The problem is that this person cannot speak Scots. I don’t mean this in a mean spirited or gatekeeping way where they’re trying their best but are making a few mistakes, I mean they don’t seem to have any knowledge of the language at all.
The Redditor then goes on to say that "this person has possibly done more damage to the Scots language than anyone else in history" with cultural vandalism, among other things. In the end, Scots is a dying language and this 19 year old, who STARTED MAKING UPDATES AT THE AGE OF 12, created a situation to reinforce the thinking that Scots is a bootleg version of English rather than a language or dialect of its own. There have been active proposals to close the Scots Wikipedia over the years, with users calling it a "Joke project. Funny for a few minutes, but inappropriate use of resources." And that was before the bombshell reveal.
But it doesn't stop there. Languages like Scots often don't have robust digital archives, and so Wikipedia can sometimes end up as a defacto version of that for sometimes dying, sometimes relegated languages (not good). Technology that uses language like voice assistants, translation tools, search, etc. often train their AI models on Wikipedia articles.
From Quartz's How a Scots Wikipedia scandal highlighted AI's data problem:
“I don’t think people necessarily realize how important Wikipedia is for training all of our language technologies,” said David Yarowsky, a computer science professor at Johns Hopkins University. “When these problems crop up, it really is impacting our ability to do a high quality job on the technologies that these communities want.”
If you have AI-based technology doing translations (or anything related to words) then, as with anything AI, you have to have clean data from the start, otherwise your systems learn and "optimize" in a self-reinforcing cycle of bad output.
So: a 19-year old who started making up fake language at the age of 12 has basically fucked an entire language and the prospects of that language surviving in modern times.
English Wikipedia, for better and for worse, has emerged as a place for a shared reality, amidst polarization, filter bubbles, and misinformation. It's become a utility that's trusted as much as the news, writes Michael Mandiberg in The Atlantic. I'm not going to opine on whether that's #good or #bad—that's for another time.
But, it reinforces that who writes the articles on Wikipedia is a very important thing to know. Mandiberg's deep dive into just who these people are and where they come from is fascinating. He literally maps them to reveal that:
There is very little editing done by U.S. citizens across the Plains, Dakotas, west Texas, and the south, excluding Florida and the Carolinas (and some larger metro areas).
Counties with high religious adherence edit Wikipedia at a low rate.
Native American communities, and rural, poor, black counties in the south, are often prevented from editing due to issues of access, education, poverty.
Global editing patterns reflect the geography of the British Empire: editing activity is way higher in former colonies than in Africa.
The whole deep dive is worth a read.
OK! That's it for today. Love you.
Share this post