Generative AI and Wikipedia editing: What we learned in 2025

> That means the article contained a plausible-sounding sentence, cited to a real, relevant-sounding source. But when you read the source it’s cited to, the information on Wikipedia does not exist in that specific source. When a claim fails verification, it’s impossible to tell whether the information is true or not.

This has been a rampant problem on Wikipedia always. I can't seem to find any indicator that this has increased recently? Because they're only even investigating articles flagged as potentially AI. So what's the control baseline rate here?

Applying correct citations is actually really hard work, even when you know the material thoroughly. I just assume people write stuff they know from their field, then mostly look to add the minimum number of plausible citations after the fact, and then most people never check them, and everyone seems to just accept it's better than nothing. But I also suppose it depends on how niche the page is, and which field it's in.

There was a fun example of this that happened live during a recent episode of the Changelog[1]. The hosts noted that they were incorrectly described as being "from GitHub" with a link to an episode of their podcast which didn't substantiate that claim. Their guest fixed the citation as they recorded[2].

[1]: https://changelog.com/podcast/668#transcript-265

[2]: https://en.wikipedia.org/w/index.php?title=Eugen_Rochko&diff...

How did they know it was not LLM generated?

LLMs can add unsubstantiated conclusions at a far higher rate than humans working without LLMs.

At some point you're forced to either believe that people have never heard of the concept of a force multiplier, or to return to Upton Sinclair's observation about getting people to believe in things that hurt their bottom line.

I don’t see why people keep blaming cars for road safety problems; people got into buggy crashes for centuries before automobiles even existed

Because a difference in scale can become a difference in category. A handful of buggy crashes can be reduced to operator error, but as the car becomes widely adopted and analysis matures, it becomes clear that the fundamental design of the machine and its available use cases has fundamental flaws that cause a higher rate of operator error than desired. Therefore, cars are redesigned to be safer, laws and regulations are put in place, license systems are issued, and traffic calming and road design is considered.

Hope that helps you understand.

Is the sarcasm really that opaque? Who would unironically equate buggy accidents and automobile accidents?

I’d like to introduce you to the internet.

There’s a reason /s was a big thing, one persons obvious sarcasm is (almost tautologically) another persons true statement of opinion.

How much time have you spent around developers?

The problems I've run into is both people giving fake citations (the citations don't actually justify the claim that's being made in the article), and people giving real citations, but if you dig into the source you realize it's coming from a crank.

It's a big blind spot among the editors as well. When this problem was brought up here in the past, with people saying that claims on Wikipedia shouldn't be believed unless people verify the sources themselves, several Wikipedia editors came in and said this wasn't a problem and Wikipedia was trustworthy.

It's hard to see it getting fixed when so many don't see it as an issue. And framing it as a non-issue misleads users about the accuracy of the site.

> but if you dig into the source you realize it's coming from a crank.

It is a dark sunday afternoon, Bob Park is sitting on his sofa as usual, drunk as usual, suddenly the TV reveals to him there to be something called the Paranormal (Twilight Zone music) ..instantly Bob knows there are no such things and adds a note to the incomprehensible mess of notes that one day will become his book. He downs one more Budweiser. In the distance lightning strikes a tree, Bob shouts You don't scare me! and shakes his fist. After a few more beers a miracle of inspiration descends and as if channeling, in the time span of 10 minutes he writes notes about Cold Fusion, Alternative Medicine, Faith Healing, Telepathy, Homeopathy, Parapsychology, Zener cards, the tooth fairy and father xmas. With much confidence he writes that non of them are real. It's been a really productive afternoon. It reminds him of times long gone back when he actually published many serious papers. He counts the remaining beers in his cooler and says to himself, in the next book I will need to take on god himself. The world needs to know, god is not real. I too will be the authority on that subject.

https://en.wikipedia.org/w/index.php?title=Special:WhatLinks...

[deleted]

When I've checked Wikipedia citations I've found so much brazen deception - citations that obviously don't support the claim - that I don't have confidence in Wikipedia.

> Applying correct citations is actually really hard work, even when you know the material thoroughly.

Why do you find it hard? Scholarly references can be sources for fundamental claims, review articles are a big help too.

Also, I tend to add things to Wikipedia or other wikis when I come across something valuable rather than writing something and then trying to find a source (which also is problematic for other reasons). A good thing about crowd-sourcing is that you don't have to write the article all yourself or all at once; it can be very iterative and therefore efficient.

It's not that I personally find it hard.

It's more like, a lot of stuff in Wikipedia articles is somewhat "general" knowledge in a given field, where it's not always exactly obvious how to cite it, because it's not something any specific person gets credit for "inventing". Like, if there's a particular theorem then sure you cite who came up with it, or the main graduate-level textbook it's taught in. But often it's just a particular technique or fact that just kind of "exists" in tons of places but there's no obvious single place to cite it from.

So it actually takes some work to find a good reference. Like you say, review articles can be a good source, survey articles or books. But it can take a surprising amount of effort to track down a place that actually says the exact thing. I literally just last week was helping a professor (leader in their field!) try to find a citation during peer review for their paper for an "obvious fact" in the field, that was in their introduction section. It was actually really challenging, like trying to produce a citation for "the sky is blue".

I remember, years ago, creating a Wikipedia article for a particular type of food in a particular country. You can buy it at literally every supermarket there. How the heck do you cite the food and facts about it? It just... is. Like... websites for manufacturers of the food aren't really citations. But nobody's describing the food in academic survey articles either. You're not going to link to Allrecipes. What do you do? It's not always obvious.

[dead]

The title I've chosen here is carefully selected to highlight one of the main points. It comes (lightly edited for length) from this paragraph:

Far more insidious, however, was something else we discovered:

More than two-thirds of these articles failed verification.

That means the article contained a plausible-sounding sentence, cited to a real, relevant-sounding source. But when you read the source it’s cited to, the information on Wikipedia does not exist in that specific source. When a claim fails verification, it’s impossible to tell whether the information is true or not. For most of the articles Pangram flagged as written by GenAI, nearly every cited sentence in the article failed verification.

FWIW, this is a fairly common problem on Wikipedia in political articles, predating AI. I encourage you to give it a try and verify some citations. A lot of them turn out to be more or less bogus.

I'm not saying that AI isn't making it worse, but bad-faith editing is commonplace when it comes to hot-button topics.

Any articles where newspapers are the main source are basically just propaganda. An encyclopaedia should not be in the business of laundering yellow journalism into what is supposed to be a tertiary resource. If they banned this practice, that would immediately deal with this issue.

That's not what I'm saying. I mean citations that aren't citations: a "source" that doesn't discuss the topic at all or makes a different claim.

A blanket dimsissal is a simple way to avoid dealing with complexity, here both in understanding the problem and forming solutions. Obviously not all newspapers are propaganda and at the same time not all can be trusted; not everything in the same newspaper or any other news source is of the same accuracy; nothing is completely trustworthy or completely untrustworthy.

I think accepting that gets us to the starting line. Then we need to apply a lot of critical thought to sometimes difficult judgments.

IMHO quality newspapers do an excellent job - generally better than any other category of source on current affairs, but far from perfect. I remember a recent article for which they intervied over 100 people, got ahold of secret documents, read thousands of pages, consulted experts .... That's not a blog post or Twitter take, or even a HN comment :), but we still need to examine it critically to find the value and the flaws.

> Obviously not all newspapers are propaganda

citation needed

There is literally no source without bias. You just need to consider whether you think a sources biases are reasonable or not

See you should work for a newspaper. You have the gumption.

That is probably 95% of wikipedia articles. Their goal is to create a record of what journalists consider to be true.

Submitted title was "For most flagged articles, nearly every cited sentence failed verification".

I agree, that's interesting, and you've aptly expressed it in your comment here.

People here are claiming that this is true of humans as well. Apart from the fact that bad content can be generated much faster with LLMs, what's your feeling about that criticism? It's there any measure of how many submissions before LLMs make unsubstantiated claims?

Thank you for publishing this work. Very useful reminder to verify sources ourselves!

Note that this article is only about edits made through the Wiki Edu program, which partners with universities and academics to have students edit Wikipedia on course-related topics. It's not about Wikipedia writ large!

That's interesting as my first thought reading the comments was "this problem seems very similar to many students writing papers just finding citations that sound correct".

Sometimes it is really sad to read from (even PhD level) students on social media about their paper writing practices.

I've found Wiki Edu -edited pages with pages of creative writing exercises. When I have read their sources they were clumsily paraphrasing and misunderstanding the source.

LLMs definitely fit the use-case of Wiki Edu students, who are just looking to pass a grade, not to look into a topic because of their interest.

So, a small proportion of articles were detected as bot-written, and a large proportion of those failed validation.

What if in fact a large proportion of articles were bot-written, but only the unverifiable ones were bad enough to be detected?

Human editors, I suspect, would pick up the "tells" of generated text, although as we know, there's a lot of false positives in that space.

But it looks like Pangram is a text classifying NN trained using a technique where they get a human to write a body of text on a subject, and then get various LLMs to write a body of text on the same subject, which strikes me as a good way to approach the problem. Not that I'm in anyway qualified to properly understand ML.

More details here: https://arxiv.org/pdf/2402.14873

Set aside the effect within Wikipedia and consider the larger picture, millions of people generating text with LLMs and at least some of that text being accepted as correct by millions of readers.

The WikiEdu article clearly demonstrates what everyone should have known already: an LLM has no commitment to the truth. An LLM's only commitment is to correct syntax.

I feel like this is such a tragedy of the commons for the LLM providers. Wikipedia probably makes up a huge bulk of their dataset, why taint it? Would be interesting if there was some kind of "you shall not use our platform on Wikipedia" stance adopted.

I don’t think it’s the providers doing this, it’s the awful users. They’re doing the same thing on GitHub. It’s maddening.

Wikipedia having incorrect citations is way older than LLMs. As many other people have pointed out in this thread, if you start pulling strings a lot of what people write starts falling apart.

Its not even unique to Wikipedia. Its really not difficult to find very misleading statements cited through a citation that doesn't even support the claim when you check the original.

This is like saying handing out machine guns is no big change because people have been shooting arrows for a long time. At some point volume becomes the story once it overwhelms the community’s ability to correct errors.

It would be random individuals.

What would be a truly epic application would be their own chat bot to ask about applying edit guidelines. After reading almost all of the guidelines the talkpage debates, even amoung experienced edditors, looked waaaay off. The pattern of revert first make up excuses later seems the worse newbie deterrent possible. This while it should be fine to make mistakes. Many such excuses would get debunked by a bot imediately. It simply wont do any favors. If established editors dont like it they can edit the guidelines.

I’m honestly surprised LLMs are still screwing up citations. It does not feel like a harder task than building software or generating novel math proofs. In both those cases, of course, there is a verifier, but self-verification with “Does this text support this claim?” seems like it ought to be within the capabilities of a good reasoning model.

But as I understand the situation, even the major Deep Research systems still have this issue.

I find it very interesting that the main competitor to Wikipedia which is Grokipedia is taking a 180 degree approach being AI first.

Didn't know about Grokipedia, I've just opened an article in it about Spain, scrolled to a random paragraph, and the information in it is plain wrong:

From https://grokipedia.com/page/Spain#terrain-and-landforms > Spain's peninsular terrain is dominated by the Meseta Central, a vast interior plateau covering about two-thirds of the country's land area, with elevations ranging from 610 to 760 meters and averaging around 660 meters

Segovia is at 1.000 meters, and so is most of the top half of the "Meseta". https://en-gb.topographic-map.com/map-763q/Spain/?center=41....

I still stand on not trusting any of what AI spits out, be it code or text. And it takes me usually longer to check that everything is ok than doing it myself, but my brain is enticed by the "effort shortcut" that AI promised.

I'm not an expert on the geography of Spain, and it's rare that I'd defend Grokipedia but in this case I think it is correct.

Meseta Central mean central tableland. Segovia is on the edge of the mountain range that surrounds that tableland, but often referred to as part of it. This is fuzzy though.

Wikipedia says: The Meseta Central (lit. 'central tableland', sometimes referred to in English as Inner Plateau) is one of the basic geographical units of the Iberian Peninsula. It consists of a plateau covering a large part of the latter's interior.[1]

Looking at the map you linked the flat part is between 610 to 760 meters.

Finally, when speaking about the Iberian Peninsula Wikipedia itself includes this:

> "About three quarters of that rough octagon is the Meseta Central, a vast plateau ranging from 610 to 760 m in altitude."[2]

[1] https://en.wikipedia.org/wiki/Meseta_Central

[2] https://en.wikipedia.org/wiki/Iberian_Peninsula

Spaniard here. Spain it's tricky, it's both 'flat' with the meseta and the 2nd most mountainous country in Europe. I am not kidding, look at a heigth map. It has a plateau... surrounded by mountains and with a bigass sierra at mid-North (Picos de Europa).

Grok does cite that claim as being from https://countrystudies.us/spain/30.htm a page in Eric Solsten and Sandra W. Meditz, editors. Spain: A Country Study. Washington: GPO for the Library of Congress, 1988.

The nice thing about grokipedia is that if you have counter examples like that you can provide it as evidence to change it and it will rewrite the article to be more clear.

You know what other site you can provide evidence to and change to be more correct?

Not Wikipedia as Wikipedia doesn't care about evidence. Those people care about reputable secondary sources and will ignore you when point out evidence that contradicts such sources.

I don't ever edit English wikipedia because my English is not nearly up to the standard, and suggestions for improvement (worthwhile IMO) are usually ignored. Grok at least won't ignore you. (I tend to post suggestions to unpopular pages with sparse edit history, which is probably the reason for them going unnoticed.)

I use to frequent irc channels and forums where no such thing as an old question existed. Someone asked an interesting question on irc and days or weeks later a response would happen. On forums the response could be more than a year "delayed". Gradually things shifted to newer new new news that couldn't possibly be new enough. Then debates happen where people sometimes link to the vastly superior olds. Wikipedia finally caught up and questions are no longer ignored. In stead they are archived long before an ignored status could be earned.

> I find it very interesting that the main competitor to Wikipedia which is Grokipedia

Encyclopedia Britannica (the website not the printed book) is the main competitor to Wikipedia and gets an order of magnitude more traffic than grokipedia. Right now grokipedia is the new kid on the block. It has yet to be seen if its just a novelty or if it has staying power but either way it still has a ways to go before its Wikipedia's primary competitor.

Main competitor? I’m pretty sure that Uncyclopedia is a more relevant competitor to Wikipedia than Grokipedia. Likely more accurate, too.

In some time it will become a serious alternative.

That thing is "the main competitor to Wikipedia" in the same way I'm the main competitor for the Olympic 100m race. I mean, both I and the winner have legs so it's going to be a close race, right?

It’s on its way to becoming more popular and a clear competitor to it. Just a matter of time.

This happens a lot on Wikipedia. I'm not sure why, but it does and you can see its traces through the Internet as people post the mistaken information around.

One that took me a little work to fix was pointed out by someone on Twitter: https://x.com/Almost_Sure/status/1901112689138536903

When I found the source, the twitter poster was correct! Someone had decided to translate "A hundred years ago, people would have considered this an outrage. But now..." as "this function is an outrage" which honestly is ironically an outrageous translation. What the hell dude.

But it takes a lot of work to clean up stuff like that! https://en.wikipedia.org/w/index.php?title=Weierstrass_funct...

I had to go find the actual source (not the other 'sources' that repeated off Wikipedia or each other) and then make sure it was correct before dealing with it. A lie can travel halfway around the world...

There seems much defensiveness in the comments here along the lines of "not a new thing" and "not unique to LLM/AI".

It seems to deflect, even gaslight TFA.

> For most of the articles Pangram flagged as written by GenAI, nearly every cited sentence in the article failed verification.

So why deflect that into convenient other pedantry (surely not under the guise tech forums often do so)?

WSo why the discomfort for part of HN at an assertion AI is being used for nefarious purposes and creation of alternate 'truths'?

Astroturfing or marketing, I’d guess. I’ve noticed you’re no longer allowed to say negative things about AI here without significant pushback, and I’d bet this isn’t an organic shift in perception.

I've found that generally people reserve down votes for posts that don't add to the conversation, in general, just like we're supposed to do. Its always been down vote city if you happen to criticize political positions that benefit libertarian technologists. But lately anything critical of AI tends to get a lot of down votes. Even on older posts that you can't find on the front page anymore... It feels inorganic

> Its always been down vote city if you happen to criticize political positions that benefit libertarian technologists.

This varies wildly by timezone. Usually I get upvoted during European timezones and then brace for the Americans to wake up.

There sure are a lot of green names on this post pushing that agenda. Makes you wonder if its astroturfing. And why its nessecary, is AI so fragile it can't let any criticism stand unchallenged?

[dead]

[flagged]

lol. would have written something shorter for HN, but the main expected audience for it was Wikipedians.

[dead]

This goes much further than Wikipedia, it's just particularly visible there.

Thanks for the LLM comment, but that's dumb. If the problem really was as bad with humans (it obviously is not), then OP wouldn't've happened:

> For most of the articles Pangram flagged as written by GenAI, nearly every cited sentence in the article failed verification.

Agree. I'm curious about the human contribution baseline.

I trust Grokipedia way more, even though it's AI-generated. Wikipedia on any current topic is dominated by various edit gangs trying to push an agenda