AI Search Has A Citation Problem
LOL, these are terrible results.
We randomly selected ten articles from each publisher, then manually selected direct excerpts from those articles for use in our queries. After providing each chatbot with the selected excerpts, we asked it to identify the corresponding article’s headline, original publisher, publication date, and URL […] We deliberately chose excerpts that, if pasted into a traditional Google search, returned the original source within the first three results. We ran sixteen hundred queries (twenty publishers times ten articles times eight chatbots) in total.
Results:
Overall, the chatbots often failed to retrieve the correct articles. Collectively, they provided incorrect answers to more than 60 percent of queries. Across different platforms, the level of inaccuracy varied, with Perplexity answering 37 percent of the queries incorrectly, while Grok 3 had a much higher error rate, answering 94 percent of the queries incorrectly.
Most of the tools we tested presented inaccurate answers with alarming confidence, rarely using qualifying phrases […] With the exception of Copilot — which declined more questions than it answered — all of the tools were consistently more likely to provide an incorrect answer than to acknowledge limitations.
Comically, the premium for-pay models “answered more prompts correctly than their corresponding free equivalents, [but] paradoxically also demonstrated higher error rates. This contradiction stems primarily from their tendency to provide definitive, but wrong, answers rather than declining to answer the question directly.”
Bottom line — don’t let an LLM attribute citations…
Tags: llm llms media journalism news research search ai chatgpt grok perplexity tests citations