Skip to content

Archives

_Cheap science, real harm: the cost of replacing human participation with synthetic data_ [pdf]

  • _Cheap science, real harm: the cost of replacing human participation with synthetic data_ [pdf]

    A new paper from the inimitable Abeba Birhane, on the increasingly common practice of generating synthetic data using LLMs:

    Driven by the goals of augmenting diversity, increasing speed, reducing cost, the use of synthetic data as a replacement for human participants is gaining traction in AI research and product development. This talk critically examines the claim that synthetic data can “augment diversity,” arguing that this notion is empirically unsubstantiated, conceptually flawed, and epistemically harmful. While speed and cost-efficiency may be achievable, they often come at the expense of rigour, insight, and robust science. Drawing on research from dataset audits, model evaluations, Black feminist scholarship, and complexity science, I argue that replacing human participants with synthetic data risks producing both real-world and epistemic harms at worst and superficial knowledge and cheap science at best.

    "Synthetic data: stereotypes compressed" is absolutely spot on. This doesn't give insights into human behaviour and beliefs, just into stereotypes. It is increasingly common in social science fields, under the names of "digital twins" and "silicon samples".

    Tags: data surveys abeba-birhane papers ai synthetic-data digital-twins simulation testing social-science silicon-samples