Research Methods

Peer-Reviewed at Last: LSE Study Validates AI-Led Qualitative Interviews at Scale

A rigorous new paper accepted at the Review of Economic Studies proves what early adopters already knew — AI interviewers perform on par with trained human experts. Here is what the research actually shows, and what it means for qualitative research teams.

Prajwal Paudyal, PhDMay 10, 20268 min read

The Evidence Gap Is Officially Closed

For the past three years, qualitative researchers have asked a reasonable question: where is the peer-reviewed evidence that AI-led interviews actually work? Vendor case studies are not evidence. Pilot projects with 20 participants are not evidence. Conference demos are not evidence.

Now there is evidence. Friedrich Geiecke and Xavier Jaravel at the London School of Economics have published "Conversations at Scale: Robust AI-led Interviews" — a paper accepted at the *Review of Economic Studies*, one of the top five economics journals in the world. This is not a workshop paper or a preprint that will languish on arXiv. This is peer-reviewed validation from one of the most rigorous review processes in social science.

The headline finding: AI-led interviews were rated approximately comparable to an average human expert interviewer by trained sociology PhD students from Harvard and LSE who evaluated transcripts blind. On one benchmark, AI voice actually scored higher than human face-to-face. On another, it trailed by 0.03 points.

But the headline is not the story. The story is in the methodology, the five studies, and the implications for anyone building or buying qualitative research tools.

What They Actually Did

Geiecke and Jaravel did not run a single proof-of-concept study. They ran five large studies across four distinct research domains:

Study 1: Meaning in life. 462 US respondents from Prolific, randomly assigned to either an AI-led interview or open text fields. The AI interviews produced a 148% increase in word count. They surfaced categories — like pet care and companionship (mentioned by 16% of respondents, equal to religion) — that do not appear in any standard close-ended questionnaire. The AI itself could not predict these categories when asked to generate them without the transcript data. The insights came from the conversation, not from the model's priors.

Study 2: French legislative elections. 384 respondents in the week before the June 2024 snap election. Deployed in French within days. Revealed sharp differences in voter motivation across political affiliations. 49% of respondents said they would prefer talking to an AI for their next interview.

Study 3: Educational and occupational choice. 100 US respondents exploring STEM career decisions. Personal interests dominated (81% in education, 76% in career), with hobbies — particularly video games — emerging as a major factor in STEM pathways that survey instruments rarely capture.

Study 4: Mental models of public policy. 800 US respondents. The researchers extracted 15 positive and 20 negative narratives through AI interviews, then validated them in a follow-up close-ended survey with 300 new respondents. 81% of respondents confirmed the AI-generated narrative set covered all their major reasons. This is the qual-to-quant workflow that mixed methods researchers have been trying to scale for decades.

Study 5: Voice interviews on inflation. 354 respondents using GPT-4o voice mode. 55% said they would prefer another AI interview. 54% preferred voice specifically over text.

Across all studies, they tested five different LLMs (GPT-4o, GPT-4.1, Claude Sonnet 4, Llama 3.1 405B, and Llama 4 Maverick 17B), validated both text and voice modalities, and benchmarked against trained sociologists conducting face-to-face interviews in the LSE Behavioural Lab.

Why This Matters More Than Previous Papers

This is not the first paper on AI-led interviewing. Chopra and Haaland built a multi-agent system in 2023. Cuevas et al. compared LLM interviewers to naive baselines. Wuttke et al. ran small comparisons with student interviewers.

What separates Geiecke and Jaravel is breadth and rigor:

Scale. Five studies, 2,100+ total respondents across multiple countries and languages.
Real comparison. Not student interviewers or hypothetical benchmarks — actual trained sociologists in a controlled lab setting.
Multiple models. Not just "GPT-4 works" but systematic testing across proprietary and open-source models.
Voice validation. Not just text chat but empirical testing of voice-mode interviews.
Venue. Accepted at the Review of Economic Studies. The reviewers tried to break it. It held.

The Architecture Insight

The platform uses a single LLM agent — no multi-agent orchestration, no separate models for topic-switching or safety checks. The system prompt encodes six principles from established qualitative methodology: non-directive guidance, palpable evidence collection, cognitive empathy, no assumption of views, one question per message, and staying on topic.

They tested three prompt variants (baseline, enhanced, and minimal). All three performed similarly, with the role description alone getting most of the way there. This is a significant finding: you do not need elaborate prompt engineering to get expert-level interview quality from frontier models. You need good research design and the right constraints.

What This Means for Qualitative Research Teams

The validity question is answered. If your stakeholders have been asking "but is AI interviewing actually valid?" — point them to a top-5 economics journal. The methodological burden of proof has shifted. It is now on skeptics to explain why their specific use case would fail where five diverse studies succeeded.

Scale is no longer a trade-off. The traditional choice — deep qualitative insight OR large sample sizes — is dissolving. You can run 800 qualitative interviews in days, extract narratives, and validate them quantitatively with a separate sample. This is not a future possibility. It is a documented, peer-reviewed workflow.

Voice changes the equation. Respondents prefer it. They open up more. They write (speak) more. For sensitive topics, 49% preferred talking to a non-judgmental AI over a human interviewer. Voice is not a nice-to-have feature — it is a modality shift that changes what participants are willing to share.

The "qual then quant" pipeline is now empirically validated. Run AI interviews to surface narratives and categories. Validate with close-ended surveys on a fresh sample. This is the mixed-methods workflow that enterprises have been trying to operationalize, now with documented evidence that it works at scale.

The Bigger Picture

Qualitative research has always had a scaling problem. Interviews are expensive. Recruiting is slow. Analysis is manual. The depth-versus-breadth trade-off meant that most organizations either ran small qualitative studies and hoped for generalizability, or ran large surveys and hoped the pre-coded categories captured what actually mattered.

This paper does not just validate a technology. It validates a new category of research methodology — one where qualitative depth and quantitative scale are not opposing forces but complementary steps in the same workflow.

For research teams still running 8-12 interviews and calling it a study: the bar just moved. Not because AI is cheaper (though it is), but because the evidence now shows you can have depth AND breadth AND speed without sacrificing rigor.

The question is no longer "does AI interviewing work?" The question is "what are you still doing manually that does not need to be?"

*The full paper is available at SSRN. The open-source platform is on GitHub. The LSE Impact Blog has an accessible summary by the authors.*

*Qualz.ai provides AI-native qualitative research with automated interviews, multi-lens analysis, and exportable reports — the productized version of what this research validates. Book a demo to see it in action.*

Continue Reading

Guides & Tutorials

Jobs-to-Be-Done Interviews: Extracting Innovation Signals From Customer Conversations

JTBD interviews are not regular user interviews with different questions. They require a fundamentally different conversational technique -- one that uncovers the causal mechanisms behind switching decisions and reveals opportunities your product analytics will never show.

Industry Insights

Why Research Agencies Are Losing Clients to In-House Teams (And How AI Levels the Playing Field)

The insourcing trend is real — brands are pulling research in-house at record rates. But the agencies that survive won't be the ones who fight it. They'll be the ones who use AI to deliver what in-house teams never can.