Generative AI Recommendations:
The Illusion of Progress

Why most LLMs fail in real-world TV personalization

ABOUT THIS PAPER

Benchmarking 50 LLMs for content recommendations

ContentWise has spent more than 20 years benchmarking recommender systems, comparing every significant academic and commercial approach against industrial performance criteria. When the Netflix Prize winner was announced in 2010, ContentWise published its finding that the solution was useless for production. Netflix confirmed it two years later.

This paper applies the same benchmarking discipline to LLMs. The findings are specific, evidence-based, and directly actionable for operators building or upgrading recommendation infrastructure today.

ABOUT THIS PAPER

Two dimensions, four combinations

First, you can use the LLM out of the box, without fine-tuning, or you can fine-tune it by feeding your catalog of content and user behavioral data into the LLM, modifying its weights so it knows your user base much better. This comes at an additional cost.

Second, you can use the LLM alone, or you can use it together with a traditional recommender system. These two dimensions give you four combinations.

WHAT'S INSIDE

Answer this question: Why 95% of LLMs fail in production?

Where LLMs can and cannot replace traditional recommender systems, with performance data, not opinions
Six positions in the recommendation pipeline where LLMs apply, with maturity ratings for each
Five deployment scenarios mapped by cost, quality, and integration complexity
Benchmarking results across nearly 50 state-of-the-art models, categorized by architectural approach
Why 95% of deep learning recommendation techniques fail when moved to production systems
The one combination that consistently delivers quality recommendations at production cost
Independent validation from OpenAI, Amazon, WeChat, and Huawei research