Finding Copernicus - Exploring RAG limitations in context-rich documents

#rag #ai #llm #context #retrieval

Table of Contents

A seemingly simple movie trivia question exposes fundamental limitations in current AI systems.

I recently started working on a small “movie chat” project using RAG to answer trivia questions about movies. The idea was simple: load a movie script and have an AI agent answer questions about it. However, two seemingly straightforward questions consistently stumped the system:

  • In Back to the Future, what was Doc Brown’s dog’s name in 1955?
  • In Star Wars Episode IV, how many planets does Luke visit?

These failures revealed something important about the complex context-dependent and time-dependent nature of movie scripts that current RAG systems struggle to handle.

Introduction to RAG

Retrieval Augmented Generation (RAG) is a technique to provide AI Agents with access to knowledge from external sources such as documents. The standard process involves:

  1. Splitting documents into chunks
  2. Encoding each chunk as embeddings (vectors)
  3. Storing these embeddings in a vector database
  4. Searching for relevant chunks using similarity matching

This approach works well for documents where each section contains independent, factual information. For example, if you search for “hiking,” the system can find Thoreau’s quote about walking through semantic similarity, even though the word “hiking” never appears in the text. The vector search recognizes the conceptual connection between hiking and walking.

The Movie Script Problem

Movie scripts present unique challenges for RAG systems. Unlike technical manuals or encyclopedias, understanding any scene in a movie requires knowing what happened in previous scenes. A lot of the context of what you are reading depends on what has happened in the previous sections.

Lack of Context

Consider the Back to the Future question: “What was Doc Brown’s dog’s name in 1955?”

When the RAG system searches for relevant chunks, it finds many references to Doc Brown’s dog throughout the script. However, most of these chunks mention “Einstein” - the dog’s name in 1985. The system has no way to distinguish between the 1985 timeline and the 1955 timeline without broader context. The correct answer, “Copernicus,” appears in the 1955 scenes, but the chunks don’t carry enough temporal context to make this distinction.

RAG system struggling with temporal context

It’s easy to come up with the required context when you know the question, but it’s much harder to have a generic solution a-priori. How do you chunk a document in a way that preserves temporal, geographical, and relational context?

Lack of Indirect Relationships

The Star Wars question presents another challenge: “How many planets does Luke visit?”

Answering this requires understanding that Luke travels aboard the Millennium Falcon and identifying all locations where that spacecraft lands. This involves connecting multiple pieces of information across different chunks:

  • Luke is a passenger on the Millennium Falcon
  • The Millennium Falcon visits various planets
  • Therefore, Luke visits those planets

Understanding indirect relationships in narratives

Standard vector searches struggle to establish these systematic connections. You need to identify all occurrences of related events and piece together the narrative.

Graph RAG

Microsoft Research’s GraphRAG approach attempts to address these limitations by using an LLM to process each chunk and extract interesting entities and relationships into a graph. Instead of just storing text chunks, the system builds a knowledge graph of entities (people, places, things) and their relationships.

I tested this approach using the open-source R2R project with Neo4J as the graph database. The system successfully extracted entities and relationships:

  • Entity: BROWN (Doc Brown)
  • Entity: EINSTEIN (dog)
  • Entity: COPERNICUS (dog)
  • Relationship: BROWN owns EINSTEIN
  • Relationship: BROWN owns COPERNICUS

However, even with these relationships mapped, we’re still lacking some key context to figure out which dog should be the answer for 1955. The graph knows about both dogs and their connection to Doc Brown, but it doesn’t capture the temporal dimension - which dog belongs to which time period.

The Need for Better Contextualization

Both standard RAG and Graph RAG prove insufficient without stronger contextual frameworks. We need a more holistic approach to context retrieval.

Think about how humans read and remember narratives. When reading this paragraph, you don’t remember the exact words of the previous ones. You remember however the broad context - the themes, the timeline, the characters involved, and how they relate to each other.

Different document types require different contextual models:

  • Movies/Stories: Temporal models (timeline of events, character development)
  • Travel Guides: Geographical models (spatial relationships, proximity)
  • Legal Documents: Version-based models (amendments, historical changes)
  • Scientific Papers: Domain-specific spatial models (relationships between concepts)

Next Steps

I believe we need to explore novel approaches to RAG and chunk contextualization. Some ideas:

  1. Create benchmarks: Develop a benchmark made of movie scripts and trivia questions that test temporal, geographical, and relationship understanding.

  2. Contextualized chunks: Rather than treating chunks as isolated units, maintain their relationship to the broader document structure. Use LLMs to generate contextual summaries that travel with each chunk.

  3. Multi-modal retrieval: Combine vector search, graph relationships, and temporal/spatial models to provide richer context.

  4. Hierarchical context: Build multiple layers of context - from sentence level to paragraph, scene, act, and entire document - allowing the system to zoom in and out as needed.

I’m curious to hear from others working on similar challenges. Have you encountered other projects or techniques that might solve this “Copernicus challenge”? How are you handling context-dependent information retrieval in your RAG systems?

If you’re interested in following along with this exploration, subscribe to my updates. I’ll be sharing progress on the OpenGPA project and diving deeper into these contextualization techniques.