Contextual Retrieval

We build a corporate RAG for a government entity. What I've learned so far by applying an experimental A/B testing approach to RAG using RAGAS metrics:

- Hybrid Retrieval (semantic + vector) and then LLM based Reranking made no significant change using synthetic eva-questions

- HyDE decreased answer quality and retrieval quality severly when measured with RAGAS using synthetic eval-questions

(we still have to do a RAGAS eval using expert and real user questions)

So yes, hybrid retrieval is always good - that's no news to anyone building production ready or enterprise RAG solutions. But one method doesn't always win. We found semantic search of Azure AI Search being sufficient as a second method, next to vector similarity. Others might find BM25 great, or a fine tuned query post processing SLM. Depends on the use case. Test, test, test.

Next things we're going to try:

- RAPTOR

- SelfRAG

- Agentic RAG

- Query Refinement (expansion and sub-queries)

- GraphRAG

Learning so far:

- Always use a baseline and an experiment to try to refute your null hypothesis using measures like RAGAS or others.

- Use three types of evaluation questions/answers: 1. Expert written q&a, 2. Real user questions (from logs), 3. Synthetic q&a generated from your source documents

Could you explain or link to explanations of all of the acronyms you’ve used in your comment?

These are all "techniques" on top of the foundations of RAG. It's similar to "Chain of Thought" in prompt engineering. You have an underlying technology, and then come up with techniques/frameworks on top. What MVC was for Web dev +15 years ago.

RAPTOR for example is a technique that groups and clusters documents together, summarizes them, and creates embeddings defining a sort of a Tree. Paper: https://arxiv.org/html/2401.18059v1

Agentic RAG is creating an agent that can decide to augment "conversations" (or other LLM tools) with RAG searches and analyze its relevance. Pretty useful, but hard to implement right.

You can google the others, they're all more or less these "techniques" to improve an old-fashioned RAG search.

Worth noting that a lot of the improvement gains you get from RAPTOR are (from my use cases) related to giving context to the chunks. Simpler but easier to implement methods of summarizing context (e.g. in a hierarchical document) and cutting chunks around document boundaries can get you most of the way there with less effort (again, as other mentioned, it depends though on your use)

It makes me chuckle a bit to see this kind of request in a tech forum, particularly when discussing advanced LLM-related topics.

This is akin to a HN comment asking someone to search the Internet for something on their behalf, while discussing search engine algorithms!

A lot of people here (myself included) work across different specialisations and are here to learn from discussion that is intentionally unfamiliar.

Yes, but ChatGPT knows these things! Just ask it to expand the acronyms.

This is the new “can you Google that for me?”

يمكن لـ ChatGPT أيضًا الترجمة من العربية إلى الإنجليزية، ولكن سيكون من المزعج استخدامه للمحادثة في هذا السياق

Annyira lusta vagyok, hogy nem akarok néhány gombot megnyomni, ezért kérlek, írj nekem egy oldal szöveget.

Another solution is to downvote / not upvote comments which place an unreasonable burden on the reader. The best comments are those which can be broadly understood without a need for Googling acronyms or "expanding" the comment using an LLM.

It adds useful context to the discussion and spurs further conversation.

HyDE: Hypothetical Document Embeddings [1]

RAGAS: RAG Assessment [2]

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval [3]

Self-RAG: Self-Reflective Retrieval-Augmented Generation [4]

Agentic RAG: Agentic Retrieval-Augmented Generation [5]

GraphRAG: Graph Retrieval-Augmented Generation [6]

[1] https://docs.haystack.deepset.ai/docs/hypothetical-document-...

[2] https://docs.ragas.io/en/stable/

[3] https://arxiv.org/html/2401.18059v1

[4] https://selfrag.github.io

[5] https://langchain-ai.github.io/langgraph/tutorials/rag/langg...

[6] https://www.microsoft.com/en-us/research/blog/graphrag-unloc...

What do you think of HippoRAG? Did you try it or plan to do?

My favorite thing about this is the way it takes advantage of prompt caching.

That's priced at around 1/10th of what the prompts would normally cost if they weren't cached, which means that tricks like this (running every single chunk against a full copy of the original document) become feasible where previously they wouldn't have financially made sense.

I bet there are all sorts of other neat tricks like this which are opened up by caching cost savings.

My notes on contextual retrieval: https://simonwillison.net/2024/Sep/20/introducing-contextual... and prompt caching: https://simonwillison.net/2024/Aug/14/prompt-caching-with-cl...

I follow your blog and read almost everything you write about Llms. Just curious (if you havent already written about it somewhere and I missed it), how much do you spend monthly, exploring all the various Llms and their features? (I think its a useful context for having a grasp of how much I would have to spend to keep up to date with the models out there and the latest features)

Most months I spend less than $10 total across the OpenAI, Anthropic and Google APIs - for the kind of stuff I do I’m just not racking up really high token counts.

I spend $20/month on ChatGPT plus and $20/month on Claude Pro. I get GitHub Copilot for free as an open source maintainer.

You could do a lot of stuff with pre-calculating things for your embeddings. Why cache when you can pre-calculate. That brings into play a whole lot of things people commonly do as part of ETL.

I come from a traditional search back ground. It's quite obvious to me that RAG is a bit of a naive strategy if you limit it to just using vector search with some off the shelf embedding model. Vector search simply isn't that good. You need additional information retrieval strategies if you want to improve the context you provide to the LLM. That is effectively what they are doing here.

Microsoft published an interesting paper on graph RAG some time ago where they combine RAG with vector search based on a conceptual graph that they construct from the indexed data using entity extraction. This allows them to pull in contextually relevant information for matching chunks.

I have a hunch that you could probably get quite far without doing any vector search at all. It would be a lot cheaper too. Simply use a traditional search engine and some tuned query. The trick is of course query tuning. Which may not work that well for general purpose use cases but it could work for more specialized use cases.

I have experience in traditional search as well and I think this is doing some limiting of my imagination when it comes to vector search. In the post, I did like the introduction of the Contextual BM25 compared to other hybrid approaches then doing rrf.

For question answering, vector/semantic search is clearly a better fit in my mind, and I can see how the contextual models can enable and bolster that. However, because I’ve implemented and used so many keyword based systems, that just doesn’t seem to be how my brain works.

An example I’m thinking of is finding a sushi restaurant near me with availability this weekend around dinner time. I’d love to be able to search for this as I’ve written it. How I would search for it would be search for sushi restaurant, sort by distance and hope the application does a proper job of surfacing time filtering.

Conversely, this is mostly how I would build this system. Perhaps with a layer to determine user intention to pull out restaurant type, location sorting, and time filtering.

I could see using semantic search for filtering down the restaurants to related to sushi, but do we then drop back into traditional search for filtering and sorting? Utilize function calling to have the LLM parameterize our search query?

As stated, perhaps I’m not thinking of these the right way because of my experiences with existing systems, which I find seem to give me better results when well built

Another approach I saw is to build a conceptual graph using entity extraction and have the LLM suggest search paths through that graph to enhance the retrieval step. The LMM is fine-tuned on the conceptual graph for this specific task. Could work in your case, but you need to deal with an ontology that suits your use case, in other words it must already contain restaurant location, type of dishes served and opening hours.

GraphRAG requires you define upfront the schema of entity and relation types. This works when you are in a known domain, but in general, when you want to just answer questions from a large reference, you don't know what you need to put in the graph.

Graph RAG is very cool and outstanding at filling some niches. IIRC, Perplexity's actual search is just BM25 (based a lex fridman interview of the founder).

Makes sense; perplexity is really responsive and fast usually.

I need to check out that interview with Lex Fridman.

That is a funny was of explaining that they scrape google.

Do you have the link and the time in the video where he mentions it?

https://youtu.be/e-gwvmhyU7A?t=2h5m41s

This was my exact question. Why do an LLM rewrite, when you can add a context vector to a chunk vector, and for plaintext indexing, add a context string (eg, tfidf)?

The article claimed other context augmentation fails, and that you are better off paying anthropic to run an LLM on all your data, but it seems quite handwavy. What vector+text search nuance does a full document cache LLM rewrite catch that cheapo methods miss? Reminds me of "It is difficult to get a man to understand something when his salary depends on his not understanding it". (We process enough data that we try to limit LLMs to the retrieval step, and only embeddings & light LLMs to the indexing step, so it's a $$$ distinction for our customers.)

The context caching is neat in general, so I have to wonder if this use case is more about paying for ease than quality, and its value for quality is elsewhere.

Cost is one aspect, but what about ingest time? You’re adding significant processing time to your pipeline with this method right?

I expect most implementations of RAG don't mind this too much - if you're dealing with only a few hundred more pages of documents a day the ingestion time from using fancy tricks like this is going to be measured in minutes.

To add some context, this isn't that novel of an approach. A common approach to improve RAG results is to "expand" the underlying chunks using an llm, so as to increase the semantic surface area to match against. You can further improve your results by running query expansion using HyDE[1], though it's not always an improvement. I use it as a fallback.

I'm not sure what Anthropic is introducing here. I looked at the cookbook code and it's just showing the process of producing said context, but there's no actual change to their API regarding "contextual retrieval".

The one change is prompt caching, introduced a month back, which allows you to very cheaply add better context to individual chunks by providing the entire (long) document as context. Caching is an awesome feature to expose to developers and I don't want to take anything away from that.

However, other than that, the only thing I see introduced is just a cookbook on how to do a particular rag workflow.

As an aside, Cohere may be my favorite API to work with. (no affiliation) Their RAG API is a delight, and unlike anything else provided by other providers. I highly recommend it.

1: https://arxiv.org/abs/2212.10496

I think the innovation is using caching as so to make the cost of the approach manageable. The way they implemented it is that each time you create a chunk, you ask the llm to create an atomic chunk from the whole context. You need to do this for all tens of thousands of chunks in your data. This costs a lot. By caching the documents, you can spare costs

You could also just save the first outputted atomic chunk and store it then re-use it each time yourself. Easier and more consistent.

I don't understand how that helps here. They're not regenerating each chunk every time, this is about caching the state after running a large doc through a model. You can only do this kind of thing if you have access to the model itself, or it's provided by the API you use.

To be fair, that only works if you keep chunk windows static.

Yup. Caching is very nice.. but the framing is weird. "Introducing" to me, connotes a product release, not a new tutorial.

I was trying to do this using Prompt Caching like a month ago, but then noticed there's five minute maximum lifetime for the cached prompts - doesn't really work for my RAG needs (or probably most), where the queries would be ran during the next month or a year. I can't see any changes to that policy. Little surprised to see them talk about Prompt Caching relating to RAG.

They aren’t using the prompt caching on the query side, only on the embedding side… so you cache the document in the context window when ingesting it, but not during retrieval.

It seems a little odd to make multiple requests instead of using one request to create all the context for all the chunks.

We're doing something similar. We first chunk the documents based on h1,h2,h3 headings. Then we add headers in the beginning of the chunk as a context. As an imagenary example, instead of one chunk being:

  The usual dose for adults is one or two 200mg tablets or 
  capsules 3 times a day.

It is now something like:

  # Fever
  ## Treatment
  ---
  The usual dose for adults is one or two 200mg tablets or 
  capsules 3 times a day.

This seems to work pretty well, and doesn't require any LLMs when indexing documents.

(Edited formatting)

I used to always wonder how do llms know whether a particular long article or audio transcript was written by say Alan Watts. Basically these kind of metadata annotation would be common while preparing training data for Llama models and so on. This could also be reason for the genesis for the argument that ChatGPT got slower in December. That "date" metadata would "inform" ChatGPT to be unhelpful.

I am working on question answering based on long documents / bundles of documents, 100+ pages, and I took a similar approach. I first summarize each page, give it a title and extract a list of subsections. Then I put all the summaries together and I ask the model to provide a hierarchical index. It will organize the whole bundle into a tree. At querying time I combine the path in the tree as additional context.

Did you experiment with different ways to format those included headers? Asking because I am doing something similar to that as well.

Nope, not yet. We have sticked with markdownish syntax so far.

I'm not a fan of this technique. I agree the scenario they lay out is a common problem, but the proposed solution feels odd.

Vector embeddings have bag-of-words compression properties and can over-index on the first newline separated text block to the extent that certain indices in the resulting vector end up much closer to 0 than they otherwise would. With quantization, they can eventually become 0 and cause you to lose out on lots of precision with the dense vectors. IDF search overcomes this to some extent, but not enough.

You can "semantically boost" embeddings such that they move closer to your document's title, summary, abstract, etc. and get the recall benefits of this "context" prepend without polluting the underlying vector. Implementation wise it's a weighted sum. During the augmentation step where you put things in the context window, you can always inject the summary chunk when the doc matches as well. Much cleaner solution imo.

Description of "semantic boost" in the Trieve API[1]:

>semantic_boost: Semantic boost is useful for moving the embedding vector of the chunk in the direction of the distance phrase. I.e. you can push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factorL2Distance closer to or away from the distance_phrase point along the line between the two points.

[1]:https://docs.trieve.ai/api-reference/chunk/create-or-upsert-...

Sorry random question - do vector dbs work across models? I'd guess no, since embeddings are models specific afaik, but that means that a vector db would lock you into using a single LLM and even within that, a single version, like Claude-3.5 Sonnet, and you couldn't move to 3.5 Haiku, Opus etc., never mind ChatGPT or Llama without reindexing.

In short: no.

The vector databases are here to store vectors and calculating distance between vectors.

The embeddings model is the model that you pick to generate these vectors from a string or an image.

You give "bart simpson" to an embeddings model and it becomes (43, -23, 2, 3, 4, 843, 34, 230, 324, 234, ...)

You can imagine it like geometric points in space (well, it's a vector though), except that instead of being 2D, or 3D-space, they are typically in higher-number of dimensions (e.g: 768).

When you want to find similar entries, you just generate a new vector "homer simpson" (64, -13, 2, 3, 4, 843, 34, 230, 324, 234, ...) and send it to the vector database and it will return you all the nearest neighbors (= the existing entries with the smallest distance).

To generate these vectors, you can use any model that you want, however, you have to stay consistent.

It means that once you are using one embedding model, you are "forever" stuck with it, as there is no practical way to project from one vector space to another.

that sucks :(. I wonder if there are other approaches to this, like simple word lookup, with storing a few synonyms, and prompting the LLM to always use the proper technical terms when performing a lookup.

Back of the book index or inverted indexes can be stored in a set store and give decent results that compare to vector lookups. The issue with them is you have to do an extraction inference to get the keywords.

The sibling comments seem to be correct in their technical explanations, but miss the meaning I'm getting from your question.

My understanding is you want to know "are vector DBs compatible with specific LLMs, or are we stuck with a specific LLM if we want to do RAG once we've adopted a specific vector store?"

And the answer to that is that the LLM never sees the vectors from your DB. Your LLM only sees what you submit as context (ie the "system" and "user" prompts in chat-based models).

The way RAG works is:

1 - end-user submits a query

2 - this query is embedded (with the same model that was used to compile the vector store) and compared (in the vector store) with existing content, to retrieve relevant chunks of data

3 - and then this data (in the form of text segments) is passed to the LLM along with the initial query.

So, in a sense you're "locked in" in the sense that you need to use the same embedding model for storage and for retrieval. But you can definitely swap out the LLM for any other LLM without reindexing.

An easy way to try this behavior out as a layperson is to use AnythingLLM which is an open-source desktop client, that allows you to embed your own documents and use RAG locally with open-weight models or swap out any of the LLM APIs.

Embedding is a transformation which allows us to find semantically relevant chunks from a catalogue given a query. Through some nearness criteria, you would retrieve "semantically relevant" chunks which along with query would be fed to LLMs and ask them to synthesize the best answer. Vespa docs are very great if you are thinking of building in this space. Retrieval part is independent of synthesis, hence it has its separate leaderboard on huggingface.

https://docs.vespa.ai/en/embedding.html

https://huggingface.co/spaces/mteb/leaderboard

The technique I find most useful is to implement a “linked list” strategy where a chunk has multiple pointers to the entry it is referenced by. This task is done manually, but the diversity of the ways you can reference a particular node go up dramatically.

Another way to look at it, comments. Imagine every comment under this post is a pointer back to the original post. Some will be close in distance, and others will be farther, due to the perception of the authors of the comments themselves. But if you assign each comment a “parent_id”, your access to the post multiplies.

You can see an example of this technique here [1]. I don’t attempt to mind read what the end user will query for, I simply let them tell me, and then index that as a pointer. There are only a finite number of options to represent a given object. But some representations are very, very, very far from the semantic meaning of the core object.

[1] - https://x.com/yourcommonbase/status/1833262865194557505

The statement about just throwing 200k tokens to get best answer for smaller datasets goes against my experience. I commonly find as my prompt gets larger, the less consistent the output becomes, and the poorer following instructions becomes. Does anyone else experience this or a well known way to avoid this? It seems to happen at much less than even 25k tokens.

Interesting. One problem I'm facing is using RAG to retrieve applicable rules instead of knowledge (chunks): only rules that may apply to the context should be injected into the context. I haven't done any experiment, but one approach that I think could work would be to train small classifiers to determine whether a specific rule could apply. The main LLM would be tasked with determining whether the rule indeed applies or not for the current context.

An example: let's suppose you're using an LLM to play a multi user dungeon. In the past your character has behaved badly with taxis so that the game has decided to create a rule that says that whenever you try to enter a taxi you're kicked out: "we know who you are, we refuse to have you as a client until you formally apologize to the taxi company director". Upon apologizing, the rule is removed. Note that the director of the taxi company could be another player and be the one who issued the rule in the first place, to be enforced by his NPC fleet of taxis.

I'm wondering how well this could scale (with respect of number of active rules) and to which extent traditional RAG could be applied. It seems deciding whether a rule applies or not is a problem that is more abstract and difficult than deciding whether a chunk of knowledge is relevant or not.

In particular the main problem I have identified that makes it more difficult is the following dependency loop that doesn't appear with knowledge retrieval: you need to retrieve a rule to identify whether it applies or not. Does anyone know how this problem could be solved ?

If the in-game context is properly described in the query, I believe that the same traditional vector search used in RAG would match your case.

Example query, with some help from LLama 3.1 8B:

    As the dark elven horde closes in on his position, Grimgold Ironfist finds himself in a desperate predicament. His sturdy bearded face is set with determination, but his worn leather apron and mismatched socks are a far cry from the battle-hardened armor he once donned as a proud member of the Dwarven Militia. Now, his tunic is stained with ale and oil from a recent session at the local tavern, and his boots are scuffed from countless miles of adventuring. His health bar, once a proud 100%, now teeters on 35% due to a nasty encounter with a giant spider earlier that day. In his inventory, Grimgold has: a rusty iron pickaxe (degraded), a waterskin (half-full), a chunk of stale bread (half-eaten), and a small pouch containing 17 gold pieces. His trusty hammer, "Mithrilcrusher", lies forgotten in the nearby underbrush, having been left behind in his haste to flee the elven army. With no time to lose, Grimgold spots a lone taxi cab rattling down the road - its golden horse emblem a beacon of hope in this desperate hour. He sprints towards it, hoping against hope that he can somehow sweet-talk the driver into taking him on, despite his...ahem...' checkered past' with the Taxi Guild.

Example rule that would be fetched from the vector store (because there is a vector proximity caused by the character name/attributes and by the mentions of taxis and the Taxi Guild.

   The Taxi Guild has imposed a strict penalty upon Grimgold: whenever he attempts to hail a cab, he is summarily ejected from the vehicle. The Guild’s decree, inscribed on a parchment of shame, reads:
    “Grimgold Ironfist, bearded dwarf of ill repute, henceforth shall not be granted passage in any taxi operated by our members until he has formally apologized to Thorgrim Stonebeard, Director of the Golden Horse Cab Company. Failure to comply with this edict shall result in perpetual exclusion from our services.”

> If your knowledge base is smaller than 200,000 tokens (about 500 pages of material)

I would prefer that anthropic just release their tokeniser so we don't have to make guesses.

shouldn't this be possible to reverse-engineer since they stream the responses token-by-token?

Waiting for the day when the entire AI industry goes back full circle to TF-IDF.

Yeah it did make me chuckle. I’m guessing products like elasticsearch support all the classic text matching algos out of the box anyway?

This sounds a lot like how we used to do research, by reading books and writing any interesting quotes on index cards, along with where they came from. I wonder if prompting for that would result in better chunks? It might make it easier to review if you wanted to do it manually.

The fundamental problem of both keyword and embedding based retrieval is that they only access surface level features. If your document contains 5+5 and you search "where is the result 10" you won't find the answer. That is why all texts need to be "digested" with LLM before indexing, to draw out implicit information and make it explicit. It's also what Anthropic proposes we do to improve RAG.

"study your data before indexing it"

Makes sense. It seems after retrieval, both would be useful - both the exact quote and a summary of its context.

I wish they included the datasets they used for the evaluations. As far as I can tell, in appendix II they include some sample questions, answers, and golden chunks but they do not give the entire dataset or give an explicit information on exactly what the datasets are.

Does anyone know if the datasets they used for the evaluation are publicly available or if they give more information on the datasets than what's in appendix II?

There are standard publically available datasets for this type of evaluation, like MTEB (https://github.com/embeddings-benchmark/mteb). I wonder how this technique does on the MTEB dataset.

Even with prompt caching this adds a huge extra time to your vector database create/update, right? That may be okay for some use cases but I’m always wary of adding multiple LLM layers into these kinds of applications. It’s nice for the cloud LLM providers of course.

I wonder how it would work if you generated the contexts yourself algorithmically. Depending on how well structured your docs are this could be quite trivial (eg for an html doc insert the title > h1 > h2 > chunk).

I just took the time to read through all source code and docs. Nice ideas. I like to experiment with LLMs running on my local computer so I will probably convert this example to use the light weight Python library Rank-BM25 instead of Elastic Search, and a long context model running on Ollama. I wouldn’t have prompt caching though.

This example is well written and documented, easy to understand. Well done.

Can someone explain simply how these benchmarks work?

What exactly is a "failure rate" and how is it computed?

They simply ask the AI a question about a large document (or set of docs). It either gets the answer right or wrong. They count the number of hits and misses.

I don't know anything about AI but I've always wished I could just upload a bunch of documents/books and the AI would perform some basic keyword searches to figure out what is relevant, then auto include that in the prompt.

It would help if you tried Notebooklm by Google. It does this, you can upload a document/PDF whatever, and ask questions. The model replies to you giving also a reference to your material

+1 Google’s NotebookLM is amazing. In addition to the functionality you mention, I tried loading the PDF for my entire Practical AI Programming with Clojure book and had it generate an 8 minute podcast that was very nuanced - to be honest, it seriously blew my mind how well it works. Here is a link to the audio file it automatically generated https://markwatson.com/audio/AIClojureBook.wav

NotebookLM is currently free to use and was so good I almost immediately started paying Google $20 a month to get access to their pro version of Gemini.

I still think the Groq APIs for open weight models are the best value for the money, but the way OpenAI, Google, Anthropic, etc. are productizing LLMs is very impressive.

i've noticed that copilot seems to do this pretty well, i ask about a function and it correctly looks up the relevant lines of code

Looking forward to some guidance on "chunking":

"Chunk boundaries: Consider how you split your documents into chunks. The choice of chunk size, chunk boundary, and chunk overlap can affect retrieval performance1."

I've been wondering for a while if having ElasticSearch as just another function to call might be interesting. If the LLM can just generate queries it's an easy deployment.

I guess this does give some insights. Using a more space efficient language for your codebase will mean more functionality in the ais context window when working with Claude and code.