100

Anthropic – Introducing Contextual Retrieval

We build a corporate RAG for a government entity. What I've learned so far by applying an experimental A/B testing approach to RAG using RAGAS metrics:

- Hybrid Retrieval (semantic + vector) and then LLM based Reranking made no significant change using synthetic eva-questions

- HyDE decreased answer quality and retrieval quality severly when measured with RAGAS using synthetic eval-questions

(we still have to do a RAGAS eval using expert and real user questions)

So yes, hybrid retrieval is always good - that's no news to anyone building production ready or enterprise RAG solutions. But one method doesn't always win. We found semantic search of Azure AI Search being sufficient as a second method, next to vector similarity. Others might find BM25 great, or a fine tuned query post processing SLM. Depends on the use case. Test, test, test.

Next things we're going to try:

- RAPTOR

- SelfRAG

- Agentic RAG

- Query Refinement (expansion and sub-queries)

- GraphRAG

Learning so far:

- Always use a baseline and an experiment to try to refute your null hypothesis using measures like RAGAS or others.

- Use three types of evaluation questions/answers: 1. Expert written q&a, 2. Real user questions (from logs), 3. Synthetic q&a generated from your source documents

7 minutes agounderlines

To add some context, this isn't that novel of an approach. A common approach to improve RAG results is to "expand" the underlying chunks using an llm, so as to increase the semantic surface area to match against. You can further improve your results by running query expansion using HyDE[1], though it's not always an improvement. I use it as a fallback.

I'm not sure what Anthropic is introducing here. I looked at the cookbook code and it's just showing the process of producing said context, but there's no actual change to their API regarding "contextual retrieval".

The one change is prompt caching, introduced a month back, which allows you to very cheaply add better context to individual chunks by providing the entire (long) document as context. Caching is an awesome feature to expose to developers and I don't want to take anything away from that.

However, other than that, the only thing I see introduced is just a cookbook on how to do a particular rag workflow.

As an aside, Cohere may be my favorite API to work with. (no affiliation) Their RAG API is a delight, and unlike anything else provided by other providers. I highly recommend it.

1: https://arxiv.org/abs/2212.10496

2 hours agopostalcoder

I was trying to do this using Prompt Caching like a month ago, but then noticed there's five minute maximum lifetime for the cached prompts - doesn't really work for my RAG needs (or probably most), where the queries would be ran during the next month or a year. I can't see any changes to that policy. Little surprised to see them talk about Prompt Caching relating to RAG.

12 minutes agobayesianbot

I think the innovation is using caching as so to make the cost of the approach manageable. The way they implemented it is that each time you create a chunk, you ask the llm to create an atomic chunk from the whole context. You need to do this for all tens of thousands of chunks in your data. This costs a lot. By caching the documents, you can spare costs

2 hours agoresiros

You could also just save the first outputted atomic chunk and store it then re-use it each time yourself. Easier and more consistent.

2 hours agoskeptrune

To be fair, that only works if you keep chunk windows static.

2 hours agopostalcoder

Yup. Caching is very nice.. but the framing is weird. "Introducing" to me, connotes a product release, not a new tutorial.

2 hours agopostalcoder

I'm not a fan of this technique. I agree the scenario they lay out is a common problem, but the proposed solution feels odd.

Vector embeddings have bag-of-words compression properties and can over-index on the first newline separated text block to the extent that certain indices in the resulting vector end up much closer to 0 than they otherwise would. With quantization, they can eventually become 0 and cause you to lose out on lots of precision with the dense vectors. IDF search overcomes this to some extent, but not enough.

You can "semantically boost" embeddings such that they move closer to your document's title, summary, abstract, etc. and get the recall benefits of this "context" prepend without polluting the underlying vector. Implementation wise it's a weighted sum. During the augmentation step where you put things in the context window, you can always inject the summary chunk when the doc matches as well. Much cleaner solution imo.

Description of "semantic boost" in the Trieve API[1]:

>semantic_boost: Semantic boost is useful for moving the embedding vector of the chunk in the direction of the distance phrase. I.e. you can push a chunk with a chunk_html of "iphone" 25% closer to the term "flagship" by using the distance phrase "flagship" and a distance factor of 0.25. Conceptually it's drawing a line (euclidean/L2 distance) between the vector for the innerText of the chunk_html and distance_phrase then moving the vector of the chunk_html distance_factorL2Distance closer to or away from the distance_phrase point along the line between the two points.

[1]:https://docs.trieve.ai/api-reference/chunk/create-or-upsert-...

4 hours agoskeptrune

Sorry random question - do vector dbs work across models? I'd guess no, since embeddings are models specific afaik, but that means that a vector db would lock you into using a single LLM and even within that, a single version, like Claude-3.5 Sonnet, and you couldn't move to 3.5 Haiku, Opus etc., never mind ChatGPT or Llama without reindexing.

27 minutes agotorginus

The technique I find most useful is to implement a “linked list” strategy where a chunk has multiple pointers to the entry it is referenced by. This task is done manually, but the diversity of the ways you can reference a particular node go up dramatically.

Another way to look at it, comments. Imagine every comment under this post is a pointer back to the original post. Some will be close in distance, and others will be farther, due to the perception of the authors of the comments themselves. But if you assign each comment a “parent_id”, your access to the post multiplies.

You can see an example of this technique here [1]. I don’t attempt to mind read what the end user will query for, I simply let them tell me, and then index that as a pointer. There are only a finite number of options to represent a given object. But some representations are very, very, very far from the semantic meaning of the core object.

[1] - https://x.com/yourcommonbase/status/1833262865194557505

2 hours ago_bramses

My favorite thing about this is the way it takes advantage of prompt caching.

That's priced at around 1/10th of what the prompts would normally cost if they weren't cached, which means that tricks like this (running every single chunk against a full copy of the original document) become feasible where previously they wouldn't have financially made sense.

I bet there are all sorts of other neat tricks like this which are opened up by caching cost savings.

My notes on contextual retrieval: https://simonwillison.net/2024/Sep/20/introducing-contextual... and prompt caching: https://simonwillison.net/2024/Aug/14/prompt-caching-with-cl...

4 hours agosimonw

You could do a lot of stuff with pre-calculating things for your embeddings. Why cache when you can pre-calculate. That brings into play a whole lot of things people commonly do as part of ETL.

I come from a traditional search back ground. It's quite obvious to me that RAG is a bit of a naive strategy if you limit it to just using vector search with some off the shelf embedding model. Vector search simply isn't that good. You need additional information retrieval strategies if you want to improve the context you provide to the LLM. That is effectively what they are doing here.

Microsoft published an interesting paper on graph RAG some time ago where they combine RAG with vector search based on a conceptual graph that they construct from the indexed data using entity extraction. This allows them to pull in contextually relevant information for matching chunks.

I have a hunch that you could probably get quite far without doing any vector search at all. It would be a lot cheaper too. Simply use a traditional search engine and some tuned query. The trick is of course query tuning. Which may not work that well for general purpose use cases but it could work for more specialized use cases.

3 hours agojillesvangurp

I have experience in traditional search as well and I think this is doing some limiting of my imagination when it comes to vector search. In the post, I did like the introduction of the Contextual BM25 compared to other hybrid approaches then doing rrf.

For question answering, vector/semantic search is clearly a better fit in my mind, and I can see how the contextual models can enable and bolster that. However, because I’ve implemented and used so many keyword based systems, that just doesn’t seem to be how my brain works.

An example I’m thinking of is finding a sushi restaurant near me with availability this weekend around dinner time. I’d love to be able to search for this as I’ve written it. How I would search for it would be search for sushi restaurant, sort by distance and hope the application does a proper job of surfacing time filtering.

Conversely, this is mostly how I would build this system. Perhaps with a layer to determine user intention to pull out restaurant type, location sorting, and time filtering.

I could see using semantic search for filtering down the restaurants to related to sushi, but do we then drop back into traditional search for filtering and sorting? Utilize function calling to have the LLM parameterize our search query?

As stated, perhaps I’m not thinking of these the right way because of my experiences with existing systems, which I find seem to give me better results when well built

2 hours agoTmpstsTrrctta

GraphRAG requires you define upfront the schema of entity and relation types. This works when you are in a known domain, but in general, when you want to just answer questions from a large reference, you don't know what you need to put in the graph.

an hour agovisarga

Graph RAG is very cool and outstanding at filling some niches. IIRC, Perplexity's actual search is just BM25 (based a lex fridman interview of the founder).

2 hours agopostalcoder

Makes sense; perplexity is really responsive and fast usually.

I need to check out that interview with Lex Fridman.

2 hours agojillesvangurp

Do you have the link and the time in the video where he mentions it?

2 hours ago_hfqa

I don't know anything about AI but I've always wished I could just upload a bunch of documents/books and the AI would perform some basic keyword searches to figure out what is relevant, then auto include that in the prompt.

an hour agovendiddy

It would help if you tried Notebooklm by Google. It does this, you can upload a document/PDF whatever, and ask questions. The model replies to you giving also a reference to your material

42 minutes agoaverage_r_user

We're doing something similar. We first chunk the documents based on h1,h2,h3 headings. Then we add headers in the beginning of the chunk as a context. As an imagenary example, instead of one chunk being:

  The usual dose for adults is one or two 200mg tablets or 
  capsules 3 times a day.
It is now something like:

  # Fever
  ## Treatment
  ---
  The usual dose for adults is one or two 200mg tablets or 
  capsules 3 times a day.
This seems to work pretty well, and doesn't require any LLMs when indexing documents.

(Edited formatting)

3 hours agovalstu

I am working on question answering based on long documents / bundles of documents, 100+ pages, and I took a similar approach. I first summarize each page, give it a title and extract a list of subsections. Then I put all the summaries together and I ask the model to provide a hierarchical index. It will organize the whole bundle into a tree. At querying time I combine the path in the tree as additional context.

an hour agovisarga

Did you experiment with different ways to format those included headers? Asking because I am doing something similar to that as well.

2 hours agocabidaher

Nope, not yet. We have sticked with markdownish syntax so far.

2 hours agovalstu

This sounds a lot like how we used to do research, by reading books and writing any interesting quotes on index cards, along with where they came from. I wonder if prompting for that would result in better chunks? It might make it easier to review if you wanted to do it manually.

6 hours agoskybrian

The fundamental problem of both keyword and embedding based retrieval is that they only access surface level features. If your document contains 5+5 and you search "where is the result 10" you won't find the answer. That is why all texts need to be "digested" with LLM before indexing, to draw out implicit information and make it explicit. It's also what Anthropic proposes we do to improve RAG.

"study your data before indexing it"

an hour agovisarga

I guess this does give some insights. Using a more space efficient language for your codebase will mean more functionality in the ais context window when working with Claude and code.