>After several weeks, between 2 and 3, the indexing process finished without failures. ... we could finally shut down the virtual machine. The cost was 184 euros on Hetzner, not cheap.
184euro is loose change after spending 3 man weeks working on the process!
That's the budget I'd have for the coffee shop with the team to discuss budget
That's the budget to discuss and approve the above coffee shop budget
I'd argue the author missed a trick here by using a fancy embedding model without any re-ranking. One of the benefits of a re-ranker (or even a series of re-rankers!) is that you can embed your documents using a really small and cheap model (this also often means smaller embeddings).
And some have been saying that RAGs are obsolete—that the context window of a modern LLM is adequate (preferable?). The example I recently read was that the contexts are large enough for the entire "The Lord of the Rings" books.
That may be, but then there's an entire law library, the entirety of Wikipedia (and the example in this article of 451 GB). Surely those are at least an order of magnitude larger than Tolkien's prose and might still benefit from a RAG.
The success of the model responding to you with a correct information is a function of giving it proper context too.
That hasn't changed nor I think it will, even with the models having very large context windows (eg Gemini has 2M). It is observed that having a large context alone is not enough and that it is better to give the model sufficiently enough and quality information rather than filling it with virtually everything. Latter is also impossible and does not scale well with long and complicated tasks where reaching the context limit is inevitable. In that case you need to have the RAG which will be smart enough to extract the sufficient information from previous answers/context, and make it part of the new context, which in turn will make it possible for the model to keep its performance at satisfactory level.
I do think that what we think of as RAG will change!
When any given document can fit into context, and when we can generate highly mission-specific summarization and retrieval engines (for which large amounts of production data can be held in context as they are being implemented)... is the way we index and retrieve still going to be based on naive chunking, and off-the-shelf embedding models?
For instance, a system that reads every article and continuously updates a list of potential keywords with each document and the code assumptions that led to those documents being generated, then re-runs and tags each article with those keywords and weights, and does the same to explode a query into relevant keywords with weights... this is still RAG, but arguably a version where dimensionality is closer tied to your data.
(Such a system, for instance, might directly intuit the difference in vector space between "pet-friendly" and "pets considered," or between legal procedures that are treated differently in different jurisdictions. Naive RAG can throw dimensions at this, and your large-context post-processing may just be able to read all the candidates for relevance... but is this optimal?)
I'm very curious whether benchmarks have been done on this kind of approach.
Also the thing with context is that you want to keep it focused on the task at hand.
For example there's evidence that typical use of AGENTS.md actually doesn't improve outcomes but just slows the LLMs down and confuses them.
In my personal testing and exploration I found that small (local) LLMs perform drastically better, both in accuracy and speed, with heavily pruned and focused context.
Just because you can fill in more context, doesn't mean that you should.
The worry I have is that common usage will lead to LLMs being trained and fined tuned in order to accommodate ways of using them that doesn't make a lot of sense (stuffing context, wasting tokens etc.), just because that's how most people use them.
RAG is nowhere near obselete. Model performance on enormous sequences degrades hugely as they are not well represented in training and non quadratic attention approximations are not amazing
For technical domains, stuffing the context full of related-and-irrelevant or possibly-conflicting information will lead to poor results. The examples of long-context retrieval like finding a fact in a book really aren't representative of the types of context you'd be working with in a RAG scenario. In a lot of cases the problem is information organization, not retrieval, e.g. "What is the most authoritative type of source for this information?" or "How do these 100 documents about X relate to each other?"
I'm not super deep on LLM development, but with ram being a material bottleneck and from what I've read about DeepSeek's results with offloading factual knowledge with 'engrams' I think that the near future will start moving towards the dense core of LLMs focusing much more on a distillation of universal reasoning and logic while factual knowledge is pushed out into slower storage. IIRC Nvidia's Nemotron Cascade is taking MoE even further in that direction too.
I don't need a coding model to be able to give me an analysis of the declaration of independence in urdu from 'memory' and the price in ram for being able to do that, impressive as it is, is an inefficiency.
Some previous techniques for RAG, like directly using a user message’s embedding to do a vector search and stuffing the results in the prompt, are probably obsolete. Newer models work much better if you use tool calls and let them write their own search queries (on an internal database, and perhaps with multiple rounds), and some people consider that “agentic AI” as opposed to RAG. It’s still augmenting generation with retrieved information, just in a more sophisticated way.
How can it be obsolete? Maybe if you only have toy data you picked to write your blog post. Companies have gigabytes, petabytes of data to draw from.
I assume it’s not possible to get the same results by fine tuning a model with the documents instead?
You will still get hallucinations. With RAG you use the vectors to aid in finding things that are relevant, and then you typically also have the raw text data stored as well. This allows you to theoretically have LLM outputs grounded in the truth of the documents. Depending on implementation, you can also make the LLM cite the sources (filename, chunk, etc).
Maybe a bit off-topic:
For my PhD, I wanted to leverage LLMs and AI to speed up the literature review process*.
Due to time constraints, this never really lifted off for me. At the time I checked (about 6 months ago), several tools were already available (NotebookLM, Anara, Connected Papers, ZotAI, Litmaps, Consensus, Research Rabbit) supporting Literature Review.
They have all pros and cons (and different scopes), but my biggest requirement would be to do this on my Zotero bibliographic collection (available offline as PDF/ePub).
ZotAI can use LMStudio (for embeddings and LLM models), but at that time, ZotAI was super slow and buggy.
Instead of going through the valley of sorrows (as threatofrain shared in the blog post - thanks for that), is there a more or less out-of-the-box solution (paid or free) for the demand (RAG for local literature review support)?
*If I am honest, it was rather a procrastination exercise, but this is for sure relatable for readers of HN :-D
onyx is good for this, it is standard doc ingestion -> chunk -> embedding -> index -> query -> rerank -> answer
Oh! Same! I made an R / Shiny powered RAG/ Researching app that hooks into OpenAlex (for papers) and allows you to generate NotebookLM like outputs. Just got slides with from-paper images to be injected in, super fun. Takes an OpenRouter or local LLMs (if that's your thing). Network graphs too! https://github.com/seanthimons/serapeum/
If you don’t mind a little instability while I work out the bugs, might be interested in my project: https://github.com/rmusser01/tldw_server ; it’s not quite fully ready yet but the backend api is functional and has a full RAG system with a customizable and tweakable local-first ETL so you can use it without relying on any third party services.
I tried to do RAG on my laptop just by setting it all up myself, but the actual LLM gave poor results (I have a small thin-and-light fwiw, so I could only run weak models). The vector search itself, actually, ended up being a little more useful.
What ended up being the main bottleneck in your pipeline—embedding throughput, cost, or something else? Did you explore parallelizing vectorization (e.g., multiple workers) or did that not help much in practice?
51 visitors in real-time.
I love those site features!
In a submission of a few days ago there was something similar.
I love it when a website gives a hint to the old web :)
[deleted]
Odd to me that Elasticsearch isn't finding a second breath in these new ecosystems. It basically is that now, a RAG engine with model integration.
It’s definitely a use case for this and would’ve saved a lot of pain IMO but also seems like it would have added confusing technology to what was a VERY Python-heavy stack that would’ve benefitted from other elements.
Hardest part is always figuring out your company’s knowledge management has been dogsh!t for years so now you need to either throw most of it away or stick to the authoritative stuff somehow.
Elastic plus an agent with MCP may have worked as a prototype very quickly here, but hosting costs for 500GB worth of indexes sounds too expensive for this person’s use case if $185 is a lot.
ah got it! thanks for the color
The people that survived it aren't willing to give it anymore of their breathing left
haha! it's been ok for me, but a lot of song and dance is required. the saas-version is a black box (in a bad way).
Great write-up. Thank you! I’m contemplating a similar RAG architecture for my engineering firm, but we’re dealing with roughly 20x the data volume (estimating around 9TB of project files, specs, and PDFs).
I've been reading about Google's new STATIC framework (sparse matrix constrained decoding) and am really curious about the shift toward generative retrieval for massive speedups well beyond this approach.
For those who have scaled RAG into the multi-terabyte range: is it actually worth exploring generative retrieval approaches like STATIC to bypass standard dense vector search, or is a traditional sharded vector DB (Milvus, Pinecone, etc.) still the most practical path at this scale?
I would guess the ingestion pain is still the same.
This new world is astounding.
9tb should be fine for vectordb, for sure. google search is many petabytes of index with vector+semantic search, that is using ScaNN.
you could probably use the hybrid search in llamaindex; or elasticsearch. there is an off the shelf discovery engine api on gcp. vertex rag engine is end to end for building your own. gcp is too expensive though. alibaba cloud have a similar solution.
We did it in an engineering setting and had very mixed results. Big 800 page machine manuals are hard to contextualise.
There’s turbopuffer
Cool work! Would be so interested in what would happen if you would put the data and you plan / features you wanted in a Claude Code instance and let it go. You did carefully thinking, but those models now also go really far and deep. Would be really interested in seeing what it comes up with. For that kind of data getting something like a Mac mini or whatever (no not with OpenClaw) would be damn interesting to see how fast and far you can go.
But where is the fun with that?
Thanks for an interesting read! Are you monitoring usage, and what kind of user feedback have you received? Always curious if these projects end up used because, even with the perfect tech, if the data is low quality, nobody is going to bother
Think that's the first time i've seen someone write about checkpointing, definitely worth doing for similar projects.
I made something similar in my project. My more difficult task has been choice the right approach to chunking long documents. I used both structural and semantic chunking approach. The semantic one helped to better store vectors in vectorial DB. I used QDrant and openAi embedding model.
Great writeup but ... pretty sure ChromaDB is open source and not "Google's database"?
I'm afraid this hits the credibility of the article for me, that's a pretty weird mistake to make. It's like paying for a Model 3 while thinking it comes from Ford.
Thank you for your feedback!
What would it look like to regularly react to source data changes? Seems like a big missing piece. Event based? regular cadence? Curious what people choose. Great post though.
Depends on the use case, ie frequency and impact of changes.
Typically you would have a reindex process, and you keep track of hashes of chunks to check if you’ve already calculated this exact block before to avoid extra costs. And then run such a reindex process pretty frequently as it’s cheap / costs nothing when there are no changes.
makes great sense, thanks!
This article came just in the nick of time. I'm in fandoms that lean heavily into fanfiction, and there's a LOT out there on Ao3. Ao3 has the worst search (and yo can't even search your account's history!), so I've been wanting to create something like this as a tool for the fandom, where we can query "what was the fic about XYZ where ABC happened?" and get hopefully helpful responses. I'm very tired of not being able to do this, and it would be a fun learning experience.
I've already got the data mostly structured because I did some research on the fandom last year, charting trends and such, so I don't even need to massage the data. I've got authors, dates, chapters, reader comments, and full text already in a local SQLite db.
Cool to see Nomic embeddings mentioned. Though surpriser you didn't land on Voyage.
Did you look at Turbopuffer btw?
i assume based on their concerns of the hetzner pricing that they didnt want to pay for voyage/turbopuffer. unless there are free versions of those products that I'm unaware of, but I'm only seeing paid.
>After several weeks, between 2 and 3, the indexing process finished without failures. ... we could finally shut down the virtual machine. The cost was 184 euros on Hetzner, not cheap.
184euro is loose change after spending 3 man weeks working on the process!
That's the budget I'd have for the coffee shop with the team to discuss budget
That's the budget to discuss and approve the above coffee shop budget
I'd argue the author missed a trick here by using a fancy embedding model without any re-ranking. One of the benefits of a re-ranker (or even a series of re-rankers!) is that you can embed your documents using a really small and cheap model (this also often means smaller embeddings).
And some have been saying that RAGs are obsolete—that the context window of a modern LLM is adequate (preferable?). The example I recently read was that the contexts are large enough for the entire "The Lord of the Rings" books.
That may be, but then there's an entire law library, the entirety of Wikipedia (and the example in this article of 451 GB). Surely those are at least an order of magnitude larger than Tolkien's prose and might still benefit from a RAG.
The success of the model responding to you with a correct information is a function of giving it proper context too.
That hasn't changed nor I think it will, even with the models having very large context windows (eg Gemini has 2M). It is observed that having a large context alone is not enough and that it is better to give the model sufficiently enough and quality information rather than filling it with virtually everything. Latter is also impossible and does not scale well with long and complicated tasks where reaching the context limit is inevitable. In that case you need to have the RAG which will be smart enough to extract the sufficient information from previous answers/context, and make it part of the new context, which in turn will make it possible for the model to keep its performance at satisfactory level.
I do think that what we think of as RAG will change!
When any given document can fit into context, and when we can generate highly mission-specific summarization and retrieval engines (for which large amounts of production data can be held in context as they are being implemented)... is the way we index and retrieve still going to be based on naive chunking, and off-the-shelf embedding models?
For instance, a system that reads every article and continuously updates a list of potential keywords with each document and the code assumptions that led to those documents being generated, then re-runs and tags each article with those keywords and weights, and does the same to explode a query into relevant keywords with weights... this is still RAG, but arguably a version where dimensionality is closer tied to your data.
(Such a system, for instance, might directly intuit the difference in vector space between "pet-friendly" and "pets considered," or between legal procedures that are treated differently in different jurisdictions. Naive RAG can throw dimensions at this, and your large-context post-processing may just be able to read all the candidates for relevance... but is this optimal?)
I'm very curious whether benchmarks have been done on this kind of approach.
Also the thing with context is that you want to keep it focused on the task at hand.
For example there's evidence that typical use of AGENTS.md actually doesn't improve outcomes but just slows the LLMs down and confuses them.
In my personal testing and exploration I found that small (local) LLMs perform drastically better, both in accuracy and speed, with heavily pruned and focused context.
Just because you can fill in more context, doesn't mean that you should.
The worry I have is that common usage will lead to LLMs being trained and fined tuned in order to accommodate ways of using them that doesn't make a lot of sense (stuffing context, wasting tokens etc.), just because that's how most people use them.
RAG is nowhere near obselete. Model performance on enormous sequences degrades hugely as they are not well represented in training and non quadratic attention approximations are not amazing
For technical domains, stuffing the context full of related-and-irrelevant or possibly-conflicting information will lead to poor results. The examples of long-context retrieval like finding a fact in a book really aren't representative of the types of context you'd be working with in a RAG scenario. In a lot of cases the problem is information organization, not retrieval, e.g. "What is the most authoritative type of source for this information?" or "How do these 100 documents about X relate to each other?"
I'm not super deep on LLM development, but with ram being a material bottleneck and from what I've read about DeepSeek's results with offloading factual knowledge with 'engrams' I think that the near future will start moving towards the dense core of LLMs focusing much more on a distillation of universal reasoning and logic while factual knowledge is pushed out into slower storage. IIRC Nvidia's Nemotron Cascade is taking MoE even further in that direction too.
I don't need a coding model to be able to give me an analysis of the declaration of independence in urdu from 'memory' and the price in ram for being able to do that, impressive as it is, is an inefficiency.
Some previous techniques for RAG, like directly using a user message’s embedding to do a vector search and stuffing the results in the prompt, are probably obsolete. Newer models work much better if you use tool calls and let them write their own search queries (on an internal database, and perhaps with multiple rounds), and some people consider that “agentic AI” as opposed to RAG. It’s still augmenting generation with retrieved information, just in a more sophisticated way.
How can it be obsolete? Maybe if you only have toy data you picked to write your blog post. Companies have gigabytes, petabytes of data to draw from.
I assume it’s not possible to get the same results by fine tuning a model with the documents instead?
You will still get hallucinations. With RAG you use the vectors to aid in finding things that are relevant, and then you typically also have the raw text data stored as well. This allows you to theoretically have LLM outputs grounded in the truth of the documents. Depending on implementation, you can also make the LLM cite the sources (filename, chunk, etc).
Maybe a bit off-topic: For my PhD, I wanted to leverage LLMs and AI to speed up the literature review process*. Due to time constraints, this never really lifted off for me. At the time I checked (about 6 months ago), several tools were already available (NotebookLM, Anara, Connected Papers, ZotAI, Litmaps, Consensus, Research Rabbit) supporting Literature Review. They have all pros and cons (and different scopes), but my biggest requirement would be to do this on my Zotero bibliographic collection (available offline as PDF/ePub).
ZotAI can use LMStudio (for embeddings and LLM models), but at that time, ZotAI was super slow and buggy.
Instead of going through the valley of sorrows (as threatofrain shared in the blog post - thanks for that), is there a more or less out-of-the-box solution (paid or free) for the demand (RAG for local literature review support)?
*If I am honest, it was rather a procrastination exercise, but this is for sure relatable for readers of HN :-D
onyx is good for this, it is standard doc ingestion -> chunk -> embedding -> index -> query -> rerank -> answer
Oh! Same! I made an R / Shiny powered RAG/ Researching app that hooks into OpenAlex (for papers) and allows you to generate NotebookLM like outputs. Just got slides with from-paper images to be injected in, super fun. Takes an OpenRouter or local LLMs (if that's your thing). Network graphs too! https://github.com/seanthimons/serapeum/
If you don’t mind a little instability while I work out the bugs, might be interested in my project: https://github.com/rmusser01/tldw_server ; it’s not quite fully ready yet but the backend api is functional and has a full RAG system with a customizable and tweakable local-first ETL so you can use it without relying on any third party services.
I tried to do RAG on my laptop just by setting it all up myself, but the actual LLM gave poor results (I have a small thin-and-light fwiw, so I could only run weak models). The vector search itself, actually, ended up being a little more useful.
What ended up being the main bottleneck in your pipeline—embedding throughput, cost, or something else? Did you explore parallelizing vectorization (e.g., multiple workers) or did that not help much in practice?
51 visitors in real-time.
I love those site features!
In a submission of a few days ago there was something similar.
I love it when a website gives a hint to the old web :)
Odd to me that Elasticsearch isn't finding a second breath in these new ecosystems. It basically is that now, a RAG engine with model integration.
It’s definitely a use case for this and would’ve saved a lot of pain IMO but also seems like it would have added confusing technology to what was a VERY Python-heavy stack that would’ve benefitted from other elements.
Hardest part is always figuring out your company’s knowledge management has been dogsh!t for years so now you need to either throw most of it away or stick to the authoritative stuff somehow.
Elastic plus an agent with MCP may have worked as a prototype very quickly here, but hosting costs for 500GB worth of indexes sounds too expensive for this person’s use case if $185 is a lot.
ah got it! thanks for the color
The people that survived it aren't willing to give it anymore of their breathing left
haha! it's been ok for me, but a lot of song and dance is required. the saas-version is a black box (in a bad way).
Great write-up. Thank you! I’m contemplating a similar RAG architecture for my engineering firm, but we’re dealing with roughly 20x the data volume (estimating around 9TB of project files, specs, and PDFs). I've been reading about Google's new STATIC framework (sparse matrix constrained decoding) and am really curious about the shift toward generative retrieval for massive speedups well beyond this approach. For those who have scaled RAG into the multi-terabyte range: is it actually worth exploring generative retrieval approaches like STATIC to bypass standard dense vector search, or is a traditional sharded vector DB (Milvus, Pinecone, etc.) still the most practical path at this scale?
I would guess the ingestion pain is still the same.
This new world is astounding.
9tb should be fine for vectordb, for sure. google search is many petabytes of index with vector+semantic search, that is using ScaNN.
you could probably use the hybrid search in llamaindex; or elasticsearch. there is an off the shelf discovery engine api on gcp. vertex rag engine is end to end for building your own. gcp is too expensive though. alibaba cloud have a similar solution.
We did it in an engineering setting and had very mixed results. Big 800 page machine manuals are hard to contextualise.
There’s turbopuffer
Cool work! Would be so interested in what would happen if you would put the data and you plan / features you wanted in a Claude Code instance and let it go. You did carefully thinking, but those models now also go really far and deep. Would be really interested in seeing what it comes up with. For that kind of data getting something like a Mac mini or whatever (no not with OpenClaw) would be damn interesting to see how fast and far you can go.
But where is the fun with that?
Thanks for an interesting read! Are you monitoring usage, and what kind of user feedback have you received? Always curious if these projects end up used because, even with the perfect tech, if the data is low quality, nobody is going to bother
Think that's the first time i've seen someone write about checkpointing, definitely worth doing for similar projects.
I made something similar in my project. My more difficult task has been choice the right approach to chunking long documents. I used both structural and semantic chunking approach. The semantic one helped to better store vectors in vectorial DB. I used QDrant and openAi embedding model.
Great writeup but ... pretty sure ChromaDB is open source and not "Google's database"?
ChromaDB is open source with Apache-2.0 license.
https://github.com/chroma-core/chroma
I'm afraid this hits the credibility of the article for me, that's a pretty weird mistake to make. It's like paying for a Model 3 while thinking it comes from Ford.
Thank you for your feedback!
What would it look like to regularly react to source data changes? Seems like a big missing piece. Event based? regular cadence? Curious what people choose. Great post though.
Depends on the use case, ie frequency and impact of changes.
Typically you would have a reindex process, and you keep track of hashes of chunks to check if you’ve already calculated this exact block before to avoid extra costs. And then run such a reindex process pretty frequently as it’s cheap / costs nothing when there are no changes.
makes great sense, thanks!
This article came just in the nick of time. I'm in fandoms that lean heavily into fanfiction, and there's a LOT out there on Ao3. Ao3 has the worst search (and yo can't even search your account's history!), so I've been wanting to create something like this as a tool for the fandom, where we can query "what was the fic about XYZ where ABC happened?" and get hopefully helpful responses. I'm very tired of not being able to do this, and it would be a fun learning experience.
I've already got the data mostly structured because I did some research on the fandom last year, charting trends and such, so I don't even need to massage the data. I've got authors, dates, chapters, reader comments, and full text already in a local SQLite db.
Cool to see Nomic embeddings mentioned. Though surpriser you didn't land on Voyage.
Did you look at Turbopuffer btw?
i assume based on their concerns of the hetzner pricing that they didnt want to pay for voyage/turbopuffer. unless there are free versions of those products that I'm unaware of, but I'm only seeing paid.
[dead]
[dead]
[dead]
[dead]
[dead]