279

Show HN: Transform your codebase into a single Markdown doc for feeding into AI

CodeWeaver is a command-line tool designed to weave your codebase into a single, easy-to-navigate Markdown document. It recursively scans a directory, generating a structured representation of your project's file hierarchy and embedding the content of each file within code blocks. This tool simplifies codebase sharing, documentation, and integration with AI/ML code analysis tools by providing a consolidated and readable Markdown output.

CodeWeavers is a software company that focuses on Wine development and sells a proprietary version of Wine called CrossOver for running Windows applications on macOS, ChromeOS and Linux.

https://en.wikipedia.org/wiki/CodeWeavers

Trademark is active. It's an Ⓡ not just a ™, registered not just trademarked. To keep it, they have to demonstrate they defend it.

https://www.trademarkia.com/codeweavers-76546826

While this project drops the final "s", you don't get to launch an OS called "Window". The test is a fuzzy match based on likelihood of confusion.

8 days agoTerretta

Yeah, I was thinking "what does the Wine guys have to do with this?"

This project is definitely going to get C&D'd.

7 days agojychang

Do you think they would actually litigate? They seem like different products serving entirely different markets so I am not sure that the trademark infringement claim is very defensible. And how do they prove damages?

7 days agowilliamcotton

CodeLoom could work instead

7 days agopstuart

I use the following for feeding into AI

   find . -print -exec cat {} \; -exec echo \;
Which will return for each file (and subfolders) the filename and then the content of the file.

Then `| pbcopy` to copy to clipboard and paste it into ChatGPT or similar.

8 days agocrisbal_

I guess this only works for very small codebase?

8 days agosingpolyma3

Correct, but it's the same as what OP shared.

You should use Aider/Cursor for proper indexing/intelligent codebase referencing

8 days agoOsrsNeedsf2P

Cramming thousands of tokens of potentially irrelevant context through unclear indexing paths isn't "proper".

The best results come from feeding precisely targeted context directly into the prompt, where you know exactly what the model sees and how it processes it. The prompt receives the most accurate use of attention—whereas god knows what the pipeline is for cursor or what extra layers and context restrictions they add on top of base Claude.

Giving the model a clean project hierarchy accomplishes a lot efficiently in terms of context tokens. The key is ensuring it only sees what's relevant, without diluting its attention.

Tools like reopmix and OP's version, feeding targeted context straight into models like Claude or Google's offerings, outperform Copilot and Cursor in my experience, even though they use the same base models. Use the highest-quality attention (the prompt context) directly, rather than layers of uncertainty and "proper indexing".

7 days agostarfezzy

I'm still puzzled how come people are convinced by Cursor, while my experience was meh at best. Can it index your stuff? okay it can. Can it refactor a simple function? No it cannot, it can't even rename a damn Java class. How can I trust it to generate then code based on my codebase? So, what is your use case then? Or can anybody point me to some blog/articles/videos showing some real use cases for Cursor? Real as in, something that it provenly can do?

7 days agosoco

I think you know the correct answer:)

7 days agorisyachka

I hoped to be wrong but no comment so far even tried to bring a based argument... eh, maybe I'll try again in a year.

4 days agosoco

>Can it refactor a simple function?

Certainly, I do that several times a day.

7 days agokristiandupont

Listen, I don't want to brag too much, but it even made me a function today.

7 days agorob

>Java

found the problem

7 days agoastar1

Correct. It's Web 3.0 2.0. You're supposed to play along to make the stock prices go up and to the right.

7 days agojcgrillo

not sure if it's cursor's fault, but very often it doesn't give me the real or complete code of my codebase when auto editing/auto completing.

any tips?

8 days agoboredemployee

Or, for a lazier approach:

    $ head -10000 *
    ==> package.json <==
    {
      "name": ...
      ...
    ==> tsconfig.json <==
    {
       "extends": ...
      ...

    $ head -10000 * | llm -s "generate a patch to switch this project to esm"
7 days agousagisushi

That's very nice and compact. I do the same with a short bash script, but wrap each file in triple-backticks and attempt to put the correct language label on each eg:

Filename: demo.py

```python

   ...python code here...
```
8 days agoDrPhish

Seconded because just having something autowrapped like that and putting the clipboard would save me time: release the snyder cut, er, bash script!

7 days agogenewitch
[deleted]
7 days ago

Tip: If you ever need to do this on a public GitHub repository you can use "gitingest".

This will open a website that creates a copy of all the file contents of the repo (code, docs, ...) It's a great tool to use when using new/obscure code with LLMs in my opinion.

The UX is so just easy and great, change the URL from <https://github.com/user_name/repo_name> to <https://gitingest.com/user_name/repo_name>

//edit: fixed URLs

8 days agobeklein

I copied the UX to my https://gitpodcast.com (creates podcast on a github repo, same replace `hub` with `podcast`)

8 days agomkagenius

i am very impressed by gitpodcast, i just listened to one podcast and first of all i am pleased with the idea, the voices are also pleasant to listen to. thanks for sharing!

7 days agopratyahava

Unfortunate naming, given that CodeWeavers is already a company making a Windows "emulator" for Linux and macOS. [1]

[1] https://www.codeweavers.com/

8 days agoreddalo

CodeWeavers are actually making wine, not just some "emulator". They then distribute this along with some QOL tools as a commercial product called CrossOver.

8 days agoArch-TK

All names are taken. There's no need to point this out every time.

8 days agolgas

Not all names are registered trademarks for software.

7 days agoanamexis

Some are more confusing than others.

8 days agoRexxar

Huewoblfan is not taken! Noiewoidc is free. XIONqlic – totally available, can mean a range of things! Ciohupoij – a bit of asian flavour but still a valid free name.

7 days agogloosx

How does this compare to / differ from https://github.com/yamadashy/repomix ?

8 days agopmx

Some advantages of CodeWeaver are that it is compiled, so it might be faster; you can grab a compatible executable from the releases section instead of using `go install` so, no dependencies. You can manually specify what to exclude via a comma-separated list of regular expressions so it might be more flexible. I never used Repomix so, those assumptions might not hold. On the other hand, remix seems to be awfully more complete, a full-fledged solution to convert source code to monolithic representations. I wrote CodeWeaver because I only needed something that worked and, occasionally, I could trust to keep sensitive data away from sketchy LLMs (And wasn't aware of other solutions).

8 days agotesserato

or https://github.com/azer/llmcat

8 days agoakoculu

I simply have a bash script called printall which takes in some args, and outputs markdown codeblocks with filenames and a tree. One of hundreds of scripts built up over the years.

8 days agoimdsm

if you add fzf to speed up file / folder selection, you'll have your own llmcat :)

8 days agoakoculu

My question exactly. Repomix seems to be tested util for something like that.

8 days agoycombinatornews

Same question here. I have found repomix to get the job done really well.

8 days agoActVen

I really want a tool like this that can extract a function and its dependency graph (to a certain depth maybe, and/or exclude node_modules).

I wrote this library [1] and hope to add the fine-grained "reference resolution" utility to it at some point, which could make implementing such a tool a lot simpler.

[1]: https://github.com/aleclarson/ts-module-graph

8 days agoretropragma

I use aider /copy-context command for that

https://aider.chat/docs/usage/copypaste.html

and with /paste you can apply the changes.

8 days agotherealmarv

Thanks for letting folks know about aider's /copy-context command.

To add some more detail, aider has a mode/UX that is optimized for "copy and paste" coding with LLM web chats. The "big brain" LLM in the web chat does the hard work, and a cheap/local LLM works with aider to apply edits to your local files.

There's a little demo video in the link above that should give you the gist.

7 days agoanotherpaulg

I’ve made a CLI tool that does something similar, called Copcon:

https://github.com/kasperjunge/copcon

Point it at a code project directory to get a file tree and content, optionally with a git diff, copied to the clipboard - ready for copy pasting into ChatGPT.

It is very true that this only works for small projects, as you will bloat the LLM’s context with large codebases.

My solution to this is two files you can use to steer the tool’s behavior:

- .copconignore: For ignoring specific files and directories.

- .copcontarget: For targeting specific files and directories (applied before .copconignore).

These two files provide great control over what to include and exclude in the copied context.

7 days agojuunge

A new tool like this comes out every week, and that's great! But I think it's fair to ask how this compares to popular ones like RepoMix? Anyone keeping an eye on this space will want to know why this is different from what's already out there and being used.

8 days agotempoponet

I actually wrote this a couple of months ago, so perhaps nothing similar existed back then (I remember doing some research back then, mostly focused on VS Code plugins). Nevertheless, the idea was also to test how Golang could facilitate the distribution of such micro tools throughout the internal team, so I probably would have still made it. It is nice to know that similar tools exist. I'll take a look at them.

8 days agotesserato

  find . -type f -name '*.py' -exec sh -c 'echo "# $1"; cat "$1"; echo ""' _ {} \; | pbcopy
8 days agomaurycy

Somewhat related. I built an Elm app all in one file as an experiment and to see if I like it. It's a little over 7k lines and I'm occasionally adding more to it.

It's actually pretty straightforward if you're in a language with lexical scoping, and it simplifies some things, like includes / cyclical, no modules, no hunting through files, etc.

I feel like this set up could integrate really well w/ AI models.

I've found that the only real limitation, at least in my experiment, was a lack of decent editor support. I use vim so this wasn't really much of an issue for me with many great ways to navigate a file, and a combination of vertical and horizontal splits on a large screen, but when I opened it up in other "modern" editors the ergonomics fell apart quite a bit.

I think the biggest downside was re-using variable names between large scopes occasionally made it hard to find the reference I wanted (E.g. i, x, key, val), but again, better editor support allowing you to limit your search to within the current scope would help. Also easily mitigated with more verbose throwaway variable naming.

8 days agorapind

I write Elm and use Emacs primarily, and sometimes neovim. Are you using lsp in vim? You’re doing it right by staying in one file until it hurts, that’s the recommendation for Elm, but I can’t recall if I’ve had issues using go-to-def or other lsp functions like your describing

8 days agosqueegee_scream

No LSP. It honestly doesn’t speed me up any. I already have the standard library memorized, plus some of the common community lib methods (List.Extra) and my typing speed is faster than I can think anyways.

I’m thinking the same approach would also work well in F#, Haskell, OCaml.

7 days agorapind

> no hunting through files, etc.

It’s easy to switch to files by name with a few keystrokes. Files are names to group things I’m looking for.

I would much rather do that than try to search through a 7,000 line file for what I need.

> I feel like this set up could integrate really well w/ AI models.

Massive files or too many files break AI models. Grouping functionality into smaller files and including only relevant files is key. The file and folder names can be hints about where to find the right files to include.

8 days agoAurornis

> I would much rather do that than try to search through a 7,000 line file for what I need.

I mean I'm not arguing for it as a best practice. I did it as an experiment (as I stated), and discovered it's actually really easy, and snappy for me to navigate in Vim. Mileage may vary with other editors. Have you tried it?

> Massive files or too many files break AI models

It's growing faster than I code! With the latest Gemeni at least it's much larger at 1-2 mil tokens. I'm sure we'll hit a ceiling though, but I also think we may find some context caching / rag type optimizations eventually.

8 days agorapind

The big problem with that is you’ll eventually blow your context window feeding the model with stuff that it mostly doesn’t need in order to complete its task.

8 days agocruffle_duffle

I can’t think of anything I’d want to add to the context for Elm at least, assuming the standard libraries are already in the model (or can be added via RAG). Gemeni is 2m tokens now and I expect this will grow at least until it’s no longer meaningful.

7 days agorapind

Nice! Built something similar in Rust that supports local and remote repos: https://crates.io/crates/r2md

8 days agostan_kirdey

I thought of using Rust, but ultimately chose Go. I'll take a look and see how something similar came out in Rust!

8 days agotesserato

Something I didn't dig to find, but is it possible for these applications to also respect .gitignores? Might be a handy flag!

8 days agojdironman

In any node project that basically _must_ be done or your source code will be eclipsed by whatever is in node_modules

7 days agofullstackchris

This is like a rediscovery of an org-mode capability that has existed for decades, and doesn't do as much.

8 days agoskeledrew

Is it? I use org-babel regularly but wasn't aware of it - what's the function called? As great as org-mode / org-babel is, the user base is too small to not be overlooked.

7 days agohatmatrix

Well in general I've put entire projects into org docs, and ran the code blocks, essentially using it like a Jupyter notebook (although honestly it wasn't always as smooth as I'd like). And I haven't done this myself, but there's a neat literate programming talk from the last EmacsConf[0] in which the presenter showed some custom capabilities which improved the experience even more for him.

[0] https://emacsconf.org/2024/talks/literate/

7 days agoskeledrew

Following the /llms.txt standard proposition, I create a MkDocs plugin that generates an /llms.txt file at the root of your site. So, same thing, but generates the Markdown document from your docs (possibly containing API reference) instead of your code.

7 days agoPawamoy

Such a functionality would be useful for developing some scripts and then converting to a Quarto document [1].

[1] https://quarto.org/

7 days agohatmatrix

I've never used Quarto, but I might give it a go someday. I currently have a convoluted workflow for generating math-heavy documents that involves generating equations using SymPy in a notebook, accumulating them in a string, and ultimately dumping the string into a Markdown. I would love to simplify this sooner rather than later. I'm also keeping an eye on https://typst.app/ and hoping for a sane alternative to LaTeX to emerge.

6 days agotesserato

Second hooray for Quarto. Great tool.

7 days agombonnet

This could be a lot better. The example linked in the Github README is a markdown file full of binary garbage because it also tried to convert gzip files to markdown.

Pretty big flag that this isn't ready for primetime.

8 days agocausal

Thank you for pointing that out. Just fixed it.

8 days agotesserato

My codebase sitting at 4M lines: hold my spaghetti.

8 days agoainiriand

This is self-promotional, but https://github.com/nahco314/feed-llm has TUI to choose what to give to llm. There are many similar tools out there, but I think this approach is relatively effective for larger code bases.

8 days agonahco314
[deleted]
8 days ago

You can ask Cursor to use information from specific folder (aka your 4M lines) and it would summarize it and use that.

Not a replacement for full 4M lines but it might work for some tasks/prompts

8 days agoycombinatornews

This kind of context is really useful for LLMs, but in any significant project, including all code in this manner will easily exceed context limitations. I've been wanting to do something like this for my php projects, but instead of dumping the entire files, would just create a map of its methods signatures, variables, etc. That should give good enough information of what each file is used for and can do, while being small enough to be ingested by AI.

8 days agonunodonato

> including all code in this manner will easily exceed context limitations

The context window for Gemini 2.0 Flash can handle roughly 50000 lines of code, and 2.0 Pro can handle twice that.

7 days agopanarky

that goes faster than you think. Also, diminishing attention/memory of facts in context also goes down together with its length. Which might hurt when you just want to dump everything at once.

7 days agonunodonato

For extra points, compile your docs into one file and feed it that as well.

(unless the reason you're giving AI the code is that you don't have any docs for either humans or machines)

8 days agolornajane

Anybody with experience of using something like this with a big codebase and Gemini 2M context window? I tried a while ago (before 2.0 Flash) to solve some refactoring tasks and even after spending some time on prompt wrangling I didn't manage to get good results out of it.

I don't know what kind of agent architecture Cursor uses internally but it seems much better designed at finding where changes need to be made.

8 days agomtrovo

In my experience with feeding large codebases to Gemini, simple tasks work ok (enumerate where such and such happens, find where a certain function is called, list TODOs throughout the code, etc), but tasks that require a bit more logic are trickier. Nevertheless, I had some success with moderate complex refactoring tasks in Python codebases.

8 days agotesserato

This thread has convinced me that Aider/Cursor need to do more marketing.

8 days agoOsrsNeedsf2P

Maybe. But maybe some like the more disconnected way of coding with ai.

8 days agolarusso

Why? It's just moving more of the grunt work of shuffling things around to the human?

8 days agolgas

For me it’s still to feel under control. And the fact that I don‘t want to inject it into every workflow. I‘m open to AI and use it daily. But my terms may be different then others. I want to control what I share and how. People have secrets and other things in a project. I sometimes rename things because the AI should only deal with the big picture. Paint me paranoid but that’s how it is for me.

8 days agolarusso

Same for windsurf, I’ve been using it to generate documentation for code bases. It will generate markdown with mermaid diagrams to explain whatever you want to know; from the component architecture of an entire application, to the sequence diagram for a specific button, and data and ER diagrams.

But the approach to fit your entire codebase into one document so you can include it in your prompt context seems a dead end, instead the llm can use an agent to do targeted search through your code.

7 days agoako

Cursor is all the rage. Nobody talks about Aider, sadly.

7 days agorane

I partially disagree. Maybe it depends what circles you run in but at least here on HN I’ve seen Aider mentioned more times than I can count. Is cursor more popular? Yeah…but the people here are talking about Aider. That’s how I learned about it.

7 days agoreplwoacause

The future is not evenly distributed.

8 days agoesafak

I made a similar tool in Golang, https://github.com/foresturquhart/grimoire. It tries to be a bit cleverer, by prioritising files that have had many commits, respecting .gitignore files, and excluding useless content like binaries or vector images.

8 days agoConasg

I can think of no use case where binaries are desired in such representation, so I might bake binary exclusion into CodeWeaver as well. SVGs, on the other hand, might be wanted sometimes, in web design contexts. I'll take a look at your implementation and see what I can learn.

8 days agotesserato

thisismy has a -g option for greedy which then also takes binaries

8 days agofranze

Nice! Written in go. I like that :)

8 days agocodecraze

I would say the demand for this kind of tool definitely exists. Good work! From a rough glance it looks pretty similar to another tool that I've been using https://github.com/mufeedvh/code2prompt

8 days agomegadragon9

I literally just wrote something similar called techdocs[1] in Rust and uses Claude to generate a README. It includes API and CLI.

[1] https://github.com/thesurlydev/techdocs

8 days agothesurlydev

Nice! I thought of using Rust. I'll check how you implemented it.

8 days agotesserato

Wouldn't it be wonderful to have a tool where you interact with AI interactively through the codebase via IDE / vim / emacs tree? Say, you open your codebase and start with prompts and AI+tool navigates to a function or a place where it needs to and modifies stuff while chatting to you about it? Or you jump to somewhere, highlight where you are to scope down the focus of it (while it still retains all of the code in history/memory). Sort of like pair programming. It sounds so obvious that I'm almost sure I've missed that already existing somewhere. I think I tried google's thing (forgot the name) but it sucked / wasn't that.

8 days agoKeyframe

I think you’re describing Aider.chat. There are 2 Emacs packages for it, one official and a very recent fork. Aider is a cli so it works great with vim as well.

In Emacs I’ve had good experience with gptel as well but I prefer aider for the coding workflow

8 days agosqueegee_scream

Yep, I've particularly been enjoying the recent "watchfiles" feature where a comment can be added to the source file, and ending it with "ai?" or "ai!" triggers use of said comment as a prompt to ask about or change that section upon save.

7 days agoskeledrew

I'll check it out, thanks!

8 days agoKeyframe

Apologies if I'm missing something, but aren't you describing Cursor/Copilot/Windsurf?

8 days agozknowledge

you're not. looks like that's kind of it, but would the thing have the context of the whole project when I'm in a file/class/function? With copilot, in my case, it was so far mostly like a fancy autocomplete that has immediate vicinity in its memory where it would be vastly more useful if it had the context of the whole project / all files.

8 days agoKeyframe

Cursor indexes the entire code use with embeddings. It works well in small single app projects

8 days agocjonas

it is also the "right thing to do" IMHO.

8 days agokohlerm

the vscode extension cline also does this

8 days agoajoseps

This doesn't sound good to me, you end up with a large codebase that no human has actually laid eyes on. When you get a bug weird enough that you can't reason the LLM through it, then what? What if a bug is because of interactions between two systems, and you don't own one of them? What if there's an issue due to convoluted business process failures, that just end in a bug report like "my data is missing!"? I honestly think in the latter case, the LLM will just fix a 'bug' and miss the forest for the trees.

I prefer the idea of the other comment reply where you use AI as a tool to explore a codebase and assist you, not something you instruct to do the work. It can accelerate you building that experience and intuition at a level we've never been able to do before.

8 days agomeesles

An llm itself is a large codebase that no human has laid eyes on, instead you validate it through testing.

Regarding testing, I’ve had an interaction with windsurf where I told it there was a bug in the application it generated. It replied “I’ve added some log statements, can you run it and tell me what you see, then I’ll know what to fix”… The llm was instructing me…

7 days agoako

Nothing like that at all. For example I have a few codebases kind of large (for certain quantity of large) where I know the code since either I wrote it or participated heavily in. Talking snippets at a time loses a ton of context which would yield better offered solutions if you had, well.. the whole context.

8 days agoKeyframe

I tried various solutions but I still haven’t found a chat tool that allows me to navigate a large monorepo. I’d like to be able to say "open the file where there is the function to do <xyz>", but current tools don’t understand that.

8 days agohk__2

This works fine in Cursor. As far as I know, you can't say "open the file..." but you can say "where is the function to do <xyz>" and it'll include a link to the file in it's response and then you can click to open it.

8 days agolgas

Whilst the pendulum seems well on its way to be swinging from microservices back to monoliths, I'm thinking we'll end up in a place that limits the volume and complexity of the code in a single service so that it's just large enough to encompass a point of single responsibility.

Then we can easily drop in and out of using LLMs in the code space.

Service Oriented Architecture lends itself well to the limited context of these models.

Maybe we can revive literate programming and simply build everything from a single markdown document..

8 days ago_puk

Microservices lend themselves to architectural decisions that LLMs are just not trained to understand.

It's one thing to have it be trained in billions of loc and be useful, its another for it to have enough quality dataset to have enough context and understanding of something like Kafka partition ordering and its possible interactions with something like a database and at-least once delivery. It will give you an explanation of those things in isolation, but not in combination.

8 days agoazthecx

Any unique benefits over using this vs something like Repomix? https://github.com/yamadashy/repomix

8 days agoActVen

CodeWeaver is compiled, so it might be faster. Also, you can grab a compatible executable from the releases section, and you're good to go, instead of using `go install` so, no dependencies. Personally, I considered following the `.gitignore` route but found that manually specifying what to exclude via a comma-separated list of regular expressions provided me with the flexibility I needed (initial setup might be a bit tedious, though, but, then again, you can use an LLM for that).

8 days agotesserato

I could see this being quite useful in the background for apps like cursor when they need to perform a full codebase search. I imagine it could be more effective in breaking up larger codebases where embeddings start to fall out. If you could fit the entire document into context, you'd be able to "point the model" in the right direction.

The challenge is maintaining it... But you'd maybe ask the model to do that incrementally on every commit, or just throw it away and regenerate from scratch occasionally.

8 days agocjonas

See the script I created that does something similar with a few improvements for large projects:

https://paste.mozilla.org/9rD95yAy

I would like to be able to create sets of files that I can easily send to the clipboard in this kind of format. The files could correspond to the ones relevant to a particular feature, etc. They don't always fall under the same subtree of the source code, and the entire source code is too big for the context.

8 days agoresters

Link says snippet deleted.

8 days agoroskelld

Which like, kinda neat that it exists, but who's using tooling that bad that they're manually copying and pasting that much code into, what, a web browser text entry box?

Use better tools people!

8 days agofragmede

I have always used o1 pro and deep research, but these are only available through the web UI. there is no doubt that cursor and others have a better UI, but the demand for this type of tool exists because OpenAI does not release an API

8 days agonahco314

This seems useful for building new tools. It's not strictly an end-user tool.

8 days agororytbyrne

Exactly, the LLM-RAG boffins are all over stuff like this.

8 days agodazzawazza

If it's useful to anyone, I made a VS Code/Cursor extension that combines all open files into one big text document.

I use it with ChatGPT's o1 pro (which can handle around 100,000 tokens).

1. Open all of the files I think are relevant

2. Use the extension to combine them

3. Copy and paste into ChatGPT

https://marketplace.visualstudio.com/items?itemName=DVYIO.co...

7 days agodavidbarker

I’ll be using this, thank you!

7 days agoreplwoacause

Does anyone know of tools that go the other direction? i.e. taking a technical writeup (scientific paper, architecture docs, or similar) and emitting a candidate codebase.

8 days agororytbyrne

Maybe I don't understand but isn't this what you use LLMs for?

8 days agoelashri

Yes, I often use one LLM to generate a PRD and the include it in the codebase, then ask Cursor agent to implement some part of the system using the PRD as a reference. It can't emit an entire codebase in one-shot (unless it's trivial project like "build me a flappy bird clone") but you can use it as scaffolding to manage implementing a whole project in chunks.

8 days agolgas

I don’t know of a tool but I’ve had some success doing this with a one shot short prompt. I say something like, “Here’s a readme. Develop this in Go.” Followed by the readme.

I’ve been getting complete working code with this strategy but I’m creating projects that are relatively simple.

I also notice that I have to give a little deeper context about “how” it should work, which I normally wouldn’t do.

8 days agocodazoda

Given the limited context length of most LLMs, is there value in turning in an entire codebase into a doc to feed it into an LLM?

I think cherry-picking relevant sections would be necessary to make it function effectively. Has anyone tried using tree-sitter to recursively feed it the source for functions used in the section we want to analyze to optimize for this?

7 days agoshipp02

Interesting. I've been converting Jupyter notebooks into markdown for the same purpose. Am considering making a custom tool.

8 days ago__mharrison__

I also have this use case, and would be interested in such a tool. If you intend to write your tool in Golang, consider instead extending CodeWeaver.

8 days agotesserato

If I'm reading this correctly, why include all code into the markdown? It's almost like the AI model that would use this is necessarily using all concatenated code plus explanation of the code, I'm not sure which is better because the LLM then already has access to the entire code as part of markdown?

8 days agonarmiouh

I have one for CVEs in case there are security folks here - recursively finds details like code commit diff which fixed the vulnerability in references links too to generate one single json.

1. https://github.com/BandarLabs/cveingest

8 days agomkagenius

Oh, cool -- this is made with golang! I'll have see if I could wrap it in a desktop gui using wails.

7 days agobkyan

nice! i made something similar that converts codebases (local and github urls), as well as youtube videos (transcripts) and blog posts to markdown.

https://github.com/tanq16/ai-context

7 days agoimport-base64

I see lots of folks here using LLMs in their codebases. Does that mean there isn’t much concern about sharing your app’s code with an LLM? Have people just gotten comfortable with this now? Or does it only matter for closed source or proprietary code bases ?

7 days agoreplwoacause

You can run an llm on your local machine, and you can get llm sandboxes for your company.

7 days agoako

https://www.repoprompt.com is better. You need more granular control if you're planning to use this in real large codebases.

8 days agotribeca18

Is this related to https://gitingest.com/ at all? Which seems to be a service doing a similar thing.

8 days agoemmelaich

There are a ridiculous number of projects doing this.

I'm always baffled by the response they get since doing this is also the most impractical, poorly scaling, way to insert an LLM into your development process.

On one hand if you realize that, there may be times where you get lucky with the size of a codebase and the nature of your questions and it works acceptably.

But on the other, this feels like the kind of thing someone who's hearing others rave about the utility of AI will try with too large of a codebase, insert the result into ChatGPT, and then get an LLM underperforming because it's being flooded with irrelevant context for every basic operation it's being asked to do.

There are very few times when providing the entire codebase in the context window instead of the relevant code to a single operation makes sense.

8 days agoBoorishBears
[deleted]
8 days ago

It is not. Others have commented pointing to services similar to this one, though.

8 days agotesserato

There is also repo2txt.simplebasedomain.com/local.html

8 days agoAlifatisk

This is great, but I’m pretty sure this is trivial using Emacs and org mode. You could then use pandoc to convert org to markdown

8 days agosqueegee_scream

It's trivial using a number of approaches, eg. a simple bash or python script. But I think there's still a fair amount of value in building a common tool for these sorts of things. Everyone that builds their own one off solution will inevitably encounter more and more of the edge cases (oh I need to honor .gitignore... oh, I need to be able to override .gitignore and include some ignored things... oh I need to deal with huge files... etc) and with a common tool the tool can collect the ways of dealing with all of these edge cases.

Now no one will need something that can handle all of the edge cases, but whatever edge cases they need to be handled will already be handled. The overall time and frustration saved this way can be huge.

8 days agolgas

How do you do the opposite of this? Transform your markdown files into a codebase that AI can't leech off of?

7 days agonovemp

There’s ClipSource for VSCode that does this

8 days agostrizzo

From the description, seems to only work with Python codebases.

3 days agotesserato

Damn, I did that the other day but manually. I just cat everything from a folder in the order that I wanted and fed it to ChatGPT so it could write a README for tiny.js

8 days agoatum47

I built a simple tool to do something similar (it's meant for a monorepo and will build each subfolder in to a (subfolder-code.txt) text file that you can upload to AIs.

https://github.com/manfrin/bundle-codebases

I don't see much merit in things like markdown or syntax highlighting as that's just extra noise for the AI. My script tries to cut down on any extraneous data since the things I'm working on are near the context limit of consumer AIs.

My script also ignores anything in .gitignore and will take a .codebundlerwhitelist (i hate this name and have meant to change it) to only bundle files matching patterns you specify.

8 days agommanfrin

Not just extra noise, but also extra tokens.

8 days agoantirez

Exactly.

8 days agommanfrin

how does this compare to code2prompt or files2prompt ? any benchmarks on which one works better for LLMs ?

8 days agosandGorgon

So only the US is allowed to get data directly from the companies.

Got it.

7 days agocroes

Another alternative is Gitingest [0]. What are the differences?

[0] https://gitingest.com/

8 days agoadityamwagh
[deleted]
8 days ago
[deleted]
4 days ago

[dead]

8 days agobingzhuwuhen

[dead]

8 days agosamueljames324

Wait, just one question…

Can I call this c++ code “machine code” now?