Show HN: Transform your codebase into a single Markdown doc for feeding into AI

CodeWeaver is a command-line tool designed to weave your codebase into a single, easy-to-navigate Markdown document. It recursively scans a directory, generating a structured representation of your project's file hierarchy and embedding the content of each file within code blocks. This tool simplifies codebase sharing, documentation, and integration with AI/ML code analysis tools by providing a consolidated and readable Markdown output.

CodeWeavers is a software company that focuses on Wine development and sells a proprietary version of Wine called CrossOver for running Windows applications on macOS, ChromeOS and Linux.

https://en.wikipedia.org/wiki/CodeWeavers

Trademark is active. It's an Ⓡ not just a ™, registered not just trademarked. To keep it, they have to demonstrate they defend it.

https://www.trademarkia.com/codeweavers-76546826

While this project drops the final "s", you don't get to launch an OS called "Window". The test is a fuzzy match based on likelihood of confusion.

Yeah, I was thinking "what does the Wine guys have to do with this?"

This project is definitely going to get C&D'd.

Do you think they would actually litigate? They seem like different products serving entirely different markets so I am not sure that the trademark infringement claim is very defensible. And how do they prove damages?

CodeLoom could work instead

I use the following for feeding into AI

   find . -print -exec cat {} \; -exec echo \;

Which will return for each file (and subfolders) the filename and then the content of the file.

Then `| pbcopy` to copy to clipboard and paste it into ChatGPT or similar.

I guess this only works for very small codebase?

Correct, but it's the same as what OP shared.

You should use Aider/Cursor for proper indexing/intelligent codebase referencing

Cramming thousands of tokens of potentially irrelevant context through unclear indexing paths isn't "proper".

The best results come from feeding precisely targeted context directly into the prompt, where you know exactly what the model sees and how it processes it. The prompt receives the most accurate use of attention—whereas god knows what the pipeline is for cursor or what extra layers and context restrictions they add on top of base Claude.

Giving the model a clean project hierarchy accomplishes a lot efficiently in terms of context tokens. The key is ensuring it only sees what's relevant, without diluting its attention.

Tools like reopmix and OP's version, feeding targeted context straight into models like Claude or Google's offerings, outperform Copilot and Cursor in my experience, even though they use the same base models. Use the highest-quality attention (the prompt context) directly, rather than layers of uncertainty and "proper indexing".

I'm still puzzled how come people are convinced by Cursor, while my experience was meh at best. Can it index your stuff? okay it can. Can it refactor a simple function? No it cannot, it can't even rename a damn Java class. How can I trust it to generate then code based on my codebase? So, what is your use case then? Or can anybody point me to some blog/articles/videos showing some real use cases for Cursor? Real as in, something that it provenly can do?

I think you know the correct answer:)

I hoped to be wrong but no comment so far even tried to bring a based argument... eh, maybe I'll try again in a year.

>Can it refactor a simple function?

Certainly, I do that several times a day.

Listen, I don't want to brag too much, but it even made me a function today.

>Java

found the problem

Correct. It's Web 3.0 2.0. You're supposed to play along to make the stock prices go up and to the right.

not sure if it's cursor's fault, but very often it doesn't give me the real or complete code of my codebase when auto editing/auto completing.

any tips?

Or, for a lazier approach:

    $ head -10000 *
    ==> package.json <==
    {
      "name": ...
      ...
    ==> tsconfig.json <==
    {
       "extends": ...
      ...

    $ head -10000 * | llm -s "generate a patch to switch this project to esm"

yek makes it a bit quicker and you can do all your piping with it:

https://github.com/bodo-run/yek

That's very nice and compact. I do the same with a short bash script, but wrap each file in triple-backticks and attempt to put the correct language label on each eg:

Filename: demo.py

```python

   ...python code here...

```

Seconded because just having something autowrapped like that and putting the clipboard would save me time: release the snyder cut, er, bash script!

Mind sharing the script?

I have something similar.

https://github.com/jzombie/globcat.sh

Nothing fancy, but gets the job done.

[deleted]

Tip: If you ever need to do this on a public GitHub repository you can use "gitingest".

This will open a website that creates a copy of all the file contents of the repo (code, docs, ...) It's a great tool to use when using new/obscure code with LLMs in my opinion.

The UX is so just easy and great, change the URL from <https://github.com/user_name/repo_name> to <https://gitingest.com/user_name/repo_name>

//edit: fixed URLs

I copied the UX to my https://gitpodcast.com (creates podcast on a github repo, same replace `hub` with `podcast`)

i am very impressed by gitpodcast, i just listened to one podcast and first of all i am pleased with the idea, the voices are also pleasant to listen to. thanks for sharing!

Unfortunate naming, given that CodeWeavers is already a company making a Windows "emulator" for Linux and macOS. [1]

[1] https://www.codeweavers.com/

CodeWeavers are actually making wine, not just some "emulator". They then distribute this along with some QOL tools as a commercial product called CrossOver.

All names are taken. There's no need to point this out every time.

Not all names are registered trademarks for software.

Some are more confusing than others.

Huewoblfan is not taken! Noiewoidc is free. XIONqlic – totally available, can mean a range of things! Ciohupoij – a bit of asian flavour but still a valid free name.

As a note, CodeWeaver might be a confusing name, as CodeWeavers (the Wine development company) exists since 1996... ( https://en.wikipedia.org/wiki/CodeWeavers )

My first though: Is this somehow using Wine?

It's not mentioned on the page but is it using [0] in the background? Edit -> It's a Go program so I guess not.

[0] https://github.com/microsoft/markitdown

How does this compare to / differ from https://github.com/yamadashy/repomix ?

Some advantages of CodeWeaver are that it is compiled, so it might be faster; you can grab a compatible executable from the releases section instead of using `go install` so, no dependencies. You can manually specify what to exclude via a comma-separated list of regular expressions so it might be more flexible. I never used Repomix so, those assumptions might not hold. On the other hand, remix seems to be awfully more complete, a full-fledged solution to convert source code to monolithic representations. I wrote CodeWeaver because I only needed something that worked and, occasionally, I could trust to keep sensitive data away from sketchy LLMs (And wasn't aware of other solutions).

There is also https://github.com/regenrek/codefetch which I personally like

or https://github.com/azer/llmcat

I simply have a bash script called printall which takes in some args, and outputs markdown codeblocks with filenames and a tree. One of hundreds of scripts built up over the years.

if you add fzf to speed up file / folder selection, you'll have your own llmcat :)

My question exactly. Repomix seems to be tested util for something like that.

Same question here. I have found repomix to get the job done really well.

or https://github.com/bodo-run/yek

I really want a tool like this that can extract a function and its dependency graph (to a certain depth maybe, and/or exclude node_modules).

I wrote this library [1] and hope to add the fine-grained "reference resolution" utility to it at some point, which could make implementing such a tool a lot simpler.

[1]: https://github.com/aleclarson/ts-module-graph

I use aider /copy-context command for that

https://aider.chat/docs/usage/copypaste.html

and with /paste you can apply the changes.

Thanks for letting folks know about aider's /copy-context command.

To add some more detail, aider has a mode/UX that is optimized for "copy and paste" coding with LLM web chats. The "big brain" LLM in the web chat does the hard work, and a cheap/local LLM works with aider to apply edits to your local files.

There's a little demo video in the link above that should give you the gist.

A new tool like this comes out every week, and that's great! But I think it's fair to ask how this compares to popular ones like RepoMix? Anyone keeping an eye on this space will want to know why this is different from what's already out there and being used.

I actually wrote this a couple of months ago, so perhaps nothing similar existed back then (I remember doing some research back then, mostly focused on VS Code plugins). Nevertheless, the idea was also to test how Golang could facilitate the distribution of such micro tools throughout the internal team, so I probably would have still made it. It is nice to know that similar tools exist. I'll take a look at them.

I’ve made a CLI tool that does something similar, called Copcon:

https://github.com/kasperjunge/copcon

Point it at a code project directory to get a file tree and content, optionally with a git diff, copied to the clipboard - ready for copy pasting into ChatGPT.

It is very true that this only works for small projects, as you will bloat the LLM’s context with large codebases.

My solution to this is two files you can use to steer the tool’s behavior:

- .copconignore: For ignoring specific files and directories.

- .copcontarget: For targeting specific files and directories (applied before .copconignore).

These two files provide great control over what to include and exclude in the copied context.

  find . -type f -name '*.py' -exec sh -c 'echo "# $1"; cat "$1"; echo ""' _ {} \; | pbcopy

Somewhat related. I built an Elm app all in one file as an experiment and to see if I like it. It's a little over 7k lines and I'm occasionally adding more to it.

It's actually pretty straightforward if you're in a language with lexical scoping, and it simplifies some things, like includes / cyclical, no modules, no hunting through files, etc.

I feel like this set up could integrate really well w/ AI models.

I've found that the only real limitation, at least in my experiment, was a lack of decent editor support. I use vim so this wasn't really much of an issue for me with many great ways to navigate a file, and a combination of vertical and horizontal splits on a large screen, but when I opened it up in other "modern" editors the ergonomics fell apart quite a bit.

I think the biggest downside was re-using variable names between large scopes occasionally made it hard to find the reference I wanted (E.g. i, x, key, val), but again, better editor support allowing you to limit your search to within the current scope would help. Also easily mitigated with more verbose throwaway variable naming.

I write Elm and use Emacs primarily, and sometimes neovim. Are you using lsp in vim? You’re doing it right by staying in one file until it hurts, that’s the recommendation for Elm, but I can’t recall if I’ve had issues using go-to-def or other lsp functions like your describing

No LSP. It honestly doesn’t speed me up any. I already have the standard library memorized, plus some of the common community lib methods (List.Extra) and my typing speed is faster than I can think anyways.

I’m thinking the same approach would also work well in F#, Haskell, OCaml.

> no hunting through files, etc.

It’s easy to switch to files by name with a few keystrokes. Files are names to group things I’m looking for.

I would much rather do that than try to search through a 7,000 line file for what I need.

> I feel like this set up could integrate really well w/ AI models.

Massive files or too many files break AI models. Grouping functionality into smaller files and including only relevant files is key. The file and folder names can be hints about where to find the right files to include.

> I would much rather do that than try to search through a 7,000 line file for what I need.

I mean I'm not arguing for it as a best practice. I did it as an experiment (as I stated), and discovered it's actually really easy, and snappy for me to navigate in Vim. Mileage may vary with other editors. Have you tried it?

> Massive files or too many files break AI models

It's growing faster than I code! With the latest Gemeni at least it's much larger at 1-2 mil tokens. I'm sure we'll hit a ceiling though, but I also think we may find some context caching / rag type optimizations eventually.

The big problem with that is you’ll eventually blow your context window feeding the model with stuff that it mostly doesn’t need in order to complete its task.

I can’t think of anything I’d want to add to the context for Elm at least, assuming the standard libraries are already in the model (or can be added via RAG). Gemeni is 2m tokens now and I expect this will grow at least until it’s no longer meaningful.

Nice! Built something similar in Rust that supports local and remote repos: https://crates.io/crates/r2md

I thought of using Rust, but ultimately chose Go. I'll take a look and see how something similar came out in Rust!

Something I didn't dig to find, but is it possible for these applications to also respect .gitignores? Might be a handy flag!

In any node project that basically _must_ be done or your source code will be eclipsed by whatever is in node_modules

Updated the project readme with links to (hopefully) all alternatives listed here. Feel free to add new ones via pull requests.

https://github.com/tesserato/CodeWeaver

This is like a rediscovery of an org-mode capability that has existed for decades, and doesn't do as much.

Is it? I use org-babel regularly but wasn't aware of it - what's the function called? As great as org-mode / org-babel is, the user base is too small to not be overlooked.

Well in general I've put entire projects into org docs, and ran the code blocks, essentially using it like a Jupyter notebook (although honestly it wasn't always as smooth as I'd like). And I haven't done this myself, but there's a neat literate programming talk from the last EmacsConf[0] in which the presenter showed some custom capabilities which improved the experience even more for him.

[0] https://emacsconf.org/2024/talks/literate/

Following the /llms.txt standard proposition, I create a MkDocs plugin that generates an /llms.txt file at the root of your site. So, same thing, but generates the Markdown document from your docs (possibly containing API reference) instead of your code.

This could be a lot better. The example linked in the Github README is a markdown file full of binary garbage because it also tried to convert gzip files to markdown.

Pretty big flag that this isn't ready for primetime.

Thank you for pointing that out. Just fixed it.

Such a functionality would be useful for developing some scripts and then converting to a Quarto document [1].

[1] https://quarto.org/

I've never used Quarto, but I might give it a go someday. I currently have a convoluted workflow for generating math-heavy documents that involves generating equations using SymPy in a notebook, accumulating them in a string, and ultimately dumping the string into a Markdown. I would love to simplify this sooner rather than later. I'm also keeping an eye on https://typst.app/ and hoping for a sane alternative to LaTeX to emerge.

Second hooray for Quarto. Great tool.

I've been enjoying `files-to-prompt` by Simon Willison: https://github.com/simonw/files-to-prompt

here is mine

https://github.com/franzenzenhofer/thisismy

supports files, resursive directories, .gitignor and .thisismyignore and online ressources / URLs + tree commands

also available as a chrome extension https://thisismy.franzai.com/

I'll be the 10th person to add, I made something like this too! https://github.com/keizo/ggrab

My codebase sitting at 4M lines: hold my spaghetti.

This is self-promotional, but https://github.com/nahco314/feed-llm has TUI to choose what to give to llm. There are many similar tools out there, but I think this approach is relatively effective for larger code bases.

[deleted]

You can ask Cursor to use information from specific folder (aka your 4M lines) and it would summarize it and use that.

Not a replacement for full 4M lines but it might work for some tasks/prompts

This kind of context is really useful for LLMs, but in any significant project, including all code in this manner will easily exceed context limitations. I've been wanting to do something like this for my php projects, but instead of dumping the entire files, would just create a map of its methods signatures, variables, etc. That should give good enough information of what each file is used for and can do, while being small enough to be ingested by AI.

> including all code in this manner will easily exceed context limitations

The context window for Gemini 2.0 Flash can handle roughly 50000 lines of code, and 2.0 Pro can handle twice that.

that goes faster than you think. Also, diminishing attention/memory of facts in context also goes down together with its length. Which might hurt when you just want to dump everything at once.

For extra points, compile your docs into one file and feed it that as well.

(unless the reason you're giving AI the code is that you don't have any docs for either humans or machines)

Anybody with experience of using something like this with a big codebase and Gemini 2M context window? I tried a while ago (before 2.0 Flash) to solve some refactoring tasks and even after spending some time on prompt wrangling I didn't manage to get good results out of it.

I don't know what kind of agent architecture Cursor uses internally but it seems much better designed at finding where changes need to be made.

In my experience with feeding large codebases to Gemini, simple tasks work ok (enumerate where such and such happens, find where a certain function is called, list TODOs throughout the code, etc), but tasks that require a bit more logic are trickier. Nevertheless, I had some success with moderate complex refactoring tasks in Python codebases.

I made the same but for VSCode two weeks ago, called it ClipSource it’s in the extensions marketplace https://marketplace.visualstudio.com/items?itemName=Strizzo.... You can right click on a directory in the workspace and copy all content in markdown

This thread has convinced me that Aider/Cursor need to do more marketing.

Maybe. But maybe some like the more disconnected way of coding with ai.

Why? It's just moving more of the grunt work of shuffling things around to the human?

For me it’s still to feel under control. And the fact that I don‘t want to inject it into every workflow. I‘m open to AI and use it daily. But my terms may be different then others. I want to control what I share and how. People have secrets and other things in a project. I sometimes rename things because the AI should only deal with the big picture. Paint me paranoid but that’s how it is for me.

Same for windsurf, I’ve been using it to generate documentation for code bases. It will generate markdown with mermaid diagrams to explain whatever you want to know; from the component architecture of an entire application, to the sequence diagram for a specific button, and data and ER diagrams.

But the approach to fit your entire codebase into one document so you can include it in your prompt context seems a dead end, instead the llm can use an agent to do targeted search through your code.

Cursor is all the rage. Nobody talks about Aider, sadly.

I partially disagree. Maybe it depends what circles you run in but at least here on HN I’ve seen Aider mentioned more times than I can count. Is cursor more popular? Yeah…but the people here are talking about Aider. That’s how I learned about it.

The future is not evenly distributed.

I made a similar tool in Golang, https://github.com/foresturquhart/grimoire. It tries to be a bit cleverer, by prioritising files that have had many commits, respecting .gitignore files, and excluding useless content like binaries or vector images.

I can think of no use case where binaries are desired in such representation, so I might bake binary exclusion into CodeWeaver as well. SVGs, on the other hand, might be wanted sometimes, in web design contexts. I'll take a look at your implementation and see what I can learn.

thisismy has a -g option for greedy which then also takes binaries

Nice! Written in go. I like that :)

I would say the demand for this kind of tool definitely exists. Good work! From a rough glance it looks pretty similar to another tool that I've been using https://github.com/mufeedvh/code2prompt

I literally just wrote something similar called techdocs[1] in Rust and uses Claude to generate a README. It includes API and CLI.

[1] https://github.com/thesurlydev/techdocs

Nice! I thought of using Rust. I'll check how you implemented it.

Wouldn't it be wonderful to have a tool where you interact with AI interactively through the codebase via IDE / vim / emacs tree? Say, you open your codebase and start with prompts and AI+tool navigates to a function or a place where it needs to and modifies stuff while chatting to you about it? Or you jump to somewhere, highlight where you are to scope down the focus of it (while it still retains all of the code in history/memory). Sort of like pair programming. It sounds so obvious that I'm almost sure I've missed that already existing somewhere. I think I tried google's thing (forgot the name) but it sucked / wasn't that.

I think you’re describing Aider.chat. There are 2 Emacs packages for it, one official and a very recent fork. Aider is a cli so it works great with vim as well.

In Emacs I’ve had good experience with gptel as well but I prefer aider for the coding workflow

Yep, I've particularly been enjoying the recent "watchfiles" feature where a comment can be added to the source file, and ending it with "ai?" or "ai!" triggers use of said comment as a prompt to ask about or change that section upon save.

I'll check it out, thanks!

Apologies if I'm missing something, but aren't you describing Cursor/Copilot/Windsurf?

you're not. looks like that's kind of it, but would the thing have the context of the whole project when I'm in a file/class/function? With copilot, in my case, it was so far mostly like a fancy autocomplete that has immediate vicinity in its memory where it would be vastly more useful if it had the context of the whole project / all files.

Cursor indexes the entire code use with embeddings. It works well in small single app projects

it is also the "right thing to do" IMHO.

the vscode extension cline also does this

This doesn't sound good to me, you end up with a large codebase that no human has actually laid eyes on. When you get a bug weird enough that you can't reason the LLM through it, then what? What if a bug is because of interactions between two systems, and you don't own one of them? What if there's an issue due to convoluted business process failures, that just end in a bug report like "my data is missing!"? I honestly think in the latter case, the LLM will just fix a 'bug' and miss the forest for the trees.

I prefer the idea of the other comment reply where you use AI as a tool to explore a codebase and assist you, not something you instruct to do the work. It can accelerate you building that experience and intuition at a level we've never been able to do before.

An llm itself is a large codebase that no human has laid eyes on, instead you validate it through testing.

Regarding testing, I’ve had an interaction with windsurf where I told it there was a bug in the application it generated. It replied “I’ve added some log statements, can you run it and tell me what you see, then I’ll know what to fix”… The llm was instructing me…

Nothing like that at all. For example I have a few codebases kind of large (for certain quantity of large) where I know the code since either I wrote it or participated heavily in. Talking snippets at a time loses a ton of context which would yield better offered solutions if you had, well.. the whole context.

I tried various solutions but I still haven’t found a chat tool that allows me to navigate a large monorepo. I’d like to be able to say "open the file where there is the function to do <xyz>", but current tools don’t understand that.

This works fine in Cursor. As far as I know, you can't say "open the file..." but you can say "where is the function to do <xyz>" and it'll include a link to the file in it's response and then you can click to open it.

Any unique benefits over using this vs something like Repomix? https://github.com/yamadashy/repomix

CodeWeaver is compiled, so it might be faster. Also, you can grab a compatible executable from the releases section, and you're good to go, instead of using `go install` so, no dependencies. Personally, I considered following the `.gitignore` route but found that manually specifying what to exclude via a comma-separated list of regular expressions provided me with the flexibility I needed (initial setup might be a bit tedious, though, but, then again, you can use an LLM for that).

I could see this being quite useful in the background for apps like cursor when they need to perform a full codebase search. I imagine it could be more effective in breaking up larger codebases where embeddings start to fall out. If you could fit the entire document into context, you'd be able to "point the model" in the right direction.

The challenge is maintaining it... But you'd maybe ask the model to do that incrementally on every commit, or just throw it away and regenerate from scratch occasionally.

Whilst the pendulum seems well on its way to be swinging from microservices back to monoliths, I'm thinking we'll end up in a place that limits the volume and complexity of the code in a single service so that it's just large enough to encompass a point of single responsibility.

Then we can easily drop in and out of using LLMs in the code space.

Service Oriented Architecture lends itself well to the limited context of these models.

Maybe we can revive literate programming and simply build everything from a single markdown document..

Microservices lend themselves to architectural decisions that LLMs are just not trained to understand.

It's one thing to have it be trained in billions of loc and be useful, its another for it to have enough quality dataset to have enough context and understanding of something like Kafka partition ordering and its possible interactions with something like a database and at-least once delivery. It will give you an explanation of those things in isolation, but not in combination.

See the script I created that does something similar with a few improvements for large projects:

https://paste.mozilla.org/9rD95yAy

I would like to be able to create sets of files that I can easily send to the clipboard in this kind of format. The files could correspond to the ones relevant to a particular feature, etc. They don't always fall under the same subtree of the source code, and the entire source code is too big for the context.

Link says snippet deleted.

I made a better one that lets you add the files/paths and refresh and copy to the clipboard:

https://paste.mozilla.org/omP4EKE8

Which like, kinda neat that it exists, but who's using tooling that bad that they're manually copying and pasting that much code into, what, a web browser text entry box?

Use better tools people!

I have always used o1 pro and deep research, but these are only available through the web UI. there is no doubt that cursor and others have a better UI, but the demand for this type of tool exists because OpenAI does not release an API

This seems useful for building new tools. It's not strictly an end-user tool.

Exactly, the LLM-RAG boffins are all over stuff like this.

Does anyone know of tools that go the other direction? i.e. taking a technical writeup (scientific paper, architecture docs, or similar) and emitting a candidate codebase.

Maybe I don't understand but isn't this what you use LLMs for?

Yes, I often use one LLM to generate a PRD and the include it in the codebase, then ask Cursor agent to implement some part of the system using the PRD as a reference. It can't emit an entire codebase in one-shot (unless it's trivial project like "build me a flappy bird clone") but you can use it as scaffolding to manage implementing a whole project in chunks.

I don’t know of a tool but I’ve had some success doing this with a one shot short prompt. I say something like, “Here’s a readme. Develop this in Go.” Followed by the readme.

I’ve been getting complete working code with this strategy but I’m creating projects that are relatively simple.

I also notice that I have to give a little deeper context about “how” it should work, which I normally wouldn’t do.

If it's useful to anyone, I made a VS Code/Cursor extension that combines all open files into one big text document.

I use it with ChatGPT's o1 pro (which can handle around 100,000 tokens).

1. Open all of the files I think are relevant

2. Use the extension to combine them

3. Copy and paste into ChatGPT

https://marketplace.visualstudio.com/items?itemName=DVYIO.co...

I’ll be using this, thank you!

Given the limited context length of most LLMs, is there value in turning in an entire codebase into a doc to feed it into an LLM?

I think cherry-picking relevant sections would be necessary to make it function effectively. Has anyone tried using tree-sitter to recursively feed it the source for functions used in the section we want to analyze to optimize for this?

Interesting. I've been converting Jupyter notebooks into markdown for the same purpose. Am considering making a custom tool.

I also have this use case, and would be interested in such a tool. If you intend to write your tool in Golang, consider instead extending CodeWeaver.

If you want to be able to select certain files quickly and visually, and work with private repos or just local files, try this open source tool I made:

https://github.com/Dicklesworthstone/your-source-to-prompt.h...

HNYSF: I made a better solution based on the llm cli by Simon Willison: https://janikvonrotz.ch/2025/01/27/work-with-llms-on-the-com...

If I'm reading this correctly, why include all code into the markdown? It's almost like the AI model that would use this is necessarily using all concatenated code plus explanation of the code, I'm not sure which is better because the LLM then already has access to the entire code as part of markdown?

I've been using these two scripts with a similar outcome: https://gist.github.com/blairjordan/018aacc60bcc4fe07234d908...

treedump is particularly helpful.

I have one for CVEs in case there are security folks here - recursively finds details like code commit diff which fixed the vulnerability in references links too to generate one single json.

1. https://github.com/BandarLabs/cveingest

Oh, cool -- this is made with golang! I'll have see if I could wrap it in a desktop gui using wails.

nice! i made something similar that converts codebases (local and github urls), as well as youtube videos (transcripts) and blog posts to markdown.

https://github.com/tanq16/ai-context

I see lots of folks here using LLMs in their codebases. Does that mean there isn’t much concern about sharing your app’s code with an LLM? Have people just gotten comfortable with this now? Or does it only matter for closed source or proprietary code bases ?

You can run an llm on your local machine, and you can get llm sandboxes for your company.

A better alternative, which uses a .gitignore-like file to ignore specific files: https://github.com/rodlaf/describe

https://www.repoprompt.com is better. You need more granular control if you're planning to use this in real large codebases.

Files-to-prompt has been a surprisingly useful tool for this kind of workflow.

https://github.com/simonw/files-to-prompt

Is this related to https://gitingest.com/ at all? Which seems to be a service doing a similar thing.

There are a ridiculous number of projects doing this.

I'm always baffled by the response they get since doing this is also the most impractical, poorly scaling, way to insert an LLM into your development process.

On one hand if you realize that, there may be times where you get lucky with the size of a codebase and the nature of your questions and it works acceptably.

But on the other, this feels like the kind of thing someone who's hearing others rave about the utility of AI will try with too large of a codebase, insert the result into ChatGPT, and then get an LLM underperforming because it's being flooded with irrelevant context for every basic operation it's being asked to do.

There are very few times when providing the entire codebase in the context window instead of the relevant code to a single operation makes sense.

[deleted]

It is not. Others have commented pointing to services similar to this one, though.

There is also repo2txt.simplebasedomain.com/local.html

You can also use https://chathub.gg/repo2txt

This is great, but I’m pretty sure this is trivial using Emacs and org mode. You could then use pandoc to convert org to markdown

It's trivial using a number of approaches, eg. a simple bash or python script. But I think there's still a fair amount of value in building a common tool for these sorts of things. Everyone that builds their own one off solution will inevitably encounter more and more of the edge cases (oh I need to honor .gitignore... oh, I need to be able to override .gitignore and include some ignored things... oh I need to deal with huge files... etc) and with a common tool the tool can collect the ways of dealing with all of these edge cases.

Now no one will need something that can handle all of the edge cases, but whatever edge cases they need to be handled will already be handled. The overall time and frustration saved this way can be huge.

How do you do the opposite of this? Transform your markdown files into a codebase that AI can't leech off of?

Damn, I did that the other day but manually. I just cat everything from a folder in the order that I wanted and fed it to ChatGPT so it could write a README for tiny.js

There’s ClipSource for VSCode that does this

From the description, seems to only work with Python codebases.

I built a simple tool to do something similar (it's meant for a monorepo and will build each subfolder in to a (subfolder-code.txt) text file that you can upload to AIs.

https://github.com/manfrin/bundle-codebases

I don't see much merit in things like markdown or syntax highlighting as that's just extra noise for the AI. My script tries to cut down on any extraneous data since the things I'm working on are near the context limit of consumer AIs.

My script also ignores anything in .gitignore and will take a .codebundlerwhitelist (i hate this name and have meant to change it) to only bundle files matching patterns you specify.

Not just extra noise, but also extra tokens.

Exactly.

how does this compare to code2prompt or files2prompt ? any benchmarks on which one works better for LLMs ?

I created something similar. https://github.com/forrest321/code2text

So only the US is allowed to get data directly from the companies.

Got it.

Another alternative is Gitingest [0]. What are the differences?

[0] https://gitingest.com/

[deleted]

[dead]

Wait, just one question…

Can I call this c++ code “machine code” now?