Have you heard the good news about the terminal savior asciinema -- https://asciinema.org/
I have a bunch of opinionated/personal-use binaries like this in my $HOME/bin/, like delete-all-npm, clean-rust-cache, download-youtube-playlist, and get-markdown <url>. It feels good, and I don't need to remember any commands. Sometimes my coding agent can figure out how to call some of those tools too ;))
VHS is fantastic for scripting cli video generation.
One use I'd have for this is company wikis that you want to give folks easy offline access to (maybe the wiki has documentation that's useful at sites that don't have cellular coverage).
Cool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
Submitting this to Hacker News is the right place! Thanks for your idea. I will consider implementing that :)
Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
I think the zim flow was perfect for offline use. I know I will be making use of it as soon as I can figure out how to pass chrome the cookies so I can be signed into the site. Didn't see it in the page, but I didn't look closely yet.
> kage serve $HOME/data/kage/paulgraham.com
If the result is static why does it need a server? Isn't it possible to make it so that it can simply be opened by the browser? Like:
$ firefox $HOME/data/kage/paulgraham.com
Then the result would be useable on machines without kage nstalled.
You could use python -m http.server instead. I haven't tried it yet, but it should work.
Actually, Kage has two parts: a crawler that crawls pages and converts them to clean HTML by capturing the DOM after rendering in Chrome/Chromium, and a pack/serve component that packages the result as either a ZIM file for Kiwix or an executable file.
Usually JavaScript is blocked when you load pages that way.
Not all JavaScript, but a lot of APIs are restricted
Since when? You won't be able to make HTTP requests to localhost, as it'd be a different Origin, but I don't think any mainstream browser blocks JS outright when you use file:// to load and view HTML files.
I am quite familiar with this and it is factually false
Js modules don’t work on file urls (classic js does).
You’ll likely run into a ton of CORS issues doing that.
I don't think so, there is no HTTP requests being done from JS as it's stripped away, and all the other resources are pulled down (and I'm assume their reference made relative), so really shouldn't be any issues because of CORS at all.
I find SingleFile [0] to be a much more robust version of this.
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.
Oh, I see. In that case, feature-wise, it is actually a modern alternative to HTTrack.
I think the misunderstanding stems from the browser's "Save As" reference in the description. It is misleading. You use "Save As" to save a single page, not an entire website.
Also, the description lacks a clear explanation of the project's purpose. It would be helpful to include a sentence explaining that the program downloads an entire website, not just a single page.
[flagged]
Um. Whose website are you on right now?
Don't come here to laugh but always great when it happens anyways.
Love love love SingleFile too. The FF extension works pretty well for a clean save.
That said, Kage looks promising if OP can combine SingleFile reproduction quality with the HTTPTrack spidering approach. SPA's are kinda tricky with archiving and do wonder how well Kage would handle that
I've seen the option in IE- .mhtml.
For some reason it displays in IE better but I don't recall seeing this option in chrome of Firefox recently..
And thanks for the link. Let me implement this single HTML feature, it looks nice to have!
Yeah. An idea on top of that is to bundle an entire website into a single HTML page, with vendored JavaScript to enable client-side routing (all of the original pages' JS is still stripped out).
That way, the page is self-contained as it is, but requires no bundled binary code to serve the site. It is actually safer security-wise.
The vendored script can be as simple as this:
const site = {
"path-1": "<!DOCTYPE html><html> ... </html>",
"path-2": "<!DOCTYPE html><html> ... </html>",
// More paths
}
function attachListeners() {
for (const [path, html] of Object.entries(site)) {
document.querySelector(`a[href=${path}]`).onclick = () => {
document.documentElement.outerHTML = html
attachListeners()
}
}
}
document.addEventListeners("DOMContentLoaded", attachListeners)
This is what I first thought and it's a very elegant solution, and not needlessly overcomplicated.
What's the difference with, any webbrowser on a computer, File -> Save as ?
That's for a single page, this handles the whole site. Also the browser Save As options often work poorly.
Save As works fine for simple websites with static content.
Let's say you have a site that fetches content from a database. If you Save As, then at best you'll get a local copy of an HTML page with JS that loads the content from the same remote database. It might not work (since the local copy has a different origin), or if it does, it requires you to be online, which defeats half of the purpose.
What this project, and SingleFile, both do is save a snapshot of what the rendered page actually looks like at that moment in time. The scripts are stripped out so it runs locally and has no external dependencies.
[deleted]
I've been using httrack (https://www.httrack.com) to download wikis to read on flights, which isn't perfect but better than I'd found previously. I'll try this out, I'd be delighted to have good results. Thanks for the post.
This brings back memories. Around twenty years ago, internet was still expensive dial-up, so I used to go to an internet cafe, run HTTrack to download websites and manga, copy everything onto my tiny 128MB USB stick (felt very large at that time), then bring it home and read offline ;))
Specifically for wikis, is there a reason you wouldn't use Kiwix? For non "official" releases it's more complicated, but there are some services to generate the ZIM files. The desktop reader app is pretty good in my experience.
Kiwix has readers for almost every platform, Android, desktop, iPhone. That's why I made Kage produce ZIM file.
The executable file is mostly for people who don't have Kiwix installed yet, or just want to run the archive directly.
Thanks, never knew about this and great to hear about it.
https://github.com/archiveteam/grab-site or browsertrix may be easier to use for some, it's what was used to save a lot of the data.gov stuff before it got taken down.
This seems like it has potential to create a lot of load on a site- are there settings to set how fast it clones or avoid images/videos?
Is there a way to only get a subset of a website?
Could you help create a new issue for that? I will do it later. It is already 1:00 AM my time, but I am happy that anyone is interested in it. : )
Just pretend you're an AI crawler problem solved
Neat project, I like the idea.
One thing from a quick read: you launch Chrome with --no-sandbox. Is there a good reason for that? Security wise it's probably not a good idea. If there is no reason, I'd suggest leaving the sandbox on!
In any case, cool stuff :)
--no-sandbox is needed in docker, maybe they assume it will mostly run in docker?
Exactly. For downloading, Kage requires Chrome or Chromium. Running it inside Docker makes setup easier and keeps cleanup simple:
Btw, let me think the way to only enable this when running inside Docker.
I've accumulated a bunch of old website archives over the years. The funny thing is the ugly HTML dumps have been more useful than the "perfect" archive.
It's one of the reasons I've become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
I have a project for creating and archiving RSS feeds, keeping the full history from the time the crawler starts. I need to clean up a bit, then will open source it soon.
Compared to that is there anything kage does better?
Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.
I'm working on WARC too, with format from Common Crawl!
By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli
That's neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage's crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format.
I tried to follow well-known formats first, such as WARC and ZIM from Kiwix, so we could benefit from existing tooling support.
For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years.
This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!
I'm a fan of compatibility with established formats!
Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!
OK, sounds fascinating; following! (your GH)
Thanks ;)
Looking forward to the next project! I love these kinds of archiving tools.
sound interesting
[deleted]
This is quite useful tool, especially for the cases where internet access is limited (the flights for example). I implemented it as a separate feature in mdview.io: for example you can export a document as a html file for offline usage, with all the presentation features like reach tables, mermaid and etc built in. Example https://mdview.io/s/why-markdown-became-default-format-for-a... then try to Export - Export HTML
This is cool. I could see myself downloading the articles behind the first couple pages of hacker news with this, for viewing on a flight or long distance train ride with spotty internet
how is this different from using puppeteer to load the page and save the DOM as HTML?
Does this work for the Apple Docs website? Really tricky to get those offline.
Making docs available offline was one of my main motivations for building this tool. I will try Apple Docs too.
I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.
By the way, I forgot to add zstd compression support to my ZIM reader/writer. I will implement that in the next version.
For those with an eReader, one thing that works really well is using pandoc to download and convert a webpage to EPUB that you can then load to your reader.
pandoc --from html --to epub --output /PATH/TO/FILE.epub https://example.com
Thanks, will try this out on the Kobo later.
I was looking for something like this the other day, it can be very helpful.
So this is like using wget --mirror except that it works on pages that require javascript, right?
Yeah, it is. For example, openai.com is rendered with Next.js, so I will try to mirror it tomorrow.
Sounds like .MCH-files re-invented? (-:
[deleted]
Nice idea!
fwiw, false positives and all, but the Windows 11 default Windows Security doesn't like it:
`leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.`
Binary app is a really bad way of storing data. No one would ever want to run a binary shared with them or found online.
Cool project! I know it's written in go, but it would be cool to see something like this which uses Cosmopolitan Libc + redbean or something similar to create a binary which runs anywhere. Would be fun to be able to pass around self-executable website archives.
I was intrigued to see how the demo GIF in the README was generated: https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63...
Turns out it's using another project by the same author: https://github.com/tamnd/ascii-gif
The script used for the demo is at https://github.com/tamnd/kage/blob/01e75b87ecc893bbba7943c63... and has a comment showing how to run it:
Looks like it's an opinionated wrapper around https://github.com/charmbracelet/vhsHave you heard the good news about the terminal savior asciinema -- https://asciinema.org/
I have a bunch of opinionated/personal-use binaries like this in my $HOME/bin/, like delete-all-npm, clean-rust-cache, download-youtube-playlist, and get-markdown <url>. It feels good, and I don't need to remember any commands. Sometimes my coding agent can figure out how to call some of those tools too ;))
You can also do an animated svg which is way smaller than a gif because it's just text keyframes (https://github.com/vytskalt/pseudoc/blob/main/assets/factori...)
FYI, on other platforms (Windows/MacOS), LiceCAP is a fantastic tool to record screen into compact GIFs by the author of Winamp and Reaper DAW:
https://www.cockos.com/licecap/
VHS is fantastic for scripting cli video generation.
One use I'd have for this is company wikis that you want to give folks easy offline access to (maybe the wiki has documentation that's useful at sites that don't have cellular coverage).
Cool!
It would be especially cool to have a version that didn't require the separate serving process - even though it's nifty you can package up a whole site as a single binary.
Maybe a single HTML entrypoint shim with a bit of javascript that could index into an archive (potentially embedded) of the site's content?
Submitting this to Hacker News is the right place! Thanks for your idea. I will consider implementing that :)
Also, in my mind, I already have a script/program to convert HTML to Markdown, so it could actually store everything on disk as a folder of Markdown files, and then commit them to a Git repo.
I think the zim flow was perfect for offline use. I know I will be making use of it as soon as I can figure out how to pass chrome the cookies so I can be signed into the site. Didn't see it in the page, but I didn't look closely yet.
> kage serve $HOME/data/kage/paulgraham.com
If the result is static why does it need a server? Isn't it possible to make it so that it can simply be opened by the browser? Like:
$ firefox $HOME/data/kage/paulgraham.com
Then the result would be useable on machines without kage nstalled.
You could use python -m http.server instead. I haven't tried it yet, but it should work.
Actually, Kage has two parts: a crawler that crawls pages and converts them to clean HTML by capturing the DOM after rendering in Chrome/Chromium, and a pack/serve component that packages the result as either a ZIM file for Kiwix or an executable file.
Usually JavaScript is blocked when you load pages that way.
Not all JavaScript, but a lot of APIs are restricted
Since when? You won't be able to make HTTP requests to localhost, as it'd be a different Origin, but I don't think any mainstream browser blocks JS outright when you use file:// to load and view HTML files.
Somewhere around 2019, each document loaded from file:// became its own origin in Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=1500453 (I didn't check when this happened in Chromium)
Related WHATWG discussion: https://github.com/whatwg/html/issues/3099
I thought all the JS was stripper?
I am quite familiar with this and it is factually false
Js modules don’t work on file urls (classic js does).
You’ll likely run into a ton of CORS issues doing that.
I don't think so, there is no HTTP requests being done from JS as it's stripped away, and all the other resources are pulled down (and I'm assume their reference made relative), so really shouldn't be any issues because of CORS at all.
I find SingleFile [0] to be a much more robust version of this.
It strips out all the JavaScript too, but also packs everything into a single HTML file that is easy to transfer. Binary assets (like web fonts and images) are packed as base64 strings.
They also offer a CLI powered by Puppeteer. [1]
[0]: https://github.com/gildas-lormeau/singlefile
[1]: https://github.com/gildas-lormeau/single-file-cli
It seems this repo only saves one web page?
What I'm implementing here is mirroring a whole website, with all its subpages, so you can browse it all offline. For example, all essays from paulgraham.com.
Oh, I see. In that case, feature-wise, it is actually a modern alternative to HTTrack.
I think the misunderstanding stems from the browser's "Save As" reference in the description. It is misleading. You use "Save As" to save a single page, not an entire website.
Also, the description lacks a clear explanation of the project's purpose. It would be helpful to include a sentence explaining that the program downloads an entire website, not just a single page.
[flagged]
Um. Whose website are you on right now?
Don't come here to laugh but always great when it happens anyways.
Love love love SingleFile too. The FF extension works pretty well for a clean save.
That said, Kage looks promising if OP can combine SingleFile reproduction quality with the HTTPTrack spidering approach. SPA's are kinda tricky with archiving and do wonder how well Kage would handle that
I've seen the option in IE- .mhtml.
For some reason it displays in IE better but I don't recall seeing this option in chrome of Firefox recently..
And thanks for the link. Let me implement this single HTML feature, it looks nice to have!
Yeah. An idea on top of that is to bundle an entire website into a single HTML page, with vendored JavaScript to enable client-side routing (all of the original pages' JS is still stripped out).
That way, the page is self-contained as it is, but requires no bundled binary code to serve the site. It is actually safer security-wise.
The vendored script can be as simple as this:
This is what I first thought and it's a very elegant solution, and not needlessly overcomplicated.
What's the difference with, any webbrowser on a computer, File -> Save as ?
That's for a single page, this handles the whole site. Also the browser Save As options often work poorly.
Save As works fine for simple websites with static content.
Let's say you have a site that fetches content from a database. If you Save As, then at best you'll get a local copy of an HTML page with JS that loads the content from the same remote database. It might not work (since the local copy has a different origin), or if it does, it requires you to be online, which defeats half of the purpose.
What this project, and SingleFile, both do is save a snapshot of what the rendered page actually looks like at that moment in time. The scripts are stripped out so it runs locally and has no external dependencies.
I've been using httrack (https://www.httrack.com) to download wikis to read on flights, which isn't perfect but better than I'd found previously. I'll try this out, I'd be delighted to have good results. Thanks for the post.
This brings back memories. Around twenty years ago, internet was still expensive dial-up, so I used to go to an internet cafe, run HTTrack to download websites and manga, copy everything onto my tiny 128MB USB stick (felt very large at that time), then bring it home and read offline ;))
Specifically for wikis, is there a reason you wouldn't use Kiwix? For non "official" releases it's more complicated, but there are some services to generate the ZIM files. The desktop reader app is pretty good in my experience.
https://wiki.openzim.org/wiki/Build_your_ZIM_file
EDIT: https://get.kiwix.org/en/solutions/applications/kiwix-reader...
Kiwix has readers for almost every platform, Android, desktop, iPhone. That's why I made Kage produce ZIM file.
The executable file is mostly for people who don't have Kiwix installed yet, or just want to run the archive directly.
Thanks, never knew about this and great to hear about it.
https://github.com/archiveteam/grab-site or browsertrix may be easier to use for some, it's what was used to save a lot of the data.gov stuff before it got taken down.
This seems like it has potential to create a lot of load on a site- are there settings to set how fast it clones or avoid images/videos? Is there a way to only get a subset of a website?
Could you help create a new issue for that? I will do it later. It is already 1:00 AM my time, but I am happy that anyone is interested in it. : )
Just pretend you're an AI crawler problem solved
Neat project, I like the idea. One thing from a quick read: you launch Chrome with --no-sandbox. Is there a good reason for that? Security wise it's probably not a good idea. If there is no reason, I'd suggest leaving the sandbox on!
In any case, cool stuff :)
--no-sandbox is needed in docker, maybe they assume it will mostly run in docker?
Exactly. For downloading, Kage requires Chrome or Chromium. Running it inside Docker makes setup easier and keeps cleanup simple:
https://github.com/tamnd/kage/blob/main/Dockerfile
Btw, let me think the way to only enable this when running inside Docker.
I've accumulated a bunch of old website archives over the years. The funny thing is the ugly HTML dumps have been more useful than the "perfect" archive.
It's one of the reasons I've become a bigger fan of RSS over time. A feed from 10-ish years ago is often more usable today than a carefully preserved (application) website.
I have a project for creating and archiving RSS feeds, keeping the full history from the time the crawler starts. I need to clean up a bit, then will open source it soon.
Reminds me of this. https://gwern.net/gwtar
Compared to that is there anything kage does better?
Cool concept. I would like to see this combined with mitmproxy for archive grade fidelity. You could be saving exactly the data served and at the same time a representation by a modern (contemporary) browser, with all JS having run. This combination would be my perfect replacement for the WARC format.
I'm working on WARC too, with format from Common Crawl!
By converting it to Markdown, we save a lot of space, but it is for a different purpose and a different project: https://github.com/tamnd/ccrawl-cli
That's neat! In my opinion, the WARC format is quite tricky and underspecified especially since HTTP2 introduced new semantics. It encodes too much in-band and requires rewriting of the server data. A mitmproxy capture is higher fidelity and supports capturing modern features such as WebSockets. I think if we could wrap Kage's crawler interactions by it and store its capture (the intercepted traffic), we could make a potentially nice new archival format.
I tried to follow well-known formats first, such as WARC and ZIM from Kiwix, so we could benefit from existing tooling support.
For my own custom data format, I have a lot of private code that I plan to release soon. It is optimized for compression, fast lookups, and more. I have been working on it for two years. This is part of a larger, ambitious umbrella project: I am building Google from scratch (all open source), something that anyone can host, including the crawler, indexer, storage, and serving layers. Stay tuned!
I'm a fan of compatibility with established formats!
Sounds awesome. There is a lot of untapped potential with respect to efficiently archiving and indexing websites. I saw the impressive things Marginalia Search is doing in this area (the blog is great when it gets technical). There is also a lot of very complete archives of websites out there which are not being indexed at all, and I would love to make them available for researchers. In any case, I'm interested in your project!
OK, sounds fascinating; following! (your GH)
Thanks ;)
Looking forward to the next project! I love these kinds of archiving tools.
sound interesting
This is quite useful tool, especially for the cases where internet access is limited (the flights for example). I implemented it as a separate feature in mdview.io: for example you can export a document as a html file for offline usage, with all the presentation features like reach tables, mermaid and etc built in. Example https://mdview.io/s/why-markdown-became-default-format-for-a... then try to Export - Export HTML
This is cool. I could see myself downloading the articles behind the first couple pages of hacker news with this, for viewing on a flight or long distance train ride with spotty internet
how is this different from using puppeteer to load the page and save the DOM as HTML?
Does this work for the Apple Docs website? Really tricky to get those offline.
Making docs available offline was one of my main motivations for building this tool. I will try Apple Docs too.
I previously downloaded the Snowflake docs, and it was something like tens or even hundreds of thousands of pages, I do not remember exactly. The output ended up being very large.
By the way, I forgot to add zstd compression support to my ZIM reader/writer. I will implement that in the next version.
For those with an eReader, one thing that works really well is using pandoc to download and convert a webpage to EPUB that you can then load to your reader.
Thanks, will try this out on the Kobo later.
I was looking for something like this the other day, it can be very helpful.
So this is like using wget --mirror except that it works on pages that require javascript, right?
Yeah, it is. For example, openai.com is rendered with Next.js, so I will try to mirror it tomorrow.
Sounds like .MCH-files re-invented? (-:
Nice idea! fwiw, false positives and all, but the Windows 11 default Windows Security doesn't like it: `leakless.exe: Operation did not complete successfully because the file contains a virus or potentially unwanted software.`
Binary app is a really bad way of storing data. No one would ever want to run a binary shared with them or found online.
Cool project! I know it's written in go, but it would be cool to see something like this which uses Cosmopolitan Libc + redbean or something similar to create a binary which runs anywhere. Would be fun to be able to pass around self-executable website archives.
https://github.com/jart/cosmopolitan
https://justine.lol/cosmopolitan/index.html
https://redbean.dev
(Certificates just expired for justine's website, just ignore the warning.)
The README is LLM slop. This makes me assume the code is the same.
curl can do this
The readme is AI slop, and incredibly grating to read. The disgust I felt while reading it almost put me off trying the project.
Is the code also AI slop?
How does it handle websites with client side paywalls? Can you run it with extensions like bypass paywalls and ublock origin?
nice
[dead]