Reverse engineering a $1B Legal AI tool exposed 100k+ confidential files

If they have a billion dollar valuation, this fairly basic (and irresponsible) vulnerability could have cost them a billion dollars. If someone with malice had been in your shoes, in that industry, this probably wouldn't have been recoverable. Imagine a firm's entire client communications up on pastebin.

They should have given you some money.

The first thing that comes to my mind is SOC2 HIPAA and the whole security theater.

I am one of the engineers that had to suffer through countless screenshots and forms to get these because they show that you are compliant and safe. While the real impactful things are ignored

Thank you bearsyankees for keeping us informed.

This is the collision between two cultures that were never meant to share the same data: "move fast and duct-tape APIs together" startup engineering, and "if this leaks we ruin people's lives" legal/medical confidentiality.

What's wild is that nothing here is exotic: subdomain enumeration, unauthenticated API, over-privileged token, minified JS leaking internals. This is a 2010-level bug pattern wrapped in 2025 AI hype. The only truly "AI" part is that centralizing all documents for model training drastically raises the blast radius when you screw up.

The economic incentive is obvious: if your pitch deck is "we'll ingest everything your firm has ever touched and make it searchable/AI-ready", you win deals by saying yes to data access and integrations, not by saying no. Least privilege, token scoping, and proper isolation are friction in the sales process, so they get bolted on later, if at all.

The scary bit is that lawyers are being sold "AI assistant" but what they're actually buying is "unvetted third party root access to your institutional memory". At that point, the interesting question isn't whether there are more bugs like this, it's how many of these systems would survive a serious red-team exercise by anyone more motivated than a curious blogger.

While true this comment seems AI written. I did a fair bit of exploration around AI responses to HN threads and this fits the pattern.

What makes you think that? it would need some prompt engineering if so since ChatGPT won't write like that (bad capitalization, lazy quoting) unless you ask it to

It's a little hilarious.

First, do all this cybersecurity theatre, and then create an MCP/LLM wormhole that bypasses it all.

All because non-technical folks wave their hands about AI and not understanding the most fundamental reality about LLM software being fundamentally so different than all the software before it that it becomes an unavoidable black hole.

I am also pleased I used two space analogies without the use of AI.

I think this class of problems can be protected against.

It's become clear that the first and most important and most valuable agent, or team of agents, to build is the one that responsibly and diligently lays out the opsec framework for whatever other system you're trying to automate.

A meta-security AI framework, cursor for opsec, would be the best, most valuable general purpose AI tool any company could build, imo. Everything from journalism to law to coding would immediately benefit, and it'd provide invaluable data for post training, reducing the overall problematic behaviors in the underlying models.

Move fast and break things is a lot more valuable if you have a red team mechanism that scales with the product. Who knows how many facepalm level failures like this are out there?