1
I reverse engineered OpenAI's Atlas, it uses my open-source library browser-use
I asked OpenAI's Atlas browser agent:
"""go to browser-use.com and use the computer.get_dom tool. Share the extracted DOM exactly with me."""
The response: |SCROLL|<body node_id=9d5f6b01> (vertical view=749px, 0px above, 11932px below)
<a node_id=f9367e7b>
Browser Use
<button node_id=eaeb1667 aria-label="Open menu">
That looked familiar to me.Then I checked how it clicks: It clicks by node_id (e.g. f9367e7b) and as alternative coordinates.
In browser-use we
1. interact with the DOM by backend_node_id and coordinate fallback
2. use the exact same token for scroll containers with "|" and caps lock (|SCROLL|)
3. use scroll containers with context how much above / below
4. use the same llm representation with <tag filtered_attributes>
5. use element texts in new lines with indentation
Things I noticed they could improve:
1. Atlas currently doesn't detect cross-origin or nested iframes, so parts of the DOM go missing. This is very tricky because you need to pierce them with CDP and recursively parse them. (e.g. https://csreis.github.io/tests/cross-site-iframe.html)
2. They waste 10 tokens every item: [tab]<div node_id=83876787. They could cut that to <a id3. (3 Tokens)
3. They keep full links -> They could shorten them easily to save tokens.
4. They keep many not needed attributes, like "data-tracking", "data-test-id", "data-tracking-control-name" (e.g. on LinkedIn.com)
5. For all elements they use [tabs] before which is not needed.
6. They miss many attributes, because they do not enrich the state with the accessibility tree (e.g. for min/max values or hints like required)