Open-interpreter: A natural language interface for computers

I finally got around to trying this out right now. Here's how to run it using uvx (so you don't need to install anything first):

    uvx --from open-interpreter interpreter

I took the simplest route and pasted in an OpenAI API key, then I typed:

    find largest files on my desktop

It generated a couple of chunks of Python, asked my permission to run them, ran them and gave me a good answer.

Here's the transcript: https://gist.github.com/simonw/f78a2ebd2e06b821192ec91963995...

simon's writeup is here https://simonwillison.net/2024/Nov/24/open-interpreter/

i always thought the potential for openinterpreter would be kind of like an "open source chatgpt desktop assistant" app with swappable llms. especially vision since that (specifically the one teased at 4o's launch https://www.youtube.com/watch?v=yJHw33cVeHo) has not yet been released by oai. they made some headway with the "o1" device that they teased.. and then canceled.

instead all the demo usecases seem very trivial: "Plot AAPL and META's normalized stock prices". "Add subtitles to all videos in /videos" seems a bit more interesting but honestly trying to hack it in a "code interpreter" inline in a terminal is strictly worse than just opening up cursor for me.

i'd be interested if anyone here is active users of OI and what you use it for.

It's funny that we're getting so much attention funneled towards the thought-to-machine I/O problem now that LLMs are on the scene.

If the improvements are beneficial now, then surely they were beneficial before.

Prior to LLMs, though, we could have been making judicious use of simple algorithmic approaches to process natural language constructs as command language. We didn't see a lot of interest in it.

> Prior to LLMs, though, we could have been making judicious use of simple algorithmic approaches to process natural language constructs as command language. We didn't see a lot of interest in it.

Siri was released in 2011, and Alexa and Google Assistant followed soon thereafter. Companies spent tens of millions of dollars improving their algorithmic NLP because voice interfaces were "the future". I took a class in the late 2010s that went over all of the methodologies that they used for intent parsing and slot filling. All of that has been largely abandoned at this point in favor of LLMs for everything.

My hope is that at some point people will come back to these UI paradigms as we realize the limitations of "everything is a chat bot". There's a simplicity to the context-free limited voice assistants that had a set of specific use cases they could handle, and the effort to chatbot everything is starting to destroy the legitimate use cases that came out of that era like timers and reminders.

I have a somewhat different perspective. The way I see it, for the past 10+ years, the major vendors were going out of their way to try for generic NLP interface. At that point, it's already been known that controlled language[0] + environmental context could allow for highly functional voice control. But for some reason[1], the vendors really wanted for assistants to guess what people mean. As a result, we got 10+ years of shitty assistants that couldn't reliably do anything, not even set a goddamn timer, and weren't able to do much either - it's hard to have many complex features when you can't get the few simplest ones right.

This was a bad direction then. Now, for better or worse, all those vendors got their miracle: LLMs are literally plug-and-play boxes that implement the "parse arbitrary natural-language queries and map them to system capabilities" functionality. Thanks to LLMs, voice interfaces could actually start working. If vendors could also get the "having useful functionality" part right.

(Note: this is distinct from "everything is a chat bot". That's a bad idea simply because typing text sucks, specifically typing out your thoughts in prose form is about the least efficient way to interact with a tool. Voice interfaces are an exception here.)

[0] - https://en.wikipedia.org/wiki/Controlled_natural_language

[1] - Perhaps this weird idea that controlled languages are too hard for general population, too much like programming, or such. They're not. More generally, we've always had to "meet in the middle" with our machines, and it was - and remains - always a highly successful approach.

Uh, we did…? Alexa, Siri, Ok Google…

A lot of money was poured into that goal, but because every type of action required a handcrafted integration, they were either costly to develop or extremely limited. That’s no longer the case.

COBOL and SQL would like a word.

People have some solution so they are searching for problems it can fit. Doesn't mean it's the best one...

I find the "Can you ..." phrasing used in this demo/project fascinating. I would have expected the LLM to basically say "Yes I can, would you like me to do it?" to most of these questions, rather than directly and immediately executing the action.

If an employer were to ask an employee, "can you write up this report and send it to me" and they said, "yes I can, would you like me to do it?", I think it would be received poorly. I believe this is a close approximation of the relationship people tend to have with chatgpt.

Depends, the 'can you' (or 'can I get') phrasing appears to be a USA English thing.

Managers often expect subordinates to just know what they mean, but checking instructions and requirements is usually essential and imo is a mark of a good worker.

"Can you dispose of our latest product in a landfill"...

Generally in UK, unless the person is a major consumer of USA media, "can you" is an enquiry as to capability or whether an action is within the rules.

IME. YMMV.

I'm very curious why you think that! Sincerely. These models undergo significant human-aided training where people express a preference for certain behaviours, and that is fed back into the training process: I feel like the behaviour you mention would probably be trained out pretty quickly since most people would find it unhelpful, but I'm really just guessing.

What distinguishes LLMs from classical computing is that they're very much not pedantic. Because the model is predicting what human text would follow a given piece of content, you can generally expect them to react approximately the way that a human would in writing.

In this example, if a human responded that way I would assume they were either being passive aggressive or were autistic or spoke English as a second language. A neurotypical native speaker acting in good faith would invariably interpret the question as a request, not a question.

In your locality.

I've asked LLM systems "can you..." questions. I'm asking surely about their capability and allowed parameters of operation.

Apparently you think that means I'm brain damaged?

Surely there's better Windmills for you to tilt at.