> When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.
This would make sense... if they were using UDP, but they are using TCP. All the JPEGs they send will get there eventually (unless the connection drops). JPEG does not fix your buffering and congestion control problems. What presumably happened here is the way they implemented their JPEG screenshots, they have some mechanism that minimizes the number of frames that are in-flight. This is not some inherent property of JPEG though.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB. We’re sending LESS data per frame AND getting better reliability.
h.264 has better coding efficiency than JPEG. For a given target size, you should be able to get better quality from an h.264 IDR frame than a JPEG. There is no fixed size to an IDR frame.
Ultimately, the problem here is a lack of bandwidth estimation (apart from the sort of binary "good network"/"cafe mode" thing they ultimately implemented). To be fair, this is difficult to do and being stuck with TCP makes it a bit more difficult. Still, you can do an initial bandwidth probe and then look for increasing transmission latency as a sign that the network is congested. Back off your bitrate (and if needed reduce frame rate to maintain sufficient quality) until transmission latency starts to decrease again.
WebRTC will do this for you if you can use it, which actually suggests a different solution to this problem: use websockets for dumb corporate network firewall rules and just use WebRTC everything else
They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading. UDP is not necessary to write a loop.
> They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading.
You're right, I don't know how I managed to skip over that.
> UDP is not necessary to write a loop.
True, but this doesn't really have anything to do with using JPEG either. They basically implemented a primitive form of rate control by only allowing a single frame to be in flight at once. It was easier for them to do that using JPEG because they (to their own admission) seem to have limited control over their encode pipeline.
> have limited control over their encode pipeline.
Frustratingly this seems common in many video encoding technologies. The code is opaque, often has special kernel, GPU and hardware interfaces which are often closed source, and by the time you get to the user API (native or browser) it seems all knobs have been abstracted away and simple things like choosing which frame to use as a keyframe are impossible to do.
I had what I thought was a simple usecase for a video codec - I needed to encode two 30 frame videos as small as possible, and I knew the first 15 frames were common between the videos so I wouldn't need to encode that twice.
I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.
> I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.
fork()? :-)
But most software, video codec or not, simply isn't written to serialize its state at arbitrary points. Why would it?
A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!
In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.
However this is not the case with video codecs.
So US->Australia/Asia wouldn't that limit you to 6fps or so due half-rtt? Each time a frame finishes arriving you have 150ms or so for your new request to reach.
Probably either (1) they don't request another jpeg until they have the previous one on-screen (so everything is completely serialized and there are no frames "in-flight" ever) (2) they're doing a fresh GET for each and getting a new connection anyway (unless that kind of thing is pipelined these days? in which case it still falls back to (1) above.)
You can still get this backpressure properly even if you're doing it push-style. The TCP socket will eventually fill up its buffer and start blocking your writes. When that happens, you stop encoding new frames until the socket is able to send again.
The trick is to not buffer frames on the sender.
You probably won't get acceptable latency this way since you have no control over buffer sizes on all the boxes between you and the receiver. Buffer bloat is a real problem. That said, yeah if you're getting 30-45 seconds behind at 40 Mbps you've probably got a fair bit of sender-side buffering happening.
Setting aside the various formatting problems and the LLM writing style, this just seems all kinds of wrong throughout.
> “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
10Mbps should be way more than enough for a mostly static image with some scrolling text. (And 40Mbps are ridiculous.) This is very likely to be caused by bad encoding settings and/or a bad encoder.
> “What if we only send keyframes?”
The post goes on to explain how this does not work because some other component needs to see P-frames. If that is the case, just configure your encoder to have very short keyframe intervals.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.
A single H.264 keyframe can be whatever size you want, *depending on how you configure your encoder*, which was apparently never seriously attempted. Why are we badly reinventing MJPEG instead of configuring the tools we already have? Lower the bitrate and keyint, use a better encoder for higher quality, lower the frame rate if you need to. (If 10 fps JPEGs are acceptable, surely you should try 10 fps H.264 too?)
But all in all the main problem seems to be squeezing an entire video stream through a single TCP connection. There are plenty of existing solutions for this. For example, this article never mentions DASH, which is made for these exact purposes.
>Setting aside...the LLM writing style
I don't want to set that aside either. Why is AI generated slop getting voted to the top of HN? If you can't be bothered to spend the time writing a blog post, why should I be bothered spending my time reading it? It's frankly a little bit insulting.
Don’t assume something you cannot prove. It was great writing
Normally the 1 sentence per para LinkedIn post for dummies writing style bugs me to no end, but for a technical article that's continually hopping between questions, results, code, and explanations, it fits really well and was a very easy article to skim and understand.
[deleted]
> For example, this article never mentions DASH, which is made for these exact purposes.
DASH isn't supported on Apple AFAIK. HLS would be an idea, yes...
But in either case: you need ffmpeg somewhere in your pipeline for that experience to be even remotely enjoyable. No ffmpeg? No luck, good luck implementing all of that shit yourself.
I'm very familiar with the stack and the pain of trying to livestream video to a browser. If JPEG screenshots work for your clients, then I would just stick with that.
The problem with wolf, gstreamer, moonlight, $third party, is you need to be familiar with how the underlying stack handles backpressure and error propagation, or else things will just "not work" and you will have no idea why. I've worked on 3 projects in the last 3 years where I started with gstreamer, got up and running - and while things worked in the happy path, the unhappy path was incredibly brittle and painful to debug. All 3 times I opted to just use the lower level libraries myself.
Given all of OPs requirements, I think something like NVIDIA Video Codec SDK to a websocket to MediaSource Extensions.
However, given that even this post seems to be LLM generated, I don't think the author would care to learn about the actual internals. I don't think this is a solution that could be vibe coded.
This is where LLMs shine, where you need to dip your toes into really complex systems but basically just to do one thing with pretty straightforward requirements.
They might want to check out what VNC has been doing since 1998– keep the client-pull model, break the framebuffer up into tiles and, when client requests an update, perform a diff against last frame sent, composite the updated tiles client-side. (This is what VNC falls back to when it doesn’t have damage-tracking from the OS compositor)
This would really cut down on the bandwidth of static coding terminals where 90% of screen is just cursor flashing or small bits of text moving.
If they really wanted to be ambitious they could also detect scrolling and do an optimization client-side where it translates some of the existing areas (look up CopyRect command in VNC).
I worked on a project that started with VNC and had lots of problems. Slow connect times and backpressure/latency. Switching to neko was quick/easy win.
The blog post did smell of inexperience. Glad to hear there is other approaches - is something like that open source?
Yup. Go look into tigervnc if you want to see the source. But also you can just search for "tigervnc h.264" and you'll see extensive discussions between the devs on h.264 and integrating it into tiger. This is something that people spent a LOT of brainpower on.
Of all the suggestions in the comments here, this seems like the best one to start with.
Also... I get that the dumb solution to "ugly text at low bitrates" is "make the bitrate higher." But still, nobody looked at a 40M minimum and wondered if they might be looking at this problem from the wrong angle entirely?
Copying how VNC does it is exactly how my first attempt would go. Seems odd to try something like Moonlight which is designed for low latency remote gameplay.
40mbps for video of an LLM typing text didn't immediately fire off alarm bells in anyone's head that their approach was horribly wrong? That's an insane amount of bandwidth for what they're trying to do.
> When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.
You can have still have weird broken stallouts though.
I dunno, this article has some good problem solving but the biggest and mostly untouched issue is that they set the minimum h.264 bandwidth too high. H.264 can do a lot better than JPEG with a lot less bandwidth. But if you lock it at 40Mbps of course it's flaky. Try 1Mbps and iterate from there.
And going keyframe-only is the opposite of how you optimize video bandwidth.
> Try 1Mbps and iterate from there.
From the article:
“Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
Rejecting it out of hand isn't actually trying it.
10Mbps is still way too high of a minimum. It's more than YouTube uses for full motion 4k.
And it would not be blocky garbage, it would still look a lot better than JPEG.
1Mbps for video is rule of thumb I use. Of course that will depend on customer expectations. 500K can work, but it won’t be pretty.
For normal video I think that's a good rule of thumb.
For mostly-static content at 4fps you can cut a bunch more bitrate corners before it looks bad. (And 2-3 JPEGs per second won't even look good at 1Mbps.)
[deleted]
>> 10Mbps is still way too high of a minimum. It's more than YouTube uses for full motion 4k.
> And 2-3 JPEGs per second won't even look good at 1Mbps.
Unqualified claims like these are utterly meaningless. It depends too much on exactly what you're doing, some sorts of images will compress much better than others.
Proper rate control for such realtime streaming would also lower framerate and/or resolution to maintain the best quality and latency they can over dynamic network conditions and however little bandwidth they have. The fundamental issue is that they don't have this control loop at all, and are badly simulating it by polling JPEGs.
It might be possible to buffer and queue jpegs for playback as well to help with weird broken stall outs.
Video players used to call it buffering, and resolving it was called buffering issues.
Players today can keep an eye on network quality while playing too, which is neat.
There are so many things that I would have done differently.
> We added a keyframes_only flag. We modified the video decoder to check FrameType::Idr. We set GOP to 60 (one keyframe per second at 60fps). We tested.
Why muck around with P-frames and keyframes? Just make your video 1fps.
> Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
10 Mbps is way too much. I occasionally watch YouTube videos where someone writes code. I set my quality to 1080p to be comparable with the article and YouTube serves me the video at way less than 1Mbps. I did a quick napkin math for a random coding video and it was 0.6Mbps. It’s not blocky garbage at all.
Setting to 1 FPS might not be enough. GOP or P frame setting needs to be adjusted to make every frame keyframe.
[deleted]
This blog post smells of LLM, both in the language style and the muddled explanations / bad technical justifications. I wouldn't be surprised if their code is also vibe coded slop.
One man's not-blocky-garbage is another's insufferable hell. Even at 4k I find YouTube quality to be just awful with artefacts everywhere.
Many moons ago I was using this software which would screenshot every five seconds and give you a little time lapse and the end of the day. So you could see how you were spending your computer time.
My hard disk ended up filling up with tens of gigabytes of screenshots.
I lowered the quality. I lowered the resolution, but this only delayed the inevitable.
One day I was looking through the folder and I noticed well almost all the image data on almost all of these screenshots is identical.
What if I created some sort of algorithm which would allow me to preserve only the changes?
I spent embarrassingly long thinking about this before realizing that I had begun to reinvent video compression!
So I just wrote a ffmpeg one-liner and got like 98% disk usage reduction :)
Thats fun. I take it JPEG (what settings lolz!) is compressing harder than a keyframe.
But you are waching code. Why not send the code? Plus any css/html used to render it pretty. Or in other words why not a vscode tunnel?
Having pair programmed over some truly awful and locked down connections before, dropped frames are infinitely better than blurred frames which make text unreadable whenever the mouse is moved. But 40mbps seems an awful lot for 1080p 60fps.
Temporal SVC (reduce framerate if bandwidth constrained) is pretty widely supported by now, right? Though maybe not for H.264, so it probably would have scaled nicely but only on Webrtc?
They're just streaming a video feed of an LLC running in a terminal? Why not stream the actual text? Or fetch it piecemeal over AJAX requests? They complain that corporate networks support only HTTPS and nothing else? Do they not understand what the first T stands for?
Indeed, live text streaming is well over 100 years old:
Suppose an LLM opens a browser, or opens a corporate .exe and GUI and starts typing in there and clicking buttons.
You don't give it a browser or buttons to click.
If you are ok with a second or so of latency then MPEG-DASH (standardized version of HTTP Live Streaming) is likely the best bet. You simply serve the video chunks over HTTP so it should be just as compatible as the JPEG solution used here but provide 60fps video rather than crappy jpegs.
The standard supports adaptive bit rate playback so you can provide both low quality and high quality videos and players can switch depending on bandwidth available.
"Think “screen share, but the thing being shared is a robot writing code.”"
Thinks: why not send text instead of graphics, then? I'm sure it's more complicated than that...
Thinks: this video[1] is the processed feed from the Huygens space probe landing on Saturn's moon Titan circa 2005. Relayed through the Cassini probe orbiting Saturn, 880 million miles from the Sun. At a total mission cost of 3.25 billion dollars. This is the sensor data, altitude, speed, spin, ultra violet, and hundreds of photos. (Read the description for what the audio is encoding, it's neat!)
Look at the end of the video, the photometry data count stops at "7996 kbytes received"(!)
> "Turns out, 40Mbps video streams don’t appreciate 200ms+ network latency. Who knew. “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage"
Yeah, I'm thinking the same thing. Capture the text somehow and send that, and reconstruct it on the other end; and the best part is you only need to send each new character, not the whole screen, so it should be very small and lightning fast?
Which features terminal live streaming since recently released 3.0 :)
> The fix was embarrassingly simple: once you fall back to screenshots, stay there until the user explicitly clicks to retry.
There is another recovery option:
- increase the JPEG framerate every couple seconds until the bandwidth consumption approaches the H264 stream bandwidth estimate
- keep track latency changes. If the client reports a stable latency range, and it is acceptable (<1s latency, <200ms variance?) and bandwidth use has reached 95% of H264 estimate, re-activate the stream
Given that text/code is what is being viewed, lower res and adaptive streaming (HLS) are not really viable solutions since they become unreadable at lower res.
If remote screen sharing is a core feature of the service, I think this is a reasonable next step for the product.
That said, IMO at a higher level if you know what you're streaming is human-readable text, it's better to send application data pipes to the stream rather than encoding screenspace videos. That does however require building bespoke decoders and client viewing if real time collaboration network clients don't already exist for the tools (but SSH and RTC code editors exist)
I made this because I got tired of screensharing issues in corporate environments:
https://bluescreen.live (code via github).
Screenshot once per second. Works everywhere.
I’m still waiting for mobile screenshare api support, so I could quickly use it to show stuff from my phone to other phones with the QR link.
I recognize this voice :) This is Claude.
So it’s video of an AI typing text?
Why not just send text? Why do you need video at all?
Why send anything at all if the AI isn't even good enough to solve their own problems?
(Although the fact they decided to use Moonlight in an enterprise product makes me wonder if their product actually was vibe coded)
You apparently need video for the 45 seconds window you then get before preventing catastrophic things to happen. From TFA:
> You’re watching the AI type code from 45 seconds ago
>
> By the time you see a bug, the AI has already committed it to main
>
> Everything is terrible forever
Is this satire? I mean: if the solution for things to not be terrible forever consists in catching what an AI is doing in 45 seconds (!) before the AI commits to trunk, I'm sorry but you should seriously re-evaluate your life plans.
This article reminds me so much of so many hardware providers I deal with at work who want to put equipment on-site and then spend the next year not understanding that our customers manage their own firewall. No, you can’t just add a new protocol or completely change where your stuff is deployed because then our support team has to contact hundreds of customers about thousands of sites.
This was the most entertaining thing I read all day. Kudos.
I've had similar experiences in the past when trying to do remote desktop streaming for digital signage (which is not particularly demanding in bandwidth terms). Multicast streaming video was the most efficent, but annoying to decode when you dropped data. I now wonder how far I could have gone with JPEGs...
If playing with Chromecast types multicast or streaming one frame at a time manually worked pretty good.
Yes, this is unfortunately still the way and was very common back when iOS Safari did not allow embedded video.
For a fast start of the video, reverse the implementation: instead of downgrading from Websockets to polling when connection fails, you should upgrade from polling to Websockets when the network allows.
Socket.io was one of the first libraries that did that switching and had it wrong first, too. Learned the enterprise network behaviour and they switched the implementation.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.
I believe the latter can be adjusted in codec settings.
Of course. But same quality h264 keyframe will not be much smaller than JPEG.
So they replaced a TCP connection with no congestion control with a sycnronous poll of an endpoint which is inherently congestion controlled.
I wonder if they just tried restarting the stream at a lower bitrate once it got too delayed.
The talk about how the images looks more crisp at a lower FPS is just tuning that I guess they didn't bother with.
> The constraint that ruined everything: It has to work on enterprise networks.
> You know what enterprise networks love? HTTP. HTTPS. Port 443. That’s it. That’s the list.
That's not enough.
Corporate networks also love to MITM their own workstations and reinterpret http traffic. So, no WebSockets and no Server-Side Events either, because their corporate firewall is a piece of software no one in the world wants and everyone in the world hates, including its own developers. Thus it only supports a subset of HTTP/1.1 and sometimes it likes to change the content while keeping Content-Length intact.
And you have to work around that, because IT dept of the corporation will never lift restrictions.
I wish I was kidding.
Back when I had a job at a big old corporation, a significant part of my value to the company was that I knew how to bypass their shitty MITM thing that broke tons of stuff, including our own software that we wrote. So I could solve a lot of problems people had that otherwise seemed intractable because IT was not allowed to disable it, and they didn't even understand the myriad ways it was breaking things.
> So, no WebSockets
The corporate firewall debate came up when we considered websockets at a previous company. Everyone has parroted the same information for so long that it was just assumed that websockets and corporate firewalls were going to cause us huge problems.
We went with websockets anyway and it was fine. Almost no traffic to the no-websockets fallback path, and the traffic that did arrive appeared to be from users with intermittent internet connections (cellular providers, foreign countries with poor internet).
I'm 100% sure there are still corporate firewalls out there blocking or breaking websocket connections, but it's not nearly the same problem in 2025 as it was in 2015.
If your product absolute must, no exceptions, work perfectly in every possible corporate environment then a fallback is necessary if you use websockets. I don't think it's a hard rule that websockets must be avoided due to corporate firewalls any more, though.
I've had to switch from SSE to WebSockets to navigate a corporate network (the entire SSE would have to close before the user received any of the response).
Then we ran into a network where WebSockets were blocked, so we switched to streaming http.
No trouble with streaming http using a standard content-type yet.
> And you have to work around that, because IT dept of the corporation will never lift restrictions.
Unless the corporation is 100% in-office, I’d wager they do in fact make exceptions - otherwise they wouldn’t have a working videoconferencing system.
The challenge is getting corporate insiders to like your product enough to get it through the exception process (a total hassle) when the firewall’s restrictions mean you can’t deliver a decent demo.
They even break server-sent events (which is still my default for most interactive apps)
[deleted]
There are other ways to make server-sent events work.
I try to remember many environments once likely supported Flash.
Request URL has a query parameter with more than 64 characters? Fuck you.
Request lives for longer than 15 sec? Fuck you.
Request POSTs some JSON? Maybe fuck you just a little bit, when we find certain strings in the payload. We won't tell you which though.
>And you have to work around that, because IT dept of the corporation will never lift restrictions.
Because otherwise people do dumb stuff like pasting proprietary designs or PII into deepseek
Oh, they'll do that anyway, once they find the workaround (Oh... you can paste a credit card if you put periods instead of dashes! Oh... I have to save the file and do it from my phone! Oh... I'll upload it as a .txt file and change the extension on the server!)
It's purely illusory security, that doesn't protect anything but does levy a constant performance tax on nearly every task.
>Oh, they'll do that anyway, once they find the workaround ...
This is assuming the DLP service blocks the request, rather than doing something like logging it and reported to your manager and/or CIO.
>It's purely illusory security, that doesn't protect anything but does levy a constant performance tax on nearly every task.
Because you can't ask deepseek to extract some unstructured data for you? I'm not sure what the alternative is, just let everyone paste info into deepseek? If you found out that your data got leaked because some employee pasted some data into some random third party service, and that the company didn't have any policies/technological measures against it, would your response still be "yeah it's fine, it's purely illusory security"?
What's the term for the ideology that "laws are silly because people sometimes break them"?
I don't think that's a good read if the post you're implying this at. I think a more charitable read would be something like "people break rules for convenience so if your security relies on nobody breaking rules then you don't have thorough security".
You and op can be right at the same time. You imply the rules probably help a lot even while imperfect. They imply that pretending rules alone are enough to be perfect is incomplete.
Posting stuff into Deepseek is banned. The corporate firewall is like putting a camera in your home because you may break the law. But, yeah, arguing against cameras in homes because people find dead angles where they can hide may not be the strongest argument.
Disclaimer: I work in corporate cybersecurity.
I know that some guardrails and restrictions in a corporate setting can backfire. I know that onerous processes to get approval for needed software access can drive people to break the rules or engage in shadow IT. As a member of a firewall team, I did it myself! We couldn't get access to Python packages or PHP for a local webserver we had available to us from a grandfather clause. My team hated our "approved" Sharepoint service request system. So a few of us built a small web app with Bottle (single file web server microframework, no dependencies) and Bootstrap CSS and SQLite backend. Everyone who interacted with our team loved it. Had we more support from corporate it might have been a lot easier.
Good cybersecurity needs to work with IT to facilitate peoples' legitimate use cases, not stand in the way all the time just because it's easier that way.
But saying "corporate IT controls are all useless" is just as foolish to me. It is reasonable and moral for a business to put controls and visibility on what data is moving between endpoints, and to block unsanctioned behavior.
It's called black and white thinking
At the same time, enterprise is where the revenue is.
Against all odds, you're right, that's where somehow revenue is being generated. IT idiocy notwithstanding.
Often, enterprises create moats and then profit from them.
It's not usually IT idiocy, that usually comes from higher up cosplaying their inner tech visionaries.
Corporate IT needs to die.
It's not corporate IT's fault, it's usually corporate leaderships fault who often cosplay leading technology and not understanding it.
Wherever Tech is a first class citizen and seat at the corporate table, it can be different.
Believe me, the average Fortune 500 CEO does not know or care what “SSL MITM” is, or whether passwords should contain symbols and be changed monthly, or what the difference is between ‘VPN’ and ‘Zero Trust’.
They delegate that stuff. To the corporate IT department.
But they also say "Here, this is Sarah your auditor. Answer these questions and resolve the findings." - every year
It's all CyberSecurity insurance compliance that in many cases deviates from security best practices.
This is where the problems come from. Auditors are definitely what ultimately causes IT departments to make dumb decisions.
For example, we got dinged on an audit because instead of using RSA4096, we used ed25519. I kid you not, their main complaint was there wasn't enough bits which meant it wasn't secure.
Auditors are snake oil salesman.
This is 100% it- the auditor is confirming the system is configured to a set of requirements, and those requirements are rarely in lockstep with actual best practices.
Sometimes they have checkboxes to tick in some compliance document and they must run the software that let them tick those checkboxes, no exceptions, because those compliances allow the company to be on the market. Regulatory captures, etc.
where else are you going to find customers that are so sticky it will take years for them to select another solution regardless of how crappy you are. that will staff teams to work around your failures. who, when faced with obvious evidence of the dysfunction of your product, will roundly blame themselves for not holding it properly. gaslight their own users. pay obscene amounts for support when all you provide is a voice mailbox that never gets emptied. will happily accept your estimate about the number of seats they need. when holding a retro about your failure will happily proclaim that there wasn't anything _they_ could have done, so case closed.
Oh yes you can absolutely profit off that but you have to be dead inside a little bit.
And produce a piece of software no one in the world wants and everyone in the world hates. Yourself included.
I think the general idea/flow of things is "numbers go up, until $bubble explodes, and we built up smaller things from the ground up, making numbers go up, bloating go up, until $bubble explodes..." and then repeat that forever. Seems to be the end result of capitalism.
If you wanna kill corporate IT, you have to kill capitalism first.
I’d say there’s nothing inherently capitalist about large and stupid bureaucracies (but I repeat myself) spending money in stupid ways. Military bureaucracies in capitalist countries do it. Military bureaucracies in socialist countries did it. Everything else in end-stage socialist countries did it too. I’m sorry, it’s not the capitalism—things’d be much easier if it were.
I don't believe that. I don't necessarily love capitalism (though I can't say I see very many realistic better alternatives either), but if HN is full of people who could do corporate IT better (read: sanely), then the conclusion is just that corporate IT is run by morons. Maybe that's because the corporate owners like morons, but nothing about capitalism inherently makes it so.
> corporate IT is run by morons
playing devil's advocate for a second, but corpIT is also working with morons as employees. most draconian rules used by corpIT have a basis in at least one real world example. whether that example happened directly by one of the morons they manage or passed along from corpIT lore, people have done some dumb ass things on corp networks.
Yes, and the problem in that picture is the belief (whichever level of the management hierarchy it comes from) that you can introduce technical impediments against every instance of stupidity one by one until morons are no longer able to stupid. Morons will always find a way to stupid, and most organizations push the impediments well past the point of diminishing returns.
> the problem in that picture is the belief (whichever level of the management hierarchy it comes from) that you can introduce technical impediments against every instance of stupidity one by one until morons are no longer able to stupid
I would say the problem in the picture is your belief that corporate IT is introducing technical impediments against every instance of stupidity. I bet there's loads of stupidity they don't introduce technical impediments against. It would just not meet the cost-benefit analysis to spend thousands of tech man-hours introducing a new impediment that didn't cost the company much if any money.
It's because corporate IT has to service non-tech people, and non-tech people get pwned by tech savvy nogoodniks. So the only sane behavior of corporate IT is to lock everything down and then whitelist things rarely.
Apparently capitalism doesn’t pay enough for corporate IT admin jobs.
[deleted]
WebSockets over TCP is probably always going to cause problems for streaming media.
WebRTC over UDP is one choice for lossy situations. Media over Quic might be another (is the future here?), and it might be more enterprise firewall friendly since HTTP3 is over Quic.
I have some experience with pushing video frames over TCP.
It appears that the writer has jumped to conclusions at every turn and it's usually the wrong one.
The reason that the simple "poll for jpeg" method works is that polling is actually a very crude congestion control mechanism. The sender only sends the next frame when the receiver has received the last frame and asks for more. The downside of this is that network latency affects the frame rate.
The frame rate issue with the polling method can be solved by sending multiple frame requests at a time, but only as many as will fit within one RTT, so the client needs to know the minimum RTT and the sender's maximum frame rate.
The RFB (VNC) protocol does this, by the way. Well, the thing about rtt_min and frame rate isn't in the spec though.
Now, I will not go though every wrong assumption, but as for this nonsense about P-frames and I-frames: With TCP, you only need one I-frame. The rest can be all P-frames. I don't understand how they came to the conclusion that sending only I-frames over TCP might help with their latency problem. Just turn off B-frames and you should be OK.
The actual problem with the latency was that they had frames piling up in buffers between the sender and the receiver. If you're pushing video frames over TCP, you need feedback. The server needs to know how fast it can send. Otherwise, you get pile-up and a bunch of latency. That's all there is to it.
The simplest, absolutely foolproof way to do this is to use TCP's own congestion control. Spin up a thread that does two things: encodes video frames and sends them out on the socket using a blocking send/write call. Set SO_SNDBUF on that socket to a value that's proportional to your maximum latency tolerance and the rough size of your video frames.
One final bit of advice: use ffmpeg (libavcodec, libavformat, etc). It's much simpler to actually understand what you're doing with that than some convoluted gstreamer pipeline.
About eight years ago I was trying to stream several videos of a drone over the internet for remote product demos. Since we were talking to customers while the demo happened, the latency needed to be less than a few seconds. I couldn't get that latency with the more standard streaming video options I tried, and at the time setting up something based on WebRTC seemed pretty daunting. I ended up doing something pretty much like JPEGs as well, via the jsmpeg library [1]. Worked great.
I was blown away when I realized I could stream mjpeg from a raspberry pi camera with lower latency and less ceremony than everything I tried with webrtc and similar approaches.
An MPEG-1-based screen sharing experiment appeared here 10 years ago:
Yup, when reading this I immediately thought of jsmpeg, which I'm fond of.
from first principles.
Super interesting. Some time ago I wrote some code that breaks down a jpeg image into smaller frames of itself, then creates an h.264 video with the frames, outputting a smaller file than the original image
You can then extract the frames from the video and reconstruct the original jpeg
Additionally, instead of converting to video, you can use the smaller images of the original, to progressively load the bigger image, ie. when you get the first frame, you have a lower quality version of the whole image, then as you get more frames, the code progressively adds detail with the extra pixels contained in each frame
It was a fun project, but the extra compression doesn’t work for all images, and I also discovered how amazing jpeg is - you can get amazing compression just by changing the quality/size ratio parameter when creating a file
Why is video streaming so difficult? We've been doing it for decades, why is there seemingly no FOSS library which let's me encode an arbitrary dynamic frame rate image stream in Rust and get HD data with delta encoding in a browser receiver? This is insanity.
Would HLS be an option? I publish my home security cameras via WebRTC, but I keep HLS as a escape for hotel/cafe WiFi situations (MediaMTX makes it easy to offer both).
Thought of the same. I have not set it up outside of hobby projects, but it should work over HTTP as it says on the box, even inside a strict network?
Yes, it is strictly HTTP, not even persistent connections required.
> I mashed F5 like a degenerate.
I love the style of this blog-post, you can really tell that Luke has been deep down in the rabbit hole, encountered the Balrog and lived to tell the tale.
I like it too, even though it has that distinctive odor of being totally written by chatgpt though. (a bit distracting tbh)
[deleted]
We did something similar in one of the places I've worked at. We sent xy coordinates and pointer events from our frontend app to our backend/3d renderer and received JPEG frames back. All of that wrapped in protobuf messages and sent via WS connection. Surpassingly it kinda worked, not "60fps worked" though obviously.
would like to see what alternatives were looked at. RDP with a html client (guacamole) seams like a good match
You can do TURN using TLS/TCP over port 443. This can fool some firewalls, but will still fail for instances when an intercepting HTTP proxy is used.
The neat thing about ICE is that you get automatic fallbacks and best path selection. So best case IPv6 UDP, worst case TCP/TLS
One of the nice things about HTTP3 and QUIC will be that UDP port 443 will be more likely to be open in the future.
Yes, though hopefully not for long; unfortunately not all codecs are given equal treatment...
If having native support in a web browser is important, though, then yes, WebP is a better choice (as is JPEG).
> A JPEG screenshot is self-contained. It either arrives complete, or it doesn’t. There’s no “partial decode.”
What about Progressive JPEG?
We did something similar +12 years ago with `streaming` AWS running app inside the browser. Basically you can run 3d studio max on chromebook. App is actually running on AWS instance and it just sending jpegs to browser to `stream` it. We did a lot of QoS logic and other stuff but it was actually working pretty nice. Adobe used it for some time to allow user to run Photoshop in the browser. Good old days..
Who’s “we” in this case? Amazon (AWS)?
Having built an image sequence player using JPEGs back in the day - I can attest that it slappps.
WebP is well supported in browsers these days. Use WebP for the screenshots instead of JPEG and it will reduce the file size:
A long time ago I was trying to get video multiplexing to work over mobile over 3G. We struggled with H264 which had broad enough hardware support but almost no tooling and software support on the phones we were targeting. Even with engineers from the phone manufacturer as liaison we struggled to get access to any kind or SDK etc. We ended up doing JPEG streaming instead, much like the article said. And it worked great but we discovered we were getting a fraction of the framerate reported in Flash players - the call to refresh the screen was async and the act of receiving and deciding the next frame staved the redraw so the phone spent more time receiving lots of frames but not showing them. Super annoying and I don’t think the project survived long enough for us to find a fix.
This reminds me of the time we built a big angular3 codebase for a content platform. When we had to launch, the search engines were expecting content to be part of page html while we are calling APIs to fetch the content ( angular3 didn’t have server side rendering at that point)
So only plausible thing to do was pre-build html pages for content pages and let load angular’s JS take its time to load ( for ux functionality). It looked like page flickered when JS loads for the first time but we solved the search engine problem.
I very confused, couldn’t they have achieved much better outcome with existing hls tech with adaptive bitrate playlists? Seems they both created the problem and found a suboptimal solution.
> Why JPEGs Actually Slap
JPEG is extremely efficient to [de/en]code on modern CPUs. You can get close to 1080p60 per core if you use a library that leverages SIMD.
I sometimes struggle with the pursuit of perfect codec efficiency when our networks have become this fast. You can employ half-assed compression and still not max out a 1gbps pipe. From Netflix & Google's perspective it totally makes sense, but unless you are building a streaming video platform with billions of customers I don't see the point.
No mention of PNGs? I don’t usually go to jpegs first for screenshots of text. Did png have worse compression? Burn more cpu? I’m sure there are good reasons, but it seems like they’ve glossed over the obvious choice here.
edit: Thanks for the answers! The consensus is that PNG en/de -coding is too expensive compared to jpeg.
PNG is VERY slow compared to other formats. Not suitable for this sort of thing.
PNGs of screenshots would probably compress well, and the quality to size ratio would definitely be better than JPG, but the size would likely still be larger than a heavily compressed JPG. And PNG encoding/decoding is relatively slow compared to JPG.
PNGs are lossless so you can’t really dial up the compression. You can save space by reducing to 8-bit color (or grayscale!) but it’s basically the equivalent of raw pixels plus zlib.
PNG can be lossy. It can be done by first discarding some image detail, to make adjacent almost-matching pixel values actually match, to be more amenable to PNG's compression method. pngquant.org has a tool that does it.
There are usage cases where you might want lossy PNG over other formats; one is for still captures of 2d animated cartoon content, where H.264 tended to blur the sharp edges and flat color areas and this approach can compensate for that.
PNGs likely perform great, existing enterprise network filters, browser controls, etc, might not, even with how old PNGs are now.
I've found that WebM works much better because of the structure of the data in the container. I've also gone down similar routes using outdated tech and even inventing my own encoder and decoders trying to smooth things out but what I've currently found is the best current approach is using WebM because it is easier to lean on hardware encoder and decoders including across browsers with the new WebCodecs APIs. What I've been working on is a little different than what is in this post but but I'm pretty sure this logic still stands.
Doesn’t matter now, but what led you to TURN?
You can run all WebRTC traffic over a single port. It’s a shame you spent so much time/were frustrated by ICE errors
That’s great you got something better and with less complexity! I do think people push ‘you need UDP and BWE’ a little too zealously. If you have a homogeneous set of clients stuff like RTMP/Websockets seems to serve people well
[deleted]
There's no real reason other than bad configuration/coding for a H.264 1080p 30fps screen share stream to sustainably use 40mbps. You can watch an action move at the same frame rate but with 4k resolution while using less than half this bandwidth.
The real solution is using WebRTC, like every single other fucking company that have to stream video is doing. Yes, enterprise consumers require additional configuration. Yes, sometimes you need to provide a "network requirements" sheet to your customer so they can open a ticket with their IT to configure an exception.
Second problem, usually enterprise networks are not as bad as internet cafe networks, but then, internet café networks usually are not locked down, so, you should always try first the best case scenario with webrtc and turn servers on 3478. That will also be the best option for really bad networks, but usually those networks are not enterprise networks.
Please configure your encoder, 40mbps bit rate for what you're doing is way way too much.
Test if TURN is not acessible. try it first with UDP (the best option and will also work with internet cafe), if not try over tcp on port 443, not working? try over tls on port 443.
I guess this is great as long as you don't worry about audio sync?
at least the ai agents aren't talking back to us
You're behind by 1.5 years on that thought. They certainly can.
H.264 can be used to encode a single frame as an effective image with better compression than JPEG.
This is similar to what BrowserBox does for the same reasons outlined. Glad to see the control afforded by "ye olde ways" is recognized and widely appreciated.
A very stupid hack that can work to "fix" this could be to buffer the h264 stream at the data center using a proxy before sending it to the real client, etc.
One of the big issues was latency.
Yes, but the real issue (IMO) is that something is causing an avalanche of some kind. You would much rather have a consistent 100ms increased latency for this application if it works much better for users with high loss, etc. Also, to be clear, this is basically just a memory cache. I doubt it would add any "real" latency like that.
The idea is that if the fancy system works well on connection A and works poorly on connection B, what are the differences and how can we modify the system so that A and B are the same from it's perspective.
RTP is specifically designed for real-time video.
This is such a great post. I love the "The Oscillation Problem"!
I’m surprised that H264 I-frame only compresses less than JPG.
Maybe because the basic frequency transform is 4x4 vs 8x8 for JPG?
Their h264 iframes were bigger than the jpegs because they told the h264 encoder to produce bigger images. If they had set it to produce images the same size as the jpegs it most likely would have resulted in higher quality.
> I mashed F5 like a degenerate
Bargaining.
> looks at TCP congestion control literature
> closes tab
Eh, there are a few easy things one can try. Make sure to use a non-ancient kernel on the sender side (to get the necessary features), then enable BBR and NOTSENT_LOWAT (https://blog.cloudflare.com/http-2-prioritization-with-nginx...) to avoid buffering more than what's in-flight and then start dropping websocket frames when the socket says it's full.
Also, with tighter integration with the h264 encoder loop one could tell it which frames weren't sent and account for that in pframe generation. But I guess that wasn't available with that stack.
> We are professionals. We implement proper video codecs. We don’t spam HTTP requests for individual frames like it’s 2009.
I distinctly 'member doing CGI stuff with HTTP multipart responses... although I bet that with the exception of Apache, server (and especially: reverse proxy) side support for that has gone down the drain.
>A single H.264 keyframe is 200-500KB.
Hmm they must be doing something wrong, they're not usually that heavy.
“We didn’t have the expertise to build the thing we were building, got in way over our heads, and built a basic POC using legacy technology, which is fine.”
Awesome!
Good engineering: when you're not too proud to do the obvious, but sort of cheesy-sounding solution.
The LinkedIn slop tone, random bolding, miscopied Markdown tables makes me invoke: "please read the copy you worked on with AI"
smaller thing: many, many, moons ago, I did a lot of work with H.264. "A single H.264 keyframe is 200-500KB." is fantastical.
Can't prove it wrong because it will be correct given arbitrary dimensions and encoding settings, but, it's pretty hard to end up with.
Just pulled a couple 1080p's off YouTube, biggest I-frame is 150KB, median is 58KB (`ffprobe $FILE -show_frames -of compact -show_entries frame=pict_type,pkt_size | grep -i "|pict_type=I"`)
at least it had a minimum of Clause. Clause. Punchline.
I wrote a motion jpeg server for precisely this use case.
The secret to great user experience is you return the current video frame at time of request.
what about av1?
If you have latency detection already why not pause H.264 frames, then when ack comes just force a key frame and resume (perhaps with adjusted target bitrate)?
That would require that they understand the protocol stack they're using to send H.264 frames
Yeah, monitor the send queue length and reduce bit rate accordingly.
So, they've invented MJPEG?
Or is it intra-only H.264?
I mean, none of this is especially new. It's an interesting trick though!
This is a beautiful cope. Every time technology rolls out something that works great 90% of the time for 90% of the people, those 10%s pile up big time in support and lost productivity. You need functional systems that fall back gracefully to 1994 if necessary.
I started the first ISP in my area. We had two T1s to Miami. When HD audio and the rudiments of video started to increase in popularity, I'd always tell our modem customers, "A few minutes of video is a lifetime of email. Remember how exciting email was?"
One thing this article does point to indirectly is sometimes, simple scales, and complex fails.
I'm confused, do people actually watch their agents code like it was a screen share? Why does the AI even mess with that, just send a diff over text? Is it getting a keyboard next?
This is the definition of over-engineering. I don't usually criticize ideas but this is so stupid my head hurts.
Another case of we’re going backwards. The boring stuff is what works every time…
> When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.
This would make sense... if they were using UDP, but they are using TCP. All the JPEGs they send will get there eventually (unless the connection drops). JPEG does not fix your buffering and congestion control problems. What presumably happened here is the way they implemented their JPEG screenshots, they have some mechanism that minimizes the number of frames that are in-flight. This is not some inherent property of JPEG though.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB. We’re sending LESS data per frame AND getting better reliability.
h.264 has better coding efficiency than JPEG. For a given target size, you should be able to get better quality from an h.264 IDR frame than a JPEG. There is no fixed size to an IDR frame.
Ultimately, the problem here is a lack of bandwidth estimation (apart from the sort of binary "good network"/"cafe mode" thing they ultimately implemented). To be fair, this is difficult to do and being stuck with TCP makes it a bit more difficult. Still, you can do an initial bandwidth probe and then look for increasing transmission latency as a sign that the network is congested. Back off your bitrate (and if needed reduce frame rate to maintain sufficient quality) until transmission latency starts to decrease again.
WebRTC will do this for you if you can use it, which actually suggests a different solution to this problem: use websockets for dumb corporate network firewall rules and just use WebRTC everything else
They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading. UDP is not necessary to write a loop.
> They shared the polling code in the article. It doesn't request another jpeg until the previous one finishes downloading.
You're right, I don't know how I managed to skip over that.
> UDP is not necessary to write a loop.
True, but this doesn't really have anything to do with using JPEG either. They basically implemented a primitive form of rate control by only allowing a single frame to be in flight at once. It was easier for them to do that using JPEG because they (to their own admission) seem to have limited control over their encode pipeline.
> have limited control over their encode pipeline.
Frustratingly this seems common in many video encoding technologies. The code is opaque, often has special kernel, GPU and hardware interfaces which are often closed source, and by the time you get to the user API (native or browser) it seems all knobs have been abstracted away and simple things like choosing which frame to use as a keyframe are impossible to do.
I had what I thought was a simple usecase for a video codec - I needed to encode two 30 frame videos as small as possible, and I knew the first 15 frames were common between the videos so I wouldn't need to encode that twice.
I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.
> I couldn't find a single video codec which could do that without extensive internal surgery to save all internal state after the 15th frame.
fork()? :-)
But most software, video codec or not, simply isn't written to serialize its state at arbitrary points. Why would it?
A word processor can save it's state at an arbitrary point... That's what the save button is for, and it's functional at any point in the document writing process!
In fact, nearly everything in computing is serializable - or if it isn't, there is some other project with a similar purpose which is.
However this is not the case with video codecs.
So US->Australia/Asia wouldn't that limit you to 6fps or so due half-rtt? Each time a frame finishes arriving you have 150ms or so for your new request to reach.
Probably either (1) they don't request another jpeg until they have the previous one on-screen (so everything is completely serialized and there are no frames "in-flight" ever) (2) they're doing a fresh GET for each and getting a new connection anyway (unless that kind of thing is pipelined these days? in which case it still falls back to (1) above.)
You can still get this backpressure properly even if you're doing it push-style. The TCP socket will eventually fill up its buffer and start blocking your writes. When that happens, you stop encoding new frames until the socket is able to send again.
The trick is to not buffer frames on the sender.
You probably won't get acceptable latency this way since you have no control over buffer sizes on all the boxes between you and the receiver. Buffer bloat is a real problem. That said, yeah if you're getting 30-45 seconds behind at 40 Mbps you've probably got a fair bit of sender-side buffering happening.
Setting aside the various formatting problems and the LLM writing style, this just seems all kinds of wrong throughout.
> “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
10Mbps should be way more than enough for a mostly static image with some scrolling text. (And 40Mbps are ridiculous.) This is very likely to be caused by bad encoding settings and/or a bad encoder.
> “What if we only send keyframes?” The post goes on to explain how this does not work because some other component needs to see P-frames. If that is the case, just configure your encoder to have very short keyframe intervals.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.
A single H.264 keyframe can be whatever size you want, *depending on how you configure your encoder*, which was apparently never seriously attempted. Why are we badly reinventing MJPEG instead of configuring the tools we already have? Lower the bitrate and keyint, use a better encoder for higher quality, lower the frame rate if you need to. (If 10 fps JPEGs are acceptable, surely you should try 10 fps H.264 too?)
But all in all the main problem seems to be squeezing an entire video stream through a single TCP connection. There are plenty of existing solutions for this. For example, this article never mentions DASH, which is made for these exact purposes.
>Setting aside...the LLM writing style
I don't want to set that aside either. Why is AI generated slop getting voted to the top of HN? If you can't be bothered to spend the time writing a blog post, why should I be bothered spending my time reading it? It's frankly a little bit insulting.
Don’t assume something you cannot prove. It was great writing
Normally the 1 sentence per para LinkedIn post for dummies writing style bugs me to no end, but for a technical article that's continually hopping between questions, results, code, and explanations, it fits really well and was a very easy article to skim and understand.
> For example, this article never mentions DASH, which is made for these exact purposes.
DASH isn't supported on Apple AFAIK. HLS would be an idea, yes...
But in either case: you need ffmpeg somewhere in your pipeline for that experience to be even remotely enjoyable. No ffmpeg? No luck, good luck implementing all of that shit yourself.
I'm very familiar with the stack and the pain of trying to livestream video to a browser. If JPEG screenshots work for your clients, then I would just stick with that.
The problem with wolf, gstreamer, moonlight, $third party, is you need to be familiar with how the underlying stack handles backpressure and error propagation, or else things will just "not work" and you will have no idea why. I've worked on 3 projects in the last 3 years where I started with gstreamer, got up and running - and while things worked in the happy path, the unhappy path was incredibly brittle and painful to debug. All 3 times I opted to just use the lower level libraries myself.
Given all of OPs requirements, I think something like NVIDIA Video Codec SDK to a websocket to MediaSource Extensions.
However, given that even this post seems to be LLM generated, I don't think the author would care to learn about the actual internals. I don't think this is a solution that could be vibe coded.
This is where LLMs shine, where you need to dip your toes into really complex systems but basically just to do one thing with pretty straightforward requirements.
They might want to check out what VNC has been doing since 1998– keep the client-pull model, break the framebuffer up into tiles and, when client requests an update, perform a diff against last frame sent, composite the updated tiles client-side. (This is what VNC falls back to when it doesn’t have damage-tracking from the OS compositor)
This would really cut down on the bandwidth of static coding terminals where 90% of screen is just cursor flashing or small bits of text moving.
If they really wanted to be ambitious they could also detect scrolling and do an optimization client-side where it translates some of the existing areas (look up CopyRect command in VNC).
https://github.com/m1k1o/neko before VNC check neko out.
I worked on a project that started with VNC and had lots of problems. Slow connect times and backpressure/latency. Switching to neko was quick/easy win.
The blog post did smell of inexperience. Glad to hear there is other approaches - is something like that open source?
Yup. Go look into tigervnc if you want to see the source. But also you can just search for "tigervnc h.264" and you'll see extensive discussions between the devs on h.264 and integrating it into tiger. This is something that people spent a LOT of brainpower on.
Of all the suggestions in the comments here, this seems like the best one to start with.
Also... I get that the dumb solution to "ugly text at low bitrates" is "make the bitrate higher." But still, nobody looked at a 40M minimum and wondered if they might be looking at this problem from the wrong angle entirely?
Copying how VNC does it is exactly how my first attempt would go. Seems odd to try something like Moonlight which is designed for low latency remote gameplay.
40mbps for video of an LLM typing text didn't immediately fire off alarm bells in anyone's head that their approach was horribly wrong? That's an insane amount of bandwidth for what they're trying to do.
> When the network is bad, you get... fewer JPEGs. That’s it. The ones that arrive are perfect.
You can have still have weird broken stallouts though.
I dunno, this article has some good problem solving but the biggest and mostly untouched issue is that they set the minimum h.264 bandwidth too high. H.264 can do a lot better than JPEG with a lot less bandwidth. But if you lock it at 40Mbps of course it's flaky. Try 1Mbps and iterate from there.
And going keyframe-only is the opposite of how you optimize video bandwidth.
> Try 1Mbps and iterate from there.
From the article:
“Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
Rejecting it out of hand isn't actually trying it.
10Mbps is still way too high of a minimum. It's more than YouTube uses for full motion 4k.
And it would not be blocky garbage, it would still look a lot better than JPEG.
1Mbps for video is rule of thumb I use. Of course that will depend on customer expectations. 500K can work, but it won’t be pretty.
For normal video I think that's a good rule of thumb.
For mostly-static content at 4fps you can cut a bunch more bitrate corners before it looks bad. (And 2-3 JPEGs per second won't even look good at 1Mbps.)
>> 10Mbps is still way too high of a minimum. It's more than YouTube uses for full motion 4k.
> And 2-3 JPEGs per second won't even look good at 1Mbps.
Unqualified claims like these are utterly meaningless. It depends too much on exactly what you're doing, some sorts of images will compress much better than others.
Proper rate control for such realtime streaming would also lower framerate and/or resolution to maintain the best quality and latency they can over dynamic network conditions and however little bandwidth they have. The fundamental issue is that they don't have this control loop at all, and are badly simulating it by polling JPEGs.
It might be possible to buffer and queue jpegs for playback as well to help with weird broken stall outs.
Video players used to call it buffering, and resolving it was called buffering issues.
Players today can keep an eye on network quality while playing too, which is neat.
There are so many things that I would have done differently.
> We added a keyframes_only flag. We modified the video decoder to check FrameType::Idr. We set GOP to 60 (one keyframe per second at 60fps). We tested.
Why muck around with P-frames and keyframes? Just make your video 1fps.
> Now it’s 10Mbps of blocky garbage that’s still 30 seconds behind.
10 Mbps is way too much. I occasionally watch YouTube videos where someone writes code. I set my quality to 1080p to be comparable with the article and YouTube serves me the video at way less than 1Mbps. I did a quick napkin math for a random coding video and it was 0.6Mbps. It’s not blocky garbage at all.
Setting to 1 FPS might not be enough. GOP or P frame setting needs to be adjusted to make every frame keyframe.
This blog post smells of LLM, both in the language style and the muddled explanations / bad technical justifications. I wouldn't be surprised if their code is also vibe coded slop.
One man's not-blocky-garbage is another's insufferable hell. Even at 4k I find YouTube quality to be just awful with artefacts everywhere.
Many moons ago I was using this software which would screenshot every five seconds and give you a little time lapse and the end of the day. So you could see how you were spending your computer time.
My hard disk ended up filling up with tens of gigabytes of screenshots.
I lowered the quality. I lowered the resolution, but this only delayed the inevitable.
One day I was looking through the folder and I noticed well almost all the image data on almost all of these screenshots is identical.
What if I created some sort of algorithm which would allow me to preserve only the changes?
I spent embarrassingly long thinking about this before realizing that I had begun to reinvent video compression!
So I just wrote a ffmpeg one-liner and got like 98% disk usage reduction :)
Thats fun. I take it JPEG (what settings lolz!) is compressing harder than a keyframe.
But you are waching code. Why not send the code? Plus any css/html used to render it pretty. Or in other words why not a vscode tunnel?
Having pair programmed over some truly awful and locked down connections before, dropped frames are infinitely better than blurred frames which make text unreadable whenever the mouse is moved. But 40mbps seems an awful lot for 1080p 60fps.
Temporal SVC (reduce framerate if bandwidth constrained) is pretty widely supported by now, right? Though maybe not for H.264, so it probably would have scaled nicely but only on Webrtc?
They're just streaming a video feed of an LLC running in a terminal? Why not stream the actual text? Or fetch it piecemeal over AJAX requests? They complain that corporate networks support only HTTPS and nothing else? Do they not understand what the first T stands for?
Indeed, live text streaming is well over 100 years old:
https://en.wikipedia.org/wiki/Teleprinter
Suppose an LLM opens a browser, or opens a corporate .exe and GUI and starts typing in there and clicking buttons.
You don't give it a browser or buttons to click.
If you are ok with a second or so of latency then MPEG-DASH (standardized version of HTTP Live Streaming) is likely the best bet. You simply serve the video chunks over HTTP so it should be just as compatible as the JPEG solution used here but provide 60fps video rather than crappy jpegs.
The standard supports adaptive bit rate playback so you can provide both low quality and high quality videos and players can switch depending on bandwidth available.
"Think “screen share, but the thing being shared is a robot writing code.”"
Thinks: why not send text instead of graphics, then? I'm sure it's more complicated than that...
Thinks: this video[1] is the processed feed from the Huygens space probe landing on Saturn's moon Titan circa 2005. Relayed through the Cassini probe orbiting Saturn, 880 million miles from the Sun. At a total mission cost of 3.25 billion dollars. This is the sensor data, altitude, speed, spin, ultra violet, and hundreds of photos. (Read the description for what the audio is encoding, it's neat!)
Look at the end of the video, the photometry data count stops at "7996 kbytes received"(!)
> "Turns out, 40Mbps video streams don’t appreciate 200ms+ network latency. Who knew. “Just lower the bitrate,” you say. Great idea. Now it’s 10Mbps of blocky garbage"
Who could do anything useful with 10Mbps. :/
[1] https://en.wikipedia.org/wiki/File:Huygens_descent.ogv
Yeah, I'm thinking the same thing. Capture the text somehow and send that, and reconstruct it on the other end; and the best part is you only need to send each new character, not the whole screen, so it should be very small and lightning fast?
Sounds kind of like https://asciinema.org/ (which I've never used, but it seems cool).
Which features terminal live streaming since recently released 3.0 :)
> The fix was embarrassingly simple: once you fall back to screenshots, stay there until the user explicitly clicks to retry.
There is another recovery option:
- increase the JPEG framerate every couple seconds until the bandwidth consumption approaches the H264 stream bandwidth estimate
- keep track latency changes. If the client reports a stable latency range, and it is acceptable (<1s latency, <200ms variance?) and bandwidth use has reached 95% of H264 estimate, re-activate the stream
Given that text/code is what is being viewed, lower res and adaptive streaming (HLS) are not really viable solutions since they become unreadable at lower res.
If remote screen sharing is a core feature of the service, I think this is a reasonable next step for the product.
That said, IMO at a higher level if you know what you're streaming is human-readable text, it's better to send application data pipes to the stream rather than encoding screenspace videos. That does however require building bespoke decoders and client viewing if real time collaboration network clients don't already exist for the tools (but SSH and RTC code editors exist)
I made this because I got tired of screensharing issues in corporate environments: https://bluescreen.live (code via github).
Screenshot once per second. Works everywhere.
I’m still waiting for mobile screenshare api support, so I could quickly use it to show stuff from my phone to other phones with the QR link.
I recognize this voice :) This is Claude.
So it’s video of an AI typing text?
Why not just send text? Why do you need video at all?
Why send anything at all if the AI isn't even good enough to solve their own problems?
(Although the fact they decided to use Moonlight in an enterprise product makes me wonder if their product actually was vibe coded)
You apparently need video for the 45 seconds window you then get before preventing catastrophic things to happen. From TFA:
> You’re watching the AI type code from 45 seconds ago > > By the time you see a bug, the AI has already committed it to main > > Everything is terrible forever
Is this satire? I mean: if the solution for things to not be terrible forever consists in catching what an AI is doing in 45 seconds (!) before the AI commits to trunk, I'm sorry but you should seriously re-evaluate your life plans.
This article reminds me so much of so many hardware providers I deal with at work who want to put equipment on-site and then spend the next year not understanding that our customers manage their own firewall. No, you can’t just add a new protocol or completely change where your stuff is deployed because then our support team has to contact hundreds of customers about thousands of sites.
This was the most entertaining thing I read all day. Kudos.
I've had similar experiences in the past when trying to do remote desktop streaming for digital signage (which is not particularly demanding in bandwidth terms). Multicast streaming video was the most efficent, but annoying to decode when you dropped data. I now wonder how far I could have gone with JPEGs...
If playing with Chromecast types multicast or streaming one frame at a time manually worked pretty good.
Yes, this is unfortunately still the way and was very common back when iOS Safari did not allow embedded video.
For a fast start of the video, reverse the implementation: instead of downgrading from Websockets to polling when connection fails, you should upgrade from polling to Websockets when the network allows.
Socket.io was one of the first libraries that did that switching and had it wrong first, too. Learned the enterprise network behaviour and they switched the implementation.
> And the size! A 70% quality JPEG of a 1080p desktop is like 100-150KB. A single H.264 keyframe is 200-500KB.
I believe the latter can be adjusted in codec settings.
Of course. But same quality h264 keyframe will not be much smaller than JPEG.
So they replaced a TCP connection with no congestion control with a sycnronous poll of an endpoint which is inherently congestion controlled.
I wonder if they just tried restarting the stream at a lower bitrate once it got too delayed.
The talk about how the images looks more crisp at a lower FPS is just tuning that I guess they didn't bother with.
> The constraint that ruined everything: It has to work on enterprise networks.
> You know what enterprise networks love? HTTP. HTTPS. Port 443. That’s it. That’s the list.
That's not enough.
Corporate networks also love to MITM their own workstations and reinterpret http traffic. So, no WebSockets and no Server-Side Events either, because their corporate firewall is a piece of software no one in the world wants and everyone in the world hates, including its own developers. Thus it only supports a subset of HTTP/1.1 and sometimes it likes to change the content while keeping Content-Length intact.
And you have to work around that, because IT dept of the corporation will never lift restrictions.
I wish I was kidding.
Back when I had a job at a big old corporation, a significant part of my value to the company was that I knew how to bypass their shitty MITM thing that broke tons of stuff, including our own software that we wrote. So I could solve a lot of problems people had that otherwise seemed intractable because IT was not allowed to disable it, and they didn't even understand the myriad ways it was breaking things.
> So, no WebSockets
The corporate firewall debate came up when we considered websockets at a previous company. Everyone has parroted the same information for so long that it was just assumed that websockets and corporate firewalls were going to cause us huge problems.
We went with websockets anyway and it was fine. Almost no traffic to the no-websockets fallback path, and the traffic that did arrive appeared to be from users with intermittent internet connections (cellular providers, foreign countries with poor internet).
I'm 100% sure there are still corporate firewalls out there blocking or breaking websocket connections, but it's not nearly the same problem in 2025 as it was in 2015.
If your product absolute must, no exceptions, work perfectly in every possible corporate environment then a fallback is necessary if you use websockets. I don't think it's a hard rule that websockets must be avoided due to corporate firewalls any more, though.
I've had to switch from SSE to WebSockets to navigate a corporate network (the entire SSE would have to close before the user received any of the response).
Then we ran into a network where WebSockets were blocked, so we switched to streaming http.
No trouble with streaming http using a standard content-type yet.
> And you have to work around that, because IT dept of the corporation will never lift restrictions.
Unless the corporation is 100% in-office, I’d wager they do in fact make exceptions - otherwise they wouldn’t have a working videoconferencing system.
The challenge is getting corporate insiders to like your product enough to get it through the exception process (a total hassle) when the firewall’s restrictions mean you can’t deliver a decent demo.
They even break server-sent events (which is still my default for most interactive apps)
There are other ways to make server-sent events work.
I try to remember many environments once likely supported Flash.
Request URL has a query parameter with more than 64 characters? Fuck you.
Request lives for longer than 15 sec? Fuck you.
Request POSTs some JSON? Maybe fuck you just a little bit, when we find certain strings in the payload. We won't tell you which though.
>And you have to work around that, because IT dept of the corporation will never lift restrictions.
Because otherwise people do dumb stuff like pasting proprietary designs or PII into deepseek
Oh, they'll do that anyway, once they find the workaround (Oh... you can paste a credit card if you put periods instead of dashes! Oh... I have to save the file and do it from my phone! Oh... I'll upload it as a .txt file and change the extension on the server!)
It's purely illusory security, that doesn't protect anything but does levy a constant performance tax on nearly every task.
>Oh, they'll do that anyway, once they find the workaround ...
This is assuming the DLP service blocks the request, rather than doing something like logging it and reported to your manager and/or CIO.
>It's purely illusory security, that doesn't protect anything but does levy a constant performance tax on nearly every task.
Because you can't ask deepseek to extract some unstructured data for you? I'm not sure what the alternative is, just let everyone paste info into deepseek? If you found out that your data got leaked because some employee pasted some data into some random third party service, and that the company didn't have any policies/technological measures against it, would your response still be "yeah it's fine, it's purely illusory security"?
What's the term for the ideology that "laws are silly because people sometimes break them"?
I don't think that's a good read if the post you're implying this at. I think a more charitable read would be something like "people break rules for convenience so if your security relies on nobody breaking rules then you don't have thorough security".
You and op can be right at the same time. You imply the rules probably help a lot even while imperfect. They imply that pretending rules alone are enough to be perfect is incomplete.
Posting stuff into Deepseek is banned. The corporate firewall is like putting a camera in your home because you may break the law. But, yeah, arguing against cameras in homes because people find dead angles where they can hide may not be the strongest argument.
Disclaimer: I work in corporate cybersecurity.
I know that some guardrails and restrictions in a corporate setting can backfire. I know that onerous processes to get approval for needed software access can drive people to break the rules or engage in shadow IT. As a member of a firewall team, I did it myself! We couldn't get access to Python packages or PHP for a local webserver we had available to us from a grandfather clause. My team hated our "approved" Sharepoint service request system. So a few of us built a small web app with Bottle (single file web server microframework, no dependencies) and Bootstrap CSS and SQLite backend. Everyone who interacted with our team loved it. Had we more support from corporate it might have been a lot easier.
Good cybersecurity needs to work with IT to facilitate peoples' legitimate use cases, not stand in the way all the time just because it's easier that way.
But saying "corporate IT controls are all useless" is just as foolish to me. It is reasonable and moral for a business to put controls and visibility on what data is moving between endpoints, and to block unsanctioned behavior.
It's called black and white thinking
At the same time, enterprise is where the revenue is.
Against all odds, you're right, that's where somehow revenue is being generated. IT idiocy notwithstanding.
Often, enterprises create moats and then profit from them.
It's not usually IT idiocy, that usually comes from higher up cosplaying their inner tech visionaries.
Corporate IT needs to die.
It's not corporate IT's fault, it's usually corporate leaderships fault who often cosplay leading technology and not understanding it.
Wherever Tech is a first class citizen and seat at the corporate table, it can be different.
Believe me, the average Fortune 500 CEO does not know or care what “SSL MITM” is, or whether passwords should contain symbols and be changed monthly, or what the difference is between ‘VPN’ and ‘Zero Trust’.
They delegate that stuff. To the corporate IT department.
But they also say "Here, this is Sarah your auditor. Answer these questions and resolve the findings." - every year
It's all CyberSecurity insurance compliance that in many cases deviates from security best practices.
This is where the problems come from. Auditors are definitely what ultimately causes IT departments to make dumb decisions.
For example, we got dinged on an audit because instead of using RSA4096, we used ed25519. I kid you not, their main complaint was there wasn't enough bits which meant it wasn't secure.
Auditors are snake oil salesman.
This is 100% it- the auditor is confirming the system is configured to a set of requirements, and those requirements are rarely in lockstep with actual best practices.
Sometimes they have checkboxes to tick in some compliance document and they must run the software that let them tick those checkboxes, no exceptions, because those compliances allow the company to be on the market. Regulatory captures, etc.
where else are you going to find customers that are so sticky it will take years for them to select another solution regardless of how crappy you are. that will staff teams to work around your failures. who, when faced with obvious evidence of the dysfunction of your product, will roundly blame themselves for not holding it properly. gaslight their own users. pay obscene amounts for support when all you provide is a voice mailbox that never gets emptied. will happily accept your estimate about the number of seats they need. when holding a retro about your failure will happily proclaim that there wasn't anything _they_ could have done, so case closed.
Oh yes you can absolutely profit off that but you have to be dead inside a little bit.
And produce a piece of software no one in the world wants and everyone in the world hates. Yourself included.
I think the general idea/flow of things is "numbers go up, until $bubble explodes, and we built up smaller things from the ground up, making numbers go up, bloating go up, until $bubble explodes..." and then repeat that forever. Seems to be the end result of capitalism.
If you wanna kill corporate IT, you have to kill capitalism first.
I’d say there’s nothing inherently capitalist about large and stupid bureaucracies (but I repeat myself) spending money in stupid ways. Military bureaucracies in capitalist countries do it. Military bureaucracies in socialist countries did it. Everything else in end-stage socialist countries did it too. I’m sorry, it’s not the capitalism—things’d be much easier if it were.
I don't believe that. I don't necessarily love capitalism (though I can't say I see very many realistic better alternatives either), but if HN is full of people who could do corporate IT better (read: sanely), then the conclusion is just that corporate IT is run by morons. Maybe that's because the corporate owners like morons, but nothing about capitalism inherently makes it so.
> corporate IT is run by morons
playing devil's advocate for a second, but corpIT is also working with morons as employees. most draconian rules used by corpIT have a basis in at least one real world example. whether that example happened directly by one of the morons they manage or passed along from corpIT lore, people have done some dumb ass things on corp networks.
Yes, and the problem in that picture is the belief (whichever level of the management hierarchy it comes from) that you can introduce technical impediments against every instance of stupidity one by one until morons are no longer able to stupid. Morons will always find a way to stupid, and most organizations push the impediments well past the point of diminishing returns.
> the problem in that picture is the belief (whichever level of the management hierarchy it comes from) that you can introduce technical impediments against every instance of stupidity one by one until morons are no longer able to stupid
I would say the problem in the picture is your belief that corporate IT is introducing technical impediments against every instance of stupidity. I bet there's loads of stupidity they don't introduce technical impediments against. It would just not meet the cost-benefit analysis to spend thousands of tech man-hours introducing a new impediment that didn't cost the company much if any money.
It's because corporate IT has to service non-tech people, and non-tech people get pwned by tech savvy nogoodniks. So the only sane behavior of corporate IT is to lock everything down and then whitelist things rarely.
Apparently capitalism doesn’t pay enough for corporate IT admin jobs.
WebSockets over TCP is probably always going to cause problems for streaming media.
WebRTC over UDP is one choice for lossy situations. Media over Quic might be another (is the future here?), and it might be more enterprise firewall friendly since HTTP3 is over Quic.
I have some experience with pushing video frames over TCP.
It appears that the writer has jumped to conclusions at every turn and it's usually the wrong one.
The reason that the simple "poll for jpeg" method works is that polling is actually a very crude congestion control mechanism. The sender only sends the next frame when the receiver has received the last frame and asks for more. The downside of this is that network latency affects the frame rate.
The frame rate issue with the polling method can be solved by sending multiple frame requests at a time, but only as many as will fit within one RTT, so the client needs to know the minimum RTT and the sender's maximum frame rate.
The RFB (VNC) protocol does this, by the way. Well, the thing about rtt_min and frame rate isn't in the spec though.
Now, I will not go though every wrong assumption, but as for this nonsense about P-frames and I-frames: With TCP, you only need one I-frame. The rest can be all P-frames. I don't understand how they came to the conclusion that sending only I-frames over TCP might help with their latency problem. Just turn off B-frames and you should be OK.
The actual problem with the latency was that they had frames piling up in buffers between the sender and the receiver. If you're pushing video frames over TCP, you need feedback. The server needs to know how fast it can send. Otherwise, you get pile-up and a bunch of latency. That's all there is to it.
The simplest, absolutely foolproof way to do this is to use TCP's own congestion control. Spin up a thread that does two things: encodes video frames and sends them out on the socket using a blocking send/write call. Set SO_SNDBUF on that socket to a value that's proportional to your maximum latency tolerance and the rough size of your video frames.
One final bit of advice: use ffmpeg (libavcodec, libavformat, etc). It's much simpler to actually understand what you're doing with that than some convoluted gstreamer pipeline.
About eight years ago I was trying to stream several videos of a drone over the internet for remote product demos. Since we were talking to customers while the demo happened, the latency needed to be less than a few seconds. I couldn't get that latency with the more standard streaming video options I tried, and at the time setting up something based on WebRTC seemed pretty daunting. I ended up doing something pretty much like JPEGs as well, via the jsmpeg library [1]. Worked great.
[1] https://jsmpeg.com/ (tagline: "decode like it's 1999")
so did they reinvent mjpeg
I was blown away when I realized I could stream mjpeg from a raspberry pi camera with lower latency and less ceremony than everything I tried with webrtc and similar approaches.
An MPEG-1-based screen sharing experiment appeared here 10 years ago:
- https://news.ycombinator.com/item?id=9954870
- https://phoboslab.org/log/2015/07/play-gta-v-in-your-browser...
Yup, when reading this I immediately thought of jsmpeg, which I'm fond of.
from first principles.
Super interesting. Some time ago I wrote some code that breaks down a jpeg image into smaller frames of itself, then creates an h.264 video with the frames, outputting a smaller file than the original image
You can then extract the frames from the video and reconstruct the original jpeg
Additionally, instead of converting to video, you can use the smaller images of the original, to progressively load the bigger image, ie. when you get the first frame, you have a lower quality version of the whole image, then as you get more frames, the code progressively adds detail with the extra pixels contained in each frame
It was a fun project, but the extra compression doesn’t work for all images, and I also discovered how amazing jpeg is - you can get amazing compression just by changing the quality/size ratio parameter when creating a file
Why is video streaming so difficult? We've been doing it for decades, why is there seemingly no FOSS library which let's me encode an arbitrary dynamic frame rate image stream in Rust and get HD data with delta encoding in a browser receiver? This is insanity.
Would HLS be an option? I publish my home security cameras via WebRTC, but I keep HLS as a escape for hotel/cafe WiFi situations (MediaMTX makes it easy to offer both).
Thought of the same. I have not set it up outside of hobby projects, but it should work over HTTP as it says on the box, even inside a strict network?
Yes, it is strictly HTTP, not even persistent connections required.
> I mashed F5 like a degenerate.
I love the style of this blog-post, you can really tell that Luke has been deep down in the rabbit hole, encountered the Balrog and lived to tell the tale.
I like it too, even though it has that distinctive odor of being totally written by chatgpt though. (a bit distracting tbh)
We did something similar in one of the places I've worked at. We sent xy coordinates and pointer events from our frontend app to our backend/3d renderer and received JPEG frames back. All of that wrapped in protobuf messages and sent via WS connection. Surpassingly it kinda worked, not "60fps worked" though obviously.
would like to see what alternatives were looked at. RDP with a html client (guacamole) seams like a good match
You can do TURN using TLS/TCP over port 443. This can fool some firewalls, but will still fail for instances when an intercepting HTTP proxy is used.
The neat thing about ICE is that you get automatic fallbacks and best path selection. So best case IPv6 UDP, worst case TCP/TLS
One of the nice things about HTTP3 and QUIC will be that UDP port 443 will be more likely to be open in the future.
webp is smaller than jpeg
https://developers.google.com/speed/webp/docs/webp_study
ALSO - the blog author could simplify - you don't need any code at all at the web browser.
The <img> tag automatically does motion jpeg streaming.
… and JPEG XL is smaller than WebP.
JPEG XL looks to have pretty poor support.
https://caniuse.com/jpegxl
Yes, though hopefully not for long; unfortunately not all codecs are given equal treatment...
If having native support in a web browser is important, though, then yes, WebP is a better choice (as is JPEG).
> A JPEG screenshot is self-contained. It either arrives complete, or it doesn’t. There’s no “partial decode.”
What about Progressive JPEG?
We did something similar +12 years ago with `streaming` AWS running app inside the browser. Basically you can run 3d studio max on chromebook. App is actually running on AWS instance and it just sending jpegs to browser to `stream` it. We did a lot of QoS logic and other stuff but it was actually working pretty nice. Adobe used it for some time to allow user to run Photoshop in the browser. Good old days..
Who’s “we” in this case? Amazon (AWS)?
Having built an image sequence player using JPEGs back in the day - I can attest that it slappps.
WebP is well supported in browsers these days. Use WebP for the screenshots instead of JPEG and it will reduce the file size:
https://developers.google.com/speed/webp/gallery1
https://caniuse.com/webp
A long time ago I was trying to get video multiplexing to work over mobile over 3G. We struggled with H264 which had broad enough hardware support but almost no tooling and software support on the phones we were targeting. Even with engineers from the phone manufacturer as liaison we struggled to get access to any kind or SDK etc. We ended up doing JPEG streaming instead, much like the article said. And it worked great but we discovered we were getting a fraction of the framerate reported in Flash players - the call to refresh the screen was async and the act of receiving and deciding the next frame staved the redraw so the phone spent more time receiving lots of frames but not showing them. Super annoying and I don’t think the project survived long enough for us to find a fix.
This reminds me of the time we built a big angular3 codebase for a content platform. When we had to launch, the search engines were expecting content to be part of page html while we are calling APIs to fetch the content ( angular3 didn’t have server side rendering at that point)
So only plausible thing to do was pre-build html pages for content pages and let load angular’s JS take its time to load ( for ux functionality). It looked like page flickered when JS loads for the first time but we solved the search engine problem.
I very confused, couldn’t they have achieved much better outcome with existing hls tech with adaptive bitrate playlists? Seems they both created the problem and found a suboptimal solution.
> Why JPEGs Actually Slap
JPEG is extremely efficient to [de/en]code on modern CPUs. You can get close to 1080p60 per core if you use a library that leverages SIMD.
I sometimes struggle with the pursuit of perfect codec efficiency when our networks have become this fast. You can employ half-assed compression and still not max out a 1gbps pipe. From Netflix & Google's perspective it totally makes sense, but unless you are building a streaming video platform with billions of customers I don't see the point.
No mention of PNGs? I don’t usually go to jpegs first for screenshots of text. Did png have worse compression? Burn more cpu? I’m sure there are good reasons, but it seems like they’ve glossed over the obvious choice here.
edit: Thanks for the answers! The consensus is that PNG en/de -coding is too expensive compared to jpeg.
PNG is VERY slow compared to other formats. Not suitable for this sort of thing.
PNGs of screenshots would probably compress well, and the quality to size ratio would definitely be better than JPG, but the size would likely still be larger than a heavily compressed JPG. And PNG encoding/decoding is relatively slow compared to JPG.
PNGs are lossless so you can’t really dial up the compression. You can save space by reducing to 8-bit color (or grayscale!) but it’s basically the equivalent of raw pixels plus zlib.
PNG can be lossy. It can be done by first discarding some image detail, to make adjacent almost-matching pixel values actually match, to be more amenable to PNG's compression method. pngquant.org has a tool that does it.
There are usage cases where you might want lossy PNG over other formats; one is for still captures of 2d animated cartoon content, where H.264 tended to blur the sharp edges and flat color areas and this approach can compensate for that.
PNGs likely perform great, existing enterprise network filters, browser controls, etc, might not, even with how old PNGs are now.
I've found that WebM works much better because of the structure of the data in the container. I've also gone down similar routes using outdated tech and even inventing my own encoder and decoders trying to smooth things out but what I've currently found is the best current approach is using WebM because it is easier to lean on hardware encoder and decoders including across browsers with the new WebCodecs APIs. What I've been working on is a little different than what is in this post but but I'm pretty sure this logic still stands.
Doesn’t matter now, but what led you to TURN?
You can run all WebRTC traffic over a single port. It’s a shame you spent so much time/were frustrated by ICE errors
That’s great you got something better and with less complexity! I do think people push ‘you need UDP and BWE’ a little too zealously. If you have a homogeneous set of clients stuff like RTMP/Websockets seems to serve people well
There's no real reason other than bad configuration/coding for a H.264 1080p 30fps screen share stream to sustainably use 40mbps. You can watch an action move at the same frame rate but with 4k resolution while using less than half this bandwidth.
The real solution is using WebRTC, like every single other fucking company that have to stream video is doing. Yes, enterprise consumers require additional configuration. Yes, sometimes you need to provide a "network requirements" sheet to your customer so they can open a ticket with their IT to configure an exception.
Second problem, usually enterprise networks are not as bad as internet cafe networks, but then, internet café networks usually are not locked down, so, you should always try first the best case scenario with webrtc and turn servers on 3478. That will also be the best option for really bad networks, but usually those networks are not enterprise networks.
Please configure your encoder, 40mbps bit rate for what you're doing is way way too much.
Test if TURN is not acessible. try it first with UDP (the best option and will also work with internet cafe), if not try over tcp on port 443, not working? try over tls on port 443.
It’s always TCP_NODELAY seems relevant here: https://news.ycombinator.com/item?id=40310896
I guess this is great as long as you don't worry about audio sync?
at least the ai agents aren't talking back to us
You're behind by 1.5 years on that thought. They certainly can.
H.264 can be used to encode a single frame as an effective image with better compression than JPEG.
This is similar to what BrowserBox does for the same reasons outlined. Glad to see the control afforded by "ye olde ways" is recognized and widely appreciated.
A very stupid hack that can work to "fix" this could be to buffer the h264 stream at the data center using a proxy before sending it to the real client, etc.
One of the big issues was latency.
Yes, but the real issue (IMO) is that something is causing an avalanche of some kind. You would much rather have a consistent 100ms increased latency for this application if it works much better for users with high loss, etc. Also, to be clear, this is basically just a memory cache. I doubt it would add any "real" latency like that.
The idea is that if the fancy system works well on connection A and works poorly on connection B, what are the differences and how can we modify the system so that A and B are the same from it's perspective.
RTP is specifically designed for real-time video.
This is such a great post. I love the "The Oscillation Problem"!
I’m surprised that H264 I-frame only compresses less than JPG.
Maybe because the basic frequency transform is 4x4 vs 8x8 for JPG?
Their h264 iframes were bigger than the jpegs because they told the h264 encoder to produce bigger images. If they had set it to produce images the same size as the jpegs it most likely would have resulted in higher quality.
> I mashed F5 like a degenerate
Bargaining.
> looks at TCP congestion control literature
> closes tab
Eh, there are a few easy things one can try. Make sure to use a non-ancient kernel on the sender side (to get the necessary features), then enable BBR and NOTSENT_LOWAT (https://blog.cloudflare.com/http-2-prioritization-with-nginx...) to avoid buffering more than what's in-flight and then start dropping websocket frames when the socket says it's full.
Also, with tighter integration with the h264 encoder loop one could tell it which frames weren't sent and account for that in pframe generation. But I guess that wasn't available with that stack.
> We are professionals. We implement proper video codecs. We don’t spam HTTP requests for individual frames like it’s 2009.
I distinctly 'member doing CGI stuff with HTTP multipart responses... although I bet that with the exception of Apache, server (and especially: reverse proxy) side support for that has gone down the drain.
>A single H.264 keyframe is 200-500KB.
Hmm they must be doing something wrong, they're not usually that heavy.
“We didn’t have the expertise to build the thing we were building, got in way over our heads, and built a basic POC using legacy technology, which is fine.”
Awesome!
Good engineering: when you're not too proud to do the obvious, but sort of cheesy-sounding solution.
The LinkedIn slop tone, random bolding, miscopied Markdown tables makes me invoke: "please read the copy you worked on with AI"
smaller thing: many, many, moons ago, I did a lot of work with H.264. "A single H.264 keyframe is 200-500KB." is fantastical.
Can't prove it wrong because it will be correct given arbitrary dimensions and encoding settings, but, it's pretty hard to end up with.
Just pulled a couple 1080p's off YouTube, biggest I-frame is 150KB, median is 58KB (`ffprobe $FILE -show_frames -of compact -show_entries frame=pict_type,pkt_size | grep -i "|pict_type=I"`)
at least it had a minimum of Clause. Clause. Punchline.
I wrote a motion jpeg server for precisely this use case.
https://github.com/crowdwave/maryjane
The secret to great user experience is you return the current video frame at time of request.
what about av1?
If you have latency detection already why not pause H.264 frames, then when ack comes just force a key frame and resume (perhaps with adjusted target bitrate)?
That would require that they understand the protocol stack they're using to send H.264 frames
Yeah, monitor the send queue length and reduce bit rate accordingly.
So, they've invented MJPEG?
Or is it intra-only H.264?
I mean, none of this is especially new. It's an interesting trick though!
This is a beautiful cope. Every time technology rolls out something that works great 90% of the time for 90% of the people, those 10%s pile up big time in support and lost productivity. You need functional systems that fall back gracefully to 1994 if necessary.
I started the first ISP in my area. We had two T1s to Miami. When HD audio and the rudiments of video started to increase in popularity, I'd always tell our modem customers, "A few minutes of video is a lifetime of email. Remember how exciting email was?"
One thing this article does point to indirectly is sometimes, simple scales, and complex fails.
I'm confused, do people actually watch their agents code like it was a screen share? Why does the AI even mess with that, just send a diff over text? Is it getting a keyboard next?
This is the definition of over-engineering. I don't usually criticize ideas but this is so stupid my head hurts.
Another case of we’re going backwards. The boring stuff is what works every time…