> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.
The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
LLM generated code will eventually contain UB.
EDIT: added "eventually"
The problem of UB is not really that it may crash in some architecture. The real problem is that the compiler expects UB code to NOT happen, so if you write UB code anyway the compiler (and especially the optimizer) is allowed to translate that to anything that's convenient for its happy path. And sometimes that "anything" can be really unexpected (like removing big chunks of code).
In C / C++ there are two kinds of undefined behaviour. One is where there is written in standard what UB is. Another one is everthing else that is not in standard.
Technically, that's only one kind, because it's written in the standard that anything not mentioned in the standard is undefined behavior.
From the ANSI C standard:
3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements. Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).
Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.
By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.
> but that the consequence of this should be somewhat bounded or as expected for the target machine.
Aren't "unpredictable results" and "no requirements" contrary to the idea that the behavior would be "somewhat bounded"?
Notice though "ignoring the situation" thru "documented manner characteristic of the environment". Even though truly you can read this in an uncharitable way, you could also try and understand the intent of this paragraph, and I think reading it for its intents is always the best way to interpret a language standard when the wording is ambiguous or soft, especially if you're writing a compiler.
I don't think you could sincerely argue that this definition intends to allow the compiler to totally rewrite your code because of one guaranteed UB detected on line 5, just that it would be good to print a diagnostic if it can be detected, and if not to do what's "characteristic of the environment". Does that make sense?
A fun one that'd fit list be sequence point violations like
i = i++
Fun, sure, but also GCC and Clang will both warn with -Wall (-Wsequence-point / -Wunsequenced).
Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.
you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)
Yet another push to use LLMs after casting fear. Now it should be illegal not to use LLMs. A good start of the day.
(I hope casting fear is not UB)
> (I hope casting fear is not UB)
I'm sure that's UB in C
In C++ just use <reinterpret_cast>
The irony is unmistakable.
Ok, and?
"Rewrite everything in Rust. OMG universe is written in Rust so memory safe with zero allocations"
[deleted]
[dead]
Anyone who uses the construction "C/C++" doesn't write modern C++, and probably isn't very familiar with the recent revisions despite TFA's claims of writing it every day for decades.
Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.
No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:
- eschewing boomer loops in favor of ranges
- using RAII with smart pointers
- move semantics
- using STL containers instead of raw arrays
- borrowing using spans and string views
These things go a long way towards, shall we say, "safe-ish" code without UB. It is not memory-safe enforced at the language level, like Rust, but the upshot is you never need to deal with the Rust community :^)
Although some people, like Bjarne Stroustrup, object to the term C/C++, it's a bit like Richard Stallman objecting to the term "Linux". The fact is it can mean "C or C++", and I wouldn't assume the author thinks they're the same, but they're talking about both of them together in the same sentence. This seems reasonable given this is about undefined behavior, and it's trivial to accidentally write UB-inducing code in C++ even with modern style (although I'd say you should catch most trivial cases with e.g. ubsan, and a lot of bad cases would be avoided with e.g. ranges, so I think the article is exaggerating the issue).
Well, the author explicitly refers to "C/C++" as one language:
>After all, C/C++ is not a memory safe language.
"C/C++" is a useful term for the common C/C++ subset :)
As far as stdlib usage is concerned: that's just your opinion. The stdlib has a lot of footguns and terrible design decisions too, e.g. std::vector pulling in 20k lines of code into each compilation unit is simply bizarre.
Also:
- borrowing using spans and string views
Those are just as unsafe as raw pointers. It's not really "borrowing" when the referenced data can disappear while the "borrow" is active.
I use a C dialect that has generic containers, slices and buffers, while pointer arithmetic is forbidden. At this point, C stdlib has near zero value, unless it is a syscall or a function specially treated by the compiler (e.g. memcpy).
> the upshot is you never need to deal with the Rust community
> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.
The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.
LLM generated code will eventually contain UB.
EDIT: added "eventually"
The problem of UB is not really that it may crash in some architecture. The real problem is that the compiler expects UB code to NOT happen, so if you write UB code anyway the compiler (and especially the optimizer) is allowed to translate that to anything that's convenient for its happy path. And sometimes that "anything" can be really unexpected (like removing big chunks of code).
In C / C++ there are two kinds of undefined behaviour. One is where there is written in standard what UB is. Another one is everthing else that is not in standard.
Technically, that's only one kind, because it's written in the standard that anything not mentioned in the standard is undefined behavior.
From the ANSI C standard:
Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.
> but that the consequence of this should be somewhat bounded or as expected for the target machine.
Aren't "unpredictable results" and "no requirements" contrary to the idea that the behavior would be "somewhat bounded"?
Notice though "ignoring the situation" thru "documented manner characteristic of the environment". Even though truly you can read this in an uncharitable way, you could also try and understand the intent of this paragraph, and I think reading it for its intents is always the best way to interpret a language standard when the wording is ambiguous or soft, especially if you're writing a compiler.
I don't think you could sincerely argue that this definition intends to allow the compiler to totally rewrite your code because of one guaranteed UB detected on line 5, just that it would be good to print a diagnostic if it can be detected, and if not to do what's "characteristic of the environment". Does that make sense?
A fun one that'd fit list be sequence point violations like
Fun, sure, but also GCC and Clang will both warn with -Wall (-Wsequence-point / -Wunsequenced).
I stoped reading about here:
Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)
Yet another push to use LLMs after casting fear. Now it should be illegal not to use LLMs. A good start of the day.
(I hope casting fear is not UB)
> (I hope casting fear is not UB)
I'm sure that's UB in C
In C++ just use <reinterpret_cast>
The irony is unmistakable.
Ok, and?
"Rewrite everything in Rust. OMG universe is written in Rust so memory safe with zero allocations"
[dead]
Anyone who uses the construction "C/C++" doesn't write modern C++, and probably isn't very familiar with the recent revisions despite TFA's claims of writing it every day for decades.
Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.
No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:
- eschewing boomer loops in favor of ranges
- using RAII with smart pointers
- move semantics
- using STL containers instead of raw arrays
- borrowing using spans and string views
These things go a long way towards, shall we say, "safe-ish" code without UB. It is not memory-safe enforced at the language level, like Rust, but the upshot is you never need to deal with the Rust community :^)
Although some people, like Bjarne Stroustrup, object to the term C/C++, it's a bit like Richard Stallman objecting to the term "Linux". The fact is it can mean "C or C++", and I wouldn't assume the author thinks they're the same, but they're talking about both of them together in the same sentence. This seems reasonable given this is about undefined behavior, and it's trivial to accidentally write UB-inducing code in C++ even with modern style (although I'd say you should catch most trivial cases with e.g. ubsan, and a lot of bad cases would be avoided with e.g. ranges, so I think the article is exaggerating the issue).
Well, the author explicitly refers to "C/C++" as one language:
>After all, C/C++ is not a memory safe language.
"C/C++" is a useful term for the common C/C++ subset :)
As far as stdlib usage is concerned: that's just your opinion. The stdlib has a lot of footguns and terrible design decisions too, e.g. std::vector pulling in 20k lines of code into each compilation unit is simply bizarre.
Also:
- borrowing using spans and string views
Those are just as unsafe as raw pointers. It's not really "borrowing" when the referenced data can disappear while the "borrow" is active.
I use a C dialect that has generic containers, slices and buffers, while pointer arithmetic is forbidden. At this point, C stdlib has near zero value, unless it is a syscall or a function specially treated by the compiler (e.g. memcpy).
> the upshot is you never need to deal with the Rust community
In the end, everything comes down to culture war.
Perhaps we should rewrite our culture in Rust.
I love arguing with idiots.