I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:
- Number of bytes this will be stored as in the DB
- Number of monospaced font character blocks this string will take up on the screen
- Number of bytes that are actually being stored in memory
"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.
Taking this one step further -- there's no such thing as the context-free length of a string.
Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.
Refining your list, the things you usually want are:
- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).
- Number of code points when parsing.
- Number of grapheme clusters for advancing the cursor back and forth when editing.
- Bounding box in pixels or points for display with a given font.
Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.
It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?
"Unicode is JPG for ASCII" is an incredibly great metaphor.
You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.
I really wish people would stop giving this bad advice, especially so stridently.
Like it or not, code points are how Unicode works. Telling people to ignore code points is telling people to ignore how data works. It's of the same philosophy that results in abstraction built on abstraction built on abstraction, with no understanding.
I vehemently dissent from this view.
You’re arguing against a strawman. The advice wasn’t to ignore learning about code points; it’s that if your solution to a problem involves reasoning about code points, you’re probably doing it wrong and are likely to make a mistake.
Trying to handle code points as atomic units fails even in trivial and extremely common cases like diacritics, before you even get to more complicated situations like emoji variants. Solving pretty much any real-world problem involving a Unicode string requires factoring in canonical forms, equivalence classes, collation, and even locale. Many problems can’t even be solved at the _character_ (grapheme) level—text selection, for example, has to be handled at the grapheme _cluster_ level. And even then you need a rich understanding of those graphemes to know whether to break them apart for selection (ligatures like fi) or keep them intact (Hangul jamo).
Yes, people should learn about code points. Including why they aren’t the level they should be interacting with strings at.
[deleted]
It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.
Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.
Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.
> Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.
You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.
This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.
I hear your point, but invalid codepoint sequences are way less of a problem than strings with invalid UTF8. Text rendering engines deal with weird Unicode just fine. They have to since Unicode changes over time. Invalid UTF8 on the other hand is completely unrepresentable in most languages. I mean, unless you use raw byte arrays and convert to strings at the edge, but that’s a terrible design.
> This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.
Disagree. Allowing 2 kinds of bugs to slip through to runtime doesn’t make your system more resilient than allowing 1 kind of bug. If you’re worried about errors like this, checksums are a much better idea than letting your database become corrupted.
ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.
Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.
> Notably Rust did the correct thing
In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:
String.len() == number of bytes
String.bytes().count() == number of bytes
String.chars().count() == number of unicode scalar values
String.graphemes().count() == number of graphemes (requires unicode-segmentation which is not in the stdlib)
String.lines().count() == number of lines
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.
Similar to Java:
String.chars().count(), String.codePoints().count(), and, for historical reasons, String.getBytes(UTF-8).length
String.graphemes().count()
That's a real nice API. (Similarly, python has @ for matmul but there is not an implementation of matmul in stdlib. NumPy has a matmul implementation so that the `@` operator works.)
ugrapheme and ucwidth are one way to get the graphene count from a string in Python.
It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?
Any correctly designed grapheme cluster handles emoji characters. It’s part of the spec (says the guy who wrote a Unicode segmentation library for rust).
> in the global international connected computing world it doesn’t fit at all.
Most people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to.
No one is "obsessing" over anything. Reality is there are very few cases where you can use a single 8-bit character set and not run in to problems sooner or later. Say your software is used only in Greece so you use ISO-8859-7 for Greek. That works fine, but now you want to talk to your customer Günther from Germany who has been living in Greece for the last five years, or Clément from France, or Seán from Ireland and oops, you can't.
Even plain English text can't be represented with plain ASCII (although ISO-8859-1 goes a long way).
There are some cases where just plain ASCII is okay, but there are quite few of them (and even those are somewhat controversial).
The solution is to just use UTF-8 everywhere. Or maybe UTF-16 if you really have to.
Except, this is a response to emoji support, which does have encoding issues even if your user base is in the US and only speaks English. Additionally, it is easy to have issues with data that your users use from other sources via copy and paste.
Which audience makes it so you don’t have to worry about text encodings?
This is naive at best
Here's a better analogy, in the 70s "nobody planned" for names with 's in then. SQL injections, separators, "not in the alphabet", whatever. In the US. Where a lot of people with 's in their names live... Or double-barrelled names.
It's a much simpler problem and still tripped a lot of people
And then you have to support a user with a "funny name" or a business with "weird characters", or you expand your startup to Canada/Mexico and lo and behold...
Yea, I cringe when I hear the phrase "special characters." They're only special because you, the developer, decided to treat them as special, and that's almost surely going to come back to haunt you at some point in the form of a bug.
> in the global international connected computing world it doesn’t fit at all.
I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.
This is American imperialism at its worst. I'm serious.
Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.
Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?
It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.
Well I'm not American and I can tell you that we do not see English source code as imperialism.
In fact it's awesome that we have one common very simple character set and language that works everywhere and can do everything.
I have only encountered source code using my native language (German) in comments or variable names in highly unprofessional or awful software and it is looked down upon. You will always get an ugly mix and have to mentally stop to figure out which language a name is in. It's simply not worth it.
Please stop pushing this UTF-8 everywhere nonsense. Make it work great on interactive/UI/user facing elements but stop putting UTF-8-only restrictions in low-level software. Example: Copied a bunch of ebooks to my phone, including one with a mangled non-UTF-8 name. It was ridiculously hard to delete the file as most Android graphical and console tools either didn't recognize it or crashed.
> Please stop pushing this UTF-8 everywhere nonsense.
I was with you until this sentence. UTF-8 everywhere is great exactly because it is ASCII-compatible (e.g. all ASCII strings are automatically also valid UTF-8 strings, so UTF-8 is a natural upgrade path from ASCII) - both are just encodings for the same UNICODE codepoints, ASCII just cannot go beyond the first 127 codepoints, but that's where UTF-8 comes in and in a way that's backward compatible with ASCII - which is the one ingenious feature of the UTF-8 encoding.
I'm not advocating for ASCII-everywhere, I'm for bytes-everywhere.
And bytes can conveniently fit both ASCII and UTF-8.
If you want to restrict your programming language to ASCII for whatever reason, fine by me. I don't need "let wohnt_bei_Böckler_STRAẞE = ..." that much.
But if you allow full 8-bit bytes, please don't restrict them to UTF-8. If you need to gracefully handle non-UTF-8 sequences graphically show the appropriate character "�", otherwise let it pass through unmodified. Just don't crash, show useless error messages or in the worst case try to "fix" it by mangling the data even more.
> "let wohnt_bei_Böckler_STRAẞE"
This string cannot be encoded as ASCII in the first place.
> But if you allow full 8-bit bytes, please don't restrict them to UTF-8
UTF-8 has no 8-bit restrictions... You can encode any 21-bit UNICODE codepoint with UTF-8.
It sound's like you're confusing ASCII, Extended ASCII and UTF-8:
- ASCII: 7-bits per "character" (e.g. not able to encode international characters like äöü) but maps to the lower 7-bits of the 21-bits of UNICODE codepoints (e.g. all ASCII character codes are also valid UNICODE code points)
- Extended ASCII: 8-bits per "character" but the interpretation of the upper 128 values depends on a country-specific codepage (e.g. the intepretation of a byte value in the range between 128 and 255 is different between countries and this is what causes all the mess that's usually associated with "ASCII". But ASCII did nothing wrong - the problem is Extended ASCII - this allows to 'encode' äöü with the German codepage but then shows different characters when displayed with a non-German codepage)
- UTF-8: a variable-width encoding for the full range of UNICODE codepoints, uses 1..4 bytes to encode one 21-bit UNICODE codepoint, and the 1-byte encodings are identical with 7-bit ASCII (e.g. when the MSB of a byte in an UTF-8 string is not set, you can be sure that it is a character/codepoint in the ASCII range).
Out of those three, only Extended ASCII with codepages are 'deprecated' and should no longer be used, while ASCII and UTF-8 are both fine since any valid ASCII encoded string is indistinguishable from that same string encoded as UTF-8, e.g. ASCII has been 'retconned' into UTF-8.
I’d go farther and say that extended ASCII was an unmitigated disaster of compatibility issues (not to mention that more than a few scripts still don’t fit in the available spaces of an 8-bit encoding). Those of us who were around for the pre-Unicode days understand what a mess it was (not to mention the lingering issues thanks to the much vaunted backwards compatibility of some operating systems).
I'm not GP, but I think you're completely missing their point.
The problem they're describing happens because file names (in Linux and Windows) are not text: in Linux (so Android) they're arbitrary sequences of bytes, and in Windows they're arbitrary sequences of UTF-16 code points not necessarily forming valid scalar values (for example, surrogate pairs can be present alone).
And yet, a lot of programs ignore that and insist on storing file names as Unicode strings, which mostly works (because users almost always name files by inputting text) until somehow a file gets written as a sequence of bytes that doesn't map to a valid string (i.e., it's not UTF-8 or UTF-16, depending on the system).
So what's probably happening in GP's case is that they managed somehow to get a file with a non-UTF-8-byte-sequence name in Android, and subsequently every App that tries to deal with that file uses an API that converts the file name to a string containing U+FFFD ("replacement character") when the invalid UTF-8 byte is found. So when GP tries to delete the file, the App will try to delete the file name with the U+FFFD character, which will fail because that file doesn't exist.
GP is saying that showing the U+FFFD character is fine, but the App should understand that the actual file name is not UTF-8 and behave accordingly (i.e. use the original sequence-of-bytes filename when trying to delete it).
Note that this is harder than it should be. For example, with the old Java API (from java.io[1]) that's impossible: if you get a `File` object from listing a directory and ask if it exists, you'll get `false` for GP's file, because the `File` object internally stores the file name as a Java string. To get the correct result, you have to use the new API (from java.nio.file[2]) using `Path` objects.
UTF-8 everywhere is not great and UTF-8 in practice is hardly ASCII-compatible. UTF-8 in source codes and file paths outside pure ASCII range breaks a lot of things especially on non-English systems due to legacy dependencies, ironically.
Sure, it's backward compatible, as in ASCII handling codes work on systems with UTF-8 locales, but how important is that?
It's neither American nor imperialism -- those are both category mistakes.
Andreas Rumpf, the designer of Nim, is Austrian. All the keywords of Nim are in English, the library function names are in English, the documentation is in English, Rumpf's book Mastering Nim is in English, the other major book for the language, Nim In Action (written by Dominik Picheta, nationality unknown but not American) is in English ... this is not "American imperialism" (which is a real thing that I don't defend), it's for easily understandable pragmatic reasons. And the language parser doesn't disallow non-ASCII characters but it doesn't treat them linguistically, and it has special rules for casefolding identifiers that only recognize ASCII letters, hobbling the use of non-ASCII identifiers because case distinguishes between types and other identifiers. The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster.
Actually, it would be great to have a lingua franca in every field that all participants can understand. Are you also going to complain that biologists and doctors are expected to learn some rudimentary Latin? English being dominant in computing is absolutely a strength and we gain nothing by trying to combat that. Having support for writing your code in other languages is not going to change that most libraries will use English and and most documentation will be in English and most people you can ask for help will understand English. If you want to participate and refuse to learn English you are only shooting yourself in the foot - and if you are going to learn English you may as well do it from the beginning. Also due to the dominance of English and ASCII in computing history, most languages already have ASCII-alternatives for their writing so even if you need to refer to non-English names you can do that using only ASCII.
Well, the problem is that what you are advocating is also that knowing Latin would be a prerequisite for studying medicine, which it isn't anywhere. That's the equivalent. Doctors learn a (very limited) Latin vocabulary as they study and work.
You are severely underestimate how far you can get without any real command of the English language. I agree that you can't become really good without it, just like you can't do haute cuisine without some French, but the English language is a huge and unnecessary barrier of entry that you would put in front of everyone in the world who isn't submerged in the language from an early age.
Imagine learning programming using only your high school Spanish. Good luck.
> Imagine learning programming using only your high school Spanish. Good luck.
This + translated materials + locally written books is how STEM fields work in East Asia, the odds of success shouldn't be low. There just needs to be enough population using your language.
Calm down, ASCII is a UNICODE compatible encoding for the first 127 UNICODE code points (which maps directly to the entire ASCII range). If you need to go beyond that, just 'upgrade' to UTF-8 encoding.
UNICODE is essentially a superset of ASCII, and the UTF-8 encoding also contains ASCII as compatible subset (e.g. for the first 127 UNICODE code points, an UTF-8 encoded string is byte-by-byte compatible with the same string encoded in ASCII).
Just don't use any of the Extended ASCII flavours (e.g. "8-bit ASCII with codepages") - or any of the legacy 'national' multibyte encodings (Shift-JIS etc...) because that's how you get the infamous `?????` or `♥♥♥♥♥` mismatches which are commonly associated with 'ASCII' (but this is not ASCII, but some flavour of Extended ASCII decoded with the wrong codepage).
I don’t see much difference between the amount of Italian you need for music and the amount of English you need for programming. You can have a conversation about it in your native language, but you’ll be using a bunch of domain-specific terms that may not be in your native language.
There was a time when most scientific literature was written in French. People learned French. Before that it was Latin. People learned Latin.
This is true but it’s important to recognize that this was because of the French (Napoleon) and Roman empires, Christianity just as the brutal American and UK empires created these circumstances today
The napoleonic empire lasted about 15 years, so that's a bit of a stretch.
More relevantly though, good things can come from people who also did bad things; this isn't to justify doing bad things in hopes something good also happens, but it doesn't mean we need to ideologically purge good things based on their creators.
ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8.
Just never ever use Extended ASCII (8-bits with codepages).
[deleted]
Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.
Python 3 internally uses UTF-32.
When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8.
"UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).
Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.
It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.
It would be pretty silly for them to explode all strings to 4-byte characters.
You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules.
It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail.
Python3 is specified to use arrays of 8, 16, or 32 bit units, depending on the largest code point in the string. As a result, all code points in all strings are O(1) indexable. The claim that "Python 3 internally uses UTF-32" is simply false.
> code points, which need up to 24 bits
They need at most 21 bits. The bits may only be available in multiples of 8, but the implementation also doesn't byte-pack them into 24-bit units, so that's moot.
I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.
Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string.
But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.
My experience personally with python3 (and repeated interactions with about a dozen python programmers, including core contributors) is that python3 does not let you trivially work with streams of bytes, esp if you need to do character set conversions, since a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. The last attempt was much larger, still failed, and they thought they could probably do it, but it would require far more code and was not worth their effort.
> a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3.
Show me.
Heh. It always starts this way... then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing" or, "I could get this working but it isn't worth the effort" but sure, let's do this one more time. Could be they were all missing something obvious - wouldn't know, I avoid python personally, apart from when necessary like with LLM glue.
https://pastebin.com/j4Lzb5q1
This is a script created by someone on #nethack a long time ago. It works great with other things as well like old BBS games. It was intended to transparently rewrite single byte encodings to multibyte with an optional conversion array.
> then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing or, 'I could get this working but it isn't worth the effort'"
It almost works as-is in my testing. (By the way, there's a typo in the usage message.) Here is my test process:
#!/usr/bin/env python
import random, sys, time
def out(b):
# ASCII 0..7 for the second digit of the color code in the escape sequence
color = random.randint(48, 55)
sys.stdout.buffer.write(bytes([27, 91, 51, color, 109, b]))
sys.stdout.flush()
for i in range(32, 256):
out(i)
time.sleep(random.random()/5)
while True:
out(random.randint(32, 255))
time.sleep(0.1)
I suppressed random output of C0 control characters to avoid messing up my terminal, but I added a test that basic ANSI escape sequences can work through this.
(My initial version of this didn't flush the output, which mistakenly lead me to try a bunch of unnecessary things in the main script.)
After fixing the `print` calls, the only thing I was forced to change (although I would do the code differently overall) is the output step:
I've tried this out locally (in gnome-terminal) with no issue. (I also compared to the original; I have a local build of 2.7 and adjusted the shebang appropriately.)
There's a warning that `bufsize=1` no longer actually means a byte buffer of size 1 for reading (instead it's magically interpreted as a request for line buffering), but this didn't cause a failure when I tried it. (And setting the size to e.g. `2` didn't break things, either.)
I also tried having my test process read from standard input; the handling of ctrl-C and ctrl-D seems to be a bit different (and in general, setting up a Python process to read unbuffered bytes from stdin isn't the most fun thing), but I generally couldn't find any issues here, either. Which is to say, the problems there are in the test process, not in `ibmfilter`. The input is still forwarded to, and readable from, the test process via the `Popen` object. And any problems of this sort are definitely still fixable, as demonstrated by the fact that `curses` is still in the standard library.
Of course, keys in the `special` mapping need to be defined as bytes literals now. Although that could trivially be adapted if you insist.
I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.
But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.
The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string.
My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.
I think that the current status quo is better than what came before, but I also think it could be improved.
Me too.
The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.
Non normalized unicode is just as problematic as non validated unicode imo.
Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.
Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.
C strings are not bags of bytes because they can't contain 0x00.
It's definitely worth thinking about the real problem, but I wouldn't say it's never helpful.
The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.
I have never wanted any of the things you said. I have, on the other hand, always wanted the string length. I'm not saying that we shouldn't have methods like what you state, we should! But your statement that people don't actually want string length is untrue because it's overly broad.
> I have, on the other hand, always wanted the string length.
In an environment that supports advanced Unicode features, what exactly do you do with the string length?
I don’t know about advanced Unicode features… but I use them all the time as a backend developer to validate data input.
I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.
This seems to have always been known as the length of the string.
This thread sounds like a bunch of scientists trying to make a simple concept a lot harder to understand.
> I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.
> This seems to have always been known as the length of the string.
Sure. And by this definition, the string discussed in TFA (that consists of a facepalm emoji with a skin tone set) objectively has 5 characters in it, and therefore a length of 5. And it has always had 5 characters in it, since it was first possible to create such a string.
Similarly, "é" has one character in it, but "é" has two despite appearing visually identical. Furthermore, those two strings will not compare equal in any sane programming language without explicit normalization (unless HN's software has normalized them already). If you allow passwords or email addresses to contain things like this, then you have to reckon with that brute fact.
None of this is new. These things have fundamentally been true since the introduction of Unicode in 1991.
Which length? Bytes? Code points? Graphemes? Pixels?
> Number of monospaced font character blocks this string will take up on the screen
Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.
But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.
It gets more complicated if you do substring operations.
If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.
Substring operations (and more generally the universe of operations where there is more than one string involved) are a whole other kettle of fish. Unicode, being a byte code format more than what you think of as a logical 'string' format, has multiple ways of representing the same strings.
If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.
"Unicode, being a byte code format"
UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.
The point is that not all sequences of characters ("code point" means the integer value, whereas "character" means the thing that number represents) are valid.
non sequitur ... I simply pointed out a mistaken claim and your comment is about something quite different.
(Also that's not what "character" means in the Unicode framework--some code points correspond to characters and some don't.)
I’m fairly positive the answer is trivially logical equivalence for pretty much any substring operation. I can’t imagine bitwise equivalence to ever be the “normal” use case, except to the implementer looking at it as a simpler/faster operation
I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly
Yep for a substring against its parent or other substrings of the same parent that’s definitely true, but I think this question generalizes because the case where you’re comparing strings solely within themselves is an optimization path for the more general. I’m just thinking out loud.
> s.charAt(x) or s.codePointAt(x)
Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.
The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.
Indeed. Or s.length, whatever that represents.
FWIW, the cheap lazy way to get "number of bytes in DB" from JS, is unescape(encodeURIComponent("ə̀")).length
What if you need to find 5 letter words to play wordle? Why do you care how many bytes they occupy or how large they are on screen?
In the case of Wordle, you know the exact set of letters you’re going to be using, which easily determines how to compute length.
No no, I want to create tomorrow's puzzle.
As the parent said:
> In the case of Wordle, you know the exact set of letters you’re going to be using
This holds for the generator side too. In fact, you have a fixed word list, and the fixed alphabet tells you what a "letter" is, and thus how to compute length. Because this concerns natural language, this will coincide with grapheme clusters, and with English Wordle, that will in turn correspond to byte length because it won't give you words with é (I think). In different languages the grapheme clusters might be larger than 1 byte (e.g. [1], where they're codepoints).
If you're playing at this level, you need to define:
- letter
- word
- 5 :P
Eh in macedonian they have some letters that in russian are just 2 separate letters
In German you have the same, only within one language. ß can be written as ss if it isn't available in a font, and only in 2017 they added a capital version. So depending the font and the unicode version the number of letters can differ.
"Traditionally, ⟨ß⟩ did not have a capital form, and was capitalized as ⟨SS⟩. Some type designers introduced capitalized variants. In 2017, the Council for German Orthography officially adopted a capital form ⟨ẞ⟩ as an acceptable variant, ending a long debate."
Thanks, that is interesting!
should "ß" == "ss" evaluate as true?
I don't see why it should. I also believe parent is wrong as there are unambiguous rules about when to use ß or ss.
Never thought of it but maybe there are rules that allow to visually present the code point for ß as ss? At least (from experience as a user) there seem to be a singular "ss" codepoint.
>also believe parent is wrong as there are unambiguous rules about when to use ß or ss.
I never said it was ambiguous, I said it depends on the unicode version and the font you are using. How is that wrong? (Seems like the capital of ß is still SS in the latest unicode but since ẞ is the preferred capital version now this should change in the future)
ẞ is not the preferred capital version, it is an acceptable variant (according to the Council for German Orthography).
> How is that wrong?
Not sure where, how or if it's defined as part of Unicode, but so far I assumed that for a Unicode grapheme there exists a notion of what the visual representation should look like.
If Unicode still defines capital of ß as SS that's an error in Unicode due to slow adaption of the changes in the German language.
"ß as SS that's an error in Unicode"
It's not. Uppercase of ß has always been SS.
Before we had a separate codepoint in Unicode this caused problems with round-tripping between upper and lower case. So Unicode rightfully introduced a separate codepoint specifically for that use case in 2008.
This inspired designers to design a glyph for that codepoint looking similar to ß. Nothing wrong with that.
Some liked the idea and it got some foothold, so in 2017, the Council for German Orthography allowed it as an acceptable variant.
Maybe it will win, maybe not, but for now in standard German the uppercase of ß is still SS and Unicode rightfully reflects that.
In unicode the default is still SS [1] while the Germans seem to have changed it to ẞ [2]. That means now it's the same on every system, but once the unicode standard changes and some systems get updated and others not there will be different behavior of len("ß".upper()) around.
I don't know how or if systems deal with this, but ß should be printed as ss if ß is unavailable in the font. It's possible this is completely up to the user.
"In unicode the default is still SS [1] while the Germans seem to have changed it to ẞ [2]."
Where does the source corroborate that claim? Can you give is a hint where to find the source?
well I don't speak german, I was asking
I see, wasn't clear to me on what level you were asking. The letter ß has never been generally equivalent to ss in the German language.
From a user experience perspective though it might be beneficial to pretend that "ß" == "ss" holds when parsing user input.
Niße. ;)
FWIW, I frequently want the string length. Not for anything complicated, but our authors have ranges of characters they are supposed to stay in. Luckily no one uses emojis or weird unicode symbols, so in practice there’s no problem getting the right number by simply ignoring all the complexities.
It's not unlikely that what you would ideally use here is the number of grapheme clusters. What is the length of "ë"? Either 1 or 2 codepoints depending on the encoding (combining [1] or single codepoint [2]), and either 1 byte (Latin-1), 2 bytes (UTF-8 single-codepoint) or 3 bytes (UTF-8 combining).
The metrics you care about are likely number of letters from a human perspective (1) or the number of bytes of storage (depends), possibly both.
What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?
With UTF-8 you can implement them on top of bytes.
That's basically what a string data type is for.
I actually want string length. Just give me the length of a word. My human brain wants a human way to think about problems. While programming I never think about bytes.
The whole point is that string length doesn’t necessarily give you the “length” of a “word”, and both of those terms are not well enough defined.
The point is that those terms are ambiguous ... and if you mean the length in grapheme clusters, it can be quite expensive to calculate it, and isn't the right number if you're dealing with strings as objects that are chunks of memory.
[flagged]
It is not possible to write correct code without understanding what you dismiss here.
If the validation rules don't specify (either explicitly or implicitly) what the "length" in the rule corresponds to (if it concerns a natural-language field, it's probably grapheme clusters), then either you should fix the rule, or you care only about checking the box of "I checked the validation rules", in which case it's a people problem and not a technology problem.
You are in the wrong job if you don’t want to think about “nerd shit” while programming.
Idk it pays my bills and I have success at work so I must do something right.
Well I just hope I don’t have to use any of the software you build. What a shameful attitude.
The whole industry is shameful considering devs regularly build software like Palantir.
Its all about paying your bills and I hope you too will realize that at some point.
I agree about Palantir but that’s a big deflection. And don’t patronise me. Paying your bills and paying enough attention to technical details to write good programs are in no way mutually exclusive.
I see where you're coming from, but I disagree on some specifics, especially regarding bytes.
Most people care about the length of a string in terms of the number of characters.
Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).
Same goes to the "string width".
Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.
It's not rare at all - multi-code point emojis are pretty standard these days.
And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.
I have wanted string length many times in production systems for language processing. And it is perfectly fine as long as whatever you are using is consistent. I rarely care how many bytes an emoji actually is unless I'm worried about extreme efficiency in storage or how many monospace characters it uses unless I do very specific UI things. This blog is more of a cautionary tale what can happen if you unconsciously mix standards e.g. by using one in the backend and another in the frontend. But this is not a problem of string lengths per se, they are just one instance where modern implementations are all over the place.
There's an awful lot of text in here but I'm not seeing a coherent argument that Python's approach is the worst, despite the author's assertion. It especially makes no sense to me that counting the characters the implementation actually uses should be worse than counting UTF-16 code units, for an implementation that doesn't use surrogate pairs (and in fact only uses those code units to store out-of-band data via the "surrogateescape" error handler, or explicitly requested characters. N.B.: Lone surrogates are still valid characters, even though a sequence containing them is not a valid string.) JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.
> JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.
Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.
You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.
ironic that unicode is stripped out the post's title here, making it very much wrong ;)
for context, the actual post features an emoji with multiple unicode codepoints in between the quotes
Ok, we've put Man Facepalming with Light Skin Tone back up there. I failed to find a way to avoid it.
Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.
That would be "\U0001F926\U0001F3FC\u200D\u2642\uFE0F" in Python's syntax, or "\u{1F926}\u{1F3FC}\u{200D}\u{2642}\u{FE0F}" in Rust or JavaScript.
Might be a little long for a title :)
Thanks! Your second option is almost identical to Mlller's (https://news.ycombinator.com/item?id=44988801) but the extra curly braces make it not fit. Seems like they're droppable for characters below U+FFFF, so I've squeezed it in above.
That works! (The braces are droppable for 16-bit codepoints in JS, but required in Rust.)
I can actually fit that within HN's 80 char limit without having to drop the "(2019)" bit at the end, so let's give it a try and see what happens... thanks!
An incredible retitle
Before it wasn't, about 1h ago it was showing me a proper emoji
Funny enough I clicked on the post wondering how it could possibly be that a single space was length 7.
Maybe it isn't a space, but a list of invisible Unicode chars...
It could also be a byte length of a 3 byte UTF-8 BOM and then some stupid space character like f09d85b3
It’s U+0020, a standard space character.
I did exactly the same, thinking that maybe it was invisible unicode characters or something I didn't know about.
Unintentional click-bait.
It can be many Zero-Width Space, or a few Hair-Width Space.
You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).
Next up: The <half-br/> tag.
You laugh but my typewriter could do half-br 40 years ago. Was used for typing super/subscript.
[deleted]
Stuff like this makes me so glad that in my world strings are ALWAYS ASCII and one char is always one byte. Unicode simply doesn't exist and all string manipulation can be done with a straightforward for loop or whatever.
Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.
That English can be well represented with ASCII may have contributed to America becoming an early computing powerhouse. You could actually do things like processing and sorting and doing case insensitive comparisons on data likes names and addresses very cheaply.
[deleted]
The article both argues that the "real" length from a user perspective is Extended Grapheme Clusters - and makes a case against using it, because it requires you to store the entire character database and may also change from one Unicode version to the next.
Therefore, people should use codepoints for things like length limits or database indexes.
But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?
If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?
Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
What do you mean by "use codepoints for ... database indexes"? I feel like you are drawing conclusions that the essay does not propose or support. (It doesn't say that you should use codepoints for length limits.)
> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
I was referring to this part, in "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?":
"For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.
You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."
You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.
But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.
> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.
> it doesn't say "codepoints" as an alternative solution. That was just my assumption …
On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)
> The problem will be the same if you have to reconstruct the grapheme clusters eventually.
In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.
> You don't want that if you e.g. have an index for fulltext search.
Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.
Python does an exceptionally bad job. After dragging the community through a 15-year transition to Python 3 in order to "fix" Unicode, we ended up with support that's worse than in languages that simply treat strings as raw bytes.
Yeah I have no idea what is wrong with that. Python simply operates on arrays of codepoints, which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding. This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.
> which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding.
Which, to humor the parent, is also true of raw bytes strings. One of the (valid) points raised by the gist is that `str` is not infallibly encodable to UTF-8, since it can contain values that are not valid Unicode.
> This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.
If I write,
def foo(s: str) -> …:
… I want the input string to be Unicode. If I need "Unicode, or maybe with bullshit mixed in", that can be a different type, and then I can take
def foo(s: UnicodeWithBullshit) -> …:
> Python simply operates on arrays of codepoints
But most programmers think in arrays of grapheme clusters, whether they know it or not.
No, I'm not standing for that.
Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.
It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.
Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.
The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.
I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)
(HN doesn't render the emoji in comments, it seems)
Why would I want this to be 17, if I'm representing strings as array of code points, rather than UTF-8?
TXR Lisp:
1> (len " ")
5
2> (coded-length " ")
17
(Trust me when I say that the emoji was there when I edited the comment.)
The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.
I haven't thought about this deeply, but it seems to me that the evolution of unicode has left it unparseable (into extended grapheme clusters, which I guess are "characters") in a forwards compatible way. If so, it seems like we need a new encoding which actually delimits these (just as utf-8 delimits code points). Then the original sender determines what is a grapheme, and if they don't know, who does?
[deleted]
Worth giving Raku a shout out here... methods do what they say and you write what you mean. Really wish every other language would pinch the Str implementation from here, or at least the design.
$ raku
Welcome to Rakudo™ v2025.06.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2025.06.
[0] > " ".chars
1
[1] > " ".codes
5
[2] > " ".encode('UTF-8').bytes
17
[3] > " ".NFD.map(*.chr.uniname)
(FACE PALM EMOJI MODIFIER FITZPATRICK TYPE-3 ZERO WIDTH JOINER MALE SIGN VARIATION SELECTOR-16)
[deleted]
I run one of the many online word counting tools (WordCounts.com) which also does character counts. I have noticed that even Google Docs doesn't seem to use grapheme counts and will produce larger than expected counts for strings of emoji.
If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.
Fascinating and annoying problem, indeed. In Java, the correct way to iterate over the characters (Unicode scalar values) of a string is to use the IntStream provided by String::codePoints (since Java 8), but I bet 99.9999% of the existing code uses 16-bit chars.
This does not fix the problem. The emoji consists of multiple Unicode characters (in turn represented 1:1 by the integer "code point" values). There is much more to it than the problem of surrogate pairs.
I'd disagree the number of unicode scalars is useless (in the case of python3), but it's a very interesting article nonetheless. Too bad unicode.org decided to break all the URLs in the table at the end.
>We’ve seen four different lengths so far:
Number of UTF-8 code units (17 in this case)
Number of UTF-16 code units (7 in this case)
Number of UTF-32 code units or Unicode scalar values (5 in this case)
Number of extended grapheme clusters (1 in this case)
We would not have this problem if we all agree to return number of bytes instead.
Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.
"number of bytes" is dependent on the text encoding.
UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won
> We would not have this problem if we all agree to return number of bytes instead.
I don't understand. It depends on the encoding isn't it?
How would that help? UTF-8, 16, and 32 languages would still report different numbers.
> if we all decided to report number of bytes that string used instead number of printable characters
But that isn't the same across all languages, or even across all implementations of the same language.
>Number of extended grapheme clusters (1 in this case)
Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.
when I'm reading text on a screen, I very much am not reading bytes. this is obvious when you actually think what 'text encoding' means.
You're not reading unicode code points either though. Your computer uses bytes, you read glyphs which roughly correspond to unicode extended grapheme clusters - anything between might look like the correct solution at first but is the wrong abstraction for almost everything.
you are right, but this just drives the point.
I learned this recently when I encountered a bug due to cutting an emoji character in two making it unable to render.
If you want to get the grapheme length in JavaScript, JavaScript now has Intl.Segmenter[^1][^2].
Another little thing: The post mentions that tag sequences are only used for the flags of England, Scotland, and Wales. Those are the only ones that are standard (RGI), but because it's clear how the mechanism would work for other subnational entities, some systems support other ones, such as US state flags! I don't recommend using these if you want other people to be able to see them, but...
[deleted]
[dead]
I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?
Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.
Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.
Needless to say, Unicode is not a good fit for every scenario.
I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)
Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.
E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.
So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.
Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.
Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.
That's a reasonable implementation in my opinion. It's not that complicated. You're also apparently insisting on three-letter variable names, and are using a very primitive language to boot, so I don't think you're setting yourself up for "maintainability" here.
It even includes an optimized fast path for ASCII, and it works at compile-time as well.
Well it is a pretty old codebase, the whole project is written in C. I haven't done any Rust programming yet but it does seem like a good choice for modern programs. I'll check out the link and see if I can glean any insights into what needs to be done to fix my ancient parser. Thanks!
> You're also apparently insisting on three-letter variable names
Why are the arguments not three-letter though? I would feel terrible if that was my code.
It's just a convention I use for personal projects. Back when I started coding in C, people often just opted to go with one or two character variable names. I chose three for locally-scoped variables because it was usually enough to identify them in a recognizable fashion. The fixed-width nature of it all also made for less eye-clutter. As for function arguments, the fact that they were fully spelled out made it easier for API reference purposes. At the end of the day all that really matters is that you choose a convention and stick with it. For team projects they should be laid out early on and, as long as everyone follows them, the entire project will have a much better sense of consistency.
Sure, I'll just write my own language all weird and look like an illiterate so that you are not inconvenienced.
You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.
UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.
ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also
- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.
- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.
- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.
It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.
That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.
That goes all the way back to the beginning
Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.
True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)
In this particular case it was simply a matter of not enough corner cases defined. I was however using property-based testing, doing things like reversing then un-reversing the UTF-8 strings, re-ordering code points, merging strings, etc for verification. The datasets were in a variety of languages (including emojis) and so I mistakenly thought I had covered all the bases.
But thank you for the link, it's turning out to be a very enjoyable read! There already seems to be a few things I could do better thanks to the article, besides the fact that it codifies a lot of interesting approaches one can take to improve testing in general.
I think what you meant is we should all go back to the simplicity of Shift-JIS
Should have just gone with 32 bit characters and no combinations. Utter simplicity.
That would be extremely wasteful, every single text file would be 4x larger and I'm sure eventually it would not be enough anyway.
Maybe we should have just replaced ascii, horrible encoding were entire 25% of it is wasted. And maybe we could have gotten a bit more efficiency by saying instead of having both lower and uppercase letters just have one and then have a modifier before it. Saving lot of space as most text could just be lowercase.
yeah that's how ascii works… there's 1 bit for lower/upper case.
I think combining characters are a lot simpler than having every single combination ever.
Especially when you start getting into non latin-based languages.
What does "no combinations" mean?
Like say Ä it might be either Ä a single byte, or combination of ¨ and A. Both are now supported, but if you can have more than two such things going in one thing it makes a mess.
That's fundamental to the mission of Unicode because Unicode is meant to be compatible with all legacy character sets, and those character sets already included combining characters.
I think that string length is one of those things that people (including me) don't realise they never actually want. In a production system, I have never actually wanted string length. I have wanted:
- Number of bytes this will be stored as in the DB
- Number of monospaced font character blocks this string will take up on the screen
- Number of bytes that are actually being stored in memory
"String length" is just a proxy for something else, and whenever I'm thinking shallowly enough to want it (small scripts, mostly-ASCII, mostly-English, mostly-obvious failure modes, etc) I like grapheme cluster being the sensible default thing that people probably expect, on average.
Taking this one step further -- there's no such thing as the context-free length of a string.
Strings should be thought of more like opaque blobs, and you should derive their length exclusively in the context in which you intend to use it. It's an API anti-pattern to have a context-free length property associated with a string because it implies something about the receiver that just isn't true for all relevant usages and leads you to make incorrect assumptions about the result.
Refining your list, the things you usually want are:
- Number of bytes in a given encoding when saving or transmitting (edit: or more generally, when serializing).
- Number of code points when parsing.
- Number of grapheme clusters for advancing the cursor back and forth when editing.
- Bounding box in pixels or points for display with a given font.
Context-free length is something we inherited from ASCII where almost all of these happened to be the same, but that's not the case anymore. Unicode is better thought of as compiled bytecode than something you can or should intuit anything about.
It's like asking "what's the size of this JPEG." Answer is it depends, what are you trying to do?
"Unicode is JPG for ASCII" is an incredibly great metaphor.
size(JPG) == bytes? sectors? colors? width? height? pixels? inches? dpi?
> Number of code points when parsing.
You shouldn't really ever care about the number of code points. If you do, you're probably doing something wrong.
I really wish people would stop giving this bad advice, especially so stridently.
Like it or not, code points are how Unicode works. Telling people to ignore code points is telling people to ignore how data works. It's of the same philosophy that results in abstraction built on abstraction built on abstraction, with no understanding.
I vehemently dissent from this view.
You’re arguing against a strawman. The advice wasn’t to ignore learning about code points; it’s that if your solution to a problem involves reasoning about code points, you’re probably doing it wrong and are likely to make a mistake.
Trying to handle code points as atomic units fails even in trivial and extremely common cases like diacritics, before you even get to more complicated situations like emoji variants. Solving pretty much any real-world problem involving a Unicode string requires factoring in canonical forms, equivalence classes, collation, and even locale. Many problems can’t even be solved at the _character_ (grapheme) level—text selection, for example, has to be handled at the grapheme _cluster_ level. And even then you need a rich understanding of those graphemes to know whether to break them apart for selection (ligatures like fi) or keep them intact (Hangul jamo).
Yes, people should learn about code points. Including why they aren’t the level they should be interacting with strings at.
It’s a bit of a niche use case, but I use the codepoint counts in CRDTs for collaborative text editing.
Grapheme cluster counts can’t be used because they’re unstable across Unicode versions. Some algorithms use UTF8 byte offsets - but I think that’s a mistake because they make input validation much more complicated. Using byte offsets, there’s a whole lot of invalid states you can represent easily. Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint. Then inserting at position 2 is valid again. If you send me an operation which happened at some earlier point in time, I don’t necessarily have the text document you were inserting into handy. So figuring out if your insertion (and deletion!) positions are valid at all is a very complex and expensive operation.
Codepoints are way easier. I can just accept any integer up to the length of the document at that point in time.
> Eg maybe insert “a” at position 0 is valid, but inserting at position 1 would be invalid because it might insert in the middle of a codepoint.
You have the same problem with code points, it's just hidden better. Inserting "a" between U+0065 and U+0308 may result in a "valid" string but is still as nonsensical as inserting "a" between UTF-8 bytes 0xC3 and 0xAB.
This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.
I hear your point, but invalid codepoint sequences are way less of a problem than strings with invalid UTF8. Text rendering engines deal with weird Unicode just fine. They have to since Unicode changes over time. Invalid UTF8 on the other hand is completely unrepresentable in most languages. I mean, unless you use raw byte arrays and convert to strings at the edge, but that’s a terrible design.
> This makes code points less suitable than UTF-8 bytes as mistakes are more likely to not be caught during development.
Disagree. Allowing 2 kinds of bugs to slip through to runtime doesn’t make your system more resilient than allowing 1 kind of bug. If you’re worried about errors like this, checksums are a much better idea than letting your database become corrupted.
For those following along: https://tomsmeding.com/unicode#U+65%20U+308%20U+20%20U+65%20...
ASCII is very convenient when it fits in the solution space (it’d better be, it was designed for a reason), but in the global international connected computing world it doesn’t fit at all. The problem is all the tutorials, especially low level ones, assume ASCII so 1) you can print something to the console and 2) to avoid mentioning that strings are hard so folks don’t get discouraged.
Notably Rust did the correct thing by defining multiple slightly incompatible string types for different purposes in the standard library and regularly gets flak for it.
> Notably Rust did the correct thing
In addition to separate string types, they have separate iterator types that let you explicitly get the value you want. So:
Really my only complaint is I don't think String.len() should exist, it's too ambiguous. We should have to explicitly state what we want/mean via the iterators.Similar to Java:
ugrapheme and ucwidth are one way to get the graphene count from a string in Python.
It's probably possible to get the grapheme cluster count from a string containing emoji characters with ICU?
Any correctly designed grapheme cluster handles emoji characters. It’s part of the spec (says the guy who wrote a Unicode segmentation library for rust).
> in the global international connected computing world it doesn’t fit at all.
Most people aren't living in that world. If you're working at Amazon or some business that needs to interact with many countries around the globe, sure, you have to worry about text encoding quite a bit. But the majority of software is being written for a much narrower audience, probably for one single language in one single country. There is simply no reason for most programmers to obsess over text encoding the way so many people here like to.
No one is "obsessing" over anything. Reality is there are very few cases where you can use a single 8-bit character set and not run in to problems sooner or later. Say your software is used only in Greece so you use ISO-8859-7 for Greek. That works fine, but now you want to talk to your customer Günther from Germany who has been living in Greece for the last five years, or Clément from France, or Seán from Ireland and oops, you can't.
Even plain English text can't be represented with plain ASCII (although ISO-8859-1 goes a long way).
There are some cases where just plain ASCII is okay, but there are quite few of them (and even those are somewhat controversial).
The solution is to just use UTF-8 everywhere. Or maybe UTF-16 if you really have to.
Except, this is a response to emoji support, which does have encoding issues even if your user base is in the US and only speaks English. Additionally, it is easy to have issues with data that your users use from other sources via copy and paste.
Which audience makes it so you don’t have to worry about text encodings?
This is naive at best
Here's a better analogy, in the 70s "nobody planned" for names with 's in then. SQL injections, separators, "not in the alphabet", whatever. In the US. Where a lot of people with 's in their names live... Or double-barrelled names.
It's a much simpler problem and still tripped a lot of people
And then you have to support a user with a "funny name" or a business with "weird characters", or you expand your startup to Canada/Mexico and lo and behold...
Yea, I cringe when I hear the phrase "special characters." They're only special because you, the developer, decided to treat them as special, and that's almost surely going to come back to haunt you at some point in the form of a bug.
> in the global international connected computing world it doesn’t fit at all.
I disagree. Not all text is human prose. For example, there is nothing wrong with an programming language that only allows ASCII in the source code and many downsides to allowing non-ASCII characters outside string constants or comments.
This is American imperialism at its worst. I'm serious.
Lots of people around the world learn programming from sources in their native language, especially early in their career, or when software development is not their actual job.
Enforcing ASCII is the same as enforcing English. How would you feel if all cooking recipes were written in French? If all music theory was in Italian? If all industrial specifications were in German?
It's fine to have a dominant language in a field, but ASCII is a product of technical limitations that we no longer have. UTF-8 has been an absolute godsend for human civilization, despite its flaws.
Well I'm not American and I can tell you that we do not see English source code as imperialism.
In fact it's awesome that we have one common very simple character set and language that works everywhere and can do everything.
I have only encountered source code using my native language (German) in comments or variable names in highly unprofessional or awful software and it is looked down upon. You will always get an ugly mix and have to mentally stop to figure out which language a name is in. It's simply not worth it.
Please stop pushing this UTF-8 everywhere nonsense. Make it work great on interactive/UI/user facing elements but stop putting UTF-8-only restrictions in low-level software. Example: Copied a bunch of ebooks to my phone, including one with a mangled non-UTF-8 name. It was ridiculously hard to delete the file as most Android graphical and console tools either didn't recognize it or crashed.
> Please stop pushing this UTF-8 everywhere nonsense.
I was with you until this sentence. UTF-8 everywhere is great exactly because it is ASCII-compatible (e.g. all ASCII strings are automatically also valid UTF-8 strings, so UTF-8 is a natural upgrade path from ASCII) - both are just encodings for the same UNICODE codepoints, ASCII just cannot go beyond the first 127 codepoints, but that's where UTF-8 comes in and in a way that's backward compatible with ASCII - which is the one ingenious feature of the UTF-8 encoding.
I'm not advocating for ASCII-everywhere, I'm for bytes-everywhere.
And bytes can conveniently fit both ASCII and UTF-8.
If you want to restrict your programming language to ASCII for whatever reason, fine by me. I don't need "let wohnt_bei_Böckler_STRAẞE = ..." that much.
But if you allow full 8-bit bytes, please don't restrict them to UTF-8. If you need to gracefully handle non-UTF-8 sequences graphically show the appropriate character "�", otherwise let it pass through unmodified. Just don't crash, show useless error messages or in the worst case try to "fix" it by mangling the data even more.
> "let wohnt_bei_Böckler_STRAẞE"
This string cannot be encoded as ASCII in the first place.
> But if you allow full 8-bit bytes, please don't restrict them to UTF-8
UTF-8 has no 8-bit restrictions... You can encode any 21-bit UNICODE codepoint with UTF-8.
It sound's like you're confusing ASCII, Extended ASCII and UTF-8:
- ASCII: 7-bits per "character" (e.g. not able to encode international characters like äöü) but maps to the lower 7-bits of the 21-bits of UNICODE codepoints (e.g. all ASCII character codes are also valid UNICODE code points)
- Extended ASCII: 8-bits per "character" but the interpretation of the upper 128 values depends on a country-specific codepage (e.g. the intepretation of a byte value in the range between 128 and 255 is different between countries and this is what causes all the mess that's usually associated with "ASCII". But ASCII did nothing wrong - the problem is Extended ASCII - this allows to 'encode' äöü with the German codepage but then shows different characters when displayed with a non-German codepage)
- UTF-8: a variable-width encoding for the full range of UNICODE codepoints, uses 1..4 bytes to encode one 21-bit UNICODE codepoint, and the 1-byte encodings are identical with 7-bit ASCII (e.g. when the MSB of a byte in an UTF-8 string is not set, you can be sure that it is a character/codepoint in the ASCII range).
Out of those three, only Extended ASCII with codepages are 'deprecated' and should no longer be used, while ASCII and UTF-8 are both fine since any valid ASCII encoded string is indistinguishable from that same string encoded as UTF-8, e.g. ASCII has been 'retconned' into UTF-8.
I’d go farther and say that extended ASCII was an unmitigated disaster of compatibility issues (not to mention that more than a few scripts still don’t fit in the available spaces of an 8-bit encoding). Those of us who were around for the pre-Unicode days understand what a mess it was (not to mention the lingering issues thanks to the much vaunted backwards compatibility of some operating systems).
I'm not GP, but I think you're completely missing their point.
The problem they're describing happens because file names (in Linux and Windows) are not text: in Linux (so Android) they're arbitrary sequences of bytes, and in Windows they're arbitrary sequences of UTF-16 code points not necessarily forming valid scalar values (for example, surrogate pairs can be present alone).
And yet, a lot of programs ignore that and insist on storing file names as Unicode strings, which mostly works (because users almost always name files by inputting text) until somehow a file gets written as a sequence of bytes that doesn't map to a valid string (i.e., it's not UTF-8 or UTF-16, depending on the system).
So what's probably happening in GP's case is that they managed somehow to get a file with a non-UTF-8-byte-sequence name in Android, and subsequently every App that tries to deal with that file uses an API that converts the file name to a string containing U+FFFD ("replacement character") when the invalid UTF-8 byte is found. So when GP tries to delete the file, the App will try to delete the file name with the U+FFFD character, which will fail because that file doesn't exist.
GP is saying that showing the U+FFFD character is fine, but the App should understand that the actual file name is not UTF-8 and behave accordingly (i.e. use the original sequence-of-bytes filename when trying to delete it).
Note that this is harder than it should be. For example, with the old Java API (from java.io[1]) that's impossible: if you get a `File` object from listing a directory and ask if it exists, you'll get `false` for GP's file, because the `File` object internally stores the file name as a Java string. To get the correct result, you have to use the new API (from java.nio.file[2]) using `Path` objects.
[1] https://developer.android.com/reference/java/io/File
[2] https://developer.android.com/reference/java/nio/file/Path
UTF-8 everywhere is not great and UTF-8 in practice is hardly ASCII-compatible. UTF-8 in source codes and file paths outside pure ASCII range breaks a lot of things especially on non-English systems due to legacy dependencies, ironically.
Sure, it's backward compatible, as in ASCII handling codes work on systems with UTF-8 locales, but how important is that?
It's neither American nor imperialism -- those are both category mistakes.
Andreas Rumpf, the designer of Nim, is Austrian. All the keywords of Nim are in English, the library function names are in English, the documentation is in English, Rumpf's book Mastering Nim is in English, the other major book for the language, Nim In Action (written by Dominik Picheta, nationality unknown but not American) is in English ... this is not "American imperialism" (which is a real thing that I don't defend), it's for easily understandable pragmatic reasons. And the language parser doesn't disallow non-ASCII characters but it doesn't treat them linguistically, and it has special rules for casefolding identifiers that only recognize ASCII letters, hobbling the use of non-ASCII identifiers because case distinguishes between types and other identifiers. The reason for this lack of handling of Unicode linguistically is simply to make the lexer smaller and faster.
Actually, it would be great to have a lingua franca in every field that all participants can understand. Are you also going to complain that biologists and doctors are expected to learn some rudimentary Latin? English being dominant in computing is absolutely a strength and we gain nothing by trying to combat that. Having support for writing your code in other languages is not going to change that most libraries will use English and and most documentation will be in English and most people you can ask for help will understand English. If you want to participate and refuse to learn English you are only shooting yourself in the foot - and if you are going to learn English you may as well do it from the beginning. Also due to the dominance of English and ASCII in computing history, most languages already have ASCII-alternatives for their writing so even if you need to refer to non-English names you can do that using only ASCII.
Well, the problem is that what you are advocating is also that knowing Latin would be a prerequisite for studying medicine, which it isn't anywhere. That's the equivalent. Doctors learn a (very limited) Latin vocabulary as they study and work.
You are severely underestimate how far you can get without any real command of the English language. I agree that you can't become really good without it, just like you can't do haute cuisine without some French, but the English language is a huge and unnecessary barrier of entry that you would put in front of everyone in the world who isn't submerged in the language from an early age.
Imagine learning programming using only your high school Spanish. Good luck.
> Imagine learning programming using only your high school Spanish. Good luck.
This + translated materials + locally written books is how STEM fields work in East Asia, the odds of success shouldn't be low. There just needs to be enough population using your language.
Calm down, ASCII is a UNICODE compatible encoding for the first 127 UNICODE code points (which maps directly to the entire ASCII range). If you need to go beyond that, just 'upgrade' to UTF-8 encoding.
UNICODE is essentially a superset of ASCII, and the UTF-8 encoding also contains ASCII as compatible subset (e.g. for the first 127 UNICODE code points, an UTF-8 encoded string is byte-by-byte compatible with the same string encoded in ASCII).
Just don't use any of the Extended ASCII flavours (e.g. "8-bit ASCII with codepages") - or any of the legacy 'national' multibyte encodings (Shift-JIS etc...) because that's how you get the infamous `?????` or `♥♥♥♥♥` mismatches which are commonly associated with 'ASCII' (but this is not ASCII, but some flavour of Extended ASCII decoded with the wrong codepage).
I don’t see much difference between the amount of Italian you need for music and the amount of English you need for programming. You can have a conversation about it in your native language, but you’ll be using a bunch of domain-specific terms that may not be in your native language.
There was a time when most scientific literature was written in French. People learned French. Before that it was Latin. People learned Latin.
This is true but it’s important to recognize that this was because of the French (Napoleon) and Roman empires, Christianity just as the brutal American and UK empires created these circumstances today
The napoleonic empire lasted about 15 years, so that's a bit of a stretch.
More relevantly though, good things can come from people who also did bad things; this isn't to justify doing bad things in hopes something good also happens, but it doesn't mean we need to ideologically purge good things based on their creators.
ASCII is totally fine as encoding for the lower 127 UNICODE code points. If you need to go above those 127 code points, use a different encoding like UTF-8.
Just never ever use Extended ASCII (8-bits with codepages).
Python 3 deals with this reasonable sensibly, too, I think. They use UTF-8 by default, but allow you to specify other encodings.
Python 3 internally uses UTF-32. When exchanging data with the outside world, it uses the "default encoding" which it derives from various system settings. This usually ends up being UTF-8 on non-Windows systems, but on weird enough systems (and almost always on Windows), you can end up with a default encoding other than UTF-8. "UTF-8 mode" (https://peps.python.org/pep-0540/) fixes this but it's not yet enabled by default (this is planned for Python 3.15).
Apparently Python uses a variety of internal representations depending on the string itself. I looked it up because I saw UTF-32 and thought there's no way that's what they do -- it's pretty much always the wrong answer.
It uses Latin-1 for ASCII strings, UCS-2 for strings that contain code points in the BMP and UCS-4 only for strings that contain code points outside the BMP.
It would be pretty silly for them to explode all strings to 4-byte characters.
You are correct. Discussions of this topic tend to be full of unvalidated but confidently stated assertions, like "Python 3 internally uses UTF-32." Also unjustified assertions, like the OP's claim that len(" ") == 5 is "rather useless" and that "Python 3’s approach is unambiguously the worst one". Unlike in many other languages, the code points in Python's strings are always directly O(1) indexable--which can be useful--and the subject string has 5 indexable code points. That may not be the semantics that someone is looking for in a particular application, but it certainly isn't useless. And given the Python implementation of strings, the only other number that would be useful would be the number of grapheme clusters, which in this case is 1, and that count can be obtained via the grapheme or regex modules.
It conceptually uses arrays of code points, which need up to 24 bits. Optimizing the storage to use smaller integers when possible is an implementation detail.
Python3 is specified to use arrays of 8, 16, or 32 bit units, depending on the largest code point in the string. As a result, all code points in all strings are O(1) indexable. The claim that "Python 3 internally uses UTF-32" is simply false.
> code points, which need up to 24 bits
They need at most 21 bits. The bits may only be available in multiples of 8, but the implementation also doesn't byte-pack them into 24-bit units, so that's moot.
I prefer languages where strings are simply sequences of bytes and you get to decide how to interpret them.
Such languages do not have strings. Definitionally a string is a sequence of characters, and more than 256 characters exist. A byte sequence is just an encoding; if you are working with that encoding directly and have to do the interpretation yourself, you are not using a string.
But if you do want a sequence of bytes for whatever reason, you can trivially obtain that in any version of Python.
My experience personally with python3 (and repeated interactions with about a dozen python programmers, including core contributors) is that python3 does not let you trivially work with streams of bytes, esp if you need to do character set conversions, since a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3. The last attempt was much larger, still failed, and they thought they could probably do it, but it would require far more code and was not worth their effort.
I'll probably just use rust for that script if python2 ever gets dropped by my distro. Reminds me of https://gregoryszorc.com/blog/2020/01/13/mercurial%27s-journ...
> a tiny python2 script that I have used for decades for conversion of character streams in terminals has proved to be repeated unportable to python3.
Show me.
Heh. It always starts this way... then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing" or, "I could get this working but it isn't worth the effort" but sure, let's do this one more time. Could be they were all missing something obvious - wouldn't know, I avoid python personally, apart from when necessary like with LLM glue. https://pastebin.com/j4Lzb5q1
This is a script created by someone on #nethack a long time ago. It works great with other things as well like old BBS games. It was intended to transparently rewrite single byte encodings to multibyte with an optional conversion array.
> then they confidently send me something that breaks on testing it, then half a dozen more iterations, then "python2 is doing the wrong thing or, 'I could get this working but it isn't worth the effort'"
It almost works as-is in my testing. (By the way, there's a typo in the usage message.) Here is my test process:
I suppressed random output of C0 control characters to avoid messing up my terminal, but I added a test that basic ANSI escape sequences can work through this.(My initial version of this didn't flush the output, which mistakenly lead me to try a bunch of unnecessary things in the main script.)
After fixing the `print` calls, the only thing I was forced to change (although I would do the code differently overall) is the output step:
I've tried this out locally (in gnome-terminal) with no issue. (I also compared to the original; I have a local build of 2.7 and adjusted the shebang appropriately.)There's a warning that `bufsize=1` no longer actually means a byte buffer of size 1 for reading (instead it's magically interpreted as a request for line buffering), but this didn't cause a failure when I tried it. (And setting the size to e.g. `2` didn't break things, either.)
I also tried having my test process read from standard input; the handling of ctrl-C and ctrl-D seems to be a bit different (and in general, setting up a Python process to read unbuffered bytes from stdin isn't the most fun thing), but I generally couldn't find any issues here, either. Which is to say, the problems there are in the test process, not in `ibmfilter`. The input is still forwarded to, and readable from, the test process via the `Popen` object. And any problems of this sort are definitely still fixable, as demonstrated by the fact that `curses` is still in the standard library.
Of course, keys in the `special` mapping need to be defined as bytes literals now. Although that could trivially be adapted if you insist.
I would like an utf-8 optimized bag of bytes where arbitrary byte operations are possible but the buffer keeps track of whether is it valid utf-8 or not (for every edit of n bytes it should be enough to check about n+8 bytes to validate) then utf-8 then utf-8 encoding/decoding becomes a noop and utf-8 specific apis can check quickly is the string is malformed or not.
But why care if it's malformed UTF-8? And specifically, what do you want to happen when you get a malformed UTF-8 string. Keep in mind that UTF-8 is self-synchronizing so even if you encode strings into a larger text-based format without verifying them it will still be possible to decode the document. As a user I normally want my programs to pass on the string without mangling it further. Some tool throwing fatal errors because some string I don't actually care about contains an invalid UTF-8 byte sequence is the last thing I want. With strings being an arbitrary bag of bytes many programs can support arbitrary encodings or at least arbitrary ASCII-supersets without any additional effort.
The main issue I can see is not garbage bytes in text but mixing of incompatible encoding eg splicing latin-1 bytes in a utf-8 string.
My understanding of the current "always and only utf-8/unicode" zeitgeist is that is comes mostly from encoding issues among which the complexity of detecting encoding.
I think that the current status quo is better than what came before, but I also think it could be improved.
Me too.
The languages that i really dont get are those that force valid utf-8 everywhere but dont enforce NFC. Which is most of them but seems like the worst of both worlds.
Non normalized unicode is just as problematic as non validated unicode imo.
Python has byte arrays that allow for that, in addition to strings consisting of arrays of Unicode code points.
Yes, I always roll my eyes when people complain that C strings or C++'s std::string/string_view don't have Unicode support. They are bags of bytes with support for concatenation. Any other transformation isn't going to have a "correct" way to do it so you need to be aware of what you want anyway.
C strings are not bags of bytes because they can't contain 0x00.
It's definitely worth thinking about the real problem, but I wouldn't say it's never helpful.
The underlying issue is unit conversion. "length" is a poor name because it's ambiguous. Replacing "length" with three functions - "lengthInBytes", "lengthInCharacters", and "lengthCombined" - would make it a lot easier to pick the right thing.
I have never wanted any of the things you said. I have, on the other hand, always wanted the string length. I'm not saying that we shouldn't have methods like what you state, we should! But your statement that people don't actually want string length is untrue because it's overly broad.
> I have, on the other hand, always wanted the string length.
In an environment that supports advanced Unicode features, what exactly do you do with the string length?
I don’t know about advanced Unicode features… but I use them all the time as a backend developer to validate data input.
I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.
This seems to have always been known as the length of the string.
This thread sounds like a bunch of scientists trying to make a simple concept a lot harder to understand.
> I want to make sure that the password is between a given number of characters. Same with phone numbers, email addresses, etc.
> This seems to have always been known as the length of the string.
Sure. And by this definition, the string discussed in TFA (that consists of a facepalm emoji with a skin tone set) objectively has 5 characters in it, and therefore a length of 5. And it has always had 5 characters in it, since it was first possible to create such a string.
Similarly, "é" has one character in it, but "é" has two despite appearing visually identical. Furthermore, those two strings will not compare equal in any sane programming language without explicit normalization (unless HN's software has normalized them already). If you allow passwords or email addresses to contain things like this, then you have to reckon with that brute fact.
None of this is new. These things have fundamentally been true since the introduction of Unicode in 1991.
Which length? Bytes? Code points? Graphemes? Pixels?
> Number of monospaced font character blocks this string will take up on the screen
Even this has to deal with the halfwidth/fullwidth split in CJK. Even worse, Devanagari has complex rendering rules that actually depend on font choices. AFAIU, the only globally meaningful category here is rendered bounding box, which is obviously font-dependent.
But I agree with the general sentiment. What we really about how much space these text blobs take up, whether that be in a DB, in memory, or on the screen.
It gets more complicated if you do substring operations.
If I do s.charAt(x) or s.codePointAt(x) or s.substring(x, y), I'd like to know which values for x and y are valid and which aren't.
Substring operations (and more generally the universe of operations where there is more than one string involved) are a whole other kettle of fish. Unicode, being a byte code format more than what you think of as a logical 'string' format, has multiple ways of representing the same strings.
If you take a substring of a(bc) and compare it to string (bc) are you looking for bitwise equivalence or logical equivalence? If the former it's a bit easier (you can just memcmp) but if the latter you have to perform a normalization to one of the canonical forms.
"Unicode, being a byte code format"
UTF-8 is a byte code format; Unicode is not. In Python, where all strings are arrays of Unicode code points, substrings are likewise arrays of Unicode code points.
The point is that not all sequences of characters ("code point" means the integer value, whereas "character" means the thing that number represents) are valid.
non sequitur ... I simply pointed out a mistaken claim and your comment is about something quite different.
(Also that's not what "character" means in the Unicode framework--some code points correspond to characters and some don't.)
I’m fairly positive the answer is trivially logical equivalence for pretty much any substring operation. I can’t imagine bitwise equivalence to ever be the “normal” use case, except to the implementer looking at it as a simpler/faster operation
I feel like if you’re looking for bitwise equivalence or similar, you should have to cast to some kind of byte array and access the corresponding operations accordingly
Yep for a substring against its parent or other substrings of the same parent that’s definitely true, but I think this question generalizes because the case where you’re comparing strings solely within themselves is an optimization path for the more general. I’m just thinking out loud.
> s.charAt(x) or s.codePointAt(x)
Neither of these are really useful unless you are implementing a font renderer or low level Unicode algorithm - and even then you usually only want to get the next code point rather than one at an arbitrary position.
The values for x and y should't come from your brain, though (with the exception of 0). They should come from previous index operations like s.indexOf(...) or s.search(regex), etc.
Indeed. Or s.length, whatever that represents.
FWIW, the cheap lazy way to get "number of bytes in DB" from JS, is unescape(encodeURIComponent("ə̀")).length
What if you need to find 5 letter words to play wordle? Why do you care how many bytes they occupy or how large they are on screen?
In the case of Wordle, you know the exact set of letters you’re going to be using, which easily determines how to compute length.
No no, I want to create tomorrow's puzzle.
As the parent said:
> In the case of Wordle, you know the exact set of letters you’re going to be using
This holds for the generator side too. In fact, you have a fixed word list, and the fixed alphabet tells you what a "letter" is, and thus how to compute length. Because this concerns natural language, this will coincide with grapheme clusters, and with English Wordle, that will in turn correspond to byte length because it won't give you words with é (I think). In different languages the grapheme clusters might be larger than 1 byte (e.g. [1], where they're codepoints).
If you're playing at this level, you need to define:
- letter
- word
- 5 :P
Eh in macedonian they have some letters that in russian are just 2 separate letters
In German you have the same, only within one language. ß can be written as ss if it isn't available in a font, and only in 2017 they added a capital version. So depending the font and the unicode version the number of letters can differ.
"Traditionally, ⟨ß⟩ did not have a capital form, and was capitalized as ⟨SS⟩. Some type designers introduced capitalized variants. In 2017, the Council for German Orthography officially adopted a capital form ⟨ẞ⟩ as an acceptable variant, ending a long debate."
Thanks, that is interesting!
should "ß" == "ss" evaluate as true?
I don't see why it should. I also believe parent is wrong as there are unambiguous rules about when to use ß or ss.
Never thought of it but maybe there are rules that allow to visually present the code point for ß as ss? At least (from experience as a user) there seem to be a singular "ss" codepoint.
>also believe parent is wrong as there are unambiguous rules about when to use ß or ss.
I never said it was ambiguous, I said it depends on the unicode version and the font you are using. How is that wrong? (Seems like the capital of ß is still SS in the latest unicode but since ẞ is the preferred capital version now this should change in the future)
ẞ is not the preferred capital version, it is an acceptable variant (according to the Council for German Orthography).
> How is that wrong? Not sure where, how or if it's defined as part of Unicode, but so far I assumed that for a Unicode grapheme there exists a notion of what the visual representation should look like. If Unicode still defines capital of ß as SS that's an error in Unicode due to slow adaption of the changes in the German language.
"ß as SS that's an error in Unicode"
It's not. Uppercase of ß has always been SS.
Before we had a separate codepoint in Unicode this caused problems with round-tripping between upper and lower case. So Unicode rightfully introduced a separate codepoint specifically for that use case in 2008.
This inspired designers to design a glyph for that codepoint looking similar to ß. Nothing wrong with that.
Some liked the idea and it got some foothold, so in 2017, the Council for German Orthography allowed it as an acceptable variant.
Maybe it will win, maybe not, but for now in standard German the uppercase of ß is still SS and Unicode rightfully reflects that.
In unicode the default is still SS [1] while the Germans seem to have changed it to ẞ [2]. That means now it's the same on every system, but once the unicode standard changes and some systems get updated and others not there will be different behavior of len("ß".upper()) around.
I don't know how or if systems deal with this, but ß should be printed as ss if ß is unavailable in the font. It's possible this is completely up to the user.
[1] https://unicode.org/faq/casemap_charprop.html [2] https://www.rechtschreibrat.com/DOX/RfdR_Amtliches-Regelwerk...
"In unicode the default is still SS [1] while the Germans seem to have changed it to ẞ [2]."
Where does the source corroborate that claim? Can you give is a hint where to find the source?
well I don't speak german, I was asking
I see, wasn't clear to me on what level you were asking. The letter ß has never been generally equivalent to ss in the German language.
From a user experience perspective though it might be beneficial to pretend that "ß" == "ss" holds when parsing user input.
Niße. ;)
FWIW, I frequently want the string length. Not for anything complicated, but our authors have ranges of characters they are supposed to stay in. Luckily no one uses emojis or weird unicode symbols, so in practice there’s no problem getting the right number by simply ignoring all the complexities.
It's not unlikely that what you would ideally use here is the number of grapheme clusters. What is the length of "ë"? Either 1 or 2 codepoints depending on the encoding (combining [1] or single codepoint [2]), and either 1 byte (Latin-1), 2 bytes (UTF-8 single-codepoint) or 3 bytes (UTF-8 combining).
The metrics you care about are likely number of letters from a human perspective (1) or the number of bytes of storage (depends), possibly both.
[1]: https://tomsmeding.com/unicode#U+65%20U+308 [2]: https://tomsmeding.com/unicode#U+EB
What about implementing text algorithms like prefix search or a suffix tree to mention the simplest ones? Don't you need a string length at various points there?
With UTF-8 you can implement them on top of bytes.
That's basically what a string data type is for.
I actually want string length. Just give me the length of a word. My human brain wants a human way to think about problems. While programming I never think about bytes.
The whole point is that string length doesn’t necessarily give you the “length” of a “word”, and both of those terms are not well enough defined.
The point is that those terms are ambiguous ... and if you mean the length in grapheme clusters, it can be quite expensive to calculate it, and isn't the right number if you're dealing with strings as objects that are chunks of memory.
[flagged]
It is not possible to write correct code without understanding what you dismiss here.
If the validation rules don't specify (either explicitly or implicitly) what the "length" in the rule corresponds to (if it concerns a natural-language field, it's probably grapheme clusters), then either you should fix the rule, or you care only about checking the box of "I checked the validation rules", in which case it's a people problem and not a technology problem.
You are in the wrong job if you don’t want to think about “nerd shit” while programming.
Idk it pays my bills and I have success at work so I must do something right.
Well I just hope I don’t have to use any of the software you build. What a shameful attitude.
The whole industry is shameful considering devs regularly build software like Palantir. Its all about paying your bills and I hope you too will realize that at some point.
I agree about Palantir but that’s a big deflection. And don’t patronise me. Paying your bills and paying enough attention to technical details to write good programs are in no way mutually exclusive.
I see where you're coming from, but I disagree on some specifics, especially regarding bytes.
Most people care about the length of a string in terms of the number of characters.
Treating it as a proxy for the number of bytes has been incorrect ever since UTF-8 became the norm (basically forever), and if you're dealing with anything beyond ASCII (which you really should, since East Asian users alone number in the billions).
Same goes to the "string width".
Yes, Unicode scalar values can combine into a single glyph and cause discrepancies, as the article mentions, but that is a much rarer edge case than simply handling non-ASCII text.
It's not rare at all - multi-code point emojis are pretty standard these days.
And before that the only thing the relative rarity did for you was that bugs with code working on UTF-8 bytes got fixed while bugs that assumed UTF-16 units or 32-bit code points represent a character were left to linger for much longer.
I have wanted string length many times in production systems for language processing. And it is perfectly fine as long as whatever you are using is consistent. I rarely care how many bytes an emoji actually is unless I'm worried about extreme efficiency in storage or how many monospace characters it uses unless I do very specific UI things. This blog is more of a cautionary tale what can happen if you unconsciously mix standards e.g. by using one in the backend and another in the frontend. But this is not a problem of string lengths per se, they are just one instance where modern implementations are all over the place.
There's an awful lot of text in here but I'm not seeing a coherent argument that Python's approach is the worst, despite the author's assertion. It especially makes no sense to me that counting the characters the implementation actually uses should be worse than counting UTF-16 code units, for an implementation that doesn't use surrogate pairs (and in fact only uses those code units to store out-of-band data via the "surrogateescape" error handler, or explicitly requested characters. N.B.: Lone surrogates are still valid characters, even though a sequence containing them is not a valid string.) JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.
> JavaScript is compelled to count UTF-16 code units because it actually does use UTF-16. Python's flexible string representation is a space optimization; it still fundamentally represents strings as a sequence of characters, without using the surrogate-pair system.
Python's flexible string system has nothing to do with this. Python could easily have had len() return the byte count, even the USV count, or other vastly more meaningful metrics than "5", whose unit is so disastrous I can't put a name to it. It's not bytes, it's not UTF-16 code units, it's not anything meaningful, and that's the problem. In particular, the USV count would have been made easy (O(1) easy!) by Python's flexible string representation.
You're handwaving it away in your writing by calling it a "character in the implementation", but what is a character? It's not a character in any sense a normal human would recognize — like a grapheme cluster — as I think if I asked a human "how many characters is <imagine this is man with skin tone face palming>?", they'd probably say "well, … IDK if it's really a character, but 1, I suppose?" …but "5" or "7"? Where do those even come from? An astute person might like "Oh, perhaps that takes more than one byte, is that it's size in memory?" Nope. Again: "character in the implementation" is a meaningless concept. We've assigned words to a thing to make it sound meaningful, but that is like definitionally begging the question here.
Related. Others? (Also, anybody know the answer to https://news.ycombinator.com/item?id=44987514?)
It’s not wrong that " ".length == 7 (2019) - https://news.ycombinator.com/item?id=36159443 - June 2023 (303 comments)
String length functions for single emoji characters evaluate to greater than 1 - https://news.ycombinator.com/item?id=26591373 - March 2021 (127 comments)
String Lengths in Unicode - https://news.ycombinator.com/item?id=20914184 - Sept 2019 (140 comments)
https://news.ycombinator.com/item?id=27529697
ironic that unicode is stripped out the post's title here, making it very much wrong ;)
for context, the actual post features an emoji with multiple unicode codepoints in between the quotes
Ok, we've put Man Facepalming with Light Skin Tone back up there. I failed to find a way to avoid it.
Is there a way to represent this string with escaped codepoints? It would be both amusing and in HN's plaintext spirit to do it that way in the title above, but my Unicode is weak.
That would be "\U0001F926\U0001F3FC\u200D\u2642\uFE0F" in Python's syntax, or "\u{1F926}\u{1F3FC}\u{200D}\u{2642}\u{FE0F}" in Rust or JavaScript.
Might be a little long for a title :)
Thanks! Your second option is almost identical to Mlller's (https://news.ycombinator.com/item?id=44988801) but the extra curly braces make it not fit. Seems like they're droppable for characters below U+FFFF, so I've squeezed it in above.
That works! (The braces are droppable for 16-bit codepoints in JS, but required in Rust.)
That would be …
… for Javascript.I can actually fit that within HN's 80 char limit without having to drop the "(2019)" bit at the end, so let's give it a try and see what happens... thanks!
An incredible retitle
Before it wasn't, about 1h ago it was showing me a proper emoji
Funny enough I clicked on the post wondering how it could possibly be that a single space was length 7.
Maybe it isn't a space, but a list of invisible Unicode chars...
It could also be a byte length of a 3 byte UTF-8 BOM and then some stupid space character like f09d85b3
It’s U+0020, a standard space character.
I did exactly the same, thinking that maybe it was invisible unicode characters or something I didn't know about.
Unintentional click-bait.
It can be many Zero-Width Space, or a few Hair-Width Space.
You never know, when you don’t know CSS and try to align your pixels with spaces. Some programers should start a trend where 1 tab = 3 hairline-width spaces (smaller than 1 char width).
Next up: The <half-br/> tag.
You laugh but my typewriter could do half-br 40 years ago. Was used for typing super/subscript.
Stuff like this makes me so glad that in my world strings are ALWAYS ASCII and one char is always one byte. Unicode simply doesn't exist and all string manipulation can be done with a straightforward for loop or whatever.
Dealing with wide strings sounds like hell to me. Right up there with timezones. I'm perfectly happy with plain C in the embedded world.
That English can be well represented with ASCII may have contributed to America becoming an early computing powerhouse. You could actually do things like processing and sorting and doing case insensitive comparisons on data likes names and addresses very cheaply.
The article both argues that the "real" length from a user perspective is Extended Grapheme Clusters - and makes a case against using it, because it requires you to store the entire character database and may also change from one Unicode version to the next.
Therefore, people should use codepoints for things like length limits or database indexes.
But wouldn't this just move the "cause breakage with new Unicode version" problem to a different layer?
If a newer Unicode version suddenly defines some sequences to be a single grapheme cluster where there were several ones before and my database index now suddenly points to the middle of that cluster, what would I do?
Seems to me, the bigger problem is with backwards compatibility guarantees in Unicode. If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
What do you mean by "use codepoints for ... database indexes"? I feel like you are drawing conclusions that the essay does not propose or support. (It doesn't say that you should use codepoints for length limits.)
> If the standard is continuously updated and they feel they can just make arbitrary changes to how grapheme clusters work at any time, how is any software that's not "evergreen" (I.e. forces users onto the latest version and pretends older versions don't exist) supposed to deal with that?
Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
I was referring to this part, in "Shouldn’t the Nudge Go All the Way to Extended Grapheme Clusters?":
"For example, the Unicode version dependency of extended grapheme clusters means that you should never persist indices into a Swift strings and load them back in a future execution of your app, because an intervening Unicode data update may change the meaning of the persisted indices! The Swift string documentation does not warn against this.
You might think that this kind of thing is a theoretical issue that will never bite anyone, but even experts in data persistence, the developers of PostgreSQL, managed to make backup restorability dependent on collation order, which may change with glibc updates."
You're right it doesn't say "codepoints" as an alternative solution. That was just my assumption as it would be the closest representation that does not depend on the character database.
But you could also use code units, bytes, whatever. The problem will be the same if you have to reconstruct the grapheme clusters eventually.
> Why would software need to have a permanent, durable mapping between a string and the number of grapheme clusters that it contains?
Because splitting a grapheme cluster in half can change its semantics. You don't want that if you e.g. have an index for fulltext search.
> it doesn't say "codepoints" as an alternative solution. That was just my assumption …
On the contrary, the article calls code point indexing “rather useless” in the subtitle. Code unit indexing is the appropriate technique. (“Byte indexing” generally implies the use of UTF-8 and in that context is more meaningfully called code unit indexing. But I just bet there are systems out there that use UTF-16 or UTF-32 and yet use byte indexing.)
> The problem will be the same if you have to reconstruct the grapheme clusters eventually.
In practice, you basically never do. Only your GUI framework ever does, for rendering the text and for handling selection and editing. Because that’s pretty much the only place EGCs are ever actually relevant.
> You don't want that if you e.g. have an index for fulltext search.
Your text search won’t be splitting by grapheme clusters, it’ll be doing word segmentation instead.
Python does an exceptionally bad job. After dragging the community through a 15-year transition to Python 3 in order to "fix" Unicode, we ended up with support that's worse than in languages that simply treat strings as raw bytes.
Some other fun examples: https://gist.github.com/ozanmakes/0624e805a13d2cebedfc81ea84...
Yeah I have no idea what is wrong with that. Python simply operates on arrays of codepoints, which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding. This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.
> which are a stable representation that can be converted to a bunch of encodings including "proper" utf-8, as long as all codepoints are representable in that encoding.
Which, to humor the parent, is also true of raw bytes strings. One of the (valid) points raised by the gist is that `str` is not infallibly encodable to UTF-8, since it can contain values that are not valid Unicode.
> This also allows you to work with strings that contain arbitrary data falling outside of the unicode spectrum.
If I write,
… I want the input string to be Unicode. If I need "Unicode, or maybe with bullshit mixed in", that can be a different type, and then I can take> Python simply operates on arrays of codepoints
But most programmers think in arrays of grapheme clusters, whether they know it or not.
No, I'm not standing for that.
Python does it correctly and the results in that gist are expected. Characters are not grapheme clusters, and not every sequence of characters is valid. The ability to store unpaired surrogate characters is a feature: it would take extra time to validate this when it only really matters at encoding time. It also empowers the "surrogateescape" error handler, that in turn makes it possible to supply arbitrary bytes in command line arguments, even while providing strings to your program which make sense in the common case. (Not all sequences of bytes are valid UTF-8; the error handler maps the invalid bytes to invalid unpaired surrogates.) The same character counts are (correctly) observed in many other programming languages; there's nothing at all "exceptional" about Python's treatment.
It's not actually possible to "treat strings as raw bytes", because they contain more than 256 possible distinct symbols. They must be encoded; even if you assume an ecosystem-wide encoding, you are still using that encoding. But if you wish to work with raw sequences of bytes in Python, the `bytes` type is built-in and trivially created using a `b'...'` literal, or various other constructors. (There is also a mutable `bytearray` type.) These types now correctly behave as a sequence of byte (i.e., integer ranging 0..255 inclusive) values; when you index them, you get an integer. I have personal experience of these properties simplifying and clarifying my code.
Unicode was fixed (no quotation marks), with the result that you now have clearly distinct types that honour the Zen of Python principle that "explicit is better than implicit", and no longer get `UnicodeDecodeError` from attempting an encoding operation or vice-versa. (This problem spawned an entire family of very popular and very confused Stack Overflow Q&As, each with probably countless unrecognized duplicates.) As an added bonus, the default encoding for source code files changed to UTF-8, which means in practical terms that you can actually use non-English characters in your code comments (and even identifier names, with restrictions) now and have it just work without declaring an encoding (since your text editor now almost certainly assumes that encoding in 2025). This also made it possible to easily read text files as text in any declared encoding, and get strings as a result, while also having universal newline mode work, and all without needing to reach for `io` or `codecs` standard libraries.
The community was not so much "dragged through a 15-year transition"; rather, some members of the community spent as long as 15 (really 13.5, unless you count people continuing to try to use 2.7 past the extended EOL) years refusing to adapt to what was a clear bugfix of the clearly broken prior behaviour.
Previous discussions:
• https://news.ycombinator.com/item?id=36159443 (June 2023, 280 points, 303 comments; title got reemojied!)
• https://news.ycombinator.com/item?id=26591373 (March 2021, 116 points, 127 comments)
• https://news.ycombinator.com/item?id=20914184 (September 2019, 230 points, 140 comments)
I’m guessing this got posted by one who saw my comment https://news.ycombinator.com/item?id=44976046 today, though coincidence is possible. (Previous mention of the URL was 7 months ago.)
I did post this. I found it by chance, coming from this other post https://tonsky.me/blog/unicode/
In Java,
(HN doesn't render the emoji in comments, it seems)Why would I want this to be 17, if I'm representing strings as array of code points, rather than UTF-8?
TXR Lisp:
(Trust me when I say that the emoji was there when I edited the comment.)The second value takes work; we have to go through the code points and add up their UTF-8 lengths. The coded length is not cached.
I haven't thought about this deeply, but it seems to me that the evolution of unicode has left it unparseable (into extended grapheme clusters, which I guess are "characters") in a forwards compatible way. If so, it seems like we need a new encoding which actually delimits these (just as utf-8 delimits code points). Then the original sender determines what is a grapheme, and if they don't know, who does?
Worth giving Raku a shout out here... methods do what they say and you write what you mean. Really wish every other language would pinch the Str implementation from here, or at least the design.
I run one of the many online word counting tools (WordCounts.com) which also does character counts. I have noticed that even Google Docs doesn't seem to use grapheme counts and will produce larger than expected counts for strings of emoji.
If you want to see a more interesting case than emoji, check out Thai language. In Thai, vowels could appear before, after, above, below, or on many sides of the associated consonants.
Fascinating and annoying problem, indeed. In Java, the correct way to iterate over the characters (Unicode scalar values) of a string is to use the IntStream provided by String::codePoints (since Java 8), but I bet 99.9999% of the existing code uses 16-bit chars.
This does not fix the problem. The emoji consists of multiple Unicode characters (in turn represented 1:1 by the integer "code point" values). There is much more to it than the problem of surrogate pairs.
I'd disagree the number of unicode scalars is useless (in the case of python3), but it's a very interesting article nonetheless. Too bad unicode.org decided to break all the URLs in the table at the end.
>We’ve seen four different lengths so far:
Number of UTF-8 code units (17 in this case) Number of UTF-16 code units (7 in this case) Number of UTF-32 code units or Unicode scalar values (5 in this case) Number of extended grapheme clusters (1 in this case)
We would not have this problem if we all agree to return number of bytes instead.
Edit: My mistake. There would still be inconsistency between different encoding. My point is, if we all decided to report number of bytes that string used instead number of printable characters, we would not have the inconsistency between languages.
"number of bytes" is dependent on the text encoding.
UTF-8 code units _are_ bytes, which is one of the things that makes UTF-8 very nice and why it has won
> We would not have this problem if we all agree to return number of bytes instead.
I don't understand. It depends on the encoding isn't it?
How would that help? UTF-8, 16, and 32 languages would still report different numbers.
> if we all decided to report number of bytes that string used instead number of printable characters
But that isn't the same across all languages, or even across all implementations of the same language.
>Number of extended grapheme clusters (1 in this case)
Only if you are using a new enough version of unicode. If you were using an older version it is more than 1. As new unicode updates come out, the number of grapheme clusters a string has can change.
when I'm reading text on a screen, I very much am not reading bytes. this is obvious when you actually think what 'text encoding' means.
You're not reading unicode code points either though. Your computer uses bytes, you read glyphs which roughly correspond to unicode extended grapheme clusters - anything between might look like the correct solution at first but is the wrong abstraction for almost everything.
you are right, but this just drives the point.
I learned this recently when I encountered a bug due to cutting an emoji character in two making it unable to render.
If you want to get the grapheme length in JavaScript, JavaScript now has Intl.Segmenter[^1][^2].
[^1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...[^2]: https://caniuse.com/mdn-javascript_builtins_intl_segmenter_s...
Call me naive, but I think the length of a space character ought to be one.
Read the article ... the character between the quote marks isn't a space, but HN apparently doesn't support emoji, or at least not that one.
(2019) updated in (2022)
Obligatory, Emoji under the hood https://tonsky.me/blog/emoji/
Another little thing: The post mentions that tag sequences are only used for the flags of England, Scotland, and Wales. Those are the only ones that are standard (RGI), but because it's clear how the mechanism would work for other subnational entities, some systems support other ones, such as US state flags! I don't recommend using these if you want other people to be able to see them, but...
[dead]
I really hate to rant on about this. But the gymnastics required to parse UTF-8 correctly are truly insane. Besides that we now see issues such as invisible glyph injection attacks etc cropping up all over the place due to this crappy so-called "standard". Maybe we should just to go back to the simplicity of ASCII until we can come up with with something better?
Are you referring to Unicode? Because UTF-8 is simple and relatively straight forward to parse.
Unicode definitely has its faults, but on the whole it‘s great. I‘ll take Unicode w/ UTF-8 any day over the mess of encodings we had before it.
Needless to say, Unicode is not a good fit for every scenario.
I think GP is really talking about extended grapheme clusters (at least the mention of invisible glyph injection makes me think that)
Those really seem hellish to parse, because there seem to be several mutually independent schemes how characters are combined to clusters, depending on what you're dealing with.
E.g. modifier characters, tags, zero-width joiners with magic emoji combinations, etc.
So you need both a copy of the character database and knowledge of the interaction of those various invisible characters.
Just as an example of what I am talking about, this is my current UTF-8 parser which I have been using for a few years now.
Not exactly "simple", is it? I am almost embarrassed to say that I thought I had read the spec right. But of course I was obviously wrong and now I have to go back to the drawing board (or else find some other FOSS alternative written in C). It just frustrates me. I do appreciate the level of effort made to come up with an all-encompassing standard of sorts, but it just seems so unnecessarily complicated.That's a reasonable implementation in my opinion. It's not that complicated. You're also apparently insisting on three-letter variable names, and are using a very primitive language to boot, so I don't think you're setting yourself up for "maintainability" here.
Here's the implementation in the Rust standard library: https://doc.rust-lang.org/stable/src/core/str/validations.rs...
It even includes an optimized fast path for ASCII, and it works at compile-time as well.
Well it is a pretty old codebase, the whole project is written in C. I haven't done any Rust programming yet but it does seem like a good choice for modern programs. I'll check out the link and see if I can glean any insights into what needs to be done to fix my ancient parser. Thanks!
> You're also apparently insisting on three-letter variable names
Why are the arguments not three-letter though? I would feel terrible if that was my code.
It's just a convention I use for personal projects. Back when I started coding in C, people often just opted to go with one or two character variable names. I chose three for locally-scoped variables because it was usually enough to identify them in a recognizable fashion. The fixed-width nature of it all also made for less eye-clutter. As for function arguments, the fact that they were fully spelled out made it easier for API reference purposes. At the end of the day all that really matters is that you choose a convention and stick with it. For team projects they should be laid out early on and, as long as everyone follows them, the entire project will have a much better sense of consistency.
Sure, I'll just write my own language all weird and look like an illiterate so that you are not inconvenienced.
You could use a standard that always uses eg 4 bytes per character, that is much easier to parse than UTF-8.
UTF-8 is so complicated, because it wants to be backwards compatible with ASCII.
ASCII compatibility isn't the only advantage of UTF-8 over UCS-4. It also
- requires less memory for most strings, particular ones that are largely limited to ASCII like structured text-based formats often are.
- doesn't need to care about byte order. UTF-8 is always UTF-8 while UTF-16 might either be little or big endian and UCS-4 could theoretically even be mixed endian.
- doesn't need to care about alignment: If you jump to a random memory position you can find the next and previous UTF-8 characters. This also means that you can use preexisting byte-based string functions like substring search for many UTF-8 operations.
It's not just the variable byte length that causes an issue, in some ways that's the easiest part of the problem. You also have to deal with code points that modify other code points, rather than being characters themselves. That's a huge part of the problem.
That has nothing to do with UTF-8; that's a Unicode issue, and one that's entirely unescapable if you are the Unicode Consortium and your goal is to be compatible with all legacy charsets.
That goes all the way back to the beginning
Even ascii used to use "overstriking" where the backspace character was treated as a joiner character to put accents above letters.
True. But then again, backward-compatibility isn't really such a hard to do with ASCII because the MSB is always zero. The problem I think is that the original motivation which ultimately lead to the complications we now see with UTF-8 was based on a desire to save a few bits here and there rather than create a straight-forward standard that was easy to parse. I am actually staring at 60+ lines of fairly pristine code I wrote a few years back that ostensibly passed all tests, only to find out that in fact it does not cover all corner cases. (Could have sworn I read the spec correctly, but apparently not!)
You might like https://fsharpforfunandprofit.com/series/property-based-test...
In this particular case it was simply a matter of not enough corner cases defined. I was however using property-based testing, doing things like reversing then un-reversing the UTF-8 strings, re-ordering code points, merging strings, etc for verification. The datasets were in a variety of languages (including emojis) and so I mistakenly thought I had covered all the bases.
But thank you for the link, it's turning out to be a very enjoyable read! There already seems to be a few things I could do better thanks to the article, besides the fact that it codifies a lot of interesting approaches one can take to improve testing in general.
I think what you meant is we should all go back to the simplicity of Shift-JIS
Should have just gone with 32 bit characters and no combinations. Utter simplicity.
That would be extremely wasteful, every single text file would be 4x larger and I'm sure eventually it would not be enough anyway.
Maybe we should have just replaced ascii, horrible encoding were entire 25% of it is wasted. And maybe we could have gotten a bit more efficiency by saying instead of having both lower and uppercase letters just have one and then have a modifier before it. Saving lot of space as most text could just be lowercase.
yeah that's how ascii works… there's 1 bit for lower/upper case.
I think combining characters are a lot simpler than having every single combination ever.
Especially when you start getting into non latin-based languages.
What does "no combinations" mean?
Like say Ä it might be either Ä a single byte, or combination of ¨ and A. Both are now supported, but if you can have more than two such things going in one thing it makes a mess.
That's fundamental to the mission of Unicode because Unicode is meant to be compatible with all legacy character sets, and those character sets already included combining characters.
So "no combinations" was never going to happen.