My rule of thumb is to treat strings as opaque blobs most of the time. The only validation I'd always enforce is some sane length limit, to prevent users from shoving entire novels inside. If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away. Imho often times, input validation is an attempt to solve a problem from the wrong side. Say, when XSS or SQL injections are found on a site, I've seen people's first reaction to be validation of user input by looking for "special symbols", or add a whitelist of allowed characters, instead of simply escaping strings right before rendering HTML (and modern frameworks do it automatically), or using parameterized queries if it's SQL. If a user wants to call themselves "alert('hello')", why not? Why the arbitrary limits? I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.
Sanitizing your strings immediately before display is all well and good until you need to pass them to some piece of third-party software that is very dumb and doesn’t sanitize them. You’ll argue that it’s the vendor’s fault, but the vendor will argue that nobody else allows characters like that in their name inputs!
You sanitize at the frontier of what your code controls.
Sending data to a database: parametrized queries to sanitize as it is leaving your control.
Sending to display to the user: sanitized for a browser
Sending to an API: sanitize for whatever rules the API has
Sending to a legacy system: sanitize for it
Writing a file to the system: sanitize the path
The common point is you don't sanitize before you have to send it somewhere. And the advantage of this method is that you limit the chances of getting bit by reflected injections. You interrogate some API you don't control, you may just get malicious content, but you sanitize when sending it so all is good. Because you're sanitizing on output and not on input.
Forbidding users to use your service to propagate "litte bobby tables" pseudo-pranks is likely a good choice.
The choice is different if like most apps you are almost only a data sink, but if you are also a data source for others it pays to be cautious.
I think it’s more of an ethical question than anything. There will always be pranksters and there will never be perfect input validation for names. So who do you oppress? The people with uncommon names? Or the pranksters? I happen to think that if you do your job right, the pranksters aren’t really a problem. So why oppress those with less common names?
I am not saying to only allow [a-zA-Z ]+ in names, what I am Saying is that it is ok to block names like "'; drop table users;" or "<script src="https://bad.site.net/></script>" if part of your business is to distribute that data to other consumers.
And I’m arguing, rhetorically, what if your name produces a syntax error—or worse means something semantically devious—in the query language I’m using? Not all problems look like script tags and semicolons.
> but the vendor will argue that nobody else allows characters like that in their name inputs
...and maybe they will even link to this page to support that statement! But, seeing that most of the pages are German, I bet they do accept the usual German "special" letters (ÄÖÜß) in names?
You do need to use a canonical representation, or you will have two distinct blobs that look exactly the same, tricking other users of the data (other posters in a forum, customer service people in a company, etc)
Because you don't want to ever store bad data. There's not point to that, it will just create annoying situations and potential security risks. And the best place to catch bad data is when the user is still present so they can be made aware of the issue (in case they care and are able to solve it). Once they're gone, it becomes nearly impossible and/or very expensive to check what they meant.
There's at least one major exception to this: Unicode normalization.
It's possible for the same logical character to have two different sets of code points (for example, a-with-umlaut as a single character, vs a followed by umlaut combining diacritic). Related, distinguishing between the "a" character in Latin, Greek, Cyrillic, and the handful of other times it shows up throughout Unicode.
This comes up in at least 3 ways:
1. A usability issue. It's not always easy to predict which identical-looking variant is produced by different input methods, so users enter the identical-looking characters on different devices but get an account not found error.
2. A security issue. If some of your backend systems handle these kinds of characters differently, that can cause all kinds of weird bugs, some of which can be exploited.
3. An abuse issue. If it's possible to create accounts with the same-looking name as others that aren't the same account, there can be vectors for impersonation, harassment, and other issues.
So you have to make a policy choice about how to handle this problem. The only things that I've seen work are either restricting the allowed characters (often just to printable ASCII) or being very clear and strict about always performing one of the standard Unicode transformations. But doing that transformation consistently across a big codebase has some real challenges: in particular, it can change based on Unicode version, and guaranteeing that all potential services use the same Unicode version is really non-trivial. So lots of people make the (sensible) choice not to deal with it.
But yeah, agreed that parenthesis should be OK.
Something we just ran in to: There are two UTF-8 codepoints for the @ character, the normal one and "Full width At Sign U+FF20". It took a lot of head scratching to understand why several Japanese users could not be found with their email address when I was seeing their email right there in the database.
[dead]
> or if you have to interact with some legacy service.
Which happens almost every day in the real world.
You can treat names as byte blobs for as long as you don't use them for their purpose -- naming people.
Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?
>I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.
Few exceptions for you is entirety of the service for others. At the very least you interact with legacy software of payment systems which have some ideas about what names should be.
> Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?
You cannot pronounce the name regardless of whether it is written in ASCII. Pronouncing a name requires at the very least knowledge of the language it originated in, and attempts at reading it with an English pronunciation can range from incomprehensible to outright offensive.
The only way to correctly deal with a name that you are unfamiliar with the pronunciation of is to ask how it is pronounced.
You must store and operate on the person's name as is. Requiring a name modified, or modifying it automatically, is unacceptable - in many cases legal names must be represented accurately as your records might be used for e.g. tax or legal reasons later.
> Would your customer representative be able to pronounce my name somewhat correctly?
Are you implying the CSR's lack of familiarity with the pronunciation of your name means your name should be stored/rendered incorrectly?
Quite the opposite actually. I want it stored correctly and in a way that both me and CSR can understand and so it can be used to interface with other systems.
I don’t however know which unicode subset to use, because you didn’t tell me in the signup form. I have many options, all of them correct, but I don’t know whether your CSR can read Ukrainian Cyrillic and whether you can tell what vocative case is and not use that when inerfacing with the government CA which expects nominative.
I think you're touching on another problem, which is that we as users rarely know why the form wants a name. Is it to be used in emails, or for sending packages, or for talking to me?
My language also has a separate vocative case, but I live in a country that has no concept of it and just vestiges of a case system. I enter my name in the nominative, which then of course looks weird if I get emails/letters from them later - they have no idea to use the vocative. If I knew the form is just for sending me emails, I'd maybe enter my name in the vocative.
Engineers, or UX designers, or whoever does this, like to pretend names are simple. They're just not (obligatory reference to the "falsehoods about names" article). There are many distinct cases for why you may want my name and they may all warrant different input.
- Name to use in letters or emails. It doesn't matter if a CSR can pronounce this if it's used in writing, it should be a name I like to see in correspondence. Maybe it's in a script unfamiliar to most CSRs, or maybe it's just a vocative form.
- Name for verbal communication. Just about anything could be appropriate depending on the circumstances. Maybe an anglicized name I think your company will be able to pronounce, maybe a name in a non-Latin script if I expect it to be understood here, maybe a name in a Latin-extended script if I know most people will still say it reasonably well intuitively. But it could also be an entirely different name from the written one if I expect the written one to be butchered.
- Name for package deliveries. If I'm ordering a package from abroad, I want my name (and address) written in my local convention - I don't care if the vendor can't read it, first the package will make its way to my country using the country and postal code identifiers, and then it should have info that makes sense to the local logistics companies, not to the seller's IT system.
- Legal name because we're entering a contract or because my ID will be checked later on for some reason.
- Machine-readable legal name for certain systems like airlines. For most of the world's population, this is not the same as the legal name but of course English-language bias means this is often overlooked.
In this specific case, it seems like your concerns are a hypothetical, no?
Not really, no. A lot of us only really have to deal with English-adjacent input (i.e. European languages that share the majority of character forms with English, or cultures that explicitly Anglicise their names when dealing with English folks).
As soon as you have to deal with users with a radically different alphabet/input-method, the wheels tend to come off. Can your CSR reps pronounce names written in Chinese logographs? In Arabic script? In the Hebrew alphabet?
You can analyze the name and direct a case to a CSR who can handle it. May be unrealistic for a 1-2 person company, but every 20+ person company I’ve worked at has intentionally hired CSRs with different language abilities.
First of, no you can't infer language preference from a name. The reasonable and well meaning assumption about my name on a good day makes me only sad and irritated.
And even if you could, I don't know if you actually do it by looking at what you signup form asks me to input.
A requirement to do that is an extremely broad definition of "treat strings as opaque blobs most of the time" IMHO :)
>Would your customer representative be able to pronounce my name somewhat correctly?
Typical input validation doesn't really solve the problem. For instance, I could enter my name as 'Vrdtpsk,' which is a perfectly valid ASCII string that passes all validation rules, but no one would be able to pronounce it correctly. I believe the representative (if on a call) should simply ask the customer how they would like to be addressed. Unless we want to implement a whitelist of allowed names for customers to choose from...
Many Japanese companies require an alternative name entered in half width kana to alleviate this exact problem. Unfortunately, most Japanese websites have a million other UX problems that overshadow this clever solution to the problem.
This is a problem specific to languages using Chinese characters where most only know some characters and therefore might not be able to read a specific one. Furigana (which is ultimately what you're providing in a separate field here) is often used as a phonetic reading aid, but still requires you to know Japanese to read and pronounce it correctly.
The only generic solution I can think of would be IPA notation, but it would be entirely unreasonable to expect someone to know the IPA for their name, just as it would be unreasonable to expect a random third party to know how to read IPA and replicate the sounds it described.
Absolutely not - do not build anything based on "would your CSR be able to pronounce" something - that's an awful bar - most CSRs cant pronounce my name - would I be excluded from your database?
Seriously, what are you going for here?
That’s the most basic consideration for names, unless you only show it to the user themselves — other people have to be able to read it at least somehow.
Which one is why the bag of unicode bytes approach is as wrong as telling Stęphań he has an invalid name.
Absolutely not. There's no way to understand what a source user's reading capability is. There's no way to understand how a person will pronounce their name by simply reading it, this only works for common names.
And here we go again, engineers expecting the world should behave fitting their framework du jour. Unfortunately, the real world doesn't care about our engineering bubble and goes on with life - where you can be called !xóõ Kxau or ꦱꦭꦪꦤ or X Æ A-12.
> Would your customer representative be able to pronounce my name somewhat correctly?
Worse case, just drop to hexadecimal.
> If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away
This is laughably naive.
So many things can go wrong.
Strings are not arrays of bytes.
There is a price to pay if someone doesn't understand that or chooses to ignore it.
> Strings are not arrays of bytes.
That very much depends on the language that you are using. In some, they are.
RTL go brrr
RTL is so much fun, it's the gift that keeps on going, when I first encountered it I thought, ok, maybe some junior web app developers will sometimes forget that it exists and a fun bug or two will get into production, but it's everywhere, Windows, GNU/Linux, automated emails, it can make malware hardware to detect by users in Windows because you can hide the dotexe at the beginning of the filename, etc.
And yet when stored on any computer system, that string will be encoded using some number of bytes. Which you can set a limit on even though you cannot cut, delimit, or make any other inference about that string from the bytes without doing some kind of interpretation. But the bytes limit is enough for the situation the OP is talking about.
A coworker once implemented a name validation regex that would reject his own name. It still mystifies me how much convincing it took to get him to make it less strict.
I know multiple developers who would just say "well it's their fault, they have to change name then".
I worked with an office of Germans who insisted that ASCII was sufficient. The German language uses letters that cannot be represented in ASCII.
In fairness, they mostly wanted stuff to be in English, and when necessary, to transliterate German characters into their English counterparts (in German there is a standardised way of doing this), so I can understand why they didn't see it was necessary. I just never understood why I, as the non-German, was forever the one trying to convince them that Germans would probably prefer to use their software in German...
I’ve run into a similar-ish situation working with East-Asian students and East-Asian faculty. Me, an American who wants to be clear and make policies easy for everybody to understand: worried about name ordering a bit (Do we want to ask for their last name or their family name in this field, what’s the stupid learning management system want, etc etc). Chinese co-worker: we can just ask them for their last names, everybody knows what Americans mean when they ask for that, and all the students are used to dealing with this.
Hah, fair enough. I think it was an abstract question to me, so I was looking for the technically correct answer. Practical question for him, so he gave the practical answer.
There are some valid reasons to use software in English as a German speaker. Main among those is probably translations.
If you can speak English, you might be better of using the software in English, as having to deal with the English language can often be less of hassle, than having to deal with inconsistent, weird, or outright wrong translations.
Even high quality translations might run into issues, where the same thing is translated once as "A" and then as "B" in another context. Or run into issues where there is an English technical term being used, that has no prefect equivalent in German (i.e. a translation does exist, but is not a well-known, clearly defined technical term). More often than not though, translations are anything but high quality. Even in expensive products from big international companies.
You should have asked how they would encode the german currency sign (€ for euro) in ASCII or its german counterpart latin1/iso-8859-1...
It's not possible. However I bet they would argument to use iso-8859-15 (latin9 / latin0) with the international currency sign (¤) instead or insist that char 128 of latin1 is almost always meant as €, so just ignore the standard in these cases and use a new font.
This would only fail in older printers and who is still printing stuff these days? Nobody right?
Using real utf-8 is just too complex... All these emojis are nuts
EUR is the common answer.
or just double all the numbers and use DM
Weirdly the old Deutsch Mark doesn't seem to have its own code point in the block start U+20A0, whereas the Spanish equivalent (Peseta, ₧, not just Pt) does.
> I just never understood why I, as the non-German, was forever the one trying to convince them that Germans would probably prefer to use their software in German...
I cannot know, but they could be ideological. For example, they had found it wonderful to use plain ASCII, no need for special keyboard layouts or something like that, and they decided that German would be much better without its non-ASCII characters. They could believe something like this, and they wouldn't say it aloud in the discussion with you because it is irrelevant for the discussion: you weren't trying to change German.
Is name validation even possible?
In certain cultures yes. Where I live, you can only select from a central, though frequently updated, list of names when naming your child. So theoretically only (given) names that are on that list can occur.
Family names are not part of this, but maybe that exists too elsewhere. I don't know how people whose name has been given to them before this list was established is handled however.
An alternative method, which is again culture dependent, is to use virtual governmental IDs for this purpose. Whether this is viable in practice I don't know, never implemented such a thing. But just on the surface, should be.
>So theoretically only (given) names that are on that list can occur.
Unless of course immigration is allowed and doesn't involve changing a name.
Not the OP, but immigration often involves changing your name in the way digital systems store and display it. For example, from محمد to Muhammad or from 陳 to Chen. The pronunciation ideally should stay the same, but obviously there's often slight differences. But if the differences are annoying or confusing, someone might choose an entirely different name as well.
Yes but GP said
> Where I live, you can only select from a central, though frequently updated, list of names when naming your child
I was born in such a country too and still have frequent connections there and I can confirm the laws only apply to citizens of said country so indeed immigration creates exceptions to this rule even if they transliterate their name.
I still don't see how any system in the real world can safely assume its users only have names from that list.
Even if you try to imagine a system for a hospital to register newly born babies... What happens if a pregnant tourist is visiting?
The name a system knows you as doesn’t need to correspond to your legal name or what you are called by others.
With plenty of attitude of course :)
I've only ever interacted with freeform textfields when inputting my name, so most regular systems clearly don't dare to attempt this.
But if somebody was dead set on only serving local customers or having only local personnel, I can definitely imagine someone being brave(?) enough.
This assumes every resident is born and registered in said country which is a silly assumption. Surely, any service only catered only to "naturally born citizen" is discriminatory and illegal?
Obviously, foreigners just living or visiting here will not have our strictly local names (thinking otherwise is what would be "silly"). Locals (people with my nationality, so either natural or naturalized citizens) will (*).
(*) I read up on it though, and it seems like exceptions can be requested and allowed, if it's "well supported". Kinda sours the whole thing unfortunately.
> is discriminatory and illegal?
Checked this too (well, using Copilot), it does appear to be illegal in most contexts, although not all.
But then, why would you want to perform name verification specific to my culture? One example I can think of is limiting abuse on social media sites for example. I vaguely recall Facebook being required to do such a thing like a decade ago (although they definitely did not go about it this way clearly).
> Surely, any service only catered only to "naturally born citizen" is discriminatory and illegal?
No, that's also a question that is culturally dependent. In some contexts it's normal and expected.
I read that Iceland asks people to change their names if they naturalise there (because of the -sson or -dottir surname suffix).
But your point stands - not everyone in the system will follow this pattern.
Yes, it is essential when you want to avoid doing business with customers who have invalid names.
You joke, but when a customer wants to give your company their money, it is our duty as developers to make sure their names are valid. That is so business critical!
It's not just business necrssary, it's also mandatory to do rigjt under gdpr
In legitimate retail, take the money, has always been the motto.
That said, recently I learned about monetary policy in North Korea and sanctions on the import of luxury goods.
Why Nations Fail (2012) by Daron Acemoglu and James Robinson
What are “invalid names” in this context? Because, depending on the country the person was born in, a name can be literally anything, so I’m not sure what an invalid name looks like (unless you allow an `eval` of sorts).
The non-joke answer for Europe is extened Latin, dashes, spaces and apostrophe sign, separated into two (or three) distinct ordered fields. Just because it's written in a different script originally, doesn't mean it will printed only with that on your id in the country of residence or travel document issued at home. My name isn't written in Latin characters and it's fine. I know you can't even try to pronounce them, so I have it spelled out in above mentioned Latin script.
What if your customer is the artist formerly known as Prince or even X Æ A-12 Musk?
Prince: "Get over yourself and just use your given name." (Shockingly, his given name actually is Prince; I first thought it was only a stage name)
Musk: Tell Elon to get over his narcissism enough to not use his children as his own vanity projects. This isn't just an Elon problem, many people treat children as vanity projects to fuel their own narcissism. That's not what children are for. Give him a proper name. (and then proceed to enter "X Æ A-12" into your database, it's just text...)
There are of course some people who'll point you to a blog post saying no validation is possible.
However, for every 1 user you get whose full legal name is bob@example.com you'll get 100 users who put their e-mail into the name field by accident
And for every 1 user who wants to be called e.e. cummings you'll get 100 who just didn't reach for the shift key and who actually prefer E.E. Cummings. But you'll also get 100 McCarthys and O'Connors and al-Rahmans who don't need their "wrong" capitalisation "fixed" thank you very much.
Certainly, I think you can quite reasonably say a name should be comprised of between 2 and 75 characters, with no newlines, nulls, emojis, leading or trailing spaces, invalid unicode code points, or angle brackets.
Don't validate names, use transliteration to make them safe for postal services (or whatever). In SQL this is COLLATE, in the command line you can use uconv:
If I ever make my own customer facing product with registration, I'm rejecting names with 'v', 'x' and 'q'. After all, these characters don't exist in my language, and foreign people can always transliterate them to 'w', 'ks' or 'ku' if they have names with weird characters.
The name of the city has the L with stroke (pronounced as a W), so it’s Łódź.
And the transliteration in this case is so far from the original that it's barely recognisable for me (three out of four characters are different and as a native I perceive Ł as a fully separate character, not as a funny variation of L)
The fact that it's pronounced as Вуч and not Лодж still triggers me.
I just looked up the Russian wikipedia entry for it, and it's spelled "Лодзь", but it sounds like it's pronounced "Вуджь", and this fact irritates the hell out of me.
Why would it be transliterated with an Л? And an О? And a з? None of this makes sense.
> Why would it be transliterated with an Л?
Because it _used_ to be pronounced this way in Polish! "Ł" pronounced as "L" sounds "theatrical" these days, but it was more common in the past.
It's a general pattern of what russia does to names of places and people, which is aggressively imposing their own cultural paradigm (which follows the more general general pattern). You can look up your civil code provisions around names and ask a question or two of what historical problem they attempt to solve.
It's not a Russian-specific thing by any stretch.
This happens all the time when names and loanwords get dragged across linguistic boundaries. Sometimes it results from an attempt to "simplify" the respective spelling and/or sounds (by mapping them into tokens more familiar in the local environment); sometimes there's a more complex process behind it; and other times it just happens for various obscure historical reasons.
And the mangling/degradation definitely happens in both directions: hence Москва → Moscow, Paris → Париж.
In this particular case, it may have been an attempt to transliterate from the original Polish name (Łódź), more "canonically" into Russian. Based on the idea that the Polish Ł (which sounds much closer to an English "w" than to a Russian "в") is logically closer to the Russian "Л" (as this actually makes sense in terms of how the two sounds are formed). And accordingly for the other weird-seeming mappings. Then again it could have just ended up that way for obscure etymological reasons.
Either way, how one can be "irritated as hell" over any of this (other than in some jocular or metaphorical sense) is another matter altogether, which I admit is a bit past me.
Correction - it's nothing osbcure at all, but apparently a matter of the shift that accord broadly with the L sound in Polish a few centuries ago (whereby it became "dark" and velarized), affecting a great many other words and names (like słowo, mały, etc). While in parts east and south the "clear" L sound was preserved.
Wait until you hear what Chinese or Japanese languages do with loanwords...
[deleted]
L with stroke is the english name for it according to wikipedia by the way, not my choice of naming. The transliterated version is not great, considering how far removed from the proper pronunciation it is, but I’m sort of used to it. The almost correct one above was jarring enough that I wanted to point it out.
(i mean... we do have postal numbers just for problems like this, but both Štefan and Stefan are not-so-uncommon male names over here, so are Jozef and Jožef, etc.)
If you're dealing with a bad API that only takes ASCII, "Celje" is usually better than "ÄŒelje" or "蒌elje".
If you have control over the encoding on the input side and on the output side, you should just use UTF-8 or something comparable. If you don't, you have to try to get something useful on the output side.
Most places where telling Štefan from Stefan is a problem use postal numbers for people too, or/and ask for your DOB.
I don't have a problem from differentiatin Štefan from Stefan, 's' and 'š' sound pretty different to everyone around here. But if someone runs that script above and transliterates "š" to "s" it can cause confusion.
And no, we don't use "postal numbers for humans".
>And no, we don't use "postal numbers for humans".
An email, a phone number, a tax or social security number, demographic identifier, billing/contract number or combination of them.
All of those will help you tell Stefan from Štefan in the most practical situations.
>But if someone runs that script above and transliterates "š" to "s" it can cause confusion.
It's not nice, it will certainly make Štefan unhappy, but it's not like you will debit the money from the wrong account or deliver to a different address or contact the wrong customer because of that.
Yes, it's easy
bool ValidateName(string name) => true;
(With the caveat that a name might not be representable in Unicode, in which case I dunno. Use an image format?)
name.Length > 0
is probably pretty safe.
That only works if you’re concatenating the first and last name fields. Some people have no last name and thus would fail this validation if the system had fields for first and last name.
Honestly I wish we could just abolish first and last name fields and replace them with a single free text name field since there's so many edge cases where first and last is an oversimplification that leads to errors. Unfortunately we have to interact with external systems that themselves insist on first and last name fields, and pushing it to the user to decide which is part of what name is wrong less often than string.split, so we're forced to become part of the problem.
I did this in the product where I work. We operate globally so having separate first and last name fields was making less sense. So I merged them into a singular full name field.
The first and only people to complain about that change were our product marketing team, because now they couldn’t “personalize” emails like `Hi <firstname>,`. I had the hardest time convincing them that while the concept of first and last names are common in the west, it is not a universal concept.
So as a compromise, we added a “Preferred Name” field where users can enter their first name or whatever name they prefer to be called. Still better than separate first and last name fields.
People can have many names (depending on usage and of "when", think about marriage) and even if each of those human names can handle multiple parts the "text" field is what you should use to represent the name in UIs.
I encourage people to go check the examples the standards gives, especially the Japanese and Scandinavian ones.
It’s not just external systems. In many (most?) places, when sorting by name, you use the family names first, then the given names. So you can’t correctly sort by name unless you split the fields. Having a single field, in this case, is “an oversimplification that leads to errors”.
some people have no name at all
Any notable examples apart from young children and Michael Scott that one time?
I've been compiling a list of them:
You seem to have forgotten quite a few, like
See point 40 and 32-36 on Falsehoods programmers believe about names[1]
I know that this is trying to be helpful but the snark in this list detracts from the problem.
Whether it's healthy or not, programmers tend to love snark, and that snark has kept this list circulating and hopefully educating for a long time to this very day
What if my name is
Slim Shady?
Presumably there aren't any people with control characters in their name, for example.
Watch as someone names themselves the bell character, “^G” (ASCII code 7) [1]
When they meet people, they tell them their name is unpronounceable, it’s the sound of a PC speaker from the late 20th century, but you can call them by their preferred nickname “beep”.
In paper and online forms they are probably forced to go by the name “BEL”.
This name, "คุณสมชาย" (Khun Somchai, a common Thai name), appears normal but has a Zero Width Space (U+200B) between "คุณ" (Khun, a title like Mr./Ms.) and "สมชาย" (Somchai, a given name).
In scripts like Thai, Chinese, and Arabic, where words are written without spaces, invisible characters can be inserted to signal word boundaries or provide a hint to text processing systems.
The reminds me of a few Thai colleagues who ended up with a legal first name of "Mr." (period included), probably as a result of this.
Buying them plane tickets to attend meetings and so on proved fairly difficult.
But C0 and C1 control codes are out, probably.
> Presumably there aren't any people with control characters in their name, for example.
Of course there are. If you commit to supporting everything anyone wants to do, people will naturally test the boundaries.
The biggest fallacy programmers believe about names is that getting name support 100% right matters. Real engineers build something that works well enough for enough of the population and ship it, and if that's not US-ASCII only then it's usually pretty close to it.
Or unpaired surrogates. Or unassigned code points. Or fullwidth characters. Or "mathematical bold" characters. Though the latter two should be probably solved with NFKC normalization instead.
> Or unpaired surrogates.
That’s just an invalid Unicode string, then. Unicode strings are sequences of Unicode scalar values, not code points.
> unassigned code points
Ah, the tyranny of Unicode version support. I was going to suggest that it could be reasonable to check all code points are assigned at data ingress time, but then you urgently need to make sure that your ingress system always supports the latest version of Unicode. As soon as some part of the system goes depending on old Unicode tables, some data processing may go wrong!
How about Private Use Area? You could surely reasonably forbid that!
> fullwidth characters
I’m not so comfortable with halfwidth/fullwidth distinctions, but couldn’t fullwidth characters be completely legitimate?
(Yes, I’m happy to call mathematical bold, fraktur, &c. illegitimate for such purposes.)
> solved with NFKC normalization
I’d be very leery of doing this on storage; compatibility normalisations are fine for equivalence testing, things like search and such, but they are lossy, and I’m not confident that the lossiness won’t affect legitimate names. I don’t have anything specific in mind, just a general apprehension.
That sounds like a reasonable assumption, but probably not strictly correct.
It's safe to reject Cc, Cn, and Cs. You should probably reject Co as well, even though elves can't input their names if you do that.
Cc: Control, a C0 or C1 control code. (Definitely safe to reject.)
Cn: Unassigned, a reserved unassigned code point or a noncharacter. (Safe to reject if you keep up to date with Unicode versions; but if you don’t stay up to date, you risk blocking legitimate characters defined more recently, for better or for worse. The fixed set of 66 noncharacters are definitely safe to reject.)
Cs: Surrogate, a surrogate code point. (I’d put it stronger: you must reject these, it’s wrong not to.)
Co: Private_Use, a private-use character. (About elf names, I’m guessing samatman is referring to Tolkien’s Tengwar writing system, as assigned in the ConScript Unicode Registry to U+E000–U+E07F. There has long been a concrete proposal for inclusion in Unicode’s Supplementary Multilingual Plane <https://www.unicode.org/roadmaps/smp/>, from time to time it gets bumped along, and since fairly recently the linked spec document is actually on unicode.org, not sure if that means something.)
Cf: Format, a format control character. (See the list at <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[...>. You could reject a large number of these, but some are required by some scripts, such as ZERO-WIDTH NON-JOINER in Indic scripts.)
Challenge accepted, I'll try to put a backspace and a null byte in my firstborn's name. Hope I don't get swatted for crashing the government servers.
If you just use the {Alphabetic} Unicode character class (100K code points), together with a space, hyphen, and maybe comma, that might get you close. It includes diacritics.
I'm curious if anyone can think of any other non-alphabetic characters used in legal names around the world, in other scripts?
I wondered about numbers, but the most famous example of that has been overturned:
"Originally named X Æ A-12, the child (whom they call X) had to have his name officially changed to X Æ A-Xii in order to align with California laws regarding birth certificates."
(Of course I'm not saying you should do this. It is fun to wonder though.)
> I'm curious if anyone can think of any other non-alphabetic characters used in legal names around the world, in other scripts?
Latin characters are NOT allowed in official names for Japanese citizens. It must be written in Japanese characters only.
For foreigners living in Japan it's quite frequent to end up in a situation where their official name in Latin does not pass the validation rules of many forms online. Issues like forbidden characters, or because it's too long since Japanese names (family name + first name) are typically only 4 characters long.
Also, when you get a visa to Japan, you have to bend and disform the pronunciation of your name to make it fit into the (limited) Japanese syllabary.
Funnily, they even had to register a whole new unicode range at some point, because old administrative documents sometimes contains characters that have been deprecated more than a century ago.
To be clear, I wasn't thinking about within a specific country though.
More like, what is the set of all characters that are allowed in legal names across the world?
You know, to eliminate things like emoji, mathematical symbols, and so forth.
Ah, I see.
I don't know, but I would bet that the sum of all corner cases and exceptions in the world would make it pretty hard to confidently eliminate any "obvious" characters.
From a technical standpoint, unicode emojis are probably safe to exclude, but on the other hand, some scripts like Chinese characters are fundamentally pictograms, which is semantically not so different than an emoji.
Maybe after centuries of evolution we will end up with a legit universal language based on emojis, and people named with it.
Chinese characters are nothing like emoji. They are more akin to syllables. There is no semantic similarity to emoji at all, even if they were originally derived from pictorial representations.
And they belong to the {Alphabetic} Unicode class.
I'm mostly curious if Unicode character classes have already done all the hard work.
You forgot apostrophe as is common in Irish names like O’Brien.
Yes, though O’Brien is Ó Briain in Irish, according to Wikipedia. I think the apostrophe in Irish names was added by English speakers, perhaps by analogy with "o'clock", perhaps to avoid writing something that would look like an initial.
There are also English names of Norman origin that contain an apostrophe, though the only example I can think of immediately is the fictional d'Urberville.
> I'm curious if anyone can think of any other non-alphabetic characters used in legal names around the world, in other scripts?
Some Japanese names are written with Japanese characters that do not have Unicode codepoints.
(The Unicode consortium claims that these characters are somehow "really" Chinese characters just written in a different font; holders of those names tend to disagree, but somehow the programmer community that would riot if someone suggested that people with ø in their name shouldn't care when it's written as o accepts that kind of thing when it comes to Japanese).
Apostrophe is common in surnames in parts of the world.
> I'm curious if anyone can think of any other non-alphabetic characters used in legal names around the world, in other scripts?
The Catalan name Gal·la is growing in popularity, with currently 1515 women in the census having it as a first name in Spain with an average age of 10.4 years old: https://ine.es/widgets/nombApell/nombApell.shtml
There’s this individual’s name which involves a clock sound: Nǃxau ǂToma[1]
> There’s this individual’s name which involves a clock sound: Nǃxau ǂToma
I was extremely puzzled until I realized you meant a click sound, not a clock sound. Adding to my confusion, the vintage IBM 1401 computer uses ǂ as a record mark character.
What if one's name is not in alphabetic script? Let's say, "鈴木涼太".
That's part of {Alphabetic} in Unicode. It validates.
דויד Smith (concatenated) will have an LTR control character in the middle
Oh that's interesting.
Is that a thing? I've never known of anyone whose legal name used two alphabets that didn't have any overlap in letters at all -- two completely different scripts.
Would a birth certificate allow that? Wouldn't you be expected to transliterate one of them?
Comma or apostrophe, like in d'Alembert ?
(And I have 3 in my keyboard, I'm not sure everyone is using the same one.)
Mrs. Keihanaikukauakahihuliheekahaunaele only had a string length problem, but there are people with a Hawaiian ʻokina in their names. U+02BB
It is if you first provide a complete specification of a “name”. Then you can validate if a name is compliant with your specification.
It's super easy actually. Name consists of three parts -- Family Name, Given Name and Patronymic, spelled using Ukrainian Cyrillic. You can have a dash in the Family name and apostrophe is part of Cyrillic for this purposes, but no spaces in any of the three. If are unfortunate enough to not use Cyrillic (of our variety) or Patronymics in the country of your origin (why didn't you stay there, anyway), we will fix it for you, mister Нкріск. If you belong to certain ethnic groups who by their custom insist on not using Patronymics, you can have a free pass, but life will be difficult, as not everybody got the memo really. No, you can not use Matronimyc instead of Patronymic, but give us another 30 years of not having a nuclear war with country name starting with "R" and ending in "full of putin slaves si iiia" and we might see to that.
Unless of course the name is not used for official purposes, in which case you can get away with First-Last combination.
It's really a non issue and the answer is jurisdiction bound. In most of Europe extented Latin set is used in place of Cyrillic (because they don't know better), so my name is transliterated for the purposes of being in the uncivilized realms by my own government. No, I can't just use Л and Я as part of my name anywhere here.
Valid names are those which terminate when run as Python programs.
You may not want Bobby Tables in your system.
If you're prohibiting valid letters to protect your database because you didn't parametrize your queries, you're solving the problem from the wrong end
Sure it is. Context matters. For example, in clone wars.
No, but it doesn’t stop people trying.
There's little more you can do to validate a name internationally than to provide one textbox and check if it's a valid encoding of Unicode. Maybe you can exclude some control and graphical ranges at best.
Of course there are valid concerns that international names should pass through e.g. local postal services, which would require at least some kind of Latinized representation of name and address. I suppose the Latin alphabet is the most convenient minimal common denominator across writing systems, even though I admit being Euro-centric.
I have an 'æ' in my middle name (formally secondary first name because history reasons). Usually I just don't use it, but it's always funny when a payment form instructs me to write my full name exactly as written on my credit card, and then goes on to tell me my name is invalid.
I live in Łódź.
Love receiving packages addressed to ??d? :)
I wonder how many of those packages end up in Vada, Italy. Or Cody, Wyoming. Or Buda, Texas...
I imagine the “Poland” part of the address would narrow it down somewhat.
I got curious if I can get data to answer that, and it seems so.
Based on xlsx from [0], we got the following ??d? localities in Poland:
1 x Bądy, 1 x Brda, 5 x Buda, 120 x Budy, 4 x Dudy, 1 x Dydy, 1 x Gady, 1 x Judy, 1 x Kady, 1 x Kadź, 1 x Łada, 1 x Lady, 4 x Lądy, 2 x Łady, 1 x Lęda, 1 x Lody, 4 x Łódź, 1 x Nida, 1 x Reda, 1 x Redy, 1 x Redz, 74 x Ruda, 8 x Rudy, 12 x Sady, 2 x Zady, 2 x Żydy
Certainly quite a lot to search for a lost package.
Interesting! However, assuming that ASCII characters are always rendered correctly and never as "?", it seems like the only solution for "??d?" would be one of the four Łódźs?
Sounds like someone is getting ready for Advent of Code!
Experienced postal workers most probably know well that ??d? represents a municipality with three non-ascii characters.
Interestingly, Lady, Łady and Lądy will end up the same after the usual transliteration.
And the postal code.
And the packages get there? Don't you put "Łódź (Lodz)" in the city field? Or the postal code takes care of the issue?
Yep, postal code does all the work.
You live in a boat? But how do they know on what sea?
Ironically, there are no big rivers in Łódź (anymore)
As you may be aware, the name field for credit card transactions is rarely verified (perhaps limited to North America, not sure).
Often I’ll create a virtual credit card number and use a fake name, and virtually never have had a transaction declined. Even if they are more aggressively asking for a street address, giving just the house number often works.
This isn’t a deep cover but gives a little bit of a anonymity for marketing.
It's for when things go wrong. Same as with wire transfers. Nobody checks it unless there's a dispute.
The thing is though that payment networks do in fact do instant verification and it is interesting what gets verified and when. At gas stations it is very common to ask for a zip code (again US), and this is verified immediately to allow the transaction to proceed. I’ve found that when a street address is asked for there is some verification and often a match on the house number is sufficient. Zip codes are verified almost always, names pretty much never.
This likely has something to do with complexities behind “authorized users”.
Funny thing about house numbers: they have their own validation problems. For a while I lived in a building whose house number was of the form 123½ and that was an ongoing source of problems. If it just truncated the ½ that was basically fine (the house at 123 didn't have apartment numbers and the postal workers would deliver it correctly) but validating in online forms (twenty-ish years ago) was a challenge. If they ran any validation at all they'd reject the ½, but it was a crapshoot whether which of "123-1/2" or "123 1/2" would work, or sometimes neither one. The USPS's official recommendation at the time was to enter it as "123 1 2 N Streetname" which usually validated but looked so odd it was my last choice (and some validators rejected the "three numbers" format too).
I don't think I ever tried "123.5", actually.
Around here, there used to be addresses like "0100 SW Whatever Ave" that were distinct from "100 SW Whatever Ave". And we've still got various places that have, for example, "NW 21st Avenue" and "NW 21st Place" as a simple workaround for a not-entirely-regular street grid optimized for foot navigation.
123 + 0.5?
At American gas stations, if you have a Canadian credit card, you type in 00000 because Canadians don't have ZIP codes.
Are we sure they don't actually validate against a more generic postal code field? Then again some countries have letters in their postcodes (the UK comes to mind), so that might be a problem anyways.
Canada has letters in postal codes.
That’s the issue the GP is referring to, since US gas stations invariably just have a simple 5 numeric digit input for “zip” code.
There is so many ways to write your address I always assume it it’s just the house number as well. In fact I vaguely remember that being a specific field when interacting with some old payment gateway.
The government of Ireland has many IT systems that cannot handle áccénted letters. #headdesk
I worked for an Irish company that didn't support ' in names. Did get fixed eventually, but sigh...
Bobby Tables enters the chat
Still much better when it fails at the first step. I once got myself in a bit of a struggle with Windows 10 by using "ł" as part of Windows username. Amusingly/irritatingly large number of applications, even some of Microsoft's own ones, could not cope with that.
For a similar reason many Java applications do not work in Turkish Windowses. The Turkish İi Iı problem.
My wife had two given names and no surname. (In fact, before eighth class, she only had one given name.) Lacking a surname is very common in some parts of India. Also some parts of India put surname first, and some last, and the specific meaning and customs vary quite a bit too. Indian passports actually separate given names and family names entirely (meaning you can’t reconstruct the name as it would customarily be written). Her passport has the family name line blank. Indigo writes: “Name should be as per government ID”, and has “First And Middle Name” and “Last Name” fields. Both required, of course. I discovered that if you put “-” in the Last Name field, the booking process falls over several steps later in a “something went wrong, try again later” way; only by inspecting an API response in the dev tools did I determine it was objecting to having “-” in the name. Ugh. Well, I have a traditional western First Middle Last name, and from putting it in things, sometimes it gets written First Middle Last and sometimes Last First Middle, and I’ve received some communications addressed to First, some to Last, and some to Middle (never had that happen before!). It’s a disaster.
Plenty of government things have been digitalised in recent years too, and split name fields tend to have been coded to make both mandatory. It’s… disappointing, given the radical diversity of name construction across India.
"Write your name the way it's spelled in your government issued id" is my favorite. I have three ids issued by two governments and no two match letter by letter.
Did you actually get banks to print that on your credit card?
I’m impressed, most I know struggle with any kind of non-[A-Z]!
As someone who really think name field should just be one field with any printable unicode characters, I do wonder what the hell would I need to do if I take customer names in this form, and then my system has to interact with some other service that requires first/last name split, and/or [a-zA-Z] validation, like a bank or postal service.
Automatic transliteration seems to be very dangerous (wrong name on bank accounts, for instance), and not always feasible (some unicode characters have more than one way of being transliterated).
Should we apologize to the user, and just ask the user twice, once correctly, and once for the bad computer systems? This seems to be the only approach that both respects their spelling, and at the same time not creating potential conflict with other systems.
We had problems with a Ukrainian refugee we helped because certified translations of her documents did not match. Her name was transliterated the German way in one place and the English way in another.
Those are translations coming from professionals who swore an oath. Don’t try to do it with code.
In the US, you can generally specify to your certified translators how you want proper names and place names written. I would suggest you or your friend talk to the translators again so that everything matches. It will also minimize future pains.
Also, USCIS usually has an "aliases" field on their forms, which would be a good place to put German government misspellings.
USCIS is a mess.
I know someone that still doesn't know whether they have a middle name as far as american authorities are concerned.
Coupled with "two last names" and it gets really messy, really quickly.
Purchases names don't match the CC name.
Bank statements are actually "for another person".
Border crossings are now extra spicy.
And "pray" that your name doesn't resemble a name in some blacklist.
The fundamental mistake is in trying to take input for one purpose and transform it for another purpose. Just have the user fill in an additional field for their name as it appears on bank statements, or whatever the second purpose is. Trying to be clever about this stuff never works out.
What you call the second purpose is often the only purpose. Or you have to talk to half a dozen other systems each of which have different constraints. You wouldn’t want to present the user half a dozen fields just so that they can choose the nicest representation of their name for each system.
That being said, in Japan it’s actually common to have two fields, one for the name in kanji (the “nice” name), and one in katakana (restricted, 8-bit phonetic alphabet, which earlier/older computer systems used and some probably still use).
You usually don't have a dozen, just two or three and if you do have a dozen, there is a certain pattern, or at least the common denominator for the half of them to be ASCII, a another half using some kind of local convention you totally know how to encode.
You can just show the user the transliteration & have them confirm it makes sense. Always store the original version since you can't reverse the process. But you can compare the transliterated version to make sure it matches.
Debit cards a pretty common example of this. I believe you can only have ASCII in the cardholder name field.
>But you can compare the transliterated version to make sure it matches
No you can't.
Add: Okay, you need to know why. I'm right here a living breathing person with a government id that has the same name scribed in two scripts side by side.
There is an algorithm (blessed by the same government that issued said it) which defines how to transliterate names from one to another, published on the parliament web site and implement in all the places that are involved in the id issuing business.
The algorithm will however not produce the outcome you will see on my id, because me, living breathing person who has a name asked nicely to spell it the way I like. The next time I visit the id issuing place, I could forget to ask nicely and then I will have two valid ids (no, the old one will not be marked as void!) with three names that don't exactly match. It's all perfectly fine, because name as a legal concept is defined in the character set you probably can't read anyway.
Please, don't try be smart with names.
Your example fails to explain any problem with GPs proposal. They would show you a transliteration of your name and ask you to confirm it. You would confirm it or not. It might match one or other of your IDs (in which case you would presumably say yes) or not (in which case you would presumably say no). What's the issue?
You will compare the transliterated version I provided with the one you have already, it will not match and then what? Either you tell me I have invalid name or you just ignore it.
I think they were suggesting the opposite order - do an automatic transliteration and offer you the choice to approve or correct it.
But even if the user is entering both, warning them that the transliteration doesn't match and letting them continue if they want is something that pays for itself in support costs.
I have an ID that transliterated my name, and included the original, but the original contained an obvious typo. I immediately notified the government official, but they refused to fix it. They assured me that only the transliterated name would be used.
Human systems aren't always interested in avoiding or fixing defects.
Okay, I have a non-ASCII (non Latin even) name, so I can tell. You just ask explicitly how my name is spelled in a bank system or my government id. Please don't try transliteration, unless you know exact rules the other system suggests to transliterate my name from the one cultural context into another and then still make it a suggestion and make it clear for which purpose it will be used (and then only use it for that purpose).
And please please please, don't try to be smart and detect the cultural context from the character set before automatically translating it to another character set. It will go wrong and you will not notice for a long time, but people will make mean passive aggressive screenshots of your product too.
My bank for example knows my legal name in Cyrillic, but will not print it on a card, so they make best-effort attempt to transliterate it to ASCII, but make it editable field and will ask me to confirm this is how I want it to be on a card.
Ask for inflections separately.
For instance, in many Japanese forms, there are dedicated fields for the name and the pronunciation of the name. There are possibly multiple ways to read a name (e.g. 山崎 is either やまざき or やまさき). It is better to ask the person "how to read your name?" rather than execute code to guess the reading.
As for transliteration, it's best to avoid if possible. If not possible, then rely on international standards (e.g. Japanese has ISO 3602 and Arabic has ISO 233-2). When international standards don't exist, then fall back to "context-dependent" standards (e.g. in Taiwan, there are several variants of Pinyin. Allow the user to choose the romanization that matches their existing documentation).
Legal name vs. display name
… "legal name" is "things programmer's believe about names" grade. Maybe (name, jurisdiction), but I've seen exceptions to that, too.
Where I live, no less than 3 jurisdictions have a say about my "legal" name, and their laws do not require them to match. At one point, one jurisdiction had two different "legal" names for me, one a typo by my standards, but AFAICT, both equally valid.
There's no solution here, AFIACT, it's just evidence towards why computers cannot be accountability sinks.
WTF-8 is actually a real encoding, used for encoding invalid UTF-16 unpaired surrogates for UTF-8 systems: https://simonsapin.github.io/wtf-8/
I believe this is what Rust OsStrings are under the hood on Windows.
Which I assume stands for "Windows-Transformation-Format-8(bits)".
Abstract
WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair.
Can you still assume the bytes 0x00 and 0xFF are not present in the string (like in UTF-8?)
I'll say it again: this is the consequence of Unicode trying to be a mix of html and docx, instead of a charset. It went too far for an average Joe DevGuy to understand how to deal with it, so he just selects a subset he can handle and bans everything else. HN does that too - special symbols simply get removed.
Unicode screwed itself up completely. We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while. Shortly after it focused on becoming a complex file format with colorful icons and invisible symbols, which is not manageable without cutting out all that bs by force.
The “invisible symbols” are necessary to correctly represent human language. For instance, one of the most infamous Unicode control characters — the right-to-left override — is required to correctly encode mixed Latin and Hebrew text [1], which are both scripts that you mentioned. Besides, ASCII has control characters as well.
The “colorful icons” are not part of Unicode. Emoji are just characters like any other. There is a convention that applications should display them as little coloured images, but this convention has evolved on its own.
If you say that Unicode is too expansive, you would have to make a decision to exclude certain types of human communication from being encodable. In my opinion, including everything without discrimination is much preferable here.
> Is this necessary to correctly represent human language?
Yes! As soon as you have any invisible characters (eg. RTL or LTR marks, which are required to represent human language), you will be able to encode any data you want.
How many direction marks can we see in this hidden text?
Wow. Did not expect you can just hide arbitrary data inside totally normal looking strings like that. If I select up to "Copy thi" and decode, there's no hidden string, but just holding shift+right arrow to select just "one more character", the "s" in "this", the hidden string comes along.
> one of the most infamous Unicode control characters — the right-to-left override
You are linking to an RLM not an RLO. Those are different characters. RLO is generally not needed and more special purpose. RLM causes much less problems than RLO.
Really though, i feel like the newer "first strong isolate" character is much better designed and easier to understand then most of the other rtl characters.
Granted, technically speaking emojis are not part of the "Unicode Standard", but they are standardized by the Unicode Consortium and constitute "Unicode Technical Standard #51": https://www.unicode.org/reports/tr51/
I'm happy to discriminate against those damn ancient Sumerians and anyone still using goddamn Linear B.
Sure, but removing those wouldn't make Unicode any simpler, they're just character sets. The GP is complaining about things like combining characters and diacritic modifiers, which make Unicode "ugly" but are necessary if you want to represent real languages used by billions of people.
I’m actually complaining about more “advanced” features like hiding text (see my comment above) or zalgoing it.
And of course endless variations of skin color and gender of three people in a pictogram of a family or something, which is purely a product of a specific subculture that doesn’t have anything in common with text/charset.
If unicode cared about characters, which happens to be an evolving but finite set, it would simply include them all, together with exactly two direction specifiers. Instead it created a language/format/tag system within itself to build characters most of which make zero sense to anyone in the world, except for grapheme linguists, if that job title even exists.
It will eventually overengineer itself into a set bigger than the set of all real characters, if not yet.
Practicality and implications of such system is clearly demonstrated by the $subj.
You're right, of course. The point was glibly making was that Unicode has a lot of stuff in it, and you're not necessarily stomping on someone's ability to communicate by removing part if it.
I'm also concerned by having to normalize representations that use combining character etc. but I will add that there are assumptions that you can break just by including weird charsets.
For example the space character in Ogham, U+1680 is considered whitespace, but may not be invisible, ultimately because of the mechanics of writing something that's like the branches coming off a tree though carved around a large stone. That might be annoying to think about when you're designing a login page.
I mean, we can just make the languages simpler? We can also remove all the hundred different ways to pronounce English sounds. All elementary students will thank you for it xD
You can make a language simpler but old books still exist. I guess if we burn all old books and disallow a means to print these old books again, people would be happy?
Reprint them with new spelling? We have 500 year old books that are unreadable. 99.99% of all books published will not be relevant to anyone that isn’t consuming them right at that moment anyway.
Lovers can read the lord of the rings in the ‘original’ spelling.
People who should use Sumerian characters don't even use them, sadly. First probably because of habit with their transcription, but also because missing variants of characters mean lot of text couldn't be accurately represented. Also I'm downvoting you for discriminating me.
I know you're being funny, but that's sort of the point. There's an important "use-mention" distinction when it comes to historical character sets. You surely could try to communicate in authentic Unicode Akkadian (𒀝𒅗𒁺𒌑(𒌝) but what's much more likely is that you really just want to refer to characters or short strings thereof while communicating anything else in a modern living language like English. I don't want to stop someone from trying to revive the language for fun or profit, but I think there's an important distinction between cases of primarily historical interest like that, and cases that are awkward but genuine like Inuktut.
> and invisible symbols
Invisible symbols were in Unicode before Unicode was even a thing (ASCII already has a few). I also don't think emojis are the reason why devs add checks like in the OP, it's much more likely that they just don't want to deal with character encoding hell.
As much as devs like to hate on emojis, they're widely adopted in the real world. Emojis are the closest thing we have to a universal language. Having them in the character encoding standard ensures that they are really universal, and supported by every platform; a loss for everyone who's trying to count the number of glyphs in a string, but a win for everyone else.
Unicode has metadata on each character that would allow software to easily strip out or normalize emoji's and "decorative" characters.
It might have edge case problems -- but the charcters in the OP's name would not be included.
Also, stripping out emoji's may not actually be required or the right solution. If security is the concern, Unicode also has recommended processes and algorithms for dealing with that.
We need better support for the functions developers actually need on unicode in more platforms and languages.
Global human language is complicated as a domain. Legacy issues in actually existing data adds to the complexity. Unicode does a pretty good job at it. It's actually pretty amazing how well it does. Including a lot more than just the character set, and encoding, but algorithms for various kinds of normalizing, sorting, indexing, under various localizations, etc.
It needs better support in the environments more developers are working in, with raised-to-the-top standard solutions for identified common use cases and problems, that can be implemented simply by calling a performance-optimized library function.
(And, if we really want to argue about emoji's, they seem to be extremely popular, and literally have effected global culture, because people want to use them? Blaming emoji's seems like blaming the user! Unicode's support for them actually supports interoperability and vendor-neutral standards for a thing that is wildly popular? but I actually don't think any of the problems or complexity we are talking about, including the OP's complaint, can or should be laid at the feet of emojis)
There's no argument here.
We could say it's only for script and alphabets, ok. It includes many undeciphered writing systems from antiquity with only a small handful of extent samples.
Should we keep that, very likely to never be used character set, but exclude the extremely popular emojis?
I used to be on the anti emoji bandwagon but really, it's all indefensible. Unicode is characters of communication at an extremely inclusive level.
I'm sure some day it will also have primitive shapes and you can construct your own alphabet using them + directional modifiers akin to a generalizable Hangul in effect becoming some kind of wacky version of svg that people will abuse it in an ASCII art renaissance.
So be it. Sounds great.
No, no, no, no, no… So then we’d get ‘the same’ character with potentially infinite different encodings. Lovely.
Unicode is a coding system, not a glyph system or font.
Fonts are already in there and proto-glyphs are too as generalized dicritics. There's also a large variety of generic shapes, lines, arrows, circles and boxes in both filled and unfilled varieties. Lines even have different weights. The absurdity of a custom alphabet can already be partially actualized. Formalism is merely the final step
This conversation was had 20 years ago and your (and my) position lost. Might as well embrace the inevitable instead of insisting on the impossible.
Whether you agree with it or not won't actually affect unicode's outcome, only your own.
Unicode does not specify any fonts, though many fonts are defined to be consistent with the Unicode standard, nevertheless they are emphatically not part of Unicode.
How symbols including diacritics are drawn and displayed is not a concern for Unicode, different fonts can interpret 'filled circle' or the weight of a glyph as they like, just as with emoji. By convention they generally adopt common representations but not entirely. For example try using the box drawing characters from several different fonts together. Some work, many don't.
macOS already does different encoding for filenames in Japanese than what Windows/Linux do, and I'm sure someone mentioned same situation in Korean here.
Unicode is already a non-deterministic mess.
And that justifies making it an even more complete mess, in new and dramatically worse ways?
Like how phonetic alphabets save space compared to ideograms by just “write the word how it sounds”, the little SVG-icode would just “write the letter how it’s drawn”
Right. Semantic iconography need not be universal or even formal to be real.
Think of all the symbols notetakers invent; ideographs without even phonology assigned to it.
Being as dynamic as flexible as human expression is hard.
Emojis have even taken on this property naturally. The high-5 is also the praying hands for instance. Culturally specific semantics are assigned to the variety of shapes, such as the eggplant and peach.
Insisting that this shouldn't happen is a losing battle against how humans construct written language. Good luck with that.
[deleted]
There are no emoiji in this guy's name.
Unicode has made some mistakes, but having all the symbols necessary for this guy's name is not one of them.
This frustration seems unnecessary, unicode isnt more complicated than time and we have far more than enough processing power to handle its most absurd manifestations.
We just need good libraries, which is a lot less work than inventing yet another system.
The limiting factor is not compute power, but the time and understanding of a random dev somewhere.
Time also is not well understood by most programmers. Most just seem to convert it to epoch and pretend that it is continuous.
in what way is unicode similar to html, docx, or a file format? the only features I can think of that are even remotely similar to what you're describing are emoji modifiers.
and no, this webpage is not result of "carefully cutting out the complicated stuff from Unicode". i'm pretty sure it's just the result of not supporting Unicode in any meaningful way.
>We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while.
we didn't even get that because slightly different looking characters from japanese and chinese (and other languages) got merged to be the same character in unicode due to having the same origin, meaning you have to use a font based on the language context for it to display correctly.
They are the same character, though. They do not use the same glyph in different language contexts, but Unicode is a character encoding, not a font standard.
They're not. Readers native in one version can't read the other, and there are more than handful that got duplicated in multiple forms, so they're just not same, just similar.
You know, obvious presumption underlying Han Unification is that CJK languages must have a continuous dialect continuums, like villagers living in the middle of East China Sea between Shanghai and Nagasaki and Gwangju would speak half-Chinese-Japanese-Korean, and technical distinction only exist because of rivalry or something.
Alas, people don't really erect a house on the surface of an ocean, and CJK languages are each complete isolates with no known shared ancestries, so "it's gotta be all the same" thinking really don't work.
I know it's not very intuitive to think that Chinese and Japanese has ZERO syntactic similarity or mutual intelligibility despite relatively tiny mental shares they occupy, but it's just how things are.
You're making the same mistake: the languages are different, but the script is the same (or trivially derived from the Han script). The Ideographic Research Group was well aware of this, having consisted of native speakers of the languages in question.
That's not "mistake", that's the reality. They don't exchange, and they're not the same. "Same or trivially derived" is just a completely false statement that solely exist to justify Han Unification, or maybe something that made sense in the 80s, it doesn't make literal sense.
Yes, but the same is true for overlapping characters in Cyrillic and Latin. A and А are the same glyph, so are т,к,і and t,k,i and you can even see the difference between some of those.
The duplication there is mostly to remain compatible or trivially transformable with existing encodings. Ironically, the two versions of your example "A" do look different on my device (Android), with a slightly lower x-height for the Cyrillic version.
The irony is you calling it irony. CJK "the same or trivially derived" characters are nowhere close to that yet given same code points. CJK unified ideographs is just broken.
So when are we getting UniPhoenician?
This is a bullshit argument that never gets applied to any other live language. The characters are different, people who actually use them in daily life recognise them as conveying different things. If a thumbs up with a different skin tone is a different character then a different pattern of lines is definitely a different character.
> If a thumbs up with a different skin tone is a different character
Is it? The skin tone modifier is serving the same purpose as a variant selector for the CJK codepoint would be.
The underlying implementation mechanism is not the issue. If unicode had actual support for Japanese characters so that when one e.g. converted text from Shift-JIS (in the default, supported way) one could be confident that one's characters would not change into different characters, I wouldn't be complaining, whether the implementation mechanism involved variant selectors or otherwise.
Okay, that's fair. The support for the selectors is very half-assed and there's no other good mechanism.
It doesn't matter to me what bullshit semantics theoretical excuse there is, for practical purposes it means that UTF-8 is insufficient for displaying any human language, especially if you want chinese and japanese in the same document/context without switching fonts (like, say, a website)
IMO, the sin of Unicode is that they didn't just pick local language authorities and gave them standardized concepts like lines and characters, and start-of-language and end-of-language markers.
Lots of Unicode issues come from handling languages that the code is not expecting, and codes currently has no means to select or report quirk supports.
I suppose they didn't like getting national borders involved in technical standardization bit that's just unavoidable. It is getting involved anyway, and these problems are popping up anyway.
This doesn't self-synchronize. Removing an arbitrary byte from the text stream (e.g. SOL / EOL) will change the meaning of codepoints far away from the site of the corruption.
What it sounds like you want is an easy way for English-language programmers to skip or strip non-ASCII text without having to reference any actual Unicode documentation. Which is a Unicode non-goal, obviously. And also very bad software engineering practice.
I'm also not sure what you're getting at with national borders and language authorities, but both of those were absolutely involved with Unicode and still are.
> start-of-language and end-of-language markers
Unicode used to have language tagging but they've been (mostly) deprecated:
The lack of such markers prevents Unicode from encoding strings of mixed Japanese and Chinese text correctly. Or in the case of a piece of software that must accept both Chinese and Japanese names for different people, Unicode is insufficient for encoding the written forms of the names.
I’m working with Word documents in different languages, and few people take the care to properly tag every piece of text with the correct language. What you’re proposing wouldn’t work very well in practice.
The other historical background is that when Unicode was designed, many national character sets and encodings existed, and Unicode’s purpose was to serve as a common superset of those, as otherwise you’d need markers when switching between encodings. So the existing encodings needed to be easily convertible to Unicode (and back), without markers, for Unicode to have any chance of being adopted. This was the value proposition of Unicode, to get rid of the case distinctions between national character sets as much as possible. As a sibling comment notes, originally there were also optional language markers, which however nobody used.
I bet the complex file format thing probably started at CJK. They wanted to compose Hangul and later someone had a bright idea to do the same to change the look of emojis.
Don't worry, AI is the new hotness. All they need is unpack prompts into arbitrary images and finally unicode is truly unicode, all our problems will be solved forever
>so he just selects a subset he can handle and bans everything else.
Yes? And the problem is?
The problem is the scale at which it happens and lack of methods-to-go in most runtimes/libs. No one and nothing is ready for unicode complexity out of box, and there's little interest in unscrewing it by oneself, cause it looks like an absurd minefield and likely is one, from the persepective of an average developer. So they get defensive by default, which results in $subj.
The next guy with a different subset? :)
The subset is mostly defined by the jurisdiction you operate in, which usually defines a process to map names from one subset to another and is also in the business of keeping the log of said operation. The problem is not operating in a subset, but defining it wrong and not being aware there are multiple of those.
If different parts of your system operate in different jurisdictions (or interface which other systems that do), you have to pick multiple subsets and ask user to provide input for each of them.
You just can't put anything other than ASCII into either payment card or PNR and the rules of minimal length will differ for the two and you can't put ASCII into the government database which explicitly rejects all of ASCII letters.
Well, the labels of input fields are written in English yet user enters his name in native language.
What's the reason of having name at all? You can call the person by this name. But if I write you my name in my language, what you (not knowing how to read it) can do? Only "hey, still-don't-know-you, here is your info".
In my foreign passport I have name __transliterated__ to Latin alphabet. Shouldn't this be the case for other places?
A system not supporting non-latin characters in personal names is pitiful, but a system telling the user that they have an invalid name is outright insulting.
That’s the best one of the lot. "Dein Name ist ungültig", "Your name is invalid", written with the informal word for "your".
They're trying to say that you and the server are very close friends, you see? No, no, I get this is not correct, just a joke...
[deleted]
It seems ridiculous to apply form validation to a name, given the complexity of charsets involved. I don't even validate email addresses. I remember [this](https://www.netmeister.org/blog/email.html) wonderful explainer of why your email validation regex is wrong.
In HTML if you use <input type="email"> you basically get a regex validation. While it not fully follows the RFC it's a good middle ground. And it gives you an email that you can use on the internet (RFC obviously has some cases that are outside of this scope). That's why I tend to prefer what's defined in the standard: https://html.spec.whatwg.org/multipage/input.html#email-stat...
Situations like these regularly make me feel ashamed about being a software developer.
I've got a good feel now for which forms will accept my name and which won't, though mostly I default to an ASCII version for safety. Similarly, I've found a way to mangle my address to fit a US house/state/city/zip format.
I don't feel unwelcome, I emphathize with the developers. I'd certainly hate to figure out address entry for all countries. At least the US format is consistent across websites and I can have a high degree of confidence that it'll work in the software, and my local postal service know what to do because they see it all the time.
At the end of the day, a postal address is printed to an envelope or package as a single block of text and then read back and parsed somehow by the people delivering the package (usually by a machine most of the way, but even these days more by humans as the package gets closer to the destination). This means that, in a very real sense, the "correct" way to enter an address is into a single giant multi-line text box with the implication that the user must provide whatever is required to be printed onto the mailing label such that a carrier will successfully be able to find them.
Really, then, the reasons why we bother trying to break out an address into multiple parts is not directly caused by the need for an address at all: it is because we 1) might not trust the user to provide for us everything required to make the address valid (assuming the country or even state, or giving us only a street address with no city or postal code... both mistakes that are likely extremely common without a multi-field form), or 2) need to know some subset of the address ourselves and do not trust ourselves to parse back the fuzzy address the same way as the postal service might, either for taxes or to help establish shipping rates.
FWIW, I'd venture to say that #2 is sufficiently common -- as if you need a street address for shipping you are going to need to be careful about sales taxes and VAT, increasingly often even if you aren't located in the state or even country to which the shipment will be made -- that it almost becomes nonsensical to support accepting an address for a location where you aren't already sure of the format convention ahead of time (as that just leads you to only later realizing you failed to collect a tax, will be charged a fortune to ship there, or even that it simply isn't possible to deliver anything to that country)... and like, if you don't intend to ship anything, you actually do not need the full address anyway (credit cards, as an obvious example, don't need or use the full address).
You can grab JSON data of all ISO recognized countries and their address formats on GitHub (apologies, I forget the repo name. IIRC there is more than one).
I don't know if it's 100% accurate, but it's not very hard to implement it as part of an address entry form. I think the main issue is that most developers don't know it exists,
OMG, the second screenshot might be actually the application i am working on right now ...
It's like with phone numbers. Some people assume they contain only digits.
Its really not that hard though. PCRE regex support unicode letter classes. There is really no excuse for this type of issue.
`..*` should be a good validation regexp. At that point you might as well check only that length is non-zero and the the name is valid Unicode and nothing more. Well, ok, maybe look at UTR-#36 and maybe disallow / filter out non-printing characters, PUCs, and what not.
How do I allow "stępień" while detecting Zalgo-isms?
Zalgo is largely the result of abusing combining modifiers. Declare that any string with more than n combining modifiers in a row is invalid.
n=1 is probably a reasonable falsehood to believe about names until someone points out that language X regularly has multiple combining modifiers in a row, at which point you can bump up N to somewhere around the maximum number of combining modifiers language X is likely to have, add a special case to say "this is probably language X so we don't look for Zalgos", or just give up and put some Zalgo in your test corpus, start looking for places where it breaks things, and fix whatever breaks in a way that isn't funny.
N=2 is common in Việt Nam. (vowel sound + tonal pitch)
Yet Vietnamese can be written in Unicode without any combining characters whatsoever - in NFC normalization each character is one code point - just like the U+1EC7 LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW in your example.
u/egypurnash's point was about limiting glyph complexity. You could canonically decompose then look for more than N (say, N=3) combining codepoints in a row and reject if any are found. Canonical forms have nothing to do with actual glyph complexity, but conceptually[0] normalizing first might be a good first step.
[0] I say conceptually because you might implement a form-insensitive Zalgo detector that looks at each non-combining codepoint, looks it up in the Unicode database to find how many combining codepoints one would need if canonically decomposing and call that `n`, then count from there all the following combining codepoints, and if that exceeds `N` then reject. This approach is fairly optimal because most of the time most characters in most strings don't decompose to more than one codepoint, and even if they do you save the cost of allocating a buffer to normalize into and the associated memory stores.
I can point out that Greek needs n=2: for accent and breathing.
There's nothing special about "Stępień", it has no combining characters, just the usual diacritics that have their own codepoints in Basic Multilingual Plane (U+0119 and U+0144). I bet there are some names out there that would make it harder, but this isn't one.
If you decompose then it uses combining codepoints. Still nothing special.
I could answer your question better if I knew why you need to detect Zalgo-isms.
We have a whitelist of allowed characters, which is a pretty big list.
I think we based it on Lodash’ deburr source code. If deburr’s output is a-z and some common symbols, it passes (and we store the original value)
If you really think you need to programmatically detect and reject these (I'm dubious), there is probably a reasonable limit on the number of diacritics per character.
Pfft, "Dein Name ist ungültig" (your name is invalid). Let's get straight to the point, it's the user's fault for having a bad name, user needs to fix this.
fun fact there is a semi standard encoding called wtf-8 which is utf-8 extended in a way so that it can represent non well formed utf-16 (bad surrogate code points)
it's used in situations like when a utf-8 based system has to interact with Windows file paths
I totally get that companies are probably more successful using simple validation rules, that work for the vast majority of names rather than just accepting everything just so that some person with no name or someone whose name cannot possibly be expressed or at least transliterated to Unicode can use their services.
But that person's name has no business failing validation. They fucked up.
My first name is hyphenated. I still find forms that reject it. My favorite was one that say "invalid first name."
What would be wrong with "enter your name as it appears in the machine-readable zone of your passport" (or "would appear" for people who have never gotten one)? Isn't that the one standard format for names that actually is universal?
The problem is, people exists that have É in their name and will go to court when you spell it as E, the court will also say that 1 ) you have the technical ability to write it as É and 2) they have a right to have their name spelled correctly. Also it's not nice and bad for business to be like this.
> they have a right to have their name spelled correctly
IMO, having the law consider this as an unconditional right is the root of the problem. What happens when people start making up their own characters that aren't in Unicode to put in their names?
> Also it's not nice and bad for business to be like this.
What about having a validated "legal name" for everything official and an unvalidated "display name" that's only ever parroted back to the person who entered it?
> What happens when people start making up their own characters that aren't in Unicode to put in their names?
They first have to fight the Unicode committee and maybe they actually have a point and the character is made up in a way that is acceptable in a society. Then they will fight their local authorities who run everything on 30 years old system. Only after they become your problem, at which point you fix your cursed regexp.
>an unvalidated "display name" that's only ever parroted back to the person who entered it?
You will do that wrong too. When you send me an email, I would expect my name to be in different form compared to what you display in the active user widget.
The point is, you need to know the exact context in which the name is used and also communicate it to me so I can tell you the right thing to display.
I would like to use my name as my parents gave it to me, thanks. Is that too much to ask for?
How much flexibility are we giving parents in what they name children?
If a parent invented a totally new glyph, would supporting that be a requirement?
Luckily, there is a vital records keeping office which already bothered to check with the law on that matter and if they can write it, so can you.
There's the problem that "appears" is a visible phenomenon and unicode strings can contain non-visible characters and multiple ways to represent the same visible information. Normalization is supposed to help here, but some sites may fail to do this or do incorrectly, etc.
But the MRZ of a passport doesn't contain any of those problem characters.
But on some random website, with people copy-pasting from who-knows-what, they will have to normalize or validate, etc. to deal with such characters.
The point is if the standard were to enter it like in the MRZ, it would be easy to validate.
From experience, it's not actually universal. Visa applications will often ask for name information not encoded in the MRZ.
I think the bottom of the page is you missing the joke. It's showing only the name letters that get rejected everywhere else. Similarly for the URL, the URL renders his name correctly when you browse to it in a modern browser. What you've copied is the canonical fallback for unicode.
Under GDPR you have the legal right for your name to be stored and processed with the correct spelling in the EU.
I wouldn't be surprised if that created kafkaesque problems with other institutions that require name to match the bank account exactly, and break/reject non-ASCII at the same time.
I know an Åsa who became variously Åsa, Aasa and Asa after moving to a non-Scandinavian country. That took a while to untangle, and caused some of the problems you describe.
Spelled with an Angstrom, or with a Latin Capital Letter A With Ring Above?
The second. It’s the 27th letter of the Swedish alphabet.
This does not only apply to banks. The specific court case was brought against a bank, but the law as is applies to any and everyone who processes your personal data.
It’s a general right to have incorrect personal data relating to you rectified by the data processor.
No, anywhere where your name is used.
I lost count of the projects where this was an issue. US and Western European-born devs are oblivious to this problem and it ends up catching them over and over again.
Yeah, it's amazing. My language has a Latin-based alphabet but can't be represented with ISO 8859-1 (aka the Latin-1 charset) so I used to take it for granted that most software will not support inputs in the language... 25 years ago. But Windows XP shipped with a good selection of input methods and used UTF-16, dramatically improving things, so it's amazing to still see new software created where this is somehow a problem.
Except that now there's no good excuse. Things like the name in the linked article would just work out of the box if it weren't for developers actually taking the time to break them by implementing unnecessary and incorrect validation.
I can think of very few situations, where validation of names is actually warranted. One that comes to mind is when you need people's ICAO 9303 compliant names, such as on passports or airline systems. If you need to make sure you're getting the name the person has in their passport's MRZ, then yes, rejecting non-ASCII characters is correct, but most systems don't need to do that.
Software has been gaslighting generations of people around the world.
Side note: not a bad way to skirt surveillance though.
A name like “stępień” will without a doubt have many ambiguous spellings across different intelligence gathering systems (RUMINT, OSINT, …). Americans will probably spell it as “Stefen” or “Steven” or “Stephen”, especially once communicated over phone.
My rule of thumb is to treat strings as opaque blobs most of the time. The only validation I'd always enforce is some sane length limit, to prevent users from shoving entire novels inside. If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away. Imho often times, input validation is an attempt to solve a problem from the wrong side. Say, when XSS or SQL injections are found on a site, I've seen people's first reaction to be validation of user input by looking for "special symbols", or add a whitelist of allowed characters, instead of simply escaping strings right before rendering HTML (and modern frameworks do it automatically), or using parameterized queries if it's SQL. If a user wants to call themselves "alert('hello')", why not? Why the arbitrary limits? I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.
Sanitizing your strings immediately before display is all well and good until you need to pass them to some piece of third-party software that is very dumb and doesn’t sanitize them. You’ll argue that it’s the vendor’s fault, but the vendor will argue that nobody else allows characters like that in their name inputs!
See the Companies House XSS injection situation, where their rationale for forcing a business to change its name was that others using their database could be vulnerable: https://www.theregister.com/2020/10/30/companies_house_xss_s...
You sanitize at the frontier of what your code controls.
Sending data to a database: parametrized queries to sanitize as it is leaving your control.
Sending to display to the user: sanitized for a browser
Sending to an API: sanitize for whatever rules the API has
Sending to a legacy system: sanitize for it
Writing a file to the system: sanitize the path
The common point is you don't sanitize before you have to send it somewhere. And the advantage of this method is that you limit the chances of getting bit by reflected injections. You interrogate some API you don't control, you may just get malicious content, but you sanitize when sending it so all is good. Because you're sanitizing on output and not on input.
Forbidding users to use your service to propagate "litte bobby tables" pseudo-pranks is likely a good choice.
The choice is different if like most apps you are almost only a data sink, but if you are also a data source for others it pays to be cautious.
I think it’s more of an ethical question than anything. There will always be pranksters and there will never be perfect input validation for names. So who do you oppress? The people with uncommon names? Or the pranksters? I happen to think that if you do your job right, the pranksters aren’t really a problem. So why oppress those with less common names?
I am not saying to only allow [a-zA-Z ]+ in names, what I am Saying is that it is ok to block names like "'; drop table users;" or "<script src="https://bad.site.net/></script>" if part of your business is to distribute that data to other consumers.
And I’m arguing, rhetorically, what if your name produces a syntax error—or worse means something semantically devious—in the query language I’m using? Not all problems look like script tags and semicolons.
> but the vendor will argue that nobody else allows characters like that in their name inputs
...and maybe they will even link to this page to support that statement! But, seeing that most of the pages are German, I bet they do accept the usual German "special" letters (ÄÖÜß) in names?
You do need to use a canonical representation, or you will have two distinct blobs that look exactly the same, tricking other users of the data (other posters in a forum, customer service people in a company, etc)
Because you don't want to ever store bad data. There's not point to that, it will just create annoying situations and potential security risks. And the best place to catch bad data is when the user is still present so they can be made aware of the issue (in case they care and are able to solve it). Once they're gone, it becomes nearly impossible and/or very expensive to check what they meant.
There's at least one major exception to this: Unicode normalization.
It's possible for the same logical character to have two different sets of code points (for example, a-with-umlaut as a single character, vs a followed by umlaut combining diacritic). Related, distinguishing between the "a" character in Latin, Greek, Cyrillic, and the handful of other times it shows up throughout Unicode.
This comes up in at least 3 ways:
1. A usability issue. It's not always easy to predict which identical-looking variant is produced by different input methods, so users enter the identical-looking characters on different devices but get an account not found error.
2. A security issue. If some of your backend systems handle these kinds of characters differently, that can cause all kinds of weird bugs, some of which can be exploited.
3. An abuse issue. If it's possible to create accounts with the same-looking name as others that aren't the same account, there can be vectors for impersonation, harassment, and other issues.
So you have to make a policy choice about how to handle this problem. The only things that I've seen work are either restricting the allowed characters (often just to printable ASCII) or being very clear and strict about always performing one of the standard Unicode transformations. But doing that transformation consistently across a big codebase has some real challenges: in particular, it can change based on Unicode version, and guaranteeing that all potential services use the same Unicode version is really non-trivial. So lots of people make the (sensible) choice not to deal with it.
But yeah, agreed that parenthesis should be OK.
Something we just ran in to: There are two UTF-8 codepoints for the @ character, the normal one and "Full width At Sign U+FF20". It took a lot of head scratching to understand why several Japanese users could not be found with their email address when I was seeing their email right there in the database.
[dead]
> or if you have to interact with some legacy service.
Which happens almost every day in the real world.
You can treat names as byte blobs for as long as you don't use them for their purpose -- naming people.
Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?
>I think there're very few exceptions to this, probably something law-related, or if you have to interact with some legacy service.
Few exceptions for you is entirety of the service for others. At the very least you interact with legacy software of payment systems which have some ideas about what names should be.
> Suppose you have a unicode blob of my name in your database and there is a problem and you need to call me and say hi. Would your customer representative be able to pronounce my name somewhat correctly?
You cannot pronounce the name regardless of whether it is written in ASCII. Pronouncing a name requires at the very least knowledge of the language it originated in, and attempts at reading it with an English pronunciation can range from incomprehensible to outright offensive.
The only way to correctly deal with a name that you are unfamiliar with the pronunciation of is to ask how it is pronounced.
You must store and operate on the person's name as is. Requiring a name modified, or modifying it automatically, is unacceptable - in many cases legal names must be represented accurately as your records might be used for e.g. tax or legal reasons later.
> Would your customer representative be able to pronounce my name somewhat correctly?
Are you implying the CSR's lack of familiarity with the pronunciation of your name means your name should be stored/rendered incorrectly?
Quite the opposite actually. I want it stored correctly and in a way that both me and CSR can understand and so it can be used to interface with other systems.
I don’t however know which unicode subset to use, because you didn’t tell me in the signup form. I have many options, all of them correct, but I don’t know whether your CSR can read Ukrainian Cyrillic and whether you can tell what vocative case is and not use that when inerfacing with the government CA which expects nominative.
I think you're touching on another problem, which is that we as users rarely know why the form wants a name. Is it to be used in emails, or for sending packages, or for talking to me?
My language also has a separate vocative case, but I live in a country that has no concept of it and just vestiges of a case system. I enter my name in the nominative, which then of course looks weird if I get emails/letters from them later - they have no idea to use the vocative. If I knew the form is just for sending me emails, I'd maybe enter my name in the vocative.
Engineers, or UX designers, or whoever does this, like to pretend names are simple. They're just not (obligatory reference to the "falsehoods about names" article). There are many distinct cases for why you may want my name and they may all warrant different input.
- Name to use in letters or emails. It doesn't matter if a CSR can pronounce this if it's used in writing, it should be a name I like to see in correspondence. Maybe it's in a script unfamiliar to most CSRs, or maybe it's just a vocative form.
- Name for verbal communication. Just about anything could be appropriate depending on the circumstances. Maybe an anglicized name I think your company will be able to pronounce, maybe a name in a non-Latin script if I expect it to be understood here, maybe a name in a Latin-extended script if I know most people will still say it reasonably well intuitively. But it could also be an entirely different name from the written one if I expect the written one to be butchered.
- Name for package deliveries. If I'm ordering a package from abroad, I want my name (and address) written in my local convention - I don't care if the vendor can't read it, first the package will make its way to my country using the country and postal code identifiers, and then it should have info that makes sense to the local logistics companies, not to the seller's IT system.
- Legal name because we're entering a contract or because my ID will be checked later on for some reason.
- Machine-readable legal name for certain systems like airlines. For most of the world's population, this is not the same as the legal name but of course English-language bias means this is often overlooked.
In this specific case, it seems like your concerns are a hypothetical, no?
Not really, no. A lot of us only really have to deal with English-adjacent input (i.e. European languages that share the majority of character forms with English, or cultures that explicitly Anglicise their names when dealing with English folks).
As soon as you have to deal with users with a radically different alphabet/input-method, the wheels tend to come off. Can your CSR reps pronounce names written in Chinese logographs? In Arabic script? In the Hebrew alphabet?
You can analyze the name and direct a case to a CSR who can handle it. May be unrealistic for a 1-2 person company, but every 20+ person company I’ve worked at has intentionally hired CSRs with different language abilities.
First of, no you can't infer language preference from a name. The reasonable and well meaning assumption about my name on a good day makes me only sad and irritated.
And even if you could, I don't know if you actually do it by looking at what you signup form asks me to input.
A requirement to do that is an extremely broad definition of "treat strings as opaque blobs most of the time" IMHO :)
>Would your customer representative be able to pronounce my name somewhat correctly?
Typical input validation doesn't really solve the problem. For instance, I could enter my name as 'Vrdtpsk,' which is a perfectly valid ASCII string that passes all validation rules, but no one would be able to pronounce it correctly. I believe the representative (if on a call) should simply ask the customer how they would like to be addressed. Unless we want to implement a whitelist of allowed names for customers to choose from...
Derek would like a word.
https://www.youtube.com/watch?v=hNoS2BU6bbQ
Many Japanese companies require an alternative name entered in half width kana to alleviate this exact problem. Unfortunately, most Japanese websites have a million other UX problems that overshadow this clever solution to the problem.
This is a problem specific to languages using Chinese characters where most only know some characters and therefore might not be able to read a specific one. Furigana (which is ultimately what you're providing in a separate field here) is often used as a phonetic reading aid, but still requires you to know Japanese to read and pronounce it correctly.
The only generic solution I can think of would be IPA notation, but it would be entirely unreasonable to expect someone to know the IPA for their name, just as it would be unreasonable to expect a random third party to know how to read IPA and replicate the sounds it described.
Absolutely not - do not build anything based on "would your CSR be able to pronounce" something - that's an awful bar - most CSRs cant pronounce my name - would I be excluded from your database?
Seriously, what are you going for here?
That’s the most basic consideration for names, unless you only show it to the user themselves — other people have to be able to read it at least somehow.
Which one is why the bag of unicode bytes approach is as wrong as telling Stęphań he has an invalid name.
Absolutely not. There's no way to understand what a source user's reading capability is. There's no way to understand how a person will pronounce their name by simply reading it, this only works for common names.
And here we go again, engineers expecting the world should behave fitting their framework du jour. Unfortunately, the real world doesn't care about our engineering bubble and goes on with life - where you can be called !xóõ Kxau or ꦱꦭꦪꦤ or X Æ A-12.
> Would your customer representative be able to pronounce my name somewhat correctly?
Worse case, just drop to hexadecimal.
> If you treat your strings as opaque blobs, and use UTF8, most of internationalization problems go away
This is laughably naive.
So many things can go wrong.
Strings are not arrays of bytes.
There is a price to pay if someone doesn't understand that or chooses to ignore it.
> Strings are not arrays of bytes.
That very much depends on the language that you are using. In some, they are.
RTL go brrr
RTL is so much fun, it's the gift that keeps on going, when I first encountered it I thought, ok, maybe some junior web app developers will sometimes forget that it exists and a fun bug or two will get into production, but it's everywhere, Windows, GNU/Linux, automated emails, it can make malware hardware to detect by users in Windows because you can hide the dotexe at the beginning of the filename, etc.
Here it is today in GNOME 46.0, after so many years, this should say "selected": https://github.com/user-attachments/assets/306737fb-6b01-467... In previous GNOME versions it would mess up even more text in the file properties window.
Here's an article about it, but I couldn't find the more interesting blogpost about RTL: https://krebsonsecurity.com/2011/09/right-to-left-override-a...
And yet when stored on any computer system, that string will be encoded using some number of bytes. Which you can set a limit on even though you cannot cut, delimit, or make any other inference about that string from the bytes without doing some kind of interpretation. But the bytes limit is enough for the situation the OP is talking about.
A coworker once implemented a name validation regex that would reject his own name. It still mystifies me how much convincing it took to get him to make it less strict.
I know multiple developers who would just say "well it's their fault, they have to change name then".
I worked with an office of Germans who insisted that ASCII was sufficient. The German language uses letters that cannot be represented in ASCII.
In fairness, they mostly wanted stuff to be in English, and when necessary, to transliterate German characters into their English counterparts (in German there is a standardised way of doing this), so I can understand why they didn't see it was necessary. I just never understood why I, as the non-German, was forever the one trying to convince them that Germans would probably prefer to use their software in German...
I’ve run into a similar-ish situation working with East-Asian students and East-Asian faculty. Me, an American who wants to be clear and make policies easy for everybody to understand: worried about name ordering a bit (Do we want to ask for their last name or their family name in this field, what’s the stupid learning management system want, etc etc). Chinese co-worker: we can just ask them for their last names, everybody knows what Americans mean when they ask for that, and all the students are used to dealing with this.
Hah, fair enough. I think it was an abstract question to me, so I was looking for the technically correct answer. Practical question for him, so he gave the practical answer.
There are some valid reasons to use software in English as a German speaker. Main among those is probably translations.
If you can speak English, you might be better of using the software in English, as having to deal with the English language can often be less of hassle, than having to deal with inconsistent, weird, or outright wrong translations.
Even high quality translations might run into issues, where the same thing is translated once as "A" and then as "B" in another context. Or run into issues where there is an English technical term being used, that has no prefect equivalent in German (i.e. a translation does exist, but is not a well-known, clearly defined technical term). More often than not though, translations are anything but high quality. Even in expensive products from big international companies.
You should have asked how they would encode the german currency sign (€ for euro) in ASCII or its german counterpart latin1/iso-8859-1...
It's not possible. However I bet they would argument to use iso-8859-15 (latin9 / latin0) with the international currency sign (¤) instead or insist that char 128 of latin1 is almost always meant as €, so just ignore the standard in these cases and use a new font.
This would only fail in older printers and who is still printing stuff these days? Nobody right?
Using real utf-8 is just too complex... All these emojis are nuts
EUR is the common answer.
or just double all the numbers and use DM
Weirdly the old Deutsch Mark doesn't seem to have its own code point in the block start U+20A0, whereas the Spanish equivalent (Peseta, ₧, not just Pt) does.
TIL
https://www.compart.com/en/unicode/block/U+20A0
Even Bitcoin is there. And "German Penny Sign"?
> I just never understood why I, as the non-German, was forever the one trying to convince them that Germans would probably prefer to use their software in German...
I cannot know, but they could be ideological. For example, they had found it wonderful to use plain ASCII, no need for special keyboard layouts or something like that, and they decided that German would be much better without its non-ASCII characters. They could believe something like this, and they wouldn't say it aloud in the discussion with you because it is irrelevant for the discussion: you weren't trying to change German.
Is name validation even possible?
In certain cultures yes. Where I live, you can only select from a central, though frequently updated, list of names when naming your child. So theoretically only (given) names that are on that list can occur.
Family names are not part of this, but maybe that exists too elsewhere. I don't know how people whose name has been given to them before this list was established is handled however.
An alternative method, which is again culture dependent, is to use virtual governmental IDs for this purpose. Whether this is viable in practice I don't know, never implemented such a thing. But just on the surface, should be.
>So theoretically only (given) names that are on that list can occur.
Unless of course immigration is allowed and doesn't involve changing a name.
Not the OP, but immigration often involves changing your name in the way digital systems store and display it. For example, from محمد to Muhammad or from 陳 to Chen. The pronunciation ideally should stay the same, but obviously there's often slight differences. But if the differences are annoying or confusing, someone might choose an entirely different name as well.
Yes but GP said
> Where I live, you can only select from a central, though frequently updated, list of names when naming your child
I was born in such a country too and still have frequent connections there and I can confirm the laws only apply to citizens of said country so indeed immigration creates exceptions to this rule even if they transliterate their name.
I still don't see how any system in the real world can safely assume its users only have names from that list.
Even if you try to imagine a system for a hospital to register newly born babies... What happens if a pregnant tourist is visiting?
For example in Iceland you don't have to name the baby immediately, and the registration times are different for foreign parents.https://www.skra.is/english/people/registration-of-children/...
Of course then you may fall foul of classic falsehood 40: People have names.
For today's lucky 10,000: Falsehoods programmers believe about names (https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...)
For today's lucky 10,000: Ten Thousand (https://xkcd.com/1053/)
The name a system knows you as doesn’t need to correspond to your legal name or what you are called by others.
With plenty of attitude of course :)
I've only ever interacted with freeform textfields when inputting my name, so most regular systems clearly don't dare to attempt this.
But if somebody was dead set on only serving local customers or having only local personnel, I can definitely imagine someone being brave(?) enough.
This assumes every resident is born and registered in said country which is a silly assumption. Surely, any service only catered only to "naturally born citizen" is discriminatory and illegal?
Obviously, foreigners just living or visiting here will not have our strictly local names (thinking otherwise is what would be "silly"). Locals (people with my nationality, so either natural or naturalized citizens) will (*).
(*) I read up on it though, and it seems like exceptions can be requested and allowed, if it's "well supported". Kinda sours the whole thing unfortunately.
> is discriminatory and illegal?
Checked this too (well, using Copilot), it does appear to be illegal in most contexts, although not all.
But then, why would you want to perform name verification specific to my culture? One example I can think of is limiting abuse on social media sites for example. I vaguely recall Facebook being required to do such a thing like a decade ago (although they definitely did not go about it this way clearly).
> Surely, any service only catered only to "naturally born citizen" is discriminatory and illegal?
No, that's also a question that is culturally dependent. In some contexts it's normal and expected.
I read that Iceland asks people to change their names if they naturalise there (because of the -sson or -dottir surname suffix).
But your point stands - not everyone in the system will follow this pattern.
Yes, it is essential when you want to avoid doing business with customers who have invalid names.
You joke, but when a customer wants to give your company their money, it is our duty as developers to make sure their names are valid. That is so business critical!
It's not just business necrssary, it's also mandatory to do rigjt under gdpr
In legitimate retail, take the money, has always been the motto.
That said, recently I learned about monetary policy in North Korea and sanctions on the import of luxury goods.
Why Nations Fail (2012) by Daron Acemoglu and James Robinson
https://en.wikipedia.org/wiki/United_Nations_Security_Counci...
What are “invalid names” in this context? Because, depending on the country the person was born in, a name can be literally anything, so I’m not sure what an invalid name looks like (unless you allow an `eval` of sorts).
The non-joke answer for Europe is extened Latin, dashes, spaces and apostrophe sign, separated into two (or three) distinct ordered fields. Just because it's written in a different script originally, doesn't mean it will printed only with that on your id in the country of residence or travel document issued at home. My name isn't written in Latin characters and it's fine. I know you can't even try to pronounce them, so I have it spelled out in above mentioned Latin script.
Obligatory xkcd https://xkcd.com/327/
What if your customer is the artist formerly known as Prince or even X Æ A-12 Musk?
Prince: "Get over yourself and just use your given name." (Shockingly, his given name actually is Prince; I first thought it was only a stage name)
Musk: Tell Elon to get over his narcissism enough to not use his children as his own vanity projects. This isn't just an Elon problem, many people treat children as vanity projects to fuel their own narcissism. That's not what children are for. Give him a proper name. (and then proceed to enter "X Æ A-12" into your database, it's just text...)
There are of course some people who'll point you to a blog post saying no validation is possible.
However, for every 1 user you get whose full legal name is bob@example.com you'll get 100 users who put their e-mail into the name field by accident
And for every 1 user who wants to be called e.e. cummings you'll get 100 who just didn't reach for the shift key and who actually prefer E.E. Cummings. But you'll also get 100 McCarthys and O'Connors and al-Rahmans who don't need their "wrong" capitalisation "fixed" thank you very much.
Certainly, I think you can quite reasonably say a name should be comprised of between 2 and 75 characters, with no newlines, nulls, emojis, leading or trailing spaces, invalid unicode code points, or angle brackets.
Don't validate names, use transliteration to make them safe for postal services (or whatever). In SQL this is COLLATE, in the command line you can use uconv:
>echo "'Lódź'" | uconv -f "UTF-8" -t "UTF-8" -x "Latin-ASCII"
>'Lodz'
If I ever make my own customer facing product with registration, I'm rejecting names with 'v', 'x' and 'q'. After all, these characters don't exist in my language, and foreign people can always transliterate them to 'w', 'ks' or 'ku' if they have names with weird characters.
The name of the city has the L with stroke (pronounced as a W), so it’s Łódź.
And the transliteration in this case is so far from the original that it's barely recognisable for me (three out of four characters are different and as a native I perceive Ł as a fully separate character, not as a funny variation of L)
The fact that it's pronounced as Вуч and not Лодж still triggers me.
I just looked up the Russian wikipedia entry for it, and it's spelled "Лодзь", but it sounds like it's pronounced "Вуджь", and this fact irritates the hell out of me.
Why would it be transliterated with an Л? And an О? And a з? None of this makes sense.
> Why would it be transliterated with an Л?
Because it _used_ to be pronounced this way in Polish! "Ł" pronounced as "L" sounds "theatrical" these days, but it was more common in the past.
It's a general pattern of what russia does to names of places and people, which is aggressively imposing their own cultural paradigm (which follows the more general general pattern). You can look up your civil code provisions around names and ask a question or two of what historical problem they attempt to solve.
It's not a Russian-specific thing by any stretch.
This happens all the time when names and loanwords get dragged across linguistic boundaries. Sometimes it results from an attempt to "simplify" the respective spelling and/or sounds (by mapping them into tokens more familiar in the local environment); sometimes there's a more complex process behind it; and other times it just happens for various obscure historical reasons.
And the mangling/degradation definitely happens in both directions: hence Москва → Moscow, Paris → Париж.
In this particular case, it may have been an attempt to transliterate from the original Polish name (Łódź), more "canonically" into Russian. Based on the idea that the Polish Ł (which sounds much closer to an English "w" than to a Russian "в") is logically closer to the Russian "Л" (as this actually makes sense in terms of how the two sounds are formed). And accordingly for the other weird-seeming mappings. Then again it could have just ended up that way for obscure etymological reasons.
Either way, how one can be "irritated as hell" over any of this (other than in some jocular or metaphorical sense) is another matter altogether, which I admit is a bit past me.
Correction - it's nothing osbcure at all, but apparently a matter of the shift that accord broadly with the L sound in Polish a few centuries ago (whereby it became "dark" and velarized), affecting a great many other words and names (like słowo, mały, etc). While in parts east and south the "clear" L sound was preserved.
https://en.wikipedia.org/wiki/Ł
Wait until you hear what Chinese or Japanese languages do with loanwords...
L with stroke is the english name for it according to wikipedia by the way, not my choice of naming. The transliterated version is not great, considering how far removed from the proper pronunciation it is, but I’m sort of used to it. The almost correct one above was jarring enough that I wanted to point it out.
Yeah, that'll work great..
https://en.wikipedia.org/wiki/%C4%8Celje
echo "Čelje" | uconv -f "UTF-8" -t "UTF-8" -x "Latin-ASCII"
> "Celje"
https://en.wikipedia.org/wiki/Celje
(i mean... we do have postal numbers just for problems like this, but both Štefan and Stefan are not-so-uncommon male names over here, so are Jozef and Jožef, etc.)
If you're dealing with a bad API that only takes ASCII, "Celje" is usually better than "ÄŒelje" or "蒌elje".
If you have control over the encoding on the input side and on the output side, you should just use UTF-8 or something comparable. If you don't, you have to try to get something useful on the output side.
Most places where telling Štefan from Stefan is a problem use postal numbers for people too, or/and ask for your DOB.
I don't have a problem from differentiatin Štefan from Stefan, 's' and 'š' sound pretty different to everyone around here. But if someone runs that script above and transliterates "š" to "s" it can cause confusion.
And no, we don't use "postal numbers for humans".
>And no, we don't use "postal numbers for humans".
An email, a phone number, a tax or social security number, demographic identifier, billing/contract number or combination of them.
All of those will help you tell Stefan from Štefan in the most practical situations.
>But if someone runs that script above and transliterates "š" to "s" it can cause confusion.
It's not nice, it will certainly make Štefan unhappy, but it's not like you will debit the money from the wrong account or deliver to a different address or contact the wrong customer because of that.
Yes, it's easy
(With the caveat that a name might not be representable in Unicode, in which case I dunno. Use an image format?)name.Length > 0
is probably pretty safe.
That only works if you’re concatenating the first and last name fields. Some people have no last name and thus would fail this validation if the system had fields for first and last name.
Honestly I wish we could just abolish first and last name fields and replace them with a single free text name field since there's so many edge cases where first and last is an oversimplification that leads to errors. Unfortunately we have to interact with external systems that themselves insist on first and last name fields, and pushing it to the user to decide which is part of what name is wrong less often than string.split, so we're forced to become part of the problem.
I did this in the product where I work. We operate globally so having separate first and last name fields was making less sense. So I merged them into a singular full name field.
The first and only people to complain about that change were our product marketing team, because now they couldn’t “personalize” emails like `Hi <firstname>,`. I had the hardest time convincing them that while the concept of first and last names are common in the west, it is not a universal concept.
So as a compromise, we added a “Preferred Name” field where users can enter their first name or whatever name they prefer to be called. Still better than separate first and last name fields.
One field?
Like people have only one name... I like the Human Name from the FHIR standard: https://hl7.org/fhir/datatypes.html#HumanName
People can have many names (depending on usage and of "when", think about marriage) and even if each of those human names can handle multiple parts the "text" field is what you should use to represent the name in UIs.
I encourage people to go check the examples the standards gives, especially the Japanese and Scandinavian ones.
It’s not just external systems. In many (most?) places, when sorting by name, you use the family names first, then the given names. So you can’t correctly sort by name unless you split the fields. Having a single field, in this case, is “an oversimplification that leads to errors”.
some people have no name at all
Any notable examples apart from young children and Michael Scott that one time?
I've been compiling a list of them:
You seem to have forgotten quite a few, like
See point 40 and 32-36 on Falsehoods programmers believe about names[1]
[1] https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...
I know that this is trying to be helpful but the snark in this list detracts from the problem.
Whether it's healthy or not, programmers tend to love snark, and that snark has kept this list circulating and hopefully educating for a long time to this very day
What if my name is
Slim Shady?
Presumably there aren't any people with control characters in their name, for example.
Watch as someone names themselves the bell character, “^G” (ASCII code 7) [1]
When they meet people, they tell them their name is unpronounceable, it’s the sound of a PC speaker from the late 20th century, but you can call them by their preferred nickname “beep”.
In paper and online forms they are probably forced to go by the name “BEL”.
[1] https://en.wikipedia.org/wiki/Bell_character
Or Derek <wood dropping on desk>
https://www.youtube.com/watch?v=hNoS2BU6bbQ
The interaction brings to mind Grzegorz Brzęczyszczykiewicz:
https://www.youtube.com/watch?v=AfKZclMWS1U
(from the Polish comedy film "How I Unleashed World War II")
I thought this was going to be a link to the Key & Peele sketch: https://youtu.be/gODZzSOelss?t=180
I can finally change my name to something that represents my personality: ^G^C
https://en.wikipedia.org/wiki/End-of-Text_character
It's not exactly a bell, but there are clicks: https://en.wikipedia.org/wiki/Click_consonant
https://www.reddit.com/r/Damnthatsinteresting/comments/1614k...
คุณ สมชาย
This name, "คุณสมชาย" (Khun Somchai, a common Thai name), appears normal but has a Zero Width Space (U+200B) between "คุณ" (Khun, a title like Mr./Ms.) and "สมชาย" (Somchai, a given name).
In scripts like Thai, Chinese, and Arabic, where words are written without spaces, invisible characters can be inserted to signal word boundaries or provide a hint to text processing systems.
The reminds me of a few Thai colleagues who ended up with a legal first name of "Mr." (period included), probably as a result of this.
Buying them plane tickets to attend meetings and so on proved fairly difficult.
But C0 and C1 control codes are out, probably.
> Presumably there aren't any people with control characters in their name, for example.
Of course there are. If you commit to supporting everything anyone wants to do, people will naturally test the boundaries.
The biggest fallacy programmers believe about names is that getting name support 100% right matters. Real engineers build something that works well enough for enough of the population and ship it, and if that's not US-ASCII only then it's usually pretty close to it.
Or unpaired surrogates. Or unassigned code points. Or fullwidth characters. Or "mathematical bold" characters. Though the latter two should be probably solved with NFKC normalization instead.
> Or unpaired surrogates.
That’s just an invalid Unicode string, then. Unicode strings are sequences of Unicode scalar values, not code points.
> unassigned code points
Ah, the tyranny of Unicode version support. I was going to suggest that it could be reasonable to check all code points are assigned at data ingress time, but then you urgently need to make sure that your ingress system always supports the latest version of Unicode. As soon as some part of the system goes depending on old Unicode tables, some data processing may go wrong!
How about Private Use Area? You could surely reasonably forbid that!
> fullwidth characters
I’m not so comfortable with halfwidth/fullwidth distinctions, but couldn’t fullwidth characters be completely legitimate?
(Yes, I’m happy to call mathematical bold, fraktur, &c. illegitimate for such purposes.)
> solved with NFKC normalization
I’d be very leery of doing this on storage; compatibility normalisations are fine for equivalence testing, things like search and such, but they are lossy, and I’m not confident that the lossiness won’t affect legitimate names. I don’t have anything specific in mind, just a general apprehension.
That sounds like a reasonable assumption, but probably not strictly correct.
It's safe to reject Cc, Cn, and Cs. You should probably reject Co as well, even though elves can't input their names if you do that.
Don't reject Cf. That's asking for trouble.
Explanation for those not accustomed, based on <https://www.unicode.org/reports/tr44/#GC_Values_Table> (with my own commentary):
Cc: Control, a C0 or C1 control code. (Definitely safe to reject.)
Cn: Unassigned, a reserved unassigned code point or a noncharacter. (Safe to reject if you keep up to date with Unicode versions; but if you don’t stay up to date, you risk blocking legitimate characters defined more recently, for better or for worse. The fixed set of 66 noncharacters are definitely safe to reject.)
Cs: Surrogate, a surrogate code point. (I’d put it stronger: you must reject these, it’s wrong not to.)
Co: Private_Use, a private-use character. (About elf names, I’m guessing samatman is referring to Tolkien’s Tengwar writing system, as assigned in the ConScript Unicode Registry to U+E000–U+E07F. There has long been a concrete proposal for inclusion in Unicode’s Supplementary Multilingual Plane <https://www.unicode.org/roadmaps/smp/>, from time to time it gets bumped along, and since fairly recently the linked spec document is actually on unicode.org, not sure if that means something.)
Cf: Format, a format control character. (See the list at <https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[...>. You could reject a large number of these, but some are required by some scripts, such as ZERO-WIDTH NON-JOINER in Indic scripts.)
Mandatory reference: https://xkcd.com/327/
Challenge accepted, I'll try to put a backspace and a null byte in my firstborn's name. Hope I don't get swatted for crashing the government servers.
If you just use the {Alphabetic} Unicode character class (100K code points), together with a space, hyphen, and maybe comma, that might get you close. It includes diacritics.
I'm curious if anyone can think of any other non-alphabetic characters used in legal names around the world, in other scripts?
I wondered about numbers, but the most famous example of that has been overturned:
"Originally named X Æ A-12, the child (whom they call X) had to have his name officially changed to X Æ A-Xii in order to align with California laws regarding birth certificates."
(Of course I'm not saying you should do this. It is fun to wonder though.)
> I'm curious if anyone can think of any other non-alphabetic characters used in legal names around the world, in other scripts?
Latin characters are NOT allowed in official names for Japanese citizens. It must be written in Japanese characters only.
For foreigners living in Japan it's quite frequent to end up in a situation where their official name in Latin does not pass the validation rules of many forms online. Issues like forbidden characters, or because it's too long since Japanese names (family name + first name) are typically only 4 characters long.
Also, when you get a visa to Japan, you have to bend and disform the pronunciation of your name to make it fit into the (limited) Japanese syllabary.
Funnily, they even had to register a whole new unicode range at some point, because old administrative documents sometimes contains characters that have been deprecated more than a century ago.
https://ccjktype.fonts.adobe.com/2016/11/hentaigana.html
Very interesting about Japan!
To be clear, I wasn't thinking about within a specific country though.
More like, what is the set of all characters that are allowed in legal names across the world?
You know, to eliminate things like emoji, mathematical symbols, and so forth.
Ah, I see.
I don't know, but I would bet that the sum of all corner cases and exceptions in the world would make it pretty hard to confidently eliminate any "obvious" characters.
From a technical standpoint, unicode emojis are probably safe to exclude, but on the other hand, some scripts like Chinese characters are fundamentally pictograms, which is semantically not so different than an emoji.
Maybe after centuries of evolution we will end up with a legit universal language based on emojis, and people named with it.
Chinese characters are nothing like emoji. They are more akin to syllables. There is no semantic similarity to emoji at all, even if they were originally derived from pictorial representations.
And they belong to the {Alphabetic} Unicode class.
I'm mostly curious if Unicode character classes have already done all the hard work.
https://en.wikipedia.org/wiki/Perri_6
You forgot apostrophe as is common in Irish names like O’Brien.
Yes, though O’Brien is Ó Briain in Irish, according to Wikipedia. I think the apostrophe in Irish names was added by English speakers, perhaps by analogy with "o'clock", perhaps to avoid writing something that would look like an initial.
There are also English names of Norman origin that contain an apostrophe, though the only example I can think of immediately is the fictional d'Urberville.
> I'm curious if anyone can think of any other non-alphabetic characters used in legal names around the world, in other scripts?
Some Japanese names are written with Japanese characters that do not have Unicode codepoints.
(The Unicode consortium claims that these characters are somehow "really" Chinese characters just written in a different font; holders of those names tend to disagree, but somehow the programmer community that would riot if someone suggested that people with ø in their name shouldn't care when it's written as o accepts that kind of thing when it comes to Japanese).
Apostrophe is common in surnames in parts of the world.
> I'm curious if anyone can think of any other non-alphabetic characters used in legal names around the world, in other scripts?
The Catalan name Gal·la is growing in popularity, with currently 1515 women in the census having it as a first name in Spain with an average age of 10.4 years old: https://ine.es/widgets/nombApell/nombApell.shtml
There’s this individual’s name which involves a clock sound: Nǃxau ǂToma[1]
[1] https://en.m.wikipedia.org/wiki/N%25C7%2583xau_%C7%82Toma
Click characters are part of {Alphabetic}!
https://en.wikipedia.org/wiki/Click_consonant
https://www.compart.com/en/unicode/category/Lo
https://stackoverflow.com/a/4843363
> There’s this individual’s name which involves a clock sound: Nǃxau ǂToma
I was extremely puzzled until I realized you meant a click sound, not a clock sound. Adding to my confusion, the vintage IBM 1401 computer uses ǂ as a record mark character.
What if one's name is not in alphabetic script? Let's say, "鈴木涼太".
That's part of {Alphabetic} in Unicode. It validates.
דויד Smith (concatenated) will have an LTR control character in the middle
Oh that's interesting.
Is that a thing? I've never known of anyone whose legal name used two alphabets that didn't have any overlap in letters at all -- two completely different scripts.
Would a birth certificate allow that? Wouldn't you be expected to transliterate one of them?
Comma or apostrophe, like in d'Alembert ?
(And I have 3 in my keyboard, I'm not sure everyone is using the same one.)
Mrs. Keihanaikukauakahihuliheekahaunaele only had a string length problem, but there are people with a Hawaiian ʻokina in their names. U+02BB
It is if you first provide a complete specification of a “name”. Then you can validate if a name is compliant with your specification.
It's super easy actually. Name consists of three parts -- Family Name, Given Name and Patronymic, spelled using Ukrainian Cyrillic. You can have a dash in the Family name and apostrophe is part of Cyrillic for this purposes, but no spaces in any of the three. If are unfortunate enough to not use Cyrillic (of our variety) or Patronymics in the country of your origin (why didn't you stay there, anyway), we will fix it for you, mister Нкріск. If you belong to certain ethnic groups who by their custom insist on not using Patronymics, you can have a free pass, but life will be difficult, as not everybody got the memo really. No, you can not use Matronimyc instead of Patronymic, but give us another 30 years of not having a nuclear war with country name starting with "R" and ending in "full of putin slaves si iiia" and we might see to that.
Unless of course the name is not used for official purposes, in which case you can get away with First-Last combination.
It's really a non issue and the answer is jurisdiction bound. In most of Europe extented Latin set is used in place of Cyrillic (because they don't know better), so my name is transliterated for the purposes of being in the uncivilized realms by my own government. No, I can't just use Л and Я as part of my name anywhere here.
Valid names are those which terminate when run as Python programs.
You may not want Bobby Tables in your system.
If you're prohibiting valid letters to protect your database because you didn't parametrize your queries, you're solving the problem from the wrong end
Sure it is. Context matters. For example, in clone wars.
No, but it doesn’t stop people trying.
There's little more you can do to validate a name internationally than to provide one textbox and check if it's a valid encoding of Unicode. Maybe you can exclude some control and graphical ranges at best.
Of course there are valid concerns that international names should pass through e.g. local postal services, which would require at least some kind of Latinized representation of name and address. I suppose the Latin alphabet is the most convenient minimal common denominator across writing systems, even though I admit being Euro-centric.
I have an 'æ' in my middle name (formally secondary first name because history reasons). Usually I just don't use it, but it's always funny when a payment form instructs me to write my full name exactly as written on my credit card, and then goes on to tell me my name is invalid.
I live in Łódź.
Love receiving packages addressed to ??d? :)
I wonder how many of those packages end up in Vada, Italy. Or Cody, Wyoming. Or Buda, Texas...
I imagine the “Poland” part of the address would narrow it down somewhat.
I got curious if I can get data to answer that, and it seems so.
Based on xlsx from [0], we got the following ??d? localities in Poland:
1 x Bądy, 1 x Brda, 5 x Buda, 120 x Budy, 4 x Dudy, 1 x Dydy, 1 x Gady, 1 x Judy, 1 x Kady, 1 x Kadź, 1 x Łada, 1 x Lady, 4 x Lądy, 2 x Łady, 1 x Lęda, 1 x Lody, 4 x Łódź, 1 x Nida, 1 x Reda, 1 x Redy, 1 x Redz, 74 x Ruda, 8 x Rudy, 12 x Sady, 2 x Zady, 2 x Żydy
Certainly quite a lot to search for a lost package.
[0]: https://dane.gov.pl/pl/dataset/188,wykaz-urzedowych-nazw-mie...
Interesting! However, assuming that ASCII characters are always rendered correctly and never as "?", it seems like the only solution for "??d?" would be one of the four Łódźs?
Sounds like someone is getting ready for Advent of Code!
Experienced postal workers most probably know well that ??d? represents a municipality with three non-ascii characters.
Interestingly, Lady, Łady and Lądy will end up the same after the usual transliteration.
And the postal code.
And the packages get there? Don't you put "Łódź (Lodz)" in the city field? Or the postal code takes care of the issue?
Yep, postal code does all the work.
You live in a boat? But how do they know on what sea?
Ironically, there are no big rivers in Łódź (anymore)
As you may be aware, the name field for credit card transactions is rarely verified (perhaps limited to North America, not sure).
Often I’ll create a virtual credit card number and use a fake name, and virtually never have had a transaction declined. Even if they are more aggressively asking for a street address, giving just the house number often works. This isn’t a deep cover but gives a little bit of a anonymity for marketing.
It's for when things go wrong. Same as with wire transfers. Nobody checks it unless there's a dispute.
The thing is though that payment networks do in fact do instant verification and it is interesting what gets verified and when. At gas stations it is very common to ask for a zip code (again US), and this is verified immediately to allow the transaction to proceed. I’ve found that when a street address is asked for there is some verification and often a match on the house number is sufficient. Zip codes are verified almost always, names pretty much never. This likely has something to do with complexities behind “authorized users”.
Funny thing about house numbers: they have their own validation problems. For a while I lived in a building whose house number was of the form 123½ and that was an ongoing source of problems. If it just truncated the ½ that was basically fine (the house at 123 didn't have apartment numbers and the postal workers would deliver it correctly) but validating in online forms (twenty-ish years ago) was a challenge. If they ran any validation at all they'd reject the ½, but it was a crapshoot whether which of "123-1/2" or "123 1/2" would work, or sometimes neither one. The USPS's official recommendation at the time was to enter it as "123 1 2 N Streetname" which usually validated but looked so odd it was my last choice (and some validators rejected the "three numbers" format too).
I don't think I ever tried "123.5", actually.
Around here, there used to be addresses like "0100 SW Whatever Ave" that were distinct from "100 SW Whatever Ave". And we've still got various places that have, for example, "NW 21st Avenue" and "NW 21st Place" as a simple workaround for a not-entirely-regular street grid optimized for foot navigation.
123 + 0.5?
At American gas stations, if you have a Canadian credit card, you type in 00000 because Canadians don't have ZIP codes.
Are we sure they don't actually validate against a more generic postal code field? Then again some countries have letters in their postcodes (the UK comes to mind), so that might be a problem anyways.
Canada has letters in postal codes. That’s the issue the GP is referring to, since US gas stations invariably just have a simple 5 numeric digit input for “zip” code.
There is so many ways to write your address I always assume it it’s just the house number as well. In fact I vaguely remember that being a specific field when interacting with some old payment gateway.
The government of Ireland has many IT systems that cannot handle áccénted letters. #headdesk
I worked for an Irish company that didn't support ' in names. Did get fixed eventually, but sigh...
Bobby Tables enters the chat
Still much better when it fails at the first step. I once got myself in a bit of a struggle with Windows 10 by using "ł" as part of Windows username. Amusingly/irritatingly large number of applications, even some of Microsoft's own ones, could not cope with that.
For a similar reason many Java applications do not work in Turkish Windowses. The Turkish İi Iı problem.
My wife had two given names and no surname. (In fact, before eighth class, she only had one given name.) Lacking a surname is very common in some parts of India. Also some parts of India put surname first, and some last, and the specific meaning and customs vary quite a bit too. Indian passports actually separate given names and family names entirely (meaning you can’t reconstruct the name as it would customarily be written). Her passport has the family name line blank. Indigo writes: “Name should be as per government ID”, and has “First And Middle Name” and “Last Name” fields. Both required, of course. I discovered that if you put “-” in the Last Name field, the booking process falls over several steps later in a “something went wrong, try again later” way; only by inspecting an API response in the dev tools did I determine it was objecting to having “-” in the name. Ugh. Well, I have a traditional western First Middle Last name, and from putting it in things, sometimes it gets written First Middle Last and sometimes Last First Middle, and I’ve received some communications addressed to First, some to Last, and some to Middle (never had that happen before!). It’s a disaster.
Plenty of government things have been digitalised in recent years too, and split name fields tend to have been coded to make both mandatory. It’s… disappointing, given the radical diversity of name construction across India.
"Write your name the way it's spelled in your government issued id" is my favorite. I have three ids issued by two governments and no two match letter by letter.
Did you actually get banks to print that on your credit card?
I’m impressed, most I know struggle with any kind of non-[A-Z]!
As someone who really think name field should just be one field with any printable unicode characters, I do wonder what the hell would I need to do if I take customer names in this form, and then my system has to interact with some other service that requires first/last name split, and/or [a-zA-Z] validation, like a bank or postal service.
Automatic transliteration seems to be very dangerous (wrong name on bank accounts, for instance), and not always feasible (some unicode characters have more than one way of being transliterated).
Should we apologize to the user, and just ask the user twice, once correctly, and once for the bad computer systems? This seems to be the only approach that both respects their spelling, and at the same time not creating potential conflict with other systems.
We had problems with a Ukrainian refugee we helped because certified translations of her documents did not match. Her name was transliterated the German way in one place and the English way in another.
Those are translations coming from professionals who swore an oath. Don’t try to do it with code.
In the US, you can generally specify to your certified translators how you want proper names and place names written. I would suggest you or your friend talk to the translators again so that everything matches. It will also minimize future pains.
Also, USCIS usually has an "aliases" field on their forms, which would be a good place to put German government misspellings.
USCIS is a mess.
I know someone that still doesn't know whether they have a middle name as far as american authorities are concerned.
Coupled with "two last names" and it gets really messy, really quickly.
Purchases names don't match the CC name.
Bank statements are actually "for another person".
Border crossings are now extra spicy.
And "pray" that your name doesn't resemble a name in some blacklist.
The fundamental mistake is in trying to take input for one purpose and transform it for another purpose. Just have the user fill in an additional field for their name as it appears on bank statements, or whatever the second purpose is. Trying to be clever about this stuff never works out.
What you call the second purpose is often the only purpose. Or you have to talk to half a dozen other systems each of which have different constraints. You wouldn’t want to present the user half a dozen fields just so that they can choose the nicest representation of their name for each system.
That being said, in Japan it’s actually common to have two fields, one for the name in kanji (the “nice” name), and one in katakana (restricted, 8-bit phonetic alphabet, which earlier/older computer systems used and some probably still use).
You usually don't have a dozen, just two or three and if you do have a dozen, there is a certain pattern, or at least the common denominator for the half of them to be ASCII, a another half using some kind of local convention you totally know how to encode.
You can just show the user the transliteration & have them confirm it makes sense. Always store the original version since you can't reverse the process. But you can compare the transliterated version to make sure it matches.
Debit cards a pretty common example of this. I believe you can only have ASCII in the cardholder name field.
>But you can compare the transliterated version to make sure it matches
No you can't.
Add: Okay, you need to know why. I'm right here a living breathing person with a government id that has the same name scribed in two scripts side by side.
There is an algorithm (blessed by the same government that issued said it) which defines how to transliterate names from one to another, published on the parliament web site and implement in all the places that are involved in the id issuing business.
The algorithm will however not produce the outcome you will see on my id, because me, living breathing person who has a name asked nicely to spell it the way I like. The next time I visit the id issuing place, I could forget to ask nicely and then I will have two valid ids (no, the old one will not be marked as void!) with three names that don't exactly match. It's all perfectly fine, because name as a legal concept is defined in the character set you probably can't read anyway.
Please, don't try be smart with names.
Your example fails to explain any problem with GPs proposal. They would show you a transliteration of your name and ask you to confirm it. You would confirm it or not. It might match one or other of your IDs (in which case you would presumably say yes) or not (in which case you would presumably say no). What's the issue?
You will compare the transliterated version I provided with the one you have already, it will not match and then what? Either you tell me I have invalid name or you just ignore it.
I think they were suggesting the opposite order - do an automatic transliteration and offer you the choice to approve or correct it.
But even if the user is entering both, warning them that the transliteration doesn't match and letting them continue if they want is something that pays for itself in support costs.
I have an ID that transliterated my name, and included the original, but the original contained an obvious typo. I immediately notified the government official, but they refused to fix it. They assured me that only the transliterated name would be used.
Human systems aren't always interested in avoiding or fixing defects.
Okay, I have a non-ASCII (non Latin even) name, so I can tell. You just ask explicitly how my name is spelled in a bank system or my government id. Please don't try transliteration, unless you know exact rules the other system suggests to transliterate my name from the one cultural context into another and then still make it a suggestion and make it clear for which purpose it will be used (and then only use it for that purpose).
And please please please, don't try to be smart and detect the cultural context from the character set before automatically translating it to another character set. It will go wrong and you will not notice for a long time, but people will make mean passive aggressive screenshots of your product too.
My bank for example knows my legal name in Cyrillic, but will not print it on a card, so they make best-effort attempt to transliterate it to ASCII, but make it editable field and will ask me to confirm this is how I want it to be on a card.
Ask for inflections separately.
For instance, in many Japanese forms, there are dedicated fields for the name and the pronunciation of the name. There are possibly multiple ways to read a name (e.g. 山崎 is either やまざき or やまさき). It is better to ask the person "how to read your name?" rather than execute code to guess the reading.
As for transliteration, it's best to avoid if possible. If not possible, then rely on international standards (e.g. Japanese has ISO 3602 and Arabic has ISO 233-2). When international standards don't exist, then fall back to "context-dependent" standards (e.g. in Taiwan, there are several variants of Pinyin. Allow the user to choose the romanization that matches their existing documentation).
Legal name vs. display name
… "legal name" is "things programmer's believe about names" grade. Maybe (name, jurisdiction), but I've seen exceptions to that, too.
Where I live, no less than 3 jurisdictions have a say about my "legal" name, and their laws do not require them to match. At one point, one jurisdiction had two different "legal" names for me, one a typo by my standards, but AFAICT, both equally valid.
There's no solution here, AFIACT, it's just evidence towards why computers cannot be accountability sinks.
WTF-8 is actually a real encoding, used for encoding invalid UTF-16 unpaired surrogates for UTF-8 systems: https://simonsapin.github.io/wtf-8/
I believe this is what Rust OsStrings are under the hood on Windows.
Which I assume stands for "Windows-Transformation-Format-8(bits)".
Abstract
WTF-8 (Wobbly Transformation Format − 8-bit) is a superset of UTF-8 that encodes surrogate code points if they are not in a pair.
Can you still assume the bytes 0x00 and 0xFF are not present in the string (like in UTF-8?)
I'll say it again: this is the consequence of Unicode trying to be a mix of html and docx, instead of a charset. It went too far for an average Joe DevGuy to understand how to deal with it, so he just selects a subset he can handle and bans everything else. HN does that too - special symbols simply get removed.
Unicode screwed itself up completely. We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while. Shortly after it focused on becoming a complex file format with colorful icons and invisible symbols, which is not manageable without cutting out all that bs by force.
The “invisible symbols” are necessary to correctly represent human language. For instance, one of the most infamous Unicode control characters — the right-to-left override — is required to correctly encode mixed Latin and Hebrew text [1], which are both scripts that you mentioned. Besides, ASCII has control characters as well.
The “colorful icons” are not part of Unicode. Emoji are just characters like any other. There is a convention that applications should display them as little coloured images, but this convention has evolved on its own.
If you say that Unicode is too expansive, you would have to make a decision to exclude certain types of human communication from being encodable. In my opinion, including everything without discrimination is much preferable here.
[1]: https://en.wikipedia.org/wiki/Right-to-left_mark#Example_of_...
Copy this sentence into this site and click Decode. (YMMW)
https://embracethered.com/blog/ascii-smuggler.html
> Is this necessary to correctly represent human language?
Yes! As soon as you have any invisible characters (eg. RTL or LTR marks, which are required to represent human language), you will be able to encode any data you want.
How many direction marks can we see in this hidden text?
Wow. Did not expect you can just hide arbitrary data inside totally normal looking strings like that. If I select up to "Copy thi" and decode, there's no hidden string, but just holding shift+right arrow to select just "one more character", the "s" in "this", the hidden string comes along.
> one of the most infamous Unicode control characters — the right-to-left override
You are linking to an RLM not an RLO. Those are different characters. RLO is generally not needed and more special purpose. RLM causes much less problems than RLO.
Really though, i feel like the newer "first strong isolate" character is much better designed and easier to understand then most of the other rtl characters.
Granted, technically speaking emojis are not part of the "Unicode Standard", but they are standardized by the Unicode Consortium and constitute "Unicode Technical Standard #51": https://www.unicode.org/reports/tr51/
I'm happy to discriminate against those damn ancient Sumerians and anyone still using goddamn Linear B.
Sure, but removing those wouldn't make Unicode any simpler, they're just character sets. The GP is complaining about things like combining characters and diacritic modifiers, which make Unicode "ugly" but are necessary if you want to represent real languages used by billions of people.
I’m actually complaining about more “advanced” features like hiding text (see my comment above) or zalgoing it.
And of course endless variations of skin color and gender of three people in a pictogram of a family or something, which is purely a product of a specific subculture that doesn’t have anything in common with text/charset.
If unicode cared about characters, which happens to be an evolving but finite set, it would simply include them all, together with exactly two direction specifiers. Instead it created a language/format/tag system within itself to build characters most of which make zero sense to anyone in the world, except for grapheme linguists, if that job title even exists.
It will eventually overengineer itself into a set bigger than the set of all real characters, if not yet.
Practicality and implications of such system is clearly demonstrated by the $subj.
You're right, of course. The point was glibly making was that Unicode has a lot of stuff in it, and you're not necessarily stomping on someone's ability to communicate by removing part if it.
I'm also concerned by having to normalize representations that use combining character etc. but I will add that there are assumptions that you can break just by including weird charsets.
For example the space character in Ogham, U+1680 is considered whitespace, but may not be invisible, ultimately because of the mechanics of writing something that's like the branches coming off a tree though carved around a large stone. That might be annoying to think about when you're designing a login page.
I mean, we can just make the languages simpler? We can also remove all the hundred different ways to pronounce English sounds. All elementary students will thank you for it xD
You can make a language simpler but old books still exist. I guess if we burn all old books and disallow a means to print these old books again, people would be happy?
Reprint them with new spelling? We have 500 year old books that are unreadable. 99.99% of all books published will not be relevant to anyone that isn’t consuming them right at that moment anyway.
Lovers can read the lord of the rings in the ‘original’ spelling.
People who should use Sumerian characters don't even use them, sadly. First probably because of habit with their transcription, but also because missing variants of characters mean lot of text couldn't be accurately represented. Also I'm downvoting you for discriminating me.
I know you're being funny, but that's sort of the point. There's an important "use-mention" distinction when it comes to historical character sets. You surely could try to communicate in authentic Unicode Akkadian (𒀝𒅗𒁺𒌑(𒌝) but what's much more likely is that you really just want to refer to characters or short strings thereof while communicating anything else in a modern living language like English. I don't want to stop someone from trying to revive the language for fun or profit, but I think there's an important distinction between cases of primarily historical interest like that, and cases that are awkward but genuine like Inuktut.
> and invisible symbols
Invisible symbols were in Unicode before Unicode was even a thing (ASCII already has a few). I also don't think emojis are the reason why devs add checks like in the OP, it's much more likely that they just don't want to deal with character encoding hell.
As much as devs like to hate on emojis, they're widely adopted in the real world. Emojis are the closest thing we have to a universal language. Having them in the character encoding standard ensures that they are really universal, and supported by every platform; a loss for everyone who's trying to count the number of glyphs in a string, but a win for everyone else.
Unicode has metadata on each character that would allow software to easily strip out or normalize emoji's and "decorative" characters.
It might have edge case problems -- but the charcters in the OP's name would not be included.
Also, stripping out emoji's may not actually be required or the right solution. If security is the concern, Unicode also has recommended processes and algorithms for dealing with that.
https://www.unicode.org/reports/tr39/
We need better support for the functions developers actually need on unicode in more platforms and languages.
Global human language is complicated as a domain. Legacy issues in actually existing data adds to the complexity. Unicode does a pretty good job at it. It's actually pretty amazing how well it does. Including a lot more than just the character set, and encoding, but algorithms for various kinds of normalizing, sorting, indexing, under various localizations, etc.
It needs better support in the environments more developers are working in, with raised-to-the-top standard solutions for identified common use cases and problems, that can be implemented simply by calling a performance-optimized library function.
(And, if we really want to argue about emoji's, they seem to be extremely popular, and literally have effected global culture, because people want to use them? Blaming emoji's seems like blaming the user! Unicode's support for them actually supports interoperability and vendor-neutral standards for a thing that is wildly popular? but I actually don't think any of the problems or complexity we are talking about, including the OP's complaint, can or should be laid at the feet of emojis)
There's no argument here.
We could say it's only for script and alphabets, ok. It includes many undeciphered writing systems from antiquity with only a small handful of extent samples.
Should we keep that, very likely to never be used character set, but exclude the extremely popular emojis?
Exclude both? Why? Aren't computers capable enough?
I used to be on the anti emoji bandwagon but really, it's all indefensible. Unicode is characters of communication at an extremely inclusive level.
I'm sure some day it will also have primitive shapes and you can construct your own alphabet using them + directional modifiers akin to a generalizable Hangul in effect becoming some kind of wacky version of svg that people will abuse it in an ASCII art renaissance.
So be it. Sounds great.
No, no, no, no, no… So then we’d get ‘the same’ character with potentially infinite different encodings. Lovely.
Unicode is a coding system, not a glyph system or font.
Fonts are already in there and proto-glyphs are too as generalized dicritics. There's also a large variety of generic shapes, lines, arrows, circles and boxes in both filled and unfilled varieties. Lines even have different weights. The absurdity of a custom alphabet can already be partially actualized. Formalism is merely the final step
This conversation was had 20 years ago and your (and my) position lost. Might as well embrace the inevitable instead of insisting on the impossible.
Whether you agree with it or not won't actually affect unicode's outcome, only your own.
Unicode does not specify any fonts, though many fonts are defined to be consistent with the Unicode standard, nevertheless they are emphatically not part of Unicode.
How symbols including diacritics are drawn and displayed is not a concern for Unicode, different fonts can interpret 'filled circle' or the weight of a glyph as they like, just as with emoji. By convention they generally adopt common representations but not entirely. For example try using the box drawing characters from several different fonts together. Some work, many don't.
macOS already does different encoding for filenames in Japanese than what Windows/Linux do, and I'm sure someone mentioned same situation in Korean here.
Unicode is already a non-deterministic mess.
And that justifies making it an even more complete mess, in new and dramatically worse ways?
Like how phonetic alphabets save space compared to ideograms by just “write the word how it sounds”, the little SVG-icode would just “write the letter how it’s drawn”
Right. Semantic iconography need not be universal or even formal to be real.
Think of all the symbols notetakers invent; ideographs without even phonology assigned to it.
Being as dynamic as flexible as human expression is hard.
Emojis have even taken on this property naturally. The high-5 is also the praying hands for instance. Culturally specific semantics are assigned to the variety of shapes, such as the eggplant and peach.
Insisting that this shouldn't happen is a losing battle against how humans construct written language. Good luck with that.
There are no emoiji in this guy's name.
Unicode has made some mistakes, but having all the symbols necessary for this guy's name is not one of them.
This frustration seems unnecessary, unicode isnt more complicated than time and we have far more than enough processing power to handle its most absurd manifestations.
We just need good libraries, which is a lot less work than inventing yet another system.
The limiting factor is not compute power, but the time and understanding of a random dev somewhere.
Time also is not well understood by most programmers. Most just seem to convert it to epoch and pretend that it is continuous.
in what way is unicode similar to html, docx, or a file format? the only features I can think of that are even remotely similar to what you're describing are emoji modifiers.
and no, this webpage is not result of "carefully cutting out the complicated stuff from Unicode". i'm pretty sure it's just the result of not supporting Unicode in any meaningful way.
>We wanted a common charset for things like latin, extlatin, cjk, cyrillic, hebrew, etc. And we got it, for a while.
we didn't even get that because slightly different looking characters from japanese and chinese (and other languages) got merged to be the same character in unicode due to having the same origin, meaning you have to use a font based on the language context for it to display correctly.
They are the same character, though. They do not use the same glyph in different language contexts, but Unicode is a character encoding, not a font standard.
They're not. Readers native in one version can't read the other, and there are more than handful that got duplicated in multiple forms, so they're just not same, just similar.
You know, obvious presumption underlying Han Unification is that CJK languages must have a continuous dialect continuums, like villagers living in the middle of East China Sea between Shanghai and Nagasaki and Gwangju would speak half-Chinese-Japanese-Korean, and technical distinction only exist because of rivalry or something.
Alas, people don't really erect a house on the surface of an ocean, and CJK languages are each complete isolates with no known shared ancestries, so "it's gotta be all the same" thinking really don't work.
I know it's not very intuitive to think that Chinese and Japanese has ZERO syntactic similarity or mutual intelligibility despite relatively tiny mental shares they occupy, but it's just how things are.
You're making the same mistake: the languages are different, but the script is the same (or trivially derived from the Han script). The Ideographic Research Group was well aware of this, having consisted of native speakers of the languages in question.
That's not "mistake", that's the reality. They don't exchange, and they're not the same. "Same or trivially derived" is just a completely false statement that solely exist to justify Han Unification, or maybe something that made sense in the 80s, it doesn't make literal sense.
Yes, but the same is true for overlapping characters in Cyrillic and Latin. A and А are the same glyph, so are т,к,і and t,k,i and you can even see the difference between some of those.
The duplication there is mostly to remain compatible or trivially transformable with existing encodings. Ironically, the two versions of your example "A" do look different on my device (Android), with a slightly lower x-height for the Cyrillic version.
The irony is you calling it irony. CJK "the same or trivially derived" characters are nowhere close to that yet given same code points. CJK unified ideographs is just broken.
So when are we getting UniPhoenician?
This is a bullshit argument that never gets applied to any other live language. The characters are different, people who actually use them in daily life recognise them as conveying different things. If a thumbs up with a different skin tone is a different character then a different pattern of lines is definitely a different character.
> If a thumbs up with a different skin tone is a different character
Is it? The skin tone modifier is serving the same purpose as a variant selector for the CJK codepoint would be.
The underlying implementation mechanism is not the issue. If unicode had actual support for Japanese characters so that when one e.g. converted text from Shift-JIS (in the default, supported way) one could be confident that one's characters would not change into different characters, I wouldn't be complaining, whether the implementation mechanism involved variant selectors or otherwise.
Okay, that's fair. The support for the selectors is very half-assed and there's no other good mechanism.
It doesn't matter to me what bullshit semantics theoretical excuse there is, for practical purposes it means that UTF-8 is insufficient for displaying any human language, especially if you want chinese and japanese in the same document/context without switching fonts (like, say, a website)
IMO, the sin of Unicode is that they didn't just pick local language authorities and gave them standardized concepts like lines and characters, and start-of-language and end-of-language markers.
Lots of Unicode issues come from handling languages that the code is not expecting, and codes currently has no means to select or report quirk supports.
I suppose they didn't like getting national borders involved in technical standardization bit that's just unavoidable. It is getting involved anyway, and these problems are popping up anyway.
This doesn't self-synchronize. Removing an arbitrary byte from the text stream (e.g. SOL / EOL) will change the meaning of codepoints far away from the site of the corruption.
What it sounds like you want is an easy way for English-language programmers to skip or strip non-ASCII text without having to reference any actual Unicode documentation. Which is a Unicode non-goal, obviously. And also very bad software engineering practice.
I'm also not sure what you're getting at with national borders and language authorities, but both of those were absolutely involved with Unicode and still are.
> start-of-language and end-of-language markers
Unicode used to have language tagging but they've been (mostly) deprecated:
https://en.wikipedia.org/wiki/Tags_(Unicode_block)
https://www.unicode.org/reports/tr7/tr7-1.html
The lack of such markers prevents Unicode from encoding strings of mixed Japanese and Chinese text correctly. Or in the case of a piece of software that must accept both Chinese and Japanese names for different people, Unicode is insufficient for encoding the written forms of the names.
I’m working with Word documents in different languages, and few people take the care to properly tag every piece of text with the correct language. What you’re proposing wouldn’t work very well in practice.
The other historical background is that when Unicode was designed, many national character sets and encodings existed, and Unicode’s purpose was to serve as a common superset of those, as otherwise you’d need markers when switching between encodings. So the existing encodings needed to be easily convertible to Unicode (and back), without markers, for Unicode to have any chance of being adopted. This was the value proposition of Unicode, to get rid of the case distinctions between national character sets as much as possible. As a sibling comment notes, originally there were also optional language markers, which however nobody used.
I bet the complex file format thing probably started at CJK. They wanted to compose Hangul and later someone had a bright idea to do the same to change the look of emojis.
Don't worry, AI is the new hotness. All they need is unpack prompts into arbitrary images and finally unicode is truly unicode, all our problems will be solved forever
But hey, multiocular o!
https://en.wikipedia.org/wiki/Cyrillic_O_variants#Multiocula...
(TL;DR a bored scribe's doodle has a code point)
>so he just selects a subset he can handle and bans everything else.
Yes? And the problem is?
The problem is the scale at which it happens and lack of methods-to-go in most runtimes/libs. No one and nothing is ready for unicode complexity out of box, and there's little interest in unscrewing it by oneself, cause it looks like an absurd minefield and likely is one, from the persepective of an average developer. So they get defensive by default, which results in $subj.
The next guy with a different subset? :)
The subset is mostly defined by the jurisdiction you operate in, which usually defines a process to map names from one subset to another and is also in the business of keeping the log of said operation. The problem is not operating in a subset, but defining it wrong and not being aware there are multiple of those.
If different parts of your system operate in different jurisdictions (or interface which other systems that do), you have to pick multiple subsets and ask user to provide input for each of them.
You just can't put anything other than ASCII into either payment card or PNR and the rules of minimal length will differ for the two and you can't put ASCII into the government database which explicitly rejects all of ASCII letters.
Well, the labels of input fields are written in English yet user enters his name in native language.
What's the reason of having name at all? You can call the person by this name. But if I write you my name in my language, what you (not knowing how to read it) can do? Only "hey, still-don't-know-you, here is your info".
In my foreign passport I have name __transliterated__ to Latin alphabet. Shouldn't this be the case for other places?
w3c recommends adding a separate field for pronounciation - see e.g. paragraph after second image in https://www.w3.org/International/questions/qa-personal-names...
Wow, that's neat.
I’d expect iCloud to accept that name, even though Rachel True’s name breaks the heck out of it:
https://www.reddit.com/r/ProgrammerHumor/comments/lz27ou/she...
A system not supporting non-latin characters in personal names is pitiful, but a system telling the user that they have an invalid name is outright insulting.
That’s the best one of the lot. "Dein Name ist ungültig", "Your name is invalid", written with the informal word for "your".
They're trying to say that you and the server are very close friends, you see? No, no, I get this is not correct, just a joke...
It seems ridiculous to apply form validation to a name, given the complexity of charsets involved. I don't even validate email addresses. I remember [this](https://www.netmeister.org/blog/email.html) wonderful explainer of why your email validation regex is wrong.
In HTML if you use <input type="email"> you basically get a regex validation. While it not fully follows the RFC it's a good middle ground. And it gives you an email that you can use on the internet (RFC obviously has some cases that are outside of this scope). That's why I tend to prefer what's defined in the standard: https://html.spec.whatwg.org/multipage/input.html#email-stat...
Apropos: https://www.kalzumeus.com/2010/06/17/falsehoods-programmers-...
Situations like these regularly make me feel ashamed about being a software developer.
I've got a good feel now for which forms will accept my name and which won't, though mostly I default to an ASCII version for safety. Similarly, I've found a way to mangle my address to fit a US house/state/city/zip format.
I don't feel unwelcome, I emphathize with the developers. I'd certainly hate to figure out address entry for all countries. At least the US format is consistent across websites and I can have a high degree of confidence that it'll work in the software, and my local postal service know what to do because they see it all the time.
At the end of the day, a postal address is printed to an envelope or package as a single block of text and then read back and parsed somehow by the people delivering the package (usually by a machine most of the way, but even these days more by humans as the package gets closer to the destination). This means that, in a very real sense, the "correct" way to enter an address is into a single giant multi-line text box with the implication that the user must provide whatever is required to be printed onto the mailing label such that a carrier will successfully be able to find them.
Really, then, the reasons why we bother trying to break out an address into multiple parts is not directly caused by the need for an address at all: it is because we 1) might not trust the user to provide for us everything required to make the address valid (assuming the country or even state, or giving us only a street address with no city or postal code... both mistakes that are likely extremely common without a multi-field form), or 2) need to know some subset of the address ourselves and do not trust ourselves to parse back the fuzzy address the same way as the postal service might, either for taxes or to help establish shipping rates.
FWIW, I'd venture to say that #2 is sufficiently common -- as if you need a street address for shipping you are going to need to be careful about sales taxes and VAT, increasingly often even if you aren't located in the state or even country to which the shipment will be made -- that it almost becomes nonsensical to support accepting an address for a location where you aren't already sure of the format convention ahead of time (as that just leads you to only later realizing you failed to collect a tax, will be charged a fortune to ship there, or even that it simply isn't possible to deliver anything to that country)... and like, if you don't intend to ship anything, you actually do not need the full address anyway (credit cards, as an obvious example, don't need or use the full address).
You can grab JSON data of all ISO recognized countries and their address formats on GitHub (apologies, I forget the repo name. IIRC there is more than one).
I don't know if it's 100% accurate, but it's not very hard to implement it as part of an address entry form. I think the main issue is that most developers don't know it exists,
OMG, the second screenshot might be actually the application i am working on right now ...
I thought this was https://simonsapin.github.io/wtf-8/
Yeah, this is just issues caused by ascii
It's like with phone numbers. Some people assume they contain only digits.
Its really not that hard though. PCRE regex support unicode letter classes. There is really no excuse for this type of issue.
`..*` should be a good validation regexp. At that point you might as well check only that length is non-zero and the the name is valid Unicode and nothing more. Well, ok, maybe look at UTR-#36 and maybe disallow / filter out non-printing characters, PUCs, and what not.
How do I allow "stępień" while detecting Zalgo-isms?
Zalgo is largely the result of abusing combining modifiers. Declare that any string with more than n combining modifiers in a row is invalid.
n=1 is probably a reasonable falsehood to believe about names until someone points out that language X regularly has multiple combining modifiers in a row, at which point you can bump up N to somewhere around the maximum number of combining modifiers language X is likely to have, add a special case to say "this is probably language X so we don't look for Zalgos", or just give up and put some Zalgo in your test corpus, start looking for places where it breaks things, and fix whatever breaks in a way that isn't funny.
N=2 is common in Việt Nam. (vowel sound + tonal pitch)
Yet Vietnamese can be written in Unicode without any combining characters whatsoever - in NFC normalization each character is one code point - just like the U+1EC7 LATIN SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW in your example.
u/egypurnash's point was about limiting glyph complexity. You could canonically decompose then look for more than N (say, N=3) combining codepoints in a row and reject if any are found. Canonical forms have nothing to do with actual glyph complexity, but conceptually[0] normalizing first might be a good first step.
[0] I say conceptually because you might implement a form-insensitive Zalgo detector that looks at each non-combining codepoint, looks it up in the Unicode database to find how many combining codepoints one would need if canonically decomposing and call that `n`, then count from there all the following combining codepoints, and if that exceeds `N` then reject. This approach is fairly optimal because most of the time most characters in most strings don't decompose to more than one codepoint, and even if they do you save the cost of allocating a buffer to normalize into and the associated memory stores.
I can point out that Greek needs n=2: for accent and breathing.
There's nothing special about "Stępień", it has no combining characters, just the usual diacritics that have their own codepoints in Basic Multilingual Plane (U+0119 and U+0144). I bet there are some names out there that would make it harder, but this isn't one.
If you decompose then it uses combining codepoints. Still nothing special.
I could answer your question better if I knew why you need to detect Zalgo-isms.
We have a whitelist of allowed characters, which is a pretty big list.
I think we based it on Lodash’ deburr source code. If deburr’s output is a-z and some common symbols, it passes (and we store the original value)
https://www.geeksforgeeks.org/lodash-_-deburr-method/
Why do you need to detect Zalgo-isms and why is it so important that you want to force people to misspell their names?
For the unaware (including myself): https://en.wikipedia.org/wiki/Zalgo_text
If you really think you need to programmatically detect and reject these (I'm dubious), there is probably a reasonable limit on the number of diacritics per character.
https://stackoverflow.com/a/11983435
Pfft, "Dein Name ist ungültig" (your name is invalid). Let's get straight to the point, it's the user's fault for having a bad name, user needs to fix this.
fun fact there is a semi standard encoding called wtf-8 which is utf-8 extended in a way so that it can represent non well formed utf-16 (bad surrogate code points)
it's used in situations like when a utf-8 based system has to interact with Windows file paths
I totally get that companies are probably more successful using simple validation rules, that work for the vast majority of names rather than just accepting everything just so that some person with no name or someone whose name cannot possibly be expressed or at least transliterated to Unicode can use their services.
But that person's name has no business failing validation. They fucked up.
My first name is hyphenated. I still find forms that reject it. My favorite was one that say "invalid first name."
What would be wrong with "enter your name as it appears in the machine-readable zone of your passport" (or "would appear" for people who have never gotten one)? Isn't that the one standard format for names that actually is universal?
The problem is, people exists that have É in their name and will go to court when you spell it as E, the court will also say that 1 ) you have the technical ability to write it as É and 2) they have a right to have their name spelled correctly. Also it's not nice and bad for business to be like this.
> they have a right to have their name spelled correctly
IMO, having the law consider this as an unconditional right is the root of the problem. What happens when people start making up their own characters that aren't in Unicode to put in their names?
> Also it's not nice and bad for business to be like this.
What about having a validated "legal name" for everything official and an unvalidated "display name" that's only ever parroted back to the person who entered it?
> What happens when people start making up their own characters that aren't in Unicode to put in their names?
They first have to fight the Unicode committee and maybe they actually have a point and the character is made up in a way that is acceptable in a society. Then they will fight their local authorities who run everything on 30 years old system. Only after they become your problem, at which point you fix your cursed regexp.
>an unvalidated "display name" that's only ever parroted back to the person who entered it?
You will do that wrong too. When you send me an email, I would expect my name to be in different form compared to what you display in the active user widget.
The point is, you need to know the exact context in which the name is used and also communicate it to me so I can tell you the right thing to display.
I would like to use my name as my parents gave it to me, thanks. Is that too much to ask for?
How much flexibility are we giving parents in what they name children?
If a parent invented a totally new glyph, would supporting that be a requirement?
Luckily, there is a vital records keeping office which already bothered to check with the law on that matter and if they can write it, so can you.
There's the problem that "appears" is a visible phenomenon and unicode strings can contain non-visible characters and multiple ways to represent the same visible information. Normalization is supposed to help here, but some sites may fail to do this or do incorrectly, etc.
But the MRZ of a passport doesn't contain any of those problem characters.
But on some random website, with people copy-pasting from who-knows-what, they will have to normalize or validate, etc. to deal with such characters.
The point is if the standard were to enter it like in the MRZ, it would be easy to validate.
From experience, it's not actually universal. Visa applications will often ask for name information not encoded in the MRZ.
Just use the unicode identifier rules, my libu8ident. https://github.com/rurban/libu8ident
Windows folks need to convert to UTF—8 first
That's nice. Which restriction level handles all names?
Every. Identifiers are identifiable names. Most languages with unicode name support only support symbols, ie not identifiable names (binary trash).
grzegorz brzęczyszczykiewicz
Looks ok in my language: Gřegoř Bženčiščikievič
You miss "ę"!
I don't think I did. I watched the video and this is the phonetic transcription. I hear b zh e n ch ...
https://www.youtube.com/watch?v=AfKZclMWS1U For the curious
Yes, all these forms should handle existing names...
but the author's own website doesn't (url: xn--stpie-k0a81a.com, bottom of the page: "© 2024 ę ń. All rights reserved.")
I think the bottom of the page is you missing the joke. It's showing only the name letters that get rejected everywhere else. Similarly for the URL, the URL renders his name correctly when you browse to it in a modern browser. What you've copied is the canonical fallback for unicode.
Under GDPR you have the legal right for your name to be stored and processed with the correct spelling in the EU.
https://gdprhub.eu/index.php?title=Court_of_Appeal_of_Brusse...
This seems to only apply to banks.
I wouldn't be surprised if that created kafkaesque problems with other institutions that require name to match the bank account exactly, and break/reject non-ASCII at the same time.
I know an Åsa who became variously Åsa, Aasa and Asa after moving to a non-Scandinavian country. That took a while to untangle, and caused some of the problems you describe.
Spelled with an Angstrom, or with a Latin Capital Letter A With Ring Above?
The second. It’s the 27th letter of the Swedish alphabet.
This does not only apply to banks. The specific court case was brought against a bank, but the law as is applies to any and everyone who processes your personal data.
It’s a general right to have incorrect personal data relating to you rectified by the data processor.
No, anywhere where your name is used.
I lost count of the projects where this was an issue. US and Western European-born devs are oblivious to this problem and it ends up catching them over and over again.
Yeah, it's amazing. My language has a Latin-based alphabet but can't be represented with ISO 8859-1 (aka the Latin-1 charset) so I used to take it for granted that most software will not support inputs in the language... 25 years ago. But Windows XP shipped with a good selection of input methods and used UTF-16, dramatically improving things, so it's amazing to still see new software created where this is somehow a problem.
Except that now there's no good excuse. Things like the name in the linked article would just work out of the box if it weren't for developers actually taking the time to break them by implementing unnecessary and incorrect validation.
I can think of very few situations, where validation of names is actually warranted. One that comes to mind is when you need people's ICAO 9303 compliant names, such as on passports or airline systems. If you need to make sure you're getting the name the person has in their passport's MRZ, then yes, rejecting non-ASCII characters is correct, but most systems don't need to do that.
Software has been gaslighting generations of people around the world.
Side note: not a bad way to skirt surveillance though.
A name like “stępień” will without a doubt have many ambiguous spellings across different intelligence gathering systems (RUMINT, OSINT, …). Americans will probably spell it as “Stefen” or “Steven” or “Stephen”, especially once communicated over phone.
Turns out just foreign document is enough for that sometimes ;) http://news.bbc.co.uk/1/hi/northern_ireland/7899171.stm