Good practices for generated identifiers

This post is about random generated identifiers: what they are and how to use them well.

Identifiers are a general construct that let you reference a resource without having your hands directly on it. For example, substack is an identifier, and it is part of the compound name substack.com that lets you reference this web site from all over the world without having your hands directly on it. This one is not a generated identifier, however. It is called a slug, and it was selected by a human being. It is meaningful to a human and can even be written down from human memory, but it comes at the very high cost of a human needing to select it. Anyone who has tried to name a band can tell you how very much effort can be consumed by naming something.

A generated identifier is manufactured by the computer without any human input. On a recent Substack post, I could see a URL on the top of the page that looks like https://lexspoon.substack.com/publish/post/176922671. The numeric part 176922671 is generated by the server and took zero brain power from a human being. This is a sequential identifier: each time someone makes a post on Substack, the software increases the post number by one and allocates that number to the next post being made.

These sequential identifiers do have their advantages, and they are used to great effect within database implementations to give a numeric identifier to each row of a given table. For an API, however, they have some large downsides. They reveal information about how many identifiers have been allocated so far; for example, we can see that Substack has had maybe 175 million posts created at the time of writing. Also, sequential identifiers let callers, by accident or maliciously, guess the number of other posts being made. If you know that someone sitting in the room with you just started writing a Substack post, then, even if they did not tell you the number for it, you can probably figure it out by making your own post and then decrementing the post number one at a time until you find their post.

Random identifiers are therefore a go-to approach that is used more often. You take a random sequence of characters and then make that your identifier. If you look at the Square developer documentation, you will see URLs like https://connect.squareup.com/v2/customers/JDKYHBWT1D4F8MFH63DBMEN8Y4. In this example, JDKYHBWT1D4F8MFH63DBMEN8Y4 is a random generated identifier. It’s a good approach.

It is a little bit long, however. If this encoding is what I think and will explain, then each character has 32 possible values, and there are 26 characters in it. Since 32 is more than 10, this gives more than 100,000,000,000,000,000,000,000,000 identifiers. I have played a lot of Trimps but do not want to figure out the name of this large number. There just aren’t that many customers in the world. There are only 8,000,000,000 people in the world. So, the first practice I would suggest is to only make these things large enough for the number of identifiers you expect to ever generate, padded by a factor of 1000 in case you are wrong, and padded by maybe 1000—1,000,000 to decrease the chance of random collision. That sounds like a lot, but if you think there are 8 billion possible customers, and they each need 1000 separate customer IDs, and then you multiply that by another million, then I think these need to be about 15 characters, as follows:

log32(8,000,000,00010001,000,000)14.55 log_{32}⁡(8{,}000{,}000{,}000∗1000∗1{,}000{,}000)≈14.55

Overly large identifiers are a significant problem! Identifiers are used in spreadsheets, Slack messages, documents, and emails. In the case of a spreadsheet, a 15-character identifier like JDKYHBWT1D4F8MH is short enough that you can have a column of identifiers and still have plenty of room on the screen for more columns. Be careful not to make your identifiers too short, because that’s an even worse problem, but figure out what you need and then add a couple for a safety measure. Even if you get it wrong, you can add a few more, later. Excepting certain kinds of Cobol programs, identifiers will be stored in variable-length strings, so your initial choice of length is not a permanent choice.

For this reason, UUIDs seem like a bad default to me. A humongous identifier like 9c5b94b1-35ad-49bb-b118-8e8fc24abf80 is impractical to type and takes an awfully large amount of horizontal space on a screen.

As an aside UUIDs are not unique at all, so you still need to check for uniqueness with something like a centralized database if you really do need uniqueness. It is a distraction from the bigger issues to observe that the same 128-bit integer is unlikely to be generated twice in a row by a random number generator. The more likely reason for a UUID collision is in the context where it is generated. If you clone a disk volume, launch two instances of a Docker image, or make a copy of a Yaml configuration file, then the targets of those UUIDs can end up existing multiple times.

As another example of all of this, the Substack post I mentioned earlier is number 176922671, and we might think of this as a globally nuique ID in a certain way, because there is only one source of Substack posts anywhere.

Wait, though! Read that slowly again. “There is only one source of Substack posts anywhere”. That’s how we get a psychological sense of global uniqueness even though identifiers tend to not be able to accomplish that. We have to assume that there is just one Substack, and it is that assumption that lets us generate universally unique identifiers. If you think about it, that’s not really true. The developers of Substack have personal development environments that each have their own set of post identifiers. As well, when Substack’s developers write test cases, those test cases will each make a temporary copy of Substack and create posts within it. A Substack post number is really only unique within one instance of a Substack deployment.

It is better to think of an identifier as always being only relatively unique, and to stop and ask what scope you need it to be unique within. For example, if you follow that train of thought, it’s really not a bad naming system on a Linux server to use disk volume names like /home and /usr, because the volumes on a server only need to be unique within that individual server. As a beneficial side effect, once you flip your brain around to think about relative uniqueness, you can start to make things collide on purpose! If you replace your /home drive by a different /home drive, you sort of do want everything else on the machine to link to the new drive.

Since there’s not really such a thing as a universally unique identifier, there is a second and far pettier reason I disfavor UUIDs: the developers knew that “unique IDs” are not so attainable, so they named their system “universally unique” IDs. There are many jokes going around about what might be next. “Completely universally unique IDs” ? “Maximally stupendously no this time for real unique IDs” ? There are just few things sillier than generating a 36-character identifier for a hard drive slot in a computer that can only ever hold 4 or 8 hard drives.

Instead of the UUID, I would suggest a wider usage of the Crockford-32 encoding, which I believe is what that customer ID up above is using. This encoding was invented by Douglas Crockford, a co-creator of the JSON format. Here are some of the high points:

  • The encoding uses only letters and numbers, so it will work well as part of a URL, embedded in a string literal, or in many other places an identifier may need to be used.

  • It encodes 5 bits per character, which is about the limit of what you can do and still get other benefits you would like. For comparison, hex encoding only gives you 4 bits per character. You can get 6 bits with systems like base64 or uuencode, but that runs into the next point.

  • Crockford-32 is case insensitive, making identifiers easier to type and to read aloud. Typing mixed-case identifiers on a standard keyboard will require pressing and releasing the shift key repeatedly, which turns it into more like two physical characters even though there is just one on the screen. Likewise for reading a mixed-case tring, where you have to say “capital A” or “lower A”, thus turning each visual character into two physical characcters.

As a parting bit of trivia, the Latin alphabet has 26 letters and 10 digits, giving 36 possible characters to use, but a base-32 encoding only needs 32 options. What four options would you leave out, if it were up to you? Crockford chose O, I, and L due to possible mixups with 0 and 1. The fourth one is a little funny, and I will let you read Doug’s web page on the encoding and see for yourself. It’s a funny but real issue for a mass-market web site, but that’s a tale for another day.