Everything you need to know about IDs

Stanislav Spiridonov May 30, 2023

squids.org
ulid.page

An ID:

Uniquely identifies an entity in the collection
Is immutable

Speaking of relational databases, a primary key is a column or a set of columns that uniquely specify a row in a table.

If a primary key consists of real world attributes it is called a natural key. For example an email or a phone number can be a good choice for a primary key of a collection of Users since all emails and all phone numbers are globally unique. In contrast, a full name would be a bad choice for that, it is highly likely to meet two persons with the exact same combination of first and last name.

More often than not, a collection does not have any real world attribute with distinct values that is suitable for a primary key. In such cases a surrogate (or synthetic) key is generated to serve as a primary key. In fact, surrogate keys are so often used that they became de facto way of thinking about primary keys. And I am going to talk only about surrogate keys here in this article.

Server-generated IDs

The simplest way to generate a surrogate key is to assign a monotonically increasing number for each new record. This is commonly implemented as SEQUENCE or AUTO_INCREMENT in mainstream relational databases. A new record would be INSERTed into a database with all the fields populated by the client except the id field, and this newly assigned id will be returned as the result of the query. Since they are generated by a database itself, they are called server- (or system-, or auto-) generated IDs.

Numbers are compact in size and easy to operate with. There are some disadvantages though. You need to be careful with the size (such as 32-bit vs 64-bit) or you risk to face difficult migrations in the future (like Twitter did). Sequential numbers can also disclose some information about business performance. If you sign up today and your user_id is 14528, that means there are at most 14.5k users on that website.

But most importantly, sequences become a performance bottleneck with high insertion rate, and they are hard to maintain in a distributed environment. If a database is sharded then a primary key sequence should be somehow synchronized between shards, which does not scale well.

Client-generated IDs

To overcome this performance bottleneck of sequential IDs people started generating IDs on the client side (hence the name client-generated IDs). I am not sure who was the first one to really invent this, but I remember it became somewhat mainstream with MongoDB and its ObjectID. A new record would be inserted into a database with all the fields populated by the client including the id field, and the database would simply respond with a success or a failure.

Any long enough random number or string would really work. It should be fast and cheap to generate and have a tiny-tiny probability of a collision. UUID is one of the most popular ways to do that. Some monsters like AWS use UUIDs in their APIs. It is pretty popular to use UUIDs in Postgres as a primary key with id uuid DEFAULT gen_random_uuid() PRIMARY KEY, but don’t be confused, this is not a client-generated ID, this is still a server-generated one, even if it is a UUID.

Fully random values perform poorly on insertions. Newly created IDs can land somewhere in the middle of the primary key index and the database will have to re-balance the tree. In contrast, sequential numbers most likely land to the right end of the index. To tackle that a timestamp is placed in the beginning of an ID. Mongo’s ObjectID starts with 4-byte timestamp. Newer versions of UUIDs (v6 and v7) are also optimized for that and have a timestamp in the beginning.

Nano ID is another popular solution. They claim several advantages over UUIDv4, such as shorter IDs and small library size. But essentially it is just a random string, and it does not have any write optimizations.

Apart from solving performance issues client-generated IDs have two other important benefits. First, they allow to make Create operations idempotent. A client can safely retry Create requests with the same ID in all of them and be sure that at most one record will be created as the result of that (with consequent calls returning unique index violation errors). Second, they allow to simplify some flows where an ID can be used before a record is actually created in the database (for logging a request, not a response, or for creating other records that hold a reference to a given record before it is created).

Encoded IDs

So far we have been talking about server side of things. Let’s take a look at IDs from the perspective of those who hold those references and operate with them. In most cases IDs are opaque to users. This is a very important property of IDs, but this is not always the case, that’s why I did not mention that is the beginning of the article. Users usually know only that an ID is, let’s say, “an alphanumeric string of 64 characters long” or “a positive integer number up to 2^63”. They don’t necessarily know how to interpret that string or what this value really means. And they should not know, in my opinion. This gives application developers an advantage.

You might have seen some real-life IDs that are not purely random and hold some useful information. For example, a car’s VIN number. It might look just like a long alphanumeric string which uniquely identifies a car, but inside it has a make and model of that car, year, trim and some other additional info. Or let’s take a look at a credit card number: 16 digits, some of them encode a card network (Visa or MasterCard), card type (debit or credit), bank number and account number.

What advantage does it give? Some additional information can be encoded into an ID which later can be used by a server to optimize request routing or perform validations.

In a global service you might follow data locality laws and store users in several data centers in different countries. If a user ID has a country code inside then a frontend server can route a request to a backend in a corresponding data center simply by looking at that ID without a need to look the user up by this ID.
For a web service that has albums and songs you might want to shard the database by albums and collocate songs with corresponding albums. A complex song ID can include an album’s shard key (which cannot be derived from a song ID alone).
At some point you might want to create a v2 version of a service or schema and all new users will be automatically created in this new system, while existing users are still living in the old one. Routing again can be performed on the frontend level by looking at version (or generation) field in the user ID.
A signature can be added to an ID to verify that a given string is in fact generated by your server and not a random string. Surprisingly, a lot of big web services can be DoS’ed by generating random UUIDs which can overload a server by cache misses.

Any such information in ID can be serialized to a byte blob, this byte blob can be encoded as a text and handed out to a user. Hex is a good and simple text representation, but it is too long. Base64 has the most compact text representation. However, Base62 is a better option for IDs, because it can be fully selected by a double click. Unlike Base64, Base62 has only alphanumeric characters and does not have +, = and /. This small detail hugely improves user experience when an ID is exposed in a browser or in a log file.

Property #2 says that IDs must be immutable. That means that this additional information also should be immutable. It is only possible with an assumption that a user can never change its country, or a song can never be moved between albums. This heavily depends on your particular domain, and you need to be careful while deciding to implement this approach.