Challenges and Solutions in Storing and Updating Sequential Numbers in a Database

Posted on May 17, 2023

When you’re using sequence numbers in a database, there are several potential issues to be aware of:

Concurrency: The most common problem with sequence numbers is dealing with concurrent updates. If multiple processes or threads are updating the sequence number at the same time, you could end up with conflicts or duplicates.
Performance: If a table with a sequence number is updated frequently, it can become a performance bottleneck. Every time a row is inserted or updated, the sequence number must be updated as well, which can slow down the overall performance.
Scalability: If the sequence number gets very large, it could potentially exceed the storage capacity for its data type. You’ll need to plan carefully to ensure that you won’t run out of possible sequence numbers.
Gaps in sequence: In many systems, if a transaction that involves a sequence number is rolled back, that sequence number is not reused. This can lead to gaps in the sequence numbers, which some people may find undesirable.
Replication: In a replicated database environment, maintaining the sequence across multiple databases can be difficult and may require additional coordination to ensure uniqueness.
Portability: If you ever need to move your database to a different system, you may encounter issues with maintaining the sequence numbers, as different database systems handle them differently.

In addition to the basic issues, there are some more complex problems that can arise when using sequence numbers in a database.

Hotspotting: This can occur in distributed databases (like Google’s Cloud Spanner or Apache HBase) when sequence numbers are used as primary keys. Because sequence numbers are monotonically increasing, all writes will tend to go to a single node (the one handling the current range of sequence numbers), leading to an imbalance in the load distribution. This can negatively impact the performance and availability of your database.
Sharding/Partitioning: Similar to hotspotting, if you’re sharding or partitioning your data across multiple databases or tables, sequence numbers can cause issues. Since they’re always increasing, all new data will end up in the same shard or partition, which can lead to imbalances.
Replica Lag: In a replicated database environment, it’s possible for replicas to lag behind the master. If your application reads from a replica, it might not see the latest sequence number, which can cause issues if it then tries to write data based on that outdated number.
Failover and Recovery: In case of a database failure and subsequent recovery, managing sequence numbers can become a challenge. For instance, if a transaction involving the increment of a sequence number was in flight during the failure, it might be unclear after recovery whether the increment had been applied or not. This can lead to duplication or skipping of sequence numbers.
Data Migration: When migrating data between systems or during a system upgrade, sequence numbers can present a challenge, especially if you want to maintain the existing sequence. The new system might have a different mechanism for handling sequences, or there might be conflicts between existing data and new data being added during the migration.
Long Transactions: If you have long-running transactions, sequence numbers can cause issues. For example, if a transaction reserves a sequence number but doesn’t commit for a long time, other transactions may be blocked or forced to skip numbers.
Time Travel or Audit Tracking: If your application supports going back to a previous state or needs to track changes over time, sequence numbers can become complex to manage. You’ll need to figure out how to handle sequence numbers when reverting to a previous state, and you may need to track the history of each sequence number to know when it was used.

These advanced issues underscore the importance of careful planning and management when using sequence numbers in a database. Depending on your specific use case, other methods of generating unique identifiers, like UUIDs.

Here are a few more potential issues with sequence numbers in databases:

Data Integrity Issues: Sequence numbers typically serve as identifiers. If there is an error in sequence generation or allocation, it could lead to serious data integrity issues. For example, if a bug in your code leads to the same sequence number being used for two different records, you could end up with data corruption or loss.
Backup and Restore Issues: If you are backing up and restoring your database, sequence numbers can present a challenge. You need to ensure that after a restore operation, new sequence numbers don’t conflict with those in the restored data.
Lack of Meaning: Sequence numbers themselves don’t convey any information about the data in the record. This can be a disadvantage in situations where having a meaningful or structured identifier could be useful.
Dependency Issues: If other tables or systems depend on the sequence numbers in your table, any issues with the sequence numbers could have wide-ranging effects. For example, if a sequence number is changed or deleted, any dependent data might become orphaned or incorrect.
Space and Performance: Sequence numbers typically require an additional column in the table and an additional index. This can increase the space required to store the table and can decrease performance for insertions and updates.
Manual Intervention: If an error occurs with the sequence number generation (for instance, a sequence number is skipped or duplicated), it can require manual intervention to correct, which can be time-consuming and error-prone.
Complexity with Multi-Row Transactions: When you’re inserting or updating multiple rows in a single transaction, managing sequence numbers can become more complex. You need to ensure that each row gets a unique sequence number, and that those numbers are allocated in the correct order.
Predictability: By nature, sequence numbers are predictable, which can lead to potential security issues. If someone can guess the sequence number of a record, they might be able to use that information to their advantage in a malicious way.

To address these issues, you might consider:

Locking: Use some form of locking or synchronization to ensure that only one process can update the sequence number at a time.
Caching: If your database supports it, you might be able to cache a range of sequence numbers in memory to reduce the number of database operations required.
Planning for scale: Make sure that your sequence number’s data type has a large enough range to support your expected volume of data.
Accepting gaps: Rather than trying to prevent gaps in your sequence numbers, you could design your application to tolerate them.
Using UUIDs: Instead of relying on sequence numbers for uniqueness, consider using universally unique identifiers (UUIDs). These can be generated independently on any system without coordination, which makes them a good choice for distributed systems.

How does facebook handle it by using the id as number?

Facebook, along with many other large-scale web applications, uses a distributed, sharded database architecture. While the specific details about how Facebook handles its database operations are proprietary, they have shared some insights in various public presentations and papers.

For generating unique identifiers, Facebook has developed a system called TAO (The Associations and Objects) that manages the social graph consisting of users and their interactions. It’s believed that they use a form of distributed, sharded ID generation.

When a new object (like a post, comment, or user) is created, TAO assigns it a unique ID that consists of two parts:

The ID of the shard where the object is stored.
A sequence number that is unique within that shard.

This approach enables each shard to generate IDs independently, which avoids the need for a global lock and improves performance. However, it also means that IDs are not globally ordered. You can’t tell whether one object was created before another just by comparing their IDs.

In the early days of Facebook, it’s likely that they used simple auto-incrementing IDs. However, as they grew and needed to distribute their data across more and more servers, they had to develop more sophisticated methods of ID generation.

Context Matters

It’s also worth noting that Facebook uses a custom-built system for managing their data, and their solutions are tailored to their specific needs. Smaller applications or those with different requirements might not need such a complex system.