Database Migrations That Don't Cause Downtime

The worst production incidents we have seen have usually not been caused by bugs. They have been caused by database migrations that seemed routine and were not. This risk applies whether you are running a single well-organised monolith or a distributed system; in the latter, it tends to compound. A column rename that looked innocuous. A NOT NULL constraint added to a table that turned out to have more rows than anyone had checked. An index that locked a table for twenty minutes during peak traffic.

These are almost always avoidable. The technique is not exotic. It is just a specific way of thinking about schema changes that most teams never get taught.

The Mental Model: Schema and Code Are Versioned Together

The source of most migration pain is the assumption that schema and code are deployed as a single atomic unit. They are not. There is always a window — sometimes seconds, sometimes longer — where the running code does not match the current schema. The question is whether that window is safe.

A migration is safe if, at every moment during and after it runs, the currently deployed code can continue to function. A migration is unsafe if there is any state where the code and schema are mismatched in a way that causes errors or data loss.

This frames everything else. A "destructive" migration is not one that changes data. It is one that creates a window where code and schema disagree.

The Expand, Migrate, Contract Pattern

The safe pattern for any non-trivial schema change has three steps:

Expand. Add the new thing alongside the old thing. Deploy.

Migrate. Move reads and writes to the new thing gradually. Deploy.

Contract. Once nothing uses the old thing, remove it. Deploy.

Each step is a separate deploy. Each one is independently safe. The goal is that at no point in the sequence is there a state where the currently running code depends on something that has been removed, or expects something that does not yet exist.

The specific mechanics differ by operation. Let us walk through the common ones.

Adding a Column

Adding a nullable column is almost always safe. Databases handle it without locking (on modern versions of Postgres and MySQL), existing code ignores the column, new code can start using it.

Adding a NOT NULL column is where teams get into trouble. The naive approach — ADD COLUMN new_field VARCHAR(255) NOT NULL DEFAULT 'x' — can rewrite the entire table in older database versions, which means locking, which means downtime.

The safer pattern:

Add the column as nullable, with or without a default.
Backfill existing rows in batches, at a rate your system can absorb.
Once all rows have values, add the NOT NULL constraint.

Postgres 11 and later made this less painful for tables with a constant default, but the batched backfill is still the right approach for any table large enough to matter.

Removing a Column

This is where the expand-migrate-contract pattern is most important.

The unsafe path: write the migration, deploy it alongside code that no longer uses the column, cross your fingers. The risk is that if the deploy does not hit all application instances at exactly the same moment, some instances are still reading a column that no longer exists.

The safe path:

Stop writing to the column in code. Deploy.
Stop reading from the column in code. Deploy.
After confirming nothing references it, drop the column. Deploy.

This is slower. It is also how you avoid 3am pages.

Renaming a Column

Column renames are almost never worth doing as a single operation. Even when the database supports it in one statement, you have the same deploy-window problem as deletes, plus you have to update every ORM mapping and every query at the same moment.

The safe pattern is a rename in everything except the database:

Add the new column.
Write to both columns at once.
Backfill the new column from the old.
Switch reads to the new column.
Stop writing to the old column.
Drop the old column.

Every step is a separate deploy. Yes, this takes a while. Yes, your tables look ugly in the middle of it. The alternative is an incident.

If you only need to rename for cosmetic reasons — "field name does not match the business vocabulary" — the better answer is often to leave the database column alone and rename it in the application layer only. The database does not care what you call it.

Adding an Index on a Large Table

Adding an index on a small table is instant. Adding an index on a large table can take minutes or hours, and the default behaviour in most databases is to lock the table while it happens.

The solution on Postgres is CREATE INDEX CONCURRENTLY, which builds the index without an exclusive lock. It is slower than a normal index build, but it does not take the table offline. On MySQL, InnoDB supports online DDL for most index types. Use it.

Relevant: Strong Migrations is a library that lints Rails migrations for unsafe patterns. The rules it enforces are a good reference even if you are not using Rails — they are the specific patterns that cause production incidents.

Big Backfills

When you need to update every row in a large table — to set values in a new column, to restructure data, to clean up historical inconsistencies — do it in batches, not in one statement.

A single UPDATE users SET new_field = some_function(old_field) on a large table will hold locks, generate an enormous write-ahead log, and may overwhelm replicas. A batched version that processes a few thousand rows at a time, with small pauses between batches, does the same work over a longer period without taking the database down.

Tools like gh-ost (for MySQL) and pg_repack (for Postgres) exist for specific large-table operations, and are worth reaching for when the operation is genuinely big.

The Dry Run

Every migration we run in production gets tested first on a clone of production data. Not synthetic data. Real data, in roughly real volumes, with realistic distributions.

The number of migrations we have caught this way that would have been fine on staging and disastrous on production is enough that we never skip it. Some classes of bug — constraint violations on data you forgot existed, performance problems on actually-large tables, subtle behavioural differences between dev and prod versions — only show up on data that resembles production.

If you do not have a way to restore production-like data into a non-production environment, that itself is worth fixing. The capability pays for itself the first time it catches a bad migration.

The Rollback Plan

Every migration should be deployable. It should also be rollback-able.

What this means in practice: any migration that can be rolled back (adding a nullable column, adding an index) should have a clear rollback procedure. Any migration that cannot easily be rolled back (dropping a column, dropping a table) should only be run after the code that depends on the old state has been out of production for long enough that you are confident you will not need to revert.

The worst position to be in is "we shipped a bad version of the app, and the migration can't be reversed without data loss, and now we have to ship a fix forward under pressure". The way to avoid this is to make destructive migrations the last step in a long sequence, not part of the deploy that might need to be reverted.

The Uncomfortable Truth

Fast migrations and safe migrations are often different things. A change that takes one migration when you are writing it by yourself on a new app takes four or five when you are doing it on a live system with users who cannot tolerate downtime.

That overhead is not waste. It is the price of running a system people depend on. Teams that try to skip it usually learn why it exists the hard way, and the lesson is significantly more expensive than doing the work in advance — the kind of compounding cost we have written about before.

About to run a migration that makes you nervous? Get in touch — we have probably done it before, and we can help you sequence it safely.