Database Migrations: Turning Caterpillars into Butterflies

For a large codebase, managing database schema can become tedious, especially if you maintain multiple testing environments or customers that update the product at different paces. Sometimes, documenting the latest schema or database changes isn’t enough.

In this article, Toptal Database Engineer Ivan Pavlov introduces us to concepts that help manage database states.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

For a large codebase, managing database schema can become tedious, especially if you maintain multiple testing environments or customers that update the product at different paces. Sometimes, documenting the latest schema or database changes isn’t enough.

In this article, Toptal Database Engineer Ivan Pavlov introduces us to concepts that help manage database states.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Ivan Pavlov
Verified Expert in Engineering
17 Years of Experience

Ivan has both back-end and front-end development experience. He has built software for banks, medical organizations and city administration.

Share

Users don’t care what’s inside the software they use; just that it works smoothly, safely, and unobtrusively. Developers strive to make that happen, and one of the problems they try to solve is how to ensure that the data store is in a state appropriate for the current version of the product. Software evolves, and its data model may also change over time, e.g., to fix design mistakes. To complicate the issue further, you may have a number of test environments or customers that migrate to newer versions of the product at different paces. You can’t just document the structure of the store and what manipulations are needed to use the shiny new version from a single perspective.

Database Migrations: Turning Caterpillars into Butterflies

I once joined a project with a few databases with structures that were updated on demand, directly by developers. This meant that there was no obvious way to find out what changes needed to be applied to migrate the structure to the latest version and there was no concept of versioning at all! This was during the pre-DevOps era and would be considered a total mess nowadays. We decided to develop a tool that would be used to apply every change to the given database. It had migrations and would document schema changes. This made us confident that there would be no accidental changes and schema state would be predictable.

In this article, we’ll have a look at how to apply relational database schema migrations and how to overcome concomitant problems.

First of all, what are database migrations? In the context of this article, a migration is a set of changes that should be applied to a database. Creating or dropping a table, column, or index are common examples of migrations. The shape of your schema may change dramatically over time, especially if development was started when requirements were still vague. So, over the course of several milestones on the way to a release, your data model will have evolved and may have become completely different from what it was in the very beginning. Migrations are just steps to the target state.

To get started, let’s explore what we’ve got in our toolbox to avoid reinventing what’s already done well.

Tools

In every widely used language, there are libraries that help make database migrations easy. For example, in the case of Java, popular options are Liquibase and Flyway. We’ll use Liquibase more in examples, but the concepts apply to other solutions and are not tied to Liquibase.

Why bother using a separate schema migration library if some ORMs already provide an option to automagically upgrade a schema and make it match the structure of mapped classes? In practice, such automatic migrations only do simple schema changes, e.g., creating tables and columns, and cannot do potentially destructive things like dropping or renaming database objects. So non-automatic (but still automated) solutions are usually a better choice because you are forced to describe the migration logic yourself, and you know what exactly is going to happen to your database.

It is also a very bad idea to mix automated and manual schema modifications because you may produce unique and unpredictable schemas if manual changes are applied in the wrong order or not applied at all, even if they are required. Once the tool is chosen, use it to apply all schema migrations.

Typical Database Migrations

Typical migrations include creating sequences, tables, columns, primary and foreign keys, indexes, and other database objects. For most common types of changes, Liquibase provides distinct declarative elements for describing what should be done. It would be too dull to read about every trivial change supported by Liquibase or other similar tools. To get an idea of how the changesets look, consider the following example where we create a table (XML namespace declarations are omitted for brevity):

<?xml version="1.0" encoding="UTF-8"?>
<databaseChangeLog>
    <changeSet id="R1-20180201-create_product_table" author="demo">
        <createTable tableName="PRODUCT">
            <column name="ID" type="BIGINT">
                <constraints primaryKey="true"
                             primaryKeyName="PK_PRODUCT"/>
            </column>
            <column name="CODE" type="VARCHAR(50)">
                <constraints nullable="false"
                             unique="true"
                             uniqueConstraintName="UC_PRODUCT_CODE"/>
            </column>
        </createTable>
    </changeSet>
</databaseChangeLog>

As you can see, the changelog is a set of changesets, and changesets consist of changes. Simple changes like createTable can be combined to implement more complex migrations; e.g., suppose you need to update the product code for all products. It can easily be achieved with the following change:

<sql>UPDATE product SET code = 'new_' || code</sql>

Performance will suffer if you’ve got zillions of products. To speed up the migration, we can rewrite it into the following steps:

  1. Create a new table for products with createTable, just like we saw earlier. At this stage, it’s better to create as few constraints as possible. Let’s name the new table PRODUCT_TMP.
  2. Populate PRODUCT_TMP with SQL in the form of INSERT INTO ... SELECT ... using sql change.
  3. Create all the constraints (addNotNullConstraint, addUniqueConstraint, addForeignKeyConstraint) and indexes (createIndex) you need.
  4. Rename the PRODUCT table to something like PRODUCT_BAK. Liquibase can do it with renameTable.
  5. Rename PRODUCT_TMP to PRODUCT (again, using renameTable).
  6. Optionally, remove PRODUCT_BAK with dropTable.

Of course, it is better to avoid such migrations, but it’s good to know how to implement them in case you run into one of those rare cases where you need it.

If you consider XML, JSON, or YAML to be too bizarre for the task of describing changes, then just use plain SQL and utilize all database vendor-specific features. Also, you can implement any custom logic in plain Java.

The way Liquibase exempts you from writing actual database-specific SQL may lead to overconfidence, but you should not forget about the quirks of your target database; e.g., when you create a foreign key, an index may or may not be created, depending on the specific database management system being used. As a result, you might find yourself in an awkward situation. Liquibase allows you to specify that a changeset should be run only for a particular type of database, e.g., PostgreSQL, Oracle, or MySQL. It makes this possible using the same vendor-agnostic changesets for different databases, and for other changesets, using vendor-specific syntax and features. The following changeset will be executed only if using an Oracle database:

<changeSet id="..." dbms="oracle" author="...">
    ...
</changeSet>

Besides Oracle, Liquibase supports a few other databases out of the box.

Naming Database Objects

Every database object you create needs to be named. You are not required to explicitly provide a name for some types of objects, e.g., for constraints and indexes. But it doesn’t mean that those objects won’t have names; their names will be generated by the database anyway. The problem arises when you need to reference that object to drop or alter it. So it’s better to give them explicit names. But are there any rules about what names to give? The answer is short: Be consistent; e.g., if you decided to name indexes like this: IDX_<table>_<columns>, then an index for the aforementioned CODE column should be named IDX_PRODUCT_CODE.

Naming conventions are incredibly controversial, so we won’t presume to give comprehensive instructions here. Be consistent, respect your team or project conventions, or just invent them if there are none.

Organizing Changesets

The first thing to decide on is where to store the changesets. There are basically two approaches:

  1. Keep changesets with the application code. It is convenient to do so because you can commit and review changesets and application code together.
  2. Keep changesets and application code separate, e.g., in separate VCS repositories. This approach is suitable when the data model is shared across several applications and it is more convenient to store all changesets in a dedicated repository and not scatter them across multiple repositories where the application code lives.

Wherever you store the changesets, it is generally reasonable to break them into the following categories:

  1. Independent migrations that don’t affect the running system. It is usually safe to create new tables, sequences, etc, if the currently deployed application is not aware of them yet.
  2. Schema modifications that alter the structure of the store, e.g., adding or dropping columns and indexes. These changes should not be applied while an older version of the application is still in use because doing so may lead to locks or weird behavior due to changes in the schema.
  3. Quick migrations that insert or update tiny amounts of data. If multiple applications are being deployed, changesets from this category can be executed concurrently without degrading the database performance.
  4. Potentially slow migrations that insert or update a lot of data. These changes are better to be applied when no other similar migrations are being executed.

graphic representation of the four categories

These sets of migrations should be run consecutively prior to deploying a newer version of an application. This approach gets even more practical if a system is composed of several separate applications and some of them use the same database. Otherwise, it is worth separating only those changesets that could be applied without affecting the running applications, and the remaining changesets may be applied together.

For simpler applications, the full set of necessary migrations can be applied at application startup. In this case, all changesets fall into a single category and are run whenever the application is initialized.

Whatever stage is chosen to apply migrations at, it is worth mentioning that using the same database for multiple applications may cause locks when migrations are being applied. Liquibase (like many other similar solutions) utilizes two special tables to record its metadata: DATABASECHANGELOG and DATABASECHANGELOGLOCK. The former is used for storing information about applied changesets, and the latter to prevent concurrent migrations within the same database schema. So, if multiple applications must use the same database schema for some reason, it is better to use non-default names for metadata tables to avoid locks.

Now that the high-level structure is clear, you need to decide how to organize the changesets within each category.

sample changeset organization

It greatly depends on specific application requirements, but the following points are usually reasonable:

  1. Keep changelogs grouped by releases of your product. Create a new directory for each release and place the corresponding changelog files into it. Have a root changelog and include changelogs that correspond to releases. In release changelogs, include other changelogs comprising this release.
  2. Have a naming convention for changelog files and changeset identifiers—and follow it, of course.
  3. Avoid changesets with a lot of changes. Prefer multiple changesets to a single long changeset.
  4. If you use stored procedures and need to update them, consider using the runOnChange="true" attribute of the changeset in which that stored procedure is added. Otherwise, each time it’s updated, you’ll need to create a new changeset with a new version of the stored procedure. Requirements vary, but it’s often acceptable to not track such history.
  5. Consider squashing redundant changes before merging feature branches. Sometimes, it happens that in a feature branch (especially in a long-lived one) later changesets refine changes made in earlier changesets. For example, you may create a table and then decide to add more columns to it. It is worth adding those columns to the initial createTable change if this feature branch was not merged to the main branch yet.
  6. Use the same changelogs to create a test database. If you try to do so, you may soon find out that not every changeset is applicable to the test environment, or that additional changesets are needed for that specific test environment. With Liquibase, this problem is easily solved using contexts. Just add the context="test" attribute to the changesets that need to be executed only with tests, and then initialize Liquibase with the test context enabled.

Rolling Back

Like other similar solutions, Liquibase supports migrating schema “up” and “down.” But be warned: Undoing migrations may not be easy, and it’s not always worth the effort. If you decided to support undoing migrations for your application, then be consistent and do it for every changeset that would need to be undone. With Liquibase, undoing a changeset is accomplished by adding a rollback tag that contains changes required to perform a rollback. Consider the following example:

<changeSet id="..." author="...">
    <createTable tableName="PRODUCT">
        <column name="ID" type="BIGINT">
            <constraints primaryKey="true"
                         primaryKeyName="PK_PRODUCT"/>
        </column>
        <column name="CODE" type="VARCHAR(50)">
            <constraints nullable="false"
                         unique="true"
                         uniqueConstraintName="UC_PRODUCT_CODE"/>
        </column>
    </createTable>
    <rollback>
        <dropTable tableName="PRODUCT"/>
    </rollback>
</changeSet>

Explicit rollback is redundant here because Liquibase would perform the same rollback actions. Liquibase is able to automatically roll back most of its supported types of changes, e.g., createTable, addColumn, or createIndex.

Fixing the Past

No one is perfect, and we all make mistakes. Some of them may be discovered too late when spoiled changes have already been applied. Let’s explore what could be done to save the day.

Manually Update the Database

It involves messing with DATABASECHANGELOG and your database in the following ways:

  1. If you want to correct bad changesets and execute them again:
    • Remove rows from DATABASECHANGELOG that correspond to the changesets.
    • Remove all side effects that were introduced by the changesets; e.g., restore a table if it was dropped.
    • Fix the bad changesets.
    • Run migrations again.
  2. If you want to correct bad changesets but skip applying them again:
    • Update DATABASECHANGELOG by setting the MD5SUM field value to NULL for those rows that correspond to the bad changesets.
    • Manually fix what was made wrong in the database. For example, if there was a column added with the wrong type, then issue a query to modify its type.
    • Fix the bad changesets.
    • Run migrations again. Liquibase will calculate the new checksum and save it to MD5SUM. Corrected changesets won’t be run again.

Obviously, it’s easy to do these tricks during development, but it gets much harder if the changes are applied to multiple databases.

Write Corrective Changesets

In practice, this approach is usually more appropriate. You may wonder, why not just edit the original changeset? The truth is that it depends on what needs to be changed. Liquibase calculates a checksum for each changeset and refuses to apply new changes if the checksum is new for at least one of the previously applied changesets. This behavior can be customized on a per-changeset basis by specifying the runOnChange="true" attribute. The checksum is not affected if you modify preconditions or optional changeset attributes (context, runOnChange, etc.).

Now, you may be wondering, how do you eventually correct changesets with mistakes?

  1. If you’d like those changes to still be applied for new schemas, then just add corrective changesets. For example, if there was a column added with the wrong type, then modify its type in the new changeset.
  2. If you’d like to pretend that those bad changesets never existed, then do the following:
    • Remove the changesets or add the context attribute with a value guaranteeing you’d never try to apply migrations with such a context again, e.g., context="graveyard-changesets-never-run".
    • Add new changesets that will either revert what was done wrong or fix it. These changes should be applied only if bad changes were applied. It can be achieved with preconditions, such as with changeSetExecuted. Don’t forget to add a comment explaining why you are doing so.
    • Add new changesets that modify the schema the right way.

As you see, fixing the past is possible, though it may not always be straightforward.

Mitigating Growing Pains

As your application gets older, its changelog also grows, accumulating every schema change along the path. It’s by design, and there’s nothing inherently wrong with this. Long changelogs can be made shorter by regularly squashing migrations, e.g., after releasing each version of the product. In some cases, it would make initializing fresh schema faster.

illustration of changelogs being squashed

Squashing is not always trivial and may cause regressions without bringing many benefits. Another great option is using a seed database to avoid executing all changesets. It is well suited to test environments if you need to have a database ready as fast as possible, maybe even with some test data. You may think of it as a form of squashing for changesets: At some point (e.g., after releasing another version), you make a dump of the schema. After restoring the dump, you apply migrations as usual. Only new changes will be applied because older ones were already applied before making the dump; therefore, they were restored from the dump.

illustration of a seed database

Conclusion

We intentionally avoided diving deeper in Liquibase’s features to deliver an article that is short and to the point, focused on evolving schemas in general. Hopefully, it’s clear what benefits and problems are brought about by automated application of database schema migrations and how well it all fits into DevOps culture. It’s important not to turn even good ideas into dogma. Requirements vary, and as database engineers, our decisions should foster moving a product forward and not just adhering to recommendations from someone on the internet.

Understanding the basics

  • What is the meaning of schema in database?

    The database schema describes how the data is organized inside the database.

  • What is the difference between a database and a schema?

    A schema is a part of a database. The database is usually comprised of one or more schemas. Also, it’s quite common to call a DBMS a database. It’s usually clear from the context whether we’re talking about the data container or about the system that manages that container.

  • What is an example of a schema?

    If you’re building a tour management application, its database schema would contain entities like airline, flight, or city. Besides user-defined schemas, a DBMS usually has an “information schema” that can be used to query the settings and the metadata.

  • What are the different types of databases?

    Besides relational databases, there are object, document-oriented, and hierarchical databases.

  • Is there a way to make query run faster?

    You can optimize a slow query by restructuring it or changing the schema. A DBMS usually can give you the execution plan and help guess what is slowing your query (e.g., not using indexes or selecting too much data in subqueries). In PostgreSQL, you can use EXPLAIN or EXPLAIN ANALYZE to understand what’s wrong.

  • What is the schema in PostgreSQL?

    In PostgreSQL, you may have several databases within the same cluster; the schema is a structural element of the database. It is the container for tables, views, procedures, etc. Schemas may be considered directories inside the database, but schemas can’t contain other schemas.

Hire a Toptal expert on this topic.
Hire Now
Ivan Pavlov

Ivan Pavlov

Verified Expert in Engineering
17 Years of Experience

Hamburg, Germany

Member since March 7, 2016

About the author

Ivan has both back-end and front-end development experience. He has built software for banks, medical organizations and city administration.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

Join the Toptal® community.