Introduction

Recently, I coordinated Porch’s sponsorship of the development of non-transactional PostgreSQL migrations as a feature for Flyway—an open source database versioning tool developed by Boxfuse. We’ve adopted Flyway in our microservices architecture for many reasons:

  • schema changes are written in SQL instead of yet another domain language
  • we can utilize the native features of the RDBMS
  • the DDL resides with the microservice code
  • replicating the database schema is simple and always repeatable
  • code deployments requiring schema changes are automated
  • the database is versioned, i.e. we always know what patches have been applied

I am happy to report the feature was recently released in version 4.1.

Problem Background

We regularly utilize three commands that PostgreSQL requires to be run outside of a transaction:

  1. Adding new values to an enumerated type
  2. Building an index concurrently to prevent table locking
  3. Vacuuming tables to reclaim disk space from dead rows

Flyway is rightfully constructed to run the DDL and DML of every database migration file within a transaction. If any errors are encountered, the transaction is safely rolled back, leaving the database in a consistent state. With Flyway versions prior to 4.1 being incompatible with the above commands, we would manually intervene in database deployments, thereby negating the automated nature of code deployments. In the following, I will review the design decisions made during the implementation of this feature.

Determining Transactional vs. Non-transactional

The first hurdle is detecting when a migration needs to be run outside of a transaction. Two paths forward are evident. The first path, suggested in the GitHub issue, is to create a filename annotation. The second path is to add a statement parser to Flyway that detects non-transactional commands and runs them in the appropriate isolation level. Despite the extra work it would require on their part, the Boxfuse team was steadfast on following the second path. It provides the better user experience for Flyway users in the long term. The biggest problem with filename annotations is that it adds complexity to the Flyway user interface for a specific use case that only applies to PostgreSQL environments. Users managing other RDBMS’ shouldn’t have to worry about syntax that doesn’t apply to their systems. Since Boxfuse’s goal for Flyway is to be an easy-to-use generic tool that can work with many database platforms, the solution is for Flyway to inspect what is being asked of it and to act according to the situation. Put the burden on the automation instead of the user.

Non-Transactional Isolation

Now that non-transactional migrations are detected, the next issue is determining what to do when non-transactional statements are mixed with transactional statements in the same migration. Let us review Example 1 to see what happens when transactional and non-transactional statements are mixed and errors are introduced.

Example 1

-- V1__initial.sql
create type myschema.mytype as enum
   ('A');
-- V2__throw_error.sql

-- This table will persist after the below error due to the non-transactional run
create table myschema.persists();

-- Altering an enum triggers a non-transactional run
alter type myschema.mytype add value 'B' after 'A';

-- Force an error to see what happens
alter table myschema.persists drop column nonexistent;

In migration V2, the type created in V1 is being altered, causing Flyway’s pre-parser to flag V2 as needing to be run outside of a transaction. Note that the alter type can be anywhere in V2 for the pre-parser to flag it as a non-transactional migration. When Flyway begins executing the migrations, V1 completes successfully, but it has problems with V2 as the column that is being dropped doesn’t exist and PostgreSQL will error out. Since V2 is run outside of a transaction, nothing was rolled back, and the table myschema.persists remains after the failed execution. When V2 is fixed, the V2 migration will fail again, now at the create table statement, because the table already exists.

The safest way to migrate the database is to run as much of the DDL inside of transactions as possible. Since the parser exists to detect non-transactional statements, it would be possible to split up the migration file automatically, running each part transactionally as appropriate. However, this introduces a lot of complexity to the migration process, makes it impossible to completely roll back a partially applied migration, and breaks the versioning system. If V2 only executes half of the migration, it is unclear at what version the database should be left. Thus, it was decided to disallow mixing non-transactional and transactional statements by default. In the spirit of failing fast, Flyway throws an error while parsing if it detects a migration with mixed transactional and non-transactional statements. Example 2 shows the necessary changes.

Example 2

-- V1__initial.sql
create type myschema.mytype as enum
   ('A');
-- V2__throw_error.sql
-- Transactional Migration
-- Still throws an error, but can be rolled back
create table myschema.persists();

-- Force an error to see what happens
alter table myschema.persists drop column nonexistent;
-- V3__modify_type.sql
-- Non-transactional migration
alter type myschema.mytype add value 'B' after 'A';

Creating a condition that migrations must not mix transactional with non-transactional statements creates a restriction that may introduce an unseen problem for a user in the future. In case this unknown situation requiring DDL that normally would be run inside of a transaction needs to be handled, a property called flyway.allowMixedMigrations is added to the configuration. By default it is false. I strongly suggest to never turn it on.

Concurrency

Having solved the first problem of not being able to modify enumerated types, focus turned toward building indices concurrently. Building indices concurrently is important for maintaining high availability in production. One of the great features of Flyway is its ability to handle multiple simultaneous code deployments. In a production environment where a cluster of the new version of the microservice is deployed simultaneously, it would be catastrophic for all of of the instances to attempt to migrate their shared database at the same time.

In the past, when migrating PostgreSQL instances Flyway would open two connections to the target database. In the first connection, a standard SELECT…FOR UPDATE is executed against Flyway’s schema_version table to create a lock that prevents other deployments from migrating at the same time. Once that lock is obtained, Flyway opens its second connection to start applying the migrations. In the case when an index is attempting to be built concurrently, this causes a deadlock scenario. A concurrent index cannot be finalized until all transactions that started before it began are completed. Since Flyway is waiting for the index to complete before releasing the FOR UPDATE lock, it gets stuck, never finishing.

The solution to prevent the deadlock is to switch to advisory locks. Once an advisory lock is obtained, the transaction in which it is created completes. The lock is tied to the session of the connection and will be automatically released upon disconnect or when it is explicitly released. Since the FOR UPDATE transaction does not need to be held open to preserve its exclusive lock, Boxfuse eliminated the second connection. Now Flyway connects, opens the advisory lock, applies migrations including concurrent index builds, updates the schema_version table, releases the lock, and closes the connection.

Vacuuming

Once enumeration modifications and concurrent index builds were working, no notable issues for vacuuming were encountered. Having the ability to vacuum in a migration is useful when a previous migration deletes many rows from a table, say if they were moved to a newly created table, to reclaim the disk space. The auto vacuum process that is enabled by default in versions of PostgreSQL since 8.3 would eventually get to cleaning the table, but for performance reasons it may be necessary to trigger a manual vacuuming in certain situations.

Summary

At the start, I covered how we avoided adding PostgreSQL specific complexity to the Flyway user interface by auto-detecting non-transactional migrations. Next, I explained why non-transactional statements need to be put into isolated migrations, separating transactional DDL. Finally, I discussed the changes to the locking needed for Flyway in order to support concurrent index building.

About the Author

Michael Rasmussen is a Sr. Data Engineer at Porch. He has been working with PostgreSQL databases for a decade, and he has a passion for entity-relational modeling and data normalization. He is a graduate of U.C. Santa Barbara’s College of Engineering with a bachelor’s in Computer Science. He resides in the Seattle area with his two cats Huey and Pookie.