Sharing related information among isolated systems has become increasingly important to organizations, as it allows them to improve the quality and availability of data. There are many situations where it is useful to have a data set that is available and consistent in more than one directory server. That’s why knowing the common methods for performing SQL Server data synchronization is important.

Data availability and consistency can be achieved by data replication and data synchronization processes. Data replication is the process of creating one or more redundant copies of a database for the purpose of fault tolerance or accessibility improvement. Data synchronization is the process of establishing data consistency between two or more databases, and the subsequent continuous updates to maintain said consistency.

Various sources of data getting minced together into SQL queries

In many organizations, performing data synchronization across diverse systems is both desirable and challenging. We can find many use cases where we need to perform data synchronization:

  • Database migration
  • Regular synchronization between information systems
  • Importing data from one information system into another
  • Moving data sets between different stages or environments
  • Importing data from a non-database source

There is no unique way or unanimously agreed method for data synchronization. This task differs from case to case, and even data synchronizations that should be simple at first glance can be complicated, due to the complexity of data structures. In real scenarios, data synchronization consists of many complex tasks, which can take a long time to perform. When a new requirement comes up, database specialists usually have to reimplement the whole synchronization process. Since there are no standard ways of doing this, besides replication, the implementations of data synchronization are rarely optimal. This results in difficult maintenance and higher expenses. Implementation and maintenance of data synchronization is such a time consuming process, it can be a full-time job by itself.

We can implement architecture for data synchronization tasks manually, possibly using Microsoft Sync Framework, or we can benefit from already created solutions within tools for managing Microsoft SQL Server. We will try to describe the most common methods and tools that can be used to solve data synchronization on Microsoft SQL Server databases and try to give some recommendations.

Based on the structure of the source and destination (e.g., databases, tables) we can differentiate use cases when structures are similar or different.

Source and Destination Have Very Similar Structures

This is very often the case when we use data in various stages of the software development lifecycle. For example, the data structure in the testing and production environments is very similar. The common requirement is to compare data between the testing and production database and import data from the production into the testing database.

Source and Destination Have Different Structures

If the structures are different, synchronization is more complicated. This is also a more frequently recurring task. A common case is importing from one database into another. The most common case is when a piece of software needs to import data from another piece of software which is maintained by another company. Usually, imports need to run automatically on a scheduled basis.

The method used depends on personal preferences and complexity of the problem you need to solve.

Regardless of how similar the structures are, we can choose four different ways for solving data synchronization:

  • Synchronization using manually created SQL scripts
  • Synchronization using the data compare method (can be used only when source and target have similar structure)
  • Synchronization using automatically generated SQL scripts - need commercial product

Source and Destination Have the Same or Very Similar Structures

Using Manually Created SQL Scripts

The most straightforward and tedious solution is to manually write SQL scripts for synchronization.

Advantages

  • Can be performed by free and open source (FOSS) tools.
  • If the table has indexes, it is very fast.
  • The SQL script can be saved into a stored procedure, or run periodically as a job for SQL Server.
  • Can be used as an automatic import, even on continuously changed data.

Disadvantages

  • Creating such a SQL script is quite tedious, because three scripts are usually needed for each table: INSERT, UPDATE, and DELETE.
  • You can only synchronize data that is available via SQL queries, so you can’t import from sources like CSV and XML files.
  • It is hard to maintain—when database structure is changed, it is necessary to modify two or three scripts (INSERT, UPDATE, and sometimes also DELETE).

Example

We’ll do synchronization between the table Source, with columns ID and Value, and the table Target, with the same columns.

If tables have the same primary key, and the target table doesn’t have an auto-incrementing (identity) primary key, you can execute the following synchronization script.

 -- insert
INSERT INTO Target (ID, Value)
SELECT ID, Value FROM Source
WHERE NOT EXISTS (SELECT * FROM Target WHERE Target.ID = Source.ID);

-- update
UPDATE Target
SET Value = Source.Value
FROM Target INNER JOIN Source ON Target.ID = Source.ID

-- delete
DELETE FROM Target
WHERE NOT EXISTS (SELECT * FROM Source WHERE Target.ID = Source.ID)

Using the Data Compare Method

In this method, we can use a tool to compare between source and target data. The comparison process generates SQL scripts that apply the differences from the source database into the target database.

There are a number of programs for data comparison and synchronization. These programs mostly use the same approach. The user selects the source and the target database, but other alternatives could be a DB backup, a folder with SQL scripts, or even a connection to a source control system.

Below are the most popular tools that use the data compare approach:

In the first step, the data is read, or just checksums of larger data from the source and from the target are read. Then the comparison process is executed.

These tools also offer additional settings for synchronizing.

We need to set up the following configuration options that are necessary for data synchronization:

Sync Key

By default, the primary key or a UNIQUE constraint is used. If there is no primary key, you can choose a combination of columns. The Sync key is used to pair rows of the source with rows of the target.

Table Pairing

By default, tables are paired by name. You can change this, and pair them according to your own needs. In the dbForge Data Compare software, you can choose SQL query to be the source or destination.

Synchronization Process

After confirming, the tool compares source and target data. The whole process consists of downloading all of the source and target data and comparing them based on specified criteria. By default, values from equally named tables and columns are compared. All tools support mapping column and table names. Also, there is the possibility to exclude IDENTITY (autoincrement) columns or to do some transformations before comparing values (round float types, ignore character case, treat NULL as an empty string, etc.) Data download is optimized. If the data volume is large, only checksums are downloaded. This optimization is helpful in most cases, but time requirements for performing operations increase with the volume of data.

In the next step, there is a SQL script with generated migrations. This script can be saved or run directly. To be safe, we can even make a database backup before running this script. ApexSQL Data Diff tool can create an executable program which runs the script on a selected database. This script contains data which needs to be changed, not the logic how to change it. This means that the script cannot be run automatically to provide a recurring import. That is the biggest disadvantage of this approach.

Advantages

  • Advanced knowledge of SQL is not required, and can be done via GUI.
  • You have the ability to visually check differences between databases before synchronization.

Disadvantages

  • It’s an advanced feature of commercial products.
  • Performance decreases when transferring enormous volumes of data.
  • Generated SQL script contains only differences, and thus cannot be reused for automatically synchronizing future data.

Below you can see the typical UI of these tools.

ApexSQL Data Diff

ApexSQL Data Diff

RedGate SQL Compare

RedGate SQL Compare

Change list in dbForge Data Compare

Change list in dbForge Data Compare

Synchronize with Automatically Generated SQL

This method is very similar to the data comparison method. The only difference compared to the previous method is that there is no data comparison, and the generated SQL script does not contain data differences, but synchronization logic. The generated script can be easily saved into a stored procedure, and can be run periodically (e.g., each night). This method is useful for automatic imports between databases. The performance of this method is much better than the data compare method.

Synchronization by automatically generated SQL is only provided by SQL Database Studio.

SQL Database Studio provides a similar interface to the data compare method. We need to select the source and target (databases or tables). Then we need to set up options (sync keys, pairing and mapping). There is a graphical query builder feature for setting up all parameters.

Advantages

  • Advanced knowledge of SQL is not required.
  • You can set up everything in a GUI pretty quickly.
  • The resulting SQL script can be saved into a stored procedure.
  • Can be used as automatic import - as a job for SQL Server.

Disadvantages

  • It’s an advanced feature of commercial products.
  • Differences can’t be checked manually before synchronization, because the whole process is executed in one step.

Performance Benchmarks

Test Case

Two databases (A and B), each containing one table with 2,000,000 rows. The tables are in two different databases on same SQL Server. This test covers two extreme cases: 1) The source table contains all 2,000,000 rows and the target table is empty. Synchronization needs to provide many INSERTS. 2) The source and target tables contain 2,000,000 rows. The difference is only in one row. Synchronization needs to provide only one UPDATE.

RedGate Data Compare needs 3 steps:

  • Compare
  • Generate script
  • Run script on target database

ApexSQL Data Diff needs 2 steps:

  • Compare
  • Generate script and run script in one step

SQL Database Studio performs the whole synchronization in one step. Below are synchronization times, in seconds. In the column labeled “individual steps” are durations of the synchronization steps listed above.

  Case A. many INSERTs Case A. many INSERTs (individual steps) Case B. UPDATE one row Case B. UPDATE one row (individual steps)
SQL Database Studio 47   5  
RedGate Data Compare 317 13+92+212 23 22+0+1
ApexSQL Data Diff 188 18+170 26 25+

 

Lower is better.

The same test, but databases are on different SQL servers, which are not connected over a linked server.

  Case A. many INSERTs Case A. many INSERTs (individual steps) Case B. UPDATE one row Case B. UPDATE one row (individual steps)
SQL Database Studio 78   44  
RedGate Data Compare 288 17+82+179 25 24+0+1
ApexSQL Data Diff 203 18+185 25 24+1
dbForge Data Compare 326 11+315 16 16+0

 

Lower is better.

Summary

From the results, it’s obvious that RedGate and Apex do not care if databases are on the same SQL server, because the synchronization algorithm is not dependent on SQL Server. SQL Database Studio uses native functions of SQL Server; therefore, the result is better when databases are on the same server.

Source and destination have a different structure

There are also situations when one wide table has to be synchronized into many small related tables.

This example consists of one wide table SourceData which needs to be synchronized into small tables Continent, Country, and City. The scheme is given below.

Scheme for example database

Data in SourceData could be like the one in the image below.

Data points for the example

Using manually created SQL scripts

Script Synchronizing Continent Table

INSERT INTO Continent (Name)
SELECT SourceData.Continent
FROM SourceData
WHERE (SourceData.Continent IS NOT NULL
        AND NOT EXISTS (SELECT * FROM Continent tested WHERE tested.Name =SourceData.Continent ))
GROUP BY SourceData.Continent;

Script Synchronizing City Table

INSERT INTO City (Name, CountryId)
SELECT SourceData.City,
    Country.Id
FROM SourceData
    LEFT JOIN Continent ON SourceData.Continent = Continent.Name
    LEFT JOIN Country ON SourceData.Country = Country.Name AND Continent.Id = Country.ContinentId
WHERE SourceData.City IS NOT NULL
        AND Country.Id IS NOT NULL
        AND NOT EXISTS (SELECT * FROM City tested
            WHERE tested.Name = SourceData.City AND tested.CountryId = Country.Id)
GROUP BY
    SourceData.City,
    Country.Id;

This script is more complicated. It is because records in the tables Country and Continent have to be found. This script inserts missing records into City and fills ContryId correctly.

The UPDATE and DELETE scripts could also be written the same way if needed.

Advantages

  • You do not need any commercial products.
  • The SQL script can be saved into stored procedure or run periodically as a job for SQL Server.

Disadvantages

  • Creating such a SQL script is difficult and complicated (for each table, three scripts—INSERT, UPDATE, and DELETE—are usually necessary).
  • It is very hard to maintain.

Using external tools

This kind of synchronization (wide table into many related tables) cannot be done with the data compare method, because it is focused on different use cases. Since the data compare method produces a SQL script with data to be inserted, it has no straightforward ability to look up references in related tables. For that reason, applications using this method can’t be used (dbForge Data Compare for SQL Server, RedGate SQL Data Compare, Apex SQL Data Diff).

However, SQL Database Studio can help you to create synchronization scripts automatically. In the picture below, there is an element called Editor for Data Synchronization in SQL Database Studio.

Editor for Data Synchronization in SQL Database Studio

Editor looks like the well-known Query builder and works in a very similar way. Each table needs to have a defined synchronization key, but there are also defined relations between tables. In the picture above there is mapping for synchronization as well. In the column list (lower part of image) there are the columns of the table City (for other tables it is similar).

Columns

  • Id — This column is not mapped because it is the primary key (automatically generated).
  • CountryId — This column is defined as a reference for the table.
  • Name — This column is filled from column City in the source table (wide table).

Columns CountryId and Name are chosen as synchronization keys. The synchronization key is a set of columns which uniquely identify a row in the source and target table. You can not use the primary key Id as a synchronization key because this is not in the source table.

After the synchronization, this is what the tables look like:

Content of tables after synchronization

In the example above, there was one wide table as a source. There is also a common scenario when source data is stored in several related tables. Relations in SQL Database Studio are not defined using foreign keys, but column names. This way it is also possible to import from CSV or Excel files (the file is loaded into a temporary table, and synchronization is run from that table). It’s good practice to have unique column names. If this is not possible, you can define aliases to those columns.

Advantages

  • Easy and fast to create
  • Easy to maintain
  • Can be saved into a stored procedure (the stored procedure is saved with data necessary to open the synchronization in an editor later on)

Disadvantages

  • Commercial solution

Comparing the Solutions

Data synchronization consists of a sequence of INSERT, UPDATE, or DELETE commands. There are multiple ways to create sequences of these commands. In this article, we looked at three options for creating synchronization SQL scripts. The first option is to create everything manually. It is feasible (but takes too much time), it requires complex understanding of SQL, and it’s difficult to create and maintain. The second option is to use commercial tools. We looked at the following tools:

  • dbForge Data Compare for SQL Server
  • RedGate SQL Data Compare
  • Apex SQL Data Diff
  • SQL Database Studio

The first three tools work very similarly. They compare data, let the user analyze differences, and can synchronize selected differences (even automatically or from command line). They are beneficial for these usage scenarios:

  • Databases are out of sync due to various errors.
  • You need to avoid replication while transferring data between environments.
  • Data comparison reports in Excel or HTML are needed.

Each tool is loved for one reason or the other: dbForge has great UI and a lot of options, ApexSQL performs better than the rest, and RedGate is the most popular one.

The fourth tool, SQL Database Studio, works a little differently. It generates SQL scripts that contain synchronization logic, not changes. Performance is also great, because all work is done directly on the database server, so no data transfer between the database server and the synchronization tool is needed. This tool is useful for the following use cases:

  • Automatic database migrations where databases has different structure
  • Import into multiple related tables
  • Import from external sources XML, CSV, MS Excel

Understanding the Basics

Which tools are available for database synchronization?

You can use tools such as RedGate Data Compare, ApexSQL Data Diff, and dbForge Data Compare, which use the data compare method. You can use SQL Database Studio, which automatically generates reusable SQL scripts.

About the author

Andrej Gajdos, Czech Republic
member since March 14, 2016
Andrej is a freelance software engineer, consultant, and occasional tech blogger with over six years of experience delivering software. His strong theoretical and practical background in IT, marketing, and sales makes him capable with projects of any size. He makes complex problems simple with fast, clean, and maintainable code. He has worked with an array of different technologies and is currently focusing on full-stack web development. [click to continue...]
Hiring? Meet the Top 10 Freelance SQL Developers for Hire in December 2017

Comments

comments powered by Disqus
Subscribe
The #1 Blog for Engineers
Get the latest content first.
No spam. Just great engineering posts.
The #1 Blog for Engineers
Get the latest content first.
Thank you for subscribing!
Check your inbox to confirm subscription. You'll start receiving posts after you confirm.
Trending articles
Relevant Technologies
About the author
Andrej Gajdos
HTML/CSS Developer
Andrej is a freelance software engineer, consultant, and occasional tech blogger with over six years of experience delivering software. His strong theoretical and practical background in IT, marketing, and sales makes him capable with projects of any size. He makes complex problems simple with fast, clean, and maintainable code. He has worked with an array of different technologies and is currently focusing on full-stack web development.