What is a “staging” area, why is it needed?
Staging is an optional, intermediate storage area in ETL processes.
The decision “to stage or not to stage” can be split into four main considerations:
- Auditing purposes. Thanks to the staging area we are able to compare the original input file with our outcome. It’s extremely useful, especially when the source system overwrites the history (e.g., flat files on an FTP server are being overwritten every day.)
- Recovery needs. Even though PCs are getting faster—typically having more bandwidth of nearly every form—there are still some legacy systems and environments that are not performant to extract data from. It’s good practice to store the data as soon as it’s extracted from the source system. This way staging objects can act as recovery checkpoints, avoiding the situation where a process needs to be completely rerun when it fails at 90 percent done.
- Backup. In the case of a failure, the staging area can be used to recover data in a target system.
- Load performance. If the data has to be loaded as soon as possible into the system, staging is the way to go. Developers load data as-is into the staging area, then perform various transformations on it from there. It’s far more efficient than transforming the data on-the-fly before loading it into the target system, but the tradeoff here is higher disk space usage.
How would you prepare and develop incremental loads?
The most common way to prepare for incremental load is to use information about the date and time a record was added or modified. It can be designed during the initial load and maintained later, or added later in an ETL process based on business logic.
It is very important to make sure that the fields used for this are not modified during the process and that these can be trusted.
The next step is to decide how to capture the changes, but the underlying basics are always the same: Comparing the last modified date to the maximum date already existing in the target, and then taking all records that are larger.
Another option is to prepare a process for delta loads, which would compare already existing records with new ones, and only load the differences. But this is not the most efficient way.
How would you update a big table, i.e. one having over 10 million rows?
The most common way is to use batches: Split one big query into smaller ones, e.g. 10 thousand rows in a batch:
DECLARE @id_control INT = 0 --current batch ,@batchSize INT = 10000 --size of the batch ,@results INT = 1 --row count after batch -- if 0 rows returned, exit the loop WHILE (@results > 0) BEGIN UPDATE [table] SET [column] = [value] WHERE [PrimaryKey column] > @id_control AND [PrimaryKey column] <= @id_control + @batchSize -- the latest row count SET @results = @@ROWCOUNT -- start next batch SET @id_control = @id_control + @batchSize END
If the table is too big, a better option might be to create a new table, insert the changed data, and then switch tables.
What are the disadvantages of indexes?
Indexes allow fast lookups, but they decrease load performance: Heavily indexed tables will not allow effective DML operations, i.e. insertions and updates.
It’s worth noting that indexes take additional disk space. Worse, though, the database back end needs to update all relevant indexes whenever data changes. It also creates additional overhead due to index fragmentation: Developers or DBAs have to take care of index maintenance, reorganization, and rebuilds.
Index fragmentation causes serious performance issues. When new data is inserted into an index, the database engine has to find space for it. It might happen that the new data insert messes up the current order—the SQL engine might split the data from a single data page, which creates an excessive amount of free space (internal fragmentation). It might also mess up the current page order, which forces the SQL engine to jump between pages when reading data from the disk. All of this creates additional overhead to the process of reading the data and forces random disk I/O.
Microsoft, in particular, recommends reorganizing an index when index fragmentation is between 5 and 30 percent, and to rebuild it when it’s greater than 30 percent. An index rebuild in SQL Server creates another index underneath and then replaces the previous one. Rebuilding may block the whole table from reading it (when using an edition other than their Enterprise offering.) Index reorganization is basically a reordering of leaf-pages and an attempt to compact data pages.
How would you prepare logging for ETL process?
Logging is extremely important to keep track of all changes and failures during a load. The most common ways to prepare for logging are to use flat files or a logging table. That is, during the process, counts, timestamps, and metadata about the source and target are added and then dumped into a flat file or table.
This way the load can be checked for invalid runs. When such a table or file exists, the next step would be to prepare notifications. This could be a report, or a simple formatted email, describing the load as soon as it finishes (e.g. the number of processed records compared to the previous load.)
To achieve that in an ETL process, a developer would add event handlers (SSIS) or use variables (like the system variable
@@ROWCOUNT in Transact-SQL) to keep track of inserted, updated, and deleted records.
When using SSIS we can also keep track of every processed package using the
SELECT TOP 10000 o.Object_Name [Project Name] ,REPLACE(e.package_name , '.dtsx', '') [Package Name] ,o.start_time [Start Time] ,o.end_time [End Time] ,e.message_source_name [Message Source Name] ,e.event_name [Event Name] ,e.subcomponent_name [Subcomponent Name] ,e.message_code [Message Code] ,m.message_time [Event Time] ,m.message [Error Message] ,m.message_time [Error Date] ,o.caller_name [Caller Name] ,o.Stopped_By_Name [Stopped By] ,ROW_NUMBER() OVER (PARTITION BY m.operation_id ORDER BY m.message_source_type DESC [Source Type Order] FROM SSISDB.internal.operations o JOIN SSISDB.internal.operation_messages m ON o.operation_id = m.operation_id JOIN SSISDB.internal.event_messages e ON m.operation_id = e.operation_id AND m.operation_message_id = e.event_message_id WHERE o.Object_Name LIKE '%[object name]%' -- database AND event_name LIKE '%Error%' ORDER BY o.end_time DESC ,o.Object_Name ,[Source Type Order] ASC
Other than flat files and the database itself, ETL tools offer native notification and logging features: for example, a dashboard indicating the current load status.
What is the purpose of data profiling in an ETL process? Which steps in the data profiling process are the most important?
Data profiling tasks help maintain data quality. During this phase, a number of issues are checked for and resolved. The most important are:
- Keys and unique identification of a row. Rows to be inserted must be unique. Often businesses use some natural keys to identify a given row, but developers must verify that this is sufficient.
- Data types. Column names that suggest a certain type should be scrutinized: Will the indicated type change the meaning of the column, or potentially allow for data loss? Data types can also affect post-ETL performance: Even if it doesn’t matter much during the process, text loaded into a variable-length string column will, on some RDBMSes, cause a performance hit when users start querying the target.
- Relationships among data. It’s important to know how tables relate to each other. It might require additional modeling to join some parts of data to avoid losing important structural information. Another thing is to understand the cardinality of a relationship, as it determines how the tables involved are going to be joined in the future.
What are three different approaches to implementing row versioning?
Maintaining row history requires implementing a versioning policy. The three most popular types are:
- Insert a new record: In this case, updated information about the row is stored, but it’s not linked to any other information—it’s treated as a new row. Usually, in this case, there is also an additional column (or even more than one) to easily identify the most recent change. It could be, for example, a “current record” flag, a “reason for change” text field, or a “valid from/until” pair of
- Additional column(s): Here the old value of a changed column is moved to the additional column (e.g.
old_amount) and the new value takes the place of the original (e.g.
- History table: First a history table is created, separate from the primary table. Then we have multiple options for how to load data into this table. One of them is to create DML triggers. Functionality provided by RDBMS vendors—like change data capture features—can be handy here. Such features can be far more efficient than triggers, like when they keep track of changes directly in the transaction log, which is responsible for keeping information about any changes made to the database. SQL Server—in particular, 2016 and beyond—can track changes using system-versioned temporal tables. This feature maintains a full history table next to the most current one: The main temporal table keeps only the most recent version of the data, but it is linked to the history table, which contains all previous versions.
What’s better from a performance point of view: Filtering data first and then joining it with other sources, or joining it first and then filtering?
It’s better to filter data first and then join it with other sources.
A good way to improve ETL process performance is to get rid of unwanted data as soon as possible in the process. It reduces the time spent on data transfer and/or I/O and memory processing.
The general rule is to reduce the number of processed rows and to avoid transforming data that never gets to the target.
What is the advantage of third-party tools like SSIS compared to SQL scripts?
Third-party tools offer faster and simpler development. Thanks to their GUIs, these tools can also be used by people who are not technical experts but have wide knowledge about the business itself.
ETL tools are able to generate metadata automatically and have predefined connectors for most sources. One of the most important features is also the ability to join data from multiple files on the fly.
What are the most common transformations in ETL processes?
Although the list might get very long, there are some basic steps performed during the ETL process that every ETL developer should mention immediately: data conversion, aggregation, deduplication, and filtering.
Other options a candidate might mention include:
- Data cleaning
- Calculating new fields
- Lookup operations
- Data validation