What are the most common transformations in ETL processes?
Although the list might get very long, there are some basic steps performed during the ETL process that every ETL developer should mention immediately: data conversion, aggregation, deduplication, and filtering.
Other options a candidate might mention include:
What is a “staging” area, why is it needed?
Staging is an optional, intermediate storage area in ETL processes.
The decision “to stage or not to stage” can be split into four main considerations:
How would you prepare and develop incremental loads?
The most common way to prepare for incremental load is to use information about the date and time a record was added or modified. It can be designed during the initial load and maintained later, or added later in an ETL process based on business logic.
It is very important to make sure that the fields used for this are not modified during the process and that these can be trusted.
The next step is to decide how to capture the changes, but the underlying basics are always the same: Comparing the last modified date to the maximum date already existing in the target, and then taking all records that are larger.
Another option is to prepare a process for delta loads, which would compare already existing records with new ones, and only load the differences. But this is not the most efficient way.
Apply to Join Toptal's Development Network
and enjoy reliable, steady, remote Freelance ETL Developer Jobs.
What is the advantage of third-party tools like SSIS compared to SQL scripts?
Third-party tools offer faster and simpler development. Thanks to their GUIs, these tools can also be used by people who are not technical experts but have wide knowledge about the business itself.
ETL tools are able to generate metadata automatically and have predefined connectors for most sources. One of the most important features is also the ability to join data from multiple files on the fly.
How would you update a big table, i.e. one having over 10 million rows?
The most common way is to use batches: Split one big query into smaller ones, e.g. 10 thousand rows in a batch:
DECLARE @id_control INT = 0 --current batch
,@batchSize INT = 10000 --size of the batch
,@results INT = 1 --row count after batch
-- if 0 rows returned, exit the loop
WHILE (@results > 0)
BEGIN
UPDATE [table]
SET [column] = [value]
WHERE [PrimaryKey column] > @id_control
AND [PrimaryKey column] <= @id_control + @batchSize
-- the latest row count
SET @results = @@ROWCOUNT
-- start next batch
SET @id_control = @id_control + @batchSize
END
If the table is too big, a better option might be to create a new table, insert the changed data, and then switch tables.
What are the disadvantages of indexes?
Indexes allow fast lookups, but they decrease load performance: Heavily indexed tables will not allow effective DML operations, i.e. insertions and updates.
It’s worth noting that indexes take additional disk space. Worse, though, the database back end needs to update all relevant indexes whenever data changes. It also creates additional overhead due to index fragmentation: Developers or DBAs have to take care of index maintenance, reorganization, and rebuilds.
Index fragmentation causes serious performance issues. When new data is inserted into an index, the database engine has to find space for it. It might happen that the new data insert messes up the current order—the SQL engine might split the data from a single data page, which creates an excessive amount of free space (internal fragmentation). It might also mess up the current page order, which forces the SQL engine to jump between pages when reading data from the disk. All of this creates additional overhead to the process of reading the data and forces random disk I/O.
Microsoft, in particular, recommends reorganizing an index when index fragmentation is between 5 and 30 percent, and to rebuild it when it’s greater than 30 percent. An index rebuild in SQL Server creates another index underneath and then replaces the previous one. Rebuilding may block the whole table from reading it (when using an edition other than their Enterprise offering.) Index reorganization is basically a reordering of leaf-pages and an attempt to compact data pages.
What’s better from a performance point of view: Filtering data first and then joining it with other sources, or joining it first and then filtering?
It’s better to filter data first and then join it with other sources.
A good way to improve ETL process performance is to get rid of unwanted data as soon as possible in the process. It reduces the time spent on data transfer and/or I/O and memory processing.
The general rule is to reduce the number of processed rows and to avoid transforming data that never gets to the target.
How would you prepare logging for ETL process?
Logging is extremely important to keep track of all changes and failures during a load. The most common ways to prepare for logging are to use flat files or a logging table. That is, during the process, counts, timestamps, and metadata about the source and target are added and then dumped into a flat file or table.
This way the load can be checked for invalid runs. When such a table or file exists, the next step would be to prepare notifications. This could be a report, or a simple formatted email, describing the load as soon as it finishes (e.g. the number of processed records compared to the previous load.)
To achieve that in an ETL process, a developer would add event handlers (SSIS) or use variables (like the system variable @@ROWCOUNT
in Transact-SQL) to keep track of inserted, updated, and deleted records.
When using SSIS we can also keep track of every processed package using the SSISDB
database:
SELECT TOP 10000 o.Object_Name [Project Name]
,REPLACE(e.package_name , '.dtsx', '') [Package Name]
,o.start_time [Start Time]
,o.end_time [End Time]
,e.message_source_name [Message Source Name]
,e.event_name [Event Name]
,e.subcomponent_name [Subcomponent Name]
,e.message_code [Message Code]
,m.message_time [Event Time]
,m.message [Error Message]
,m.message_time [Error Date]
,o.caller_name [Caller Name]
,o.Stopped_By_Name [Stopped By]
,ROW_NUMBER() OVER (PARTITION BY m.operation_id ORDER BY m.message_source_type DESC [Source Type Order]
FROM SSISDB.internal.operations o
JOIN SSISDB.internal.operation_messages m
ON o.operation_id = m.operation_id
JOIN SSISDB.internal.event_messages e
ON m.operation_id = e.operation_id
AND m.operation_message_id = e.event_message_id
WHERE o.Object_Name LIKE '%[object name]%' -- database
AND event_name LIKE '%Error%'
ORDER BY o.end_time DESC
,o.Object_Name
,[Source Type Order] ASC
Other than flat files and the database itself, ETL tools offer native notification and logging features: for example, a dashboard indicating the current load status.
What is the purpose of data profiling in an ETL process? Which steps in the data profiling process are the most important?
Data profiling tasks help maintain data quality. During this phase, a number of issues are checked for and resolved. The most important are:
What are three different approaches to implementing row versioning?
Maintaining row history requires implementing a versioning policy. The three most popular types are:
datetime
s (or tsrange
, perhaps).old_amount
) and the new value takes the place of the original (e.g. amount
.)There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every “A” candidate worth hiring will be able to answer them all, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.
Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.
Freelance ETL Developer
Paul is a seasoned software developer with over a decade of experience and a focus on Scala, Play, Akka, and Apache Spark. He develops safe, maintainable software that is robust against unintended bugs and transparent for monitoring and diagnostics. He uses computational resources efficiently, developing reactive applications that don't clog up threads.
Show MoreFreelance ETL Developer
Giovani is an experienced data professional with extensive expertise in data systems such as SQL Server, PostgreSQL, MySQL, and DB2. He is also very familiar with the Microsoft BI stack including SSRS, SSIS, SSAS, Power BI, Azure DW, and Azure DB. He also has AWS experience including RDS, Aurora, DynamoDB, S3, EC2, CloudFormation, Lambda Functions, Step Functions, and VPC set up. He communicates very well and has worked in teams of all sizes.
Show MoreFreelance ETL Developer
Sam is a database and business intelligence expert with 10+ years of experience in data architecture, analytics, and reporting, and a strong background in finance, sales, and operations. He's detail-oriented with excellent communication skills. If your project involves data and complex requirements, he can do it with his eyes closed. Look no further for someone who shares your values for quality and attention to detail.
Show MoreLooking for ETL Developers experts? Check out Toptal’s ETL developers.
Toptal Connects the Top 3% of Freelance Talent All Over The World.
Join the Toptal community.
Learn more