10 Essential Data Analysis Interview Questions *

Toptal sourced essential questions that the best data analysts can answer. Driven from our community, we encourage experts to submit questions and offer feedback.

is an exclusive network of the top freelance software developers, designers, finance experts, product managers, and project managers in the world. Top companies hire Toptal freelancers for their most important projects.

Interview Questions

What techniques can be used to handle missing data?

View answer

There are plenty of alternatives to handle missing data, although none of them is perfect or fits all cases. Some of them are:

Dropping incomplete rows: Simplest of all, it can be used if the amount of missing data is small and seemingly random.
Dropping variables: This can be used if the proportion of missing data in a feature is too big and the feature is of little significance to the analysis. In general, it should be avoided, as it usually throws away too much information.
Considering “not available” (NA) to be a value: Sometimes missing information is information in itself. Depending on the problem domain, missing values are sometimes non-random: Instead, they’re a byproduct of some underlying pattern.
Value imputation: This is the process of estimating the value of a missing field given other information from the sample. There are various viable kinds of imputation. Some examples are mean/mode/median imputation, KNN, regression models, and multiple imputations.

What is data validation?

View answer

In any data-oriented process the “garbage in, garbage out” issue is always a possibility. To mitigate it, we make use of data validation, a process composed of a set of rules to ensure that the data reaches a minimum quality standard. A couple of examples of validation checks are:

Data type validation: Checks whether the data is of the expected type (eg. integer, string) and conforms to the expected format.
Range and constraint validation: Checks if the observed values fall within a valid range. For example, temperature values must be above absolute zero (or likely a higher minimum depending on the operating range of the equipment being used to record them.)

What are the differences between linear and logistic regression?

View answer

Linear regression is a statistical model that, given a set of input features, attempts to fit the best possible straight line (or hyperplane, in the general case) between the independent and the dependent variable. Since its output is continuous and its cost function measures the distance from the observed to the predicted values, it is an appropriate choice to solve regression problems (e.g. to predict sales numbers).

Logistic regression, on the other hand, outputs a probability, which by definition is a bounded value between zero and one, due to the sigmoid activation function. Therefore, it is most appropriate to solve classification problems (e.g. to predict whether a given transaction is fraudulent or not).

Apply to Join Toptal's Development Network

and enjoy reliable, steady, remote Freelance Data Analyst Jobs

Apply as a Freelancer

What is model extrapolation? What are its pitfalls?

View answer

Model extrapolation is defined as estimating beyond a previously observed data range to establish the relationships between variables.

The main issue with extrapolation is that it is, at best, an educated guess. Since it has no data to support it, it’s generally not possible to claim that the observed relationships still hold. A relationship that looks linear in a given range might actually be non-linear when outside of range.

What is data leakage in the context of data analysis? What problems may arise from it? Which strategies can be applied to avoid it?

View answer

Data leakage is the process of training a statistical model with information that would be actually unavailable when using the model to make predictions.

Data leakage makes the results during model training and validation much better than what is observed when the model is deployed, generating too optimistic estimates, possibly leading to an entirely invalid predictive model.

There is no single recipe to eliminate data leakage, but some practices are helpful to avoid them:

Don’t use future data to make predictions of the past. Although obvious, it’s a very common mistake when validating models, especially when using cross-validation. When training on time-series data, always make sure to use an appropriate validation strategy.
Prepare the data within cross-validation folds. Another common mistake is to make data preparations, like normalization or outlier removal on the whole dataset, prior to splitting the dataset to validate the model, which is a leak of information.
Investigate IDs. It’s easy to dismiss IDs as randomly generated values, but sometimes they encode information about the target variable. If they are leaky, it’s best to remove them from any sort of model.

A retail chain owner has collected purchasing history data from his stores for 10 years. The data dictionary is shown below:

Feature	Description
Transaction ID	Unique transaction ID. Must appear just once in the dataset
Store ID	Unique store ID. May appear more than once in the dataset
Client ID	Unique client ID. May appear more than once in the dataset
Item ID	Unique item ID. May appear more than once in the dataset
Item Quantity	Number of items bought together
Item Price	Price of a single item
Date and Time of Purchase	Timestamp of the purchase
Payment Method	One of the following: cash, credit card, debit card, or voucher

What kind of information or analyses could we leverage that might generate value to the business? Assume each transaction represents the purchase of a single type of item.

View answer

The answer is not closed and will depend on previous experience and domain expertise. The goal is not to get every single item right, but to showcase critical thinking and domain knowledge.

For this scenario, some of the possible paths to explore are:

Determine which are the most popular items sold
Explore how much is spent per transaction
Find which clients spend the most
Find the most recurrent clients
Uncover seasonalities and trends

All of the above can be analyzed over the whole dataset or by region, by store or by time frame. The analyses could be further enriched with store, client and item information, if available.

The information uncovered could be used to:

Better scale inventory sizing and the number of on-site employees using time-series forecasting.
Perform direct marketing to the most profitable clients, which could be identified with the aid of clustering techniques.
Enhance item positioning in a store by grouping items likely to be bought together, which could be identified through market basket analysis. Recommender systems could also be applied.

Of which steps is the data analysis process typically composed?

View answer

Finding a relevant business problem to solve: Often neglected, this is the most important step of the process, since generating business value is the end goal of any data analyst. Having a clear objective and restricting the data space to be explored is paramount to avoiding wasting resources. Since it requires deep knowledge of the problem domain, this step may be executed by a domain expert other than the data analyst.
Data extraction: The next step is to collect data for analysis. It could be as simple as loading a CSV file, but more often than not it involves gathering data from multiple sources and formats.
Data cleansing: After gathering the data, the dataset needs to be prepared for processing. Likely the most time-consuming step, data cleansing can include handling missing fields, corrupt data, outliers, and duplicate entries.
Data exploration: This is often what comes to mind when thinking of data analysis. Data exploration involves generating statistics, features, and visualizations from the data to better understand its underlying patterns. This then leads to insights that might generate business value.
Data modeling and model validation (optional): Training a statistical or machine learning model is not always required, as a data analyst usually generates value through insights found in the data exploration step, but it may uncover additional information. Easily interpretable models, like linear or tree-based models, and clustering techniques often expose patterns that would be otherwise difficult to detect with data visualization alone.
Storytelling: This last step encompasses every bit of information uncovered previously to finally present a solution to—or at least a path to continue exploring—the business problem proposed in the first step. It’s all about being able to clearly communicate findings to stakeholders and convincing them to take a course of action that will lead to creating business value.

These are the most common steps of data analysis. Although they have been presented as a list, more often than not they are not executed sequentially and some steps may require several iterations as new data sources are added and information is uncovered.

What is the difference between correlation and causation? How can we infer the latter?

View answer

Correlation is a statistic that measures the strength and direction of the associations between two or more variables.

Causation, on the other hand, is a relationship that describes cause and effect.

“Correlation does not imply causation” is a famous quote that warns us about the dangers of the very common practice of looking at a strong correlation and assuming causality. A strong correlation may manifest without causation in the following cases:

Lurking variable: An unobserved variable that affects both variables of interest, causing them to exhibit a strong correlation, even when there is no direct relationship between them.
Confounding variable: A confounding variable is one that cannot be isolated from one or more of the variables of interest. Therefore we cannot explain if the result observed is caused by the variation of the variable of interest or of the confounding variable.
Spurious correlation: Sometimes due to coincidence, variables can be correlated even though there is no reasonably logical relationship.

Causation is tricky to be inferred. The most usual solution is to set up a randomized experiment, where the variable that’s a candidate to be the cause is isolated and tested. Unfortunately, in many fields running such an experiment is impractical or not viable, so using logic and domain knowledge becomes crucial to formulating reasonable conclusions.

What are precision and recall? In which cases are they used?

View answer

Precision and recall are metrics that measure classification performance, each using its own criteria, given by the formulas below:

\[\text{Precision} = \frac{TP}{TP+FP}\] \[\text{Recall} = \frac{TP}{TP+FN}\]

Where:

TP = True Positive
FP = False Positive
FN = False Negative

In other words, precision is the ratio of correctly classified positive cases over all cases predicted as positive, while recall is the ratio of correctly classified positive cases over all positive cases.

Precision is an appropriate measure when the cost of a false positive is high (e.g. email spam classification), while recall is appropriate when the cost of a false negative is high (e.g. fraud detection).

Both are also frequently used together in the form of the F1-score, which is defined as:

\[\text{F1} = 2*\frac{\text{Precision} * \text{Recall}}{\text{Precision}+\text{Recall}}\]

The F1-score balances both precision and recall, so it’s a good measure of classification performance for highly imbalanced datasets.

10.

How can we visualize more than three dimensions of data in a single chart?

View answer

Usually, data is visually represented through a chart using locations in the image (height, width, and depth). Going beyond three dimensions, we need to make use of other visual cues to add more information. Some of the most common are:

Color: A visually appealing and intuitive way to depict both continuous and categorical data.
Size: Marker size is also used to represent continuous data. Could be applied for categorical data as well, but since size differences are more difficult to detect than color, it is not the most appropriate choice for this type of data.
Shape: Lastly, we have shapes, which are an effective way to represent different classes.

Combining all of the above we can visualize up to six dimensions, though one could argue that cramming so much information in a single chart does not make for a very effective visualization.

Another possibility is to make an animated chart, which is quite useful to depict changes through time:

There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every “A” candidate worth hiring will be able to answer them all, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.

Why Toptal

Submit an interview question

Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.

Looking for Data Analysts?

Looking for Data Analysts? Check out Toptal’s data analysts.

View Oliver

Oliver Holloway

Freelance Data Analyst

United KingdomToptal Member Since May 10, 2016

Oliver is a versatile data scientist and software engineer combining over a decade of experience and a postgraduate mathematics degree from Oxford. Career assignments have ranged from building machine learning solutions for startups to leading project teams and handling vast amounts of data at Goldman Sachs. With this background, he is adept at picking up new skills quickly to deliver robust solutions to the most demanding of businesses.

Data Analysis Software Development Google Cloud Deep Learning Artificial Intelligence (AI)Natural Language Processing (NLP)MongoDB Python Machine Learning Pandas HTML5 Data Engineering Data Modeling + more

View Christopher

Christopher Karvetski

Freelance Data Analyst

United StatesToptal Member Since August 24, 2016

Dr. Karvetski has ten years of experience as a data and decision scientist. He has worked across academia and industry in a variety of team and client settings, and has been recognized as an excellent communicator. He loves working with teams to conceive and deploy novel data science solutions. He has expertise with R, SQL, MATLAB, SAS, and other platforms for data science.

Data Analysis Software Development DevOps SAS SQL R Statistics iOS Oracle Data Engineering Data Modeling TensorFlow Machine Learning + more

View Renee

Renee Ahel

Freelance Data Analyst

CroatiaToptal Member Since June 18, 2020

Renee is a data scientist with over 12 years of experience, and five years as a full-stack software engineer. For over 12 years, he has worked in international environments, with English or German as a working language. This includes four years working remotely for German and Austrian client companies and nine months working remotely as a member of the Deutsche Telekom international analytics team.

Data Analysis Data Engineering Software Development DevOps Microsoft Excel R Machine Learning Oracle SQL Databases Company Databases Data Mining SQL Data Modeling + more

Toptal Connects the Top 3% of Freelance Talent All Over The World.

Join the Toptal community.

Learn more