Of which steps is the data analysis process typically composed?
- Finding a relevant business problem to solve: Often neglected, this is the most important step of the process, since generating business value is the end goal of any data analyst. Having a clear objective and restricting the data space to be explored is paramount to avoiding wasting resources. Since it requires deep knowledge of the problem domain, this step may be executed by a domain expert other than the data analyst.
- Data extraction: The next step is to collect data for analysis. It could be as simple as loading a CSV file, but more often than not it involves gathering data from multiple sources and formats.
- Data cleansing: After gathering the data, the dataset needs to be prepared for processing. Likely the most time-consuming step, data cleansing can include handling missing fields, corrupt data, outliers, and duplicate entries.
- Data exploration: This is often what comes to mind when thinking of data analysis. Data exploration involves generating statistics, features, and visualizations from the data to better understand its underlying patterns. This then leads to insights that might generate business value.
- Data modeling and model validation (optional): Training a statistical or machine learning model is not always required, as a data analyst usually generates value through insights found in the data exploration step, but it may uncover additional information. Easily interpretable models, like linear or tree-based models, and clustering techniques often expose patterns that would be otherwise difficult to detect with data visualization alone.
- Storytelling: This last step encompasses every bit of information uncovered previously to finally present a solution to—or at least a path to continue exploring—the business problem proposed in the first step. It’s all about being able to clearly communicate findings to stakeholders and convincing them to take a course of action that will lead to creating business value.
These are the most common steps of data analysis. Although they have been presented as a list, more often than not they are not executed sequentially and some steps may require several iterations as new data sources are added and information is uncovered.
What techniques can be used to handle missing data?
There are plenty of alternatives to handle missing data, although none of them is perfect or fits all cases. Some of them are:
- Dropping incomplete rows: Simplest of all, it can be used if the amount of missing data is small and seemingly random.
- Dropping variables: This can be used if the proportion of missing data in a feature is too big and the feature is of little significance to the analysis. In general, it should be avoided, as it usually throws away too much information.
- Considering “not available” (NA) to be a value: Sometimes missing information is information in itself. Depending on the problem domain, missing values are sometimes non-random: Instead, they’re a byproduct of some underlying pattern.
- Value imputation: This is the process of estimating the value of a missing field given other information from the sample. There are various viable kinds of imputation. Some examples are mean/mode/median imputation, KNN, regression models, and multiple imputations.
What is data validation?
In any data-oriented process the “garbage in, garbage out” issue is always a possibility. To mitigate it, we make use of data validation, a process composed of a set of rules to ensure that the data reaches a minimum quality standard. A couple of examples of validation checks are:
- Data type validation: Checks whether the data is of the expected type (eg. integer, string) and conforms to the expected format.
- Range and constraint validation: Checks if the observed values fall within a valid range. For example, temperature values must be above absolute zero (or likely a higher minimum depending on the operating range of the equipment being used to record them.)
How can we visualize more than three dimensions of data in a single chart?
Usually, data is visually represented through a chart using locations in the image (height, width, and depth). Going beyond three dimensions, we need to make use of other visual cues to add more information. Some of the most common are:
- Color: A visually appealing and intuitive way to depict both continuous and categorical data.
- Size: Marker size is also used to represent continuous data. Could be applied for categorical data as well, but since size differences are more difficult to detect than color, it is not the most appropriate choice for this type of data.
- Shape: Lastly, we have shapes, which are an effective way to represent different classes.
Combining all of the above we can visualize up to six dimensions, though one could argue that cramming so much information in a single chart does not make for a very effective visualization.
Another possibility is to make an animated chart, which is quite useful to depict changes through time:
What is the difference between correlation and causation? How can we infer the latter?
Correlation is a statistic that measures the strength and direction of the associations between two or more variables.
Causation, on the other hand, is a relationship that describes cause and effect.
“Correlation does not imply causation” is a famous quote that warns us about the dangers of the very common practice of looking at a strong correlation and assuming causality. A strong correlation may manifest without causation in the following cases:
- Lurking variable: An unobserved variable that affects both variables of interest, causing them to exhibit a strong correlation, even when there is no direct relationship between them.
- Confounding variable: A confounding variable is one that cannot be isolated from one or more of the variables of interest. Therefore we cannot explain if the result observed is caused by the variation of the variable of interest or of the confounding variable.
- Spurious correlation: Sometimes due to coincidence, variables can be correlated even though there is no reasonably logical relationship.
Causation is tricky to be inferred. The most usual solution is to set up a randomized experiment, where the variable that’s a candidate to be the cause is isolated and tested. Unfortunately, in many fields running such an experiment is impractical or not viable, so using logic and domain knowledge becomes crucial to formulating reasonable conclusions.
What are the differences between linear and logistic regression?
Linear regression is a statistical model that, given a set of input features, attempts to fit the best possible straight line (or hyperplane, in the general case) between the independent and the dependent variable. Since its output is continuous and its cost function measures the distance from the observed to the predicted values, it is an appropriate choice to solve regression problems (e.g. to predict sales numbers).
Logistic regression, on the other hand, outputs a probability, which by definition is a bounded value between zero and one, due to the sigmoid activation function. Therefore, it is most appropriate to solve classification problems (e.g. to predict whether a given transaction is fraudulent or not).
What is model extrapolation? What are its pitfalls?
Model extrapolation is defined as estimating beyond a previously observed data range to establish the relationships between variables.
The main issue with extrapolation is that it is, at best, an educated guess. Since it has no data to support it, it’s generally not possible to claim that the observed relationships still hold. A relationship that looks linear in a given range might actually be non-linear when outside of range.
What is data leakage in the context of data analysis? What problems may arise from it? Which strategies can be applied to avoid it?
Data leakage is the process of training a statistical model with information that would be actually unavailable when using the model to make predictions.
Data leakage makes the results during model training and validation much better than what is observed when the model is deployed, generating too optimistic estimates, possibly leading to an entirely invalid predictive model.
There is no single recipe to eliminate data leakage, but some practices are helpful to avoid them:
- Don’t use future data to make predictions of the past. Although obvious, it’s a very common mistake when validating models, especially when using cross-validation. When training on time-series data, always make sure to use an appropriate validation strategy.
- Prepare the data within cross-validation folds. Another common mistake is to make data preparations, like normalization or outlier removal on the whole dataset, prior to splitting the dataset to validate the model, which is a leak of information.
- Investigate IDs. It’s easy to dismiss IDs as randomly generated values, but sometimes they encode information about the target variable. If they are leaky, it’s best to remove them from any sort of model.
A retail chain owner has collected purchasing history data from his stores for 10 years. The data dictionary is shown below:
|Transaction ID||Unique transaction ID. Must appear just once in the dataset|
|Store ID||Unique store ID. May appear more than once in the dataset|
|Client ID||Unique client ID. May appear more than once in the dataset|
|Item ID||Unique item ID. May appear more than once in the dataset|
|Item Quantity||Number of items bought together|
|Item Price||Price of a single item|
|Date and Time of Purchase||Timestamp of the purchase|
|Payment Method||One of the following: cash, credit card, debit card, or voucher|
What kind of information or analyses could we leverage that might generate value to the business? Assume each transaction represents the purchase of a single type of item.
The answer is not closed and will depend on previous experience and domain expertise. The goal is not to get every single item right, but to showcase critical thinking and domain knowledge.
For this scenario, some of the possible paths to explore are:
- Determine which are the most popular items sold
- Explore how much is spent per transaction
- Find which clients spend the most
- Find the most recurrent clients
- Uncover seasonalities and trends
All of the above can be analyzed over the whole dataset or by region, by store or by time frame. The analyses could be further enriched with store, client and item information, if available.
The information uncovered could be used to:
- Better scale inventory sizing and the number of on-site employees using time-series forecasting.
- Perform direct marketing to the most profitable clients, which could be identified with the aid of clustering techniques.
- Enhance item positioning in a store by grouping items likely to be bought together, which could be identified through market basket analysis. Recommender systems could also be applied.
What are precision and recall? In which cases are they used?
Precision and recall are metrics that measure classification performance, each using its own criteria, given by the formulas below:
TP = True Positive FP = False Positive FN = False Negative
In other words, precision is the ratio of correctly classified positive cases over all cases predicted as positive, while recall is the ratio of correctly classified positive cases over all positive cases.
Precision is an appropriate measure when the cost of a false positive is high (e.g. email spam classification), while recall is appropriate when the cost of a false negative is high (e.g. fraud detection).
Both are also frequently used together in the form of the F1-score, which is defined as:
The F1-score balances both precision and recall, so it’s a good measure of classification performance for highly imbalanced datasets.