10 Essential Data Modeling Interview Questions *

question badge

Please discuss the three types of data models in an entity-relationship (ER) model.

answer badge

The Conceptual Model

This model is used to describe the conceptual structure of the business data strategy in the initial stage of design. It defines entity names and entity relationships but does not focus on technical problems, computer systems, or database management systems.

The Logical Model

This model is used for the implementation of the database management system and it focus on the logical structure of the data. It is a staging model and it can be both user- and system-oriented. In addition to entity names and entity relationships, the model defines attributes, primary keys, and foreign keys in each entity.

The Physical Data Model

This is the final stage of a data model which not only relates to a specific database management system, but also states the operating system, storage strategy, data security, and hardware. This model describes schema details, columns, data types, constraints, triggers, indexes, replicas, and backup strategy. This model carries the actual design blueprint for the database implementation by DBA and database developers.

question badge

What are the possible cardinalities of a relationship in the entity-relationship (ER) model, and what do they mean?

answer badge

There are three degrees of relationship, known as:

One-to-one (1:1)

Here, one occurrence in an entity can relate to one occurrence or zero occurrences in another entity.

For example: A “tag” table with a “tag number” attribute might have a 1:1 relationship to a “vehicle” table with a unique vehicle identification number (VIN) as its primary key. That is, a given vehicle may or may not have a tag, but it can’t have more than one.

Further to this example, there might be motor vehicle tags that are in inventory, but that have not been assigned a vehicle yet.

One-to-many (1:M)

In this case, one occurrence in an entity relates (or can relate) to many occurrences in another entity.

For example: A professor can teach many classes. Therefore, in the “professor” table, each professor record can be associated with many class records.

One professor can be associated with many classes.

The opposite relationship is many-to-one (M:1)—it’s simply a matter of semantics, depending on which table you are mentioning first—e.g., the “class” table would have an M:1 relationship to the “professor” table.

Many-to-many (M:N)

With this type of relationship, many occurrences in an entity can relate to many occurrences in another entity.

E.g., a student can take many classes and a class can contain many students. To implement this relationship, you will need an intermediate table linking students to courses. We can call this table the “student-course” table.

So in total you will need three tables: “student,” “class,” and “student-class.” From “student” to “student-class” is 1:M. From “student-class” to “class” is M:1. Finally, “student” to “class” is M:N using the intermediate table.

Students can be associated with many classes and vice-versa; therefore there's an intermediate student-class table to correctly model the M:N relationship between the two entities.

question badge

What is normalization? What is denormalization?

answer badge

Normalization is the process of reducing data redundancy while maintaining data integrity. This is performed by creating relationships among tables through primary and foreign keys. Normalization procedures include 1NF (first normal form), 2NF, 3NF, etc.

1NF: Eliminate repeating groups of attributes into entities and implement a primary key for each entity. (By definition, each cell is identified by the combination of its unique primary key and a column identifier.)

For example, in this model of university classes, each record entry is uniquely identified by the primary key [Professor_Id]+[Course_Id].

A 1NF model with columns for professor ID, name, sex, type ID/name, and department;  course ID and title; term, and campus.

2NF: This includes 1NF, but all non-key attributes (i.e., those other than the primary key columns) must depend on the primary key. If not, separate such attributes into another entity. You are solving the question of whether relationships should be one-to-many or many-to-many at this stage. (See example below.)

To extend our 1NF example to 2NF, we include “verb relationships” but not “adjective relationships.” The “Course” column partially depends on the primary key [Professor_Id]+[Course_Id]+[Term]. Therefore, we create the entities “Class” and “Professor-Class” to model the many-to-many relationship.

The same model extended to 2NF, using separate "Professor-Class" and "Class" tables.

3NF: This includes 2NF—plus, all fields (columns) cannot have transitive functional dependency (an indirect relationship between columns) from primary key in the table. In the above 1NF example, if we know the [Course+Term] we learn the [Professor_Name]. When we learn the [Professor_Name] we have the [Dept_Name]. Therefore [Course+Term] has transitive functional dependency on Department.

At this stage, we are creating reference tables with foreign keys—similar to what we did with 2NF, but for “adjective relationships” as well.

Continuing with our example, [Type], [Dept_Id], [Dept_Name] and [Campus] have transitive functional dependency. They do not depend on the primary key [Professor_Id], so we create two reference tables:

Extending the model to 3NF by adding Type and Department tables.

In most cases, the 3NF stage is sufficient for normalization.

Denormalization, on the other hand, is an optimized procedure used on a normalized database to increase query performance for data analytics. It’s a process to revert from 3NF (or a higher form) to 1NF. A denormalized model is also sometimes called a dimensional model. It decreases the running time of selected queries by eliminating complex joins, it pre-aggregates detailed data into summary-level data, and it stores data in a separate denormalized 1NF using in-memory and column-stored technology (c.f. the related question on this page.) That is, the original 3NF model remains in place as the original source of data.

You’ll find a variation of this strategy within any data warehouse dimensional model design strategy.

Find top data modeling experts today. Toptal can match you with the best engineers to finish your project.

Hire Toptal’s data modeling experts
question badge

What is the star schema? When is it used?

answer badge

The star schema is where there are one or more fact tables referencing any number of dimension tables in a star schema. Usually, the fact tables in a star schema are created from the third normal form (3NF) with foreign keys and aggregates (sometimes called “measures.”)

It’s used when developing data warehouses and dimensional data marts.

The star schema below shows two fact tables, “Fact_Professor” and “Fact_Professor_Class_Detail,” and dimension tables include “Dim_Type,” “Dim_Department,” “Dim_Date,” and “Dim_Class”. Foreign keys are [Type_Id], [Dept_Id], and [Date_Key]. Measures (not pictured; these are calculated sums, minimums, maximums, averages, counts, and distinct counts) would include the sum of [Credits], the distinct count of [Professor], the distinct count of [Class_Id], and [Hired_Days] which can be calculated the difference between [Hired_Date] and today().

Star schema example, with tables Dim_Type, Dim_Department, Dim_Class, Fact_professor, Dim_Date, and Fact_Professor_Class_Detail.

question badge

Please discuss in-memory analytics and column store databases in the context of data warehouse architecture.

answer badge

In-memory analytics is a method of caching the entire database in RAM. In the past, RAM has been expensive, so this method was often cost prohibitive. The solution at the time was to use data indexing and store pre-aggregated data.

The slowest part of that process was retrieving data from disk, because limited system memory was not capable of holding an entire database. Today, the price of RAM is more affordable, so the industry now has a much higher adoption rate when it comes to in-memory analytics.

A column store is a modern way of storing and retrieving data in a database. In the past, data was row-oriented and stored horizontally. In order to retrieve the data, it was read horizontally, like reading a book from left to right. So the entire data set would have to be read from the hard drive into RAM before it could be used.

Nowadays, column store (or columnar store) databases use column-oriented storage which stores the data vertically. Thus there are minimal hard drive reads when retrieving data—the database only reads the columns that a query actually contains.

For example, say you have a “Customer Sales” dataset that has 20 columns of all the customers’ personal and sales information, with a million rows. You only need the [Customer Name], [Address], and [Purchase Date] columns. Using a column store, the data will only be read from three columns with the least amount of hard disk reads, drastically reducing read times.

Also, column storage improves the likelihood of compression, because the data values within a single field are highly similar to one another—something that compression algorithms depend on.

question badge

Please discuss the issues/disadvantages with this dimensional model:

Fact_Shipment table, with columns Shipment_No, Date, Product, FCLLCL, Port_To_Port, FRREV, FRCOST, FUEL, Custom, Insurance, Security, Special_Handling, and Other_Charges.

answer badge

Instead of expanding charge line items horizontally, the model should be split into a FACT_Header and a Fact_Detail table. The charge line item column names should be pivoted into one “Charge_Code” column as a dimension for in-memory/column storage best practices.

Separate tables for Fact_Header (with Shipment_No, Date, Product, FCLLCL, and Port_to_Port columns) and Fact_Detail (with Shipment_No, Charge_Code, and Amount columns).

question badge

Please discuss the best strategy to model a set of entities in a NoSQL database.

answer badge

Beginning with a small-scale NoSQL database, this model can be built in using a dynamic schema. That is, it will allow schema changes and adjustments in real time, rather than having a predefined schema during the development of an application. However, to ensure data consistency and resolve relationships between data entities, use an entity-relationship (ER) model initially to define entities, relations, attributes, primary keys and foreign keys. The ER model—since it’s normalized by definition—will ensure data integrity. The model can then be converted/denormalized to a parent-child relationship-embedded entity model. For example:

Start with a 3NF model with “order_id” as the primary key in the entity “Order” and with foreign keys in the entity “Detail”:

Order and Detail tables for a typical 3NF model.

Denormalized, the above 3NF Model then looks like:

The 3NF model denormalized into a single table with order_id, order_date, customer_id, detail_id, item_id, qty, unit_price, and sales_total columns.

Finally, you can convert the denormalized form to a NoSQL parent-child relationship-embedded entity model:

{
    "order_id": "12345",
    "order_date": "1/21/2019",
    "customer_id": "123456",
    "order_details": [
        {
            "item_id": "2345",
            "qty": 2,
            "unit_price": 157.75,
            "sales_price": 315.5
        },
        {
            "item_id": "2110",
            "qty": 1,
            "unit_price": 75.25,
            "sales_price": 75.25
        },
        {
            "item_id": "1760",
            "qty": 3,
            "unit_price": 55,
            "sales_price": 165
        }
    ]
}
question badge

What are the advantages of NoSQL databases over relational databases?

answer badge

Flexibility

A relational database only allows structured data, whereas a NoSQL database offers the flexibility to store structured, semi-structured, or unstructured data.

Dynamic Schema

In a relational data model, the schema needs to be predefined. In the real world, applications evolve and changes are needed. This process is very costly and time-consuming. In recent decades, the data object is more complicated compared to the days of simple text and numbers. For example, geospatial data (polygons made out of GPS coordinates) can be difficult to query efficiently with some RDBMSes.

A dynamic schema offers a solution to add complex data objects easily because you can modify or add new elements as your application grows without a predefined schema.

Large Blobs/Rich Media

In many contexts nowadays, we need to have a more flexible way to store data types like voice files, video files, images, etc. Where once it was considered a best practice to simply store links to a filesystem for data like this, with NoSQL they can be stored directly.

The advantage—aside from avoiding brittle links—is that the database breaks the objects into pieces and can distribute them over a server pool for the ultimate performance gain.

Sharding

Sharding is the process of dividing up and distributing data into smaller databases for faster access to the data. The data is spread across a number of servers without the application being aware. The data and query load are evenly distributed across the servers. That way, if one goes down, it can be replaced effortlessly and with no application disruption.

Compared to a single-server setup, this process reduces big data tasks into small pieces so that it they can be processed at the same time in a distributed server farm.

Replication

NoSQL databases are quite sophisticated when it comes to being self-resolving and self-balancing. They offer failover and recovery options. They do this by distributing the database over many geographical/regional data centers. If one region goes down, it automatically relies on the others. This architecture allows for uninterrupted (“high”) availability, because data is replicated across different data centers.

Scaling

Starting with a distributed set of server nodes (all server nodes are peers with no shared resources and can work independently), as the database size increases, additional server nodes can be added on the fly across the data center without interruption or limitation.

In current RDBMS architectures, you can generally only scale up by adding memory and hard drives within the applicable hardware limitations, although native sharding support is in the works in some cases.

question badge

Please discuss the different types of NoSQL databases.

answer badge

Document Databases

A document database is mostly used for semi-structured data storage in a self-describing document format like JSON. Data document structures—defined by key-value hierarchies—can either be identical or different. Documents with the same attributes will be grouped into “collections” which is similar to tables in a relational databases.

{
    "id": "IC001",
    "Category": "Ice Cream",
    "Product Name": "Celebration Cookie Cake",
    "Size": 2.5,
    "Details": {
        "shape": [
            { "id": "001", "type": "Round" },
            { "id": "002", "type": "Sheet" },
            { "id": "003", "type": "Roll" },
            { "id": "004", "type": "Dome" }
        ],	
        "cream flavor": [
            { "id": "1001", "type": "Mint" },
            { "id": "1002", "type": "Oreo" },
            { "id": "1045", "type": "Chocolate" },
            { "id": "1021", "type": "Cheese Cake" },
            { "id": "1022", "type": "Vanilla" },
            { "id": "1033", "type": "Cookie" },
            { "id": "1014", "type": "Butter Pecan" }
        ]
    },
    "Image": "X562FCDQX73DS;L9EBWIDKFDKNLUWNJSNA:IU*DWJHIJDSJDKALijnkw8!WEJD#KLJSATEGD..."
}

Column Databases

In a column database, data is stored column-oriented instead of row-oriented. This type of database is optimized for I/O during analytical database queries. Some examples of NoSQL column databases include Google’s Big Table, Apache Cassandra, and HBase.

Key-value Databases

As with document databases, a key-value data structure is like a dictionary or a map, but in this case it cannot be nested. It uses a unique key—which can be synthetic or hash-algorithm-generated—to point to its own value list. The value can be of different types: string, JSON, basic large object (BLOB), etc.

{
    name: "Jimmy Johnson",
    email: "[email protected]",
    street: "13278 Southside Blvd",
    city: "Orlando",
    state:  "FL",
    country: "USA"
}

Cache Systems

Similar to key-value pairs, frequently accessed data is stored in memory for fast retrieval and access. Redis and clustered Memcached are examples of this. (While Memcached can be used by RDBMSes, it can only offer memory in one server, whereas NoSQL offers memory farms from all servers in a cluster.)

Graph Databases

In a graph database, objects are represented as nodes and relationships between objects are represented as edges. Both edges and nodes have defined attributes/properties which can be stored as key-value pairs. Examples include Neo4J, InfiniteGraph, OrientDB, Virtuoso, and Allegro.

Note: For more information please see The Definitive Guide to NoSQL Databases on the Toptal Engineering Blog.

question badge

How do you handle one-to-many and many-to-many relationships in NoSQL data modeling?

answer badge

One-to-many

One-to-many relationships in NoSQL are modeled in a document database and the 1:M relationship is represented as a parent-child relationship-embedded entity object within another document data type. For example:

{
    "order_id": "12345",
    "order_date": "1/21/2019",
    "customer_id": "123456",
    "order_details": [
        {
            "item_id": "2345",
            "qty": 2,
            "unit_price": 157.75,
            "sales_price": 315.5
        },
        {
            "item_id": "2110",
            "qty": 1,
            "unit_price": 75.25,
            "sales_price": 75.25
        },
        {
            "item_id": "1760",
            "qty": 3,
            "unit_price": 55,
            "sales_price": 165
        }
    ]
}

Here, each of the objects under order_details is a child document of a different document data type.

Many-to-many

Many-to-many relationships are modeled using two document collections with embedded entity objects referencing related documents from a third collection with identifiers/keys. The M:N relationships are achieved/maintained at the application level when retrieving data. The following example uses the identifier class_id to resolve M:N relationships between professor and student collections.

The professor collection looks like this:

{
    "Professor_id": "1021",
    "Name": "John P Vess",
    "Type": "Adjunct",
    "class_details": [
        {
            "class_id": [
                "B10101",
                "B10203",
                "B10112"
            ]
        }
    ]
},
...

And the student collection looks like this:

{
    "Student_id": "201727542",
    "Name": "Mary Carson",
    "Type": "Under Graduate",
    "class_details": [
        {
            "class_id": [
                "B10101",
                "B10107",
                "B10119"
            ]
        }
    ]
},
{
    "Student_id": "201821230",
    "Name": "Jerry Smith",
    "Type": "Under Graduate",
    "class_details": [
        {
            "class_id": [
                "B10101",
                "B10203",
                "M10332",
                "C12001"
            ]
        }
    ]
},
...

Finally, the class collection, referred to by the other two collections, looks like this:

{
    "Class_id": "B10101",
    "Class": "Business Principles"
},
{
    "Class_id": "B10203",
    "Class": "Introduction to Marketing"
},
...
* There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every “A” candidate worth hiring will be able to answer them all, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.
Submit an interview question
Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.
All fields are required
Thanks for submitting your question.
Our editorial staff will review it shortly. Please note that submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.
Looking for Data Modeling experts? Check out Toptal’s data modeling experts.
Darin Erat Sleiter
Brazil
Darin is a data scientist and engineer with a PhD in physics from Stanford. He's passionate about data and machine learning and has worked on data science projects across numerous industries and applications. Darin's co-founded an AI company and led a team of data scientists to build a product which uses machine learning and optimization techniques to reduce energy consumption in data centers. He's eagerly waiting for quantum computers.
Eva Bojorges Rodriguez
Mexico
Eva is a skilled back-end developer and machine learning engineer with experience in scalability issues, system administration, and more. She has a flair for well-structured, readable, and maintainable applications and excellent knowledge of Python, Ruby, and Go. She is a quick learner and has worked in teams of all sizes.
Pedro Alves Nogueira
Portugal
Pedro is a senior researcher and prototype developer with a PhD in AI, human-computer interaction, and affective computing. His background in academic and startup environments give him an edge in implementing state-of-the-art, elegant, and efficient custom-built solutions. Additionally, his experience as director of engineering for a multi-million dollar startup makes him an expert communicator and project manager.