Customized Remote Work Solutions From the World’s Largest Fully Remote CompanyCustomized Remote Work SolutionsLearn More
Back-end
10 minute read

SQL Indexes Explained, Pt. 2

Mirko designs and develops massive, extreme-workload databases. He also trains software developers on databases and SQL.

In the first lesson of SQL Indexes Explained, we learned to use sorting to speed up data retrieval. While our query execution is faster after rows are sorted, the sorting involves reading every row at least once and moving them around. That makes the method slower and less efficient than simply reading the entire table sequentially.

The logical conclusion seems to be that we should maintain sorted copies—which we’ll officially call SQL indexes, prefixed with IX_—of a given table. The retrieval algorithms from the first article would then be applicable, and we would not need to sort the table before starting.

The Index as a Sorted Copy of the Table

Let’s take a look at the literal implementation of that idea, again using Google Sheets. Our Reservation spreadsheet becomes a collection of five sheets containing the same data. Each sheet is sorted according to different sets of columns.

The exercises here are intended to be less exacting than in the previous SQL index tutorial article—they can be done more by feel than by timer and row count. Some exercises will seem very similar, but this time, we’re exploring:

  1. How to more efficiently retrieve data when using separate indexes rather than sorted primary tables
  2. How to maintain order in each index and table when modifying data

The previous tutorial focused on reads, but in many common real-world data dynamics—our hotel reservations included—we have to take into account the effects of indexing on write performance, both for inserting new data and updating existing data.

Preliminary Exercise: Cancel a Reservation

To get a feel for SQL index performance using the sorted table strategy, your task is to delete a reservation for client 12, from the 22nd of August 2020, in hotel 4. Keep in mind that you have to remove a row from all copies of the table and maintain the correct sorting.

Done? It should be clear that the idea of keeping multiple sorted copies of the table is not as good as it seemed. If you still have any doubts, you can also try to re-insert the reservation you just deleted or change the date of an existing reservation.

While sorted copies of the table allow for faster retrieval, as we learned just now, data modification is a nightmare. Whenever we need to add, delete, or update an existing row, we will have to retrieve all copies of the table, find a row and/or the place where it should be added or moved, and finally move blocks of data.

SQL Indexes Using Row Addresses

This spreadsheet contains indexes that use a different approach. Index rows are still sorted according to specific criteria, but we don’t keep all the other information in the index row. Instead, we keep just the “row address,” the address of the row in the Reservations sheet—representing the table itself—in column H.

All RDBMS implementations use an operating system–level capability to quickly find the block on the disk using a physical address. Row addresses typically consist of a block address plus the row’s position within the block.

Let’s do a few exercises to learn how this index design works.

Exercise 1: All of a Client’s Reservations

As in the first article, you are going to simulate the execution of the following SQL query:

SELECT *
FROM Reservations
WHERE ClientID = 12;

Again, there are two reasonable approaches. The first one is simply reading all rows from the Reservations table and fetching only rows matching the criteria:

For each row from Reservations
  If Reservations.ClientID = 12 then write down Reservations.*

The second approach involves reading data from the IX_ClientID sheet, and for any item matching the criteria, finding a row in the Reservation table based on the rowAddress value:

Get first row from IX_ClientID where ClientID = 12

While IX_ClientID.ClientID = 12
  Fetch Reservations.* where rowAddress = IX_ClientID.rowAddress
  Write down Reservations.*
  Get next row from IX_ClientID

Here, the expression Get first row from is implemented by a loop similar to those seen in the previous article:

Repeat
  Fetch next row from IX_ClientID
Until ClientID >= 12

You can find a row with a given rowAddress by sliding down until you find a row, or using a filter on the rowAddress column.

If there were just a handful of reservations to be returned, the approach using the index would be better. However, with hundreds—or sometimes even just tens—of rows to be returned, merely using the Reservations table directly can be faster.

The volume of reads depends on the value of ClientID. For the largest value, you have to read the entire index, while for the lowest value, it is at the beginning of the index. The average value is half of the number of rows.

We’ll return to that part later and present an efficient solution. For now, let’s focus on the part after you find the first row matching our criteria.

Exercise 2: The Number of Reservations Starting on a Given Date

The task is to count the number of check-ins on the 16th of August 2020, using the new index design.

SELECT COUNT (*)
FROM Reservations
WHERE DateFrom = TO_DATE('2020-08-16','YYYY-MM-DD');

The approach of using the appropriate index for counting is superior to a table scan, no matter the number of rows involved. The reason is that we do not have to access the Reservations table at all—we have all information we need in the index itself:

Count := 0
Get first row from IX_DateFrom where DateFrom >= '2020-08-16'

While found and DateFrom < '2020-08-17'
  Count := Count + 1
  Get next row from IX_DateFrom

Write down Count

The algorithm is basically the same as one using sorted tables. However, the index row is much shorter than the table row, so our RDBMS would have to read fewer data blocks from the disk.

Exercise 3: Criminal Investigation (Guest List Given Hotel and Date Range)

Let’s prepare a list of guests that arrived at hotel 3 on the 13th and 14th of August 2020.

SELECT ClientID
FROM Reservations
WHERE DateFrom BETWEEN (
    TO_DATE('2020-08-13','YYYY-MM-DD') AND
    TO_DATE('2020-08-14','YYYY-MM-DD')
  ) AND
  HotelID = 3;

We can read all the rows from the Reservations table or use one of the available indexes. After doing the same exercise with a table sorted according to specific criteria, we found out that the index IX_HotelID_DateFrom is the most efficient.

Get first row from IX_HotelID_DateFrom where
  HotelID = 3 and
  DateFrom between '2020-08-13' and '2020-08-14'

While found and DateFrom < '2020-08-15' and IX_HotelID_DateFrom.HotelID = 3
  Fetch Reservations.* where rowAddress = IX_HotelID_DateFrom.rowAddress
  Write down Reservations.ClientID
  Get next row from IX_HotelID_DateFrom

Can We Design an Even More Efficient Index?

We access the table because of the ClientID value, the only information we need for the list of guests we’re reporting. If we include that value in the SQL index, we don’t have to access the table at all. Try to prepare a list reading only from just such an index, IX_HotelID_DateFrom_ClientID:

Get first row from IX_HotelID_DateFrom_ClientID where 
  HotelID = 3 and 
  DateFrom between '2020-08-13' and '2020-08-14'

While found and HotelID = 3 and DateFrom < '2020-08-15'
  Write down ClientID
  Get next row from IX_HotelID_DateFrom_ClientID

When the index contains all the necessary information for query execution, we say that the index covers the query.

Exercise 4: List of Guests’ Names Instead of IDs

A list of guest IDs would be useless for a police officer investigating a crime. We need to provide names:

SELECT c.ClientName
FROM Reservations r
JOIN Clients c ON r.ClientID = c.ClientID
WHERE r.DateFrom BETWEEN (
    TO_DATE('2020-08-13', 'YYYY-MM-DD') AND
    TO_DATE('2020-08-14', 'YYYY-MM-DD')
  ) AND
  r.HotelID = 3;

To provide a list, besides data from the Reservations table, we also need a Clients table containing guest information, which can be found in this Google sheet.

This exercise is similar to the previous one, and so is the approach.

Get first row from IX_HotelID_DateFrom_ClientID where 
  HotelID = 3 and 
  DateFrom between '2020-08-13' and '2020-08-14'

While found and HotelID = 3 and DateFrom < '2020-08-15' 
  Fetch Clients.* where ClientID = IX_HotelID_DateFrom_ClientID.ClientID
  Write down Clients.ClientName
  Get next row from IX_HotelID_DateFrom_ClientID

The expression Fetch Clients.* where ClientID = IX_HotelID_DateFrom_ClientID.ClientID can be implemented by table scan or using our index. If we use a table scan, for each ClientID from the While loop, we would have to read on average the half of rows from the Clients table:

-- Get row from Clients using table scan
Repeat
  Fetch next row from Clients
Until ClientID = IX_HotelID_DateFrom_ClientID.ClientID or not found
If found
  Write down ClientName

The index implementation we’ve considered so far—let’s call it a “flat” index implementation—would not be too helpful. We would have to read the same number of rows (although smaller rows) from the index, then jump to the row in Clients using RowAddress:

-- Get row from Clients using flat index
Repeat
  Fetch next row from Clients_PK_Flat
Until ClientID >= IX_HotelID_DateFrom_ClientID.ClientID
If found
  Fetch Clients.* where rowAddress = Clients_PK_Flat.rowAddress
  Write down ClientName

Note: Here, PK refers to “primary key,” a term we’ll explore later in the series.

Is there a way to accomplish this without having to read so many rows? Yes—this is exactly what B-tree indexes are for.

Balanced Tree (B-tree) Indexes

Let’s divide the rows from Clients_PK_Flat into four-row blocks and create a list containing the value of the last ClientID from the block and address of the block start (column IndexRowAddress). The resulting database index data structure—you can find in the Clients_PK_2Levels sheet. Try out how the new structure helps you find a client having a ClientID of 28. The algorithm should look like this:

Fetch Level2.*
Loop
  Leaf_address := Level3Address
  Exit when ClientID >= 28
  Fetch next row from Level2

Fetch Level3.* where Level3Address = Leaf_address -- 3-21
Loop
 Client_address := RowAddress
 Exit when ClientID >= 28
 Fetch next row from Level 3

Fetch Clients.* where rowAddress = Client_address -- 42
Write down Clients.*

You probably figured out that we can add another level. Level 1 consists of four rows, as you can see in IX_Clients_PK tab. To find the name of the guest with a ClientID of 28, you have to read three blocks (nodes) of data—one per level—from the primary key structure and finally jump to the Clients row with address 42.

The structure of this SQL index is called a balanced tree. The tree is balanced when the path from the root node to each leaf-level node has the same length, its so-called B-tree depth. In our case, the depth is three.

B-tree example based on the IX_Clients_PK tab in the spreadsheet, showing the lookup path of the above algorithm.

From now on, we’ll consider each index to have B-tree structure, even though our spreadsheets contain only leaf-level entries. The most important facts to know about B-tree are:

  • The B-tree index structure is the most commonly used index by all major RDBMSes on the market.
  • All levels of a balanced tree are ordered by key column values.
  • Data are read from the disk in blocks.
  • One B-tree node contains one or more blocks.
  • The most important factor affecting query performance is the number of blocks read from the disk.
  • The number of items in each new B-tree level, starting by root, ending on leaf level, increases exponentially.

Exercise 5: Criminal Investigation, Part II

Now, the police inspector is looking for a list of corresponding guests’ names, dates of arrival, and hotel names from all of the hotels in city A.

SELECT
  h.HotelName,
  r.DateFrom as CheckInDate,
  c.ClientName
FROM Reservations r
JOIN Clients c ON r.ClientID = c.ClientID
JOIN Hotels h ON r.HotelID = h.HotelID
WHERE r.DateFrom BETWEEN (
    TO_DATE('2020-08-13', 'YYYY-MM-DD') AND
    TO_DATE('2020-08-14', 'YYYY-MM-DD')
  ) AND
  h.City = 'A';

Approach 1

If we use the index IX_DateFrom_HotelID_ClientID, then for each row from the dates range, we would have to access the Hotels table and check if the hotel is from city A. If it is, we would have to access the Clients table, too, to read the client’s name.

For each row from IX_DateFrom_HotelID_ClientID where
    DateFrom between '2020-08-13' and '2020-08-14'
  For each row from Hotels where
      HotelID = IX_DateFrom_HotelID_ClientID.HotelID
    If Hotels.City = 'A' then
      Fetch Clients.* where ClientID = IX_HotelID_DateFrom_ClientID.ClientID
      Write down
        Hotels.HotelName,
        IX_HotelID_DateFrom_ClientID.DateFrom,
        Clients.ClientName

Approach 2

Using IX_HotelID_DateFrom_ClientID gives us a more efficient execution plan.

For each row from Hotels where City = 'A'
  For each row from IX_HotelID_DateFrom_ClientID where
      HotelID = Hotels.HotelID and
      DateFrom between '2020-08-13' and '2020-08-14'
    Fetch Clients.* where ClientID = IX_HotelID_DateFrom_ClientID.ClientID
    Write down
      Hotels.HotelName,
      IX_HotelID_DateFrom_ClientID.DateFrom,
      Clients.ClientName

From the Hotels table, we find all the hotels from city A. Knowing the ID of these hotels, we can read subsequent items from the IX_HotelID_DateFrom_ClientID index. This way, after finding the first row in the B-tree leaf level for each hotel and date, we don’t read the reservations from hotels outside city A.

Leveraging the short Hotels table in conjunction with the IX_HotelID_DateFrom_ClientID index. The table is shown on the left, with two hotel rows highlighted, corresponding to those that are in city A. Each of those hotels is then given a quick lookup via the B-tree process, resulting in them pointing directly to blocks within the index on the right, where all the sought-after data is sequential.

Here, we can see that when we have a database index that’s appropriate for our goals, an additional join can actually make a query faster.

The B-tree structure and how it is updated whenever a row is inserted, updated, or deleted will be addressed in more detail when I explain the motivation for partitioning and its impact. The point is that we can consider this operation quick whenever we use an index.

The Index Query in SQL: Details Make All the Difference

When it comes to indexes and databases, working at the SQL language level hides implementation details to some extent. These exercises are meant to help you get a feel for how execution plans work when using different SQL indexes. After reading the article, I hope you will be able to guess the best execution plan given available indexes and design indexes that would make a query as fast and efficient as possible.

In the next part of this series, we will use and expand newly acquired skills to investigate and understand the most common best practices and anti-patterns in the use of indexes in SQL. I have a list of good and best practices I want to address in the next part, but to make the next article more relevant to your needs and experience, please feel free to post your own questions you would like to see answered.

In a future part of SQL Indexes Explained, we will also learn about table and index partitioning, the right and wrong motivations for using it, and its impact on query performance and database maintenance.

Understanding the basics

What is a table in SQL?

A table in SQL is a database object that contains data organized in rows and columns. It is similar to a spreadsheet in that regard. Each row represents a record consisting of column values. The table definition includes the name and data type of each column and constraints applicable to table data.

What is an index in an SQL table?

SQL tables can have indexes associated with them. The role of indexes is to support fast data retrieval. In addition to that, indexes are also used to enforce unique key constraints and to speed up foreign key constraint evaluation.