Share
Top Ideas11 minute read

Don’t Panic: How Toptal Engineers Approach the Debugging Process

Test failures can cause panic as teams rush to unblock an app’s release. At Toptal, however, we view challenges as opportunities to understand and improve projects. Get an inside look at our engineering process with an Apollo Client debugging case study.

Test failures can cause panic as teams rush to unblock an app’s release. At Toptal, however, we view challenges as opportunities to understand and improve projects. Get an inside look at our engineering process with an Apollo Client debugging case study.

Author avatar
Mateusz joined Toptal Engineering in 2016. As a front-end architect, he works closely with talent-facing teams to translate business requirements into robust software services. In his free time, Mateusz is a passionate hiker who stops only to eat, drink, and take hundreds of photos, none of them good.
Share

Toptal’s Talent Portal (the web application where talent access the network and jobs) is one of the critical software services that delivers business value to the company. Working as an engineer on this surface is a high-stakes and fast-paced proposition. To provide the best user and developer experiences, we replaced the old portal with an updated platform and a new codebase that automatically deploys to production a few times a day.

However, with great power comes great responsibility—and code maintenance. New code yields new bugs, which engineers must address rapidly in order to keep pushing updates to production. This is what happened when I recently encountered a Cypress test failure blocking our entire delivery pipeline and interrupting many engineers’ work. Though the failure was difficult to address, my team supported my suggestion that we turn this challenge into an opportunity; the bug allowed me to understand our testing framework in-depth, improve our project’s infrastructure, and better prepare other teams for similar issues.

An Inside Look at Toptal Engineering Projects: The Portal Site

First, let’s understand how engineering projects work at Toptal—and which technologies, partners, and testing standards are typically involved—using our updated Talent Portal site as our example.

A flowchart has five light blue and six medium blue boxes, and two green start and end terminators. Below the flowchart, there is a light blue square key marked by the identifier "Cross-functional project process," and a medium blue square key marked by the identifier "Engineering-specific debugging process." On the top left side of the flowchart, a green start oval reads "New project is code-complete" with an arrow pointing to the first light blue box: "Engineers and QA define and code test cases." The first light blue box has an arrow pointing down to the first medium blue box: "Test failures triaged." The first medium blue box has an arrow going right that splits into two, pointing at two parallel medium blue boxes. The top medium blue box reads "Perform 'quick fix' to unblock engineering work," and the bottom one reads "Debug Cypress error messages and video capture." The two boxes have one arrow each pointing right; these two arrows merge into one single arrow that points to the fourth medium blue box: "Technical deep dive: research and diagnose root cause." From this box, an arrow points right to the fifth medium blue box: "Investigate possible solutions and weigh trade-offs." From this box, an arrow points right to the sixth medium blue box: "Implement long-term solution to avoid repeated failures." From this box, an arrow points up to the fifth light blue box. Returning to the first light blue box, an additional arrow points right to the second light blue box: "Updated project released to beta testers." From this box, an arrow points right to the third light blue box: "Beta testers report bugs." From this box, an arrow points right to the fourth light blue box: "QA and engineers triage and prioritize bugs." From this box, an arrow points right to the fifth light blue box: "Engineers fix bugs and update code." The fifth light blue box has an arrow pointing up and left, which returns to the first light blue box, and an arrow pointing right to the end oval: "Project is ready for launch."
An overview of an engineering project timeline, including cross-functional and engineering-specific processes.

What Technologies Do We Use?

Toptal engineers frequently adapt to new technologies, and the Talent Portal is no exception. While our previous development environment used jQuery and Backbone.js for data management, our updated site leverages GraphQL and Apollo Client for their excellent developer experience. Many other teams across Toptal also use this pairing for data management. GraphQL, a standard query language, is the technology that allows us to access the information necessary to render user interfaces (such as a logged-in user’s name and job engagements displayed in the Talent Portal). GraphQL is popular among developers due to its speed, type-safety features, and flexible data sources. Apollo Client further enhances GraphQL’s data management experience with intelligent data caching.

In addition to updating our data development environment, the team also shifted gears for UI development, moving from Ruby and JavaScript to React, whose clean repository improves our code maintainability. React also makes our codebase more appealing to engineers, as there is a growing base of enthusiastic React developers.

Who Do We Partner With?

Because our platforms directly impact the livelihoods of our clients and freelance talent, high product quality is non-negotiable. Engineers at Toptal partner with dedicated quality assurance (QA) engineers (integrated with the product team) and beta testers to guarantee high product quality. We work quickly but cautiously: Our Talent Portal, for example, began as an experiment before launching officially. We shared the new site with a closed group of beta testers and improved the portal based on their feedback.

Given our fast-paced cycle for product updates, it is crucial to release stable iterations of the portal so that we use our limited time with beta testers efficiently. We use automated tests to keep application branches stable, and our QA engineers help establish comprehensive test cases and triage bugs or test failures.

What Are Our Testing Standards?

Toptal engineers take testing seriously—it keeps our products reliable! The Talent Portal and other teams conduct automated end-to-end tests on our projects with the Cypress testing framework. Because Cypress runs in a web browser, it covers certain cases that other environments (like jsdom) cannot test easily.

We run all test cases before finalizing product changes to keep our main code branch stable. When a test fails on the main branch, where it had passed previously, that’s usually an indication that the test is flaky and produces unreliable results. The same test case can either pass or fail depending on a nonobvious factor that might be external to the code. Flakiness reduces our confidence in the tests and must be fixed. When a test is failing, there are two methods of triage:

  • For isolated tests with a clear owner, an automatic notification will alert the correct team according to code ownership rules.
  • When ownership is unclear because one user interaction spans multiple teams (as often is the case for Talent Portal Cypress tests), a QA or project engineer will help identify the responsible team.

Before launch, engineers must collaborate to fix incoming bugs and iterate on test cases to address past bugs until the project reaches stability.

How Toptal Engineers Debug: Diagnosing the Root Cause

You may be wondering: With so many players and tools involved in a given project, how do Toptal engineers address bugs effectively? And what happens after triage? The following case study, which details a Cypress failure I encountered, serves as a good illustration of our debugging process.

A Test Failure Caused by Apollo Client Overfetching

Let’s flash back to the day of the bug. Through the triage system, our team receives a notification that one of our Cypress tests is failing on the main branch. This is a burning issue, and we need to act fast; a broken test blocks our delivery pipeline and causes failure alerts on new code, disrupting the work of engineers.

From our past experiences, we know that we can implement a workaround for the problem: If we wait a bit longer, or disable specific test assertions temporarily, the test will pass. While this is a beneficial tactical move to unblock engineers’ work, we can’t stop there. We must identify the root cause. Understanding the behavior behind the failure allows us to improve our own project and prepare other projects for the same issues.

In our initial investigation of the test failure, we notice an error message declaring that the element selected for interaction (in this case, a button) is not reachable. Our Cypress setup captures a video recording of the test execution, so we can inspect what is happening in the web browser. Surprisingly, the video shows that the element is visible on the page. We’re stumped temporarily before realizing that the failure is occurring because there are two copies of the button element in the page context:

  • The instance referenced by Cypress that is currently detached from the page
  • A new copy of the button that is visible on the page

So, the page is re-rendering right after Cypress captures a reference to the element. When the test attempts to click on the button, that button instance is already detached from the page. With this knowledge, we investigate further, examining our operations log. We learn that the additional render is caused by an extra Apollo Client HTTP request (GetProfileSkills)—one that is unnecessary, as the former GetProfileSkills request had just finished.

From left to right, a light blue rectangle reads: "GetProfileSkills." Next to it, a dark blue rectangle reads: "GetProfileData." Finally, another light blue rectangle reads: "GetProfileSkills."
A simplified version of the log.

A quick check confirms that the test passes consistently if we block the third query. Why would we need to re-fetch a query that came back with data less than a second ago? We don’t. This excessive behavior is an example of Apollo Client overfetching, and this is our bug’s root cause.

A Quick Detour: Apollo Client Caching Explained

To remedy our overfetching problem, let’s perform a deep dive into the Apollo Client caching system so we can best evaluate potential solutions.

Apollo Client’s cache is one of its most essential aspects; it provides an intelligent solution for keeping data available and consistent. By saving the results of past queries in a local cache, Apollo Client can instantly access data for previously cached queries. However, Apollo Client’s cache is known to have issues with duplicate network requests.

I prepared a simplified example to demonstrate this scenario without going into the unnecessary complexity of our full project queries. In this example, our Content function issues two queries with the useQuery hooks, but our debug log shows that we perform three operations: DataForList1, DataForList2, and DataForList1 again. Let’s examine the queries involved:

const query1 = gql`
  query DataForList1 {
    articles {
      id
      title
      tags {
        name
        color
      }
    }
  }
`;

const query2 = gql`
  query DataForList2 {
    articles {
      id
      title
      tags {
        name
        articlesCount
      }
    }
  }
`;

The queries are similar, with the same fields selected from the article type, but their selections from the tag type differ. Since we don’t use an id field for the tag type, Apollo Client stores the tags in a denormalized way as properties of the articles. And, because the tags field is an array, Apollo Client’s default merge strategy is to rewrite data. When we receive data for the first query, Apollo Client sets the article’s cached tags to:

{
  tags: [{name: 'HTML', color: 'blue'}, {name: 'JS', color: 'yellow'}, {name: 'PHP', color: 'red'}]
}

After we receive data for the second query, the article’s cached tags are overwritten:

{
  tags: [{name: 'HTML', articlesCount: 10}, {name: 'JS', articlesCount: 12}, {name: 'PHP', articlesCount: 6}]
}

As a result, Apollo Client’s cache no longer knows the tags’ colors and performs an additional HTTP request to retrieve them. In our example, Apollo Client stops here because the duplicated query’s data is identical to that of the original query. However, if we were to load dynamic values like counters, time stamps, or dates, the client could end up in an infinite loop where the two queries continually trigger each other.

Thinking Like a Toptal Engineer: Potential Solution Trade-offs

With an understanding of the bug’s root cause and Apollo Client’s caching process, we can explore the potential solutions that prevent additional requests and discuss their pros and cons.

Disable the Cache

In cases where we don’t need cached values, we can disable the cache entirely:

const { data: data1 } = useQuery(query1, { fetchPolicy: 'no-cache' });
const { data: data2 } = useQuery(query2, { fetchPolicy: 'no-cache' });

Using the no-cache policy ensures that the network will never read from the cache, so we avoid re-fetching.

Decision: While this solution is simple to implement, disabling the cache negatively impacts our app’s usability and performance. Providing an excellent user experience is a top priority at Toptal, so this solution is not a viable option.

Use a Custom Merge Policy

We can customize Apollo Client’s merge behavior (which defaults to rewriting cached values with new values) in the cache typePolicies configuration. To do this, we update the merge function, which accepts two values: cached data and new data. The function’s output is the new value of the cache. Ignoring edge cases, we can resolve the example project’s tag overwriting with a basic custom merge policy:

new InMemoryCache({
  typePolicies: {
    Article: {
      fields: {
        tags: {
          merge: (existing = [], incoming: any[]) => {
            const result = [...incoming];
  
            existing.forEach((tag: any) => {
              const index = result.findIndex((t) => t.name === tag.name);
  
              if (index > -1) {
                result[index] = { ...result[index], ...tag };
              }
            });

            return result;
          }
        }
      }
    }
  }
})

However, this solution does not handle cases with arrays of different sizes.

Decision: A custom merge policy is powerful but getting it right for arrays is difficult. Adding complexity to a project always makes it harder to maintain, and added configuration on the front end would slow our development speed. (This is because our GraphQL endpoints have multiple front-end applications reading data from them, and we would need to repeat the configuration changes across projects.) Overall, the complexity of the custom merge function needed to address our problem would cause more problems than it would solve.

Configure a Custom Key Field

We can provide a custom typePolicy for setting a key field. In the example above, we assume our tags are unique by name and merge tags with the same name into a single entity. If this remains true, we can set the name field as a key field so it can store tags in a normalized way. This configuration is global per type and will affect how tags are cached in general (not just for articles). Now, article tags will be cached with just their names, and tag objects will be cached separately under those names:

{tags: ["PHP", "HTML", "JS"]}

Using a custom policy prevents data loss since the normalized tag cache will make updates without wiping existing fields. Let’s look at a simple typePolicy that would fix the example project:

new InMemoryCache({
  typePolicies: {
    Tag: {
      keyFields: ["name"]
    }
  }
})

Decision: Providing a custom key field policy is an effective and relatively low-cost option. However, with this solution, the name key field must always be selected in our queries in order for Apollo Client to perform caching. Forgetting this step can cause runtime errors. Using a custom key field is impractical because we won’t always need the name to render our user interface. Since there is no candidate better than the name field for the custom key field, this solution is less than optimal.

Implement a Default Key Field

An even simpler solution than configuring the name key field is to rely on a default key field: id. We can add this field to all queries (since an item ID always exists) and normalize our cache to avoid wiping existing data this way. Let’s fix our example project by adding id to the tag type selection:

# ...
tags {
  id
  name
  color # query1
}
# ...
tags {
  id
  name
  articlesCount # query2
}
# ...

Decision: Using a default key is safe and straightforward. We can use an ESLint rule to require selecting an id field if it exists in the schema, and this default key doesn’t require any front-end changes in our application. Plus, the engineering cost to add an id field to types is relatively low. Implementing a default key field would also protect other projects that rely on our GraphQL schema from the original test failure without requiring additional code changes.

The clear benefits and absence of unwanted side effects make this the ideal solution.

Lasting Excellence: The Impact of Thoughtful Engineering Solutions

By updating our queries to use a default key field, we successfully eliminated our initial test failures. The impact of the fix, however, extended far beyond one bug instance. Adding the default key field has protected other projects from production-blocking test failures as well. Moreover, our original failure exposed Apollo Client overfetching as a potential source of performance issues in our front-end projects, allowing us to fortify our user experience. Lastly, obtaining a deep understanding of Cypress and its handling of HTML has helped us debug related issues efficiently.

This case study highlights what I love about engineering at Toptal: We focus on quality above all else. Accepting the challenge the bug presented and figuring out the best possible solution was rewarding. The alternative—allowing subpar performance in our automated tests and failing to understand our tools’ behavior—was unacceptable. Even in a dynamic and fast-paced environment, Toptal engineers are encouraged to take the time to find the proper solution to every problem. The results speak for themselves: better products, and a happier and more engaged team that takes pride in their work.

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy
Congratulations!
You have been subscribed.

Understanding the Basics

  • What is Apollo Client used for?

    Apollo Client is a GraphQL client that enhances data management; it also performs intelligent data caching.
  • How is data stored in GraphQL?

    GraphQL is a query language that provides access to data based on data types and logic that developers code. It uses existing data instead of a specific storage engine or database.
  • Is Cypress good for testing?

    Cypress is a great option for end-to-end testing. It runs in a web browser, so it can test certain cases that are difficult to run using other testing tools.
  • Why do Cypress tests fail?

    Cypress tests fail due to defects in software or because the tests are “flaky” and produce unreliable results. (The test may have passed when it was committed to the codebase but later starts to fail.)
  • How do you fix a flaky test in Cypress?

    Often, a flaky test’s result depends on a nonobvious factor that may be external to the code. You can debug the root cause of a flaky test using tools like Cypress video recordings and error logs.
  • How does the Apollo cache work?

    The Apollo cache keeps data available and consistent. It saves the results of past queries (in a local cache) so that they can be instantly accessed when requested again. Because the cache is normalized, it is more efficient than typical HTTP caching.

Join the Core Team

We’re growing quickly and always on the lookout for the best people to join our team.

We work hard to review every application, and we’ll reach out to you if you’re a great match.

Apply Now