14 Essential DevOps Interview Questions *
Toptal sourced essential questions that the best DevOps engineers can answer. Driven from our community, we encourage experts to submit questions and offer feedback.Hire a Top DevOps Engineer Now
Database migrations and new features are common challenges increasing the complexity of DevOps pipelines.
Feature flags are a common way of dealing with incremental product releases inside of CI environments.
If a database migration is not successful, but was run as a scheduled job, the system may now be in an unusable state. There are multiple ways to prevent and mitigate potential issues:
- The deployment is actually triggered in multiple steps. The first step in the pipeline starts the build process of the application. The migrations are run in the application context. If the migrations are successful, they will trigger the deployment pipeline if not the application won’t be deployed.
- Define a convention that all migrations must be backwards compatible. All features are implemented using feature flags in this case. Application rollbacks are therefore independent of the database.
- Create a Docker-based application that creates an isolated production mirror from scratch on every deployment. Integration tests run on this production mirror without the risk of breaking any critical infrastructure.
It is always recommended to use database migration tools that support rollbacks.
A Pod is a mapping between containers in Kubernetes. A Pod may contain multiple containers. Pods have a flat network hierarchy inside an overlay network and communicate to each other in a flat fashion, meaning that in theory any pod inside that overlay network can speak to any other Pod.
Depending on the CNI network plugin that you use, if it supports the Kubernetes network policy API, Kubernetes allows you to specify network policies that restrict network access.
Policies can restrict based on IP addresses, ports, and/or selectors. (Selectors are a Kubernetes-specific feature that allow connecting and associating rules or components between each other. For example, you may connect specific volumes to specific Pods based on labels by leveraging selectors.)
Apply to Join Toptal's Development Network
and enjoy reliable, steady, remote Freelance DevOps Engineer Jobs
Cloud providers allow fine grained control over the network plane for isolation of components and resources. In general there are a lot of similarities among the usage concepts of the cloud providers. But as you go into the details there are some fundamental differences between how various cloud providers handle this segregation.
In Azure this is called a Virtual Network (VNet), while AWS and Google Cloud Engine (GCE) call this a Virtual Private Cloud (VPC).
These technologies segregate the networks with subnets and use non-globally routable IP addresses.
Routing differs among these technologies. While customers have to specify routing tables themselves in AWS, all resources in Azure VNets allow the flow of traffic using the system route.
Security policies also contain notable differences between the various cloud providers.
There are multiple ways to build a hybrid cloud. A common way is to create an VPN tunnel between the on-premise network and the cloud VPC/VNet.
AWS Direct Connect or Azure ExpressRoute bypasses the public internet and establishes a secure connection between a private data center and the VPC. This is the method of choice for large production deployments.
The Container Network Interface (CNI) is an API specification that is focused around the creation and connection of container workloads.
CNI has two main commands: add and delete. Configuration is passed in as JSON data.
When the CNI plugin is added, a virtual ethernet device pair is created and then connected between the Pod network namespace and the Host network namespace. Once IPs and routes are created and assigned, the information is returned to the Kubernetes API server.
An important feature that was added in later versions is the ability to chain CNI plugins.
Kubernetes Containers are scheduled to run based on their scheduling policy and the available resources.
Every Pod that needs to run is added to a queue and the scheduler takes it off the queue and schedules it. If it fails, the error handler adds it back to the queue for later scheduling.
What is the difference between orchestration and classic automation? What are some common orchestration solutions?
Classic automation covers the automation of software installation and system configuration such as user creation, permissions, security baselining, while orchestration is more focused on the connection and interaction of existing and provided services. (Configuration management covers both classic automation and orchestration.)
Most cloud providers have components for application servers, caching servers, block storage, message queueing databases etc. They can usually be configured for automated backups and logging. Because all these components are provided by the cloud provider it becomes a matter of orchestrating these components to create an infrastructure solution.
The amount of classic automation necessary on cloud environments depends on the number of components available to be used. The more existing components there are the less classic automatic is necessary.
In local or On-Premise environments you first have to automate the creation of these components before you can orchestrate them.
For AWS a common solution is CloudFormation, with lots of different types of wrappers around it. Azure uses deployments and Google Cloud has the Google Deployment Manager.
A common orchestration solution that is cloud-provider-agnostic is Terraform. While it is closely tied to each cloud, it provides a common state definition language that defines resources (like virtual machines, networks, and subnets) and data (which references existing state on the cloud.)
Nowadays most configuration management tools also provide components to manage the orchestration solutions or APIs provided by the cloud providers.
CI stands for “continuous integration” and CD is “continuous delivery” or “continuous deployment.” CI is the foundation of both continuous delivery and continuous deployment. Continuous delivery and continuous deployment automate releases whereas CI only automates the build.
While continuous delivery aims at producing software that can be released at any time, releases to production are still done manually at someone’s decision. Continuous deployment goes one step further and actually releases these components to production systems.
Blue Green Deployments and Canary Releases are common deployment patterns.
In blue green deployments you have two identical environments. The “green” environment hosts the current production system. Deployment happens in the “blue” environment.
The “blue” environment is monitored for faults and if everything is working well, load balancing and other components are switch newm the “green” environment to the “blue” one.
Canary releases are releases that roll out specific features to a subset of users to reduce the risk involved in releasing new features.
VPCs on AWS generally consist of a CIDR with multiple subnets. AWS allows one internet gateway (IG) per VPC, which is used to route traffic to and from the internet. The subnet with the IG is considered the public subnet and all others are considered private.
The components needed to create a VPC on AWS are described below:
- The creation of an empty VPC resource with an associated CIDR.
- A public subnet in which components will be accessible from the internet. This subnet requires an associated IG.
- A private subnet that can access the internet through a NAT gateway. The NAT gateway is positioned inside the public subnet.
- A route table for each subnet.
- Two routes: One routing traffic through the IG and one routing through the NAT gateway, assigned to their respective route tables.
- The route tables are then associated to their respective subnets.
- A security group then controls which inbound and outbound traffic is allowed.
This methodology is conceptually similar to physical infrastructure.
Infrastructure as Code (IaC) is a paradigm that manages and tracks infrastructure configuration in files rather than manually or graphical user interfaces. This allows for more scalable infrastructure configuration and more importantly allows for transparent tracking of changes through usually versioning system.
Configuration management systems are software systems that allow managing an environment in a consistent, reliable, and secure way.
By using an optimized domain-specific language (DSL) to define the state and configuration of system components, multiple people can work and store the system configuration of thousands of servers in a single place.
CFEngine was among the first generation of modern enterprise solutions for configuration management.
Their goal was to have a reproducible environment by automating things such as installing software and creating and configuring users, groups, and responsibilities.
Second generation systems brought configuration management to the masses. While able to run in standalone mode, Puppet and Chef are generally configured in master/agent mode where the master distributes configuration to the agents.
Ansible is new compared to the aforementioned solutions and popular because of the simplicity. The configuration is stored in YAML and there is no central server. The state configuration is transferred to the servers through SSH (or WinRM, on Windows) and then executed. The downside of this procedure is that it can become slow when managing thousands of machines.
Any system that is supposed to be capable of healing itself needs to be able to handle faults and partitioning (i.e. when part of the system cannot access the rest of the system) to a certain extent.
For databases, a common way to deal with partition tolerance is to use a quorum for writes. This means that every time something is written, a minimum number of nodes must confirm the write.
The minimum number of nodes necessary to gracefully recover from a single-node fault is three nodes. That way the healthy two nodes can confirm the state of the system.
For cloud applications, it is common to distribute these three nodes across three availability zones.
Logging solutions are used for monitoring system health. Both events and metrics are generally logged, which may then be processed by alerting systems. Metrics could be storage space, memory, load or any other kind of continuous data that is constantly being monitored. It allows detecting events that diverge from a baseline.
In contrast, event-based logging might cover events such as application exceptions, which are sent to a central location for further processing, analysis, or bug-fixing.
A commonly used open-source logging solution is the Elasticsearch-Kibana-Logstash (ELK) stack. Stacks like this generally consist of three components:
- A storage component, e.g. Elasticsearch.
- A log or metric ingestion daemon such as Logstash or Fluentd. It is responsible for ingesting large amounts of data and adding or processing metadata while doing so. For example, it might add geolocation information for IP addresses.
- A visualization solution such as Kibana to show important visual representations of system state at any given time.
Most cloud solutions either have their own centralized logging solutions that contain one or more of the aforementioned products or tie them into their existing infrastructure. AWS CloudWatch, for example, contains all parts described above and is heavily integrated into every component of AWS, while also allowing parallel exports of data to AWS S3 for cheap long-term storage.
Another popular commercial solution for centralized logging and analysis both on premise and in the cloud is Splunk. Splunk is considered to be very scalable and is also commonly used as Security Information and Event Management (SIEM) system and has advanced table and data model support.
There is more to interviewing than tricky technical questions, so these are intended merely as a guide. Not every “A” candidate worth hiring will be able to answer them all, nor does answering them all guarantee an “A” candidate. At the end of the day, hiring remains an art, a science — and a lot of work.
Submit an interview question
Submitted questions and answers are subject to review and editing, and may or may not be selected for posting, at the sole discretion of Toptal, LLC.
Looking for DevOps Engineers?
Looking for DevOps Engineers? Check out Toptal’s DevOps engineers.
Freelance DevOps Engineer
Dmitry is a cloud architect and site reliability engineer with over a decade of intense professional experience strictly adhering to the DevOps methodology. He has architected and built multiple platform-agnostic infrastructures from scratch for modern cloud systems. Dmitry has a proven track record of hands-on operations in high-scale environments. He is also proficient with IaC, automation, and scripting, as well as monitoring and observability.Show More
Freelance DevOps Engineer
Sagi is a top-performing, Microsoft Certified Senior Azure DevOps engineer with ten years of solid hands-on experience in DevOps, programming, scripting, and business intelligence. Sagi specializes in architecting and implementing DevOps processes using Azure DevOps and Azure Cloud platforms. By utilizing his gained experience in multiple application development areas, Sagi has become one of the most prominent experts in the market.Show More
Freelance DevOps Engineer
In 2012, Arthur earned a master's degree in computer engineering but he soon learned his true north was in system administration. His programming background has helped him automate most of his tasks along the way and he eventually ended up in cloud computing as it gave him even more possibilities. Arthur is a full-stack DevOps who has particularly strong development skills with all things AWS—which his numerous certifications can attest to.Show More
Toptal Connects the Top 3% of Freelance Talent All Over The World.