PostgreSQL has been one of the more popular open-source object relational database systems since its initial release nearly two decades ago, but that popularity has grown enormously during the last couple of years. We can get a sense of this by looking at db-engines.com, which calculates a “popularity score” based on factors such as number of questions on stack overflow, job offers, and number of results on the major search engines, and the like. Its graph shows a 72,35 percent increase in the popularity of PostgreSQL from January 2013 (score: 167.475) to February 2016 (score: 288.657).
With such a demand for PostgreSQL database experts, some software engineers have simply added it to their résumé because they think they will get by with the basics, but they fall short when more advanced tasks that require specific database and software development experience are put on their plate.
To help you find high-quality developers in the United States or abroad for full-time or part-time roles that truly understand the tool, this hiring guide takes you through the topics and questions that PostgreSQL experts should know well.
Even if a candidate is a master of SQL, it doesn’t necessarily mean that s/he is a master of PostgreSQL. Yet, if the candidate is not proficient in SQL we know for sure that s/he is not a master of PostgreSQL. A good Postgres DBA should know intimately the principles and use of standard SQL. PostgreSQL supports the majority of features required by the SQL standard ISO/IEC 9075:2011, so a big part of PostgreSQL usage relies on it.
We can weed out a large portion of candidates with a high amount of confidence by starting the interview with SQL principles, concepts and usage: the different types of
UNION, subqueries and the difference between
WHERE. Toptal’s list of 20 Essential SQL Interview Questions is a good place to start.
SQL vs. NoSQL
When a SQL database is in order, PostgreSQL is the best database for the job most of the time. While the differences between PostgreSQL and other SQL databases, such as MySQL, might be debated, it is more valuable in the hiring process to discuss the differences between PostgreSQL and NoSQL, such as MongoDB. A strong candidate knows when Postgres is the right tool; they don’t always use/recommend Postgres because it’s the only thing s/he knows.
Q: When is it appropriate to use PostgreSQL instead of a NoSQL database and how can you tell?
Relational databases, such as PostgreSQL, have advantages and disadvantages compared to NoSQL databases, and understanding those weaknesses and strengths is important. Using the Brewer Theorem (also known as the CAP theorem) is handy for determining which type of database is the best one for the job.
The Brewer Theorem is built on the assertion that a distributed application cannot simultaneously guarantee the following three systemic requirements: consistency, availability and partition tolerance. Depending on which two of these three are the application’s highest priority, it can be decided which database is the best fit. PostgreSQL is better when consistency and availability are needed, while other combinations may require a NoSQL solution. Here’s a more in depth look at both options:
Transactionality: PostgreSQL supports transactions at the database level while NoSQL databases do not. There are some methods for implementing transactions in NoSQL, but these are done at the application level instead of in a database level, so they have a toll on performance.
Aggregation functions: PostgreSQL implements SQL standard aggregation functions, as well as some of its own, with high performance. In NoSQL, there are techniques for getting decent performance, but it doesn’t come close to the performance achieved by PostgreSQL.
Joins: Relational databases, such as PostgreSQL, are excellent at querying data that is stored across different tables by using the different types of joins, while NoSQL databases often need to perform multiple queries.
Semistructured data: NoSQL databases have a big advantage over relational databases when it comes to handling semistructured data and horizontal scalability. While PostgreSQL has some support for this with the use of the JSONB datatype, NoSQL databases are still the superior choice.
Scaling: NoSQL databases (specifically, the ones that are BASE compliant) are better at scaling write operations since they lack a number of functionalities that ACID compliant databases have, such as referential integrity enforcement.
With these differences and the Brewer Theorem in mind, let’s look at two case study applications:
Financial applications need strict data consistency and transactionality, and the nature of the app means running multiple complex reports. In this case, PostgreSQL is the best tool.
Social networks (successful ones) are expected to scale at a massive volume and their data is not well structured. Furthermore, the database needs to constantly scale the app horizontally, too, as more and more fields are added in tandem with new app functionalities. Here, a NoSQL database is the better option.
Proof of Mastery
By this point we should have weeded out the weaker talent, but we still need to find the developers that are truly Postgres masters, the ones that will help us get the most out of our database management and are able to handle the inevitable, daunting tasks that come along.
Q: “We need to query a knowledge base with entries for several topics (each entry has a title and a content).” How would you accomplish this using Postgres?
The Subpar Answer
If a developer is not an expert in PostgreSQL, and thus not familiar with its full text search functionality (which we will discuss further), s/he might answer something like this:
SELECT * FROM posts WHERE title ILIKE ‘%query-string%’ OR content ILIKE ‘%query-string%’;
A candidate that answers this way is not a prime choice. This approach is not using PostgreSQL’s awesome capabilities and the quality of the results would be poor:
If, for example, our query-string is a pluralized word and we have an entry with the singular form of that word, that result will not be included (not to mention words such as “do” and ”does”) since this method only looks for inclusion of the query-string in the title or content. Likewise, if we have two separate words in the query-string, no results will come up unless they are in the same order and the same form.
There’s no guarantee of relevance since there is no way to rank these results by the number of occurrences in the content or numbers of exact word matches, and so on.
Performance is also a problem with this approach. Since there is no index support, this query can be extremely slow even in fairly small databases.
The Right Answer
So, how can this problem be solved in a better way using some of PostgreSQL’s cool features L? Full text search.
The full text search feature is based on the
@@ operator (known as a ‘match operator’) that returns true if a
tsvector matches a
tsvector is a search type that represents a document, a sorted list of normalized words used for text searches. A
tsquery is a search type that represents a query. A document is the ‘unit of searching’ that will often be a column of a table or a concatenation of multiple columns (even from multiple tables), such as this:
SELECT title || ‘ ‘ || content AS document FROM posts;
We can use the
tsvector this way, but we cannot directly use
tsquery() to convert our query-string into a
tsquery since it expects a string with boolean operators that separate lexemes (like
& for ‘and’,
| for ‘or’, and
! for ‘not’). To convert user-written text into a
tsquery, we can use the function
plainto_tsquery. The resulting query will look like this:
SELECT * FROM posts WHERE tsvector(title || ‘ ‘ || content) @@ plainto_tsquery(‘query-string’);
Note: depending on your PostgreSQL configuration, you may need to manually specify the dictionary to use in order to have some cool features such as “stemming”. To order the results by relevance we can use the
ts_rank() function, which accepts a
tsvector and a
tsquery as parameters.
Q: What is ‘high availability’, when should you implement it, and how do you do it?
High availability refers to the capability of the database to remain operational for a higher percent of the time in comparison to non-redundant servers. This is also known as having a high service level.
High availability can be achieved with redundant database servers working together to replace the primary server in case of failure. These standby servers (also known as “slaves”) track the changes of the primary, “master” server. While only primary servers may modify the data (perform read/write operations), standby servers have different roles:
Hot standby servers accept connections for the sole purpose of serving read-only queries. This type of standby server is often used for load balancing.
Warm standby servers do not accept any connections, they only follow the changes made to the primary server. If they are promoted to primary, they can start accepting connections and modify data.
When to Implement High Availability
PostgreSQL has many features for facilitating high availability, but it’s not always appropriate to implement high availability. There are many instances where one would decide against high availability (or some type of it), so it is difficult to come up with a simple guideline, but the deciding factor should always be the needs of the business. Generally, implementing high availability is a detriment to performance and increases the overall complexity of the architecture, thus understanding the gains and losses of each decision is key.
It’s easy to give the above answer as a catch-all cop out, so the candidate should explain a couple specific cases to demonstrate understanding:
Scenario 1: A marketing company has a newsletter application. Requests to the web application are scheduled (as are the read/write operations to the database) and do not depend on demand. A couple of hours of downtime per month is not likely to impact the business, so high availability doesn’t give much benefit for the cost.
Scenario 2: A SaaS company offers a service-level agreement of 99 percent uptime with their 500 thousand users. This company is built on a high availability implementation.
Implementation with Transaction Log Shipping
There are multiple strategies for implementing high availability, but here we’ll cover “transaction log shipping” on a hot standby. PostgreSQL continuously stores all transactions using write-ahead logging (also known as “WAL”). These log entries, among other things, are used for keeping the standby servers up to date by connecting to the primary server and fetching the logs.
For the implementation exercise, assume that the primary database is already functional and that the standby database is an exact copy of the primary database:
In the primary server:
a) Create a user for the replication:
sudo -u postgres createuser -U postgres replicator -P --replication
(enter password when prompted)
pg_hba.conf to allow connections for replication:
host replication replicator standby-server-ip/32 md5
postgresql.conf to set the
wal_level to hot standby:
wal_level = hot_standby
archive_mode = on
archive_command = 'cp %p /path/to/archive/folder/%f'
max_wal_senders = 1 # Enter the number of standby servers
Note: You may need to create the folder assigned to
archive_command; a good place is
/var/lib/postgresql/<postgres-version>/main/mnt/server/wal_archive. Don’t forget to change the ownership to the postgres user after you create the folder. Depending on your network setup you might need to add this ip address to the list of addresses to listen:
listen_addresses = 'localhost, <primary-server-ip>'
In the standby server:
a) Delete the default data directory in the ‘standby’ server and copy the files from the data that directory in the ‘primary’ server using the ‘pg_basebackup’ utility:
rm -r /var/lib/postgresql/9.3/main
sudo -u postgres pg_basebackup -h <primary-server-ip> -D /var/lib/postgresql/<postgres-version>/main -U replicator -v -P --xlog-method=stream
postgresql.conf to enable hot standby:
hot_standby = on
/var/lib/postgresql/<postgres-version>/main/recovery.conf from the sample file located at
d) Edit the
recovery.conf to configure the standby mode:
standby_mode = on
primary_conninfo = 'host=<primary-server-ip> port=5432 user=replicator password=<password>'
All these changes need to be made as a root user. Don’t forget to restart both servers so they can pick up the changes, and pay special attention to the network setup by ensuring that the hosts are accessible or that ports are not being blocked by a firewall, and so on. This configuration is a good starting point for exploring customizable options. If you have any issues a good place to start troubleshooting is the postgres logs of each server (located in “/var/log/postgresql/postgresql--main.log”).
With the recent surge of popularity of NoSQL, some people thought that SQL databases were headed to oblivion and that PostgreSQL, not being the most used SQL database, would be one of the first ones to fall. This hasn’t been the case, and so it remains important to refine the PostgreSQL recruiting process. With the NoSQL euphoria fading away, people are realizing that NoSQL is not a silver bullet. With PostgreSQL becoming so popular, SQL databases still play an important role.