Understanding a Postgres query plan

8 min readApr 14, 2020

A query plan is a sequence of steps used by a database to access data. Being able to read a query plan is key to understanding the performance of an SQL query. When tuning a query we need information like — how are the rows being fetched from a table? Are the indexes being used? What is the cost of joining two tables? etc. A query plan provides an answer for all of these questions.

In Postgres we get the query plan by prepending a query with EXPLAIN. Example:

A database can employ various algorithms to fetch rows and join tables. So a query plan can be dense with jargon. In order to read a query plan it is good to be familiar with these terms. Let’s go through some of the common terms that appear in a query plan and understand why the planner favours some particular fetching mechanism over others.

But first, we need to understand the structure of a query plan. From the official documentation:

The structure of a query plan is a tree of plan nodes. Nodes at the bottom level of the tree are scan nodes: they return raw rows from a table. There are different types of scan nodes for different table access methods: sequential scans, index scans, and bitmap index scans. […] If the query requires joining, aggregation, sorting, or other operations on the raw rows, then there will be additional nodes above the scan nodes to perform these operations. Again, there is usually more than one possible way to do these operations, so different node types can appear here too. The output of EXPLAIN has one line for each node in the plan tree, showing the basic node type plus the cost estimates that the planner made for the execution of that plan node. […] The very first line (the summary line for the topmost node) has the estimated total execution cost for the plan; it is this number that the planner seeks to minimize.

Let’s break this down in simple terms. A query plan is structured as a tree. Each branch of this tree is called a ‘plan node’. In the query plan, each of these nodes begin with an -> arrow. The nodes at the lowest level correspond to how rows are fetched from a table. Each node on a higher level performs some operation on its child nodes. For example, if we have a query as following:

select count(*) from students where age > 10 and mentor_id = 1

One of the possible ways this query can be processed is to fetch a set of all students whose age is greater than 10. Fetch another set of students whose mentor_id is 1. Then an AND operation can be performed on both sets to get the final result. In this case, the query plan would have two bottom level nodes and a parent node for the AND operation. There can be higher level nodes for any type of operations, like SORT, LIMIT, JOIN etc. that can be performed on raw rows.

Now let’s go through each aspect of a query plan in more detail.

Fetching raw rows

A database can fetch rows from a table in different ways. This depends on factors such as indexes, the volume of rows to be fetched, the size of the table in question etc.

Sequential scan

In a sequential (or full) table scan, all table rows are loaded into the memory. If the query has any filtering criteria, they will be applied in-memory on the appropriate columns. The rows that satisfy the filtering condition are transmitted further. This method is used when indexes are missing on the columns present in the query predicate (where clause). In the above example, all user rows are fetched and the rows that fail on the created_at condition are discarded.

Though a sequential scan usually indicates missing indexes, a query planner may employ this approach even when an index is present on the filtering column. This may happen if:

A table has very few rows. An index scan requires random seeks. If the whole table can fit within a few heap pages, it is just cheaper in terms of I/O operations to load all those pages into the memory, thus avoiding the overhead of random seeks.
A query would return a large percentage of table rows. If suppose you’ve a query that would return 90% of all rows in a large table. In such a case, each page would contain several rows that satisfy the query predicate. Again, fetching the whole page at once trumps over performing random seek that would access the same page several times.

Databases maintain query statistics to determine when to use sequential scan over index scan.

Index scan

For an index scan to work an index must be present on the column used in the query predicate. The database uses that index to get the reference of all the tuples (rows) that satisfy the query predicate. Then the corresponding page is accessed in the heap to fetch the rest of the required data from each tuple.

Suppose that a query has multiple predicates and all the columns in those predicates have indexes on them. The planner would typically not perform index scans for all the predicates. It would use index scan only on one of those columns and perform in-memory filtering for the remaining predicates.

An index scan can be used for queries that have an ORDER BY condition that matches the index order. This saves an extra sorting step.

Bitmap scan

A plain index scan fetches one tuple-pointer at a time from the index and immediately visits that tuple in the table. This may mean a single page is visited multiple times if several tuples are located in it. A more optimal use of I/O operations would be to visit a page only once and fetch all desired tuples from it. A bitmap scan does just that.

A bitmap index scan fetches all the tuple-pointers from the index at once and sorts them using a bitmap data structure. A bitmap heap scan visits each heap page to fetch all desired tuples from it in one go.

This method of row scanning has an overhead of maintaining the in-memory bitmap. If the bitmap gets too large, then we maintain only the references of the heap pages that contain matching tuples instead of keeping track of individual tuples within those pages. The whole page is loaded and the tuples are filtered on Recheck Condition to get the desired data.

Joining tables

Nested loop join

This method of joining two tables is preferred when one side of the join has few rows. In the above query plan all the teams are fetched via a sequential scan. Then for each team id, an index scan is performed on the players table to fetch the players belonging to that team.

A nested loop join is performed on two nodes. A node can be a table or an intermediary step in a query plan. The second node of the join is looped over for each row in the first node. This method works better when the second node (the one that is being looped over) can be index scanned. The join column value from each row of the first node (team.id in the above example) serves as the key for the index scan of the second node.

In the above query plan, there’s a Materialize node above the index scan on players table. Because in a nested loop join the second relation is accessed several times, the Materialize node saves its data in memory after the first pass. The same in-memory data is used in each subsequent pass, thus avoiding a disk look up each time.

Nested loop join is the only method of joining tables if the join condition does not use the equality operator.

Hash join

In a hash join, the relation on the right side of the join is scanned and loaded into an in-memory hash table using its join attribute as the hash key. Then the left relation is scanned and the join attribute is used to look up matching row of the second relation in the hash table.

Hash join can be used when the join condition uses the equality operator, both sides of the join are large and the hash can fit in the memory.

Merge join

In this join method, both relations are first sorted on the join attribute. Then the two relations are scanned in parallel to find the matching rows. This method can be used when the join condition uses the equality operator, both sides of the join are large but can be efficiently sorted on the join attribute.

In the above example, players is sorted using the index on team_id column. The team table could have also been sorted on its primary key. But in this instance, the query planner preferred a sequential scan and sort. A sequential scan and sort is preferable over an index sort when the table is large and cost of the non-sequential disk access required by the index scan is higher than a simple full scan and in-memory sorting.

Cost estimation

Each node is accompanied by a cost estimation that takes the form:

(cost=10.00..20.00 rows=1 width=8)

The costs are measured in an arbitrary unit determined by the planner’s cost parameters. One of the ways the cost can be measured is in the unit of disk page fetches. The cost of an upper level node includes the cost of all its child nodes. A brief description of each field in the cost estimation:

cost: This is a range of the estimated start-up cost and the estimated total cost. The start-up cost is the time spent before the transmission of output rows can begin. This may include the time required for sorting. The total cost is the time estimation for transmitting all eligible rows, assuming the query would run to completion. Meaning, there may be a LIMIT clause in an outer query that would limit the number of transmitted rows. But the estimates don’t consider these scenarios.
rows: Estimated number of rows emitted by a plan node
width: Estimated average width of rows emitted by a plan node in bytes.

EXPLAIN ANALYZE

EXPLAIN by itself just provides the query plan with cost estimation. If we execute a query with EXPLAIN ANALYZE, the query is actually executed. In each plan node, the true row count and the true run time is displayed along with the estimates. So that we can check the accuracy of the planner’s estimation.

Depending on the query plan, EXPLAIN ANALYZE may provide much richer information. This includes:

The number of actual loops performed in a nested loop join.
If sorting was involved, then the algorithm and the amount of memory used for sorting.
If hash joins were involved, the number of hash buckets and the peak memory used for hash tables.
Rows rejected by a filter condition or an index recheck.