Friday, November 27, 2015

SQL query optimization by examples

I recently had a database optimization quiz at Grokking Coding Challenge 2015. This is a basic and interesting problem. It reminds me my own days at school. I hope a write-up about this quiz will help us review our knowledge and learn some interesting things.

0. The problem

=== start question ===

You have a table with the following structure
-- postgresql-compatible SQL
-- each entry in user_purchases indicate that the user have bought an item
-- referrer_name have a few values: ‘amazon’, ‘facebook’, ‘friends-of-grokking-engineering’
CREATE TABLE user_purchases (
  user_id NUMERIC,
  item_id NUMERIC,
  referrer_name VARCHAR(255),
  created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
  updated_at TIMESTAMP WITH TIME ZONE DEFAULT now() 
);

You make a lot of query to find items that a user purchased, similar to this
SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01');

The query is slow, and running EXPLAIN returns:
EXPLAIN SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..3482332.76 rows=1 width=596)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))
What would you do to improve the performance of the query (make it run faster), independent of the server's infrastructure (CPU, memory, disk, etc)?

=== end question ===

Comment: The very first solution you can think of is using some index to make the query run faster. Simple! But how to use index efficiently needs some consideration. I can think of some solutions and decide to get my hands dirty with them.


1. Set up environment

Because I had never used PostgreSQL before, I have to install it on my MacBook to do some experiments. An easy way to try PostgreSQL on Mac OS is using Postgres.app. You can use some GUI tools to work with PostgreSQL easily.

Let's create the table and feed it some data:
CREATE TABLE user_purchases (
  user_id NUMERIC,
  item_id NUMERIC,
  referrer_name VARCHAR(255),
  created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
  updated_at TIMESTAMP WITH TIME ZONE DEFAULT now() 
);
INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (123,1,'amazon','2015-7-01');
INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (456,1,'amazon','2015-7-01');
INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (789,1,'amazon','2015-7-01');
INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (123,1,'google','2015-7-01');


2. The original query

Run the original query and analyze the output:
EXPLAIN SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..12.76 rows=1 width=596)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))
The output of EXPLAIN query tells us that the optimizer decided to use a sequential scan to execute the query. It estimates that it will cost 0.00 to return the first row, and 12.76 to return all the rows. It thinks there will be 1 row returned, and that the average width of each row will be 596 bytes.

But EXPLAIN doesn't actually run the query, it just runs the estimation. We will use EXPLAIN ANALYZE to see how the query runs in reality:
EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..12.76 rows=1 width=596) (actual time=0.066..0.074 rows=3 loops=1)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))
  Rows Removed by Filter: 1
Planning time: 1.191 ms
Execution time: 0.142 ms
We can see more information here: the actual time required to run the sequential scan step, the number of rows returned by that step, and the number of times those rows were looped through. We also have the total runtime for the query.

For the later experiments, we will use EXPLAIN ANALYZE to get the more correct stats. Read more about PostgreSQL EXPLAIN query here.

3. Attempt 1

The original answer proposed by the organizer team is creating index on 3 columns: user_id, item_id, and created_at. Let's try this solution:
CREATE INDEX ix_user_id ON user_purchases (user_id);
CREATE INDEX ix_item_id ON user_purchases (item_id);
CREATE INDEX ix_created_at ON user_purchases (created_at);

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..1.08 rows=1 width=596) (actual time=0.057..0.067 rows=3 loops=1)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))
  Rows Removed by Filter: 1
Planning time: 1.175 ms
Execution time: 0.139 ms
You can see that the execution time of the query is NOT improved so much. Moreover, you can see the query still performed a sequential scan on the user_purchases table, and didn't consider the index. It's weird. Try to understand this situation, and I found an answer for this:
The reason why this is the case is that indexes have a cost to create and maintain (on writes) and use (on reads). 
When an index is used in a SELECT query, first the position of the requested rows is fetched from the index (instead of from the table directly). Then, the requested rows (a subset of the total) are actually fetched from the table. It’s a two step process. Given a large enough number of requested rows (such as all the posts with an id greater than 10), it is actually more efficient to go fetch all those rows from the table directly rather than to go through the extra step of asking the index for the position of those rows and next fetch those specific rows from the table. If the percentage of the rows is smaller than 5-10% of all rows in the table the benefit of using the information stored in the index outweighs the additional intermediate step.
OK, our data is not big enough. Let's populate more data to the table:
#!/usr/bin/env python
import psycopg2

template = "INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (%d, %d, '%s', '%s')"
conn = psycopg2.connect("dbname='namnx' user='namnx' host='localhost' password=''")
cursor = conn.cursor()
for user_id in xrange(1,10001):
    for item_id in xrange(1,11):
        for referrer in ('amazon', 'google', 'facebook', ''):
            for month in xrange(6, 9):
                for day in (1,15):
                    created_at = "2015-%d-%d" % (month,day)
                    stmt = template % (user_id, item_id, referrer, created_at)
                    print stmt
                    cursor.execute(stmt)
This script will insert 10000*10*4*3*2=2400000=(2.4M) records into the table. I think it's big enough to do some comparison. It'll take some time to run. Be patient.

When the data is ready, re-run the previous query and see the output:
EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Index Scan using ix_user_id on user_purchases  (cost=0.43..39.90 rows=58 width=32) (actual time=0.125..1.869 rows=63 loops=1)
  Index Cond: (user_id = ANY ('{123,456,789}'::numeric[]))
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
  Rows Removed by Filter: 661
Planning time: 1.958 ms
Execution time: 1.997 ms
OK, you can see the indexes are in effect now. The output of EXPLAIN tells us that the index on "user_id" column was used. When the query is executed, this index is used first to find the rows satisfied the condition "user_id IN (123,456,789)". These rows then will be filtered by other conditions of the WHERE clause:
created_at >= date('2015-07-01') AND created_at < date('2015-08-01') AND referrer_name='amazon'
In this case, the index on "created_at" column was not used. This is because the index on "user_id" column had been applied first, and when executing a query, an index is used to find satisfied rows from the table, not from rows satisfied another index. So the index on "created_at" column is useless here.

In addition, the index we created on "item_id" column is also useless. In database schema design, we should not create an index if we will never use it because it will take space of the database, and it will slow-down the WRITE queries when we insert new records.

Just drop these useless indexes and re-run the query to see the result:
DROP INDEX ix_item_id;
DROP INDEX ix_created_at;

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Index Scan using ix_user_id on user_purchases  (cost=0.43..39.90 rows=58 width=32) (actual time=0.135..1.333 rows=63 loops=1)
  Index Cond: (user_id = ANY ('{123,456,789}'::numeric[]))
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
  Rows Removed by Filter: 661
Planning time: 0.364 ms
Execution time: 1.389 ms
The performance of query seems the same!

Something to note here. In some case, the index on "created_at" column will be applied. This is because the query optimizer will prioritize the index which will eliminate as much rows as possible. In PostgreSQL, the default index type is B-tree index, so it is suitable for range queries like "created_at >= date('2015-07-01') AND created_at < date('2015-08-01')". So if this index will eliminate more rows than the former, it will be used. But in our use case, the former index will eliminate much more rows than the latter.

Let's try to remove all the indexes and re-run to see how the query is executed without using indexes:
DROP INDEX ix_user_id;

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..70481.09 rows=58 width=32) (actual time=0.046..5209.936 rows=63 loops=1)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))
  Rows Removed by Filter: 2399941
Planning time: 1.445 ms
Execution time: 5210.030 ms
OK, the query sucks and index matters!

4. Attempt 2

Another solution is created a hash index on "user_id" column and re-write the query to utilize this hash index. This is my solution for this quiz. Let's try it!
CREATE INDEX ix_user_id ON user_purchases USING hash (user_id);

EXPLAIN ANALYZE 
SELECT * FROM user_purchases WHERE user_id=123 AND referrer_name = 'amazon' AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')
UNION
SELECT * FROM user_purchases WHERE user_id=456 AND referrer_name = 'amazon' AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')
UNION
SELECT * FROM user_purchases WHERE user_id=456 AND referrer_name = 'amazon' AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

HashAggregate  (cost=2621.87..2622.44 rows=57 width=32) (actual time=1.391..1.412 rows=42 loops=1)
  Group Key: user_purchases.user_id, user_purchases.item_id, user_purchases.referrer_name, user_purchases.created_at, user_purchases.updated_at
  ->  Append  (cost=9.77..2621.16 rows=57 width=32) (actual time=0.096..1.269 rows=63 loops=1)
        ->  Bitmap Heap Scan on user_purchases  (cost=9.77..873.53 rows=19 width=32) (actual time=0.096..0.433 rows=21 loops=1)
              Recheck Cond: (user_id = 123::numeric)
              Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
              Rows Removed by Filter: 221
              Heap Blocks: exact=4
              ->  Bitmap Index Scan on ix_user_id  (cost=0.00..9.76 rows=235 width=0) (actual time=0.079..0.079 rows=242 loops=1)
                    Index Cond: (user_id = 123::numeric)
        ->  Bitmap Heap Scan on user_purchases user_purchases_1  (cost=9.77..873.53 rows=19 width=32) (actual time=0.079..0.411 rows=21 loops=1)
              Recheck Cond: (user_id = 456::numeric)
              Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
              Rows Removed by Filter: 220
              Heap Blocks: exact=4
              ->  Bitmap Index Scan on ix_user_id  (cost=0.00..9.76 rows=235 width=0) (actual time=0.071..0.071 rows=241 loops=1)
                    Index Cond: (user_id = 456::numeric)
        ->  Bitmap Heap Scan on user_purchases user_purchases_2  (cost=9.77..873.53 rows=19 width=32) (actual time=0.075..0.410 rows=21 loops=1)
              Recheck Cond: (user_id = 456::numeric)
              Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
              Rows Removed by Filter: 220
              Heap Blocks: exact=4
              ->  Bitmap Index Scan on ix_user_id  (cost=0.00..9.76 rows=235 width=0) (actual time=0.065..0.065 rows=241 loops=1)
                    Index Cond: (user_id = 456::numeric)
Planning time: 0.496 ms
Execution time: 1.521 ms
In this case, the hash index was used, but in order to get the final result, 3 separate SELECTs and 2 UNIONs are required. This query is more complex and its performance seems a little... worse than the previous :(. But the difference is not much and it depends on the underlying implementation of the database engine. At least, it is still much more efficient than the no-index queries.

Another possible solution is run the original query with this hash index. We can see the performance is still reasonable:
EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Bitmap Heap Scan on user_purchases  (cost=29.30..2420.99 rows=58 width=32) (actual time=0.313..1.764 rows=63 loops=1)
  Recheck Cond: (user_id = ANY ('{123,456,789}'::numeric[]))
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
  Rows Removed by Filter: 661
  Heap Blocks: exact=10
  ->  Bitmap Index Scan on ix_user_id  (cost=0.00..29.29 rows=704 width=0) (actual time=0.288..0.288 rows=724 loops=1)
        Index Cond: (user_id = ANY ('{123,456,789}'::numeric[]))
Planning time: 0.311 ms
Execution time: 1.823 ms

5. Attempt 3

Another solution is to use compound (multicolumn) index as one team proposed. Let's try it now:
DROP INDEX ix_user_id;
CREATE INDEX ix_compound1 ON user_purchases(referrer_name, user_id, created_at);

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Index Scan using ix_compound1 on user_purchases  (cost=0.43..238.92 rows=58 width=32) (actual time=0.121..0.367 rows=63 loops=1)
  Index Cond: (((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])) AND (created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date))
Planning time: 0.356 ms
Execution time: 0.425 ms
Ohh, this seems the most efficient solution so far. But one disadvantage of this index is it requires more space to store the index and it takes more time to build.

Also note that the order of columns in the index matters. Let's try to change this order and re-run the query:
DROP INDEX ix_compound1;
CREATE INDEX ix_compound2 ON user_purchases(user_id, created_at, referrer_name);

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Index Scan using ix_compound2 on user_purchases  (cost=0.43..121.13 rows=58 width=32) (actual time=0.124..0.464 rows=63 loops=1)
  Index Cond: ((user_id = ANY ('{123,456,789}'::numeric[])) AND (created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
Planning time: 0.328 ms
Execution time: 0.521 ms
In this case, the performance is not much different, but you can use other orders to see how the performance changes.

6. Conclusion

Use the way we discuss here, you can try other solutions to find which ones perform well and which ones not. In real-life, query optimization is just a part of database tuning and there are many use cases with different queries for us to consider. We have to balance these queries. It also depends on our data and underlying system.

To learn more about database tuning, I recommend a very good text book in the field: "Database Tuning: Principles, Experiments, and Troubleshooting Techniques".