Nam Nham's blog

Friday, November 27, 2015

SQL query optimization by examples

I recently had a database optimization quiz at Grokking Coding Challenge 2015. This is a basic and interesting problem. It reminds me my own days at school. I hope a write-up about this quiz will help us review our knowledge and learn some interesting things.

0. The problem

=== start question ===

You have a table with the following structure

-- postgresql-compatible SQL
-- each entry in user_purchases indicate that the user have bought an item
-- referrer_name have a few values: ‘amazon’, ‘facebook’, ‘friends-of-grokking-engineering’
CREATE TABLE user_purchases (
  user_id NUMERIC,
  item_id NUMERIC,
  referrer_name VARCHAR(255),
  created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
  updated_at TIMESTAMP WITH TIME ZONE DEFAULT now() 
);

You make a lot of query to find items that a user purchased, similar to this

SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01');

The query is slow, and running EXPLAIN returns:

EXPLAIN SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..3482332.76 rows=1 width=596)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))

What would you do to improve the performance of the query (make it run faster), independent of the server's infrastructure (CPU, memory, disk, etc)?

=== end question ===

Comment: The very first solution you can think of is using some index to make the query run faster. Simple! But how to use index efficiently needs some consideration. I can think of some solutions and decide to get my hands dirty with them.

1. Set up environment

Because I had never used PostgreSQL before, I have to install it on my MacBook to do some experiments. An easy way to try PostgreSQL on Mac OS is using Postgres.app. You can use some GUI tools to work with PostgreSQL easily.

Let's create the table and feed it some data:

CREATE TABLE user_purchases (
  user_id NUMERIC,
  item_id NUMERIC,
  referrer_name VARCHAR(255),
  created_at TIMESTAMP WITH TIME ZONE DEFAULT now(),
  updated_at TIMESTAMP WITH TIME ZONE DEFAULT now() 
);

INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (123,1,'amazon','2015-7-01');
INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (456,1,'amazon','2015-7-01');
INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (789,1,'amazon','2015-7-01');
INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (123,1,'google','2015-7-01');

2. The original query

Run the original query and analyze the output:

EXPLAIN SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..12.76 rows=1 width=596)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))

The output of EXPLAIN query tells us that the optimizer decided to use a sequential scan to execute the query. It estimates that it will cost 0.00 to return the first row, and 12.76 to return all the rows. It thinks there will be 1 row returned, and that the average width of each row will be 596 bytes.

But EXPLAIN doesn't actually run the query, it just runs the estimation. We will use EXPLAIN ANALYZE to see how the query runs in reality:

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..12.76 rows=1 width=596) (actual time=0.066..0.074 rows=3 loops=1)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))
  Rows Removed by Filter: 1
Planning time: 1.191 ms
Execution time: 0.142 ms

We can see more information here: the actual time required to run the sequential scan step, the number of rows returned by that step, and the number of times those rows were looped through. We also have the total runtime for the query.

For the later experiments, we will use EXPLAIN ANALYZE to get the more correct stats. Read more about PostgreSQL EXPLAIN query here.

3. Attempt 1

The original answer proposed by the organizer team is creating index on 3 columns: user_id, item_id, and created_at. Let's try this solution:

CREATE INDEX ix_user_id ON user_purchases (user_id);
CREATE INDEX ix_item_id ON user_purchases (item_id);
CREATE INDEX ix_created_at ON user_purchases (created_at);

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..1.08 rows=1 width=596) (actual time=0.057..0.067 rows=3 loops=1)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))
  Rows Removed by Filter: 1
Planning time: 1.175 ms
Execution time: 0.139 ms

You can see that the execution time of the query is NOT improved so much. Moreover, you can see the query still performed a sequential scan on the user_purchases table, and didn't consider the index. It's weird. Try to understand this situation, and I found an answer for this:

The reason why this is the case is that indexes have a cost to create and maintain (on writes) and use (on reads).

When an index is used in a SELECT query, first the position of the requested rows is fetched from the index (instead of from the table directly). Then, the requested rows (a subset of the total) are actually fetched from the table. It’s a two step process. Given a large enough number of requested rows (such as all the posts with an id greater than 10), it is actually more efficient to go fetch all those rows from the table directly rather than to go through the extra step of asking the index for the position of those rows and next fetch those specific rows from the table. If the percentage of the rows is smaller than 5-10% of all rows in the table the benefit of using the information stored in the index outweighs the additional intermediate step.

OK, our data is not big enough. Let's populate more data to the table:

#!/usr/bin/env python
import psycopg2

template = "INSERT INTO user_purchases(user_id,item_id,referrer_name,created_at) VALUES (%d, %d, '%s', '%s')"
conn = psycopg2.connect("dbname='namnx' user='namnx' host='localhost' password=''")
cursor = conn.cursor()
for user_id in xrange(1,10001):
    for item_id in xrange(1,11):
        for referrer in ('amazon', 'google', 'facebook', ''):
            for month in xrange(6, 9):
                for day in (1,15):
                    created_at = "2015-%d-%d" % (month,day)
                    stmt = template % (user_id, item_id, referrer, created_at)
                    print stmt
                    cursor.execute(stmt)

This script will insert 10000*10*4*3*2=2400000=(2.4M) records into the table. I think it's big enough to do some comparison. It'll take some time to run. Be patient.

When the data is ready, re-run the previous query and see the output:

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Index Scan using ix_user_id on user_purchases  (cost=0.43..39.90 rows=58 width=32) (actual time=0.125..1.869 rows=63 loops=1)
  Index Cond: (user_id = ANY ('{123,456,789}'::numeric[]))
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
  Rows Removed by Filter: 661
Planning time: 1.958 ms
Execution time: 1.997 ms

OK, you can see the indexes are in effect now. The output of EXPLAIN tells us that the index on "user_id" column was used. When the query is executed, this index is used first to find the rows satisfied the condition "user_id IN (123,456,789)". These rows then will be filtered by other conditions of the WHERE clause:

created_at >= date('2015-07-01') AND created_at < date('2015-08-01') AND referrer_name='amazon'

In this case, the index on "created_at" column was not used. This is because the index on "user_id" column had been applied first, and when executing a query, an index is used to find satisfied rows from the table, not from rows satisfied another index. So the index on "created_at" column is useless here.

In addition, the index we created on "item_id" column is also useless. In database schema design, we should not create an index if we will never use it because it will take space of the database, and it will slow-down the WRITE queries when we insert new records.

Just drop these useless indexes and re-run the query to see the result:

DROP INDEX ix_item_id;
DROP INDEX ix_created_at;

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Index Scan using ix_user_id on user_purchases  (cost=0.43..39.90 rows=58 width=32) (actual time=0.135..1.333 rows=63 loops=1)
  Index Cond: (user_id = ANY ('{123,456,789}'::numeric[]))
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
  Rows Removed by Filter: 661
Planning time: 0.364 ms
Execution time: 1.389 ms

The performance of query seems the same!

Something to note here. In some case, the index on "created_at" column will be applied. This is because the query optimizer will prioritize the index which will eliminate as much rows as possible. In PostgreSQL, the default index type is B-tree index, so it is suitable for range queries like "created_at >= date('2015-07-01') AND created_at < date('2015-08-01')". So if this index will eliminate more rows than the former, it will be used. But in our use case, the former index will eliminate much more rows than the latter.

Let's try to remove all the indexes and re-run to see how the query is executed without using indexes:

DROP INDEX ix_user_id;

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Seq Scan on user_purchases  (cost=0.00..70481.09 rows=58 width=32) (actual time=0.046..5209.936 rows=63 loops=1)
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])))
  Rows Removed by Filter: 2399941
Planning time: 1.445 ms
Execution time: 5210.030 ms

OK, the query sucks and index matters!

4. Attempt 2

Another solution is created a hash index on "user_id" column and re-write the query to utilize this hash index. This is my solution for this quiz. Let's try it!

CREATE INDEX ix_user_id ON user_purchases USING hash (user_id);

EXPLAIN ANALYZE 
SELECT * FROM user_purchases WHERE user_id=123 AND referrer_name = 'amazon' AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')
UNION
SELECT * FROM user_purchases WHERE user_id=456 AND referrer_name = 'amazon' AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')
UNION
SELECT * FROM user_purchases WHERE user_id=456 AND referrer_name = 'amazon' AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

HashAggregate  (cost=2621.87..2622.44 rows=57 width=32) (actual time=1.391..1.412 rows=42 loops=1)
  Group Key: user_purchases.user_id, user_purchases.item_id, user_purchases.referrer_name, user_purchases.created_at, user_purchases.updated_at
  ->  Append  (cost=9.77..2621.16 rows=57 width=32) (actual time=0.096..1.269 rows=63 loops=1)
        ->  Bitmap Heap Scan on user_purchases  (cost=9.77..873.53 rows=19 width=32) (actual time=0.096..0.433 rows=21 loops=1)
              Recheck Cond: (user_id = 123::numeric)
              Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
              Rows Removed by Filter: 221
              Heap Blocks: exact=4
              ->  Bitmap Index Scan on ix_user_id  (cost=0.00..9.76 rows=235 width=0) (actual time=0.079..0.079 rows=242 loops=1)
                    Index Cond: (user_id = 123::numeric)
        ->  Bitmap Heap Scan on user_purchases user_purchases_1  (cost=9.77..873.53 rows=19 width=32) (actual time=0.079..0.411 rows=21 loops=1)
              Recheck Cond: (user_id = 456::numeric)
              Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
              Rows Removed by Filter: 220
              Heap Blocks: exact=4
              ->  Bitmap Index Scan on ix_user_id  (cost=0.00..9.76 rows=235 width=0) (actual time=0.071..0.071 rows=241 loops=1)
                    Index Cond: (user_id = 456::numeric)
        ->  Bitmap Heap Scan on user_purchases user_purchases_2  (cost=9.77..873.53 rows=19 width=32) (actual time=0.075..0.410 rows=21 loops=1)
              Recheck Cond: (user_id = 456::numeric)
              Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
              Rows Removed by Filter: 220
              Heap Blocks: exact=4
              ->  Bitmap Index Scan on ix_user_id  (cost=0.00..9.76 rows=235 width=0) (actual time=0.065..0.065 rows=241 loops=1)
                    Index Cond: (user_id = 456::numeric)
Planning time: 0.496 ms
Execution time: 1.521 ms

In this case, the hash index was used, but in order to get the final result, 3 separate SELECTs and 2 UNIONs are required. This query is more complex and its performance seems a little... worse than the previous :(. But the difference is not much and it depends on the underlying implementation of the database engine. At least, it is still much more efficient than the no-index queries.

Another possible solution is run the original query with this hash index. We can see the performance is still reasonable:

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Bitmap Heap Scan on user_purchases  (cost=29.30..2420.99 rows=58 width=32) (actual time=0.313..1.764 rows=63 loops=1)
  Recheck Cond: (user_id = ANY ('{123,456,789}'::numeric[]))
  Filter: ((created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
  Rows Removed by Filter: 661
  Heap Blocks: exact=10
  ->  Bitmap Index Scan on ix_user_id  (cost=0.00..29.29 rows=704 width=0) (actual time=0.288..0.288 rows=724 loops=1)
        Index Cond: (user_id = ANY ('{123,456,789}'::numeric[]))
Planning time: 0.311 ms
Execution time: 1.823 ms

5. Attempt 3

Another solution is to use compound (multicolumn) index as one team proposed. Let's try it now:

DROP INDEX ix_user_id;
CREATE INDEX ix_compound1 ON user_purchases(referrer_name, user_id, created_at);

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Index Scan using ix_compound1 on user_purchases  (cost=0.43..238.92 rows=58 width=32) (actual time=0.121..0.367 rows=63 loops=1)
  Index Cond: (((referrer_name)::text = 'amazon'::text) AND (user_id = ANY ('{123,456,789}'::numeric[])) AND (created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date))
Planning time: 0.356 ms
Execution time: 0.425 ms

Ohh, this seems the most efficient solution so far. But one disadvantage of this index is it requires more space to store the index and it takes more time to build.

Also note that the order of columns in the index matters. Let's try to change this order and re-run the query:

DROP INDEX ix_compound1;
CREATE INDEX ix_compound2 ON user_purchases(user_id, created_at, referrer_name);

EXPLAIN ANALYZE SELECT * FROM user_purchases WHERE referrer_name = 'amazon' AND user_id IN (123, 456, 789) AND created_at >= date('2015-07-01') AND created_at < date('2015-08-01')

Index Scan using ix_compound2 on user_purchases  (cost=0.43..121.13 rows=58 width=32) (actual time=0.124..0.464 rows=63 loops=1)
  Index Cond: ((user_id = ANY ('{123,456,789}'::numeric[])) AND (created_at >= '2015-07-01'::date) AND (created_at < '2015-08-01'::date) AND ((referrer_name)::text = 'amazon'::text))
Planning time: 0.328 ms
Execution time: 0.521 ms

In this case, the performance is not much different, but you can use other orders to see how the performance changes.

6. Conclusion

Use the way we discuss here, you can try other solutions to find which ones perform well and which ones not. In real-life, query optimization is just a part of database tuning and there are many use cases with different queries for us to consider. We have to balance these queries. It also depends on our data and underlying system.

To learn more about database tuning, I recommend a very good text book in the field: "Database Tuning: Principles, Experiments, and Troubleshooting Techniques".

Sunday, January 15, 2012

Mua hàng không mất tiền - Lỗ hổng trong một số cổng thanh toán trực tuyến ở Việt Nam

Đây là bài tham luận tôi đã trình bày tại TetCon 2012. Nội dung của bài này sẽ được đăng lên trang chủ của hội thảo. Tôi post lại ở đây để làm kỉ niệm.

Lỗ hổng này tôi đã tình cờ phát hiện được trong khi đang phát triển một dịch vụ có sử dụng API của một cổng thanh toán trực tuyến. Lỗ hổng này về mặt kỹ thuật thì không có gì mới và cũng không phức tạp cho lắm. Tôi cũng khá ngạc nhiên là lỗ hổng này được lập lại ở nhiều cổng thanh toán trực tuyến khác nhau. Vì một số lý do nên tôi không nêu tên cụ thể các cổng thanh toán trực tuyến, nhưng nếu bạn có làm việc trực tiếp với họ thì bạn sẽ biết được tên các cổng thanh toán trực tuyến này. Do thời gian có hạn nên tôi đã không tìm hiểu hết các cổng thanh toán trực tuyến đang hoạt động ở Việt Nam. Tuy nhiên dựa vào nội dung của bài tham luận này bạn có thể biết được cổng thanh toán mà bạn đang phát triển có nằm trong trường hợp này hay không.

----------

Tóm tắt

Thanh toán trực tuyến là một trong những loại hình dịch vụ phát triển mạnh nhất ở Việt Nam trong vài năm vừa qua. Bây giờ chỉ với một chiếc thẻ nội địa hay thẻ tín dụng, bạn có thể mua và thanh toán được rất nhiều mặt hàng khác nhau, từ quần áo, sách vở, vé xem film cho đến vé máy bay. Có thật sự là phải cần có thẻ mới mua được hàng? Trong bài tham luận này, chúng tôi sẽ trình diễn một cách mua hàng không mất tiền, bằng cách khai thác lỗ hổng an ninh trong một vài hệ thống thanh toán trực tuyến phổ biến ở Việt Nam. Những lỗ hổng này đã được thông báo đến các bên liên quan trước khi bài tham luận này được gửi đến TetCon 2012.

1. Giới thiệu

Cùng với sự phát triển của thương mại điện tử, sự ra đời của các cổng thanh toán trực tuyến là một nhu cầu tất yếu. Cổng thanh toán đóng vai trò quan trọng trong việc đem đến sự giao dịch thuận tiện giữa người mua và người bán. Ngoài việc giúp cho giao dịch giữa người bán và người mua được tiến hành thuận tiện, việc đảm bảo an toàn cho các giao dịch này là một yêu cầu bắt buộc.

Trong bài tham luận này, chúng tôi sẽ tập trung phân tích một cơ chế hoạt động phổ biến trong các cổng thanh toán trực tuyến hiện nay ở Việt Nam. Trong cơ chế này, 3 thành phần cơ bản tham gia vào quá trình thanh toán gồm có người mua (customer), địa chỉ bán hàng (merchant's site) và cổng thanh toán (payment gateway). Dòng chảy dữ liệu giữa các thành phần này được mô tả như trong hình 1.

Hình 1. Dòng chảy dữ liệu của quá trình mua bán thông qua cổng thanh toán

Sau khi người mua vào website của người bán chọn lựa mặt hàng cần mua và đồng ý thực hiện việc thanh toán, quá trình thanh toán sẽ diễn ra như sau:

Bước (1): Trình duyệt của người mua hàng sẽ gởi yêu cầu thanh toán về máy chủ trang web bán hàng. Thông tin được gởi về có thể bao gồm: mã đơn hàng, danh sách các mặt hàng, tổng số tiền cần thanh toán,...

Bước (2): Ứng dụng web của người bán hàng sẽ xử lý đơn hàng và sẽ trả kết quả về cho người mua hàng dưới dạng một yêu cầu chuyển hướng HTTP (HTTP redirect). Đích đến của yêu cầu chuyển hướng này là địa chỉ của cổng thanh toán. Thông tin trong yêu cầu chuyển hướng có thể bao gồm định danh của đơn vị bán hàng, mã đơn hàng, số tiền cần thanh toán, địa chỉ trả về kết quả nếu việc thanh toán thành công,... Ngoài ra, để đảm bảo cho cổng thanh toán xác định được rằng yêu cầu gởi đến này được tạo ra từ ứng dụng của người bán hàng, dữ liệu được gởi đi bao gồm một chữ ký của bên bán hàng trên các thông tin còn lại. Chúng tôi sẽ mô tả về các tạo ra chữ ký này trong phần tiếp theo.

Bước (3): Trình duyệt của người mua hàng sau khi nhận được yêu cầu chuyển hướng từ phía người bán sẽ tự động chuyển đến trang web của cổng thanh toán. Ứng dụng web của cổng thanh toán sẽ tiếp nhận yêu cầu này và bắt buộc người mua hàng phải thực hiện các thao tác trên cổng thanh toán như đăng nhập, điền thông tin thẻ tín dụng hoặc ATM, xác nhận đơn hàng,...

Bước (4): Sau khi người mua hàng thực hiện thành công các bước theo đúng quy trình của cổng thanh toán, phía cổng thanh toán sẽ trả về kết quả cho người mua hàng dưới dạng một yêu cầu chuyển hướng HTTP. Đích đến của yêu cầu chuyển hướng này là địa chỉ máy chủ của trang web bán hàng (địa chỉ này được gởi đến cổng thanh toán như ở bước 2). Thông tin đi kèm trong yêu cầu chuyển hướng này dùng để xác nhận với trang web bán hàng rằng người mua đã thanh toán thành công. Các thông tin này cũng được đi kèm với một chữ ký như ở trong bước 2.

Bước (5): Trình duyệt của người mua hàng sẽ tự động chuyển hướng trang web bán hàng sau khi nhận yêu cầu chuyển hướng từ cổng thanh toán như trong bước 4.

Bước (6): Trang web bán hàng sẽ kiểm tra thông tin được gởi đến. Nếu thông tin này có nội dung mô tả đúng như khi khách hàng đã thanh toán đơn hàng và có chữ ký hợp lệ thì giao dịch được thực hiện thành công.

Ưu điểm của quy trình này là ngắn gọn và đơn giản. Nó đem lại sự thuận tiện cho người dùng. Tuy nhiên việc hiện thực nó không đúng có thể tạo ra một lỗ hổng nghiêm trọng giúp cho một người có thể thực hiện mua hàng mà không cần phải thanh toán tiền. Trong bài tham luận này, chúng tôi sẽ phân tích lỗ hổng trong một cách hiện thực của của cơ chế thanh toán này. Lỗ hổng này có thể khai thác được và được tìm thấy trong một số cổng thanh toán trực tuyến phổ biến ở Việt Nam.

2. Một cơ chế tạo chữ ký không an toàn

Trong mô hình thanh toán được mô tả như phần trên, để có thể xác thực được các thông điệp được gởi qua lại, website bán hàng thường được cổng thanh toán cung cấp cho một mã định danh (merchant code) và một khóa chia sẻ chung bí mật (shared secret key). Khóa bí mật này dùng để xác thực các thông điệp được trao đổi giữa các bên.

Trong một số cổng thanh toán chúng tôi đã tiến hành khảo sát, quá trình tạo ra chữ ký cho các thông điệp được thực hiện như sau:

Sắp xếp các thông số được truyền trong thông điệp theo thứ tự bảng chữ cái dựa trên tên của thông số. Ví dụ: order_id=123, merchant_id=1230, amount=100000 được sắp xếp thành amount=100000, merchant_id=1230, order_id=123.
Kết hợp khóa bí mật và giá trị của các thông số đã được sắp xếp thành một chuỗi ký tự. Ví dụ: SECRET1000001230123.
Tính giá trị băm của chuỗi này bằng thuật toán MD5 (hoặc SHA1)
Thêm giá trị băm này vào danh sách các tham số được gởi đi như là một chữ ký xác thực. Ví dụ: signature=7BD70B91F6F026FBFFFA552A7D59B5BD

Bên nhận được thông điệp cũng thực hiện lại quá trình này nhưng sử dụng khóa bí mật được lưu trữ tại hệ thống của mình trong bước 2. Nếu chữ ký được tạo ra giống với chữ ký được gởi đến thì thông điệp được xem là hợp lệ và tiếp tục được xử lý như một giao dịch bình thường.

Việc tạo ra chữ ký theo phương pháp này có thể bị tấn công bằng một số cách sau:

Do không có dấu phân cách giữa các giá trị và tên của tham số không được sử dụng trong việc tạo ra chữ ký điện tử, nên chữ ký được tao ra cho danh sách tham số amount=100000&merchant_id=1230&order_id=123 sẽ giống với chữ ký được tạo ra cho amount=100000&order_id=123&status=0&tranx_id=123.
Do MD5 có thể bị tấn công bằng phương pháp mở rộng chiều dài (length-extension attack), một người có thể chèn thêm dữ liệu bất kỳ vào cuối thông điệp mà có thể tạo ra chữ ký xác thực hợp lệ mà không cần phải biết khóa bí mật. Chúng tôi sẽ nói về cách tấn công này trong phần tiếp theo.

Trong một số cổng thanh toán trực tuyến mà chúng tôi tiến hành khảo sát, một số có thể bị tấn công bằng cách thứ nhất, một số có thể bị tấn công bằng phương pháp thứ hai. Bằng cách khai thác được một trong hai cách này, một người có thể làm giả một thông điệp được gởi từ cổng thanh toán trực tuyến xác nhận rằng người mua đã thanh toán tiền và gởi thông điệp giả mạo này về cho website bán hàng. Và như vậy, người này có thể mua hàng thành công mà không cần phải thông qua cổng thanh toán trực tuyến và không cần phải mất tiền.

3. Tấn công mở rộng chiều dài trên MD5 (Length-Extension Attack)

Tấn công mở rộng chiều dài là một loại tấn công khá phổ biến đối với các hàm băm được xây dựng trên cấu trúc lặp Merkle-Damgård [1] như MD0-MD5 và SHA0-SHA2. Với phương pháp tấn công này, một người có thể dựa trên giá trị băm h(m) và chiều dài len(m) của thông điệp m để tính giá trị băm h(m||pad(m)||m') với bất kỳ thông điệp m' nào. Tham khảo [2] có mô tả về dạng tấn cống này trên API của một dịch vụ chia sẻ ảnh khá phổ biến là Flickr.

Trong các hàm băm được xây dựng trên cấu trục lặp Merkle-Damgård, thông điệp đầu vào được chia ra thành một chuỗi các khối (block) có kích thước bằng nhau và được xử lý theo thứ tự bằng một hàm nén một chiều (one-way compression function). Trong thuật toán MD5, hàm nén nhận vào 2 giá trị - một giá trị móc nối (chaining value) có kích thước 128-bit và một khối (block) thông điệp có kích thước 512-bit - và tạo ra một giá trị móc nối 128-bit khác được sử dụng làm đầu vào cho bước lặp tiếp theo. Thông điệp ban đầu được đệm (padded) để có kích thước là một bội số của 512-bit và được chia thành một chuỗi các khối có kích thước 512-bit. Hàm nén được tính một cách lặp lại với một giá trị móc nối ban đầu và khối thông điệp đầu tiên. Sau khi khối thông điệp cuối cùng được xử lý, giá trị móc nối cuối cùng là kết quả hàm băm của thông điệp ban đầu.

Hình 2. Cấu trúc lặp Merkle-Damgård (sao chép từ Wikipedia)

Do cấu trúc lặp của thuật toán, một người có thể chỉ dựa vào giá trị băm và chiều dài của một thông điệp để tính giá trị băm của một thông điệp dài hơn được bắt đầu bằng thông điệp ban đầu, bao gồm cả giá trị đệm đã được thêm vào để cho thông điệp ban đầu có kích thước là một bội số của 512-bit (hình 3). Vấn đề này đã được mô tả trong [3].

Hình 3. Tấn công mở rộng chiều dài (sao chép từ [4])

Đối với cơ chế tạo chữ ký như đã được mô tả trong phần 2, chúng ta có thể sử dụng phương pháp tấn công mở rộng chiều dài để tính chữ ký của bất kỳ thông điệp m' nào với m' bắt đầu bằng m||p, với p là giá trị được đệm vào SECRET||p trong cấu trúc lặp Merkle-Damgård.

signature1 = MD5(SECRET||m)

signature2 = MD5(SECRET||m||p||x)

Ví dụ, dựa trên thông điệp yêu cầu chuyển hướng được gởi từ website bán hàng về cho khách hàng (như trong bước 2 của hình 1), một người có thể làm giả một thông điệp xác nhận thanh toán thành công được gởi về từ cổng thanh toán (như trong bước 4 của hình 1), và gởi thông điệp này đến website bán hàng để hoàn tất việc mua hàng.

response1 = “amount=100000&merchant_id=1230&order_id=567&signature=[sig1]”
m1 = “1000001230567"
sig1 = MD5(SECRET || m1)
x = 10000056701156
m2 = m1 || p1 || x
sig2 = MD5(SECRET || m2)
response2 = “aaa=[m1||p1]&amount=100000&order_id=567&status=0&tranx_id=1156”

4. Phương án vá lỗi đề nghị

Do bản chất của lỗ hổng này là như nhau nên chúng tôi đề xuất phương án vá lỗi được mô tả trong phần 5.5 của tham khảo [2]. Bạn đọc quan tâm có thể tìm hiểu thêm trong tài liệu này.

5. Kết luận

Trong bài tham luận này chúng tôi đã mô tả về một lỗ hổng mật mã xuất hiện trong một số cổng thanh toán trực tuyến ở Việt Nam. Bằng việc khai thác lỗ hổng này, một người có thể mua hàng nhưng không cần phải trả tiền. Điều này có thể gây tổn hại rất lớn đến cổng thanh toán trực tuyến cũng như các website bán hàng có liên quan. Do tính chất nhạy cảm của vấn đề, chúng tôi quyết định không nêu tên cụ thể của cổng thanh toán và chi tiết về cơ chế hoạt động của các cổng thanh toán này trong bài viết.

Lỗ hổng được trình bày trong bài viết đã được biết đến từ rất lâu trong cộng đồng làm về mật mã học (xem [5]). Tuy nhiên sau nhiều năm, vẫn còn rất nhiều website gặp phải lỗ hổng này. Trên thế giới, lỗ hổng này được tìm thấy ở một dịch vụ chia sẻ ảnh phổ biến là Flickr vào năm 2009 (xem [2]). Và một cách tình cờ, chúng tôi tìm thấy được lỗ hổng này xuất hiện trong một số cổng thanh toán trực tuyến phổ biến ở Việt Nam. Do không có đủ thời gian và nguồn lực, chúng tôi đã không kiểm tra trường hợp này đối với tất cả các cổng thanh toán cũng như các dịch vụ khác. Nhưng qua bài viết này, chúng tôi hi vọng rằng các đơn vị đang phát triển dịch vụ trực tuyến ở Việt Nam có cái nhìn cẩn trọng hơn trong việc hiện thực các chức năng liên quan đến mật mã học nhằm đem đến những sản phẩm tốt và an toàn cho người dùng.

6. Tham khảo
[1] http://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction
[2] Thai Duong, Juliano Rizzo. Flickr’s API Signature Forgery Vulnerability, September 2009.
[3] B. Kaliski and M. Robshaw. Message Authentication with MD5. RSA Labs' CryptoBytes, Vol. 1 No. 1, Spring 1995.
[4] H. Travis. Web 2.0 Cryptology, A Study in Failure. OWASP, 2009. Retrieved September 13, 2009, from http://www.subspacefield.org/security/web_20_crypto.pdf.
[5] G. Tsudik. Message authentication with one-way hash functions. ACM Computer Communications Review, 22(5):29–38, 1992.

Wednesday, March 17, 2010

CodeGate 2010 CTF - Challenge 8: Bit-flipping attack on CBC mode

This is a web-based cryptography challenge. In this challenge, we were provided a URL and a hint "the first part is just an IV".
The URL is: http://ctf1.codegate.org/99b5f49189e5a688492f13b418474e7e/web4.php.

Analysis

Go to the challenge URL. It will ask you the username for the first time. After we enter a value, for example 'namnx', it will return only a single message "Hello, namnx!". Examine the HTTP payload, we will see the cookie returned:

web4_auth=1vf2EJ15hKzkIxqB27w0AA==|5X5A0e3r48gXhUXZHEKBa5dpC+XfdVv4oamlriyi5yM=

The cookie includes 2 parts delimited by character '|'. After base64 decode the first part of the cookie, we have a 16-byte value. According to the hint, this is the IV of the cipher. And because it has 16-byte length, I guess that this challenge used AES cipher, and the block size is 16 bytes. Moreover, the cipher has an IV, so it can't be in ECB mode. I guessed it in CBC mode. The last part is the base64 of a 32-byte value. This is a cipher text. We will exploit this value later.

Browse the URL again, we will receive another message: "Welcome back, namnx! Your role is: user. You need admin role." Take a look into this message, we can guess the operation of this app: it will receive the cookie from the client, decrypt it to get the user and role information and return the message to the client based on the user and role information. So, in order to get further information, we must have the admin role. This is our goal in this challenge.

Exploit
I wrote some Python to work on this challenge easier:

import urllib, urllib2
import base64, re

url = 'http://ctf1.codegate.org/99b5f49189e5a688492f13b418474e7e/web4.php'
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'

def get_cookie(user):
    headers = { 'User-Agent' : user_agent}
    values = {'username' : user, 'submit' : "Submit"}
    data = urllib.urlencode(values)
    request = urllib2.Request(url, data, headers)
    response = urllib2.urlopen(request)
    cookie = response.info().get('Set-Cookie')
    groups = re.match("web4_auth=(.+)\|(.+);.+", cookie).groups()
    iv = base64.b64decode(groups[0])
    cipher = base64.b64decode(groups[1])
    return iv, cipher

def get_message(iv, cipher):
    cookie = base64.b64encode(iv) + '|' + base64.b64encode(cipher)
    cookie = urllib.quote(cookie)
    cookie = 'web4_auth=' + cookie
    headers = { 'User-Agent' : user_agent, 'Cookie': cookie}
    request = urllib2.Request(url, None, headers)
    response = urllib2.urlopen(request)
    data = response.read()
    print repr(data)
    groups = re.match(".+, (.*)! .+: (.*)\. You.+", data).groups()
    return groups[0], groups[1]

The first function, get_cookie will submit a value as a username in the first visit to the page, get the returned cookie, and then parse it to get the IV and cipher. The second function, get_message, do the task like when you visit the page in later times, it parses the response message to get the returned username and role.

Let do some test cases:

>>> iv, cipher = get_cookie('123456789012')
>>> len(cipher)
32
>>> iv, cipher = get_cookie('1234567890123')
>>> len(cipher)
48

When you input the user with a 12-byte value, the returned cipher will have 32 bytes (2 blocks). And when you enter a 13-byte value, the cipher will have 48 bytes (3 blocks). This means that beside the username value, the plain text of the cipher will be added more 20 bytes.

Try altering the cipher text to see how it is decrypted:

>>> iv, cipher = get_cookie('1234567890')
>>> cipher1 = cipher[:-1] + '\00'
>>> username, role = get_message(iv, cipher1)
'Welcome back, 1234567\xa2\xc2\xca\xfei\xdb\xee_c\xa7\xd7\x0c\xa9j\xe0\xbb! Your role is: . You need admin role.'

When we altered the last block of the cipher, just first 7 characters of the username is decrypted correctly. So the format of the plain text maybe like this: [9 bytes] + username + [11 bytes].

Now we'll try to determine the first 9 bytes of the plain text:

>>> iv, cipher = get_cookie('1234567890123456')
>>> cipher1 = cipher[:32] + iv + cipher[:16]
>>> username, role = get_message(iv, cipher1)
'Welcome back, 1234567890123456! Your role is: E\x9bpY?\xfbW6\x84{\x8fn\x1e\x80\x10\x1busername=1234567. You need admin role.'

As you can see, the last block of the decrypted role is the first block of the plain text. So, the format of the plain text may be: 'username=' + username + [11 bytes].

To here, we can guess that the format of the plain text can be something like:

'username=' + username + [delimiter] + [param] + '=' + [value]

The last 11 bytes of the plain text can be determined by the code below:

>>> iv, cipher = get_cookie('\x00')
>>> username, role = get_message(iv, cipher)
'Welcome back, \x00##role=user\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00! Your role is: . You need admin role.'

You can see the last 11 bytes of the plain text in the returned message. So, at this time, we can conclude format of the plain text is:

'username=' + username + '##role=' + role

Now, the last thing we have to do is altering the role value to 'admin'. Because we've already known the format of the plain text, we can choose to input the username close to the target plain text and try to alter the cipher text in the way that the decrypted value is what we want.

Let remind the operation of CBC mode in cryptographic ciphers. In encryption process:

y[1] = C(IV xor x[1])
y[n] = C(y[n-1] xor x[n])

and in the decryption:

x[1] = D(y[1]) xor IV
x[n] = D(y[n]) xor y[n-1]

Notice that if we flip one bit in the (n-1)th block of cipher text, the respective bit in the n-th block of plain text will be also flipped. So, we will you this fact to exploit the challenge:

>>> iv, cipher = get_cookie('012345678901234567890123#role=admin')
>>> s = cipher[:16] + chr(ord(cipher[16]) ^ 0x10) + cipher[17:]
>>> username, role = get_message(iv, s)
'Welcome back, 0123456L\xaa\x17m\xe9\x91\xdc\xe2`#z)\xd8m\xd8\x18! Your role is: admin. You need admin role. Congratulations! Here is your flag: the_magic_words_are_squeamish_ossifrage_^-^!!!!!'

Successful! Such an interesting challenge, isn't it?

References
- Block cipher modes of operation: http://en.wikipedia.org/wiki/Block_cipher_modes_of_operation
- Bit-flipping attack: http://en.wikipedia.org/wiki/Bit-flipping_attack

Tuesday, March 16, 2010

CodeGate 2010 CTF - Challenge 7: Weak SSL Cracking

Last weekend, I had a great hacking time with team CLGT in the CodeGate 2010 CTF Preliminary Round. It lasted 36 consecutive hours from 7:00AM March 13 to 7:00PM March 14. There were a lot of teams around the world participating in this hacking contest. And excellently, CLGT proved it as one of the best teams when got the 2nd place in this round. See final ranking.

This entry is my writeup for challenge 7. I think this is an interesting challenge from which you can learn more deeply about SSL protocol and public key cryptography. In this challenge, we were provided a tcpdump file of a SSL traffic and a hint "does the modulus look familiar?". So our goal is to analyze and decrypt this captured traffic to get the flag.

Analysis
Firstly, I used Wireshark to load this file and start to analyze it:

There are 26 packets captured. Packet #4 is a SSL Client Hello packet, but after it, packet #8 and packet #9 have FIN flag. This mean that the session was termininated. So we just ignore them.

Packet #14 is another SSL Client Hello packet. This is where the real session began. Take a look into it:

There is nothing special. It is just a normal SSL Client Hello packet. It happens when a client want to connect to a SSL service. We continue look into the packet #16, the SSL Server Hello packet:

This is the response for SSL Client Hello packet. We can see some useful information here:

- The cipher suite will be used: RSA_WITH_AES_256_CBC_SHA

- The X509 certificate of the server

In the SSL protocol, the server send its certificate to the client in the handshaking process. This certificate will be used for supporting the key exchange afterward. The certificate contains the server's public key and other data. By extracting the public key and recovering the private key from it, we can decrypt the SSL traffic.

Exploit

I wrote some Python code to exploit this challange:

from scapy.all import *
from M2Crypto import X509

def decode_serverhello(packet):
    payload = packet.load
    cert = payload[94:1141]
    cert = X509.load_cert_string(cert, 0)
    return cert

def get_pubkey(cert):
    pubkey = cert.get_pubkey().get_rsa()
    n = long(pubkey.n.encode('hex')[8:], 16)
    e = long(pubkey.e.encode('hex')[9:], 16)
    return n, e

packets = rdpcap('ssl.pcap')
cert = decode_serverhello(packets[15])
n,e = get_pubkey(cert)

Because this traffic used RSA as public key algorithm, the public key contains 2 components: n and e. We get their values from the above code:

n = 1230186684530117755130494958384962720772853569595334792197322452151726400507263657518745202199786469389956474942774063845925192557326303453731548268507917026122142913461670429214311602221240479274737794080665351419597459856902143413
e = 65537

In RSA, n is the product of 2 big prime numbers p and q. So, in order to recover the RSA private key from the public key, we must factorize n into p and q. This is the key point of the challenge. In this situation, n is a very big number (232 decimal digits). How can we do that? In the beginning, I didn't know how to solve it. But I remembered the hint "does the modulus look familiar?". So I tried googling it :-D (actually just its last digits). And... oh my god, I was lucky! It is RSA-768. It's factorized just few months ago.

RSA-768 = 33478071698956898786044169848212690817704794983713768568912431388982883793878002287614711652531743087737814467999489
        × 36746043666799590428244633799627952632279158164343087642676032283815739666511279233373417143396810270092798736308917

So now, we have all components of the RSA keys.

n = 1230186684530117755130494958384962720772853569595334792197322452151726400507263657518745202199786469389956474942774063845925192557326303453731548268507917026122142913461670429214311602221240479274737794080665351419597459856902143413

e = 65537

p = 33478071698956898786044169848212690817704794983713768568912431388982883793878002287614711652531743087737814467999489

q = 36746043666799590428244633799627952632279158164343087642676032283815739666511279233373417143396810270092798736308917

d = 703813872109751212728960868893055483396831478279095442779477323396386489876250832944220079595968592852532432488202250497425262918616760886811596907743384527001944888359578241816763079495533278518938372814827410628647251148091159553

The last thing we have to do is generating the RSA private key in PEM format from these components. But how can we do that? As far as I know, popular cryptographic libraries like OpenSSL do not support this. So in this case, I wrote my own tool to do this task. It is based on ASN1. It is a little long to post here. But if you want to write your own one, I recommend pyasn1.

After having the private key, just import it to Wireshark to decrypt the SSL traffic:

References
- SSL/TLS: http://en.wikipedia.org/wiki/Transport_Layer_Security
- RSA: http://en.wikipedia.org/wiki/RSA

Monday, November 16, 2009

Hello World!

Every morning in Africa, a gazelle wakes up.
It knows it must run faster than the fastest lion,
...or it will be killed.

Every morning a lion wakes up.
It knows it must outrun the slowest gazelle,
...or it will starve to death.

It doesn't matter whether you are a lion or a gazelle.
When the sun comes up, you better start running.

(The World is Flat - Thomas L. Friedman)