Sunday, October 12, 2014

Chunky Outperforms Chatty

This repo on github contains Java code and results from a performance comparison of 5 different SQL data access strategies, ranging from very chatty to very chunky.
  • Chatty = many SQL invocations that touch relatively few records.
  • Chunky = fewer SQL invocations that touch more records.
The two chunkiest strategies (1 & 2) outperform the rest, with roughly 30% or more tps.  Surprisingly, though, these two high-performance strategies don't seem to be real popular.  As a performance engineer, my world has been covered up with a lot of chatty for the last 8 years.

But I'm not alone, being fan of the chunky -- consider Martin Fowler, who has written about ORM (object relational mapping) extensively.  He also has two books on the 'most influential programmer book,' list.  He was preaching chunky way back in 2003:
"Try to pull back multiple rows at once. In particular, 
never do repeated queries on the same table to get multiple rows."
...and others are adding to the chunky chorus ( here here here ).   My post here focuses on the code, the data and the performance.  Fowler did a similar comparison but focused on code style and where to put business logic.  But Fowler's test was a bit extreme -- he worked with 1,000 db records to show the poor performance of chunky.  These tests show the problem by selecting just 100 records.

But keep in mind that performance isn't everything:
"Any fool can write code that a computer can understand.  
Good programmers write code that humans can understand."

Five Strategies

The throughput of five different implementations of the same XML-over-HTTP web service was tested.  The web service is a simple account and transaction inquiry to the Postgres pgbench db.
The following graph shows throughput in red (tps / higher is better) and the strategy number in blue. See how the blue line rises like steps? The test ran each strategy for 1 minute, then moved on to the next larger strategy number: 1,2,3,4,5 and then it repeated 8 more times. Each request inquired upon 5 differnt accounts.  In the graph below, strategy=scenario.
Inquiry for 5 accountIds
The requirements for the web service:
Given 1 to N account numbers, return transaction history and balance info 
for all accounts.  All 5 strategies must build the response data using 
the exact same pojos and the same XML serialization code.
Stragey 1 has the fewest SQL invocations, 5 has the most (5 is a little extreme, actually), and 2, 3 and 4 line up inbetween (yes, in order).
TablesPGBENCH_ACCOUNTSPGBENCH_HISTORYNotes
Strategy_11 SELECT, OUTER JOIN to PGBENCH_HISTORYThe Chunkiest of the 5. Rarely seen in the wild
Strategy_21 SELECT1 SELECTRarely seen in the wild
Strategy_31 SELECT1 SELECT per account
Strategy_41 SELECT per account1 SELECT per account
Strategy_51 SELECT per account1 SELECT PER account to retrieve unique IDs. 1 SELECT for each full history record.The Chattiest of the 5.

Validation

Any one interested in checking my facts in the table above? I know, you're busy, which is part of why production bound code/sql rarely gets a full vetting. Instead of guessing, let's just look at what happened with each strategy -- graphically -- using wuqiSpank.

SQL

The following are the SQL executed for one invocation of each strategy.  The same account numbers (and all their corresponding history records) were used for each strategy.

Strategy 1:

A single query:
SELECT a.aid, a.bid, a.abalance, a.filler, h.tid, h.hid, h.delta, h.mtime, 
h.filler FROM pgbench_accounts a LEFT OUTER JOIN pgbench_history h 
ON a.aid = h.aid WHERE a.aid in (?,?,?,?,?) ORDER BY a.aid, h.mtime desc

Strategy 2:

2 queries:
SELECT a.aid, a.bid, a.abalance, a.filler FROM pgbench_accounts a 
WHERE aid in (?,?,?,?,?)
SELECT tid , hid, bid , aid , delta , mtime , filler 
FROM pgbench_history WHERE aid in (?,?,?,?,?) ORDER BY aid, hid

Strategy 3:


6 queries:
( 1x) SELECT a.aid, a.bid, a.abalance, a.filler 
      FROM pgbench_accounts a WHERE aid in (?,?,?,?,?)
( 5x) SELECT tid, hid, bid, aid, delta, mtime, filler 
      from pgbench_history WHERE aid = ?

Strategy 4:

The following two queries were retrieved 5 times each, for a total of 10 queries:
SELECT aid, bid, abalance, filler from pgbench_accounts WHERE aid = ?
SELECT tid, hid, bid, aid, delta, mtime, filler from pgbench_history 
  WHERE aid = ?

Strategy 5:

The following 25 queries were executed 5 times for a total of about 125 queries.
( 1x)  SELECT aid, bid, abalance, filler from pgbench_accounts WHERE aid = ?
(24x) SELECT hid from pgbench_history WHERE aid = ?

Instructions

  1. Install PostGreSQL. I used 9.2
  2. Load pgbench sample data as detailed below.
  3. Add a seuquence / primary key to the pgbench_history table.
  4. Download the war file and unzipt it to a blank folder.
  5. Make sure the JDBC connection info is right in this file: ./src/main/webapp/META-INF/context.xml
  6. Build it with mvn clean package
  7. Deploy target/sqlPerfAntiPatterns.war to Tomcat 7+
  8. Run the service using this URL:
localhost:8082/sqlPerfAntiPatterns/sqlPerfServlet?pgbenchScenarioNum=2&pgbenchAccountIds=34591,9483121,78941,111294,9122
The following populates the pgbench_accounts table, but not the pgbench_history table:
export DB_NAME=db_pgbench
export SCALE_FACTOR=100
export HOSTNAME=localhost
export PORT=5432
export USER=postgres

pgbench -i -s $SCALE_FACTOR -h $HOSTNAME -p $PORT -U $USER $DB_NAME
-- add primary key to history table
db_pgbench=# ALTER TABLE pgbench_history ADD COLUMN hid SERIAL PRIMARY KEY;
NOTICE:  ALTER TABLE will create implicit sequence "pgbench_history_hid_seq" for serial column "pgbench_history.hid"
NOTICE:  ALTER TABLE / ADD PRIMARY KEY will create implicit index "pgbench_history_pkey" for table "pgbench_history"
ALTER TABLE


--added index for searching history for account id's.
CREATE INDEX idx_aid on pgbench_history (aid);

No comments: