Change the work_mem database parameter from 2MB to 16MB. All of the disk I/O needed for the hash join and group by is moved to memory instead. This is where the bottleneck was coming from. Warning: Make sure you have enough system memory to perform this. Every database connection gets its own memory, and so if you have 32 connections that would consume 512MB instead of 64MB.
this is a great finding (and so obvious after one knows the solution)!
It looks like Postgres is able to have this set per connection, so we could just set it before starting the baseline computation:
postgres=# set work_mem=32768;
postgres=# show work_mem;
I have opened the following Bugzilla to track this: Bug 553561