1.  Why the development of a MPI version of simuPOP was discontinued? (Mar. 10, 2009)

A prototype of a MPI version of simuPOP was added to simuPOP 0.7.5 in Dec. 2006. The general idea worked and I was even able to run small scripts using it. However, because a MPI version could not achieve its initial design goals, and because a full implementation required major revision to the simuPOP core, the MPI code was removed in simuPOP 0.8.3.

When the MPI version of simuPOP was first designed, 2G of RAM sounded huge and 64 bit operating systems were rare. A MPI version seemed to be the only way to break the 4G RAM barrier of 32 bit operating systems. I also hoped that a MPI version could significantly improve the performance of large simulations.

However, compare to other single-executable programs, a MPI implementation of simuPOP was extremely difficult to design. Because simuPOP is a programming environment, arbitrary user logic could be used. For example, a user could change genotype of a random individual using the Python random module. Different individuals could be chosen if the script is executed separately on different nodes, and lead to erroneous results. The only feasible MPI design would be a master-slave model where a master node interprets a script, and sends very detailed instructions to the slave nodes. However, this model requires a large amount of communication between nodes, especially with population changes. Consequently, the MPI modules may not provide any performance advantage over a regular module. This was more or less confirmed using my experimental implementation.

And you know what happened next. RAM became cheaper and cheaper and even home computers got 4G or more RAM. Dual-core or quad-core machines became commonplace and 64 bit operating systems became mainstream. Because it became easy to simulate large populations on a regular workstation, there was less and less a need for the MPI version of simuPOP.

Another reason for the removal of the MPI version is because I am looking into an openMP implementation. Using a shared-memory architecture, I might be able to simulate several replicates, or produce multiple offspring simultaneously using different threads. The performance boost could be dramatic. In addition, this implementation requires little modification to the simuPOP codebase and it is possible that I can distribute simuPOP modules that can run on both single and multi-core machines...

If everything moves as planned, simuPOP 1.0.x will be bug fix releases of simuPOP 1.0, and the 1.1.x releases will have openMP support.

2.  Icc vs. Gcc: which one is faster for simuPOP simulations? (Feb, 21, 2009)

simuPOP uses Visual C++ 2003 (win32, Python 2.4 and 2.5), Visual C++ Express 2008 (win32, Python 2.6), and GCC/G++ for all other platforms (MacOS, Linux, Solaris). These compilers are chosen because they are the compilers used for the official Python distributions.

Intel icc is usually considered to generate (much) faster code than gcc. I tried icc before, using a simuPOP version around 0.6.0 so I am interested to see whether or not simuPOP 0.9.2 can be compiled with icc.

Here are the steps:

  • Download icc from Intel ICC website. The linux non-commercial version is free, and is the version I use.
  • Install icc. I uses a separate user account and install icc locally to that user so that it will not mess up with my current development environment. I use csh and set a ~/.cshrc file as follows:
source /my/home/intel/Compiler/11.0/081/bin/iccvars.csh intel64
setenv PATH /my/home/Python26/bin:${PATH}
  • Download Python 2.6 source code, and compile as follows
> tar zxf Python-2.6.tgz
> cd Python-2.6
> setenv CC icc
> setenv CXX icc
> ./configure --prefix=/my/home/Python26
> make
> make install
Note that some modules could not be compiled by icc.
  • (Optinal) download and install scons.
  • Then, I check up a clean copy of simuPOP, and compile simuPOP as usual:
    > python setup.py install
    
or
> scons install
if scons is installed.

There are a lot more remarks (warnings) than gcc and many of them do not make much sense to me. Anyway, simuPOP is compiled successfully. All simuPOP tests and examples in the simuPOP user's guide run smoothly.

Does icc really help the performance of simuPOP? I do not have time for a thorough test so I run a typical random mating test in test_21_performance.

> cd simuPOP/test
> python test_21_performance.py TestPerformance.TestRandomMating

for both icc compiled and gcc compiled simuPOP (both for Python 2.6). Here is the result (execution time in seconds, the shorter the better):

  N=10k N=100k N=1000k
    plain with selection with migration plain with selection with migration plain with selection with migration
Gcc 4.1.2 binary 0.30 0.35 0.53 4.41 7.43 9.02 85.33 101.88 141.08
short 0.28 0.32 0.48 3.89 7.04 8.49 94.58 115.20 150.69
long 0.29 0.33 0.50 4.17 7.73 8.52 101.29 118.75 156.76
Icc 11.0 binary 0.34 0.41 0.62 4.83 8.17 10.02 89.03 107.23 151.62
short 0.32 0.38 0.57 4.21 7.83 9.07 97.84 121.65 160.67
long 0.29 0.35 0.55 4.40 8.18 8.97 104.4 121.97 165.00

It is still quite obvious that gcc is better in all cases, which comes as a surprise to me. Anyway, these numbers are from single runs and icc may not get the right optimization flags. If you have any suggestion, please feel free to let me know.

The comparison is done on a 3-year old DELL Precision 650 workstation with Dual Xeon CPU (3.73GHz) and 4G RAM, running RHEL5 x86-64. Just out of curiosity, I also run the same tests on a new PowerMac with a Quad-Core CPU (2.6G) and 8G of RAM, and a PC with a Quad-core CPU (Q6600) and 3G RAM, running 32bit of windows Vista. The benchmark from the Mac machine is impressive. It may be a good time to replace my linux workstation with a Mac. :-)

  N=10k N=100k N=1000k
    plain with selection with migration plain with selection with migration plain with selection with migration
MacOS Gcc 4.1.2 binary 0.23 0.34 0.52 2.31 3.37 5.54 40.12 75.70 100.56
short 0.24 0.34 0.49 2.51 3.60 5.53 40.06 75.19 100.26
long 0.24 0.33 0.48 2.25 3.38 5.49 40.51 75.08 101.39
Windows Vista Visual C++ binary 0.53 0.72 0.93 5.53 8.57 10.02 85.38 123.79 141.51
short 0.43 0.60 0.81 4.55 7.04 8.66 81.89 118.15 134.67
long 0.43 0.61 0.82 4.78 7.59 8.90 88.21 121.93 138.19