Benchmarks & computer hardware: Difference between revisions

From Relion
Jump to navigation Jump to search
Line 160: Line 160:
* [http://www.exxactcorp.com/relion.php www.exxactcorp.com]
* [http://www.exxactcorp.com/relion.php www.exxactcorp.com]


Another option is to run on the Amazon EC2 cloud. This was pioneered by Michael Cianfrocco and Andres Leschziner. You can read about it in the [https://elifesciences.org/content/4/e06664 eLife paper]. They have ran the same benchmark described above on the GPU nodes available there. They also have [https://sites.google.com/site/emcloudprocessing/home/relion2 their own benchmark site].
Another option is to run on the Amazon EC2 cloud. This was pioneered by Michael Cianfrocco and Andres Leschziner. You can read about it in the [https://elifesciences.org/content/4/e06664 eLife paper]. They ran the same benchmark described above on the GPU nodes available through the EC2 cloud. You can read about their results on  [https://sites.google.com/site/emcloudprocessing/home/relion2 their own benchmark site].

Revision as of 08:26, 9 October 2016

Standard benchmarks

With the addition of GPU-acceleration to release 2.0, standard benchmarks to compare the performance of new hardware has become more necessary than ever. Therefore, we suggest the following standard tests on the Plasmodium ribosome data set presented in Wong et al, eLife 2014:

ascp -QT -l 2G -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh emp_ext@fasp.ebi.ac.uk:archive/10028/data/Particles .
wget ftp://ftp.ebi.ac.uk/pub/databases/emdb/structures/EMD-2660/map/emd_2660.map.gz .
gunzip emd_2660.map.gz

2D classification

Run the (mpi-version of) the relion_refine program with the following command line arguments:

--i Particles/shiny_2sets.star --ctf --iter 25 --tau2_fudge 2 --particle_diameter 360 --K 200 --zero_mask --oversampling 1 --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --o class2d

3D classification

Run the (mpi-version of) the relion_refine program with the following command line arguments:

--i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --ctf --ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --random_seed 0 --o class3d

Additional options

One major variable to play with is of course the number of parallel MPI processes to run. Besides, depending on your system, you may want to investigate the usage of the following options:

--j The number of parallel threads to run on each CPU. We often use 4-6.
--dont_combine_weights_via_disc By default large messages are passed between MPI processes through reading and writing of large files on the computer disk. By giving this option, the messages will be passed through the network instead. We often use this option.
--gpu Use GPU-acceleration. We often use this option.
--pool This determines how many particles get read together into RAM. We often use 10-100.
--no_parallel_disc_io By default, all MPI slaves read their own particles (from disk or into RAM). Use this option to have the master read all particles, and then send them all through the network. We do not often use this option.
--preread_images By default, all particles are read from the computer disk in every iteration. Using this option, they are all read into RAM once, at the very beginning of the job instead. We often use this option if the machine has enough RAM (more than N*boxsize*boxsize*4 bytes) to store all N particles.
--scratch_dir By default, particles are read every iteration from the location specified in the input STAR file. By using this option, all particles are copied to a scratch disk, from where they will be read (every iteration) instead. We often use this option if we don't have enough RAM to read in all the particles, but we have large enough fast SSD scratch disk(s) (e.g. mounted as /tmp).

Some of our results

Our CPU-cluster

Each node of our cluster has at least 64GB RAM, and an Intel(R) Xeon(R) CPU E5-2667 0 (@ 2.90GHz). The 12 cores of each node are hyperthreaded, so each physical core appears as two cores to the operating system.

benchmark time [hr] nr MPIs Additional options
Class3D 23:28 60 --pool 100 --j 4 --dont_combine_weights_via_disc

Note: this calculation used 10 nodes (with a total of 120 physical cores, or 240 hyperthreaded ones). Our cluster nodes do not have large enough scratch disks to store the data, nor is there enough RAM for all slaves to read the data into memory.

pcterm48

This machine has 2 Titan-X (Pascal) GPUs, 64GB RAM, and an Intel(R) Xeon(R) CPU E5-2620 v3 (@ 2.40GHz).

benchmark time [hr] nr MPIs Additional options
Class3D 8:52 3 --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc
Class3D 6:21 3 --preread_images --no_parallel_disc_io --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc
Class3D 5:23 1 --preread_images --gpu 0,0,0,0,1,1,1,1 --pool 100 --j 8 --dont_combine_weights_via_disc
Class3D 4:29 3 --scratch_dir /ssd --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc
Class3D 7:31 1 --scratch_dir /ssd --gpu 0 --pool 100 --j 4 --dont_combine_weights_via_disc
Class2D 11:02 3 --scratch_disk /ssd --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc

Note: Reading the particles from our heavily used /beegfs shared file system is relatively slow. Because 64GB of RAM is only just enough to read the entire data set (51GB) once, the two MPI slaves will have to get pre-read particles from the master (through --no_parallel_disc_io), or one has to run the non-MPI version of the program with different threads on each card. Both approaches provide some speedup compared to reading them from /beegfs, but it is faster to copy all particles to a local SSD disk first.

lg26

This machine has four GTX1080 GPUs, 64GB RAM and an Intel(R) Xeon(R) CPU E5-2620 v3 (@ 2.40GHz)

benchmark time [hr] nr MPIs Additional options
Class3D 5:39 3 --scratch_dir /ssd --gpu 0:1 --pool 100 --dont_combine_weights_via_disc
Class3D 3:42 5 --scratch_dir /ssd --gpu 0:1:2:3 --pool 100 --dont_combine_weights_via_disc

lg23

This machine has a single Quadro K5200 GPU, 64 GB of RAM, and an Intel(R) Xeon(R) CPU E5-2687W v3 (@ 3.10GHz)

benchmark time [hr] nr MPIs Additional options
Class3D 13:07 1 --scratch_dir /tmp --gpu 0 --pool 100 --dont_combine_weights_via_disc

Note: this older card is still very useful! It is only approximately half as slow as the Titan-X (Pascal), and beats 10 nodes on our cluster by almost a factor two!

Computer hardware options

Relion will run on any Linux-like machine, most typically on CPU/GPU clusters, or on GPU desktop (gamer-like) machines. A minimum of 64 GB of RAM is recommended to run relion. If you're using a CPU-only cluster, we recommend at least a cluster of 100-200 cores.

Our collaborator at the SciLifeLab in Stockholm, Erik Lindahl, has made a useful blog with GPU hardware recommendations. Briefly, you'll need an NVIDIA GPU with a CUDA compute ability of at least 3.0, but you don't need the expensive double-precision NVIDIA cards, i.e. the high-end gamer cards will also do, but do see Erik's blog for details! Note that 3D auto-refine will benefit from 2 GPUs, while 2D and 3D classification can be run just as well with 1 GPU. Apart from your GPUs you'll probably also benefit from a fast (e.g. a 400GB SSD) scratch disk, especially if your working directories will be mounted over the network connecting multiple machines.

There are now also hardware providers that sell machines with relion pre-installed on it. Please note that Sjors and Erik have NO financial interest in this. However, because we do think that these companies may provide useful solutions for part of our userbase, we mention these here:

Another option is to run on the Amazon EC2 cloud. This was pioneered by Michael Cianfrocco and Andres Leschziner. You can read about it in the eLife paper. They ran the same benchmark described above on the GPU nodes available through the EC2 cloud. You can read about their results on their own benchmark site.