Standard benchmarks

With the addition of GPU-acceleration to release 2.0, standard benchmarks to compare the performance of new hardware has become more necessary than ever. Therefore, we suggest the following standard tests on the Plasmodium ribosome data set presented in Wong et al, eLife 2014:

wget ftp://ftp.mrc-lmb.cam.ac.uk/pub/scheres/relion_benchmark.tar.gz .
tar -zxf relion_benchmark.tar.gz

Which itself was made from the following downloads from EMPIAR and EMDB:

ascp -QT -l 2G -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh emp_ext@fasp.ebi.ac.uk:archive/10028/data/Particles .
wget ftp://ftp.ebi.ac.uk/pub/databases/emdb/structures/EMD-2660/map/emd_2660.map.gz .
gunzip emd_2660.map.gz

2D classification

Run (XXX instanced of) the (mpi-version of) the relion_refine program with the following command line arguments:

mpirun -n XXX `which relion_refine_mpi` --i Particles/shiny_2sets.star --ctf --iter 25 --tau2_fudge 2 --particle_diameter 360 --K 200 --zero_mask --oversampling 1 --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --o class2d

3D classification

Run (XXX instances of) the (mpi-version of) the relion_refine program with the following command line arguments:

mpirun -n XXX `which relion_refine_mpi` --i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --ctf --ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --random_seed 0 --o class3d

Additional options

One major variable to play with is of course the number of parallel MPI processes to run. Besides, depending on your system, you may want to investigate the usage of the following options:

`--j`	The number of parallel threads to run on each CPU. We often use 4-6.
`--dont_combine_weights_via_disc`	By default large messages are passed between MPI processes through reading and writing of large files on the computer disk. By giving this option, the messages will be passed through the network instead. We often use this option.
`--gpu`	Use GPU-acceleration. We often use this option.
`--pool`	This determines how many particles are processed together in a function call. We often use 10-50 for GPU jobs and 1 or 2x the number of threads for CPU jobs
`--no_parallel_disc_io`	By default, all MPI slaves read their own particles (from disk or into RAM). Use this option to have the master read all particles, and then send them all through the network. We do not often use this option.
`--preread_images`	By default, all particles are read from the computer disk in every iteration. Using this option, they are all read into RAM once, at the very beginning of the job instead. We often use this option if the machine has enough RAM (more than Nboxsizeboxsize*4 bytes) to store all N particles.
`--scratch_dir`	By default, particles are read every iteration from the location specified in the input STAR file. By using this option, all particles are copied to a scratch disk, from where they will be read (every iteration) instead. We often use this option if we don't have enough RAM to read in all the particles, but we have large enough fast SSD scratch disk(s) (e.g. mounted as /tmp).

Some of our results

Our CPU-cluster

Each node of our cluster has at least 64GB RAM, and an Intel(R) Xeon(R) CPU E5-2667 0 (@ 2.90GHz). The 12 cores of each node are hyperthreaded, so each physical core appears as two cores to the operating system.

benchmark	time [hr]	nr MPIs	Additional options
Class3D	23:28	60	`--pool 100 --j 4 --dont_combine_weights_via_disc`

Note: this calculation used 10 nodes (with a total of 120 physical cores, or 240 hyperthreaded ones). Our cluster nodes do not have large enough scratch disks to store the data, nor is there enough RAM for all slaves to read the data into memory.

pcterm48

This machine has 2 Titan-X (Pascal) GPUs, 64GB RAM, and a 12-core i7-6800K CPU (@3.40GHz).

benchmark	time [hr]	nr MPIs	Additional options
Class3D	8:52	3	`--gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc`
Class3D	6:21	3	`--preread_images --no_parallel_disc_io --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc`
Class3D	5:23	1	`--preread_images --gpu 0,0,0,0,1,1,1,1 --pool 100 --j 8 --dont_combine_weights_via_disc`
Class3D	4:29	3	`--scratch_dir /ssd --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc`
Class3D	7:31	1	`--scratch_dir /ssd --gpu 0 --pool 100 --j 4 --dont_combine_weights_via_disc`
Class2D	11:02	3	`--scratch_disk /ssd --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc`

Note: Reading the particles from our heavily used /beegfs shared file system is relatively slow. Because 64GB of RAM is only just enough to read the entire data set (51GB) once, the two MPI slaves will have to get pre-read particles from the master (through --no_parallel_disc_io), or one has to run the non-MPI version of the program with different threads on each card. Both approaches provide some speedup compared to reading them from /beegfs, but it is faster to copy all particles to a local SSD disk first.

lg26

This machine has four GTX1080 GPUs, 64GB RAM and an Intel(R) Xeon(R) CPU E5-2620 v3 (@ 2.40GHz)

benchmark	time [hr]	nr MPIs	Additional options
Class3D	5:39	3	`--scratch_dir /ssd --gpu 0:1 --pool 100 --dont_combine_weights_via_disc --j 6`
Class3D	3:42	5	`--scratch_dir /ssd --gpu 0:1:2:3 --pool 100 --dont_combine_weights_via_disc --j 6`

After an update of the machine to contain 8 GTX1080s, and for a relion-3.0 development version:

benchmark	time [hr]	nr MPIs	Additional options
Class3D	3:31	3	`--scratch_dir /ssd --gpu 4:5 --pool 100 --dont_combine_weights_via_disc --j 6`

lg27

This machine has two Titan Volta GPUs, 128GB RAM and a AMD Ryzen Threadripper 1950X 16-Core Processor

benchmark	time [hr]	nr MPIs	Additional options
Class3D	2:43	3	`--scratch_dir /ssd --gpu 0:1 --pool 100 --dont_combine_weights_via_disc --j 6`

lg23

This machine has a single Quadro K5200 GPU, 64 GB of RAM, and an Intel(R) Xeon(R) CPU E5-2687W v3 (@ 3.10GHz)

benchmark	time [hr]	nr MPIs	Additional options
Class3D	13:07	1	`--scratch_dir /tmp --gpu 0 --pool 100 --dont_combine_weights_via_disc`

Note: this older card is still very useful! It is only approximately half as slow as the Titan-X (Pascal), and beats 10 nodes on our cluster by almost a factor two!

fmg01-24

These machines are our standard GPU nodes on the cluster. They have four 1080Ti cards, 256 GB of RAM, and an Intel(R) Xeon(R) CPU E5-2667 v4 (@ 3.20GHz) The newer nodes (fmg15-24) run Scientific Linux 7 (SL7), the older ones (fmg01-14) run SL6.

benchmark	time [hr]	nr MPIs	OS	Additional options
Class3D	2:51	5	SL6	`--scratch_dir /ssd --gpu 0:1:2:3 --pool 3 --dont_combine_weights_via_disc`
Class3D	2:26	5	SL7	`--scratch_dir /ssd --gpu 0:1:2:3 --pool 3 --dont_combine_weights_via_disc`
Class3D	2:34	5	SL7	`--scratch_dir /ssd --gpu 0:1:2:3 --pool 100 --dont_combine_weights_via_disc`
Class3D	3:06	3	SL7	`--scratch_dir /ssd --gpu 0,1:2,3 --pool 100 --dont_combine_weights_via_disc`
Class3D	4:30	1	SL7	`--scratch_dir /ssd --gpu 0,1,2,3 --pool 100 --dont_combine_weights_via_disc`
Class3D	x:xx	1	SL7	`--preread_images --gpu 0,1,2,3 --pool 100 --dont_combine_weights_via_disc`

How to use GPUs in RELION

Which programs have been GPU-accelerated?

RELION-2.0+ is GPU-accelerated for:

refine, refine_mpi (only the slaves, not the master!)
autopick, autopick_mpi (master and slaves)

Which cards can I use?

The implemented GPU-support is compatible with cuda compute capability 3.5 or higher. See wikipedia's CUDA page for a complete list of such GPUs.

Typical GPU usage

If you have one or more cuda-capable GPUs, using them in relion is as easy as adding the flag --gpu. If this flag is used without arguments, RELION will as default distribute its processes over all available (visible) GPUs. This default behaviour will likely be the preferred one if you do not need to share the computer with anyone else, and you only want to run a single job at a time. Assuming you have a 4-GPU setup, the GPUs will be numbered 0, 1, 2 and 3. In general, you most likely will want to run a single MPI process on each GPU. You could then just use the --gpu option without arguments and specify 4 working (slave) mpi ranks:

mpirun -n 4 ‘which relion_autopick_mpi‘ --gpu

For classification and refinement jobs, the master does not share the heavy calcualtions performed on the GPU, so you would use:

mpirun -n 5 ‘which relion_refine_mpi‘ --gpu

Note that 3D auto-refinement always needs to be run with at least 3 MPI processes (a master, and one slave for each half-set). Therefore, machines with at least two GPU cards would be preferable for refinement using GPUs. If you need to (or want to) run multiple mpi-ranks on each GPU, RELION will attempt to do so in an efficient way if you simply specify more ranks than there are GPUs.

You can run multiple threads just as with previous versions of relion, using the --j <x> option. Each MPI process will launch the specified number of threads. This may speed up calculations, without costing much extra memory either on the CPU (RAM) or on the GPU. On a 4-GPU development machine with 16 visible cores, we often run classifications or refinements using:

mpirun -n 5 ‘which relion_refine_mpi‘ --j 4 --gpu

Which produces 4 working (slave) mpi-ranks, each with 4 threads. This produces a single rank per card, but allows multiple CPU-cores to use each GPU, maximizing overall hardware utilization. Each mpi-rank requires it’s own copy of large object in CPU and GPU memory, but if it can fit into memory it may in fact be faster to run 2 or more MPI processes on each GPU, as the MPI processes may become asynchronized so that one MPI process is busy doing calculations, while the other one for example is reading images from disk.

Specifying which GPUs to use

This section describes more advanced syntax for restricting RELION processes to certain GPUs on multi-GPU setups. You can use an argument to the --gpu option to provide a list of device-indices. The syntax is then to delimit ranks with colons [:], and threads by commas [,]. Any GPU indices provided is taken to be a list which is repeated if shorter than the total number of GPUs. By extension, the following rules applies

If a GPU id is specified more than once for a single mpi-rank, that GPU will be assigned proprotionally more of the threads of that rank.
If no colons are used (i.e. GPUs are only specified for a single rank), then the GPUs specified, apply to all ranks.
If GPUs are specified for more than one rank but not for all ranks, the unrestricted ranks are assigned the same GPUs as the restricted ranks, by a modulo rule.

For example, if you would only want to use two of the four GPUs for all mpi-ranks, because you want to leave another two free for a different user/job, then (by the above rule 2) you can specify

mpirun -n 3 ‘which relion_refine_mpi‘ --gpu 2,3

If you want an even spread over ALL GPUs, then you should not specify selection indices, as RELION will handle this itself. On your hypothetical 4-GPU machine, you would simply say

mpirun -n 9 ‘which relion_refine_mpi‘ --gpu

One can also schedule individual threads from MPI processes on the GPUs. This would be most useful when available RAM would be a limitation. Then one could for example run 3 MPI processes, each of which spawn a numner of threads on two of the cards each, as follows:

mpirun -n 3 ‘which relion_refine_mpi‘ --j 2 --gpu 0,1:2,3

Finally, for completeness, the following is a more complex example to illustrate the full functionality of the GPU-device specification options.

mpirun -n 4 ... -j 3 --gpu 2:2:1,3

Slave ranks 1 and 2 are restricted to device 2, while slave-rank 3 is restricted to devices 1 and 3. All ranks simply distribute themselves uniformly across their restrictions, which results in a higher memory-load on device 2, and no utilization of device 0 (note that each rank resident on a device has shared memory object for its threads on that device, so device 2 has roughly twice the memory load of device 1, in this case). The mapping will be (rank/thread : GPU id)

(1/0):2
(1/1):2
(1/2):2
(2/0):2
(2/1):2
(2/2):2
(3/0):1
(3/1):3
(3/2):1

Monitor GPU use and performance

If you are on a machine which has CUDA-capable GPUs, you should be able to see them by asking the driver to list them and their current usage:

nvidia-smi
Thu Apr 14 15:38:15 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.68     Driver Version: 352.68         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:04:00.0     Off |                    0 |
| N/A   33C    P8    26W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:05:00.0     Off |                    0 |
| N/A   30C    P8    27W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:83:00.0     Off |                    0 |
| N/A   42C    P0    82W / 149W |  11397MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:84:00.0     Off |                    0 |
| N/A   57C    P0    95W / 149W |  11397MiB / 11519MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    2     27260    C   ...kspace/relion_build/bin/relion_refine_mpi 11340MiB |
|    3     27262    C   ...kspace/relion_build/bin/relion_refine_mpi 11340MiB |
+-----------------------------------------------------------------------------+

In this case you can see 4 GPUs, two of which are being used for a RELION-run with very high usage. From the display, one can assess the utilisation of the device by the percentage of ”GPU-util”. It is however possible to see a high utilisation without using the GPU at full capacity, as the clock-frequency might be too low. For instance, if the temperature of the GPU rises to high, the clock-frequency is lowered to avoid damage to the hardware. If the power-draw (”Pwr” in the above), is close to the displayed maximum, this is a better indication that the GPUs are being fully utilized. If this is not the case, see the following section for more in-depth analysis and diagnosis. To see continually updated such display, you can use e.g.

watch nvidia-smi

To monitor clock-frequency and GPU performance in a way the can be more easily put into a log-file, it may be better to use the automatically refreshing command

nvidia-smi dmon

Leaving some GPU memory free

By default, RELION will grab almost all of the memory on the GPU, which may not be optimal, e.g. when on a shared workstation. There is an additional flag called --free_gpu_memory, which specifies how many MB is to be left for dynamic allocation and other processes. Normally you should not have to use it, but to e.g. overlap different runs with low demands on memory (such as autopicking and 2D-classification) during pipelined or on-the-fly processing, it may be useful to specify about half the GPU memory as available to other processes.

Computer hardware options

CPU cluster

Relion will run on any Linux-like machine, most typically on CPU/GPU clusters, or on GPU desktop (gamer-like) machines. A minimum of 64 GB of RAM is recommended to run relion. If you're using a CPU-only cluster, we recommend at least a cluster of 100-200 cores. There are many different options when buying a cluster. We recommend speaking to your university's high-performance computing people when planning to buy one.

Assemble your own GPU machines

Our collaborator at the SciLifeLab in Stockholm, Erik Lindahl, has made a useful blog with GPU hardware recommendations. Briefly, you'll need an NVIDIA GPU with a CUDA compute ability of at least 3.0, but you don't need the expensive double-precision NVIDIA cards, i.e. the high-end gamer cards will also do, but do see Erik's blog for details! Note that 3D auto-refine will benefit from 2 GPUs, while 2D and 3D classification can be run just as well with 1 GPU. Apart from your GPUs you'll probably also benefit from a fast (e.g. a 400GB SSD) scratch disk, especially if your working directories will be mounted over the network connecting multiple machines.

Ready-to-use GPU machines

There are now also hardware providers that sell machines with relion pre-installed on it. Please note that Sjors and Erik have NO financial interest in this. However, because we do think that these companies may provide useful solutions for part of our userbase, we mention these here:

Cloud computing

Another option is to run on the Amazon EC2 cloud. This was pioneered by Michael Cianfrocco and Andres Leschziner. You can read about it in the eLife paper. They ran the same benchmark described above on the GPU nodes available through the EC2 cloud. You can read about their results on their own benchmark site.

Benchmarks & computer hardware

Contents