Benchmarks & computer hardware

From Relion
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Standard benchmarks

We suggest the following standard tests on the Plasmodium ribosome data set presented in Wong et al, eLife 2014:

wget ftp://ftp.mrc-lmb.cam.ac.uk/pub/scheres/relion_benchmark.tar.gz .
tar -zxf relion_benchmark.tar.gz

Which itself was made from the following downloads from EMPIAR and EMDB:

ascp -QT -l 2G -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh emp_ext@fasp.ebi.ac.uk:archive/10028/data/Particles .
wget ftp://ftp.ebi.ac.uk/pub/databases/emdb/structures/EMD-2660/map/emd_2660.map.gz .
gunzip emd_2660.map.gz

2D classification

Run (XXX instanced of) the (mpi-version of) the relion_refine program with the following command line arguments:

mpirun -n XXX `which relion_refine_mpi` --i Particles/shiny_2sets.star --ctf --iter 25 --tau2_fudge 2 --particle_diameter 360 --K 200 --zero_mask --oversampling 1 --psi_step 6 --offset_range 5 --offset_step 2 --norm --scale --random_seed 0 --o class2d

3D classification

Run (XXX instances of) the (mpi-version of) the relion_refine program with the following command line arguments:

mpirun -n XXX `which relion_refine_mpi` --i Particles/shiny_2sets.star --ref emd_2660.map:mrc --firstiter_cc --ini_high 60 --ctf --ctf_corrected_ref --iter 25 --tau2_fudge 4 --particle_diameter 360 --K 6 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --offset_range 5 --offset_step 2 --sym C1 --norm --scale --random_seed 0 --o class3d

Additional options

One major variable to play with is of course the number of parallel MPI processes to run. Besides, depending on your system, you may want to investigate the usage of the following options:

--j The number of parallel threads to run on each CPU. We often use 4-6.
--dont_combine_weights_via_disc By default large messages are passed between MPI processes through reading and writing of large files on the computer disk. By giving this option, the messages will be passed through the network instead. We often use this option.
--gpu Use GPU-acceleration. We often use this option.
--cpu Use CPU-acceleration. We often use this option on clusters without GPUs.
--pool This determines how many particles are processed together in a function call. We often use 10-50 for GPU jobs and 1 or 2x the number of threads for CPU jobs
--no_parallel_disc_io By default, all MPI slaves read their own particles (from disk or into RAM). Use this option to have the master read all particles, and then send them all through the network. We do not often use this option.
--preread_images By default, all particles are read from the computer disk in every iteration. Using this option, they are all read into RAM once, at the very beginning of the job instead. We often use this option if the machine has enough RAM (more than N*boxsize*boxsize*4 bytes) to store all N particles.
--scratch_dir By default, particles are read every iteration from the location specified in the input STAR file. By using this option, all particles are copied to a scratch disk, from where they will be read (every iteration) instead. We often use this option if we don't have enough RAM to read in all the particles, but we have large enough fast SSD scratch disk(s) (e.g. mounted as /tmp).

Some of our RELION 2 results

the MRC CPU-cluster

Each node of our cluster has at least 64GB RAM, and an Intel® Xeon® CPU E5-2667 0 (@ 2.90GHz). The 12 cores of each node have Intel® Hyper-Threading Technology enabled, so each physical core appears as two cores to the operating system.

benchmark time [hr] nr MPIs Additional options
Class3D 23:28 60 --pool 100 --j 4 --dont_combine_weights_via_disc

Note: this calculation used 10 nodes (with a total of 120 physical cores, or 240 with Intel® Hyper-Threading Technology enabled). Our cluster nodes do not have large enough scratch disks to store the data, nor is there enough RAM for all slaves to read the data into memory.

pcterm48

This machine has 2 Titan-X (Pascal) GPUs, 64GB RAM, and a 12-core i7-6800K CPU (@3.40GHz).

benchmark time [hr] nr MPIs Additional options
Class3D 8:52 3 --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc
Class3D 6:21 3 --preread_images --no_parallel_disc_io --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc
Class3D 5:23 1 --preread_images --gpu 0,0,0,0,1,1,1,1 --pool 100 --j 8 --dont_combine_weights_via_disc
Class3D 4:29 3 --scratch_dir /ssd --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc
Class3D 7:31 1 --scratch_dir /ssd --gpu 0 --pool 100 --j 4 --dont_combine_weights_via_disc
Class2D 11:02 3 --scratch_disk /ssd --gpu 0:1 --pool 100 --j 4 --dont_combine_weights_via_disc

Note: Reading the particles from our heavily used /beegfs shared file system is relatively slow. Because 64GB of RAM is only just enough to read the entire data set (51GB) once, the two MPI slaves will have to get pre-read particles from the master (through --no_parallel_disc_io), or one has to run the non-MPI version of the program with different threads on each card. Both approaches provide some speedup compared to reading them from /beegfs, but it is faster to copy all particles to a local SSD disk first.

lg26

This machine has four GTX1080 GPUs, 64GB RAM and an Intel® Xeon® CPU E5-2620 v3 (@ 2.40GHz)

benchmark time [hr] nr MPIs Additional options
Class3D 5:39 3 --scratch_dir /ssd --gpu 0:1 --pool 100 --dont_combine_weights_via_disc --j 6
Class3D 3:42 5 --scratch_dir /ssd --gpu 0:1:2:3 --pool 100 --dont_combine_weights_via_disc --j 6

After an update of the machine to contain 8 GTX1080s, and for a relion-3.0 development version:

benchmark time [hr] nr MPIs Additional options
Class3D 3:31 3 --scratch_dir /ssd --gpu 4:5 --pool 100 --dont_combine_weights_via_disc --j 6

lg27

This machine has two Titan Volta GPUs, 128GB RAM and a AMD Ryzen Threadripper 1950X 16-Core Processor

benchmark time [hr] nr MPIs Additional options
Class3D 2:43 3 --scratch_dir /ssd --gpu 0:1 --pool 100 --dont_combine_weights_via_disc --j 6

lg23

This machine has a single Quadro K5200 GPU, 64 GB of RAM, and an Intel® Xeon® CPU E5-2687W v3 (@ 3.10GHz)

benchmark time [hr] nr MPIs Additional options
Class3D 13:07 1 --scratch_dir /tmp --gpu 0 --pool 100 --dont_combine_weights_via_disc

Note: this older card is still very useful! It is only approximately half as slow as the Titan-X (Pascal), and beats 10 nodes on our cluster by almost a factor two!

fmg01-24

These machines are our standard GPU nodes on the cluster. They have four 1080Ti cards, 256 GB of RAM, and an Intel(R) Xeon(R) CPU E5-2667 v4 (@ 3.20GHz) The newer nodes (fmg15-24) run Scientific Linux 7 (SL7), the older ones (fmg01-14) run SL6.

benchmark time [hr] nr MPIs OS Additional options
Class3D 2:51 5 SL6 --scratch_dir /ssd --gpu 0:1:2:3 --pool 3 --dont_combine_weights_via_disc
Class3D 2:26 5 SL7 --scratch_dir /ssd --gpu 0:1:2:3 --pool 3 --dont_combine_weights_via_disc
Class3D 2:34 5 SL7 --scratch_dir /ssd --gpu 0:1:2:3 --pool 100 --dont_combine_weights_via_disc
Class3D 3:06 3 SL7 --scratch_dir /ssd --gpu 0,1:2,3 --pool 100 --dont_combine_weights_via_disc
Class3D 4:30 1 SL7 --scratch_dir /ssd --gpu 0,1,2,3 --pool 100 --dont_combine_weights_via_disc
Class3D x:xx 1 SL7 --preread_images --gpu 0,1,2,3 --pool 100 --dont_combine_weights_via_disc

Some RELION 3 results obtained on Intel's Endeavour cluster

NVIDIA Tesla V100 - RELION built with CUDA 10.1 and GCC 7.3

Each node has at least 128GB RAM, and two Intel® Xeon® Gold 6148 CPUs (@ 2.40GHz). The 40 cores of each node have Intel® Hyper-Threading Technology enabled, so each physical core appears as two cores to the operating system. The systems used for this test have one NVIDIA Tesla V100 per node, with the exception of one that has four. All I/O to an SSD-based LSF filesystem.

Build command cmake -DCUDA_ARCH=70 -DCUDA=ON -DCudaTexture=ON -DGUI=OFF -DCMAKE_BUILD_TYPE=Release ..

benchmark time [hr] nr MPIs Additional options
Class3D - 1 node, 1 V100 - GPU accelerated version 3:06 5 --pool 30 --j 6 --gpu --dont_combine_weights_via_disc
Class3D - 1 node, 4 V100 - GPU accelerated version 1:12 9 --pool 30 --j 6 --gpu --dont_combine_weights_via_disc
Class3D - 4 nodes, 1 V100 each node - GPU accelerated version 1:01 28 --pool 30 --j 6 --gpu --dont_combine_weights_via_disc

Intel "Skylake" systems - RELION built for Intel® AVX-2 with GCC 7.3

Each node has at least 128GB RAM, and two Intel® Xeon® Gold 6148 CPUs (@ 2.40GHz). The 40 cores of each node have Intel® Hyper-Threading Technology enabled, so each physical core appears as two cores to the operating system. All I/O to an SSD-based LSF filesystem.

Build command CC=/opt/intel/gcc7.3/bin/gcc CXX=/opt/intel/gcc7.3/bin/g++ cmake -DCMAKE_C_FLAGS="-O3 -ftree-vectorize -ffast-math -g -march=broadwell -mtune=broadwell -mavx2 -mpclmul -mfsgsbase -mrdrnd -mf16c -mfma -mbmi -mbmi2 -mf16c -mprefetchwt1 " -DCMAKE_CXX_FLAGS="-O3 -ftree-vectorize -ffast-math -g -march=broadwell -mtune=broadwell -mavx2 -mpclmul -mfsgsbase -mrdrnd -mf16c -mfma -mbmi -mbmi2 -mf16c -mprefetchwt1 " -DCMAKE_EXE_LINKER_FLAGS="-L/opt/intel/gcc7.3/lib64 -L$TBBROOT/lib/intel64_lin/gcc4.7" -DGUI=OFF -DALTCPU=ON -DMKLFFT=ON -DCMAKE_BUILD_TYPE=Release ..

Note: Building for Intel® AVX-512 with -march=skylake-avx512 -mtune=skylake-avx512 and other options needed to get full AVX-512 optimization with GCC 7.3 did not produce performance which improved on the AVX2 build shown here.

benchmark time [hr] nr MPIs Additional options
Class3D - 1 node - CPU accelerated version 13:34 8 --pool 40 --j 10 --cpu --dont_combine_weights_via_disc
Class3D - 4 nodes - CPU accelerated version 3:36 16 --pool 40 --j 20 --cpu --dont_combine_weights_via_disc

Intel "Skylake" systems - RELION built for Intel® AVX-2 with Intel(R) C++ Compiler 2018 Update 3

Each node has at least 128GB RAM, and two Intel® Xeon® Gold 6148 CPUs (@ 2.40GHz). The 40 cores of each node have Intel® Hyper-Threading Technology enabled, so each physical core appears as two cores to the operating system. All I/O to an SSD-based LSF filesystem.

Build command CC=mpiicc CXX=mpiicpc cmake -DCUDA=OFF -DALTCPU=ON -DCudaTexture=OFF -DMKLFFT=ON -DCMAKE_C_FLAGS="-O3 -ip -g -xCORE-AVX2 -restrict " -DCMAKE_CXX_FLAGS="-O3 -ip -g -xCORE-AVX2 -restrict " -DGUI=OFF -DCMAKE_BUILD_TYPE=Release ..

benchmark time [hr] nr MPIs Additional options
Class3D - 1 node - CPU accelerated version 8:21 8 --pool 40 --j 10 --cpu --dont_combine_weights_via_disc
Class3D - 4 nodes - CPU accelerated version 2:10 16 --pool 40 --j 20 --cpu --dont_combine_weights_via_disc

Intel "Skylake" systems - RELION built for Intel® AVX-512 with Intel(R) C++ Compiler 2018 Update 3

Each node has at least 128GB RAM, and two Intel® Xeon® Gold 6148 CPUs (@ 2.40GHz). The 40 cores of each node have Intel® Hyper-Threading Technology enabled, so each physical core appears as two cores to the operating system. All I/O to an SSD-based LSF filesystem.

Build command CC=mpiicc CXX=mpiicpc cmake -DCUDA=OFF -DALTCPU=ON -DCudaTexture=OFF -DMKLFFT=ON -DCMAKE_C_FLAGS="-O3 -ip -g -xCOMMON-AVX512 -restrict " -DCMAKE_CXX_FLAGS="-O3 -ip -g -xCOMMON-AVX512 -restrict " -DGUI=OFF -DCMAKE_BUILD_TYPE=Release ..

benchmark time [hr] nr MPIs Additional options
Class3D - 1 node - legacy 15:38 8 --pool 40 --j 10 --dont_combine_weights_via_disc
Class3D - 1 node - CPU accelerated version 4:05 8 --pool 40 --j 10 --cpu --dont_combine_weights_via_disc
Class3D - 4 nodes - legacy 3:47 16 --pool 40 --j 20 --dont_combine_weights_via_disc
Class3D - 4 nodes - CPU accelerated version 1:08 16 --pool 40 --j 20 --cpu --dont_combine_weights_via_disc

Accelerated RELION, using GPUs or CPU-vectorization

Most of the benchmarks above use GPUs by adding the --gpu flag, or the accelerated CPU version by adding the --cpu flag. This allows you to run RELION with some form of accelerated code option to speed up the large computations it needs to perform. In this section, there is more information on how to make the most of RELION when you use GPUs or CPU-vectorization to speed up calculations.

Using GPU-acceleration

Which programs have been GPU-accelerated?

RELION-2.0+ is GPU-accelerated for:

  1. refine, refine_mpi (only the slaves, not the master!)
  2. autopick, autopick_mpi (master and slaves)

Which cards can I use?

The implemented GPU-support is compatible with cuda compute capability 3.5 or higher. See wikipedia's CUDA page for a complete list of such GPUs.

Typical GPU usage

If you have one or more cuda-capable GPUs, using them in relion is as easy as adding the flag --gpu. If this flag is used without arguments, RELION will as default distribute its processes over all available (visible) GPUs. This default behaviour will likely be the preferred one if you do not need to share the computer with anyone else, and you only want to run a single job at a time. Assuming you have a 4-GPU setup, the GPUs will be numbered 0, 1, 2 and 3. In general, you most likely will want to run a single MPI process on each GPU. You could then just use the --gpu option without arguments and specify 4 working (slave) mpi ranks:

mpirun -n 4 ‘which relion_autopick_mpi‘ --gpu

For classification and refinement jobs, the master does not share the heavy calculations performed on the GPU, so you would use:

mpirun -n 5 ‘which relion_refine_mpi‘ --gpu

Note that 3D auto-refinement always needs to be run with at least 3 MPI processes (a master, and one slave for each half-set). Therefore, machines with at least two GPU cards would be preferable for refinement using GPUs. If you need to (or want to) run multiple mpi-ranks on each GPU, RELION will attempt to do so in an efficient way if you simply specify more ranks than there are GPUs.

You can run multiple threads just as with previous versions of relion, using the --j <x> option. Each MPI process will launch the specified number of threads. This may speed up calculations, without costing much extra memory either on the CPU (RAM) or on the GPU. On a 4-GPU development machine with 16 visible cores, we often run classifications or refinements using:

mpirun -n 5 ‘which relion_refine_mpi‘ --j 4 --gpu

Which produces 4 working (slave) mpi-ranks, each with 4 threads. This produces a single rank per card, but allows multiple CPU-cores to use each GPU, maximizing overall hardware utilization. Each mpi-rank requires it’s own copy of large object in CPU and GPU memory, but if it can fit into memory it may in fact be faster to run 2 or more MPI processes on each GPU, as the MPI processes may become asynchronized so that one MPI process is busy doing calculations, while the other one for example is reading images from disk. NOTE: it is a common misconception that you NEED multiple MPIs to use multiple GPUs, but this is not the case. One MPI can use many GPUs, since it can spread its threads across them.

Specifying which GPUs to use

This section describes more advanced syntax for restricting RELION processes to certain GPUs on multi-GPU setups. You can use an argument to the --gpu option to provide a list of device-indices. The syntax is then to delimit ranks with colons [:], and threads by commas [,]. Any GPU indices provided is taken to be a list which is repeated if shorter than the total number of GPUs. By extension, the following rules applies

  1. If a GPU id is specified more than once for a single mpi-rank, that GPU will be assigned proprotionally more of the threads of that rank.
  2. If no colons are used (i.e. GPUs are only specified for a single rank), then the GPUs specified, apply to all ranks.
  3. If GPUs are specified for more than one rank but not for all ranks, the unrestricted ranks are assigned the same GPUs as the restricted ranks, by a modulo rule.

For example, if you would only want to use two of the four GPUs for all mpi-ranks, because you want to leave another two free for a different user/job, then (by the above rule 2) you can specify

mpirun -n 3 ‘which relion_refine_mpi‘ --gpu 2:3                  slave 1 is told to use GPU2. slave 2 is told to use GPU3.

If you want an even spread over ALL GPUs, then you should not specify selection indices, as RELION will handle this itself. On your hypothetical 4-GPU machine, you would simply say

mpirun -n 3 ‘which relion_refine_mpi‘ --gpu                      slave 1 will use GPU0 and GPU1 for its threads. slave 2 will use GPU2 and GPU3 for its threads

One can also schedule individual threads from MPI processes on the GPUs. This would be most useful when available RAM would be a limitation. Then one could for example run 3 MPI processes, each of which spawn a number of threads on two of the cards each, as follows:

mpirun -n 3 ‘which relion_refine_mpi‘ --j 4 --gpu 0,1,1,2:3      slave 1 is told to put thread 1 on GPU0, threads 2 and 3 on GPU1, and thread 4 on GPU2.  slave 2 is told to put all 4 threads on GPU3.     

Finally, for completeness, the following is a more complex example to illustrate the full functionality of the GPU-device specification options.

mpirun -n 4 ... -j 3 --gpu 2:2:1,3                               slave 1 w/ 3 threads on GPU2, slave 2 w/ 3 threads on GPU2, slave 3 distributes 3 threads as evenly as possible across GPU1 and GPU3. 

Slave ranks 1 and 2 are restricted to device 2, while slave-rank 3 is restricted to devices 1 and 3. All ranks simply distribute themselves uniformly across their restrictions, which results in a higher memory-load on device 2, and no utilization of device 0 (note that each rank resident on a device has shared memory object for its threads on that device, so device 2 has roughly twice the memory load of device 1, in this case). The mapping will be (rank/thread : GPU id)

(1/0):2
(1/1):2
(1/2):2
(2/0):2
(2/1):2
(2/2):2
(3/0):1
(3/1):3
(3/2):1

Monitor GPU use and performance

If you are on a machine which has CUDA-capable GPUs, you should be able to see them by asking the driver to list them and their current usage by using the command nvidia-smi :

nvidia-smi
Thu Apr 14 15:38:15 2016       
+------------------------------------------------------+                       
| NVIDIA-SMI 352.68     Driver Version: 352.68         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:04:00.0     Off |                    0 |
| N/A   33C    P8    26W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:05:00.0     Off |                    0 |
| N/A   30C    P8    27W / 149W |     55MiB / 11519MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           On   | 0000:83:00.0     Off |                    0 |
| N/A   42C    P0    82W / 149W |  11397MiB / 11519MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           On   | 0000:84:00.0     Off |                    0 |
| N/A   57C    P0    95W / 149W |  11397MiB / 11519MiB |     94%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    2     27260    C   ...kspace/relion_build/bin/relion_refine_mpi 11340MiB |
|    3     27262    C   ...kspace/relion_build/bin/relion_refine_mpi 11340MiB |
+-----------------------------------------------------------------------------+

In this case you can see 4 GPUs, two of which are being used for a RELION-run with very high usage. From the display, one can assess the utilisation of the device by the percentage of ”GPU-util”. It is however possible to see a high utilisation without using the GPU at full capacity, as the clock-frequency might be too low. For instance, if the temperature of the GPU rises to high, the clock-frequency is lowered to avoid damage to the hardware. If the power-draw (”Pwr” in the above), is close to the displayed maximum, this is a better indication that the GPUs are being fully utilized. To see continually updated such display, you can use e.g.

watch nvidia-smi

To monitor clock-frequency and GPU performance in a way the can be more easily put into a log-file, it may be better to use the automatically refreshing command

nvidia-smi dmon

Leaving some GPU memory free

By default, RELION will grab almost all of the memory on the GPU, which may not be optimal, e.g. when on a shared workstation. There is an additional flag called --free_gpu_memory, which specifies how many MB is to be left for dynamic allocation and other processes. Normally you should not have to use it, but to e.g. overlap different runs with low demands on memory (such as autopicking and 2D-classification) during pipelined or on-the-fly processing, it may be useful to specify about half the GPU memory as available to other processes.

Using CPU-acceleration

Which programs have been CPU-accelerated?

RELION-3.0+ is CPU-accelerated for

  • relion_refine
  • relion_refine_mpi (only the slaves, not the master!)

A note regarding CPU-based classification without alignment:

RELION does not permit use of GPUs for classification without alignment, since there is not sufficient parallelism to use GPUs efficiently. It would therefore seem intuitive to use CPU-acceleration to benefit this type of run. CPU-acceleration however relies on exposed parallelism, just like the GPU-acceleration. Therefore, classification without alignment will not show dramatic speedup in the accelerated CPU-version, since there is very little exposed parallelism compared to the conventional CPU-code. It is simply difficult to construct an efficient parallel program for this type of run, since it is not broadly parallel and heavy on input/output files.

Using the Accelerated version

If you compiled RELION for CPU-acceleration, you can call relion_refine or relion_refine_mpi with the --cpu flag. Doing so will run the accelerated version, otherwise the original CPU-version or RELION will be run. There is no reason to not use this flag if available, unless you want to verify old runs or behavior. The rest of this section will provide details of how to compile this version of RELION, and how to attain the best possible performance using it.

Build instructions

RELION 3.0+ is compiled with either GPU- or CPU-acceleration. To have access to both, one currently has to compile to separate versions and call each separately as desired. There is currently no out-of-the-box way to have both versions in the same executable.

To compile RELION with CPU-acceleration, add the -DALTCPU=ON to your cmake configuration (for details see the installation section). Building with this option will require a TBB (thread building blocks) library, which RELION will look for TBB fetch+install it if it cannot find it on your system. You can force this fetch+install (and make sure you are using the TBB latest version) by adding -DFORCE_OWN_TBB=ON to your cmake configuration. In addition, you can make use of the Intel Math Kernel Library (Intel MKL) when using the Intel C++ Compiler (ICC). This is optional (but will scale better with increased threads), and can be added through -DMKLFFT=ON. Note that ICC is not a free compiler like GCC, but that ICC-compiled binaries can be run with the free ICC runtime.

Compilers (GCC, ICC, ...) by default generate executable programs that run on most available hardware, i.e. to have broad compatibility. This general compatibility comes at the price of a generally slower program, so you can benefit much from compiling for the specific hardware RELION will run on. If your cluster has hardware of various ages, and you want to get best performance from each, you could make a general, low-performing binary, or build several, high-performing RELION binaries, each for use on a separate hardware generations.

Compiling a fast CPU-accelerated RELION binary

We recommend that you build RELION 3.0 with Intel Parallel Studio XE 2018 Cluster Edition (the ICC-compiler). With the Plasmodium ribosome benchmark, we noted that GCC 7.3 hardware-optimized builds appear to run much slower than those built with Intel Parallel Studio XE 2018 Cluster Edition (see above). ICC is not free. However, if you want to run a version of RELION built with the Intel compiler, you can download the runtime libraries for free from:

Intel C++ Compiler libraries

Intel Math Kernel Libraries and Intel MPI (Download at least "Intel Math Kernel Library (Intel MKL)" and "Intel MPI Library (Linux Package)" (make sure the product "Intel Performance Libraries for Linux*" is selected at the top):

The above runtime libraries need to be installed in a location that is accessible to all machines running the ICC-compiled binary.

To build with ICC, you need to set up your environment:

source /opt/intel/impi/<version>/intel64/bin/mpivars.sh intel64
source /opt/intel/compilers_and_libraries_<version>/linux/bin/compilervars.sh intel64
source /opt/intel/compilers_and_libraries_<version>/linux/mkl/bin/mklvars.sh intel64

You also need to define the following environment variables before running cmake (or inline with your cmake command):

export CC=mpiicc
export CXX=mpiicpc

Building for specific architectures

AMD EPYC™

CC=mpiicc CXX=mpiicpc cmake -D CMAKE_C_FLAGS="-O3 -ip -g -march=core-avx2 -restrict " -D CMAKE_CXX_FLAGS="-O3 -ip -g -march=core-avx2 -restrict " -DGUI=OFF -DALTCPU=ON -DMKLFFT=ON -D CMAKE_BUILD_TYPE=Release ..

Intel® AVX2-compatible CPUs (this binary will only run on Intel® AVX2 platforms)

CC=mpiicc CXX=mpiicpc cmake -DCUDA=OFF -DALTCPU=ON -DCudaTexture=OFF -DMKLFFT=ON -D CMAKE_C_FLAGS="-O3 -ip -g -xCORE-AVX2 -restrict " -D CMAKE_CXX_FLAGS="-O3 -ip -g -xCORE-AVX2 -restrict " -DGUI=OFF -D CMAKE_BUILD_TYPE=Release ..

Intel® AVX512-compatible CPUs (this binary will only run on Intel® AVX512 platforms)

CC=mpiicc CXX=mpiicpc cmake -DCUDA=OFF -DALTCPU=ON -DCudaTexture=OFF -DMKLFFT=ON -D CMAKE_C_FLAGS="-O3 -ip -g -xCOMMON-AVX512 -restrict " -D CMAKE_CXX_FLAGS="-O3 -ip -g -xCOMMON-AVX512 -restrict " -DGUI=OFF -D CMAKE_BUILD_TYPE=Release ..

Intel® AVX512-compatible CPUs and Intel® AVX2-compatible CPUs (this binary will run best on Intel® AVX2 and Intel® AVX512 platforms, and more slowly on all others - note that this build will create a lot of informational messages)

CC=mpiicc CXX=mpiicpc cmake -DCUDA=OFF -DALTCPU=ON -DCudaTexture=OFF -DMKLFFT=ON -D CMAKE_C_FLAGS="-O3 -ip -g -axCORE-AVX2,COMMON-AVX512 -restrict " -D CMAKE_CXX_FLAGS="-O3 -ip -g -axCORE-AVX2,COMMON-AVX512 -restrict " -DGUI=OFF -D CMAKE_BUILD_TYPE=Release ..

Intel® Xeon Phi™ Processors (x700 Family)

CC=mpiicc CXX=mpiicpc cmake -DCUDA=OFF -DALTCPU=ON -DCudaTexture=OFF -DMKLFFT=ON -D CMAKE_C_FLAGS="-O3 -ip -g -xMIC-AVX512 -restrict " -D CMAKE_CXX_FLAGS="-O3 -ip -g -xMIC-AVX512 -restrict " -DGUI=OFF -D CMAKE_BUILD_TYPE=Release ..

Optimizing performance for CPU-accelerated RELION

Best performance is seen when the pool size (--pool) is roughly twice the number of threads (--j), and not lower than about 30. However, lowering the pool size may also decrease the memory used by a process or rank.

When running multi-node, we see best results with 4 MPI ranks per node on dual-socket systems with Intel® processors (so --j should be set to the total virtual cores in the machine divided by 4, with --pool twice that value), although this may also depend on the data set. 8 MPI ranks per node seems to work best on dual-socket AMD EPYC™ systems. Clusters with less than 128GB of RAM per node will want to run with fewer ranks per system.

We recommend running with more than 9 threads per rank (--j) if at all possible, which may mean dropping to 3 ranks per node. We have observed that decreasing the number of ranks and increasing the number of threads typically results in shorter runtimes than increasing the number of ranks and decreasing the number of threads. We recommend running with Intel® Hyper-Threading Technology enabled.

Systems with 256GB or more are recommended for the CPU-accelerated kernels.

To get best performance, set the following environment variables and/or put them in-line with your MPI run commands:

export OMP_SCHEDULE="dynamic"
export KMP_BLOCKTIME=0

Our best performance on a single dual-socket machine from Broadwell-generation or later was with Intel® Hyper-Threading Technology enabled, 8 MPI ranks (mpirun -n 8), enough threads per rank to use all the available logical cores (--j <#logicalcores/8> ), with a pool size of at least 40 (--pool 40 ). Haswell-generation machines with Intel® Hyper-Threading Technology enabled might benefit from a single MPI-rank with as many threads (--j) as virtual cores. In the latter case, set the pool size to double the thread count.

As an example of this, a dual-socket Intel® Xeon® Gold 6148 processor system with 80 logical cores (Intel® Hyper-Threading Technology enabled) would be configured with 8 ranks, each with 80/8 =10 threads per rank and a pool size of 40, like so:

mpirun -n 8 relion_refine_mpi <other flags and options> --j 10 --pool 40

When running on a cluster, recommendations are similar, although in this case we see best results with 4 MPI ranks per node on dual-socket systems with Intel processors and Intel® Hyper-Threading Technology enabled(so --j = “total virtual cores”/4, with --pool twice that value), although this may also depend on the data set. 8 MPI ranks per node seems to work best on dual-socket AMD EPYC™ systems. We recommend running with more than 9 threads per rank (--j) if at all possible, which may mean dropping to 3 ranks per node - decreasing the number of ranks and increasing the number of threads typically results in shorter runtimes than increasing the number of ranks and decreasing the number of threads. Clusters with less than 128GB of RAM per node will want to run with fewer ranks per system. There does not seem to be much benefit to running more than 4 MPI ranks per node on dual socket systems, although quad-socket systems should show better performance by doubling that number.

For example, if your cluster consists of dual-socket Intel® Xeon® Gold 6148 processor systems with Intel® Hyper-Threading Technology enabled, you would configure RELION to run with 4 ranks per node, 20 threads per rank, and a pool size of 40 like so:

mpirun -np <total_ranks> -perhost 4 relion_refine_mpi <other flags and options> --j 20 --pool 40

We don’t recommend using more than 16 machines total for a given RELION job at this time, as the scaling appears limited. This of course depends on the size of the input data, but RELION's data-flow management is not equipped to handle arbitrarily large jobs, so scaling will likely be limited both by available task parallelism AND latency at this point.

Computer hardware options and considerations

RELION will run on any Linux-like machine, from a laptop to a supercomputer. While a laptop will suffice to view results and manage the pipeline workflow, processing can get quite demanding, depending on your data. The minimal equipment required is typically a dual-socket workstation or a GPU-desktop (gaming-like). A minimum of 64 GB of RAM is recommended to run RELION with small image sizes (say 200x200) on either the original or accelerated versions of RELION. 360x360 problems run best on systems with more than 128GB of RAM, and systems with 256GB or more RAM are recommended for the CPU-accelerated kernels on larger image sizes. Below are more detailed information regarding each flavor of hardware.

Single machine (workstation)

A single machine may suffice for projects making use of cryo-EM, and supplies a compact solution without the need for additional infrastructure. As noted above, efficent RELION-processing generally requires a farily high-end computer, typically with one or more GPUs.

Assemble your own GPU machines

Our collaborator at the SciLifeLab in Stockholm, Erik Lindahl, has made a useful blog with GPU hardware recommendations. Briefly, you'll need an NVIDIA GPU with a CUDA compute ability of at least 3.0, but you don't need the expensive double-precision NVIDIA cards, i.e. the high-end gamer cards will also do, but do see Erik's blog for details! Note that 3D auto-refine will benefit from 2 GPUs, while 2D and 3D classification can be run just as well with 1 GPU. Apart from your GPUs you'll probably also benefit from a fast (e.g. a 400GB SSD) scratch disk, especially if your working directories will be mounted over the network connecting multiple machines.

Ready-to-use GPU machines

There are now also hardware providers that sell machines with relion pre-installed on it. Please note that Sjors and Erik have NO financial interest in this. However, because we do think that these companies may provide useful solutions for part of our userbase, we mention these here:

CPU cluster

If you're using a CPU-only cluster, we recommend at least a cluster of at least 8 modern dual-socket nodes. There are many different options when buying a cluster. We recommend speaking to your university's high-performance computing people when planning to buy one.

Cloud computing

Another option is to run on the Amazon EC2 cloud. This was pioneered by Michael Cianfrocco and Andres Leschziner. You can read about it in the eLife paper. They ran the same benchmark described above on the GPU nodes available through the EC2 cloud. You can read about their results on their own benchmark site.

As of January 2019, we recommend c5*.18xlarge r5*.12xlarge or r5*.24xlarge instances for CPU-only processing (CPU performance is best with large numbers of threads and memory), and p3.8xlarge instances (at the least) for GPU computations. Each instance should be configured with the maximum “vCPUs” allowed per model, and then RELION run with ranks per model and threads per rank configured as you would for a CPU cluster, substituting “vCPU” for “virtual core” in the above recommendations.