Minimise computational costs

From Relion
Jump to navigation Jump to search

By increasing the understanding of RELION's computational costs, this page intends to provide users with the information to minimise the computational costs of their MAP refinements.

Understanding the algorithm

Expectation

The expectation step is the "alignment" step: here each experimental image is compared to projections of the reference map in all orientations. Consequently, this is the most expensive step in terms of CPU. CPU costs increase with increasingly fine angular or translational sampling rates (linearly with the number of orientations sampled, see NrHiddenVariableSamplingPoints in the stdout file. If classifying, the CPU costs also increase linearly with the number of references used. In terms of memory (RAM): the expectation step also may be quiet costly, in particular if large images are used. Scaling behaviour is somewhat complicated, as more data is being kept in memory as the resolution increases. For Niko's recoated rotavirus data, we used 2x downsized images of 400x400 pixels, which still fitted into our 8x2Gb machines. Using the original 800x800 pixel images did not.

If you ever get the following error in your stderr file: Allocate: No space left then you know you've run out of memory. In that case, first check whether you are running the intended number of MPI jobs on each node. You can monitor memory usage and number of MPI jobs on your nodes by logging into them and using the "top" command. Often, it takes some work to setup your job submission system for handling hybrid parallelization, i.e. jobs that use both MPI and threads. See the installation page for more details on how to do this.


The total wallclock time needed to run the expectation step may be greatly reduced using parallel computing. This has been implemented at 2 different levels: MPI (message passing interface) is used to communicate between different computing nodes (separate computers that are connected to each other using cables), while so-called threads are used to parallelize tasks among the multiple cores of modern multi-core computers. Threads have the advantage of sharing the memory of one computer (so that memory does not need to be replicated for each thread). MPI has the advantage of scalability: one can always buy more computers and links them together in a larger cluster, while there is a maximum on the number of cores on one computer one can buy. The recommended way to run RELION (in particular for 3D refinements where memory requirements are larger than in 2D) is to use as many threads as there are cores on your nodes, and then run one MPI process on each node. For the 3D auto-refine option, be aware that the two independent half data sets are refined on two half-sets of the slaves, while a single master node directs everything. Therefore, it is most efficient to use an odd number of nodes, and the minimum number of nodes to use is 3.

Maximization

The maximization step is the "reconstruction" step. This step is typically much faster than the expectation step. However, it is not parallelized very well. The only parallelization implemented is that multiple reconstructions (e.g. in case of classification, or the two independent reconstructions for gold-standard FSCs) are performed in parallel. Implementation of threads in the FFTW library yielded limited speed-ups in release 1.1, but this implementation was removed from release 1.2 due to instabilities.

Although not very slow, the maximization step does take quite a bit of memory, scaling cubicly with the image size. It could be that you don't have memory problems in the expectation step, but that you run out of memory in the maximization step. The only solution to this is to use smaller (downscaled) images. You can downscale your images in the RELION Preprocessing procedure.

In RELION 3.1, one can Skip gridding. In our tests on a limited number of datasets, this does not seem to harm the resolution, but we welcome more feedback.

Tips to increase speed

Don't use too many classes in 2D classification

We hardly ever use more than 200 classes in 2D classification. Even for very large data sets, 200 classes seems to be just fine.

Don't waste time on bad data

If after a first (or perhaps second) round of 2D classification the 2D classes do not look good, then do not waste valuable CPU time on these data. Go back to the bench and the microscope and get better data. To use Chris Russo's words: "for bad data, the fastest way to process is rm -rf *"...

Use down-scaled particles

We often use relatively high magnifications in the microscope and end up with large boxed particles that require a lot of time to read from disc and move around through the MPI messages. Therefore, things can be speeded up significantly by downscaling your particles, especially for the initial 2D and 3D classification runs where one often only needs lower resolutions. Remember that the maximum attainable resolution from any run is two times the pixel size, so a down-sampled pixel size of 4.0 Angstrom can still give you 8.0 Angstrom reconstructions/2D class averages. This is usually more than enough to get a good separation of suitable particles from junk particles, and even separate out main conformational variability.

You can downsample your images using the Rescale particles? option on the extract tab of the Particle extraction job-type. After you have done your initial 2D and 3D classifications and you decide to go back to the original-scale (or less down-sampled) particle boxes.

Do classifications without alignments

Many forms of compositional or conformational variability can be separated very well in 3D classification runs where you omit the alignments. This is based on the assumption that alignment of all particles against a single, consensus reference structure (in a previous 3D auto-refine run with all particles) still leads to good orientations. You can do this by inputting the output data.star file from the consensus refinement (e.g. Refine3D/run1_data.star) as the input STAR file in the 3D classification run, and then on the Sampling tab set Perform image alignment? to No. This has two major advantages:

  1. Your classification will go MUCH faster compared to searching over many/all orientations
  2. You can focus your classification by providing a mask (Optimisation tab -> Reference mask) that is only white in the region of the particle that you would like to classify on. In our hands, this has been particularly powerful in classifying out bound/unbound factors that are relatively small compared to the entire complex, but is also proving useful in an increasing number of other cases, including floppy domains dangling off a more rigid scaffold structure, provided the mask is wide enough to encompass multiple conformations of the floppy part.

Limit orientational searches in 2D/3D classifications

Decreasing the angular sampling in 2D classifications, or decreasing the range of offset searches in 2D or 3D classifications are powerful ways to gain speed. The default suggested values on the GUI may be an overkill for you data, but this ultimately will depend strongly on wat you're trying to classify

Reuse of scratch

When a new job starts, it deletes everything in the scratch directory and copy particles. When the previous job crashed, the same set of particles is already in the scratch, so deleting and copying them again is wasteful. --reuse_scratch will reuse existing particles. When a job finishes, it clears the scratch. --keep_scratch prevents this.

These options are useful when you plan to run a sequence of Refine3D, CtfRefine and Refine3D again. In the first Refine3D job, specify --keep_scratch to keep copied particles even after successful job completion. Then run CtfRefine. In the second Refine3D, specify --reuse_scratch to use particles copied in the first Refine3D job.

Note that you cannot reuse scratch when the particle set was changed (e.g. after Subset selection). RELION does not verify the consistency of the input STAR file and the state of the scratch space. It is your responsibility to judge when you can --reuse_scratch.

Efficient use of scratch

This section is relevant only when you run RELION on multiple physical nodes.

By default (i.e. Use parallel disc I/O: Yes), each node reads images that are processed by the node. This happens every iteration. With Pre-read all particles into RAM or Copy particles to scratch directory, all nodes copy everything to local scratch or memory only once at the beginning of the job. Admittedly this is not very efficient and refactoring of this is on our TODO list (but not for 3.1).

When Use parallel disc I/O: No, the master reads particles and sends them to worker nodes every iteration. When combined with Pre-read all particles into RAM or Copy particles to scratch directory, the master copies everything to local scratch or memory once at the beginning of the job and sends it to workers every iteration.

Thus, if the disk is slow but connection between computation nodes is fast, we recommend Use parallel disc I/O: No with Pre-read all particles into RAM or Copy particles to scratch directory. Otherwise, all nodes access the storage and slow it down.

Tips to decrease CPU memory requirements

RELION can be compiled in single-precision. This will reduce RAM-requirements to 50% of the double-precision code. Runs will no longer be numerically the same as the double-precision runs (not even when specifying the same --random_seed), but preliminary tests indicate that the single-precision code still gives good results.

Tips to decrease GPU memory requirements

By setting Skipp padding? to Yes, you can reduce GPU memory requirement to roughly one eighth! When your box is tight, you might see artifacts in the corners of the box. For Refine3D and MultiBody Refinement, this is harmless as long as you mask it out in the PostProcess step. For Class3D, this might lead to worse results.