PreProcessing

From Relion
Jump to navigation Jump to search

If you have already performed your preprocessing in a different package (and you don't want to repeat it, despite the ease and speed of the Preprocessing procedure outlined below), then please see the Prepare_input_files page for more information.


Motion correction

As of version 2.0, RELION implements wrappers to UCSF MOTIONCORR and to Niko Grigorieff's UNBLUR, which can be used to perform whole-image beam-induced motion correction on direct-electron detector movies.

CTF-estimation

RELION implements a wrapper to Niko Grigorieff's CTFFIND3 (and also Alexis Rohou's CTFFIND4). See Niko Grigorieff's web site for details on CTFFIND3. As of version-2.0, RELION also implements a wrapper to Kai Zhang's Gctf program.

Particle picking

As of version 1.3, RELION implements functionality for both manual and reference-based, semi-automated particle picking. The autopicking works using procedures that are inspired by Alan Roseman's findEM. However, because it is based on probabilities (with Gaussian pdfs and thus squared difference terms) unlike findEM, the auto-picking in RELION is very sensitive to the intensity-scale of the references used for picking. Therefore, one would optimally search the micrographs with 2D class averages that were generated from the same (or a similar quality) data set. It is therefore not a good idea to use projections from an atomic model, or negative stain class averages to search for particles in your cryo-EM micrographs. For this reason, we typically manually pick (a few hundreds to several thousands) particles from a subset of the recorded micrographs, use these to calculate 2D class averages in a preliminary 2D classification run, and then use the best classes from that run to autopick in all micrographs.

Both the selection of micrographs on which to pick and the classes to use as references should be given as input STAR files. If the classes were calculated using CTF correction, then the input STAR file with the micrographs should also contain CTF information for each of the micrographs, and the same CTF settings should be used on this tab as were used in the 2D classification run. By limiting the resolution of the autopicking, for example to 25 Angstroms, one can prevent the pitfalls of "Einstein from noise". In general: if you cannot see the particles, then they are probably NOT there. In that case you'd better spend your time on making better samples and/or grids rather than on running RELION or any other single-particle analysis program. For most cases an in-plane rotational sampling of 5 degrees does the job.

There are two important parameters to optimise: a "picking threshold" and a "minimum inter-particle distance" (in Angstroms). The threshold ranges from 0 (pick everything as a particle) to 1 (pick very few particles). The inter-particle distance is often set to values of 50-75% of the particle diameter (but this may depend on the shape of your particles). Because it is very hard to predict the optimal values for these parameters (especially for the threshold), these typically need some tweeking. The program's computational cost scales linearly with the number of references (and the in-plane sampling) used. For a typical 4k x 4k micrograph, a 5-degree sampling (which is enough for most data sets) and ~10 references, we see computation times of around half an hour. That may be too slow to allow for a lot of tweeking of the parameters. Therefore, the option exists to write to disc so-called figure-of-merit (FOM) maps. These are intermediate results from which the actual particle positions are picked. Therefore, one can write out these maps in a first run, and then re-read the same maps in a series of subsequent runs (each of which would only take a few seconds) in order to find the best parameters. There is one drawback here: the FOM maps are very large files and there are many of them. Because your file system may become overloaded, the option to write FOM maps is not available in the parallel version of the autopicking program. Thereby, the recommended way of using this procedure becomes:

  • Select a few (2-5?) micrographs: e.g. including at least one with a higher defocus and one with a lower defocus.
  • On this subset STAR file, run the autopicker as input in sequential mode (1 MPI proc. on the Running tab) with "Write FOM maps? : Yes "
  • Visualize the resulting coordinates on the micrographs, and decide on better values for the picking threshold and inter-particle distance for each of the selected micrographs, and try to find a compromise for all of them.
  • Re-run the autopicking program, now with "Write FOM maps? : No " and "Read FOM maps? : Yes ". This will go very quick. If you left the display program of the micrographs open, you can just right-mouse click to select Reload coordinates.
  • Repeat the previous step until you have found values for the threshold and inter-particle distance that are a good compromise for all selected micrographs.
  • Delete the temporary FOM maps (rm Micrographs/*.spi) to free disk space
  • Then, finally run the autopicking program on a STAR file containing all micrographs (plus their CTF information if necessary) with "Write FOM maps? : No " and "Read FOM maps? : No " using as many MPI processes as deemed necessary.

As of version 2.0, autopicking has been GPU-accelerated. Note that the master, as well as all the slaves will run on the GPU (in contrast to relion_refine, where the master does not run on the GPU). The acceleration makes autopicking almost instantaneous.

Particle extraction

Given the manually-selected or autopicked coordinates, RELION will extract all particles in boxes of a user-defined size. As a rule-of-thumb, we often use boxes that are approximately 2x the longest dimension of the particles. Accepted formats for coordinate files from other packages are XMIPP(2.4), EMAN BOXER, and XIMDISP. As of version 1.3, the preferred (and native) format for coordinate files is STAR.

Also as of version 1.3, RELION implements a particle sorting routine. It calculates difference images between extracted particles and their aligned (and CTF-convoluted) references, and bases Z-scores on the characteristics of these difference images (such as mean, standard deviation, skewness, excess kurtosis and rotational symmetry). The sorting program will add an extra column to the particle STAR file with the resulting Z-score. You can then display all particles, ordered on this column in the STAR file. By doing so, "good" particles tend to be on the top of the display, while "bad" particles tend to be at the bottom.

The extracted particles may be re-scaled, re-windowed, normalized, and have their contrast inverted (in that order). Keep re-scaled and re-windowed image sizes even numbers. Always normalize your particles, and use a reasonable radius for the circle around your particles outside of which the standard deviation and average values for the noise are calculated. If there are white or black artefacts on the micrographs (e.g. caused by dust or hot/dead pixels), these may be removed by using a positive value for the dust removal options. All black/white pixels with values above the given parameter times the standard deviation of the noise are replaced by random values from a Gaussian distribution. For cryo-EM data, values around 3.5-5 are often useful. Make sure you do not erase part of the true signal.

Dividing your data into groups

If you use the above-explained semi-automated preprocessing procedure, your data will be divided into as many groups as there are micrographs. During refinement, for each group a different noise spectrum and signal scale factor is estimated independently. To get robust noise and signal estimates, make sure each group contains at least ~10-20 particles. If you have very few particles per micrograph, then you may want to combine multiple micrographs into one group (i.e. use the same rlnMicrographName for particles coming from multiple micrographs). If you do so, make sure you join micrographs with similar apparent signal-to-noise ratios. Often this means with similar defocus values, but do note that each particle may still have its own defocus values (CTF corrections are done per-particle, not per group).

As of version 2.0, particles may also be regrouped in a more convenient manner using the 'Class options' tab on the 'Subset selection' jobtype.