Classify 3D structural heterogeneity

From Relion
Jump to navigation Jump to search

Filling in the GUI

For 3D classifications, select the run-type of 3D classification from the drop-down menu at the top of the GUI.

I/O tab

  • Apart from the initial reference map, the number of classes is the second most important parameter of this procedure. Often one performs multiple calculations with different values.
  • If the reference was not reconstructed from the input images in either XMIPP or RELION, you may assume it is not on the absolute greyscale. However, as also mentioned in the important note for 3D classification on the Prepare input files section, it is highly recommended to use a consensus model that comes from the data themselves in order to classify structural heterogeneity.
  • Provide the correct symmetry point group. Note there are various settings for icosahedral symmetry, also see the Conventions. Make sure your input map is in the one you provide here.

CTF tab

  • The pixel size (in Angstrom) should be the same as the one used to estimate the CTF parameters (unless you rescaled the images afterwards, in which case the same scale factor should be applied to the pixel size).
  • If no CTF correction is to be performed, make sure you phase-flipped your data during preprocessing. See the Prepare input files page.
  • If the particles have been phase flipped, tell the program about this.
  • Some data sets have very-low resolution features that are not accounted for in the linear CTF model (with ~10% amplitude contrast). This will sometimes lead to too strong low-resolution features in the reconstructed maps. Separation based on these very low-resolution features may then hamper separation of distinct structural states. Therefore, it may be useful to ignore the CTFs (i.e. set them to one) until their first maximum. In several cases, this has led to successful classification of structurally heterogeneous data that could not be classified using the full CTF-correction. If desired, full CTF-correction can then be applied subsequently during separate refinements of the distinct classes.

Optimisation tab

  • Successful classification often requires starting from a very strongly low-pass filtered map. If your input map is not low-pass filtered, it may be filtered internally using the Initial low-pass filter option. Typically, one filters as much as possible, i.e. before the reference becomes a feature-less blob that can no longer be refined. For example, we use 60 Angstroms for ribosomes and GroEL.
  • Often 25-50 iterations are necessary before the refinement converges to a stable solution. Note there is currently no convergence criterion implemented, so the user is responsible for monitoring the convergence. Jobs may be killed if they converge before their maximum number of iterations has been reached, or if the opposite happens a previous run may be continued.
  • The regularisation parameter determines the relative weight between the experimental data and the prior. Bayes' law dictates it should be 1, but better results are often obtained using slightly higher values (e.g. 2-4), especially when dealing with cryo-data. (For negative stain data we often observe that lower values, e.g. 1-2, are better.)
  • The particle diameter (in Angstroms) serves to define a soft spherical mask that will be applied to the references to reduce their background noise. Note that a (preferably soft) user-provided mask (1=protein, 0=solvent) may also be used for highly non-spherical particles. Be careful though not to mask away any unexpected signal and always use a soft mask, i.e. one with values between 0 and 1 at the protein/solvent boundary.
  • Optionally, one may also apply a soft mask with zeros in the solvent area to the experimental particles. This reduces noise so that classifications may be made more reliably. However, masking the experimental data also introduces correlations between the Fourier components, which are not described in the statistical model. Often using zero-particle-masks in classification runs yields better results, although it may hamper resolution somewhat.
  • Because 2D (and 3D) classification may suffer from overfitting (especially for very weak data), there is an option to limit the resolution in the E-step (i.e. the alignment). If one sees signs of overfitting (i.e. hairy features extending in the solvent around a particle), we typically set this to values in the range of 20-5A, often starting with lower resolutions in initial runs and then progressively become more permissive in subsequent runs as the data set becomes cleaner.

Sampling tab

  • CPU requirement will increase rapidly with increased angular samplings (but in contrast to ML3D implementations memory requirements will not!). Therefore, 3D classification is often performed at relatively coarse angular sampling, e.g. 7.5 degrees for ribosomes. Ultimately this will however depend on the nature of the heterogeneity one wants to classify.
  • If fine angular sampling are required, one could run using coarse samplings initially and then restart a previous run (see the Running_RELION page) with a finer angular sampling combined with local angular searches.
  • Translational search ranges may depend on how well-centered the particles were picked, but often 6 pixels will do the job (translational searches in subsequent iterations are centered at the optimal translation in the previous one, so that particles may "move" much more than the original search range during the course of an entire refinement. Note that pre-centering prior to RELION refinement is not necessary, and also not recommended (it often messes up the Gaussian distribution of origin offsets).

Compute tab

As of RELION-2.0, there are more computation-options accessible from the GUI. These are:

  • Combine iterations through disc? This option was implemented when some network cards on our cluster were buggy and large messages often failed. Large files are written to disc and that way, the MPI nodes speak to each other. If you have reasonably fast and reliable network connections, it may be better to set this option to "No", as it will be quite slow (although that depends on the speed of your disc access).
  • Use parallel disc I/O? If set to Yes, all MPI slaves will read their own images from disc. Otherwise, only the master will read images and send them through the network to the slaves. Parallel file systems like gluster of fhgfs are good at parallel disc I/O. NFS may break with many slaves reading in parallel.
  • Number of pooled particles: Particles are processed in individual batches by MPI slaves. During each batch, a stack of particle images is only opened and closed once to improve disk access times. All particle images of a single batch are read into memory together. The size of these batches is at least one particle per thread used. The nr_pooled_particles parameter controls how many particles are read together for each thread. If it is set to 3 and one uses 8 threads, batches of 3x8=24 particles will be read together. This may improve performance on systems where disk access, and particularly metadata handling of disk access, is a problem. It has a modest cost of increased RAM usage.
  • Pre-read all particles into RAM? If set to Yes, all particle images will be read into computer memory, which will greatly speed up calculations on systems with slow disk access. However, one should of course be careful with the amount of RAM available. Because particles are read in double-precision, it will take ( N * box_size * box_size * 8 / (1024 * 1024 * 1024) ) Giga-bytes to read N particles into RAM. For 100 thousand 200x200 images, that becomes 30Gb, or 120 Gb for the same number of 400x400 particles. Remember that running a single MPI slave on each node that runs as many threads as available cores will have access to all available RAM.
  • Use GPU acceleration? If set to yes, the program will run CUDA-code to accelerate computations on NVIDIA grpahics cards. Note that cards supporting CUDA compute 3.5+ are supported. It is typically recommended to run 1 MPI process on each card. The option 'Which GPUs to use' can be used to specify which MPI process is run on which card, e.g. "0:1:2:3" means that the slaves 1-4 will run on cards 0-3. Note that the master (mpi-rank 0) will not use any GPU: it will merely orchestrate the calculations. You can run multiple threads within each MPI process to further accelerate the calculations. We often use for example 4 threads.

Running tab

  • If one uses multi-core nodes, the use of multiple threads is recommended because the shared-memory parallelisation increases the amount of memory available per process. MPI is typically used for more scalable parallelisation over the different nodes.

An example: 10k ribosome test data set

Please see the Classification example for a full example of how to perform 3D classification in RELION.