Refine a structure to high-resolution: Difference between revisions

From Relion
Jump to navigation Jump to search
No edit summary
 
(20 intermediate revisions by the same user not shown)
Line 1: Line 1:
= Typical refinement strategy =
High-resolution refinement will typically require multiple runs, which are continuations of each other (see [[Running RELION#Continuing an old run | Continuing an old run]]). For reasons of computational efficiency, one often performs initial runs at relatively coarse angular sampling rates and relatively large offset searches. Then, once the resolution of the model(s) no longer improves (see [[Analyse results]]), one continues the previous run using a finer angular sampling (and with a smaller range and step size for the translations). Also, to speed up calculations with very fine angular samplings, after a while (i.e. when most of the images will have found orientations close to their correct ones) ''local angular searches'' may be performed to speed up the calculations.
Note that in the standard output to the screen (<code>stdout</code>), the program will print estimated accuracies of angular and translational assignments. These estimations are based on a detailed comparison of the signal in the current reconstruction(s) and the noise in the data. Do ''not'' use finer angular or translational sampling rates than these estimates.
''A fundamental difference with conventional refinement'' schemes is that iterating does not only serve to find the optimal orientations, '''iteration also serves to progressively increase the resolution of the model'''. Therefore, more iterations than in conventional refinement schemes may be necessary to reach the highest possible resolution structure.
= Filling in the GUI =
= Filling in the GUI =
For 3D refinements, select the run-type of <code>3D reconstruction</code> from the drop-down menu at the top of the GUI.
For 3D refinements, select the run-type of <code>3D auto-refine</code> from the drop-down menu at the top of the GUI. (This is a new feature of version 1.1.) This procedure implements so-called '''gold-standard FSC calculations''', where two models are refined independently for two random halves of the data to prevent overfitting. Thereby, reliable resolution estimates and clean reconstructions are obtained without compromising reconstruction quality, see (Scheres & Chen, Nature Methods, in press) for more details. Note that for cyclic point group symmetries (i.e. C<n>), the two half-reconstructions are averaged up to 40 Angstrom resolution to prevent diverging orientations.


== I/O tab ==
== I/O tab ==


* See the [[Prepare input files]] page on how to prepare your input.  
* See the [[Prepare input files]] page on how to prepare your input data.  


* The pixel size (in Angstrom) should be the same as the one used to estimate the CTF parameters.
* Also see the notes of the reference map on the [[Prepare input files]] page.
 
* See the notes of the reference map on the [[Prepare input files]] page.


* If the reference was not reconstructed from the input images in either XMIPP or RELION, you may assume it is not on the absolute greyscale.  
* If the reference was not reconstructed from the input images in either XMIPP or RELION, you may assume it is not on the absolute greyscale.  


* Note there are various settings for icosahedral symmetry, also see the [[Conventions]]. Make sure your input map is in the one you provide here.
* Provide the correct symmetry point group. Note there are various settings for icosahedral symmetry, also see the [[Conventions]]. Make sure your input map is in the one you provide here.


== CTF tab ==
== CTF tab ==


* CTF-correction is recommended, especially for cryo-EM data. If no CTF correction is to be performed, make sure you phase-flipped your data during preprocessing. See the [[Prepare input files]] page. A useful compromise between full CTF-correction and only flipping the phases is to ignore CTFs (i.e. set them to one) until their first maximum. This may prevent very-low resolution artefacts for data sets that have low-resolution features that are not accounted for in the linear CTF model, while higher resolution features are still handled correctly.
* The pixel size (in Angstrom) should be the same as the one used to estimate the CTF parameters (unless you rescaled the images afterwards, in which case the same scale factor should be applied to the pixel size).


* Intensity correction corrects for distinct grey-scale intensities among the signal in the data, e.g. because due to distinct SNRs among the micrographs. This option is only effective if the data is provided in a STAR file that contains multiple unique strings for the rlnMicrographName label (see the [[Prepare input files]] page.
* If no CTF correction is to be performed, make sure you phase-flipped your data during preprocessing. See the [[Prepare input files]] page.
 
* If the particles have been phase flipped, tell the program about this.
 
* Some data sets have very-low resolution features that are not accounted for in the linear CTF model (with ~10% amplitude contrast). This will sometimes lead to too strong low-resolution features in the reconstructed maps. Separation based on these very low-resolution features may then hamper separation of distinct structural states. Therefore, it may be useful to ignore the CTFs (i.e. set them to one) until their first maximum. In several cases, this has led to successful classification of structurally heterogeneous data that could not be classified using the full CTF-correction. If desired, full CTF-correction can then be applied subsequently during separate refinements of the distinct classes.


== Optimisation tab ==
== Optimisation tab ==


* One typically starts refinement from a medium-low resolution filtered map to minimise model bias. If your input map is not low-pass filtered, it may be filtered internally using the <code>Initial low-pass filter</code> option.
* To prevent model bias it is recommended to start refinement from a very strongly low-pass filtered map. If your input map is not low-pass filtered, it may be filtered internally using the <code>Initial low-pass filter</code> option. Typically, one filters ''as much as possible'', i.e. before the reference becomes a feature-less blob that can no longer be refined. For example, we use 80 Angstroms for ribosomes and 60 Angstroms for GroEL.  


* Often 25-50 iterations are necessary before the refinement converges to a stable solution, but high-resolution refinement may take even more iterations. Note there is currently no convergence criterion implemented, so the user is responsible for monitoring the convergence, e.g. when the resolution no longer improves.
* The particle diameter (in Angstroms) serves to define a soft spherical mask that will be applied to the references to reduce their background noise. Note that a (preferably soft) user-provided mask (1=protein, 0=solvent) may also be used for highly non-spherical particles. Be careful though not to mask away any unexpected signal and always use a soft mask, i.e. one with values between 0 and 1 at the protein/solvent boundary.


* The regularisation parameter T determines the relative weight between the experimental data and the prior. Bayes' law dictates it should be 1, but better results are often obtained using slightly higher values (e.g. 2-4), especially when dealing with cryo-data.
== Sampling tab ==


* The diameter of the exp. image mask (in Angstroms) serves to define a soft circular mask that will be applied to the input experimental images to reduce their background noise. This will lead to lower, more realistic values of the estimated noise in the data and thereby higher resolution structures (for constant T).  
* The initial angular and translational sampling rates given here will be automatically increased to their optimal values by the auto-refine procedure. We tend to use 7.5 degrees angular sampling for non-icosahedral cases and 3.7 degrees for icosahedral viruses. Most of the times using 6 pixels for the initial translational searches is enough, although this ultimately depends somewhat on how well-centered the particles were picked. However, note that pre-centering prior to RELION refinement is not necessary, and also not recommended (it often messes up the Gaussian distribution of origin offsets).


* If the references are also masked, then the density inside the solvent area (as defined by a soft spherical mask with the same diameter as above, OR by a user-provided mask under the <code>solvent mask</code> option) will be set to zero.
== Compute tab ==
As of RELION-2.0, there are more computation-options accessible from the GUI. These are:


* Note that for some reconstructions, e.g. non-empty icosahedral viruses, a second solvent mask may be handy: e.g. to set the density inside the virion to a (non-zero) constant. For all white (value 1) pixels in this second mask the corresponding pixels in the reconstructed map are set to the average value of these pixels. The use of a second mask is not an option in the GUI, but one may use the additional option <code> --solvent_mask2 mask_inside_virion.mrc</code> in the <code>Additional arguments</code> line in the GUI.
* Combine iterations through disc? This option was implemented when some network cards on our cluster were buggy and large messages often failed. Large files are written to disc and that way, the MPI nodes speak to each other. If you have reasonably fast and reliable network connections, it may be better to set this option to "No", as it will be quite slow (although that depends on the speed of your disc access).


== Sampling tab ==
* Use parallel disc I/O? If set to Yes, all MPI slaves will read their own images from disc. Otherwise, only the master will read images and send them through the network to the slaves. Parallel file systems like gluster of fhgfs are good at parallel disc I/O. NFS may break with many slaves reading in parallel.


* As mentioned in the [[#General refinement strategy]], initial runs are typically performed with relatively coarse angular sampling rates, and the angular sampling rate is gradually decreased during subsequent continuation runs.  
* Number of pooled particles: Particles are processed in individual batches by MPI slaves. During each batch, a stack of particle images is only opened and closed once to improve disk access times. All particle images of a single batch are read into memory together. The size of these batches is at least one particle per thread used. The nr_pooled_particles parameter controls how many particles are read together for each thread. If it is set to 3 and one uses 8 threads, batches of 3x8=24 particles will be read together. This may improve performance on systems where disk access, and particularly metadata handling of disk access, is a problem. It has a modest cost of increased RAM usage.


* After multiple (e.g. 10-20) iterations of exhaustive angular searches, ''local angular searches'' may be performed. This will considerably speed up the calculations, especially for very fine angular samplings.
* Pre-read all particles into RAM? If set to Yes, all particle images will be read into computer memory, which will greatly speed up calculations on systems with slow disk access. However, one should of course be careful with the amount of RAM available. Because particles are read in double-precision, it will take ( N * box_size * box_size * 8 / (1024 * 1024 * 1024) ) Giga-bytes to read N particles into RAM. For 100 thousand 200x200 images, that becomes 30Gb, or 120 Gb for the same number of 400x400 particles. Remember that running a single MPI slave on each node that runs as many threads as available cores will have access to all available RAM.


* Translational search ranges may depend on how well-centered the particles were picked, but often 5-10 pixel search ranges with a step size of 1 or 2 pixels during the initial (say 10) iterations will do the job. Translational searches in subsequent iterations are centered at the optimal translation in the previous one, so that particles may "move" much more than the original search range during the course of multiple iterations. In continuation runs, typically finer step sizes (e.g. of 0.5 pixels) and smaller search ranges are used. Note that pre-centering prior to RELION refinement is not necessary, and also ''not recommended'' (it often messes up the Gaussian distribution of origin offsets).
* Use GPU acceleration? If set to yes, the program will run CUDA-code to accelerate computations on NVIDIA grpahics cards. Note that cards supporting CUDA compute 3.5+ are supported. It is typically recommended to run 1 MPI process on each card. The option 'Which GPUs to use' can be used to specify which MPI process is run on which card, e.g. "0:1:2:3" means that the slaves 1-4 will run on cards 0-3. Note that the master (mpi-rank 0) will not use any GPU: it will merely orchestrate the calculations. You can run multiple threads within each MPI process to further accelerate the calculations. We often use for example 4 threads.


== Running tab ==
== Running tab ==
* If one uses multi-core nodes, the use of myltiple threads (as many threads as cores on a machine) is recommended. MPI is typically used for more scalable parallelisation over the different nodes. (In terms of CPU usage, MPI parallelisation is a bit more efficient than threads.)
 
* If one uses multi-core nodes, the use of myltiple threads (as many threads as cores on a machine) is recommended because the shared-memory parallelisation increases the amount of memory available per process. MPI is typically used for more scalable parallelisation over the different nodes. (In terms of CPU usage, MPI parallelisation is a bit more efficient than threads.)
 
= Analyzing results =
 
The program will write a <code>out_it???_half?_model.star</code> and a <code>out_it???_half?_class001.mrc</code> file for each of the two independent data set halves at every iteration. Only upon convergence the program will write one <code>out_model.star</code> and <code>out_class001.mrc</code> file with the results from the joined halves of the data. It are these final files you will be most interested in! Note that the joined map may no longer be used for refinement to prevent overfitting.
 
'''Also remember that your map will still need sharpening!''' Have a look at the [[http://www2.mrc-lmb.cam.ac.uk/relion/index.php/Analyse_results#Map_sharpening Analyse results]] section for more details.

Latest revision as of 09:37, 15 June 2016

Filling in the GUI

For 3D refinements, select the run-type of 3D auto-refine from the drop-down menu at the top of the GUI. (This is a new feature of version 1.1.) This procedure implements so-called gold-standard FSC calculations, where two models are refined independently for two random halves of the data to prevent overfitting. Thereby, reliable resolution estimates and clean reconstructions are obtained without compromising reconstruction quality, see (Scheres & Chen, Nature Methods, in press) for more details. Note that for cyclic point group symmetries (i.e. C<n>), the two half-reconstructions are averaged up to 40 Angstrom resolution to prevent diverging orientations.

I/O tab

  • If the reference was not reconstructed from the input images in either XMIPP or RELION, you may assume it is not on the absolute greyscale.
  • Provide the correct symmetry point group. Note there are various settings for icosahedral symmetry, also see the Conventions. Make sure your input map is in the one you provide here.

CTF tab

  • The pixel size (in Angstrom) should be the same as the one used to estimate the CTF parameters (unless you rescaled the images afterwards, in which case the same scale factor should be applied to the pixel size).
  • If no CTF correction is to be performed, make sure you phase-flipped your data during preprocessing. See the Prepare input files page.
  • If the particles have been phase flipped, tell the program about this.
  • Some data sets have very-low resolution features that are not accounted for in the linear CTF model (with ~10% amplitude contrast). This will sometimes lead to too strong low-resolution features in the reconstructed maps. Separation based on these very low-resolution features may then hamper separation of distinct structural states. Therefore, it may be useful to ignore the CTFs (i.e. set them to one) until their first maximum. In several cases, this has led to successful classification of structurally heterogeneous data that could not be classified using the full CTF-correction. If desired, full CTF-correction can then be applied subsequently during separate refinements of the distinct classes.

Optimisation tab

  • To prevent model bias it is recommended to start refinement from a very strongly low-pass filtered map. If your input map is not low-pass filtered, it may be filtered internally using the Initial low-pass filter option. Typically, one filters as much as possible, i.e. before the reference becomes a feature-less blob that can no longer be refined. For example, we use 80 Angstroms for ribosomes and 60 Angstroms for GroEL.
  • The particle diameter (in Angstroms) serves to define a soft spherical mask that will be applied to the references to reduce their background noise. Note that a (preferably soft) user-provided mask (1=protein, 0=solvent) may also be used for highly non-spherical particles. Be careful though not to mask away any unexpected signal and always use a soft mask, i.e. one with values between 0 and 1 at the protein/solvent boundary.

Sampling tab

  • The initial angular and translational sampling rates given here will be automatically increased to their optimal values by the auto-refine procedure. We tend to use 7.5 degrees angular sampling for non-icosahedral cases and 3.7 degrees for icosahedral viruses. Most of the times using 6 pixels for the initial translational searches is enough, although this ultimately depends somewhat on how well-centered the particles were picked. However, note that pre-centering prior to RELION refinement is not necessary, and also not recommended (it often messes up the Gaussian distribution of origin offsets).

Compute tab

As of RELION-2.0, there are more computation-options accessible from the GUI. These are:

  • Combine iterations through disc? This option was implemented when some network cards on our cluster were buggy and large messages often failed. Large files are written to disc and that way, the MPI nodes speak to each other. If you have reasonably fast and reliable network connections, it may be better to set this option to "No", as it will be quite slow (although that depends on the speed of your disc access).
  • Use parallel disc I/O? If set to Yes, all MPI slaves will read their own images from disc. Otherwise, only the master will read images and send them through the network to the slaves. Parallel file systems like gluster of fhgfs are good at parallel disc I/O. NFS may break with many slaves reading in parallel.
  • Number of pooled particles: Particles are processed in individual batches by MPI slaves. During each batch, a stack of particle images is only opened and closed once to improve disk access times. All particle images of a single batch are read into memory together. The size of these batches is at least one particle per thread used. The nr_pooled_particles parameter controls how many particles are read together for each thread. If it is set to 3 and one uses 8 threads, batches of 3x8=24 particles will be read together. This may improve performance on systems where disk access, and particularly metadata handling of disk access, is a problem. It has a modest cost of increased RAM usage.
  • Pre-read all particles into RAM? If set to Yes, all particle images will be read into computer memory, which will greatly speed up calculations on systems with slow disk access. However, one should of course be careful with the amount of RAM available. Because particles are read in double-precision, it will take ( N * box_size * box_size * 8 / (1024 * 1024 * 1024) ) Giga-bytes to read N particles into RAM. For 100 thousand 200x200 images, that becomes 30Gb, or 120 Gb for the same number of 400x400 particles. Remember that running a single MPI slave on each node that runs as many threads as available cores will have access to all available RAM.
  • Use GPU acceleration? If set to yes, the program will run CUDA-code to accelerate computations on NVIDIA grpahics cards. Note that cards supporting CUDA compute 3.5+ are supported. It is typically recommended to run 1 MPI process on each card. The option 'Which GPUs to use' can be used to specify which MPI process is run on which card, e.g. "0:1:2:3" means that the slaves 1-4 will run on cards 0-3. Note that the master (mpi-rank 0) will not use any GPU: it will merely orchestrate the calculations. You can run multiple threads within each MPI process to further accelerate the calculations. We often use for example 4 threads.

Running tab

  • If one uses multi-core nodes, the use of myltiple threads (as many threads as cores on a machine) is recommended because the shared-memory parallelisation increases the amount of memory available per process. MPI is typically used for more scalable parallelisation over the different nodes. (In terms of CPU usage, MPI parallelisation is a bit more efficient than threads.)

Analyzing results

The program will write a out_it???_half?_model.star and a out_it???_half?_class001.mrc file for each of the two independent data set halves at every iteration. Only upon convergence the program will write one out_model.star and out_class001.mrc file with the results from the joined halves of the data. It are these final files you will be most interested in! Note that the joined map may no longer be used for refinement to prevent overfitting.

Also remember that your map will still need sharpening! Have a look at the [Analyse results] section for more details.