An Information Portal to Biological Macromolecular Structures |
||
RCSB PDB Home | Contact Us | Software Tools Home | pdb_extract Online | pdb_extract Workstation | |
|
|
pdb_extract is used to extract statistical information from the output files produced by many software for protein structure determination using Xray Crystallography and NMR method. These statistical information will be written into a complete mmCIF file which is ready for PDB deposition. In the case of Xray structure determination, pdb_extract merges all the information into two mmCIF (macromolecular Crystallographic Information File) files. One mmCIF file contains structure factors and the other contains atomic coordinates and statistics extracted from the steps of structure determination (data collection/integration/reduction, heavy atom phasing, molecular replacement, density modification, and final structure refinement) for various methods (MR, SAD, MAD, SIR, SIRAS, MIR, MIRAS). These two mmCIF files are ready for PDB deposition. In the case of NMR structure determination, statistics from header section of PDB file and other LOG files produced by software is merged into one mmCIF file containing coordinates. This file along with other constrain files (if applicable) is ready for PDB deposition. The current version supports 35 software packages and hundreds of different output files produced in various of steps. Click here to see the supported software lists. The assembled mmCIF files by pdb_extract should be uploaded to the ADIT server. Enter any additional information into ADIT and submit your files directly from there. The advantage of using pdb_extract:
IMPORTANT NOTES:
| |
The source and binary versions of pdb_extract can be downloaded from the
address
deposit.pdb.org/software . The source is available under an
Open Source license. The binary distributions are available for Intel-Linux.
The web interface can be accessed at pdb-extract.rutgers.edu pdb_extract has been integrated into CCP4 and the CCP4i interface(Version 5.0 and above). Users can run pdb_extract under the CCP4 environment. | |
System Requirements: | |
It is recommended to install the binary distribution, since it is fast
to install and it takes small space. The binary distributions are
available for Intel-Linux.
Step 1. Uncompress and unbundle the distribution using the following command: zcat pdb-extract-vX.XXX-XXX.tar.gz | tar -xf - Step 2. Set up the environment variables. * Define PDB_EXTRACT environment variable to point to the installation directory. Assuming that the installation directory is /home/username/pdb-extract-vX.XXX-XXX, execute in the shell: For C shell users: setenv PDB_EXTRACT /home/username/pdb-extract-vX.XXX-XXX For Bourne shell users: PDB_EXTRACT=/home/username/pdb-extract-vX.XXX-XXX; export PDB_EXTRACT * Add "bin" subdirectory to the PATH environment variable. Execute in the shell: For C shell users: setenv PATH "$PDB_EXTRACT/bin:"$PATH For Bourne shell users: PATH="$PDB_EXTRACT/bin:"$PATH; export PATH | |
Step 1. Uncompress and unbundle the distribution using the following command: zcat pdb-extract-vX.XXX-XXX.tar.gz | tar -xf - | |
There is an example included in this distribution.
location
------->
This example is located in the subdirectory of "pdb-extract-vX.X/examples/Example_1". The directory contains the following:
To execute the example, position in the appropriate directory and invoke test.sh and test_script.sh scripts. cd pdb-extract-vX.XXX-XXX/pdb-extract-vX.X/examples/Example_1 A. Run the scripts test.sh All the Unix commands were included in the script file test.sh. ./test.sh B. Run the scripts test_script.sh The script for test_script.sh is an alternative way to obtain the same result as above. It is also a combination of various programs. The difference is that it used the component extract instead of the pdb_extract and pdb_extract_sf. All the information is included in the file log_script.inp. ./test_script.sh Please click here to see the script files and the explanations of arguments of input/output. | |
There are four ways to extract crystallographic information and deposit complete data to the Protein Data Bank.
The four interfaces have different features. For example, The CCP4i or Web interface provide a simple graphic interface. Users only select the program name and output file names to do the job. The full Unix command line method provides the greatest flexibility. User need to read the command options to run the program. The script input method provides a simple local interface. Here, we give a concrete example to show how to use pdb_extract for complete data extraction. In this example, the experimental method for solving the protein structure was multiple anomalous diffraction (MAD). The information for the experiment is as the following:
| |
Follow on line tutorial | |
STEP 1. Obtain the template data file data_template.text using the command
extract -pdb refmac.pdb After running the program, you will get a file called data_template.text. CATEGORY 1-2 contains the extracted unit cell parameters and the unique molecular chemical sequence group. Please modify the two CATEGORIES as necessary. You may skip other categories until you submit your assembled mmCIF file into ADIT . However, if you have multiple structures to submit, you are commended to use the data_template file, since it can be re-used without re-entering the same information.
The content of the data template file data_template.text is given in
Appendix
STEP 2. Obtain coordinates and all the statistics Run the pdb_extract program: pdb_extract -e MAD \     (MAD experiment) -i HKL -iLOG index.log \     (from indexing) -s HKL -iLOG scale_refine.log \     (from scaling for refinement) -sp HKL scale1.log scale2.log scale3.log \     (from scaling for phasing) -p SOLVE -iLOG solve.prt \     (from phasing) -d RESOLVE -iLOG resolve.log \     (from density modification) -r refmac5 -icif refmac -ipdb refmac.pdb \     (from final refinement) -iENT date_template.text \     (structural & author information) -o pdb_extract.cif     (output file in mmcif format)Note: there must be a space before the sign \ and no space after, if you write the options into a script file. STEP 3. Obtain structure factors Run pdb_extract_sf to convert data into mmCIF format and merge all the files to one file.
pdb_extract_sf \ -rt F -rp MTZ -idat scale_refine.mtz \     (data for refinement) -dt I -dp HKL \     (data for phasing) -c 1 -w 1 -idat scale1.sca \     (crystal 1 & diffraction 1) -c 1 -w 2 -idat scale2.sca \     (crystal 1 & diffraction 2) -c 1 -w 3 -idat scale3.sca \     (crystal 1 & diffraction 3) -o pdb_extract_sf.cif (output file in mmcif format) The output file (output_sf.cif) contains one reflection data block for refinement and one data block for protein phasing. STEP 4. Validation and deposition It is recommended to validate the two files (pdb_extract_sf.cif, pdb_extract.cif) from ADIT before submit your data. Submit your data from ADIT. | |
STEP 1. obtain the plain text file log_script.inp
extract -pdb refmac.pdb You will get one script file called log_script.inp and one data template file data_template.text.
The content of the file log_script.inp is shown in the Appendix STEP 2. run the program: extract -ext log_script.inp You will get the same results as using the Unix command line option. STEP 3. Validation and deposition: (same as in the Unix command line option). | |
Step 1. From the main window of CCP4i, select the Data Harvesting Management Tool option. Step 2. From the option of Run program to select the Extract additional information for deposition Step 3. Select the Generate a data template filefrom various steps Type (or select using browse) in the yellow boxes either the PDB or mmCIF file name obtained from the final structure refinement and the output file name. In this case, the output coordinate file is refmac.pdb. Run the pdb_extract program to obtain the data template file. Edit this file according to the instruction in the text file. Step 4. Select the Generate a complete mmCIF file for PDB deposition from various steps Select program names and log file names generated from the selected programs.
Run the pdb_extract program to obtain a complete data in mmCIF format. The final output file can be uploaded to ADIT for on line structure validation and submission. NOTE: The characters of file name should always start from beginning of each yellow box. There should be no white space in each box, even no file name is typed in. | |
STEP 1. Obtain the template data file data_template.text using the command
extract   -pdb   coordinate_PDB_file_name   -nmr    (if PDB format) After running the program, you will get a data template file called data_template.text. This data template file contains 21 data fields for entering non-electronically extracted information. Please enter necessary information and carefully check CATEGORY 1 which contains the unique molecular chemical sequence. Please modify CATEGORY 1 as necessary. Additional structure information can be filled into CATEGORIES (2-21) for complete data deposition. The content of the data template file data_template.text is given in Appendix STEP 2. Obtain coordinates and all the statistics Run the pdb_extract program using the following command: pdb_extract   -r CNS   -ipdb cns.pdb   -ient data_template.text   -nmr Statistical information can be extracted from the header section of the PDB file.You will generate a complete mmCIF file containing atomic coordinates and other information about the structure. STEP 3. Data validation and submmision Please upload the extracted mmCIF file as well as other constraint files to the ADIT server for data validation and submmision.
| |
Follow on line tutorial for NMR | |
Listed below are the programs used from data collection to structure determination. | |
This section is used to collect statistical information from the LOG files generated by the programs for Data Scaling/Merging/Averaging. Important: The log files must be generated from the LAST (or BEST) trial which corresponds to the files used for phasing or molecular replacement.
The extracted information may be the following: * Intensities (or amplitude) and standard deviations * Data completeness (overall, resolution shells) * Redundancy (overall, resolution shells), mosaicity * R-merge, R-sym (overall, resolution shells) * average(I/sigma), (overall, resolution shells) * Total and unique reflections collected. * Resolution range   Some helpful hints for getting LOG files from the program of Data Scaling/Merging/Averaging Using HKL/HKL2000/scalepack
HKL (or HKL2000 or Scalepack) is a package by Otwinowski for data
collection/reduction/scaling.
You can use the graphical interface or the scalepack script to scale your
data. The LOG file (e.g. scale1.log) contains statistics for
PDB deposition.
Using D*trek
D*trek is a package by Jim Pflugrath at Rigaku/MSC for data collection/reduction/scaling.
You can use the graphical interface to scale (or merge/average) your
data. The LOG file (e.g. scale1.log) containing statistics is from the step of scaling data.
Using SAINT
SAINT is a package by Bruker (Siemens Molecular Analytical Research Tool)
for data collection/reduction/scaling. The LOG file (e.g.
scale1.ls) containing statistics is from the step of scaling data.
Using SCALA
SCALA is the CCP4 supported program. It scales together multiple
observations of reflections. SCALA generates
mmCIF or LOG file containing useful statistics. When you run the programs,
you must ask the program to export the data harvest file (mmCIF type). The
mmCIF file will be name.scala or name.truncate. Otherwise, it will generate LOG file.
| |
This section is used to collect key statistical information from Molecular Replacement. You may first generate a LOG file from the rotation function, then generate a LOG file from the translation function. You can upload the two LOG files into this section for data extraction. You can also upload one LOG file which is generated from MR. Important: The log files must be generated from the LAST (or BEST) trial which corresponds to the files used for density modification or refinement.
The extracted information may be the following: * Low and high resolution used in rotation and translation. * Rotation and translation methods * Reflection cut off criteria, reflection completeness. * Correlation coefficients for I or F between observed and calculated. * R_factor, packing information, and model details.
Using CNS/CNX/XPLOR CNS can be used to do molecular replacement. After you finish the translation search, you can get a log file called translation.list which contains all the information of molecular replacement. Using Amore (CCP4) Amore is a program for molecular replacement. It is distributed in the CCP4 package. After rotation and translation search, you will generate two log files rotation.log and translation.log. You may extract information from both log files If you run the program in one script, you may generate one LOG file. Upload this LOG file to the web interface. Using Molrep(CCP4) Molrep is a program for molecular replacement. It is distributed in the CCP4 package. When you run the script, you can specify a LOG file name (e.g. molrep.log). All the statistic information will be recorded in the log file. Using EPMR EPMR is a Unix command line program for molecular replacement. When you run the program, please give a log file name like the following Epmr [options] files > epmr.log All the statisticial information will be written in the log file. Using Phaser Phaser was developed by Randy Read's group at the University of Cambridge. It is a program for phasing macromolecular crystal structures with maximum likelihood methods. The program generates a LOG file which can be uploaded to the web interface for data extraction. | |
Heavy atom phasing is performed at an earlier stage of structure determination. The log files generated from phasing contain important statistical information which should be deposited to the Protein Data Bank. From heavy atom phasing, you may have LOG files and heavy atom coordinate file.
The phasing methods are the followings: * MR molecular replacement. * SAD single anomalous dispersion. * MAD multiple anomalous dispersion. * SIR single isomorphous replacement. * SIRAS single isomorphous replacement with anomalous scattering. * MIR multiple isomorphous replacement. * MIRAS multiple isomorphous replacement with anomalous scattering. Important: The log files must be generated from the LAST (or BEST) trial which corresponds to the files used for density modification or refinement.
The following items may be extracted: * Wavelength, f_prime, f_double_prime, resolution range * FOM (acentric, centric, overall, resolution shells) * R-Cullis (acentric, centric, overall, resolution shells) * R-Kraut (acentric, centric, overall, resolution shells) * Phasing power (acentric, centric, overall, resolution shells) * Number of heavy atom sites, heavy atom type. * Heavy atom location method. * Heavy atom B-factor, occupancies, and xyz coordinates.
Using SOLVE (version 2.00 and above): SOLVE is a program for finding heavy atom location and refining heavy atom parameters. The statistical information is written to a file solve.prt (default name used by the program). The heavy atom coordinates are written to a file ha.pdb. Note: You may upload the two file names solve.prt (file type: LOG) and ha.pdb (file type: PDB).
Using CNS/CNX/XPLOR
CNS is a complete software system for protein crystallography. The scripts for heavy atom location and phasing refinement are mad_phase.inp or ir_phase.inp. When you run these scripts, you will get output files like phase_final.summary, phase_final.sdb or mad_phase.fp.
The output file phase_final.summary has all the phasing statistics. (Note: The refined heavy atom coordinates, B factors and occupancies can be found in a file like phase_final.sdb. If you prefer to convert to the PDB format, you can run the script sdb_to_pdb.inp. You will get a file phase_final.pdb with PDB format.) Note: You may input at most three files (as shown above) for extracting phase information.
Using MLPHARE (CCP4)
MLPHARE is a program in the CCP4 suite. It is used for refining heavy atom parameters. If you use the CCP4i graphical interface or the script mode, you need to ask the program to write a harvesting file. Select the data havest button, when you use the CCP4i interface. Do not use the key word NOHARV, when you use script. After you finished running this program, you will get a file (e.g. name.mlphare) which is in mmCIF format. It contains all the information for heavy atom phasing refinement. For extracting the wavelength information, you need to run program REVISE in the CCP4 (version 4.0-4.2.2). You may get a file (e.g. prephadata.log) Note: You may input at most two files (as shown above) for extracting phase information.
Using SHARP (version 1.3.x and 2.0 and above):
SHARP is a program for finding heavy atom positions and refining heavy atom parameters. When you run SHARP or autoSHARP, the log files which have useful information are normally in the directory sharpfiles/logfiles_local/dirs, where dirs are all the subdirectories for your various structures. Please note that the location of generated log files may depend on how the program is installed!
SHARP produces many output files.
For version 1.3.x: Heavy.pdb contains the heavy atom coordinates. FOMstats.html contains figure of merit statistics. Otherstat.html contains Rcullis, Rkraut, phasing power. For version 2.0 and above: Heavy.pdb contains the heavy atom coordinates. FOMstats.html contains figure of merit statistics. RCullis_?.html contains Rcullis. PhasingPower_?.html contains phasing power The easiest way to obtain these files is to run the program from the SUSHI interface. Review all the log files from the internet browser and save the files as plain text files. Note: You may input at most four files (as shown above) for extracting phase information.
Using SnB (version 2.0 and above):
SnB has no heavy atom parameter refinement, and it has no corresponding statistics. SnB gives the heavy atom or substructure coordinates (e.g. heavy.pdb) in PDB format. Note: You may input only one file (as shown above) for phasing extraction.
Using BnP (version 0.93 and above):
BnP is a combination of program SnB and Phases. The heavy atom positions are located by SnB and the heavy atom parameters will be refined by Phases. The log file (e.g. auto.log) can be found from the directory ~/PHASES/*. Log file normally contains phasing power for each phasing set. The file is in LOG format. Note: You may input at most one file (as shown above) for extracting phase information.
Using SHELXD or SHELXS (version 97):
Heavy atom or substructure coordinates are produced in PDB format (e.g. heavy.pdb). Note: You may input at most one file (as shown above) for extracting phase information.
| |
Density modification is normally performed after obtaining phases. If you do density modification in your structure determination, statistics information is needed for PDB deposition. If density modification is not done in a separate step, you may skip this step, since you do not have a log file specifically for density modification. Important: The log files must be generated from the LAST (or BEST) trial which corresponds to the file used for refinement.
The following items may be extracted: * Density modification method. * FOM after density modification (overall, resolution shells) * Solvent mask determination method. * Structure solution software.
Using RESOLVE (version 2.00 and above):
RESOLVE is a density modification program in the SOLVE/RESOLVE package. Normally it runs together with SOLVE, but one can run it separately. When you run RESOLVE, you will get a log file like resolve.log. Only one log file (resolve.log) is needed for extraction. File type is LOG. Using CNS/CNX/XPLOR
The CNS user may need to run the input script like density_modify.inp. You will get a log file called density_modify.list. Only one log file (density_modify.list) is needed for extraction. File type is LOG.
Using DM (CCP4)
DM is a density modification program in the CCP4 suit. When you run DM either by using the CCP4i graphic interface or the script, you will get a log file like dm.log. Only one log file (dm.log) is needed for extraction. File type is LOG. Using SOLOMON (CCP4)
SOLOMON is also a another density modification program in the CCP4 suite. When you run DM either by using the CCP4i graphic interface or the script, you will get a log file like Solomon.log. Only one log file (Solomon.log) is needed for extraction. File type is LOG.
| |
Structure refinement is performed at the end of structure determination. The atom coordinates are generated in PDB or mmCIF format and the statistics are generated in log files. The pdb_extract program is applied to extract statistical information: Since statistics can be carried at the header section of PDB file, you may not provide any LOG files for some programs like CNS, REFMAC5. Important: The log file and the coordinate file must be generated from the LAST (or BEST) trial which corresponds to the file that is used for deposition to the PDB.
The following items may be extracted: * Resolution range (highest res. shell) * Number of reflections used in refinement, and in R-Free set. * R-factor (overall, resolution shells) * Number of atoms refined * Cell parameters and space group. * The xyz coordinates of all the atoms. * RMS Bond Distances, Bond Angles, Chiral Volume, Torsion Angles * Isotropic temperature factor restraints * Non-crystallographic symmetry restraints * Solvent model used * Overall Average Isotropic B Factor * Overall Anisotropic B Factor * Overall Isotropic B Factor * Topology/parameter data used to refine deposited model * Refinement software
Using REFMAC5 (CCP4):
REFMAC5 is a program for structure refinement used in the CCP4 suite. If you run this program using CCP4i or the script, you can get a PDB file with all the refinement information at the header section. You may directly deposit this PDB file.
Using CNS/CNX/XPLOR
CNS/CNX/XPLOR is a program for final structure refinement. It exports coordinate file in both PDB and mmCIF format. You need the script deposit_mmcif.inp to generate the mmCIF format. The mmCIF file carries more statistical information than the PDB file. Authors are encouraged to deposit the mmCIF file, otherwise authors may need to manually fill in more information. You may not have to give any LOG file generated from CNS/CNX/XPLOR.
Using SHELXL (version 97):
SHELXL is a sub_program in the SHELX package. It is used for structure refinement. After you finish structure refinement, you need to run the shelxpro interactive program and use option B. After going through the shelxpro, you will get a PDB file (e.g. name.pdb) with header information.
Using TNT (version 5f):
TNT is a crystal structure refinement program. Data from this program can be extracted from the output PDB file and some LOG files. You can use the to_pdb command to convert coordinates in TNT format (name.cor) to the PDB format (name.pdb). The command is: to_pdb name.cor After finishing refinement, you must use command rfactor to generate a log file (e.g. rfactor.log) which contains the refinement statistics. The command is: rfactor name.cor > rfactor.log To extract the symmetry information, user must provide the symmetry file (e.g. p6122.dat). This information is in the control file name.tnt
Using ARP/wARP:
ARP/wARP is a automatic program for model building and refinement. REFMAC5 is used for the structure refinement step. The new version (6.0 or above) can use CCP4i as graphic interface. You can run this program either by CCP4i or by using script. You will get a log file (for example warpNtrace_refine.log). You also get a PDB file like warpNtrace.pdb. Note: If the coordinate file warpNtrace.pdb is directly used for deposition, you can use this option. Otherwise, use other program for final refinement. Using PHENIX
PHENIX is a new software suite for the automated determination of macromolecular structures using X-ray crystallography and other methods. The PDB file generated by phenix.refine has the non-standard 'REMARK' and the standard 'REMARK 3'. It is also OK to keep the non-standard REMARK for deposion. Note: Sometimes, the MTZ file from PHENIX only contains 2Fo-Fc. Before deposition, you must make sure that the amplitude (Fo) or Intensity (I) is included in the MTZ file.
| |
There are three executable components (pdb_extract, pdb_extract_sf, extract) for the program. Argument description for the programs is given in details bellow. | |
PROGRAM DESCRIPTION: pdb_extract is used to extract statistical information from the output files produced by the software for protein structural determination using Xray Crystallography and NMR me |