Frequently Asked Questions
1. Where can I get more information about the QSAR methodology?
2. What forms of kNN-QSAR are supported?
3. Which accuracy functions are supported for QSAR Category?
4. What formats are needed for the sd and activity files?
5. How are the values in the model statistics table computed?
6. How is the Activity Histogram generated?
7. How do the basic kNN parameters affect the model building?
8. What are the Advanced kNN Parameters?
9. What types of descriptors are available?
10. What forms of descriptor normalization are supported?
11. How are data sets divided?
12. What methods of data set splitting are supported?
13. How do the sphere exclusion parameters affect data set splitting?
14. How can I reach the C-ChemBench development team?
1. Where can I get more information about the QSAR methodology?
For a detailed explanation of the methodology, see the QSAR Manual.
2. What forms of kNN QSAR are supported?
We currently support Simulated Annealing kNN QSAR for both continuous and categorical data.
3. Which accuracy functions are supported for QSAR Category?
Currently we allow the use of four different formulae for accuracy. The default is Correct Classification Rate (CCR) for both optimization and selection of models. The formula to be used is chosen under the kNN Advanced Parameters tab in Model Building. The four supported formulae are
Standard Accuracy |
 |
Correct Classification Rate |
 |
Name 1 |
 |
Name 2 |
 |
For more information about CCR, see for example Combinatorial QSAR Modeling of P-Glycoprotein Substrates by Lima et al. In general, the first two formulae are more appropriate for nominal data and the last two, for ordinal data.
4. What formats are needed for the SD and activity files?
The SD file is a standard chemical structure storage format as described on the Symyx MDL web site. The activity files can either be two column (space or tab separated) text files or Excel spreadsheets. In either case, there should be a header in the file and the two columns should be a compound ID and the activity value.
5. How are the values in the model statistics table computed?
For QSAR continuous, the values are:
nnn: the number of nearest neighbors used in this model.
q2: The leave-one-out (LOO) cross-validated correlation coefficient R2 (q2) for the training set:

where yipred, yiobs and
are predicted, observed and average activities of the i-th compound of the training set.
n: number of test set compounds predicted by themodel.
r: the measure of the correlation between the observed and predicted activity values.
The value is computed using Pearson correlation coefficient. Specifically,

where yiand
iare observed and predicted activities respectively.We display both r and r2 as a convenience to the user.
r2: correlation between the observed and predicted activity values using the Pearson correlation coefficient.
R012: the coefficients of determination for regressions through the origin between predicted and observed activities. Specifically,

R022: the coefficients of determination for regressions through the origin between observed and predicted activities.Specifically,

k1: The slope of the 0-origin line fitted for predicted vs. observed activities. Specifically,

k2 : The slope of the 0-origin line fitted for observed vs. predicted activities. Specifically,

For QSAR category, the following additional values
6. How is the Activity Histogram generated?
The range of activity values is always divided into 10 bins.
7. How do the basic kNN parameters affect the model building?
The kNN modeling technique identifies descriptor subspaces in which similar compounds have similar response variables. The QSAR modeling techniques explores subspaces of different dimensions as defined by the descriptor minimum, maximum, and step size. For example, the default settings of minimum 5, maximum 20, and step size 5 will explore subspaces of dimensions 5, 10, 15 and 20.
Because the process is stochastic in nature, different runs may optimize to different models.In order to fully explore each subspace, we generate multiple models.The number of runs identifies how many models are generated.
8. What are the Advanced kNN Parameters?
This set of parameters is intended for the user who is fully familiar with the kNN QSAR model development procedure. For those not familiar with it, we recommend leaving them as is. For those who are familiar with the process, the following details these parameters.
Number of nearest neighbors: During model development, this is the maximum number of neighbors used in the kNN pattern recognition. For each set of descriptors selected throughout the process, k will be optimized between 1 and this number.
Percentage for pseudo neighbors: In order to reduce the computational time of model building, we do not compute the distance to every other compound in the data set. Rather we use the technique of pseudo-neighbors that is described in (), where a subset of the compounds is used as the potential nearest neighbors. This number represents the percentage of the data set to be used. Note that the default setting of 100 directs that the full set be used.
Number of Permutations: Each time that a model is built, a simulated annealing process is used to optimize the descriptors selected for the model. This number determines how many descriptors are replaced in each cycle of the simulated annealing process.
Number of Cycles: This is the maximum number of cycles that will be run prior to a temperature reduction in the simulated annealing process. (Temperature reductions can also be made based on finding a better set of selected descriptors based on q2.)
Log Initial Temperature: The temperature at which the simulated annealing process is initialized in log10 units.
Log Final Temperature: The temperature below which the simulated annealing process is terminated in log10 units. The final temperature must be lower than the initial temperature. Note that the number of different temperatures in the simulated annealing process is a significant contributor to computational time required.
Mu:The factor by which you reduce the temperature. Specifically, the new temperature is computed by multiplying the old temperature by mu.
Minimum q2 (Continuous only): The minimum q2 of an acceptable model. Only acceptable models are displayed and used in consensus prediction of external sets. This will be able to be altered in the model analysis phase (coming soon). See the description of the model statistics for more information on q2.
Minimum r2 (Continuous only): The minimum r2 of an acceptable model. Only acceptable models are displayed and used in consensus prediction of external sets. This will be able to be altered in the model analysis phase (coming soon). See the description of the model statistics for more information on r2.
Minimum slope and maximum slope (Continuous only): The cutoffs for the acceptable slope, k1, of the 0-origin line fitted for predicted vs. observed and slope k2 of the 0-origin line fitted for observed vs. predicted. If both k1 and k2 fall outside this range, the model is rejected.
Relative_diff_R_R0 (Continuous only): A measure of the quality of the predictive power of the model. Models with a value above this are considered unreliable and are rejected. The value is computed as follows:
Min(ABS (1 r201/r2), ABS (1 r202/r2))
Diff_R01_R02 (Continuous only): Also a measure of the predictive power of the model. Models with a value above this are considered unreliable and are rejected. The value is computed as follows:
ABS (r201 r202)
Minimum Accuracy for Training Set (Category only): The minimum accuracy of an acceptable model as calculated by the selected accuracy function from the KNN basic parameters. Models that have an accuracy less than this value for LOO prediction of the training set are rejected.
Minimum Accuracy for Test Set (Category only): The minimum accuracy of an acceptable model as calculated by the selected accuracy function from the KNN basic parameters. Models that have an accuracy less than this value for prediction of the test set are rejected.
Applicability Domain Cutoff: Each predictive model is essentially a subspace of selected descriptors in which similar compounds display similar activity. Within this subspace, we calculate the average distance (and standard deviation) from each compound to its k nearest neighbors. Only compounds that have a distance within some number of standard deviations of that average will be predicted. This cutoff defines that number of standard deviations. As the cutoff increases, prediction coverage increases while accuracy of prediction may drop. The default value allows prediction of compounds with a total distance within 1 standard deviation of the average seen in the training set.
9. What types of descriptors are available?
MolConn-Z, Dragon, MOE2D keys, and MACCS keys.
10. What forms of descriptor normalization are supported?
We support range scaling, autoscaling, or no scaling. Range scaling scales the descriptor values based on the minimum and maximum values actually seen in the data set -- normalizing them from 0 to 1. Auto scaling normalizes based on standard deviation.
11. How are data sets divided?
Data sets are divided into three groups -- the training set, the test set, and the external validation set in order to provide robust and predictive models. This division is required but the user has the capability to control the size of external test set and the ability to choose how the splitting is done.
12. What methods of data set splitting are supported?
At this time, we only support the sphere exclusion method of rational data set division.
For details on this methodology, see J Comput Aided Mol Des. 2003 Feb-Apr;17(2-4):241-53.
13. How do the sphere exclusion parameters affect data set splitting?
Sphere exclusion is a structured methodology for dividing the data set into training and test sets. It does this by selecting a compound and selecting each neighbor with a certain radius as a member of either the training or test set. It eliminates all selected compounds and repeats this process until all compounds have been assigned to one of the two sets.
Number of Sphere Radii: Allows the user to create multiple training and test sets to be used in model building by defining the number of different radii to be used. The value of the different radii is determined by Rmin+i*(Rmax-Rmin)/(4*N) where Rmin is the minimum distance between two points in the dataset, Rmax is the maximum distance between two points in the dataset.
Number of Starting Points: The process can begin with either one or two selected points. When two points are used, they are the compounds with the minimum and maximum activity values. If one point is selected, it is
Selection of Next Training Set Point is Based on: When the assigned compounds are removed from the data set, another point needs to be selected. This parameter determines how that next point is chose: randomly or as close or as far as possible from the previous sphere center.
14. How can I reach the C-ChemBench Development Team?
The C-ChemBench Development team can be reached at ceccrhelp (at) listserv.unc.edu.