Help & Resources


Modeling workflow

The steps in creating a chembench predictor are detailed below.

Select a Modeling Dataset

There are two types of activity files you can set in your Dataset; these are used with different modeling methods, so they are separated. "Continuous" activity values vary over a range, while "Category" datasets are discrete numbers (typically 1, 2, 3...). Continuous data is used with regression data mining methods, while Category data is used with classification methods. So, you will see different options enabled on the Modeling page if you pick a Continuous dataset versus a Category one.

Define Model Descriptors

The descriptor type option defines what descriptors will be used to represent your compounds in the modeling process. If you want to use other descriptor types besides what Chembench can generate, you can create a dataset with your own descriptors in it from the Datasets page. Descriptor generation and scaling are skipped if you supply your own descriptors.

The generated descriptor values can be scaled by range scaling or auto scaling, or they can be left unscaled. Range scaling changes the range of each descriptor: it finds the max and min value of the descriptor, subtracts the minimum from each of the descriptor values, then divides each by (max-min) to produce values between 0 and 1. In Auto scaling, the mean and standard deviation is found for each descriptor. The mean is subtracted from the descriptor's values, and the result is divided by the standard deviation. Auto scaling may perform better than range scaling in cases where outliers are expected in the descriptor values. There is some debate over whether auto scaling or range scaling should be used in QSAR.

Select Predictor Type and Parameters

At present, there are four model types available:

  • Random Forest (as implemented by scikit-learn)
  • Support Vector Machines (as implemented by libsvm)
  • k-Nearest Neighbors with Genetic Algorithm descriptor selection ("GA-kNN")
  • k-Nearest Neighbors with Simulated Annealing descriptor selection ("SA-kNN")

Choose Internal Data Splitting Method

(Note: for Random Forest predictors, no further internal data splitting is performed, so if Random Forest has been selected as the predictor type this step will not be displayed.)

The dataset's external validation set has already been defined. The compounds not in the external set, referred to as the "modeling set", will be divided into a training set and a test set for the creation of each model. For each such internal split, a model will be built on the training set and applied to the test set. (At the end of the modeling process, the good models are collected together into a predictor, and the predictor is applied to the external validation set.)

The internal train/test splits can be made randomly or by sphere exclusion. Sphere exclusion is a process that chooses training set compounds which are close to the test set compounds in the descriptor space. This can help modeling by ensuring that each model will be presented with test cases that the model can reasonably predict.

We recommend using sphere exclusion for small datasets (under 300 compounds), and random selection for larger datasets (300 or more compounds).

The Modeling Job

Modeling goes through six discrete steps:

  • First, descriptors for the selected dataset are scaled.
  • Second, the training and test sets are created.
  • Third, a y-randomized version of each train-test set is created, where the activity values are scrambled; this is set aside for later.
  • Fourth, the modeling procedure is performed on the train-test sets, generating models.
  • Fifth, the modeling procedure is run again, this time on the y-randomized train-test sets; this creates the y-randomized models.
  • Sixth, the models are bundled into a predictor and applied to the external validation set.

When the job is finished, it can be viewed by clicking on its name in the Predictors section of the My Bench page.