Help & Resources
File Formats
Datasets uploaded to Chembench are expected to contain these types of files.
Activity (.act
) Files
The .act
file stores activities of the dataset's compounds from some assay. An activity file is
necessary for building predictive models on Chembench. Each line of an activity file is a chemical identifier
followed by an activity value. Activity files may contain continuous or category data.
The chemical identifiers in an activity file may be anything: SMILES strings, chemical names, and index
numbers are commonly used. The only constraint is that the chemical identifiers in your activity file must
match the identifiers in .sdf
and .x
files uploaded in the same dataset and
be in the same order.
Continuous activity data can be any decimal number. Typically continuous data comes from quantitative assays, e.g., of binding affinity. An example of a continuous activity file:
compound1 2.48 compound2 4.89 compound3 7.22 compound4 9.73 compound5 12.19 compound6 14.55 compound7 17.34 ...
Category activity data represents endpoints or is discretized from continuous data. Category activities are typically non-negative consecutive integers (e.g. 0, 1, 2). An example of a category activity file:
compound1 0 compound2 0 compound3 1 compound4 1 compound5 1 compound6 2 compound7 2 ...
Structure (.sdf
) Files
The .sdf
file stores the structures of the compounds in the dataset. An example:
compound1 comment line (can be anything) 44 47 0 0 1 0 0 0 0 0999 V2000 1.3550 -4.8300 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1.0920 -3.9960 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.4780 -2.3340 0.0000 C 0 0 3 0 0 0 0 0 0 0 0 0 ... 1.0240 3.0240 0.0000 O 0 0 0 0 0 0 0 0 0 0 0 0 0.5970 4.3590 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 2.1340 2.1230 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 1 2 1 0 0 0 0 2 3 1 0 0 0 0 3 4 1 0 0 0 0 ... 41 42 1 0 0 0 0 41 44 1 0 0 0 0 42 43 1 0 0 0 0 M END $$$$ ...
Note that each compound is terminated by the sequence $$$$
.
Matrix (.x
) files
The .x
file is a descriptor file format used by Chembench. It is similar to the matrix format
accepted by other data mining programs. It contains a matrix of compounds and their descriptor values. All
descriptor values must be numeric. The format is described below.
[LINE 1]: 120 50
This header line indicates that a 120 by 50 matrix follows: There are 120 compounds, each with 50 descriptor values.
[LINE 2]: descriptor1 descriptor2 descriptor3...
The second line contains the names of the descriptors.
[LINE 3]: 1 compound1 0.5 0.609756 0.5625 ... [LINE 4]: 2 compound2 0 0 0.0208333 0.142857 ... [LINE 5]: 3 compound3 0 0 0.0208333 0.142857 ... ...
From the third line on, each line represents one compound. The first value on each line is an index, starting at 1. The second value is an ID for the compound that matches with the IDs in the corresponding SDF and ACT files. The remaining numbers are the values of the descriptors for the compound.