Chemical Data Formats

The common formats to encode small chemical molecules are SMILES, InChI and SDF- or MOL-files. SMILES and InChI encode only connectivity between the individual atoms that constitute a molecule. Whereas SMILES codes may give rise to ambiguity, the more comprehensive InChI system unambiguously describes the full stereochemistry of a molecule. However, none of these two formats can be readily used to visually display the three-dimensional structure of molecule. 3D structure can only be encoded by atom positions in a coordinate system. SDF- (or MOL-) files are the most frequently used file formats for small- molecule structures that encode atomic coordinates.


SDF

The structure-data file (SDF) format is based on the MOL-file format, both developed by MDL Information Systems which was later acquired by Accelrys (now named Biovia, which belongs to Dassault Systems). The original MOL-file only encoded a single molecule, whereas files in SDF format can encode single or multiple molecules.
SDF files are formatted ASCII files that store information about the positions of the individual atoms that make up the molecule. Information about hybridisation state and connectivity is also encoded, although these latter data are less frequently and often inconsistently used. Multiple molecules are delimited by lines consisting of four dollar signs ($$$$).

Section Description Example
1 Header

Consists of three lines. The first line is the name/title of the compound. The third line is empty.
Aspirin
  CDK 0130151625

2 Atom count and version number

Here: 13 atoms, MOL V2000 file format.
 13 13 0 0 0 0 0 0 0 0999 V2000
3 Atom block

1 line for each atom: x, y, z, element, etc.
    0.6500   -3.7500    0.0000 O    0  0  0  0  0  0   0  0  0  0  0  0
    1.9491   -3.0000    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
    1.9490   -1.5000    0.0000 O    0  0  0  0  0  0   0  0  0  0  0  0
    0.6500   -0.7500    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
   -0.6490   -1.5000    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
   -1.9490   -0.7500    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
   -1.9490    0.7500    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
   -0.6490    1.5000    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
    0.6500    0.7500    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
    1.9490    1.5000    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
    3.2481    0.7500    0.0000 O    0  0  0  0  0  0   0  0  0  0  0  0
    1.9491    3.0000    0.0000 O    0  0  0  0  0  0   0  0  0  0  0  0
    3.2481   -3.7500    0.0000 C    0  0  0  0  0  0   0  0  0  0  0  0
4 Bond block

1 line for each bond:
1st atom, 2nd atom, bond type, etc.
  1  2  2  0  0  0  0
  2  3  1  0  0  0  0
  3  4  1  0  0  0  0
  4  5  1  0  0  0  0
  5  6  2  0  0  0  0
  6  7  1  0  0  0  0
  7  8  2  0  0  0  0
  8  9  1  0  0  0  0
  4  9  2  0  0  0  0
  9 10  1  0  0  0  0
 10 11  2  0  0  0  0
 10 12  1  0  0  0  0
  2 13  1  0  0  0  0
5 Properties block

Optional; only if required to define charge, etc.
The example shown here will set a -2 charge on atom no 3 (oxygen). Note: this example is just for illustration and does not have any meaning for Aspirin.
M  CHG  1   3  -2
6 End of molecule connectivity/property blocks M  END
7 User-defined properties block

Optional; 3 lines per property:
> <key>
{value}
{empty line}
> <H-Acceptors>
4

> <H-Donors>
1

> <InChI>
InChI=1S/C9H8O4/c1-6(10)13-8-5-3-2-4-7(8)9(11)12/h2-5H,1H3,(H,11,12)

> <InChI aux info>
AuxInfo=1/1/N:13,7,6,8,5,2,9,4,10,1,11,12,3/E:(11,12)/rA:21OCOCCCCCCCOOCHHHHHHHH/rB:d1;s2;s3;s4;d5;s6;d7;d4s8;s9;d10;s10;s2;s5;s6;s7;s8;s12;s13;s13;s13;/rC:;;;;;;;;;;;;;;;;;;;;;

> <InChI key>
BSYNRYMUTXBXSQ-UHFFFAOYSA-N

> <Mass>
180.0

> <Ring count>
1

> <Rotatable bonds>
3

> <SMILES>
O=C(Oc1ccccc1C(=O)O)C

> <clogP>
0.4

> <tPSA>
63.6

8 End of molecule $$$$

More details about the MOL/SDF file format can be found, for example, here.

In cApp, compounds from single or multi-molecule SDF files can be added by one of the File - Add compound functions. If no compound name is given, cApp will assign 'unnamed'. These files should, but don't have to carry a '.sdf' or '.mol' extension.

Often, SDF files may have more than one chemical entity specified in one entry, e.g. if the compound exists as an ion pair. The construction of molecules from such entries will not be successful, and cApp will highlight such compounds with a red background in the first column. When reading such files, cApp provides an auto-selection procedure whereby the largest chemical entitity of every entry is automatically selected. Note: the largest entitity is the molecule of interest in the very most, but not all cases. Auto-selection can be invoked by:


SMILES

The Simplified Molecular-Input Line-Entry System (SMILES) encodes the connectivity of atoms within a compound in an ASCII string, but does not offer 2D or 3D coordinates for molecules.

Example: O=C(Oc1ccccc1C(=O)O)C


In cApp, single compounds can be added to a compound set by Compound Set - Add compound.
Alternatively, files with multiple SMILES codes can be processed by one of the File - Add compound functions. Files that contain multiple SMILES codes need to list one code per line followed by the name of the compound (separated by a space or tab), as in this example. If no compound name is given, cApp will assign 'unnamed'. These files should, but don't have to carry a '.smi' or '.smiles' extension.

SMILES codes may contain more than one chemical entity, e.g. if the compound exists as an ion pair. In such SMILES patterns, the different entities are separated by a period ('.'). The construction of molecules from such patterns will not be successful, and cApp will highlight such compounds with a red background in the first column. For such cases, cApp provides an auto-selection procedure whereby the largest chemical entitity of the SMILES pattern is automatically selected. Note: the largest entitity is the molecule of interest in the very most, but not all cases. Auto-selection can be invoked by:


InChI

The IUPAC International Chemical Identifier, InChI, was developed as an accurate code to denote the exact configuration of a molecule. With the InChI code, a molecule is unambiguously described by means of its stereochemistry, tautomeric, isotope and electronic charge information.
InChI information exists in two forms:


InChI utilises a single string broken up into six layers (and their sublayers) of information coding for the structure itself. The separation of layers is achieved by a single '/'; this is then followed by a prefix and the particular code.

Layer Prefix Description
InChI=1S Every InChI starts with the string "InChI=" followed by the version number, which is currently 1. The letter 'S' indicates standard InChI.
1 (no prefix) Empirical formula
1 c Atom connections
1 h Hydrogen atoms
2 p Protons
2 q Charge on the compound
3 b Double bonds
3 s Type of stereochemistry information
3 t,m Tetrahedral stereochemistry
4 i,h and b,t,m,s Isotopic layer
5 f Fixed-H; not included in standard InChIs.
6 r Reconnected metal atoms; not included in standard InChIs.
Example InChI=1S/C20H32O4/c1-12(10-21)4-3-9-20(11-22)16-8-5-13(2)14-6-7-15(19(23)24)18(20)17(14)16/h7,12-14,16-18,21-22H,3-6,8-11H2,1-2H3,(H,23,24)/t12-,13+,14+,16-,17+,18+,20+/m1/s1


In cApp, single compounds can be added to a compound set by Compound Set - Add compound.
Alternatively, files with multiple InChI codes can be processed by one of the File - Add compound functions. Files that contain multiple InChI codes need to list one code per line followed by the name of the compound (separated by a space or tab), as in this example. If no compound name is given, cApp will assign 'unnamed'. These files should, but don't have to carry a '.inc' or '.inchi' extension.