The understanding and prediction of molecular properties is the very first step in drug development. Properties such as solubility gives out hints of whether the compound makes a good drug candidate. These fundamental properties are often related to the ADME (Absorption, Distribution, Metabolism and Excretion) profile of a molecule, which lacks proper theoretical description and remains challenge to prediction.
Parition Coefficient (logP) is one of the measurements that quantify the relative solubility of a compound in octanol versus in water. Octanol is a non-polar solvent, which represents the lipid environment in human, such as the cell membrane. Water compensates the rest of the body, and is mainly distributed in blood and inside the cell. Generally, a desired drug molecule can both solute in water so that it gets distributed via blood flow and in lipid-like solvent so that it can penatrate into the inner part of the cell and take effects. Therefore, partition coefficient falling in range (-0.4, +5.6) is desired - "Lipinski's rule". Given the aforementioned importance of logP itself, it also serves as an indicative descriptor to other medicinal properties. As a result, accurate computational modeling is crutial.
Traditional quantitative structure-activity/property relationship (QSAR/QSPR) often times treat logP as linearly dependent on various molecular descriptor, including hydrogen accpeting and donation bonds, molecular surface area etc. But such descriptors can hardly capture molecular connectivity information, and can overlook substantial molecular difference. Molecular fingerprints are originally designed to facilitate fragement-based structure search in molecular databases. it is regarded as one of the most accurate way of associating molecular features with chemical activities. The cheminformatics package RDkit comes with multiple fingerprints libraries such as Morgan circular, Avolon bit based and ErG fingerprints. The fingerprints are hashed into a user-defined length of a bit vector.