Machine-Learning in Drug Discovery

Virtual high-throughput screening (vHTS) can be used in drug discovery to replace experimental HTS or to preselect a subset of compounds for screening to reduce costs and time. The two main approaches to vHTS are structure-based or ligand-based screening. Structure-based screening does not need any known active molecules, but does need either a crystal structure or good homology model of the target protein. Ligand-based screening, on the other hand, does not need any knowledge about the target protein but identifies common patterns and features of known active molecules by creating a 2D fingerprint or 3D pharmacophore.

Scientists at Imperial College London/Equinox Pharma have developed a new rule-based vHTS methodology (INDDEx) that exploits logic-based machine learning to enhance performance in ligand-based screening.  INDDEx (Investigational Novel Drug Discovery by Example) is particularly good at identifying actives that are structurally distinct from the training set, making it useful for scaffold-hopping.

INDDEx learns easily interpretable qualitative logic rules from active ligands. These rules – in the form of ‘an active molecule requires fragment A and fragment B separated by a distance in Angstroms’ or ‘an active molecule requires the presence of fragment C’ or ‘an active molecule must NOT contain fragment D’ – give an insight into chemistry, relate molecular substructure to activity and can be used to guide the next steps of drug design chemistry. These qualitative rules are then weighted using Support Vector Machines (SVMs) to produce QSAR rules that can be used to generate novel in silico hits.

INDDEx (Investigational Novel Drug Discovery by Example)


INDDEx has been shown to be a powerful new approach to virtual screening whose strength lies in learning topological descriptors of multiple active compounds although, when considering scaffold hopping in isolation, INDDEx performs well even when there are small numbers of active molecules to learn from. One very attractive feature of INDDEx is that the rules that are produced can be readily understood and used by medicinal chemists. The technology has been extensively validated and shown to outperform comparable approaches (J. Phys. Chem. B 2012, 116, 6732-6739). In a study between Equinox and Imperial on sirtuin 2 (an NAD-dependent histone deacetylase) , INDDEx combined with structure-based docking was able to learn from only eight actives and identify a chemically novel hit that was experimentally validated to have an IC50 of 0.6 µM.

INDDEx has wide-scale applications including  rescuing failed programmes, directing hit-to-lead programmes and scaffold-hopping. Furthermore, INDDEx has the potential to derive rules for off-target activities such as the hERG receptor.

If you would like to find out more about INDDEx and Equinox Pharma, please contact us.

SkelGen – A Newly Available Tool for Computational Drug Design

drug designWith the rapidly growing body of biostructural information, structure-based drug design has increased in importance and a variety of computational methods have found a place in the drug discovery toolkit.

The de novo design program, SkelGen, was developed by De Novo Pharmaceuticals based on research begun in the Department of Pharmacology at the University of Cambridge. SkelGen constructs candidate ligands by assembling small molecular fragments within a protein target such as an enzyme or receptor (usually derived from X-ray crystal data). When growing a ligand, SkelGen uses information coded in the fragments and within its algorithm to favour synthetically tractable molecules. SkelGen is able to explore around one trillion low molecular weight, drug-like molecules using a default set of 1600 fragments. Since the accessible chemical space is so large, the majority of designed molecules are novel and patentable.

Whilst SkelGen can be run with minimal input, it also permits extensive control by the end-user, allowing the scientist to incorporate prior knowledge and insights into the drug design process. As well as completely de novo design, molecule generation can also be started from a user-defined fragment (for example, a low-affinity molecule identified by fragment-screening). SkelGen can also be used for scaffold hopping (chemotype switching) and focused library design.

Until recently SkelGen was only accessible through collaborations with De Novo Pharmaceuticals but is now available under both academic and commercial licenses. With these new licensing models, SkelGen can be a cost-effective (and accessible) tool for all scientists engaged in drug design. If you would like to find out more about SkelGen, please contact us.

New Tools for Characterizing Natural Products

Florida keysResearchers at the University of California in San Diego have developed computational tools that will allow scientists to quickly and easily determine whether newly isolated nonribosomal peptides (NRPs) are novel or already known. NRPs are a diverse family of secondary metabolites produced by microorganisms such as bacteria and fungi and have a wide range of biological activities: examples of NRPs in clinical use include the antibiotics, penicillin, cephalosprorin and vancomycin; the immunosuppressant, cyclosporine; and the cytostatic drug, bleomycin. NRPs, which are produced independently of mRNA by nonribosomal peptide synthetases, often have cyclic and/or branched structures and may contain non-proteinogenic amino acids as well as numerous other chemical modifications. The lack of genomic DNA information and complex chemical structures have made it very difficult, time-consuming and costly to determine the structure of NPRs, but the new algorithms should replace manual annotations and allow much more rapid characterization. The algorithms are able to calculate the chemical structure of a newly isolated cyclic NRP from fragments generated by mass spectrometry and can be used alongside ‘dereplication’ tools that can calculate the mass spectrometry signature of known NRPs to determine if the newly isolated NRP has already been described.

The study is published in full in the journal Nature Methods, and the web-based tools for sequencing NRPs are available (at no cost to researchers at non-profit organizations) at:

As Numerous as the Stars in the Sky

starsMaybe not quite as numerous as the stars in the sky, but chemists at the University of Berne have created a database of almost 1 billion drug-like molecules with 13 or fewer heavy atoms. Writing in the Journal of the American Chemical Society, Jean-Louis Reymond and Lorenz Blum describe a new searchable database, GDB13, of molecules containing up to 13 atoms of C, N, O, S, and Cl.

The search for novel leads is one of the key challenges in drug discovery and Reymond and Blum have previously developed chemical universe database GDB-11 which describes 26.4 million structures containing 11 or fewer atoms of C, N, O, and F and satisfying simple chemical stability and synthetic feasibility rules. The limiting factor in computing GDB-11 was elimination from the initial list of unstable or chemically impossible molecules, most of which contained multiple heteroatoms. To speed up computation of GDB-13, a very fast ‘element-ratio’ filter was used, with cut-off values of (N + O)/C < 1.0, N/C < 0.571, and O/C < 0.666. Fluorine was eliminated from GDB-13 since it was rarely found and had not proved attractive to the group when following up output from GDB-11. With these modifications, the algorithm was sufficiently fast to compute the database up to 13 heavy atoms, producing 910 million molecules in 40,000 CPU h. The molecular enumeration was dominated by monocyclic, bicyclic, and tricyclic molecules, most of which were heterocyclic – 54% of GDB-13 molecules have at least one three- or four-membered ring. Essentially all of the molecules are drug-like (Lipinski or Vieth criteria) and many were also lead-like or fragment-like. A chlorine/sulphur set (67.3 million compounds) that enumerates molecules with chlorine atoms as aromatic substituents and sulphur atoms in aromatic heterocycles, sulphones, sulphonamides, and thioureas was also generated. This set is felt to be of particular interest for virtual screening because of the distinct molecular shapes and functional groups that are possible with these larger atoms. Despite a large fraction of chemical space being excluded to accelerate computation, the authors believe that, with 977,468,314 entries, GDB-13 is the largest publicly available database of virtual molecules ever reported. The database is available free of charge at and should provide a rich source of inspiration for previously un-described bioactive fragments. For those unable to resist, fluorine atoms can be added during optimisation.