Pharma R&D Today
Ideas and Insight supporting all stages of Drug Discovery & Development
Predictive Analytics for Drug Discovery
Posted on September 28th, 2016 by Matthew Clark in Pharma R&D
The drive to predict molecular properties and thereby reduce lab testing and focus research projects started before the invention of the computer; a notable advance was with the Hammett equation in the 1930’s. A next great step forward was by Corwin Hansch in 1962 who was one of the first to use computers to perform the calculations. Since that time both computer and analytic technologies have made enormous leaps. However, one element has remained the same since before 1930 – the need for data to create predictive models.
An explosion of measurements has occurred in the past 20 years, fueled by technologies that make bioassays easier; particularly high-throughput screening. However, data reported from the scientific community has been locked in thousands of tables reported in journal articles and patents. The Elsevier Reaxys Medicinal Chemistry (RMC) product extracts the numeric data, as well as the target, the assay types, and other information to make it a powerful resource for making predictive models for protein-ligand binding.
The RMC data includes millions of data points for thousands of targets, encompassing hundreds of assays. To use this data for predictive models we used the open-source KNIME toolset, and the well-tested R statistics system as a framework to gather the information from RMC, normalize it, and use sophisticated predictive model techniques. The process includes model validation by using data not used for making the model to test its predictive ability. The test set is used to measure the expected error of prediction for each model. Figure 1 shows an example of predicted vs actual data for compounds binding to the protein EGFR (P00533).
Figure 1 Predictive Model for EGFR. Click image to enlarge.
Extending this concept further, we can create a large number of predictive models to create an entire simulated screening panel, as we did for a set of diverse kinases, shown in Figure 2. This allows not only prediction of activity, but prediction of the selectivity of the compound for a particular kinase or set of kinases.
Figure 2 Predictive Model for a Panel of Kinases.
Black – very likely to bind < 1µM, red –likely to bind < 1µM, yellow – possibly binding < 1µM. Click image to enlarge.
Among the next steps the R&D Life Science Solutions team is investigating is to use deep-learning systems to analyze the complete set of bioactivities, structures and known toxicities for the compounds to relate specific activities to toxicities observed in animals and humans. This will allow identification of simple in-vitro screens that may be used as markers to help predict in-vivo toxicities.
 C. Hansch, P. P. Maloney, T. Fujita and R. M. Muir, Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients, Nature 1962, 194, 178-180; C. Hansch, R. M. Muir, T. Fujita, P. P. Maloney, C. F. Geiger and M.Streich, The Correlation of Biological Activity of Plant Growth-Regulators and Chloromycetin Derivatives with Hammett Constants and Partition Coefficients, J. Amer. Chem. Soc. 1963, 85, 2817-2824
All opinions shared in this post are the author’s own.
R&D Solutions for Pharma & Life SciencesWe're happy to discuss your needs and show you how Elsevier's Solution can help.
Life Sciences R&D Solution Consultant
- Leveling the Playing Field in Medicinal Chemistry
- Broadening the Field of Research While Staying Focused
- Predicting Adverse Event Risks and Gaining Insights from Disparate Data
- Avoiding the Risks of Being Wrong in Drug Discovery
- How Are You Handling Big Data & Innovation?