Pharma R&D Today

Ideas and Insight supporting all stages of Drug Discovery & Development

Select category
Search this blog

Mutually empowering – semantic-based machine learning and subject matter expertise

Posted on May 7th, 2021 by in AI & Data

In a day dedicated to emerging science and technologies at the Pistoia Alliance virtual conference Collaborative R&D in Action, SciBite CTO James Malone opened the program with a compelling exploration of use cases for semantic-based machine learning (ML). A simple but elegant ML strategy based on “seeding” named entity recognition (NER) can facilitate ontology creation, drive language translation, take a crack at gaining insights from social media platforms, and generate answers to questions faster. His most important take-away: semantic-based ML and subject matter expertise (SME) are mutually empowering.

NER learns a new domain. Or language.

In this strategy, “seed” terms are passed to an ML model which then identifies candidate terms in ingested text that are similar to the seed. That similarity may be based on location relative to the seed term, word root, or other. The resulting cluster of terms is a bit noisy and requires review, but in an iterative process of term seeding, candidate generation, and candidate review and pruning, those clusters grow into meaningful categories. Scaling up the process by training a transformer model, Malone and his team were able to construct a 6000-term ontology for genetic variation. Essential to this incremental machine learning is the SME to review and improve term clusters at each iteration. With both elements – ML and SME – models built on this strategy are highly flexible in terms of application. For example, the strategy turned out to be very good at translating Japanese to English, but not without the supervision of native speakers.

Open-minded NER for real-world language

Another example comes from models trained using ontology annotated data to extract insights from broad data sources as real-world evidence – like Facebook, Reddit, Twitter or patient network forums. These often stumble over the mismatch between standardized scientific terminology and the looser language used by the general public. Because the models have been trained to understand how genes or diseases (for instance) appear within a sentence, they are able to infer phrases that look like they should be genes or diseases because of the language used. So, for example, posts can be scanned for drug names and sentences which appear to describe an adverse event can be identified, even if the phrase is not in an ontology – such as “could not sleep,” as opposed to seeking the specific term “insomnia”. This more flexible approach to NER may open new opportunities to fine-tune the analysis of these far broader and content-laden sources. However, the necessarily looser semantics calls for SME to validate outcomes.

Know your data. Know your semantics.

Beyond term extraction, the utility of such a “forgiving” semantic-based ML may support Bidirectional Encoder Representations from Transformers (BERT) in situations where a question may have multiple answers hidden in a very large body of text, or when answers are conflicting. The ML-supported NER can narrow down paragraphs of text from which answers are most likely to be found to streamline real-time processing.

Regardless of application, however, it remains essential to understand the problem you are tackling, know the data you use to solve it, and apply SME to know if your output is correct and useful. Malone described a foray into Wikipedia content that underscored the importance of that foresight even when working with structured content. Consider Wikipedia’s topical hierarchy, which makes “hearing” a sub-category of “perception” but then positions “Peruvian folk music” under “hearing.” That’s likely surprising to humans, but is it to machines?

Discover more about SciBiteAI

R&D Solutions for Pharma & Life Sciences

We're happy to discuss your needs and show you how Elsevier's Solution can help.

Contact Sales