Pharma R&D Today
Ideas and Insight supporting all stages of Drug Discovery & Development
Pharma can now track the most relevant patent info – fast and at scale
Posted on April 26th, 2022 by Ann-Marie Roche in AI & Data
Since its initial release in January 2021, AI-fueled Patent Expansion in Reaxys has won the respect of pharmaceutical companies trying to stay ahead of their competitors by keeping track of the latest patents around drug discovery and new therapeutics. With a recent expansion that allows Reaxys to also loop in information from images found in patents, it’s a good time to look back – and forward – at this continual work-in-progress. We spoke to Umesh Nandal, Elsevier’s Director of Data Science, about the logic behind the tech: “It’s not just about the speed of access to all this annotated data, but the quality.”
Addressing a need: Competitive Intelligence + Novelty Search
When the Patent Expansion project began over two years ago, there was a real demand for increasing the patent coverage in Reaxys. Pharmaceutical companies are spending too much time and money on projects that will later prove to be unpatentable. With the millions of new patents coming out every year, how can they possibly keep track?
By using the latest NLP and NER technologies, patents – along with the millions of biological targets and substance names they include – patents in Reaxys can be indexed and annotated at an unprecedented speed, scale and quality.
Leveraging Elsevier’s key strengths: Data Science + Domain Knowledge
The development and evaluation of Patent Expansion really plays on Elsevier’s strengths since it not only applies the latest AI technologies, but also leverages the company’s extensive chemistry curation expertise. As a result, pharmaceutical companies get up-to-the-moment insights on the targets and substances most relevant to their work. Reaxys is now also able to glean chemical information from the images found in patents – making the pipeline and its models even stronger and more reliable.
Umesh, as well as being a director of data science, also has a Master’s in chemistry. He led the cross-functional team of AI, chemistry and technical experts to develop the patent pipeline. We talked to him about the technology behind the expansion – in terms of both the data science and its operationalisation (you can read more about how Umesh successfully mixes data and domain expertise here).
Below Umesh talks about the technology behind the project in more detail.
What makes the Reaxys Patent Expansion unique in enhancing the chemist workflow?
It is unique in several ways. It’s the first time that a deep learning model has been deployed to extract chemical information at this large scale and enterprise level. Before, people only used a rule-based approach. The team really set out to create this whole end-to-end pipeline, where entities are extracted, with normalized chemical names with the correct chemical structures, and then evaluated for relevancy. And we proved we can productionize this cutting-edge data science at an enterprise level. Most remarkably, perhaps, while millions of patents are already flowing through, we are maintaining the same quality!
How has the customer response been?
It was very nice to hear all the positive feedback after it went live. Many customers are now reaching out to collaborate on more long-term partnerships. In fact, we are talking to major pharmaceutical companies who were considering other companies specialised in data science. Now they are reaching out to us because we also bring our chemistry domain expertise.
What are some of the challenges when indexing patents?
Besides the sheer number of them, patents are curious things in themselves. First, they are very long documents – they can be around 800 pages long. Second, there’s the language issue. They are not only written in a legal language – which is very different than general language or scientific literature language – but they also exist in different languages. And third, the information is often hidden – often intentionally because the authors don’t want potential competitors to see the novelty of that patent very easily. And fourth is the complexity in the chemistry itself, because sometimes the names can be very long. If a name has an extra comma or dash added as a mistake, then the structure can be completely different. And as we know, chemistry is all about structures.
Can you break down the data science behind the pipeline to deal with these challenges?
Patent Expansion has three components: 1) the speedy bibliographic indexing of the millions of patents coming out every year, 2) the indexing of biological targets to facilitate finding relevant patents when customers search for target names or acronyms, and 3) the indexing of substances’ names and images from a patent document.
We begin extracting the entities by putting patents into a structured format, which is then read and the entities identified. We do this largely based on tech from our sister company LexisNexis, who have a lot of IP and patent experience. Then the main challenge is to normalize these entities. For example, aspirin is mentioned in patents in various forms but we need to link these all together under a single concept.
And then we have to figure out what’s actually relevant – and it’s here we use machine learning models. Relevancy is trickier, because defining what is relevant varies from customer to customer. So, we spoke with customers to validate what was most relevant for them. And then we created reference sets to evaluate and also give scores based on, for example, where this information appears in a patent – if it appears in the title, abstract or claim section it’s higher-level, for example. And that’s what we are also using in Reaxys to filter out false positives.
What were the biggest challenges when it came to the actual operationalization of the pipeline – taking that data science ‘brain’ you created and producing the ‘body’ by which it interacts with customers?
The biggest challenge was for targets – because it was the first deliverable. The team was not only relatively new but also very diverse. We had already brought the data scientists and the medicinal chemistry experts together, and now we also had to loop in the developers – the technical people. Naturally, we had to have many discussions with them about the architecture. At the same time, we had to make some standard processes and templates for writing the code, because we were not out to build it for just one ad hoc project. We wanted to reuse this pipeline for next phases and other projects. And indeed, while the data science behind the substance extraction was more challenging, the pipeline was already there, so in a sense it was easier.
How important was establishing this collaboration between all these specialists – the data experts, the content experts, the technical experts?
It was of paramount importance. Only through this collaboration across cross-functional teams, and having everyone understand who was bringing what to the table, were we able to productionize our data science capabilities at scale, and with the robustness and reliability we had envisioned.
And now the team has made it possible to extract substance information from images!
It’s essential to combine the information coming from both text and images because sometimes a compound is mentioned in an image but not in the text. But we split this part of the project up into two phases because images required a whole different type of data science. It’s very challenging to deploy deep learning models in a production environment due to the large size of trained models. And it only becomes more challenging when these models are part of a complicated pipeline where quality is the key. But the team did a great job in deploying the image classification model online with a fast prediction response time – without the pipeline breaking. And that’s the beauty of this pipeline, since it’s built up out of microservices. It’s not sequential but rather modular. So, we can switch things in and out very easily.
And presumably, in the name of continual improvement and maintaining quality and increasing impact, the patent pipeline will continue to be developed and evaluated…
Absolutely. We need to regularly check if our models are healthy and deliver the quality Elsevier is known for – updating it with new patents and retraining whenever necessary. Actually, there are two types of quality. There is the technical quality – making sure the pipeline doesn’t break, especially whenever we add more components and heavier models. And there’s the content quality, which is not only monitored by our domain experts but also verified with our customers. And in order to keep it up to date with their expectations, we are creating these gold sets, or reference sets, to monitor the data drift.
And since this pipeline is so modular, will you be able to apply it to other aspects besides patents?
Yes! We will actually be using the same pipeline for the journals pipeline that we’re working on over 2022 – so information from journals can be accessed within days as well. And there are certainly other opportunities being discussed, for example using the patents pipeline for other use cases in chemistry such as polymers. The sky really is the limit.
R&D Solutions for Pharma & Life SciencesWe're happy to discuss your needs and show you how Elsevier's Solution can help.
Director, Corporate Markets Marketing, Elsevier
- Elsevier and LG: Turning data into action
- Elsevier research partner Karin Verspoor nominated for Women in AI award
- Umesh Nandal: Chemist and data scientist in one
- Can predictive retrosynthesis become a valuable part of a chemist’s toolkit?
- AI-driven innovation in life sciences: Unlocking the page – 20+ years of digital innovation at Elsevier