Pharma R&D Today
Ideas and Insight supporting all stages of Drug Discovery & Development
Interview with Greg Landrum at Elsevier: What are the ingredients of a successful Open Source cheminformatics software?
Posted on November 30th, 2020 by Elena Herzog in Chemistry
(Written by Elena Herzog in collaboration with Markus Fischer, Gerd Blanke, Jarek Tomczac and Gabrielle Whittick)
RDKit, a collection of cheminformatics and machine learning software, is assisting in solving chemical information challenges. The founder and creator of RDKit, Greg Landrum, was interviewed by the UDM (Unified Data Model) team, facilitated by Elsevier, to share his experience on what the road to success looks like and what ingredients does an open source project need to have to be successful. The learnings from the interview would help to shape the future of the UDM project, which is transferring from its consortium-led Pistoia Alliance model to a community-led model.
How it all began?
Greg is a chemist. After his PostDoc in Germany, he moved to California and joined a couple of start-ups. Eventually he started a small computational chemistry start-up providing consulting and machine learning services. Open Source in chemistry was limited back in 2000 and the absence of good alternatives sparked the creation of the RDKit. The open source oelib (which eventually became OpenBabel) did not have a licence they could use and attempts to license the commercial Daylight toolkit were unsuccessful. So, they started writing code themselves and adding, little by little, new pieces. The company was eventually shut down in 2006 and, rather than seeking to find a purchaser for the technology, they decided to open source the code. Greg joined the CADD group at Novartis in Basel and was able to set up a process allowing him to continue to work on the open-source RDKit while at a large pharma. In 2011, the development ramped up even more when he moved to the Research. Requirements for extensions were funded internally or Novartis was funding external programmers to work on RDKit. “Working with the other scientists at Novartis really helped inform the direction we took with the RDKit,” said Greg. In 2016, Greg left Novartis for KNIME, the company behind the OS data analysis platform—and, at the same time, started a small consultancy company, T5 Informatics, which supports custom development services around RDKit. It is a combination of RDKit as OS software and T5 Informatics that allowed Greg to do what he enjoyed most and to spend his time on developing and extending functionality together with a bunch of people with similar interests.
What does the RDKit community look like?
“The heart of any successful open source project is its community,” says Greg. The insights are not easy to get though, it is just the way the OS project is run. Nobody is asking anybody who they are and where they come from. Some ideas in the community come from the RDKit UGMs (User Group Meetings), and the last virtual (due to Covid-19) UGM, in October 2020, registered more than 500 participants, the highest ever recorded during the RDKit UGM’s lifetime of 9 years. Registrants who replied to a Google survey came from industry (52%), academia (40%), and government, laboratories and non-profit research organisations (8%). The industry people were 70% pharma and biotech, and 20% software. Hardly surprising based on the features provided by RDKit. The UGMs are heavily European focused, but there is a large number of users in the US, Japan and China. There was going to be a Japanese UGM this year, but it was cancelled because of the Covid-19 situation.
How do people contribute to RDKit and why do they contribute?
Greg defines contribution in its broad sense, for instance:
- Code, for sure
- High quality bug report is considered very valuable
- Good documentation is very valuable and incredibly helpful
- Participation in answering questions, commenting and discussing issues
The rdkit-discuss mailing list is the primary communication method for the RDKit community; people also use it as a Q&A platform. It is hard to determine why people decide to answer emails.If it is about a specific feature, often developers answer emails, but again, there is no real mechanism to make people contribute unless they want to contribute. From time to time, some “wrong” answers show up, but proficiency and comfort come with experience. The majority of users have a problem to solve and want to understand and seek people who might work on a similar problem. Some people may feel an obligation: ‘I am using it, why should I not contribute?’ For some, this is a recognition; active people are recognized in the community. It also seems that if there is a code attached to a publication, researchers are more inclined to use it. This increases citation, and this is what is important for the publication and the author. Greg believes that there are data supporting this, but he was not 100% sure. Another “selfish” motive for why people want to contribute to OS projects is to be able to carry on working on it in the future, even if people leave or change employers. Whatever the reasons might be, the important thing is that the RDKit community is friendly and open; people feel good about the project and all of these, surely, help with adoption.
How do companies contribute to RDKit?
Many companies have contributed to the development and extensions of RDKit by either funding developers internally or hiring external developers. Companies that participate have an easy way to attract people with RDKit expertise. For instance, many students work on OS Software, and employers understand what exactly developers do and how do they do it. Examples of companies using the RDKit internally and contributing to it include Schroedinger, Cresset, Novartis, Roche, Medchemica, Relay Therapeutics and NextMove Software. Many other companies are using RDKit. For example, Elsevier is providing and supporting it on Entellect’s Reaction Workbench, PerkinElmer is using it in Spotfire, and one can use chemistry extensions based on the RDKit in Mathematica. Google runs “Summer of code,” where projects improving and contributing to RDKit tools are included. These important use cases increase adoption and acceptance of RDKit.
What are the benefits for companies to deposit the code to RDKit?
There is a very important point and, in fact, there are many good reasons why companies choose to deposit the code to RDKit.
- Testing and validation of code become easier as the pool of testers is theoretically unlimited
- If a company decides that a piece of code is not IP critical, the code can be supported by community and somebody from the community might fix bugs
- Developers and cheminformaticians with RDKit expertise are known to the companies, which follow and contribute to the development. The developers can be quickly mobilized to work on features that companies are interested in
- The UGMs circulate lists of open positions advertised by companies, and this year there was a channel in Discord to announce open positions. Companies can post openings on the mailing list or LinkedIn group. In addition, a conversation has started on how to fund developers on a contract basis and, as mentioned previously, there is no organization to accept funding for RDKit
What governance structure does RDKit have and who decides on what?
The Python community refers to Guido van Rossum, the creator of the language, as “Benevolent dictator for life” (or BDFL). The RDKit currently follows more or less this model. There is not much of a governance structure, however there are four core maintainers and any contributions are reviewed by at least two of them. Theoretically, two developers must sign off and one of them should be Greg. He mentions that this may not be the best way in a long term, but it is how it is. There are not many decisions that they need to make, most of the decisions are tactical and each decides what they want to work on. There is a broad list of interests they want to work on—some are driven by long term and some by companies’ requests. Three of the other developers are from Schrodinger, Novartis and Relay.
Under what licence does RDKit operate?
“OS licences are extremely important and contentious,” Greg points out. RDKit uses the BSD licence. The BSD licence is very permissive and allows commercial use; it is done by intention. The code is covered by copyright. By default, the copyright material cannot be re-used, however the licence allows usage and re-distribution of the code. On top of each RDKit’s code, there is a copyright statement and the authors who have contributed the code are shown. At the bottom of each file, it states: all rights reserved, and covered by the licence. One can follow the licence to check what is allowed and what is not. For example, you cannot take out the code completely, remove the copyrights and re-publish. The licence also includes a clause disclaiming liability. Greg recommends using standard licences for OSS, as many big companies are familiar with them and, hence, more willing to use the OS software. To be clear, companies can build on the RDKit code and commercialize it. Schrodinger and Cresset use RDKit in computational chemistry code. RDKit is intended to be used in computational software; the companies do not need to communicate anything to Greg or the RDKit community. Moreover, there are filed patents that use RDKit. For example, there are 168 results in Google patent search where RDKit is used as of October 2020.
Are there any IP rights or copyrights when people contribute to RDKit?
Apparently, this might be tricky in some cases. Some OSS projects want to cover everything under one copyright. To accept the code, copyright must be assigned. The RDKit does not do this. As RDKit is not an organization, it cannot ask people to assign the copyright to it. Contributors (and their employers) determine the copyright on pieces of contributed code. However, all contributions must be covered by the same BSD licence as the rest of the RDKit.
Does RDKit accept funding for specific projects?
Because RDKit does not have any legal organization, it cannot accept funding. There are consultants you can pay, but there is no central place to pay to do the development work. Contributing companies provide funding to their programmers or to the external programmers to work on the RDKit development and extensions. For example, Novartis has done both, paid T5 Informatics and had internal developers to contribute to RDKit. T5 Informatics, in turn, being a consulting company, could process funding for RDKit if needed. To the extent that Greg could focus on RDKit development, the RDKit have benefited from it. When asked about crowdsourcing, Greg mentioned a success story when Andrew Dalke managed to raise funding for the development of MMPDB. However, it is questionable how successful future projects can be with regards to raising money from interested individuals. The cheminformatics space is confined, as the number of companies that would be interested in sponsoring the RDKit development outside of commercial interests is limited. How to fund a bunch of interesting projects which are not urgent or exposed enough is still occupying the creator’s mind.
How does Greg see the future of RDKit?
Greg feels it is a long game and the hope is that the toolkit continues evolving. Adoption and usage expansion in research IT organisations such as Elsevier and pharma are extremely important and would bring positive effects. In addition, more integration of the software in the internal workflows at commercial companies in a more systematic way would increase adoption and expand community.
Is there any value for RDKit to work closely together with UDM?
The UDM is primarily the exchange standard, and not a software; it is more of an open documentation project and less OSS, unless there is an idea to build a software that does something around UDM. Open documentation projects might use different licences (for example, creative commons licences). It is difficult to say what the right model for UDM can be, but having it as an OSS Project under the umbrella of a standards organization such as IUPAC is a good idea. If UDM is successful, reader and writer could be handy; having the code and being able to do something with UDM files is valuable and useful and might speed up the adoption.
The RDKit is used to process, harmonize, enhance and analyse chemical data. The demand for a software that can assist in making your data, for example, AI/ML ready as well as chemists who have skills and knowledge to execute these tasks has increased. Elsevier, with its high-quality chemical and biological data, often processes these data for various modelling projects, such as AL/ML based synthesis and pharmacological modelling predictions. As such, it is well positioned to support OS projects and chemical standards also because its customers are increasingly using and embedding these tools and standards in their ecosystems. The interview with Greg Landrum is a confirmation of Elsevier’s interests in working together and helping researchers and healthcare professionals advance science and improve health outcomes for the benefit of society. We are thankful to Greg Landrum for sharing information with Pistoia’s UDM team and Elsevier on how the RDKit works and what contributed to its success. The shared information is already informing the next steps of the UDM Project’s transition. Finally, the gathered knowledge from this interview might help commercial companies and research organisations to build and to maintain future relationships with various types of Open Source and Open Documentation Projects.
R&D Solutions for Pharma & Life SciencesWe're happy to discuss your needs and show you how Elsevier's Solution can help.
Senior Manager in Innovation and Partner Development at Elsevier
- Partnering to identify repurposing candidates for chronic pancreatitis
- Reaxys database search now available from Marvin
- Infographic: Elsevier’s Professional Services team provides COVID-19 insights
- Five drug development strategies to combat 2019 novel coronavirus
- Life Sciences trends that make me excited for 2020 (and beyond)