Navigating the Virus Regulation Pathway through Text Mining and Knowledge Graph

Posted on May 20th, 2020 by in COVID-19

A computer virus exploits the operating system of a computer to replicate (copy itself) and send copies of itself to other computers in the network. In the same manner, a human virus manipulates the cell’s “operating system,” managed by cellular proteins, to replicate and infect other cells in the body. Specifically, the virus forces the cell to terminate its ongoing operations and start making copies of its viral particles by controlling the expression and behavior of pre-existing proteins in the cell.

SARS-CoV-2, the coronavirus responsible for COVID-19 disease, is very new. Yet, the MERS and SARS epidemics have generated sufficient scientific knowledge about coronaviruses, which can be useful to understand the biological mechanisms by which SARS-CoV-2 can control human cells.

Text mining for knowledge extraction

With this in mind, we used Elsevier’s text mining tool to identify human proteins that have been shown by scientists to be increased or decreased by coronaviruses after infection. Similarly, we searched to identify cellular processes that are increased or decreased by certain chemical compounds (drugs), such as FDA-approved drugs, to revert the efforts made by coronaviruses at controlling cellular processes during infections. This is equivalent to scanning the scientific literature for computer processes that are turned on (increased) or off (decreased) by computer viruses and antiviruses.

We gathered the information and stored it in the form of a database that displays the links between studied drugs, proteins and diseases (MERS and SARS). To allow for continuous access to up-to-date information, such as drug side effects, we ensured that drugs, proteins and diseases in our database are linked to external identifiers in reviewed databases such as HGNC and Uniprot databases. We also preserved the provenance of the extracted information by providing a link to the original PubMed identifiers.

Knowledge graph for data explorers

At this point, we decided to use a graph database to explore the data, which now looks  like a directed graph. The dataset is now freely accessible for download and exploration on Mendeley. Having noticed this, our partner Neo4J has loaded the data into a hosted instance of neo4j, which can be accessed using the following credentials:

username: elsevier

password: 3153v13ruser

It would be helpful to familiarize yourself with the neo4j query language, cypher, to explore this data. Here is a useful reference card that you can use.

Coronavirus Research Hub for data scientists

Recognizing the needs of gathering and exploring data in support of COVID-19 drugs and vaccines development, Elsevier has launched the Coronavirus Research Hub with a particular focus on data science. In addition to the original dataset of this visualized virus regulation pathway, the portal also includes the access to COVID-19 related full text articles in ScienceDirect, and more than 14 million cross-publisher full text articles plus the CORD-19 Dataset. Join the research hub today!

