One of the great things about the Data in Biotech podcast is that we get to look in depth at so many aspects of data science impacting the biotech and pharma space. This episode looked at the topic of knowledge graphs, what they are, why they are useful, and how LLMs, AI, and Machine learning are helping to make them an increasingly valuable asset to biotech organizations. Cody Schiffer gave his expert insight into the topic with real-world examples of this approach to data in action, its benefits, limitations, and future.
Guest Profile
Formally trained as a biologist and biomedical engineer, Cody Schiffer is currently Associate Director of Machine Learning at Sumitomo Pharma America Inc (SMPA). He is responsible for expanding SMPA's computational capabilities for drug discovery, development, and commercialization by developing natural language processing (NLP) powered applications and knowledge graph-based analytics.
The Highlights
In this podcast, Cody gave us a detailed insight into his work around knowledge graphs in the biopharma space. If you’re new to knowledge graphs, there is a quick explainer below; otherwise, skip ahead for the highlights.
What is a Knowledge Graph?
A knowledge graph is used to show the semantic relationship between entities, illustrating the connections between them. The entities are known as nodes, and edges are the links between different nodes, labels are then used to categorize the edges and define the relationship between nodes. Triples, which are made up of a subject, a predicate, and an object, express facts relating to the data. Knowledge graphs work particularly well to show complex links across multiple large data sets.
Improving the Literature Search Process (10:50): We started off the conversation by looking at the challenges of literature search and how data science can be used to improve this process. He gives the example of how SMPA is working to make it easier for its internal stakeholders to identify the most relevant papers and understand how they connect with each other, using knowledge graphs to show complex relationships. Cody also explains the limitations of how disease ontologies are traditionally grouped. From a physician’s perspective, having diseases grouped by organ system is very useful. However, it is not necessarily the most effective way to look at how diseases relate to each other. A knowledge graph allows those traditional groupings to still exist, but it also demonstrates further links between diseases to allow researchers to see additional novel connections.
Incorporating Unstructured Information into a Knowledge Graph (16:44): Large Language Models (LLMs) have significantly accelerated the integration of unstructured data into knowledge graphs. They are able to generate candidates for relationships that can be added to the knowledge graph, but it is crucial to ensure that LLMs are fine-tuned to be able to correctly interpret biomedical information to ensure accuracy. There is a real need for benchmarking to ensure that the ‘extracted material is not only properly structured for the graph, but it's also intelligently extracted.’
The Importance of Usability (29:07): Cody speaks about the importance of ensuring that the information in a knowledge graph can be conveyed to the stakeholders who need to use and interact with it. From experience, he saw the need for a front end that allowed users to learn how to use the graph and interact with it. This includes functionality like being able to sort graph nodes by type or certain properties or to reorganize edges and nodes hierarchically. It might include integrating a chatbot that allows users to ask open questions of the knowledge graph and get natural language answers. The key is to ensure the end-user has the tools to derive value.
Maintaining Knowledge Graphs (31:44): One of the challenges when taking knowledge graphs from development to being used by teams is how to maintain them so they are easy to use, up to date, and function seamlessly. The graph is constantly expanding, but from a practical perspective, a user needs the graph’s growth to represent real new knowledge without adding noise to the system. This leaves a decision on when to freeze the graph to balance having a usable tool with ensuring accurate and up-to-date information.
Knowledge Graphs for Drug Discovery (38:41): When asked what the future looks like, one of the areas Cody was most excited by was using knowledge graphs to improve the drug discovery process. He sees it as being an area with potential for high impact as knowledge graphs can be used to improve time efficiency, lower development cost, and, most notably, limit risk. Risk can be reduced in several ways: by better representing molecules as graphs or coming up with novel methods to predict molecule target binding using graph data science. There is a need for improvement in this area, and knowledge graphs have great potential here.
Continuing the Conversation
One of the topics that bubbled throughout our conversation with Cody was the importance of formative input from stakeholders into the data science models they utilize. Cody explained the need for front-end tools to make it easy for users to interact with knowledge graphs and that the process of building that functionality involved working closely with stakeholders to improve the quality of the underlying data and ensure they would be able to derive value from it.
One of the most inspiring aspects of Cody’s and SMPA’s approach to the construction of their proprietary knowledge graph is that they are treating the knowledge graph as a representation of the organization’s collective knowledge. They are making it easy to receive feedback from domain experts so that knowledge locked in an individual person’s brain can be represented in a format that can drive insights for the entire SMPA team. Too few organizations are so strategic about embodying internal knowledge in a format that can lead to serendipitous knowledge transfer to arbitrary parts of the organization. SMPA is also laying the long-term groundwork for discoveries to be made in the future that could not otherwise be made if they were not intelligently combining and curating the knowledge graph.
Why is Formative Input From Domain Experts Into Foundational Data Science Tools So Crucial?
- Building an internal repository of high-quality, proprietary IP: The role of Data Science is to create internal knowledge that is continually building on itself and becoming a valuable asset to the company. Empowering data science teams to own responsibility for developing the infrastructure for taking knowledge out of the heads of domain experts and embedding it in systems that the company can leverage in perpetuity sets the company up for long-term success and reduces reliability on key domain experts.
- Documenting and testing the assumptions of domain experts: Often, there are unspoken assumptions that underlie the decisions a biotech organization makes: assumptions about biological relationships, about the competitive landscape for the company, about the viability of investment in a given treatment approach. Encouraging the information to document these assumptions and review them regularly creates a better baseline for clarity about the company’s strategic and tactical approach.
- Fostering the flow of information between Data Science and domain experts: Data science tools for domain experts are only as good as the feedback the Data Science team receives. By creating an interface for direct interaction with knowledge graph data underlying analyses of connections, the team is well-positioned to understand the needs and perspectives of their stakeholders. This is the approach to “data products” over “data outputs” that positions the Data Science team to contribute more thoroughly to the long-term success of the organization.
At CorrDyn, we have a significant focus on working with stakeholders from the beginning and throughout the project to ensure what we deliver scales well and becomes an asset that provides long-term ROI for the organization.
If you're interested in discovering how your organization can unlock the value of data and maximize its potential, get in touch with CorrDyn for a free SWOT analysis.
Want to listen to the full podcast? Listen here:
Tags:
February 1, 2024