Harry Rickerby on Experiment Documentation, the Importance of Metadata, and Using LLMs to Reduce the Burden on Scientists

Written by Ross Katz | Jan 11, 2024 3:39:56 PM

As we have often seen with guests who are founders, the inspiration for their business proposition is rooted in frustrations they have encountered in previous roles. Harry Rickerby, Co-Founder and CEO at Briefly Bio, is no exception. He talks about his new venture and the challenges that led him to set up the company earlier this year, and of course, as this is Data in Biotech, his views on Data, AI, and ML.

The Interview

Guest Profile

Harry originally studied biology at Imperial College, London, developing an interest in synthetic biology. From there, he joined LabGenius in 2014 as its first employee, focusing on data and machine learning to help with protein engineering and drug discovery. He remained with the company for eight years, moving into more of a leadership role during this time. Harry left the organization in 2022 to cofound Briefly Bio, where he is developing a platform to address the challenge of incomplete and inconsistent documentation of experiments.

The Highlights

Our conversation with Harry drilled down into the specific challenges he has encountered over his almost decade in the industry, which led to his move to set up Briefly Bio. From the issues that arise from incomplete, inconsistent documentation to the potential of Large Language Models (LLMs), here are the highlights:

The Challenge of Incomplete and Inconsistent Documentation (2:26): When we asked Harry to give an overview of the BrieflyBio platform, he started by explaining the challenge it is looking to solve. Incomplete, inconsistent documentation of experiments means that biotech companies regularly run into problems of reproducibility. It not only hampers wet lab scientists but also data scientists who need to know exactly how data is generated to be able to analyze it properly. Complete, consistent documentation of an experiment lets the entire team know exactly how it was run, and the Briefly Bio platform aims to take the heavy lifting of achieving this.
The Need for Consistency (6:51): When we dig into what documentation is needed, Harry shares his view that more information is not always positive. He gives the example of a nature protocols paper, explaining that they are “the most incredibly complex, detailed protocols you can imagine; every single step is meticulously documented.” However, getting an overview of a protocol like that is incredibly difficult, and this incredibly detailed document can, in practice, be less useful than something a little more high-level. It's not just about documenting everything; it is also about making that information as consistent as possible. As well as allowing bench scientists to pick up and replicate an experiment, it sets up data scientists for greater success. If the information is consistent, a data scientist can confidently query an experimental database, making it easy to analyze data from multiple experiments.
Importance of Metadata for ML (11:40): Harry explains that the nature of biological experiments is what makes metadata so valuable. He argues that the interesting thing about biotech research is that the data is not ground truth. It is a proxy that gives us a window into biology through inference. This means if you change something in your experimental procedure, it will tell you something different about the biology. Metadata gives transparency over exactly how each experiment is run and is vital as we look to compare datasets and use Machine Learning Models. If you want to put 20 datasets into a Machine Learning model to gain a fundamental understanding of the biology, you need to understand the differences between those datasets. Only by knowing how they were generated and accounting for that can we draw meaningful conclusions.
The Lineage of Experiments (23:02): The insights that better documentation can give into the lineage of experiments is an interesting proposition. Harry talks about how little is known about how and why experiments evolve over the years, and although published papers give some insight, much of the information is either passed down by hearsay or lost. Better documentation works as an antidote to this. Having the transparency to see where these changes came from and why is powerful and can have a positive impact on related experiments going forward.
Using LLMs to Reduce the Burden on Scientists (25:00): Harry provides insight into how Briefly uses LLMs within its platform. He believes that if you're building LLMs into a software product, you need to be really thoughtful about how it's actually helping your users do the job that they're trying to do. So, in this situation, the aim is to take the heavy lifting of documentation off the shoulders of scientists to make it easier to produce consistent documentation. The first example of this in practice is a tool that converts natural language descriptions into consistent, structured formats for scientists and data teams. He then gives a more advanced example of how an LLM can be used to draft protocols with the ability to add steps based on the user's input. This feature becomes more reliable as the LLM sees more and more similar experiments. The main thing for successful LLM integration is to be intentional with how you want that feature to work. LLMs are very versatile, but this is exactly why they need direction to function effectively within software.

Further Reading: For anyone wanting to hear more from Harry himself, he regularly publishes on Substack. He also recommends the blog of previous Data in Biotech guest Jesse Johnson, Scaling Biotech, for insights into building data teams and data systems within a biotech organization.

Continuing the Conversation

The highlights only scratch the surface of our conversation with Harry on how LLMs can be used in the biotech space; for more detail, you can listen to the podcast in full here.

As the host of Data in Biotech, Harry’s vision for LLMs and his view that having purpose and intent when integrating them into software really resonated in this conversation. There is a lot of noise around LLMs and Generative AI at the moment, and so much of it aligns with Gartner’s placement of the technology at ‘the peak of inflated expectations’ on its AI Hype Cycle.

When we look realistically at how organizations can use Generative AI and LLMs, it is clear they need a considerable amount of guidance and direction to have a meaningful impact. Briefly is a great example of what is needed to take the conversation on LLMs from simple Chatbots that don’t add a huge amount of value, to powerful tools that can be used to tackle some of the big challenges facing scientists needing to utilize data better.

This naturally brings us back to a familiar challenge of how we create better data. Data is our bread and butter at CorrDyn, so we think about this a lot, but following this podcast, let’s look at where the role of a data science team begins.

Starting at the Beginning

Without tools, platforms, templates, or standardization when it comes to how to record experimental data, wet lab scientists have no hope of generating the consistent full data that data science teams are looking for. This is a problem for the data science team to address in partnership with the wet lab team. Neither can achieve long-term success unless their process begins improves data quality coming from the wet lab.

Any issues in the data generated move downstream through the entire organization and cannot be solved at a later date. Empowering scientists to create better data and metadata needs to be the starting point for any data team. Tackling the issue of data quality at the source is central to making the process of analyzing the data easier and more reliable. But how can we improve data quality?

Automation is one of the key tools in the arsenal of overcoming quality data generation. Using automation, LLMs, or any tool to remove the drudgery of proper documentation and remove the ambiguity of what consistent and accurately captured protocols look like is at the heart of quality data. Data scientists cannot exist in isolation. They must collaborate with the wider science team to ensure they have datasets that are detailed enough to work with. Only by starting at the beginning, when the data is being generated, can they guarantee this.

The good news is that your team does not necessarily have to adopt a new tool for documentation and tracking to get the value of better metadata from your experiments. Regardless of how your metadata is generated, the goal should always be to extract it to a place where your data team can work with a copy of it free from compliance or process concerns, like a data warehouse. This sandbox is where your data science team can apply repeatable information extraction and cleaning approaches that solve your most critical metadata needs while applying quality control to the data your wet lab team generates. All of this is typically achievable without altering the day-to-day experience of wet lab team while creating a meaningful feedback loop for the data science and wet lab teams to collaborate toward better experimental metadata.

Briefly Bio’s vision of better upfront metadata generation is a better long-term solution to the problem because it solicits the wet lab team’s upfront approval of experimental metadata generation. However, teams that remain committed to and invested in optimizing their existing tools and processes can still achieve many of the benefits without restructuring R&D processes.

If you're interested in discovering how your organization can unlock the value of data and maximize its potential, get in touch with CorrDyn for a free SWOT analysis.

Want to listen to the full podcast? Listen here:

View full post