7 Principles for Determining Which LLM Business Cases Will Work, and Which Ones Will Fail

Written by Ross Katz | Jun 25, 2024 12:42:21 PM

Over the past 1.5 years, we have seen a steady increase in the number of clients asking questions about how they can apply generative artificial intelligence (GenAI) and large language models (LLMs) to any business problem available to them. The excitement about GenAI has caused executive stakeholders to either reallocate resources away from traditional machine learning (ML) or infuse additional resources into data science research & development (R&D) teams for the purpose of discovering the problems that can be best solved by LLMs.

Avoiding a bad start

The majority of CorrDyn engagements incorporate an assessment process to discover where data acquisition, data infrastructure, business intelligence, and machine learning can drive additional value. If there is no clear story for how value will be generated by the new technology tools, then we will generally re-scope or turn down the engagement. We recognize that delivering tools that expend additional resources but drive little-to-no value is the beginning of the end of our relationship with our clients, and our average client relationship spans 3+ years.

At a fundamental level, many Data Science projects, and especially ML projects, are R&D initiatives. The value that can be driven from these projects, and the timeline for that value to be delivered, is uncertain. We are very excited about the potential for LLM-based tools, but we don’t know whether LLMs will continue to scale in terms of their capabilities or whether capabilities will begin to plateau. We do believe that the capabilities available today are enough to drive value for a decade or more.

We operate from a position of pragmatism: companies should not hire (data) consultants to build projects with uncertain budgets, deliverables and timelines. Because return on investment (ROI) and time to value (TTV) are the most critical metrics we use to evaluate our projects’ success, we have learned to sense in advance where these projects are likely to fail, either for technical or organizational reasons.

Don’t let the tail wag the dog

Therefore, as with any technology, we firmly believe that starting with the tool and searching for the use-case is a backwards way of approaching GenAI. To help you think through the most worthwhile business cases for LLM-based tool implementation, this blog post provides an overview of our principles for where these tools are likely to drive ROI and TTV, and where LLM implementations are likely to fail.

Principle 1: Moonshots will fail. Moon bites might succeed.

A moonshot is, by definition, a project with a very low probability of a high-value outcome for your company. This is not to say that your organization should not invest in moonshots, but my experience is that most large organizations lack the will to see these projects through. If the moonshot is critical to the long-term success of the organization, then well-defined milestones, criteria for success, and a resource allocation scheme need to be developed in advance and protected from the vicissitudes of organizational performance.

Because so many moonshots fail, my advice is always to try to deconstruct it into a set of milestones (“moonbites”) that are each valuable in isolation. If we can construct a moonshot from valuable steps along the path, each of which can be evaluated on its merits independently, then a moonshot can deliver value to the organization even when it fails.

An example of a moonshot in the context of LLMs would be an automated research agent that populates a knowledge graph without supervision.

Does this seem achievable within constrained use cases? Yes.

Does it seem likely that your one data scientist or software engineer will solve this problem at scale in 6 months? No.

So break it down:

Step 1: Establish evaluation criteria and target performance.
Step 2: Prove that your LLM can extract relevant data (entities/nodes and relationships/edges in the case of knowledge graphs) from specific types of documents according to your evaluation criteria.
Step 3: Create / purchase an interface to efficiently review, evaluate, and edit the candidates the LLM agent surfaces, and add them to a knowledge graph.
Step 4: Construct a benchmark dataset using the interface from Step 3 or an existing reference graph.
Step 5: Tweak your LLM extraction to improve performance against the benchmark dataset. Expand the benchmark dataset based on human review.
Step 6: Revise steps 2-5 until LLM extraction is exceeding target performance.
Step 7: Scale up the data you introduce to the LLM.
Step 8: Establish a performance monitoring process. Sample data for human review.
Step 9: Add features to the interface to enable review of nodes/edges recently added to the knowledge graph by other users.
Step 10: Add features to the LLM agent to automatically conduct research activities and add nodes to knowledge graph, which can be reviewed for accuracy. Cap the number of nodes/edges that can be added in a given time period. Review extensively at first.
Step 11: Slowly reduce the sample of nodes reviewed by humans as trust in the LLM agent grows.

As you can see, it’s not quite as simple as “just build it”.

Principle 2: Keep the use case clear and constrained.

In all ML projects, there is technology risk, strategic risk, and execution risk (see sidebar). LLMs come with inherent technology risk because the tools are so new. Unpredictibility is intentionally built into the modeling paradigm, and the available frameworks for maximizing output value while minimizing risk are still works in progress. In a situation where technology risk is so high, it is incumbent on the organization to minimize strategic risk and execution risk.

Strategic risk can be minimized by selecting business cases for which value will accrue even if the technology is much less robust than vendors would have you believe. The business case is so obvious that the technology does not need to have the spark of artificial general intelligence (AGI) for your company to succeed.

Execution risk can be minimized by recognizing which elements of the execution plan your company is well-positioned to take on, and which elements require additional support (internal or external) to deliver on a high standard, on the required timeline, and within the allocated budget. Elements to consider include:

Definition and evaluation of the business case, including reasonable timeline and cost.
Development of evaluation benchmarks for a well-defined business case.
Exploration and selection of available models, third-party and open source (where applicable).
Development of systems to interface with selected model(s), which may include frontend (user-facing) and backend (system-facing) applications or services. Because frameworks for the development of these systems are evolving rapidly, and approaches are myriad, this is an area where substantial time and resources can be invested.
Inference (ongoing model output production) of model(s), and hosting (where applicable)
Evaluation and improvement of developed systems using evaluation benchmarks.
Enhancement of developed systems to improve performance on evaluation benchmarks and to meet expanded use-cases.

Working with a partner can be a good way to fill in the specific gaps you have with regard to strategic risk and execution risk. It allows you to focus on the areas where talent and resources are well-matched to the problem and complement internal teams where talent and resources are mismatched.

Sidebar: Types of ML Project Risk

ML Technology risk: The risk that the technology cannot be developed or applied to accomplish the organizational goal, with predictable timeline and cost.

ML Execution risk: The risk that a given organization cannot apply the technology in question to the business problem at hand without encountering barriers that are related to the nature of the organization (i.e. talent, processes, resources), and not the nature of the technology.

ML Strategic risk: The risk that a given organization will attempt to leverage the technology for a purpose that is either (a) not strategically beneficial to the organization, or (b) not well-suited to the selected technology and execution profile of the organization.

Ideally, all three should be kept to a minimum, but opportunity is usually proportional to risk.

Principle 3: Find use cases with repeatable inputs and outputs, consistent with training data.

LLMs are very good at reproducing language and ideas from their training data, as well as reproducing systematic mappings from inputs to outputs represented by their training data. LLMs are not very good at recognizing new relationships between inputs and outputs unless those relationships are repeatable and clearly demonstrated for them in the form of few-shot prompts or training data. If your input-to-output relationships vary substantially from use to use, then the use-case is not well-suited to the application of GenAI currently. This is different than saying that data must be well-structured; LLMs do very well with unstructured data. But the questions asked and the expected outputs need to be consistent.

Broadly speaking, we see consistent value being driven using three functions of LLMs:

Parsers: Extracting structured data from unstructured text, and leveraging that structured data for insights or automation.
Summarizers: Condensing large bodies of text into easily understandable summaries for internal or external stakeholders.
Question Answerers: Delivering a concise response to a question about a body of text, assuming the question can be answered based almost entirely on the text itself.

We believe a ton of value can be driven from these three functions alone. We also see a fourth emerging function where value can be driven but substantial development cost may be required:

Repeated Workflow Automation: Many of the leading-edge use cases focus on composing functions of existing software with the natural language understanding of LLMs, as well as their ability to call functions of software systems. The use cases for these systems are endless, but minimizing the level flexibility required will directly influence their achievability. This is what is currently being marketed as “Agents.”

Principle 4: Know what constitutes success.

Experimentation with these models is beneficial for understanding their capabilities, but not for the development of valuable business tools on time and within budget constraints. If you do not start with evaluation criteria in advance, you will not know whether the system is improving in terms of delivering against your business requirements. In situations like these, developers can add complexity and cost to the system unnecessarily because they want to “see whether it works better” or because they just want to add new tools to their resumes. This can lead to the deterioration in performance and the death of a large investment. Always define evaluation criteria, with examples, before starting work.

Principle 5: Plan your entire GenAI journey.

Do not start work on a project until you understand what productionalization entails at a high level; in terms of infrastructure cost, development cost, and maintenance cost.

Toy examples and demos are relatively easy to construct, but production-level systems are challenging to develop and manage. Do not be seduced by the demos. Once your team has selected the business case and a roadmap to its achievement, consider the volume of concurrent requests you will receive, the expected latency of response, the resulting scale of the expected production infrastructure, and the expected value per request your company will receive. Then, consider what options for the deployment and management of production infrastructure exist, and what the cost profiles will look like in various scenarios.

If there are few-to-no scenarios where production infrastructure can be sustainably deployed, managed, and maintained, then the business case should be placed in the backlog and reevaluated regularly as production deployment approaches change. It is always worth considering the risk that third-party APIs will increase prices, third party models will drift in terms of capabilities or approaches, and the potential maintenance burden of shifting functionality or APIs. Are you ready to rebuild your stack every 3-6 months?

Principle 6: Validate everything.

Do not believe what vendors are telling you until you have validated it directly. Do not believe what is possible until you have measured its performance. Do not believe timeline and cost estimates until your team has built a working MVP.

This is true of all technology vendors, but it is particularly true when the technology risk is this high. A lot of venture capital has infused companies that tout GenAI and LLM capabilities. These companies are searching for scalable, repeatable business models leveraging systems that are inherently uncertain. A few of these companies will be very successful. The vast majority will fail. If it seems too good to be true, it probably is. If you need assistance conducting vendor selection and evaluation, drop me a message and I’m happy to help with the selection process.

Principle 7: Start simple; add complexity as needed.

Start with the simplest possible system, and add complexity only when the use-case dictates the complexity is needed.

Technology developers (yes, even Data Scientists) can often be seduced by the excitement of new tools. This is often what brought them into the field in the first place, and some are more concerned about the tools and frameworks they can place on their resume than the relevance of those tools and frameworks to the problem at hand. As a leader of ML projects, it is essential to constrain the solution space to the minimum number of tools necessary to complete the job. With solid evaluation criteria, you can create a baseline system that helps you understand the system’s worst-case performance. Then, you can consider additions to the system that might increase performance and evaluate them one-by-one.

This parsimonious approach to ML system development is the best way to maximize performance and minimize cost. My experience is that smaller, stackable projects and developing simpler systems will save your company time and money.

Overall, I believe that a pragmatic approach to the development of GenAI systems is the only way to navigate this emerging technology landscape. Does this approach potentially slow your company down? Absolutely. Yet I firmly believe that this is a problem space where companies should slow down in order to speed up. The companies that are more rigorous about their approach are more likely to succeed.

***This post was first published by Ross Katz on LinkedIn. View the original article here***

View full post