Skip to main content

***This post was first published by Ross Katz on LinkedIn. View the original article here***

A lot of clients are exploring the potential of generative machine learning models (LLMs or GenAI) for various use-cases, and the consistent question that comes up is whether they should utilize a third-party LLM API like OpenAI, Google’s Gemini, Anthropic’s Claude, or Cohere or any of the myriad open alternatives based on Llama, Mistral, Command-R, StarCoder, or whatever the most recent/hyped model release is.

The LLM Landscape

The overlap in capabilities of these systems is often difficult to discern. Although there are great examples of open LLMs utilizing tools and acting as the backend for agents, it is not always clear whether open models can reach the same level of flexibility and reliability as third party models for a given task. What is clear is that the third-party LLM APIs will continue to push ahead with more advanced features as their race to differentiate via “closer-to-AGI” capabilities continues.

Overall, evaluating whether to use open or third-party LLMs relies on how generalized your task is. If your task is very specific and you know exactly what the inputs and outputs will be, then you can often fine-tune an open LLM to exceed the cost/performance characteristics of third party APIs (especially if you expect to reach scale or you are comfortable with your data being hosted on an upstart cloud GPU provider). If your task is relatively generalized, where inputs and outputs will vary substantially from interaction to interaction, or if your task relies on capabilities that have only been reliably demonstrated in 1 trillion+ parameter models like ChatGPT and Gemini, third-party developed and hosted LLMs will tend to outperform. And for the purpose of proofs of concept (POCs), the APIs will always be there to smooth development and to test in a friendly environment.

Asking the Right Questions

Every comparison of open vs. third party software, including LLMs, should start with a critical exploration of the use-case involved. Companies have only begun to develop the muscles to critically evaluate use cases for machine learning, let alone use cases for LLMs. Let’s walk through some of the questions that we need to ask to determine the path forward:

Do we know what success looks like? Can we procure or create 100 examples of success and failure? How diverse are the examples we have provided?

Without a benchmark dataset from which to compare the performance of open and third-party alternatives, we are “flying blind”.

We do not know the level of performance we are sacrificing when choosing one over the other. This makes it impossible to:

  1. Establish a baseline for the minimum viable level of performance we require from the system.
  2. Understand the value tradeoff between cost, performance, and latency.
  3. Evaluate whether our investments are improving the system’s performance.

In short, a benchmark dataset is non-negotiable if you intend to productionalize LLM capabilities. Start here regardless of the choice you make about open model versus third party API.

Is the system internal-facing or external-facing? A software-only integration or a human-facing tool?

If the LLM is only serving as middle-ware or data parsing software, then the type of reliability you require from the system is limited in scope. Software integrations tend to expect more consistent inputs and outputs. Smaller LLMs can be fine-tuned to focus narrowly on the task required.

Humans interfacing with an LLM introduces randomness to inputs, which expands the potential for errors and even harmful responses, as well as security vulnerabilities. Therefore, human-facing tools will typically require a more advanced, third-party model to account for the variability in inputs the model can expect to receive.

Can users / clients of the system tolerate errors, or do the users / clients expect near-perfect clarity, consistency, truthfulness and helpfulness with absolutely no harmfulness in responses provided?

This influences how polished we expect the system to be before being rolled out. It also impacts the extent to which the company can spend time building from less mature components in-house versus leveraging the more mature LLM capabilities available off-the-shelf externally.

How important is time-to-value?

If speed of delivery is critical to accomplishing the business goal or establishing momentum toward it, then third-party LLM APIs look more attractive.

If the development of internal capabilities and proprietary models is critical to the company’s long-term success, then starting from open models look more attractive by comparison.

Do we understand the strategic value we expect the system to drive?

Is the goal just to have a GenAI system your team can point to so that executives can feel secure in knowing that you are developing internal capabilities with LLMs? Or are there real strategic business priorities that need to be achieved for the system to be rolled out in production?

For example, a strategic priority may be reducing the burden of customer support requests on the customer success team, and a reduction of requests by 25% is the goal for the current year. Another strategic priority might be reducing the amount of time a team spends drafting extensive documents for regulators.

Higher strategic value increases the likelihood you want to bring the model in-house. Higher operational value increases the likelihood you are willing to utilize a third party API for the foreseeable future, under the assumption that the cost of the API will be lower than the money its outputs will save your company.

Are third-party LLMs potential competitors, or competitor enablers, for your business?

If a third-party LLM could potentially become a competitor or enable a competitor using data you provide, then building on an open LLM using your proprietary data is the clear path forward. Or getting your Chief Counsel comfortable with the contractual commitments of the third party API, and accepting the risk that these contractual commitments are not fully enforceable.

What are the other risks of the system, and how do those align with utilizing a third-party API?

Every organization has acceptable and unacceptable risks. Acceptable risks might include a decline in net promoter scores for the company in order to achieve cost reduction targets.

Unacceptable risks might include a leak of sensitive, personally identifiable information or exposure of the company’s intellectual property to a third party or a response that is harmful or offensive to a customer. Understanding these constraints helps to determine which options are off the table. They also help guide the guardrails you need to apply and the tests you need to conduct to gain comfort with the risks of whatever your choice is.

Are we risk-averse or value-seeking?

If the risk of exposed data is catastrophic, then bringing the LLM in-house is clearly the better alternative. But if you have nothing to lose, money is no object, and time-to-market is critical, utilizing a third-party API becomes the clear choice. There is a reason that many startups are built around third-party APIs until they establish product-market fit.

Do we understand the full functional requirements?

In order to understand the cost, performance, and latency tradeoffs between open and third-party LLMs, we need to explore the full functional expectation space:

  • How frequently is the information users request from the model changing?
  • How many concurrent users/clients do we expect to have?
  • How important is the latency of response to the value we drive?
  • Do we have data systems and MLOps capacity in place to serve and monitor multiple concurrent instances of an open LLM?
  • Do we have the resources necessary to architect and provision new infrastructure, if applicable?

Do we expect the required system to be reliant on retrieval of relevant information from outside the LLM?

If yes, is that relevant information sensitive? How frequently is it updated?

LLMs are basically systems for compressing information. They have “parameters” that contain both knowledge about the world and knowledge about how to interpret, collate, and respond to the prompts they are given. Knowledge can also be introduced to the system through the prompt.

LLM use cases often assume the LLM has access to current information related to the user and the question being asked. If the LLM is only trained on publicly available information from months ago, then additional context must be retrieved and provided to the LLM when the user/client makes a request. This means you are planning to develop a “Retrieval Augmented Generation” or RAG application.

If that is your use case, then the contextual information also needs to be evaluated for sensitivity. Expectations for that contextual information need to be incorporated into the functional requirements. Different LLMs will have different capabilities with regard to the amount of additional information they can process (context length) and their capacity with regard to delivering the output your end users / clients expect.

RAG applications are the most common systems in production with LLMs today, but RAG can increase the data leakage risk of leveraging third party APIs unless systems are developed to constrain the data the LLMs receive (and the users on whose behalf the LLMs receive it).

Do we have a plan in place to learn from user / client interactions with the system?

Regardless of the path you choose, the data created from user interactions will be critical to iteratively improving the performance of the system, expanding the benchmark dataset to new use cases, and tweaking other aspects of the system to make it fulfill the strategic and functional requirements, subject to privacy, latency, and cost constraints, more effectively. A plan for the development of this system should be in place as part of the architecture design, and this may influence whether you select a proprietary API or open LLM.

A Direct Comparison

A comparison of the criteria at a high level is provided below. Several of these points are debatable, but I simplify them for comparison purposes.

A simplified comparison on open models and third-party APIs

*Depending on how verbose and concurrent the use-case is, since APIs typically charge per token (per word)

**At the point when you are utilizing an open source model hosted by a third party API, I still consider this to be a third party API, equivalent with OpenAI, Gemini, Claude, or Cohere because the risks are the same.

Clients ask the question of Open LLM vs Proprietary API because it’s complicated and difficult to answer in the absence of additional information. Hopefully this post gives a springboard for asking the right questions to clarify your thoughts about where the most appropriate allocation of resources lies for your business.

If you have any questions or additional thoughts on the criteria and tradeoffs that need to be considered upfront, please get in touch.

Afterword: Hosting Considerations

Obviously, open models also require a hosting solution, and I will touch briefly on the hosting considerations here.

A two-by-two matrix comparing Self-Hosted vs Third-Party Hosting options and Proprietary vs Third-Party Owned Models.

You either need to buy or rent a GPU or custom hardware for LLM inference at scale right now. There are several startups offering serverless GPUs with the promise of reducing the cost of inference on open and/or fine-tuned in-house models. However, this requires a willingness to expose your data to an upstart cloud services provider, which may not meet your requirements for data security and IP protection.

And there are startups offering inference-optimized hardware for LLMs for those who are willing to invest huge amounts of capital to reduce marginal inference costs while maximizing data privacy/security (think finance, healthcare). But if you’re investing that kind of capital, you are probably not taking my advice from this post. You are consulting your company's version of me.

At the point when you are hosting an open model on your own cloud infrastructure within your own virtual private cloud (VPC), I consider you the owner of the open model hosting solution as if it were on premise. The differentiating attribute is whether any third party other than an already trusted cloud provider can access the data you send to the LLM and whether that data can be used to train future versions of the LLM that you do not own. In this case, the data and model are yours.

**At the point when you are utilizing an open source model hosted by a third party API, I still consider this to be a third party API, equivalent with OpenAI, Gemini, Claude, or Cohere because the risks are the same.

Tags:
Ross Katz
Post by Ross Katz
April 17, 2024