Large Language Models: A pragmatic Guide

May 3, 2023

To navigate the quickly evolving landscape of Large Language Models (LLM), we create a comprehensive guide the decision making of you and your company.

Foundation models in general and Large Language models (LLM) in particular constitute a new paradigm in Artificial Intelligence. Faster than many people expect, LLMs will revolutionize how we live our lives and how we approach knowledge work. If you followed the news on AI recently, you likely got overwhelmed by a bombardment of names of new models that get released increasingly quickly. Names like GPT-4, T5, Chinchilla or LLama all represent language models, based on the transformer architecture and pre-trained on vast amounts of text, with the ability to generate language.

Knowing the differences between the different general purpose LLMs along multiple dimensions and being able to choose the right model in a given situation, will be a vital part of every entrepreneur's toolkit. This is, why we put together this guide for you to use as a cheat sheet to build your AI-first company

How to compare LLMs

Although there are many dimensions that might be interesting for comparing different LLMs, we narrowed them down to five parameters, which we find most insightful when looking at a model. If there are different parameters you are interested in or if you would like to have more detailed information on the topic please feel free to just reach out to us.

High level quality assessment

The most important question is how different language models compare in their performance. The quality of the model determines how good it is at solving your problems.
It is of course difficult to assess the quality of LLMs, because for many models there is only limited information available and so far there are no standardized benchmarks. You, nevertheless, can evaluate the quality, leveraging a mix of openly available quality assessments (performance on research datasets) or experiment yourself. For starters you may look at this extensive survey of Large Language Models.
Quality of LLMs is currently a fast-moving target. What constitutes the ultimate, best performance a LLM may reach? We can’t say for sure right now – new models will push the limits further.
In our analysis, GPT4 is currently the most capable language model available. Together with the currently available 8k token context window length this renders the model an ideal candidate for showcasing what is possible right now. Claude, on the other hand, may provide a somewhat more secure / less biased output because of anthropics' approach of training “constitutional AI”.

Type of Access

Different LLMs are accessible to various degrees. The level of access you can get as a company determines what you can accomplish with the model. Some models are open source and you have complete access to the source code and the training weights (e.g. GPT-neoX, FlanT5), while other models you can only access via an API (e.g. GPT-4, Claude, Luminous). Moreover there are some models which are only open for research purposes (e.g. Llama) and you can get access only under a non commercial license. For some models you cannot get access at all (e-g- Chinchilla). The relevant question for you is firstly whether you can access a model and secondly whether access through API or open source is better suited for your needs.
The type of access you get matters if you intend to fine tune a model. A Fine-Tuned Model is a custom model that is based on the original LLM. The user provides their own data set and retrains the LLM, creating a customized model with very high performance, relative to the size of the user’s data set.
API: If you want to access a LLM via API, you have to pay the provider of the model.
Open Source: If you use an open source model, you have complete control over it, but at the same time need to invest into engineering and computing. Hosting a large language model for fast inference is not easy, and similarly the training process of these models requires a lot of engineering and knowledge. Up until now, these models have typically been trained with a shorter context window length than e.g. ChatGPT or GPT4, which introduces some obstacles when productionizing them for context-hungry applications.

Costs

As the name suggests, LLMs have gotten quite large. This means they are computationally expensive. It is therefore important to have an overview of pricing before you make a model choice.
If you use a LLM via API, usually you get billed by the tokens you use in requests to the model. A word in English on average consists of 4.8 characters and a character is roughly equivalent to 4 tokens. Moreover some providers differentiate in their pricing between prompts and the completion of prompts. In case of differentiated prices we will simply provide a price range. The concrete costs will depend on how you use the API.
To get an overview you might look at the API costs of OpenAI, Anthropic and Aleph Alpha.
Additional costs come up, if you want to host the model yourself. You can either run your fine-tuned model on your own hardware or in the cloud. In both cases you have to pay for the amount of compute you are using. The costs for you will mostly depend on the size of your model and the latency you are willing to accept. You can run very large models with low amounts of compute but the time your model needs to respond will be much larger. For a more detailed analysis of inference costs, see here.

TLDR

If you want to use a large language model and are unsure which one to choose, you can roughly follow these heuristics:

Your data is sensitive and you don't want to share it? Use Pythia / FLAN-T5 / Dolly or very recent models like StableLM!

You have a low-complexity use case and no sensitive data? Use GPT3.5 because the price/quality tradeoff is optimal!

You have a highly complex use case? Use GPT4 or Claude!

You are a researcher? LLama models are probably the best models available for research use right now.

Currently, new models are published on close to a weekly basis. This table gives a nice overview and is updated continuously.

‍

Legal Risks for LLM users

Finally we must acknowledge that there are legal risks involved if you use, fine tune or train a LLM.
There are ongoing lawsuits against providers of LLMs and it is not clear how this will develop.
We do not give any legal advice nor assess any concrete risks for companies that build a product on top of a LLM.
However you should be aware that there are unsettled legal questions the turn out of which might affect your business.
All available models have been trained on large piles of data. Whether this data was obtained legally, and to what degree using a model trained on it constitutes copyright infringement, is heavily disputed. Make sure to read up upon the data used in training the model of your choice and clarify whether you can use it for your use case or not.

‍

Author

Alexander Fecke

Stay informed with our newsletter.

Thank you!

Oops! Something went wrong while submitting the form. Please try again.

Sep 22, 2022

The Merantix Founders Community

Eduard Hübner

Merantix

Every Founder's Journey is unique, but to help overcome the pitfalls and build stronger personal growth, we put a strong focus on building up a community between all of our Founders.

Apr 24, 2023

AI for Energy Demand Forecasting

Konstantin Ditschuneit

Merantix Momentum

Konstantin Ditschuneit and Julien Siems, ML Engineers at Merantix Momentum, shared their experience in winning a Kaggle challenge on energy demand forecasting

Large Language Models: A pragmatic Guide

How to compare LLMs

TLDR

Read more