LLMs in real-world projects -

To solve a real-world business problem using a Large Language Model (LLM) you need 3 things:

A base LLM, for example, a proprietary model like GPT-3 or a fully open-source one like Dolly.
An algorithm that fine-tunes the model parameters using a private dataset, so the model performs the task you want it to perform.
A private dataset for the specific business problem you wanna solve.
Imagine you work at Facebook as an ML engineer and need to develop an ML algorithm to filter offensive posts. You have a large historical dataset with pairs (post text, flag), where the flag is either offensive or non-offensive.

The performance of your ML solution depends on the quality of each of these 3 ingredients.

Open-source LLMs are catching up with proprietary models from OpenAI and Google, so you don’t have to worry much here.
Similarly, state-of-the-art fine-tuning algorithms are going open-source as well, e.g. PEFT by HuggingFace.
What really makes the difference is ingredient #3. While everyone is focusing only on #1 and #2, what will make your project a success (or not) is the quality of your private dataset.

Let’s go back to the Facebook example.

Example

So you are an ML engineer at Facebook, who wants to develop an ML algorithm to filter offensive posts.

This model will be deployed and used by the content moderation team, to ensure that the platform content adheres to FB content quality policy.

🗣️ Content moderation is the process of reviewing and monitoring the content generated and published by the platform users.

Every post submitted by a user can be classified either as:

non-offensive, meaning it adheres to FB content moderation policy, so it is published on the platform.
offensive, otherwise. In this case, the content is not published.

So, your job is to build a Machine Learning classifier that can automatically detect offensive posts.

For that, you wanna use

A state-of-the-art Large Language Model
A fine-tuning algorithm
A historical dataset of pairs (post text, flag), where the flag is either offensive or non-offensive. This is a large dataset, gathered internally at Facebook and expanded over the years. Each dataset sample was annotated either by a human from the content moderation team, or by an ML algorithm. This observation is crucial.

How do you build the best model possible?

Trying different

base LLM models (for example, Flan T5 instead of Dolly) or
different fine-tuning parameters (e.g. LoRA instead of AdaLoRA)

will obviously influence the quality of your end model. These are model-centric techniques.

Model-centric techniques help you squeeze extra performance to your end model. However, as usual, the money is somewhere else.

And this somewhere else is the dataset you use to fine-tune your LLM.

Remember 🙋‍♀️
The quality of your end model can only be as good as the quality of the dataset used to fine-tune your LLM.
No matter how good your LLM or fine-tuning procedure is. If your dataset labels (offensive vs non-offensive) are not correct (aka noisy), your model performance will be sub-optimal.

Data-centric approach

So, if you really want to improve the model, use data-centric techniques to improve the quality of your dataset.

In this case, as you want to build a Machine Learning classifier, you need to make sure the labels (offensive vs non-offensive) are clean.

Your dataset, like most real-world datasets, was annotated by different humans and ML Models over time. This hybrid approach ensured

a scalable labeling process,
but, it also added a non-negligible amount of labeling errors to your dataset.

Finding and fixing labeling errors is where your focus should be.

Data-centric AI 💰
Data-centric techniques have been shown to reduce LLMs prediction errors by up to 37%.
Read more about it in this hands-on article by Chris Mauch, Data Scientist at Cleanlab.

How to improve data quality in a scalable way?

Manually fixing labeling errors is only feasible for relatively small datasets. Luckily, there are open-source tools that can help you automate data cleaning.

Automatic data cleaning 🛁

Cleanlab is my favorite data-cleaning tool. It is an open-source, FREE Python library that helps you clean data and labels by automatically detecting issues in a Machine Learning dataset

Try Cleanlab Studio to quickly correct data issues using the same automated no-code tool used to fix the label issues in this article

My advice 🧠

The only way to learn data-centric AI is to get your hands dirty.

So, I suggest the following:

Follow this hands-on tutorial with source code by Cleanlab, and try to replicate it with your dataset.
Publish the results on your GitHub
Ping me!

I would love to share your work with over 30k followers on Twitter and LinkedIn 🤗