How to generate training data: faster and better

When you create machine learning models in the real world, as opposed to online courses or Kaggle style competitions, you need to generate the training data yourself.

This is the first step in any ML project. And as crucial as it is, it is the one we overlook the most.

As a freelance ML engineer, I take on new projects often and face the same problem again and again: How can I generate the training data, faster and better?

In this article, I want to share some of the best practices I have discovered on the way and help you boost your productivity.

You can find all the code in this repository.

The problem(s)

The first step to develop a machine learning model is to get the training data.

In real-world ML projects, more often than not, you do not get the data. You generate it. Unless you work in very ML-savvy companies with evolved data engineering infrastructures (e.g. Google, Facebook, Amazon, and similar) this step is far from trivial. Working as an AI/ML consultant for startups I need to generate my own training data. Every time.

In most cases, this data is tabular, i.e. something that has rows (samples/observations) and columns (features and target). This data is stored in a DB like a Redshift or BigQuery cluster, or even an SQL server or Postgress. You can query this data using SQL.

Ideally, you would write an SQL query, like this:

-- SQL query you thought you would write
SELECT
    feature1,
    feature2,
    ...
    featureN,
    target
FROM
    that_wonderful_table

Then you would run it once to export a CSV file with training data and move onto the data exploration and model building phase.

However, in reality, things are a bit more complex, because:

1. that_wonderful_table does not exist

Instead, you end up writing a more complex SQL query, that fetches, aggregates, and joins data from different tables.

Something like this:

-- SQL query you end up writing
WITH table1 AS (
    SELECT ...
    FROM ...
    GROUP BY ...
)

,table2 AS (
    SELECT ...
    FROM ...
    GROUP BY ...
)

,table3 AS (
    SELECT ...
    FROM ...
)

SELECT
    feature1,
    feature2,
    ...
    featureN,
    target
FROM
    table1
    INNER JOIN table2
    LEFT JOIN table3

When you write long SQL you are bound to make mistakes. Either syntactic ones (easy to fix) or fundamentally wrong ones (harder to fix) that come from a misunderstanding of what each table and field in the DB stands for.

You probably need to spend a few days and back-and-forth calls with the person who takes care of the DB (DB admin or ideally data engineer) to understand where to find each bit of data you need to input into your model, write the query, run it, and cross your fingers, hoping to get that golden dataset you need to start your ML work.

Well, this is not what happens most of the time, because of problem number 2.

2. My SQL query does not execute

You finish writing that long SQL query and send it for execution.

And wait… and wait…. 0% progress… still waiting…

Yep, after I while you realize it won’t execute.

I have been there numerous times. Stuck trying to get the god damn data.

training data
Photo by Keira Burton from Pexels

The problem is that you are trying to fetch too much data at once.

A quick fix is to query just a small sample of the data. You can do this at the very beginning of the project, to unblock yourself and get a baseline performance on the problem. However, as you want to improve the model you will probably realize there are some useful features you did not include in the query. This takes you to problem number 3.

3. My SQL query needs to be adjusted (often many times)

You can hack your way once to get an initial dataset. However, this recipe is not going to work in the long run, and this is why:

  • As you prototype machine learning models, you will need to add extra features in your training data that are sitting somewhere on the DB. Hence, you will need to edit and re-run the query. Again.
  • When you move your model to production (which is the end goal of your work, by the way) you will need to implement a re-training pipeline, to avoid model drift. You will need to re-generate the training data once a day, week or month.

Generating training data is a critical step that you will need to iterate on. A manual approach like the one I described (which is sadly the most used in practice by ML engineers and data scientists) is not optimal.

Is there a better way to solve problems 1, 2, 3 and generate training data better and faster?

Yep!

The solution

The solution I came up with is based on the principle of divide (automate 🤖) and conquer.

Step 1: Split up the SQL query into smaller queries

The problem of trying to fetch all the training data with one query execution is two-fold:

  • Chances are the query will not work, because the amount of data you want to get at once is too large for DB to handle.
  • Even if the query runs today, there is no guarantee it will work tomorrow (when you re-train the model) as the data size will be larger. Or the day after tomorrow. Or the other. Eventually this will break.

A trick that our friends, the data engineers, use constantly is to query the data smartly. Instead of one query that fetches all the data, we run N queries that extract the data in N chunks.

For example, the DB of an e-commerce site will have at least one table with all historical orders, with a few parameters like the purchased item id, the dollar amount paid, etc. And the time of the purchase. Instead of fetching all the data at once:

-- one big query
SELECT
    price,
    item,
    ...
    time
FROM
    orders_table

You can fetch the same data in N steps with a sequence of queries:

-- one small query
SELECT
    price,
    item,
    ...
    time
FROM
    orders_table
WHERE
    DAY(time) = '2021-07-20' -- << replace with each of the N days you want to fetch

Running N smaller queries is scalable and removes volume bottle-necks in the DB. However, how can you automate the execution of N queries? I guess you do not want to copy and paste N queries to your DB IDE or CLI to get the data, right?

This takes us to step 2 of our solution.

Step 2: Automate query execution with Python

Python is the language you are probably using for your machine learning project. It is also a great language to quickly automate boring things, like pulling data from a DB into CSV files.

Python has also a rich ecosystem of DB client libraries that let you run queries against your DB from your Python script. Some of the ones I have extensively used in the past are pyodbc for MySQL server or psycopg2 for Postrgress/Redshift.

Having said this, an easy way to automate the execution of N queries is to put together a query string template and a simple for loop in a data.py script:

# data.py

QUERY_TEMPLATE = """
SELECT
   ... features, target, ts
FROM
    your_table
WHERE
    ts BETWEEN '{date}' AND DATEADD(day, 1, '{date}')
"""

def generate(from_date: str, to_date: str):
    """
    Fetches data from the DB, day by day, and stores it in separate CSV files.
    Then, compacts all the CSV files into a final training set.
    """
    dates = [d.strftime("%Y-%m-%d") for d in pd.date_range(from_date, to_date)]

    for date in dates:
        # replace placeholder {date} in QUERY_TEMPLATE
        query = QUERY.format(date=date)

        # fetch data from db for this date
        data = get_data_from_db(query, date)

        # save data in a csv file
        output_path = Path(config['DATA_DIR']) / 'downloads' / f'{date}.csv'
        data.to_csv(output_path, index=False)

    # concatenate data for all dates
    all_data = pd.concat([pd.read_csv(download_dir / f'{date}.csv') for date in dates])

    # and save it as csv file
    file_path = Path(config['DATA_DIR']) / f'training_data_{from_date}_{to_date}.csv'
    all_data.to_csv(file_path, index=False)

This solution works. However, it also repeats a lot of unnecessary staff. Data that was previously downloaded and stored in CSV files should not be re-downloaded unless you had a bug or a change in the SQL query.

Let’s take care of these details in the last step.

Step 3: Wrap it up

In order to avoid re-processing dates, we can add an overwrite parameter.

Sometimes you want to force an overwrite. E.g. you find a bug in your SQL query and want to re-generate all files again. In this case, you set overwrite=True.

And sometimes you just want to download files from missing dates to expand your training data. E.g. when you re-train the model in production, you do not need to re-download data for all dates, but only from the previous date when you trained the model. In this case overwrite=False.

Here is the code:

def generate(from_date: str, to_date: str, overwrite: bool):
    """
    Fetches data from the DB, day by day, and stores it in separate CSV files.
    Then, compacts all the CSV files into a final training set.
    If overwrite = False, do not process dates that were already downloades.
    """

    # path where all the downloaded csv files are stored
    download_dir = Path(config['DATA_DIR']) / 'downloads'
    if not download_dir.exists():
        # create it if does not exist yet
        os.makedirs(download_dir)

    # list of days we want to download data for
    dates = [d.strftime("%Y-%m-%d") for d in pd.date_range(from_date, to_date)]

    for date in dates:

        output_path = download_dir / f'{date}.csv'
        if output_path.exists() and not overwrite:
            continue

        # fetch data from db for this date
        data = get_data_from_db(date=date)

        # save data in a csv file
        data.to_csv(output_path, index=False)

    # concatenate data for all dates
    all_data = pd.concat([pd.read_csv(download_dir / f'{date}.csv') for date in dates])

    # and save it as csv file
    file_path = Path(config['DATA_DIR']) / f'training_data_{from_date}_{to_date}.csv'
    all_data.to_csv(file_path, index=False)

As a final touch, I am a big fan of the tqdm library. It lets you print awesome progress bars that give you feedback about the iteration speed, and estimated time to completion.

Let’s add one 🙂

def generate(from_date: str, to_date: str, overwrite: bool):
    """"""

    # path where all the downloaded csv files are stored
    download_dir = Path(config['DATA_DIR']) / 'downloads'
    if not download_dir.exists():
        # create it if does not exist yet
        os.makedirs(download_dir)

    # list of days we want to download data for
    dates = [d.strftime("%Y-%m-%d") for d in pd.date_range(from_date, to_date)]

    pbar = tqdm(dates)
    for date in pbar:

        pbar.set_description(f'Processing {date}')
        
        output_path = download_dir / f'{date}.csv'
        if output_path.exists() and not overwrite:
            continue

        # fetch data from db for this date
        data = get_data_from_db(date=date)

        # save data in a csv file
        data.to_csv(output_path, index=False)

    # concatenate data for all dates
    all_data = pd.concat([pd.read_csv(download_dir / f'{date}.csv') for date in dates])

    # and save it as csv file
    file_path = Path(config['DATA_DIR']) / f'training_data_{from_date}_{to_date}.csv'
    all_data.to_csv(file_path, index=False)

You can find all the code in this repository

Conclusion

So, this is one of my strategies to increase productivity as an ML engineer.

With this nice data.py script you will be able to:

  • Add new features to your training set, faster.
  • Debug and re-generate the dataset, faster.
  • And ultimately train ML models and create products faster and better.

I hope this helps, folks. If you have any questions, be free to get in touch.

And if you want to learn more about real-world ML tricks and tips and become a better Machine Learning developer, subscribe to my newsletter.

Have a great day