How to scrape NLP datasets from Youtube -

Too lazy to scrape nlp data yourself?

You need text data for your next Natural Language Processing (NLP) project but cannot find the right dataset online?

In this post, I’ll show you a quick way to scrape NLP datasets using Youtube and Python. All the learnings will be condensed in a simple script you can readily use.

Let’s go!

DATA, DATA, DATA

Machine learning (ML) programs are a combination of code and DATA (capital letters, yes).

DATA is the critical ingredient that determines if the ML project succeeds or fails. And this is something we, ML experts, have a tendency to forget. Fancy deep learning models are like Homer’s sirens, singing the sweetest songs that cause most ML projects to lose focus and fail.

I sometimes think that ML is too short of an acronym. Something like MLFD (Machine Learning From DATA) would be horrible to pronounce but would avoid a ton of frustration and failed projects.

Without DATA there is no ML.

Still, most ML engineers and data scientists have never scrapped their own datasets. I was one of these, until a couple of weeks ago when I scrapped my first textual dataset. From Youtube. In the next section, I explain how.

Youtube is a mine of text

I wanted to fine-tune the gigantic GPT-2 model to generate funny text. But, where can I find the funny text?

I know a bunch of funny people, mostly stand-up comedians, but they do not publish books or posts. They talk, and more often than not you can find recordings of their shows on Youtube. Also, many Youtube videos have captions available, sometimes produced by Google’s ML-based sound2text model, and sometimes written by humans.

I would need to look for relevant videos on Youtube (step 1) and scrape the captions for each of them (step 2).

For example, I would type “Hannah Gadsby stand up” (great comedian, by the way), open each of the search results, and download the captions.

This is how a slow human would do it. But, who has time for this?

Python is a great language to build automation tools. Also, Youtube is such a mature and popular product that Google has an API that lets us search for context programmatically, without the need to write exotic Selenium scripts.

There are two good Python libraries to look for videos based on a keyword, and to download the transcript given a video id:

youtube-search-python to find relevant Youtube videos given a keyword.
youtube-transcript-api to fetch the transcript of a youtube video (if available).

Code

Without further preambles, the code:

# data.py

from typing import List
import csv
import json
from argparse import ArgumentParser
from pathlib import Path

# pip install youtube-search-python
from youtubesearchpython import VideosSearch

# pip install youtube-transcript-api
from youtube_transcript_api import YouTubeTranscriptApi


def get_youtube_video_ids(keyword: str, limit: int = 10) -> List[str]:
    """
    Returns list of video ids we find for the given 'keyword'
    """
    video_search = VideosSearch(keyword, limit=limit)
    results = video_search.result()['result']
    return [r['id'] for r in results]


def get_youtube_video_transcript(video_id: str) -> str:
    """"
    Returns transcript of the given 'video_id'
    """
    try:
        transcript = YouTubeTranscriptApi.get_transcript(
            video_id, languages=['en-US', 'en']
        )
        utterances = [p['text'] for p in transcript]
        return ' '.join(utterances)

    except Exception as e:
        pass


def save_transcripts(transcripts: List[str], keyword: str, path: Path):
    """
    Stores locally in file the transcripts with associated keyword
    """
    output = [{'keyword': keyword, 'text': t} for t in transcripts if t is not None]

    # check if path points to a csv or a json file
    if path.suffix == '.csv':
        # save as csv
        keys = output[0].keys()
        with open(path, 'w', newline='')  as output_file:
            dict_writer = csv.DictWriter(output_file, keys)
            dict_writer.writeheader()
            dict_writer.writerows(output)

    else:
        # save as json
        with open(path, 'w') as output_file:
            json.dump(output, output_file)


if __name__ == '__main__':

    parser = ArgumentParser()
    parser.add_argument('--keyword', type=str, required=True)
    parser.add_argument('--n_samples', type=int, default=100)
    parser.add_argument('--output', type=Path, required=True)
    args = parser.parse_args()

    video_ids = get_youtube_video_ids(keyword=args.keyword, limit=args.n_samples)

    transcripts = [get_youtube_video_transcript(id) for id in video_ids]

    save_transcripts(transcripts, args.keyword, args.output)

You can invoke the script as follows:

$ python data.py --keyword "hannah gadsby stand up" --n_samples 10 --output ./example.json

You can find the complete code in this Github repo

conclusion

As a data scientist, you need to know how to get the data you need for your models. In this post, I shared a little trick I used to scrape NLP data for my project, that you can use in your next project.

If you want to read more about data science, machine learning, and freelance subscribe to the datamachines newsletter or check out my blog.

Have a great day