9  cleaning data

9.1 Getting the data

The dataset used throughout this book was originally curated by a group at Stanford (Maas et al. 2011). There are several ways to get your hands on a dataset, you can curate them yourself or rely on the work of others. This dataset can be found at https://ai.stanford.edu/~amaas/data/sentiment, but for simplicities sake it’s downloadable from this github repo.

Let’s grab the data!

import tarfile
from pathlib import Path

import requests

tar_path = Path("aclImdb_v1.tar.gz")
if not tar_path.exists():
    r = requests.get(
        "https://github.com/spenceforce/NLP-Simple-to-Spectacular/releases/download/aclImdb_dataset/aclImdb_v1.tar.gz",
        stream=True,
    )
    with tar_path.open("wb") as f:
        for chunk in r.iter_content(chunk_size=128):
            f.write(chunk)

with tarfile.open(tar_path) as tar:
    tar.extractall(filter="data")
    data_path = Path("aclImdb")
    # The untarred directory is `aclImdb` instead of `aclImdb_v1`.

This extracts the data from the tarball then assigns the output directory as a pathlib.Path object to data_path.

I highly recommend checking out the pathlib library if you don’t already use it. It provides a nice object oriented API for handling file paths.

There are a lot of datasets out there. Some better than others. Before trying to predict anything, it’s a good idea to get a sense of what the dataset actually contains. Hopefully the datasets you’re using are well documented. The dataset we’re working with comes with a README which we can see in the directory contents.

Don’t worry if the paths shown in this notebook are different from the ones you see on your compute resource. The structure of the data directory should be the same, and by leveraging relative paths with Path objects, everything should just work.
list(data_path.iterdir())
[PosixPath('aclImdb/.ipynb_checkpoints'),
 PosixPath('aclImdb/imdbEr.txt'),
 PosixPath('aclImdb/README'),
 PosixPath('aclImdb/train'),
 PosixPath('aclImdb/imdb.vocab'),
 PosixPath('aclImdb/test')]

If you’re working through this notebook in jupyter, I recommend taking a pause and reading the README yourself.

I’ll summarize the main takeaways here for those of you reading this online.

9.1.1 Dataset specifics

There are 50k reviews, split into two groups of 25k. One group is for training machine learning models and the other is for testing them. Within each group of 25k, the reviews are split into half positive and half negative.

There’s at most 30 reviews for a given movie. The train and test sets have reviews for different movies, so none of the movies reviewed in the train set show up in the test set. Negative reviews have a score of 4 or less and positive reviews have a score of 7 or more.

There are an additional 50k reviews without any labels. These reviews are intended for unsupervised learning purposes (we’ll learn about unsupervised learning in the future). There is an equal number of reviews with a score of 4 or less and 5 or more.

9.1.2 Directory structure

The classification dataset file naming convention is [DATASET]/[LABEL]/[ID]_[RATING].txt where DATASET is one of train or test, LABEL is one of pos or neg (for positive and negative respectively), ID is a numeric identifier for a review, and RATING is the score the reviewer gave the movie.

A word of warning for jupyter lab users. Jupyter crashed when I tried to open the directories containing the reviews (like train/pos). My guess is there’s too many files for jupyter to display and it becomes unresponsive. Your mileage may vary.

The unsupervised dataset file naming convention is train/unsup/[ID]_0.txt where ID is a numeric identifier for a review.

URLs to the reviews section are also provided in [DATASET]/urls_[LABEL].txt. The line N refers to ID N in the associated dataset/label combination. For example line 200 in train/urls_pos.txt refers to the reviews webpage for the movie of train/pos/200_10.txt. Here’s an example URL from one of these files: http://www.imdb.com/title/tt0064354/usercomments. It turns out IMDB has changed their URL format so this link is broken. They now use “reviews” in place of “usercomments” like http://www.imdb.com/title/tt0064354/reviews. An oddity in the URLs is they do not point to the individual review, but to the review section of the movie the review is about.

By the time you read this, they may have changed the URL structure again

9.2 Question the dataset

An integral part of building a machine learning model is understanding where the data comes from and how it was acquired. There may be assumptions made during the data curation process that you don’t agree with. An artefact in the data may have unintended side effects downstream, such as training a model that works very well on data it’s seen but performs poorly on data it hasn’t.

As you read about the dataset and it’s curation process, I recommend you keep asking “why?” to better understand the choices made during the curation process. Why did they use those specific thresholds for choosing positive vs negative labels? Why that ratio of positive/negative labels? Why that many reviews per movie? Why set up the train/test sets that way?

In parallel, also think about how you are going to leverage this data. Your end goal may be different from the curators of the dataset and that should be taken into account as you prepare the data for training machine learning models.

Here’s a couple questions that come to my mind.

  • Why isn’t there a dataset for multi-label classification?
  • Why a maximum of 30 reviews per movie? Why not 10 or 50?
  • What was the rationale for picking the movies? Was it random or spread evenly by genre?
  • Were the movies made during a certain time period?

Some of these questions we could answer ourselves if we want to and they can lead to a richer set of information. IMDB provides an API to programmatically gather movie data over the web. With that tool one can gather movie genres, release dates, associated actors, and more. If we never ask the questions we’ll never think to look. Maybe we curate our own dataset because this one doesn’t provide what we need or it leads to further analysis of the data and the dataset is tweaked based on that analysis. We will stay as true to the source dataset as possible, but don’t let that stop you from thinking about ways to improve this dataset or curate an entirely new one.

OMDb API is another API that provides IMDB metadata. I found their service to be much more transparent on pricing and easier to get access to.

9.2.1 Cleaning the data

The boundaries between cleaning data and analyzing data can be fuzzy at times, but for sake of simplicity we will only analyze the data as needed to properly clean it.

The pieces of information provided in this dataset are:

  • ID
  • review
  • rating
  • label
  • movie ID

The movie ID isn’t explicitly provided, but we can extract it from the reviews URL which have this format http://www.imdb.com/title/[MOVIE_ID]/usercomments.

Let’s gather the training set which is under train/pos and train/neg.

import pandas as pd


def get_review_data(reviews_dir, urls_file):
    """
    Return a `pd.DataFrame` containing the review ID,
    movie ID, rating, and review.
    """
    with urls_file.open() as f:
        movie_ids = {
            i: url.split("/")[4]
            for i, url in enumerate(f.readlines())
        }

    data = []
    for p in reviews_dir.iterdir():
        ID, rating = map(int, p.stem.split("_"))
        data.append(
            {
                "id": ID,
                "movie_id": movie_ids[ID],
                "rating": rating,
                "review": p.open().read().strip(),
            }
        )

    return pd.DataFrame(data)


def get_train_data():
    """
    Return a `pd.DataFrame` with the supervised training data.
    """
    dfs = []
    for label, label_name in enumerate(["neg", "pos"]):
        df = get_review_data(
            data_path / "train" / label_name,
            data_path / f"train/urls_{label_name}.txt",
        )
        df["label"] = label
        dfs.append(df)
    return pd.concat(dfs)


train_df = get_train_data()
train_df.head()
id movie_id rating review label
0 7275 tt0082799 1 "National Lampoon Goes to the Movies" (1981) i... 0
1 1438 tt0397501 4 Well! What can one say? Firstly, this adaptati... 0
2 9137 tt0364986 1 What can I say, this is a piece of brilliant f... 0
3 173 tt0283974 3 A decent sequel, but does not pack the punch o... 0
4 8290 tt0314630 2 Alan Rudolph is a so-so director, without that... 0

Those look like reviews! Now that we have data in hand to play with, what questions can we answer about it? Maybe we should verify the information provided by the curators of this dataset.

  • Are there 25k reviews in the train set?
  • Are they evenly split between positive and negative?
  • Is the max number of reviews per movie 30?

We can use DataFrame.info() to answer the first question. This method gives some general information about the dataframe, including the column names, number of non-null values in each column, column data types, and the number of rows.

Null values include None and NaNs. NaN, or Not a Number, representa a number that is undefined, such as the result of dividing by 0.
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 25000 entries, 0 to 12499
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        25000 non-null  int64 
 1   movie_id  25000 non-null  object
 2   rating    25000 non-null  int64 
 3   review    25000 non-null  object
 4   label     25000 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ MB

This dataframe contains 25k entries, and all of those entries have non-null reviews. So the train set does contain 25k reviews. But we know nothing about the quality of these reviews. Are there duplicates? Empty strings? Are some complete gibberish? It’s impossible to qualitatively check every data point, but checking some basic properties of the data can go a long way. I like to call these sanity checks. Let’s start with empty strings.

train_df["review"].str.len().min()
np.int64(52)

The shortest review has 52 characters, so there are no empty reviews. Let’s turn our attention to duplicates. There’s many ways to check this. We’ll use DataFrame.describe.

train_df.describe(include=[object])
movie_id review
count 25000 25000
unique 3456 24904
top tt0374240 This show comes up with interesting locations ...
freq 30 3

The movie_id column shows the most frequent movie ID shows up 30 times. So we know there’s at most 30 reviews for a given movie in this dataset.

The include=[object] argument tells describe to look at columns with the object data type instead of numeric data types. Strings fall under the object type in pandas.

The review column has a count of 25k, but 24,904 unique entries which means there’s 96 duplicate reviews. From a training perspective, it may not make sense to have duplicate items in the train set; odds are we don’t want them, but if they have conflicting labels we may want to keep them after all. Let’s take a peek at them to see if they belong to the same movie and have the same ratings and labels. We’ll create a new dataframe that contains only duplicated reviews.

is_duplicate = train_df["review"].duplicated(keep=False)
duplicate_reviews = train_df[is_duplicate]

Now that we have the duplicated reviews, let’s dig into them a bit. Are they for the same movie? Do they have the same rating or label?

duplicate_nunique = (
    duplicate_reviews.groupby("review").agg("nunique").head()
)
duplicate_nunique.head(3)
id movie_id rating label
review
'Dead Letter Office' is a low-budget film about a couple of employees of the Australian postal service, struggling to rebuild their damaged lives. Unfortunately, the acting is poor and the links between the characters' past misfortunes and present mindsets are clumsily and over-schematically represented. What's most disappointing of all, however, is the portrayal is life in the office of the film's title: there's no mechanisation whatsoever, and it's quite impossible to ascertain what any of the staff really do for a living. Granted, part of the plot is that the office is threatened with closure, but this sort of office surely closed in the 1930s, if it ever truly existed. It's a shame, as the film's overall tone is poignant and wry, and there's some promise in the scenario: but few of the details convince. Overall, it feels the work of someone who hasn't actually experienced much of real life; a student film, with a concept and an outline, but sadly little else. 2 2 1 1
.......Playing Kaddiddlehopper, Col San Fernando, etc. the man was pretty wide ranging and a scream. I love watching him interact w/ Amanda Blake, or Don Knotts or whomever--he clearly was having a ball and I think he made it easier on his guests as well--so long as they Knew ahead of time it wasn't a disciplined, 19 take kind of production. Relax and be loose was clearly the name of the game there.<br /><br />He reminds me of guys like Milton Berle, Benny Hill, maybe Jerry Lewis some too. Great timing, ancient gags that kept audiences in stitches for decades, sheer enjoyment about what he was doing. His sad little clown he played was good too--but in a touching manner.<br /><br />Personally I think he's great, having just bought a two DVD set of his shows from '61 or so, it brings his stuff back in a fond way for me. I can remember seeing him on TV at the end of his run when he was winding up the series in 1971 or so.<br /><br />Check this out if you are a fan or curious. He was a riot. 2 2 1 1
<br /><br />Back in his youth, the old man had wanted to marry his first cousin, but his family forbid it. Many decades later, the old man has raised three children (two boys and one girl), and allows his son and daughter to marry and have children. Soon, the sister is bored with brother #1, and jumps in the bed of brother #2.<br /><br />One might think that the three siblings are stuck somewhere on a remote island. But no -- they are upper class Europeans going to college and busy in the social world.<br /><br />Never do we see a flirtatious moment between any non-related female and the two brothers. Never do we see any flirtatious moment between any non-related male and the one sister. All flirtatious moments are shared between only between the brothers and sister.<br /><br />The weakest part of GLADIATOR was the incest thing. The young emperor Commodus would have hundreds of slave girls and a city full of marriage-minded girls all over him, but no -- he only wanted his sister? If movie incest is your cup of tea, then SUNSHINE will (slowly) thrill you to no end. 2 1 1 1

groupby is grouping rows with the same review column together into their own dataframes. The return value of this operation is a DataFrameGroupBy. This special dataframe type performs operations on each group as if they were their own dataframe instead of all rows in the dataframe. agg performs an operation on the entire dataframe, but since this is a DataFrameGroupBy object, the agg method is applied to each group. In this case it counts the number of unique values in each column for each unique review.

Some of these reviews have two movie IDs. Let’s inspect the two reviews above that have multiple movie IDs.

is_first2 = duplicate_reviews["review"].isin(
    duplicate_nunique.index[:2]
)
duplicate_reviews[is_first2]
id movie_id rating review label
10893 985 tt0223119 4 'Dead Letter Office' is a low-budget film abou... 0
12445 4102 tt0118939 4 'Dead Letter Office' is a low-budget film abou... 0
101 6069 tt0163806 8 .......Playing Kaddiddlehopper, Col San Fernan... 1
5458 9319 tt0043224 8 .......Playing Kaddiddlehopper, Col San Fernan... 1

Remember we are provided a URL for the comments section of each review which is where we extract the movie ID from. This means we can go backwards from movie ID to movie URL. One review is for movies at the URLs http://www.imdb.com/title/tt0223119/reviews and http://www.imdb.com/title/tt0118939/reviews and the other at http://www.imdb.com/title/tt0163806/reviews and http://www.imdb.com/title/tt0043224/reviews. When I click on each pair one gets redirected to the other. Movie IDs tt0223119 and tt0118939 are for Dead Letter Office, and tt0163806 and tt0043224 are for The Red Skelton Hour. Although they have different movie IDs, they are reviews for the same movie and therefore are truly duplicate reviews. It turns out each movie can have multiple movie IDs; we call this a one-to-many relationship.

I checked these links in August 2024. The URL endpoints may have changed in the future.

While it’s not incredibly important, The Red Skelton Hour is actually a TV show. It turns out this dataset contains reviews for movies and tv shows. See what a little digging can turn up?

We could check every single movie ID for duplicate reviews manually (or have an intern do it) because it’s possible there are duplicate reviews for different movies, but our task is to predict a label (positive or negative) given a review. Our models don’t need to see the same review over and over again in order to make predictions about them…unless duplicate reviews have different labels! Why would we want to train on both examples in this case? If the same review can be positive or negative, then training a machine learning model with both examples will teach the model to be uncertain about some reviews where the language is more ambiguous.

Let’s see if there’s any duplicate reviews with multiple labels or ratings.

different_labels = (duplicate_nunique["label"] > 1) | (
    duplicate_nunique["rating"] > 1
)
duplicate_nunique[different_labels].head()
id movie_id rating label
review

There isn’t, which means these reviews stem from duplicate movie IDs. Since the labels are the same for each duplicate, I feel confident in just removing the duplicate entries. I’ve taken the liberty of performing this analysis on the test set and found the same issue, so we’ll remove duplicates from the test set as well.

def get_review_data(reviews_dir, urls_file, dedup=False):
    """
    Return a `pd.DataFrame` containing the review ID,
    movie ID, rating, and review.
    """
    with urls_file.open() as f:
        movie_ids = {
            i: url.split("/")[4]
            for i, url in enumerate(f.readlines())
        }

    data = []
    for p in reviews_dir.iterdir():
        ID, rating = map(int, p.stem.split("_"))
        data.append(
            {
                "id": ID,
                "movie_id": movie_ids[ID],
                "rating": rating,
                "review": p.open().read().strip(),
            }
        )

    rv = pd.DataFrame(data)
    if dedup:
        return rv.drop_duplicates("review").copy()
    return rv


def get_df(dataset, dedup=False):
    """Return a `pd.DataFrame` for a dataset."""
    dfs = []
    for label, label_name in enumerate(["neg", "pos"]):
        df = get_review_data(
            data_path / dataset / label_name,
            data_path / dataset / f"urls_{label_name}.txt",
            dedup,
        )
        df["label"] = label
        dfs.append(df)
    return pd.concat(dfs)


def get_train_data(dedup=True):
    """
    Return a `pd.DataFrame` with the supervised training data.
    """
    return get_df("train", dedup)


def get_test_data(dedup=True):
    """
    Return a `pd.DataFrame` with the supervised testing data.
    """
    return get_df("test", dedup)


train_df = get_train_data()
train_df.groupby("label").describe(include="object")[
    ("review", "count")
]
label
0    12432
1    12472
Name: (review, count), dtype: object
train_df.shape
(24904, 5)

We don’t quite have 25k reviews split evenly across labels, but it’s close enough. I think this is in good shape and we can turn our attention to the test set.

test_df = get_test_data()
test_df.groupby("label").describe(include="object")[
    ("review", "count")
]
label
0    12361
1    12440
Name: (review, count), dtype: object
test_df.shape
(24801, 5)

We’ve removed about 200 reviews from the test set and can start digging in to comparing the test set to the train set.

9.2.2 Train-test contamination

Now let’s talk a bit about what the test set is used for. It’s a separate set of data used to evaluate a machine learning model after it’s trained. It is data used to measure the performance of a model on data it’s never seen before. This is important because when a model is used in production, it will be making predictions about all kinds of inputs it wasn’t trained on and it needs to generalize well beyond the training data. We do not want data leaking from the train set into the test set.

Easy enough to check.

leak_df = test_df[test_df["review"].isin(train_df["review"])]
leak_df.shape
(123, 5)

There’s 123 reviews from the train set in the test set.

Let’s think about that for a second. The reviews in the test set should be for movies that aren’t reviewed in the train set, but we have 123 reviews in the test set that are duplicates of those in the train set. How could this happen?

Remember that we saw duplicate reviews in the train set. This was a result of the same movie having multiple movie IDs. That could explain what’s happening here. Maybe the reviews in both the train and test sets have different movie IDs. We can test that!

both_df = train_df.merge(test_df, on="review")
(both_df["movie_id_x"] == both_df["movie_id_y"]).any()
np.False_

Bingo, by aligning the reviews with merge() we can directly compare movie IDs for duplicate reviews and none of them are the same. When this dataset was curated, the one-to-many relationship between movies and movie IDs in the IMDB wasn’t accounted for and resulted in not just duplicate reviews in the train and test sets independently, but also leakage of data from the train set into the test set. It gets worse though, we could remove the training reviews that leaked into the test set, but the test set should only contain reviews for movies that aren’t reviewed in the train set. Because of the relationship between movies and movie IDs we have to further process the dataset to see if the movies reviewed in the test set are all different from those in the train set, even after removing duplicates.

Why is it important that there are no overlapping movies when the reviews are different? The curators of this dataset give a good explanation in the README they provide,

“In the entire collection, no more than 30 reviews are allowed for any given movie because reviews for the same movie tend to have correlated ratings. Further, the train and test sets contain a disjoint set of movies, so no significant performance is obtained by memorizing movie-unique terms and their associated with observed labels.” (Maas et al. 2011)

Our goal is to train machine learning models that recognize general patterns of language to predict positive or negative sentiment for a given piece of text. If all the reviews in the train set for Batman Begins are positive, the model may learn to associate words like “Batman”, “Bruce Wayne”, “Christian Bale”, and “Christopher Nolan” with positive sentiment. This doesn’t generalize to other movies. If reviews for Batman Begins are in the test set, the model will likely correctly predict those reviews improving it’s performance on the test set. The test set is supposed to be unbiased, but that goes out the window when data from the train set leaks into the test set.

Note

There are a few forms of data leakage, but what I’ve described here is often called “train-test contamination”. Basically when there’s information specific to the train set that also shows up in the test set. In our case it’s duplicate reviews and movie specific terms. It’s a real problem in machine learning. It can make AI look better than it really is. Finding it in datasets can be tricky, especially as datasets get bigger and bigger. Thought needs to go into preventing sources of leakage and you may never catch every instance of leakage. For example, there’s another instance of leakage that wasn’t considered for this dataset. The train and test sets should contain reviews for a disjoint set of reviewers because individual reviewers may use specific terms or have unique speech patterns that do not generalize to all reviewers and a machine learning model could learn that.

First we’ll remove the leakage of reviews from the train set into the test set.

def get_train_test_data():
    """Return train and test `pd.DataFrame`s."""
    train_df = get_train_data()
    test_df = get_test_data()
    test_df = test_df[
        ~test_df["review"].isin(train_df["review"])
    ].copy()
    return train_df, test_df

Now, let’s handle the issues that come from the one-to-many relationship of movies and movie IDs. The way movie IDs work in IMDB is there’s one main ID for a movie and other movie IDs for that movie point to the main ID. We saw this earlier when we looked at URLs for duplicate reviews. When you visit those URLs, the duplicate movie ID redirects to the main one. Try it with these URLS, http://www.imdb.com/title/tt0223119/reviews and http://www.imdb.com/title/tt0118939/reviews. Notice how the first link redirects to the second link. That’s because tt0118939 is the main ID for that movie and tt0223119 is a duplicate ID that points to it.

I’ve taken the liberty of creating a csv file that maps every movie ID in this dataset to their corresponding main movie ID. We’ll use this to replace the movie IDs in our train and test sets then check how many reviews there are per movie and if there’s any overlap of movies between the train and test sets.

If you’d like to see details on how I did this check out this README.
def get_review_data(reviews_dir, urls_file, dedup=False):
    """
    Return a `pd.DataFrame` containing the review ID,
    movie ID, rating, and review.
    """
    with urls_file.open() as f:
        movie_ids = {
            i: url.split("/")[4]
            for i, url in enumerate(f.readlines())
        }

    movie_id_map = dict(pd.read_csv("all_movie_ids.csv").values)

    data = []
    for p in (reviews_dir).iterdir():
        ID, rating = map(int, p.stem.split("_"))
        data.append(
            {
                "id": ID,
                "movie_id": movie_id_map[movie_ids[ID]],
                "rating": rating,
                "review": p.open().read().strip(),
            }
        )

    rv = pd.DataFrame(data)
    if dedup:
        return rv.drop_duplicates("review").copy()
    return rv

Are there any movies in the original train set with more than 30 reviews?

get_train_data(dedup=False)["movie_id"].value_counts().head()
movie_id
tt0235326    32
tt0169528    32
tt0171363    30
tt0187078    30
tt0049223    30
Name: count, dtype: int64

Yes there is. And what about in the deduplicated train set?

get_train_data(dedup=True)["movie_id"].value_counts().head()
movie_id
tt0118480    30
tt0023775    30
tt0049223    30
tt0277941    30
tt0284978    30
Name: count, dtype: int64

No there isn’t. So deduplication actually brings the maximum number of reviews per movie down to 30 which was the original intent of this dataset. Let’s repeat this for the test set.

get_test_data(dedup=False)["movie_id"].value_counts().head()
movie_id
tt0239496    60
tt0152015    48
tt0202381    41
tt0108915    40
tt0126810    33
Name: count, dtype: int64

By comparison that’s pretty dramatic. We see one movie with 60 reviews and a couple with over 40!

get_test_data(dedup=True)["movie_id"].value_counts().head()
movie_id
tt0365830    30
tt0373024    30
tt0365513    30
tt0079261    30
tt0086050    30
Name: count, dtype: int64

But it’s the same story when we deduplicate the reviews. It seems the movies with more than 30 reviews are due to duplicate reviews caused by movies with multiple movie IDs. This is good news because the deduplication takes care of the overrepresented movies for us.

This leaves one more question about train test contamination though. Are there any movies with reviews in the train and test set?

train_df, test_df = get_train_test_data()
test_df["movie_id"].isin(train_df["movie_id"]).any()
np.True_

That is unfortunate. After removing reviews from the test set that appear in the train set, we’re still left with reviews in the test set for at least one movie that is reviewed in the train set. Let’s dig in a little.

overlapping_movies = test_df[
    test_df["movie_id"].isin(train_df["movie_id"])
]
overlapping_movies.shape
(103, 5)

There’s 103 reviews in the test set that shouldn’t be there because their associated movie is reviewed in the train set. How many movies are we talking?

len(overlapping_movies["movie_id"].unique())
7

103 reviews across 7 movies. Let’s also remove these from the test set as they can artificially inflate the benchmarking performance of our models. Instead of removing the duplicate training reviews from the test set, we can filter out reviews with the same movie ID. This will capture duplicates from the same movie.

def get_train_test_data():
    """Return train and test `pd.DataFrame`s."""
    train_df = get_train_data()
    test_df = get_test_data()
    test_df = test_df[
        (~test_df["movie_id"].isin(train_df["movie_id"]))
    ].copy()
    return train_df, test_df


train_df, test_df = get_train_test_data()
train_df.shape, test_df.shape
((24904, 5), (24576, 5))

We now have 24,904 reviews in the train set and 24,576 in the test set. By removing all reviews from the test set with a movie ID seen in the train set, that should handle duplicate reviews across these groups as well as duplicate movies, but there’s still something wrong with our test set. Our original test set was 25k. We removed 199 duplicate reviews in the test set, then 123 reviews seen in the train set, then 103 reviews with movie IDs seen in the train set. Adding that up doesn’t give us 24,576…

25000 - 199 - 123 - 103
24575

Our test set has one to many reviews. That’s because we didn’t actually remove all duplicate reviews in the test set seen in the train set, we just removed the reviews with the same movie IDs. There’s one duplicate review in the train and test sets, it’s just for different movies!

test_df[test_df["review"].isin(train_df["review"])]
id movie_id rating review label
10020 12159 tt0182766 8 There has been a political documentary, of rec... 1
train_df[train_df["review"].isin(test_df["review"])]
id movie_id rating review label
355 10643 tt0184773 8 There has been a political documentary, of rec... 1

It turns out these two movies are part of a documentary series. One documentary ended up in the train set, the other in the test set, and one reviewer happened to write the same review for both. All that’s left is for us to remove the reviews from the test set seen in the train set and the reviews from the test set with movie IDs in the train set.

def get_train_test_data():
    """Return train and test `pd.DataFrame`s."""
    train_df = get_train_data()
    test_df = get_test_data()
    same_review = test_df["review"].isin(train_df["review"])
    same_movie = test_df["movie_id"].isin(train_df["movie_id"])
    test_df = test_df[~same_review & ~same_movie].copy()
    return train_df, test_df


train_df, test_df = get_train_test_data()
train_df.shape, test_df.shape
((24904, 5), (24575, 5))

Now we have 24,575 reviews in the test set. All is good and we can move on.

9.3 Reflection

With that, our data cleaning journey comes to an end. Yes, there is more that could be done, like ensuring no reviews from the same reviewer show up in the train and test sets, but we don’t have user IDs associated with these reviews. Besides, the process would look similar to what we’ve already done, just with a little more leg work. For our purpose of learning about NLP this dataset is fine.

We covered a lot of ground, and while we’re here I’d like to take a moment to reflect on what we found.

  • Duplicate reviews in both the train and test sets.
  • Reviews are for movies and TV shows.
  • More than 30 reviews for some movies.
    • Fortunately these were all duplicate reviews.
  • Train-test contamination.

I especially want to draw your attention to the train-test contamination. The amount of contamination in this dataset may be negligible when it comes to benchmarking machine learning models, I don’t really know. But the fact that it’s there and that this dataset is provided by multiple deep learning libraries as well as used for benchmarking tasks in research (Howard and Ruder 2018) should make you pause. I’ve used datasets at face value without questioning them and I guarantee people have taken this dataset at face value. These libraries provide the data as-is. 25k train and 25k test reviews, but we know there’s not really 25k reviews in each set. This is not a critique of the dataset or it’s curators, the libraries that provide it, or the researchers that use it. This is a reminder to verify the data is actually what you think it is because we’ve seen here that it isn’t always what it looks like.

Researchers may perform their own preprocessing of the data as we have here. I point to this paper as an example of researchers using the dataset because it is a popular dataset, not as an example of someone using it without preprocessing. In fact, I think this paper is so important it got it’s own chapter.

9.3.1 Keep asking “why?”

Now that we’ve reached the end, did we answer all the questions we set out to answer? Did you come up with other questions while we worked through this? If we created a similar dataset from scratch today, what would you do differently?

9.4 Unsupervised learning data

The IMDB dataset includes an unsupervised learning dataset. The unsupervised set has no ratings and no labels. It’s just reviews. Like the train and test set we’ve already gone over, the same principle of deduplication applies and that’s really all we need.

def get_unsup_data(dedup=True):
    """
    Return a `pd.DataFrame` with the unsupervised data.
    """
    rv = get_review_data(
        data_path / "train/unsup",
        data_path / "train/urls_unsup.txt",
        dedup,
    )
    rv.drop(columns="rating", inplace=True)
    # Drop the ratings column since every review in the
    # unsupervised set is given a rating of 0 regardless
    # of what it's rating is.
    return rv


unsup_df = get_unsup_data()
unsup_df.shape
(49507, 3)
unsup_df["movie_id"].value_counts().head()
movie_id
tt0325596    30
tt0086856    30
tt0758053    30
tt0284850    30
tt0469062    30
Name: count, dtype: int64

Ok, we’re really done now. Thanks for bearing with me. Cleaning data is probably my least favorite part of machine learning because it can feel like busy work, but it’s so important. Even if you leave the dataset the way you found it, it’s a great opportunity to learn about the dataset before you do any modeling. I often find that data cleaning is an ongoing process as I build machine learning models because the models can point to oddities in the data I never saw during my initial exploration. You will find article after article about how machine learning works, with little discussion of how the data was prepared. I want you to walk away from this chapter knowing that cleaning data, analysis, and machine learning are all intertwined.