1  Machine learning needs data

Without data, machine learning is nothing. After all what will it learn if it has nothing to learn from. Think about everything you’ve learned throughout your life. All the books you’ve read, the movies you’ve watched, the experiences you’ve lived. Everything you’ve touched, tasted, heard, felt, smelled. It all comes together to shape who you are and what you know. Our brains adapt and change to all of this input. Our brains learn. Take away all of the memories and experiences and what are we left with? Thoughts maybe, but of what? Without the context of our life there isn’t much to think about. Machine learning models work like our brain, but simpler. They take data and try to make sense of it, or learn from it. The more data, the easier it is to learn from, at least in theory.

1.1 The dataset

Our models will be using datasets of movie reviews from IMDB (Maas et al. 2011). The original datasets can be found at https://ai.stanford.edu/~amaas/data/sentiment.

Note

One of the most important chapters in this book is about cleaning data. It’s a bonus chapter at the end. It walks through the steps and analysis I performed to clean and prepare the dataset for this book. At this point it’s not important. The focus should be on building models as fast as possible and iterating on them. By the end of this book, you should feel comfortable with machine learning, at which point understanding where data comes from and how it’s prepared becomes useful. After all we want to apply the things learned here to the real world, and that starts with real world data.

It is still a bonus chapter and not required for understanding machine learning, but I found some suprising things when I cleaned this dataset. If you make it to the end, you really should read it.

I’ve taken the liberty of further cleaning the data and making it accessible through a Python API so we can get right to work on machine learning and NLP.

This book provides a conda environment file (see Section 2) with everything you need to run the code. If you want to access the dataset without setting up the conda environment, you can get access to it through pip. If you’re just following along through the book website then there’s nothing you need to do.

$ pip install git+https://github.com/spenceforce/NLP-Simple-to-Spectacular

Let’s get a feel for the datasets before we move on to machine learning. There are two movie review datasets available. One for classification and the other for unsupervised learning.

1.2 Classification dataset

This dataset contains movie reviews and labels indicating if the review is positive (label 1) or negative (label 0). It is intended for benchmarking sentiment classification tasks. Sentiment classification is about predicting the feeling a text conveys. Like emotions such as happy, sad, or angry. In this case it’s predicting whether a review says if a movie is good or bad.

This dataset is split into a set for training and a set for testing. We can access both the train and test sets with get_train_test_data, which returns a DataFrame object for each set. The DataFrame class is a staple of pandas. Dataframes are tables and they are not unique to pandas, but pandas is the de facto Python library for working with dataframes. You can think of dataframes as the programmatic version of an excel spreadsheet.

from nlpbook import get_train_test_data

train_df, test_df = get_train_test_data()

train_df and test_df have the same format, so we’ll just inspect train_df. We can see how many rows are in the dataframe and information about the columns with DataFrame.info().

train_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 24904 entries, 0 to 12499
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        24904 non-null  int64 
 1   movie_id  24904 non-null  object
 2   rating    24904 non-null  int64 
 3   review    24904 non-null  object
 4   label     24904 non-null  int64 
dtypes: int64(3), object(2)
memory usage: 1.1+ MB

DataFrame.info() says there are 24,904 rows. There are five columns, three of which have type int64 and two with type object; the object types are strings in our dataframe.

pandas assigns the type object to non-numeric values.

Each row is for one review. A brief rundown of what the columns are:

  • id: The review ID. For each label, this is a unique identifier for the review.
  • movie_id: The movie ID. Uniqe identifier for the movie the review is about.
  • rating: A score from 1-10 that the reviewer gave the movie.
  • review: This is the review. Pretty self-explanatory.
  • label: A 0 or 1 value indicating if the review is negative or positive, respectively.

The columns we’re interested in are review and label. review will be the input to all models as this is the natural language we are trying to process. label is what we’re trying to predict!.

Let’s inspect a few reviews with DataFrame.head().

train_df.head()
id movie_id rating review label
0 7275 tt0082799 1 "National Lampoon Goes to the Movies" (1981) i... 0
1 1438 tt0397501 4 Well! What can one say? Firstly, this adaptati... 0
2 9137 tt0364986 1 What can I say, this is a piece of brilliant f... 0
3 173 tt0283974 3 A decent sequel, but does not pack the punch o... 0
4 8290 tt0314630 2 Alan Rudolph is a so-so director, without that... 0

The review column looks like natural language and the label column has numeric values just like DataFrame.info() said.

We can also see how this dataset is split by label.

train_df.value_counts("label")
label
1    12472
0    12432
Name: count, dtype: int64

There are 12,472 positive labels and 12,432 negative labels. That’s almost a 50/50 split.

1.3 Unsupervised dataset

The train/test sets above have labels, 0 or 1, which allows them to be used in a supervised learning fashion. In supervised learning we have real outputs, the review labels in this case, to compare to our machine learning model outputs. We can supervise the models learning by comparing it’s outputs to the labels and let the model know how it’s doing.

Unsupervised learning is just input data. There’s no label to use as a comparator. Instead the model must learn from the data without knowing whether it is right or wrong. This kind of learning is less about predicting a specific property and more about learning general properties of the data.

This dataset is available through get_unsup_data. Let’s inspect it with DataFrame.info().

from nlpbook import get_unsup_data

unsup_df = get_unsup_data()
unsup_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 49507 entries, 0 to 49999
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        49507 non-null  int64 
 1   movie_id  49507 non-null  object
 2   review    49507 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.5+ MB

As you can see, there is no label column. It’s just reviews and nothing else.

The next chapter will focus on building our first model. It will be simple and not very good.