1 Machine learning needs data
Without data, machine learning is nothing. After all what will it learn if it has nothing to learn from. Think about everything you’ve learned throughout your life. All the books you’ve read, the movies you’ve watched, the experiences you’ve lived. Everything you’ve touched, tasted, heard, felt, smelled. It all comes together to shape who you are and what you know. Our brains adapt and change to all of this input. Our brains learn. Take away all of the memories and experiences and what are we left with? Thoughts maybe, but of what? Without the context of our life there isn’t much to think about. Machine learning models work like our brain, but simpler. They take data and try to make sense of it, or learn from it. The more data, the easier it is to learn from, at least in theory.
1.1 The dataset
Our models will be using datasets of movie reviews from IMDB (Maas et al. 2011). The original datasets can be found at https://ai.stanford.edu/~amaas/data/sentiment.
One of the most important chapters in this book is about cleaning data. It’s a bonus chapter at the end. It walks through the steps and analysis I performed to clean and prepare the dataset for this book. At this point it’s not important. The focus should be on building models as fast as possible and iterating on them. By the end of this book, you should feel comfortable with machine learning, at which point understanding where data comes from and how it’s prepared becomes useful. After all we want to apply the things learned here to the real world, and that starts with real world data.
It is still a bonus chapter and not required for understanding machine learning, but I found some suprising things when I cleaned this dataset. If you make it to the end, you really should read it.
I’ve taken the liberty of further cleaning the data and making it accessible through a Python API so we can get right to work on machine learning and NLP.
This book provides a conda environment file (see Section 2) with everything you need to run the code. If you want to access the dataset without setting up the conda environment, you can get access to it through pip
. If you’re just following along through the book website then there’s nothing you need to do.
$ pip install git+https://github.com/spenceforce/NLP-Simple-to-Spectacular
Let’s get a feel for the datasets before we move on to machine learning. There are two movie review datasets available. One for classification and the other for unsupervised learning.
1.2 Classification dataset
This dataset contains movie reviews and labels indicating if the review is positive (label 1) or negative (label 0). It is intended for benchmarking sentiment classification tasks. Sentiment classification is about predicting the feeling a text conveys. Like emotions such as happy, sad, or angry. In this case it’s predicting whether a review says if a movie is good or bad.
This dataset is split into a set for training and a set for testing. We can access both the train and test sets with get_train_test_data
, which returns a DataFrame
object for each set. The DataFrame
class is a staple of pandas
. Dataframes are tables and they are not unique to pandas
, but pandas
is the de facto Python library for working with dataframes. You can think of dataframes as the programmatic version of an excel spreadsheet.
train_df
and test_df
have the same format, so we’ll just inspect train_df
. We can see how many rows are in the dataframe and information about the columns with DataFrame.info()
.
<class 'pandas.core.frame.DataFrame'>
Index: 24904 entries, 0 to 12499
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 24904 non-null int64
1 movie_id 24904 non-null object
2 rating 24904 non-null int64
3 review 24904 non-null object
4 label 24904 non-null int64
dtypes: int64(3), object(2)
memory usage: 1.1+ MB
DataFrame.info()
says there are 24,904 rows. There are five columns, three of which have type int64
and two with type object
; the object
types are strings in our dataframe.
pandas
assigns the type object
to non-numeric values.Each row is for one review. A brief rundown of what the columns are:
id
: The review ID. For each label, this is a unique identifier for the review.movie_id
: The movie ID. Uniqe identifier for the movie the review is about.rating
: A score from 1-10 that the reviewer gave the movie.review
: This is the review. Pretty self-explanatory.label
: A 0 or 1 value indicating if the review is negative or positive, respectively.
The columns we’re interested in are review
and label
. review
will be the input to all models as this is the natural language we are trying to process. label
is what we’re trying to predict!.
Let’s inspect a few reviews with DataFrame.head()
.
id | movie_id | rating | review | label | |
---|---|---|---|---|---|
0 | 7275 | tt0082799 | 1 | "National Lampoon Goes to the Movies" (1981) i... | 0 |
1 | 1438 | tt0397501 | 4 | Well! What can one say? Firstly, this adaptati... | 0 |
2 | 9137 | tt0364986 | 1 | What can I say, this is a piece of brilliant f... | 0 |
3 | 173 | tt0283974 | 3 | A decent sequel, but does not pack the punch o... | 0 |
4 | 8290 | tt0314630 | 2 | Alan Rudolph is a so-so director, without that... | 0 |
The review
column looks like natural language and the label
column has numeric values just like DataFrame.info()
said.
We can also see how this dataset is split by label.
There are 12,472 positive labels and 12,432 negative labels. That’s almost a 50/50 split.
1.3 Unsupervised dataset
The train/test sets above have labels, 0 or 1, which allows them to be used in a supervised learning fashion. In supervised learning we have real outputs, the review labels in this case, to compare to our machine learning model outputs. We can supervise the models learning by comparing it’s outputs to the labels and let the model know how it’s doing.
Unsupervised learning is just input data. There’s no label to use as a comparator. Instead the model must learn from the data without knowing whether it is right or wrong. This kind of learning is less about predicting a specific property and more about learning general properties of the data.
This dataset is available through get_unsup_data
. Let’s inspect it with DataFrame.info()
.
<class 'pandas.core.frame.DataFrame'>
Index: 49507 entries, 0 to 49999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 49507 non-null int64
1 movie_id 49507 non-null object
2 review 49507 non-null object
dtypes: int64(1), object(2)
memory usage: 1.5+ MB
As you can see, there is no label
column. It’s just reviews and nothing else.
The next chapter will focus on building our first model. It will be simple and not very good.