Step 2: Explore Your Data

Building and training a model is only one part of the workflow. Understanding the characteristics of your data beforehand will enable you to build a better model. This could simply mean obtaining a higher accuracy. It could also mean requiring less data for training, or fewer computational resources.

Load the Dataset

First up, let’s load the dataset into Python.

 def 
  
 load_imdb_sentiment_analysis_dataset 
 ( 
 data_path 
 , 
 seed 
 = 
 123 
 ): 
  
 """Loads the IMDb movie reviews sentiment analysis dataset. 
 # Arguments 
 data_path: string, path to the data directory. 
 seed: int, seed for randomizer. 
 # Returns 
 A tuple of training and validation data. 
 Number of training samples: 25000 
 Number of test samples: 25000 
 Number of categories: 2 (0 - negative, 1 - positive) 
 # References 
 Mass et al., http://www.aclweb.org/anthology/P11-1015 
 Download and uncompress archive from: 
 http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 
 """ 
 imdb_data_path 
 = 
 os 
 . 
 path 
 . 
 join 
 ( 
 data_path 
 , 
 'aclImdb' 
 ) 
 # Load the training data 
 train_texts 
 = 
 [] 
 train_labels 
 = 
 [] 
 for 
 category 
 in 
 [ 
 'pos' 
 , 
 'neg' 
 ]: 
 train_path 
 = 
 os 
 . 
 path 
 . 
 join 
 ( 
 imdb_data_path 
 , 
 'train' 
 , 
 category 
 ) 
 for 
 fname 
 in 
 sorted 
 ( 
 os 
 . 
 listdir 
 ( 
 train_path 
 )): 
 if 
 fname 
 . 
 endswith 
 ( 
 '.txt' 
 ): 
 with 
 open 
 ( 
 os 
 . 
 path 
 . 
 join 
 ( 
 train_path 
 , 
 fname 
 )) 
 as 
 f 
 : 
 train_texts 
 . 
 append 
 ( 
 f 
 . 
 read 
 ()) 
 train_labels 
 . 
 append 
 ( 
 0 
 if 
 category 
 == 
 'neg' 
 else 
 1 
 ) 
 # Load the validation data. 
 test_texts 
 = 
 [] 
 test_labels 
 = 
 [] 
 for 
 category 
 in 
 [ 
 'pos' 
 , 
 'neg' 
 ]: 
 test_path 
 = 
 os 
 . 
 path 
 . 
 join 
 ( 
 imdb_data_path 
 , 
 'test' 
 , 
 category 
 ) 
 for 
 fname 
 in 
 sorted 
 ( 
 os 
 . 
 listdir 
 ( 
 test_path 
 )): 
 if 
 fname 
 . 
 endswith 
 ( 
 '.txt' 
 ): 
 with 
 open 
 ( 
 os 
 . 
 path 
 . 
 join 
 ( 
 test_path 
 , 
 fname 
 )) 
 as 
 f 
 : 
 test_texts 
 . 
 append 
 ( 
 f 
 . 
 read 
 ()) 
 test_labels 
 . 
 append 
 ( 
 0 
 if 
 category 
 == 
 'neg' 
 else 
 1 
 ) 
 # Shuffle the training data and labels. 
 random 
 . 
 seed 
 ( 
 seed 
 ) 
 random 
 . 
 shuffle 
 ( 
 train_texts 
 ) 
 random 
 . 
 seed 
 ( 
 seed 
 ) 
 random 
 . 
 shuffle 
 ( 
 train_labels 
 ) 
 return 
 (( 
 train_texts 
 , 
 np 
 . 
 array 
 ( 
 train_labels 
 )), 
 ( 
 test_texts 
 , 
 np 
 . 
 array 
 ( 
 test_labels 
 ))) 

Check the Data

After loading the data, it’s good practice to run some checkson it: pick a few samples and manually check if they are consistent with your expectations. For example, print a few random samples to see if the sentiment label corresponds to the sentiment of the review. Here is a review we picked at random from the IMDb dataset: “Ten minutes worth of story stretched out into the better part of two hours. When nothing of any significance had happened at the halfway point I should have left.” The expected sentiment (negative) matches the sample’s label.

Collect Key Metrics

Once you’ve verified the data, collect the following important metrics that can help characterize your text classification problem:

  1. Number of samples : Total number of examples you have in the data.

  2. Number of classes : Total number of topics or categories in the data.

  3. Number of samples per class : Number of samples per class (topic/category). In a balanced dataset, all classes will have a similar number of samples; in an imbalanced dataset, the number of samples in each class will vary widely.

  4. Number of words per sample : Median number of words in one sample.

  5. Frequency distribution of words : Distribution showing the frequency (number of occurrences) of each word in the dataset.

  6. Distribution of sample length : Distribution showing the number of words per sample in the dataset.

Let’s see what the values for these metrics are for the IMDb reviews dataset (See Figures 3 and 4 for plots of the word-frequency and sample-length distributions).

Metric name Metric value
Number of samples 25000
Number of classes 2
Number of samples per class 12500
Number of words per sample 174

Table 1: IMDb reviews dataset metrics

explore_data.py contains functions to calculate and analyse these metrics. Here are a couple of examples:

 import 
  
 numpy 
  
 as 
  
 np 
 import 
  
 matplotlib.pyplot 
  
 as 
  
 plt 
 def 
  
 get_num_words_per_sample 
 ( 
 sample_texts 
 ): 
  
 """Returns the median number of words per sample given corpus. 
 # Arguments 
 sample_texts: list, sample texts. 
 # Returns 
 int, median number of words per sample. 
 """ 
 num_words 
 = 
 [ 
 len 
 ( 
 s 
 . 
 split 
 ()) 
 for 
 s 
 in 
 sample_texts 
 ] 
 return 
 np 
 . 
 median 
 ( 
 num_words 
 ) 
 def 
  
 plot_sample_length_distribution 
 ( 
 sample_texts 
 ): 
  
 """Plots the sample length distribution. 
 # Arguments 
 samples_texts: list, sample texts. 
 """ 
 plt 
 . 
 hist 
 ([ 
 len 
 ( 
 s 
 ) 
 for 
 s 
 in 
 sample_texts 
 ], 
 50 
 ) 
 plt 
 . 
 xlabel 
 ( 
 'Length of a sample' 
 ) 
 plt 
 . 
 ylabel 
 ( 
 'Number of samples' 
 ) 
 plt 
 . 
 title 
 ( 
 'Sample length distribution' 
 ) 
 plt 
 . 
 show 
 () 

Frequency distribution of words for IMDb

Figure 3: Frequency distribution of words for IMDb

Distribution of sample length for IMDb

Figure 4: Distribution of sample length for IMDb

Create a Mobile Website
View Site in Mobile | Classic
Share by: