Implementing Text Classification in Python

Implementing text classification in python

The study of Data Science has seen an exponential rise in the last few years, and one of its subfields which is growing tremendously is Natural Language Processing.

In this article, we would first get a brief intuition about NLP, and then implement one of the use cases of Natural Language Processing i.e., text classification in Python.

What is Natural Language Processing?

NLP or Natural Language Processing is the study of extracting meaningful information from raw textual data. Due to the variety of data-generating sources, the majority of our data is unclean and comes in the form of natural language. Those natural and unstructured data carry a lot of hidden information which when analyzed could help the business to grow in other dimensions.

For an e-commerce website, their entire business is based on its customer base. Thus to ensure customers get the maximum benefits, it is highly recommended that they analyze the logs data and extract the customer’s search patterns. This way it could ensure the company is ahead of its competitors in the market.

Natural Language Process is one such method and Python has several libraries like NLTK, Spacy, CoreNLP for dealing with textual data. There are also various pre-trained models which could be used for specific NLP tasks but that is beyond the scope of this article.

Text classification in Python

One of the applications of Natural Language Processing is text classification. It is the process by which any raw text could be classified into several categories like good/bad, positive/negative, spam/not spam, and so on. Even a news article could be classified into various categories with this method.

In this article, we will classify a message into spam or not spam as our text classification dataset using Python. There are a total of 5574 labeled messages and we need to separate spam and the ham message. Below are the code snippets and the descriptions of each block used to build the text classification model.

  • The first step for any Data Science problem is importing the necessary libraries.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping
%matplotlib inline

Apart from the traditional libraries like Pandas, NumPy, and so on, we have also imported the LSTM or Long Short Term Memory which is a part of the Recursive Neural Network used in Deep Learning. It is one of the most popular techniques in Deep Learning which is used across a variety of applications such as speech recognition, time series analysis, etc. We would use the architecture of Long Short Term Memory Network to classify messages as spam or ham.

  • The read_csv method of pandas is used to look into the first five rows of our data.
df = pd.read_csv('../input/spam.csv', delimiter=',', encoding='latin-1')

Text Classification in Python - Importing Dataset

  • The columns Unnamed: 2, Unnamed: 3 and Unnamed: 4 would not have any influence on our output model and hence we would drop the columns for further processing.
df.drop(['Unnamed:2', 'Unnamed: 3', 'Unnamed: 4'], axis=1,inplace=True)
  • Now, we are left with a labeled data of two columns – one with the ‘spam’ and ‘ham’ label and other is the textual data. Let’s visualize the dataset to see how many spam and ham are present in it. We would use the count plot functionality of the seaborn module in Python. Seaborn is built on top of Matplotlib but has a wider range of styling and interactive features.
plt.title('Number of ham and spam messages')

The count plot chart –

Text Classification in Python - The countplot Chart | EduGrad

  • As expected, there are more ham messages which are almost five times that of spam. In the next step, we would create vectors of our features and the target variable. The reason why we create vectors is that machine cannot interpret textual data and thus it needs to be converted into numbers. The sklearn module of Python has a LabelEncoder() method which encodes categorical data and assigns more weights to the greater number.
X = df.v2
Y = df.v1
le = LabelEncoder()
Y = le.fit_transform(Y)
Y = Y.reshape(-1,1)
  • The model is learned from our training set and is evaluated on the test data. We have used 85% of our initial data for the training purpose and left the remaining 15 % for testing.
X_train, X_test, Y_test = train_test_split(X,Y,test_size=0.15)
  • Data Pre-processing is the most time-consuming but important part of a Machine Learning project. Some of the pre-processing techniques used in text analysis are tokenizing, normalization, and so on.
max_words = 1000
max_len = 150
tok = Tokenizer(num_words=max_words)
sequences = tok.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)
  • Once the data is pre-processed, it needs to be fed to our model to train. We would define a Recursive Neural Network to fit in the LSTM architecture.
def RNN():
inputs = Input(name='inputs',shape=[max_len])
layer = Embedding(max_words,50,input_length=max_len)(inputs)
layer = LSTM(64)(layer)
layer = Dense(256,name='FC1')(layer)
layer = Activation('relu')(layer)
layer = Droupout(0.5)(layer)
layer = Dense(1, name='out_layer')(layer)
layer = Activation('sigmoid')(layer)
layer = Model(inputs=inputs,outputs=layer)
return model
  • The model is compiled with loss function as binary_crossentropy and the metrics of evaluation as accuracy.
model = RUN()

  • The training set is fit into the model.,Y_train,batch_size=128,epochs=10,validation_split=0.2,callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.0001)])

Text classification in Python - Training set fits the model

  • This would be our final model because of its accuracy on the validation set. The model is tested on the test data.
test_sequences = tok.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)

accr = model.evaluate(test_sequences_matrix,Y_test)

Text Classification in Python - output of the evaluated model

  • The loss and the accuracy of the test data.
print('Test set\n Loss: {:0.3f}\n Accuracy: {:0.3f}'.format{accr[0],accr[1]})

Text Classification in Python - Test Model Accuracy

There are several text classification algorithms and in this context, we have used the LSTM network using Python to separate a spam message from a ham.

Conclusion –

Understanding, and manipulating raw data is gradually becoming a part of every organization. Thus it is necessary to know the nitty-gritty of Natural Language Processing and apply its fundamentals to several use cases such as the one shown in this blog.

Explore Goals at TechLearn –

TechLearn Career path - Data Scientist
TechLearn career path - Python programmer
TechLearn career path - Deep Learning Engineer

Explore Industrial case studies by TechLearn –

How Tesla uses computer vision in autonomous cars
How Tesla uses computer vision in autonomous cars
How Flipkart is helping small businesses shine
How Flipkart is helping small businesses shine
Amazon Go - Future of shopping
Amazon Go – Future of shopping
How Uber saves your time
How Uber saves your time
Netflix Recommendation system
Netflix Recommendation: How do they know what you like


Please enter your comment!
Please enter your name here