In this article, we’ll learn how to deploy your machine learning projects. We will be solving a problem statement for a bank that wants to identify whether a loan applicant will be a defaulter or not based on certain attributes like loan amount, funded amount, income, and so on. We will be using the past data of the defaulters from the bank to recognize the pattern followed by the defaulters and notify the bank whether the future loan applicant will default or not.
Let’s start this machine learning tutorial by importing the python libraries and modules we’ll be needing to implement our machine learning algorithms.
import pandas as pd import numpy as np from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split
Our next step will be to load and read the data file. Here we have data in csv format so we will be using pandas read_csv method to load data and save it in the form of a dataframe under the variable name ‘df’
We can have a look at how our data looks by using the head() function of pandas, by default it returns the first five samples of our dataframe.
We can check the number of samples and features present in our data by using the shape attribute of pandas dataframe.
So we have 18 columns in total and about 46,000 samples. Out of these 18 columns we also have our target variable that is the ‘defaulter’ column having values 0 or 1 with
The presence of null values in the data is one of the major shortcomings of data which can create a significant adverse effect on our machine learning model. So we need to deal with the null values before moving on to the model development.
Check the number of null values present in our data by using the isnull() method in conjunction with the sum() method. This will return the total number of null values for each feature.
We can see that we have only 1 null value present in annual_inc and payment_inc_ratio features while 3 null values in pub_rec feature. We have almost 46600 samples in our data and the number of samples with null values are significantly very low. So we can proceed by discarding the samples which have null values.
To remove the samples with null values we will use the dropna() method and pass the inplace=True as an argument, so that the changes are saved in the original dataframe itself.
df.dropna(inplace = True)
Now our next step will be to check the nature of the relationship between our target variable and independent features. Here our target variable is the defaulter column. To check the correlation we will be calling pandas corr() method. It returns a value between -1 and 1 representing the nature and strength of relationship between any 2 features.
-1 –> If the value of correlation comes to be -1 then it means that by increasing the value of one feature, the value of another feature decreases. This represents the negative correlation between the 2 features.
0 –> If the value of correlation is 0 then it means that there is no correlation between the 2 features.
1 –> If the value of correlation is 1 then it means that the 2 features are positively correlated and by increasing the value of one feature, the value of the other feature also increases.
The features having a high positive or negative correlation with the target variable must be considered as an important feature in predicting the value of the target variable.
In the above table we can see that we have only 12 columns including the target variable but in our original data we had 18 features in total. So where the rest of the columns have vanished while determining the correlation.
The answer to this difference in number of columns is that the corr() method calculates correlation only among the numeric features. Also our machine learning model can interpret only the numeric features and not the text features. So we need to convert the text features to their respective numeric representation and use them for development of our model.
We will be using the LabelEncoder class from sklearn.preprocessing to convert the text features to the numeric features. It assigns a number to each unique value in the column. For example if a feature has values a,b and c then a can be represented by 0 and b, c by 1 and 2 respectively. This numbering is done by LabelEncoder itself based on the sequence of occurrence of values in the column. This means that if a occurs first then it will be represented by 0 and so on.
To use the LabelEncoder in your machine learning projects, we need to instantiate it and then use its fit_transform() method on the text features which we want to convert.
encoder = LabelEncoder() df['income_verified'] = encoder.fit_transform(df['income_verified']) df['grade'] = encoder.fit_transform(df['grade']) df['home_ownership'] = encoder.fit_transform(df['home_ownership']) df['pymnt_plan'] = encoder.fit_transform(df['pymnt_plan']) df['purpose'] = encoder.fit_transform(df['purpose'])
We have converted the important text features into numeric features and assigned them under their respective feature names. Since a state name isn’t a significant feature that can help us to determine whether a person is a defaulter or not, we have avoided converting that feature.
Let’s look at our data now.
Since there was no significant correlation among the target variable and independent features, we will be using features that can be best used to determine whether the person will default or not. We will save the features as a data frame under the variable name x and target variable as y.
x = df[['loan_amnt', 'funded_amnt', 'term', 'int_rate','installment', 'grade', 'home_ownership', 'annual_inc', 'income_verified', 'pymnt_plan', 'purpose' , 'pub_rec', 'total_rec_late_fee', 'inactive_loans', 'payment_inc_ratio']] y = df[['defaulter']]
After declaring the input features and target variables, we need to split our data into train and test dataset to verify how generalized our machine learning model is. By generalization we mean that our model should be able to correctly classify any unseen, new sample whether it is defaulter or not. So we will be training our model on the train dataset and check whether it’s generalized or not using the test dataset.
To split our dataset, we will be using train_test_split from sklearn.model_selection and pass x, y along with the test_size = 0.1, which means that we want 10% of our complete dataset to be used for testing and other 90% of dataset to be used for training.
train_x, test_x, train_y, test_y = train_test_split(x,y, test_size=0.1, random_state=15)
Now comes the main part of our blog, creating our machine learning model.
Creating Machine Learning projects in python
We will be creating a Logistic Regression model and we can do this by importing the Logistic Regression class from the sklearn.linear_model and instantiating it.
Then we can call the fit() method on our training data that is train_x and train_y to train our model.
from sklearn.linear_model import LogisticRegression model = LogisticRegression() model.fit(train_x, train_y)
Finally let’s check how accurate our model is by evaluating its accuracy score on our test data. This will also be representative of how much our model is able to generalize the pattern learned while training.
Woah! We got an accuracy score of around 95% on our test data which means that our model is good enough to classify the new unseen samples.
This completes our knowledge towards how to deploy machine learning projects in python.
Happy Learning and keep exploring!!