Skip to content

FraudDetectorVotingClassifier

This is part two of a three-project series where I explore the various different approaches on how to classify credit card fraud with artificial intelligence. The other two projects can be found on my profile titled FraudDetectorLogisticRegression and FraudDetectorDeepLearning. Enjoy!

In this project, I used a voting classifier machine learning ensemble model to detect credit card fraud.

Some Background

In the last project, we used a logistic regression model to classify credit card transactions in creditcard_sampledata.csv, and the results weren’t too pretty. As a reminder, here’s the classification report and confusion matrix.

Classification report: precision recall f1-score support 0 1.00 0.98 0.99 2390 1 0.12 0.80 0.21 10 accuracy 0.97 2400 macro avg 0.56 0.89 0.60 2400 weighted avg 1.00 0.97 0.98 2400 Confusion matrix: [[2331 59] [ 2 8]]

How can this be improved? It’s easy to default to deep learning, but I wanted to play around with some other, more traditional, machine learning methods before diving into neural networks.

In this project, I used an ensemble method comprised of three different machine learning models. The first is a logistic regression model, explained in the previous project in the series. The second is a decision tree classifier and the third a random forest classifier.

A decision tree classifier is a tree-like model used for classification and regression, which splits the data into branches based on feature values, leading to predictions at the leaf nodes. Each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a class label or a continuous value. A random forest classifier is an ensemble method itself built on multiple decision trees.

My voting classifier uses a hard voting rule, meaning that it takes the majority rule of the predicted classes of the three models.

Code Breakdown

creditcard_sampledata.csv contains information on credit card purchases. I started off by importing the necessary libraries and loading the dataset.

# Import necessary libraries for data manipulation and machine learning
import pandas as pd
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report, confusion_matrix

# Load the dataset from the specified CSV file into a DataFrame
df = pd.read_csv("creditcard_sampledata.csv")

I defined the function prep_data() that returns the feature and target variables (with the target variable being fraud or non-fraud). I then split the dataset into training and test sets and used SMOTE to achieve a balanced number of observations in each class when training.

# Function to preprocess the data by separating the features and the target variable
def prep_data(df):
    # Separate the features (X) by dropping the 'Class' column
    X = df.drop('Class', axis=1)
    # Store the target variable (y) from the 'Class' column
    y = df['Class']
    return X, y

# Preprocess the data to get features (X) and target (y)
X, y = prep_data(df)

# Split the dataset into training and testing sets, with 30% of the data used for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# Apply Synthetic Minority Over-sampling Technique (SMOTE) to balance the class distribution in the training set
sm = SMOTE(random_state=42)
X_train, y_train = sm.fit_resample(X_train, y_train)

Next, I defined my three models, entered them into the VotingClassifier, and fitted the ensemble model to the training set.

# Initialize three different classifiers with specified hyperparameters
# Logistic Regression with adjusted class weights to handle class imbalance
clf1 = LogisticRegression(class_weight={0:1, 1:15}, random_state=5)
# Random Forest with specific parameters including class weights, maximum depth, and number of estimators
clf2 = RandomForestClassifier(class_weight={0:1, 1:12}, criterion='gini', max_depth=8, max_features='log2',
            min_samples_leaf=10, n_estimators=30, n_jobs=-1, random_state=5)
# Decision Tree with balanced class weights
clf3 = DecisionTreeClassifier(random_state=5, class_weight="balanced")

# Combine the three classifiers into a single ensemble model using hard voting
ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('dt', clf3)], voting='hard')

# Train the ensemble model using the training data
ensemble_model.fit(X_train, y_train)

Lastly, I used the model to make predictions on the test set and printed out the classification report and confusion matrix.

# Use the trained ensemble model to make predictions on the test set
predicted = ensemble_model.predict(X_test)

# Print the classification report which includes precision, recall, and F1-score for each class
print(classification_report(y_test, predicted))
# Print the confusion matrix to show the performance of the model in terms of true/false positives and negatives
print(confusion_matrix(y_test, predicted))
Classification report: precision recall f1-score support 0 1.00 1.00 1.00 2390 1 0.57 0.40 0.47 10 accuracy 1.00 2400 macro avg 0.78 0.70 0.73 2400 weighted avg 1.00 1.00 1.00 2400 Confusion matrix: [[2387 3] [ 6 4]]

From the reports, it can be seen that precision has drastically improved from 0.12 to 0.57 for the fraud class, but at the expense of a decline in recall from 0.80 to 0.40. Is this a better result? It’s really up to the bank to decide. They might want to catch more fraud knowing they will end up with additional false positives, or vice versa. In an ideal world, both precision and recall would be higher, and I'll try to achieve this with deep learning in the next project, FraudDetectorDeepLearning.