Spam classification using classic models

Spam Classification

This is the final assignment of undergraduate elective course of UCAS: Data Mining, which may be helpful to you.

Problem Description

problem link: https://challenge.datacastle.cn/v3/cmptDetail.html?id=352

Given email text information, establish a classification model to determine which emails are spam.

Phone Number Checker

1.Theory

After downloading the training and testing data, I observed the spam text in the training set, trying to find some obvious features. I found that the vast majority of spam emails would contain “phone numbers”, usually a string of 11 in length, sometimes connected by characters such as spaces or ‘-‘ in the middle of the numbers.


So a natural idea is to mark all emails with phone numbers as spam, otherwise they are considered normal emails. Method for identifying phone numbers: Use a sliding window to check all consecutive 13 digit strings in the text. If 9 or more digits are digits, it is considered a phone number. For example, “0871-872-9755” will be recognized as a phone number.

2.Code

Critical Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def check_phone_number(text):
# Count the number of digits in the string
def count_digits(s):
return sum(c.isdigit() for c in s)

# Check every window of length 13 in the text
for i in range(0, len(text) - 12):
if(count_digits(text[i:i+13]) >= 9):
return 1

return 0

# 构建输出
for index, row in test_data_df.iterrows():
id = row['ID']
if check_phone_number(row['Email']): # If a phone number is recognized in the text
Label = 'spam' # Mark as spam
else:
Label = 'ham'
new_row = pd.DataFrame([[id, Label]], columns = ['ID', 'Label'])
result_df = pd.concat([result_df, new_row], ignore_index = True)

3.Result


The accuracy on the training set is about 0.9457, and the accuracy on the test set is about 0.9434.

The initial way to identify phone numbers was to have 11 or more digits in a 13 length sliding window, with an accuracy rate of 0.9372.

It was discovered that some numbers were less than 11 digits and should still be recognized as phone numbers. The method of identifying phone numbers was changed to a 13 length sliding window with 9 or more digits. The accuracy of the test set was improved to 0.9434.

Naive Bayes regression model based on bag of words

1.Theory

Build a big word bag for all spam emails and a big word bag for all normal emails.

First, calculate the prior probability: take the proportion of spam and normal emails in the training set samples as the prior probability.

Then multiply the frequency of each word appearing in spam emails to obtain the probability that the email is spam, and multiply the probability of each word appearing in normal emails to obtain the probability that the email is normal. Compare the two probabilities and take the larger one as the prediction result.

Laplace smoothing was used to avoid the problem of calculating a probability of 0 when the frequency of a word in the bag of words is 0.

2.Code

Critical Code:

(1) Data Processing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def remove_punctuation(text):
# Use regular expressions to match any character that is not a letter, number, or space and replace it with a space
text = re.sub(r'[^\w\s]', ' ', text)
return text

def change_digit_to_zero(text):
# Match all numbers using regular expressions and replace them with the string '0'
return re.sub(r'\d', '0', text)

# Process the text, remove punctuation and convert all to lowercase
def init_text(text):
text = remove_punctuation(text)
text = change_digit_to_zero(text)
text = text.lower()
return text

for index, row in train_data_df.iterrows():
text = init_text(row['Email'])
# Put the processed text into different lists according to labels
if(row['Label'] == 'spam'):
spam_emails.append(text)
else:
ham_emails.append(text)

# Count the frequency of each word appearing in spam and normal files
for email in spam_emails:
words = email.split() # Split by one or more spaces
for word in words: # Build a word bag
spam_word_count[word] = spam_word_count.get(word, 0) + 1

for email in ham_emails:
words = email.split()
for word in words:
ham_word_count[word] = ham_word_count.get(word, 0) + 1

# Calculate prior probability
spam_prior_prob = len(spam_emails) / (len(spam_emails) + len(ham_emails))
ham_prior_prob = len(ham_emails) / (len(spam_emails) + len(ham_emails))

(2) Prediction:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

def predict(email): # Predicting whether it is spam based on email text
text = init_text(email) # Initialize text
word_count = {}
words = text.split() # Split by one or more spaces
for word in words: # Build a word bag
word_count[word] = word_count.get(word, 0) + 1

spam_prob = spam_prior_prob
ham_prob = ham_prior_prob

for word, count in word_count.items():
# Laplace smoothing
spam_prob *= (spam_word_count.get(word, 0) + 1) / (sum(spam_word_count.values()) + len(word_count))
ham_prob *= (ham_word_count.get(word, 0) + 1) / (sum(ham_word_count.values()) + len(word_count))
if((spam_prob < 1e-6) | (ham_prob < 1e-6)):
spam_prob *= 1e6
ham_prob *= 1e6

if(spam_prob > ham_prob):
return 'spam'
else:
return 'ham'

3.Result

(1) Training time

The total time of training and prediction is 6.44s

(2) Accuracy on the training set

Accuracy on the training set is 0.9850

(3) Accuracy on the test set

The accuracy is 0.9452 without processing the numerical string.

Considering that phone numbers are a string of numbers, each spam phone call is different, but essentially can be seen as the same word. In order to make them play their due role, I converted all numbers to the number 0, and the accuracy was improved to 0.947.

At first, when dividing words, I directly deleted punctuation marks:

text = re.sub(r’[^\w\s]’, ‘’, text)

Later, it was discovered that some words separated by symbols did not have spaces in between, such as ‘… ‘, After removing punctuation marks, the two words became connected, so I changed the data processing method to replace punctuation marks with spaces:

text = re.sub(r’[^\w\s]’, ‘ ‘, text)

The accuracy has been improved to 0.9587.

Naive Bayes Model Based on TF-IDF

1.Theory

Email text initialization:

  1. Replace punctuation with spaces
  2. Replace all numbers with ‘0’
  3. Convert all letters to lowercase
  4. Divide each word into one or more spaces
  5. Merge the segmentation results into a string with only one space between each word

Call CountVectorizer() to segment each email into $n$ small word bags. Assuming all texts have m different words, each sample have $m$ features, representing the number of times each word appears in the email.

Call TfidfTransformer() to calculate the TF-IDF value of each small bag of words. At this point, the $m$ features of each sample become the corresponding TF-IDF values of $m$ words in the email text.

At this point, only a Multidimensional Feature Classification problem needs to be solved, which can be predicted through various methods such as naive Bayes model, logistic regression, support vector machine, etc.

2.Code

Critical Code:

(1) Data Processing:CountVectorizer() was called to construct the bag of words, and TfidfTransformer() was called to calculate the TF-IDF value.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
def remove_punctuation(text):
# Use regular expressions to match any character that is not a letter, number, or space and replace it with a space
text = re.sub(r'[^\w\s]', ' ', text)
return text

def change_digit_to_zero(text):
# Match all numbers using regular expressions and replace them with the string '0'
return re.sub(r'\d', '0', text)

# Initialize the text
def init_text(text):
text = remove_punctuation(text)
text = change_digit_to_zero(text)
text = text.lower()
words = text.split()
# Merge the segmentation results into a string with only one space between each word
text = " ".join(words)
return text

train_text_list = []
test_text_list = []

# Initialize the text and save it to a list
for index, row in train_data_df.iterrows():
text = init_text(row['Email'])
train_text_list.append(text)

for index, row in test_data_df.iterrows():
text = init_text(row['Email'])
test_text_list.append(text)

# Create a bag of words data structure
cv = CountVectorizer(max_features = 3000, max_df = 0.1, min_df = 7)
count = cv.fit_transform(train_text_list + test_text_list)
train_count = count[0 : len(train_text_list)]
test_count = count[len(train_text_list) : len(train_text_list) + len(test_text_list)]

# Calculate TF-IDF
tfidf = TfidfTransformer()
train_tfidf_matrix = tfidf.fit_transform(train_count)
test_tfidf_matrix = tfidf.fit_transform(test_count)

(2) Model training: MultinomialNB() was called to implement a Naive Bayes model with a prior of polynomial distribution.

1
2
3
4
5
6
7
8
9
10
# Train Bayes model on the training set
bayes_model = MultinomialNB()
bayes_model.fit(train_tfidf_matrix, train_data_df['Label'].tolist())

# Get the score on training set
score = bayes_model.score(train_tfidf_matrix, train_data_df['Label'].tolist())
print(score)

# Get the prediction result
y_pred = bayes_model.predict(test_tfidf_matrix)

3.Result

(1) Training time

The total training and prediction time is 0.27s

(2) Accuracy on the training set

The accuracy of prediction on the training set is 0.9901.

(3) Accuracy on the test set

The accuracy of prediction on the test set is 0.9847.

Logistic regression

1.Theory

By constructing TF-IDF as input features in the manner described above, it can be transformed into a classification problem with multidimensional features, and a logistic regression model can be used for prediction.

2.Code

Critical Code:

(1) Data Processing:Data Processing consistent with the naive Bayes model based on TF-IDF

(2) Model training: LogisticRegressionCV was called to implement the logistic regression model

1
2
3
4
5
6
7
8
9
10
# Train logistic regression models on the training set
lr_model = LogisticRegressionCV(max_iter = 100000)
lr_model.fit(train_tfidf_matrix, train_data_df['Label'].tolist())

# Get the score on training set
score = lr_model.score(train_tfidf_matrix, train_data_df['Label'].tolist())
print(score)

# Get the prediction result
y_pred = lr_model.predict(test_tfidf_matrix)

3.Result

(1) Training time

The total training and prediction time is 1.63s

(2) Accuracy on the training set

The accuracy of prediction on the training set is 1.0

(3) Accuracy on the test set

The accuracy of prediction on the test set is 0.9865

SVM

1.Theory

By constructing TF-IDF as input features in the manner described above, it can be transformed into a classification problem with multidimensional features, and a SVM model can be used for prediction.

2.Code

critical code:

(1) Data Processing: It is basically the same as the Bayesian model and logistic regression model based on TF-IDF, with the only difference being the need to convert ‘ham’ and ‘spam’ into 0 and 1.

(2) The training process only changed the called model, and the other steps are basically the same.

1
2
3
4
5
6
7
8
9
10
11
12
# Convert the strings 'ham' and 'spam' of Label to 0 and 1, respectively
train_y = []
for index, row in train_data_df.iterrows():
label_bool = (row['Label'] == 'spam')
train_y.append(0.0 + label_bool)

# Train support vector machine models on the training set
SVR_model = LinearSVC()
SVR_model.fit(train_tfidf_matrix, train_y)

# Get the prediction result
y_pred = SVR_model.predict(test_tfidf_matrix)

3.Result

(1) Training time

The total training and prediction time is 0.35s

(2) Accuracy on the training set

The accuracy of prediction on the training set is 0.9978

(3) Accuracy on the test set

The accuracy of prediction on the test set is 0.9874

Other classification models (Decision Tree, Random Forest, Multilayer Perceptron)

1.Theory

Due to their data processing methods being roughly consistent with the Bayes, Logistic Regression, and SVM based on TF-IDF mentioned above, all of which are packet switching solutions for a multidimensional feature classification problem, they will not be further elaborated。

2.Code

Decision Tree:

1
2
3
4
5
6
# Train a SVM model on the training set
DTR_model = DecisionTreeRegressor()
DTR_model.fit(train_tfidf_matrix, train_y)

# Get the prediction result
y_pred = DTR_model.predict(test_tfidf_matrix)

Random Forest:

1
2
3
4
5
6
# Train a Random Forest model on the training set
RF_model = RandomForestClassifier()
RF_model.fit(train_tfidf_matrix, train_y)

# Get the prediction result
y_pred = RF_model.predict(test_tfidf_matrix)

Multilayer Perceptron:

1
2
3
4
5
6
# Train a MLP model on the training set
MLP_model = MLPRegressor()
MLP_model.fit(train_tfidf_matrix, train_y)

# Get the prediction result
y_pred = MLP_model.predict(test_tfidf_matrix)

3.Result

Decision Tree:

Training time is 0.72s.

The accuracy of prediction on the training set is 0.9981.

Accuracy on the test set is 0.9390.

Random Forest:

Training time is 3.37s.

The accuracy of prediction on the training set is 0.9997.

The accuracy of prediction on the test set is 0.9874.

Multilayer Perceptron:

Training time is 8.51s.

The accuracy of prediction on the training set is 0.9968.

The accuracy of prediction on the test set is 0.9919.

Combination Model

1.Theory

I have implemented a Bayesian model and a Logistic Regression model based on TF-IDF, and their prediction accuracy is not much different.

Considering how to combine them, a natural idea is to let the two models make separate predictions. If the prediction results of the two models are the same, it is considered as the prediction result. Otherwise, we will use other methods to determine whether it is spam.

I printed out all the email texts with different prediction results from two models:

I found that most of these samples are spam emails. I guess both models have a tendency to predict some spam emails as normal emails, so a simple judgment method was adopted: all emails with different prediction results from the two models were treated as spam emails.

I also tried using Bayesian model, logistic regression model, and support vector machine model to predict simultaneously, and then selected the most frequently occurring classification from the three results as the final classification, but the performance did not improve on the test set. But if samples with different results among them are treated as spam emails, their scores on the test set will be improved.

Afterwards, I tried using five models including Bayesian, logistic regression, SVM, random forest, and MLP for simultaneous prediction. The prediction results of the five models were voted to obtain the final result, but there was no improvement in accuracy on the test set. Considering that the model has a tendency to predict some spam emails as normal emails, I attempted to take spam as the final result if any of the five models predicted it as spam. The accuracy of the results on the test set was improved.

2.Code

critical code:

(1) Bayes and logistic regression combination model:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Train logistic regression models on the training set
lr_model = LogisticRegressionCV(max_iter = 100000)
lr_model.fit(train_tfidf_matrix, train_data_df['Label'].tolist())

# Train Bayes models on the training set
bayes_model = MultinomialNB()
bayes_model.fit(train_tfidf_matrix, train_data_df['Label'].tolist())

# Get the prediction result
y_logistic_pred = lr_model.predict(test_tfidf_matrix)
y_bayes_pred = bayes_model.predict(test_tfidf_matrix)

# Get the result
for index, row in test_data_df.iterrows():
id = row['ID']
label1 = y_logistic_pred[index]
label2 = y_bayes_pred[index]
if(label1 == label2): # Compare the prediction result of two models
label = label1 # If they are the same, use it as the prediction result
else: # mark it as spam
print(row['Email'])
label = 'spam'
new_row = pd.DataFrame([[id, label]], columns = ['ID', 'Label'])
result_df = pd.concat([result_df, new_row], ignore_index = True)

(2) Five models combination:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
SVR_model = LinearSVC()
SVR_model.fit(train_tfidf_matrix, train_y)

lr_model = LogisticRegressionCV(max_iter = 100000)
lr_model.fit(train_tfidf_matrix, train_y)

bayes_model = MultinomialNB()
bayes_model.fit(train_tfidf_matrix, train_y)

MLP_model = MLPRegressor()
MLP_model.fit(train_tfidf_matrix, train_y)

RF_model = RandomForestClassifier()
RF_model.fit(train_tfidf_matrix, train_y)

y_logistic_pred = lr_model.predict(test_tfidf_matrix)
y_bayes_pred = bayes_model.predict(test_tfidf_matrix)
y_SVR_pred = SVR_model.predict(test_tfidf_matrix)
y_RF_pred = RF_model.predict(test_tfidf_matrix)
y_MLP_pred = MLP_model.predict(test_tfidf_matrix)

y_logistic_train = lr_model.predict(train_tfidf_matrix)
y_bayes_train = bayes_model.predict(train_tfidf_matrix)
y_SVR_train = SVR_model.predict(train_tfidf_matrix)
y_RF_train = RF_model.predict(train_tfidf_matrix)
y_MLP_train = MLP_model.predict(train_tfidf_matrix)

right_cnt = 0
wrong_cnt = 0
for index, row in train_data_df.iterrows():
if y_MLP_train[index] > 0.5:
y_MLP = 1
else:
y_MLP = 0
if(y_logistic_train[index] + y_bayes_train[index] + y_SVR_train[index] + y_RF_train[index] + y_MLP >= 1.0):
label = 'spam'
else:
label = 'ham'
if(label == row['Label']):
right_cnt += 1
else:
wrong_cnt += 1
print(right_cnt / (right_cnt + wrong_cnt))

for index, row in test_data_df.iterrows():
if y_MLP_pred[index] > 0.5:
y_MLP = 1
else:
y_MLP = 0
if(y_logistic_pred[index] + y_bayes_pred[index] + y_SVR_pred[index] + y_RF_pred[index] + y_MLP >= 1.0):
label = 'spam'
else:
label = 'ham'
id = row['ID']
new_row = pd.DataFrame([[id, label]], columns = ['ID', 'Label'])
result_df = pd.concat([result_df, new_row], ignore_index = True)

3.Result

(1) Training time

Combination of two models: The total training and prediction time is 1.91s.
Combination of five models: The total training and prediction time is 12.20s.

(2) Accuracy on the training set

Combination of two models: The accuracy of prediction on the training set is 1.0.
Combination of five models: The accuracy of prediction on the training set is 0.9991.

(3) Accuracy on the test set

Combination of two models: The accuracy of prediction on the test set is 0.9883.
Combination of five models: The accuracy of prediction on the test set is 0.9919.

Result and Summary

Ranking

As shown above, among all the prediction results I submitted, the highest accuracy on the test set was 0.9919, ranking 11th

Comparison of different models

Model Training time Accuracy on the training set Accuracy on the test set
Phone Number Checker - 0.9457 0.9434
Bayes Classifier based on Bag of Words 6.44s 0.9850 0.9587
Bayes Classifier based on TF-IDF 0.27s 0.9901 0.9847
Logistic Regression 1.63s 1.0000 0.9865
SVM 0.35s 0.9978 0.9847
Decision Tree 0.72s 0.9981 0.9390
Random Forest 3.37s 0.9997 0.9847
MLP 8.51s 0.9968 0.9883
bayes-LR Combination Model 1.91s 1.0000 0.9883
bayes-LR-SVR-RF-MLP Combination Model 12.20s 0.9991 0.9919

Summary

Phone number checking was a small attempt made by observing the characteristics of spam emails in the training set, and the effect was quite good.

The accuracy of the Bayesian classifier based on TF-IDF is much higher than that based on bag of words, and the former is also much faster in training speed than the latter.

Among these TF-IDF based models, Bayesian classifiers, logistic regression, support vector machines, random forests, and multi-layer perceptrons all have good accuracy on the training set. Among them, the multi-layer perceptron has the longest training time and the highest accuracy on the test set. The training time for Random Forest is the second longest. The accuracy of logistic regression is slightly better than SVM and Bayesian classifiers, but the training speed is slower compared to them.

Originally, it was expected that the accuracy of the multi-layer perceptron would not be very high because its expressive power was too strong and it was prone to overfitting. However, surprisingly, it was the model with the best accuracy on the test set.

The decision tree exhibits overfitting, which may be related to its strong expressive ability and small dataset size.

The Bayesian logistic regression combination model also has good accuracy on the test set, and the combination model of the five models can achieve the best accuracy, indicating that these individual models do have a tendency to predict spam emails as normal emails on the test set.

Areas that can be improved:

  1. Each model is directly adjusted and uses default parameters without parameter tuning. Some models may improve their performance after parameter tuning

  2. The text preprocessing is relatively rough and does not recognize and process special strings such as URLs, links, and garbled characters

  3. Without multiple training sessions, the ‘Training time’ may not be precise

  4. The “combination” between models is relatively simple and crude, and the single model’s “tendency to predict spam into normal mail” has not been explained in principle

  5. In terms of interpretability, perhaps one or more collaborative formulas can be used for prediction, which has better interpretability and facilitates monitoring and correction


Spam classification using classic models
http://example.com/2024/07/26/Spam-classification-using-classic-models/
Author
John Doe
Posted on
July 26, 2024
Licensed under