SMS Spam Detection Using Python and Machine Learning

Project Overview

SMS Spam Detection is a machine learning project designed to automatically classify SMS messages as spam or ham (non-spam). With the rising number of spam messages, this system helps protect users by filtering unwanted or potentially harmful messages. This project uses Natural Language Processing (NLP) techniques and Naive Bayes classifiers to accurately detect spam messages.

Key Features

Data Cleaning & Preprocessing:
- Convert messages to lowercase
- Tokenization of text
- Remove special characters, punctuation, and stopwords
- Apply stemming for word normalization
Feature Extraction:
- Convert text into numerical data using TF-IDF Vectorization
- Normalize features for improved model performance
Machine Learning Models:
- Gaussian Naive Bayes (GNB)
- Multinomial Naive Bayes (MNB)
- Bernoulli Naive Bayes (BNB)
Exploratory Data Analysis (EDA):
- Visualize spam vs ham distribution using pie charts
- Analyze message length, number of words, and sentences
- WordClouds for frequent words in spam and ham messages
- Correlation heatmaps and histograms
Evaluation Metrics:
- Accuracy
- Precision
- Confusion Matrix

Technology Used

Programming Language: Python
Libraries: Pandas, NumPy, Matplotlib, Seaborn, NLTK, Scikit-learn, WordCloud
Tools: Jupyter Notebook or any Python IDE

Project Workflow

Load the SMS dataset (spam.csv) and perform data cleaning.
Remove duplicates and handle missing values.
Convert the target labels (spam/ham) into numerical format.
Perform text preprocessing: lowercase, tokenization, remove stopwords/punctuation, stemming.
Visualize data with charts, histograms, and WordClouds to identify patterns.
Convert text messages into numerical features using TF-IDF Vectorization.
Scale the features using MinMaxScaler.
Split the dataset into training and testing sets (80/20).
Train Naive Bayes models and evaluate using accuracy, precision, and confusion matrix.
Analyze results and draw insights for spam detection.