SMS Span Detection & Classification

SMS Spam Detection is a machine learning project designed to automatically classify SMS messages as spam or ham (non-spam). With the rising number of spam messages, this system helps protect users by filtering unwanted or potentially harmful messages. This project uses Natural Language Processing (NLP) techniques and Naive Bayes classifiers to accurately detect spam messages.


  • Data Cleaning & Preprocessing:
    • Convert messages to lowercase
    • Tokenization of text
    • Remove special characters, punctuation, and stopwords
    • Apply stemming for word normalization
  • Feature Extraction:
    • Convert text into numerical data using TF-IDF Vectorization
    • Normalize features for improved model performance
  • Machine Learning Models:
    • Gaussian Naive Bayes (GNB)
    • Multinomial Naive Bayes (MNB)
    • Bernoulli Naive Bayes (BNB)
  • Exploratory Data Analysis (EDA):
    • Visualize spam vs ham distribution using pie charts
    • Analyze message length, number of words, and sentences
    • WordClouds for frequent words in spam and ham messages
    • Correlation heatmaps and histograms
  • Evaluation Metrics:
    • Accuracy
    • Precision
    • Confusion Matrix

  1. Programming Language: Python
  2. Libraries: Pandas, NumPy, Matplotlib, Seaborn, NLTK, Scikit-learn, WordCloud
  3. Tools: Jupyter Notebook or any Python IDE

  1. Load the SMS dataset (spam.csv) and perform data cleaning.
  2. Remove duplicates and handle missing values.
  3. Convert the target labels (spam/ham) into numerical format.
  4. Perform text preprocessing: lowercase, tokenization, remove stopwords/punctuation, stemming.
  5. Visualize data with charts, histograms, and WordClouds to identify patterns.
  6. Convert text messages into numerical features using TF-IDF Vectorization.
  7. Scale the features using MinMaxScaler.
  8. Split the dataset into training and testing sets (80/20).
  9. Train Naive Bayes models and evaluate using accuracy, precision, and confusion matrix.
  10. Analyze results and draw insights for spam detection.

  • Source: SMS Spam Collection Dataset (spam.csv)
  • Columns:
    • v1: Target label (spam or ham)
    • v2: SMS text message
  • Preprocessing:
    • Dropped unnecessary columns

Label encoding for target column (0 = ham, 1 = spam)



โš ๏ธ Note: This project is for educational purposes only. Not for commercial sale.