Students will:
- Load a spam email dataset (1000 emails, 10 features)
- Train a Naive Bayes classifier
- Train a k-Nearest Neighbors classifier
- Train an SVM classifier
- Compare the results
Time Required: 3-5 hours
3480project-classification1/
├── data/
│ └── spam_data.csv # 1000 emails, 10 features
├── notebooks/
│ └── spam_classification.ipynb # Main notebook (This is where you do your work)
└── README.md # This file
- Import libraries
- Load the spam dataset
- View class distribution
- Clean dataset
- Split into train/test sets (70/30)
- Scale features for SVM
- Create Naive Bayes Classifier model
- Make predictions
- View confusion matrix
- Determine best value of
$k$ - Create kNN model
- Make predictions
- View confusion matrix
- Create SVM model with RBF kernel
- Make predictions
- View confusion matrix
- Compare accuracy, precision, recall, F1-score
- Create comparison visualization
- Which model performed better?
- What's the difference between metrics?
- When to use each algorithm?
Email Spam Classification Dataset
- Size: 1,000 emails
- Features: 10 numerical features
- word_free, word_money, word_winner, word_click, word_urgent
- num_exclamation, num_dollar, num_capitals
- email_length, has_link
- Target: is_spam (0=Ham, 1=Spam)
- Distribution: 60% Ham, 40% Spam
After completing this project, students will understand:
- How to implement classification algorithms with scikit-learn
- The difference between Naive Bayes and SVM
- How to evaluate models using multiple metrics
- When to use different algorithms
-
Data Loading & Preparation (20 points)
- Correctly load and split data
- Apply appropriate preprocessing
-
Naive Bayes Implementation (20 points)
- Train model correctly
- Generate predictions and confusion matrix
-
SVM Implementation (20 points)
- Train model with scaled data
- Generate predictions and confusion matrix
-
Model Comparison (20 points)
- Calculate all metrics
- Create comparison visualization
-
Discussion Questions (20 points)
- Answer all 4 questions thoughtfully
- Demonstrate understanding of concepts
Students should see:
- Both models achieving 80-95% accuracy
- SVM typically slightly more accurate
- Similar precision and recall values
- Clear differences in confusion matrices
- Type: Probabilistic classifier
- Assumption: Features are independent
- Pros: Fast, simple, works well with limited data
- Cons: Independence assumption rarely holds
- Type: Instance-based/lazy learner
- Assumption: Similar data points belong to the same class
- Pros: Simple, no training phase, works well with irregular decision boundaries, naturally handles multi-class
- Cons: Slow predictions, sensitive to irrelevant features, requires feature scaling, memory intensive (stores all training data)
- Type: Margin-based classifier
- Goal: Find optimal hyperplane separating classes
- Pros: Effective, handles non-linear data (with kernels)
- Cons: Requires feature scaling, slower training
- Accuracy: Overall correctness
- Precision: Of predicted spam, how many were actually spam?
- Recall: Of actual spam, how many did we catch?
- F1-Score: Balance between precision and recall
Problem: Can't import sklearn
pip install scikit-learnProblem: Kernel keeps dying
- Reduce dataset size
- Close other programs
- Restart Jupyter
Problem: Different results each time
- Check that
random_state=42is set everywhere
- Run cells in order - Don't skip ahead
- Read error messages - They usually tell you what's wrong
- Compare your metrics - Both models should work reasonably well
- Think about the questions - There's no single "right" answer
- Notebook runs completely without errors
- All cells have output visible
- Comparison tables (confusion matrix and classification report) are generated
- All 4 discussion questions are answered
- Your name and date are filled in
Contact me at any time if you have questions. The best time to reach me is during my office hours.
Good luck! 🚀