Thesis Details

Measuring Emails Against Phishing Tactics using Natural Language Processing (NLP)

Patrick farah

Submission Year : 2025

Abstract

Phishing continues to be a predominant cybersecurity threat, capitalizing on human vulnerabilities through fraudulent emails that imitate authentic communication. This thesis introduces a hybrid framework utilizing Natural Language Processing (NLP) to identify phishing emails through the examination of lexical patterns and psychological manipulation strategies. The model combines conventional TF-IDF vectorization with crafted features that identify social engineering signs, including urgency, authority, and manipulation signals. A variety of phishing and authentic datasets, including the Enron email corpus and publicly accessible phishing archives, were sanitized, combined, and transformed into a binary-labeled dataset comprising over 10,000 email samples. The system initially employed logistic regression as a baseline classifier, trained on TF-IDF vectors, and subsequently augmented with social engineering feature counts. The hybrid model attained a detection accuracy of 96%, exhibiting low rates of false positives and false negatives, so underscoring the need of enhancing lexical models with domain-specific psychological attributes. The evaluation metrics, comprising precision, recall, F1-score, and a confusion matrix, validate the model's resilience and dependability. This study provides a pragmatic and comprehensible solution for phishing detection and establishes a basis for future integration with transformer-based models and domain verification systems. Future endeavors involve the implementation of explainable AI methodologies and the deployment of the system as a web-based interface featuring a user-oriented Suspicion Meter.

Undergraduate

Graduate

Abstract