Measuring Emails Against Phishing Tactics using Natural
Language Processing (NLP)
Phishing continues to be a predominant cybersecurity threat, capitalizing on human
vulnerabilities through fraudulent emails that imitate authentic communication. This
thesis introduces a hybrid framework utilizing Natural Language Processing (NLP) to
identify phishing emails through the examination of lexical patterns and psychological
manipulation strategies. The model combines conventional TF-IDF vectorization with
crafted features that identify social engineering signs, including urgency, authority,
and manipulation signals. A variety of phishing and authentic datasets, including the
Enron email corpus and publicly accessible phishing archives, were sanitized,
combined, and transformed into a binary-labeled dataset comprising over 10,000 email
samples. The system initially employed logistic regression as a baseline classifier,
trained on TF-IDF vectors, and subsequently augmented with social engineering
feature counts. The hybrid model attained a detection accuracy of 96%, exhibiting low
rates of false positives and false negatives, so underscoring the need of enhancing
lexical models with domain-specific psychological attributes. The evaluation metrics,
comprising precision, recall, F1-score, and a confusion matrix, validate the model's
resilience and dependability. This study provides a pragmatic and comprehensible
solution for phishing detection and establishes a basis for future integration with
transformer-based models and domain verification systems. Future endeavors involve
the implementation of explainable AI methodologies and the deployment of the system
as a web-based interface featuring a user-oriented Suspicion Meter.