Thesis Details

Explaining Advanced Malware Detection Techniques Using XAI Tools (SHAP, Eli5, and LIME

Aline Feghali

Submission Year : 2025

Abstract

In the ever-evolving landscape of cybersecurity, the omnipresent threat of malware necessitates a revolutionary approach to detection mechanisms. As malware continues to grow in sophistication and complexity, the urgency for advanced, interpretable detection methods becomes increasingly critical. This thesis pioneers a groundbreaking approach to malware detection, harnessing the illuminating power of SHAP (SHapley Additive exPlanations) values and the innovative ”ELI5 for PyTorch” model. Explainable Artificial Intelligence (XAI) and machine learning (ML) models play pivotal roles in cybersecurity threat detection, as evidenced by recent research. Notably, studies delve into this arena with distinct focuses, highlighting the efficacy of ML models in cybersecurity threat detection while underscoring the critical role of XAI in elucidating models’ decision-making processes. This transparency is pivotal for building trust and enhancing the robustness of cybersecurity systems. This thesis, situated at the nexus of advanced ML techniques and cybersecurity imperatives, bridges the gap between model performance and interpretability. By leveraging SHAP values and the novel ”ELI5 for PyTorch” model, it not only detects previously unseen malware variants with precision but also provides unmatched insights into the decision-making process of detection models. Furthermore, the introduction of SOREL-20M, a large-scale benchmark dataset, empowers the research community with an extensive real-world corpus to develop explainable machine-learning models that reliably classify threats. Additionally, adversarial methods have been employed to generate explanations for misclassifications by IDS, while innovative techniques like Symbolic Deep Learning (SDL) aim to provide explainable suggestions for threat detection. Furthermore, XAI techniques such as SHAP, LIME, and gradient-based attribution methods have been utilized to explain decisions made by malware classifiers, with a focus on improving interpretability without sacrificing performance. This research not only underscores the importance of explainable AI in cybersecurity but also highlights the diverse range of methodologies and their potential impact on enhancing system transparency and analyst decision-making processes. The following table highlights the significance of XAI and the distribution of published approaches based on geographic location. Through the application of machine learning models and SHAP explainer algorithms, this research aimed to identify the factors influencing the classification of cyber threats. The datasets included features related to obfuscation techniques in URLs, system calls, binders, and composite behaviors in Android malware. Three machine learning classifiers were utilized: the Random Forest Classifier, XGBoost, and the Keras Sequential algorithm. The predictions were explained using three SHAP variants: TreeSHAP, KernelSHAP, and DeepSHAP. This analysis provided key insights into the critical features driving attack classification, offering a deeper understanding of cybersecurity threats. By presenting the findings in a clear and interpretable manner, this study contributes to the advancement of effective cybersecurity strategies, enabling stakeholders to detect and mitigate cyber threats more effectively in an increasingly digital landscape. Keywords: Malware Detection, Explainable AI, SOREL-20M Benchmark Dataset, ELI5 for PyTorch, Classification Benchmark.

Undergraduate

Graduate

Abstract