Explaining Advanced Malware Detection
Techniques Using XAI Tools (SHAP, Eli5, and
LIME
In the ever-evolving landscape of cybersecurity, the omnipresent threat
of malware necessitates a revolutionary approach to detection mechanisms.
As malware continues to grow in sophistication and complexity,
the urgency for advanced, interpretable detection methods
becomes increasingly critical. This thesis pioneers a groundbreaking
approach to malware detection, harnessing the illuminating power
of SHAP (SHapley Additive exPlanations) values and the innovative
”ELI5 for PyTorch” model.
Explainable Artificial Intelligence (XAI) and machine learning (ML)
models play pivotal roles in cybersecurity threat detection, as evidenced
by recent research. Notably, studies delve into this arena with
distinct focuses, highlighting the efficacy of ML models in cybersecurity
threat detection while underscoring the critical role of XAI in
elucidating models’ decision-making processes. This transparency is
pivotal for building trust and enhancing the robustness of cybersecurity
systems.
This thesis, situated at the nexus of advanced ML techniques and cybersecurity
imperatives, bridges the gap between model performance
and interpretability. By leveraging SHAP values and the novel ”ELI5
for PyTorch” model, it not only detects previously unseen malware
variants with precision but also provides unmatched insights into
the decision-making process of detection models. Furthermore, the
introduction of SOREL-20M, a large-scale benchmark dataset, empowers
the research community with an extensive real-world corpus
to develop explainable machine-learning models that reliably classify
threats.
Additionally, adversarial methods have been employed to generate explanations
for misclassifications by IDS, while innovative techniques
like Symbolic Deep Learning (SDL) aim to provide explainable suggestions
for threat detection. Furthermore, XAI techniques such as
SHAP, LIME, and gradient-based attribution methods have been utilized
to explain decisions made by malware classifiers, with a focus
on improving interpretability without sacrificing performance. This
research not only underscores the importance of explainable AI in cybersecurity
but also highlights the diverse range of methodologies and
their potential impact on enhancing system transparency and analyst
decision-making processes.
The following table highlights the significance of XAI and the distribution
of published approaches based on geographic location. Through
the application of machine learning models and SHAP explainer algorithms,
this research aimed to identify the factors influencing the
classification of cyber threats. The datasets included features related
to obfuscation techniques in URLs, system calls, binders, and composite
behaviors in Android malware. Three machine learning classifiers
were utilized: the Random Forest Classifier, XGBoost, and
the Keras Sequential algorithm. The predictions were explained using
three SHAP variants: TreeSHAP, KernelSHAP, and DeepSHAP.
This analysis provided key insights into the critical features driving
attack classification, offering a deeper understanding of cybersecurity
threats. By presenting the findings in a clear and interpretable manner,
this study contributes to the advancement of effective cybersecurity
strategies, enabling stakeholders to detect and mitigate cyber
threats more effectively in an increasingly digital landscape. Keywords:
Malware Detection, Explainable AI, SOREL-20M Benchmark
Dataset, ELI5 for PyTorch, Classification Benchmark.