Overfitting In Decision Trees: Remedies Through Ensemble Learning And Explainable AI

Mansi, Yatu Rani, Archana Kumar,

doi:10.5281/zenodo.20050883

This study focuses on improving the performance of decision tree models by addressing key challenges such as overfitting and lack of interpretability. To overcome these issues, ensemble learning techniques including bagging and boosting are applied to enhance prediction accuracy and model stability. In addition, explainable artificial intelligence methods such as SHAP and LIME are used to improve model transparency and provide insights into feature importance. The proposed approach also incorporates data preprocessing techniques, including handling imbalanced datasets using SMOTE. The experimental results show that the proposed model outperforms traditional decision tree models in terms of accuracy and overall performance. However, the increased computational complexity highlights certain limitations. Overall, this study provides an effective approach for developing accurate and interpretable machine learning models for real-world applications.

Machine learning has become a key technology in modern applications enabling systems to learn from data and make decisions. It is widely used in domains such as healthcare, finance, and recommendation systems. Among various machine learning techniques decision tree algorithms are popular due to their simplicity, ease of implementation, and interpretability [1].

A decision tree is a supervised learning algorithm that can be used for both classification and regression tasks. It works by recursively partitioning the data into subsets based on feature values, making decisions at each node to maximize a specific criterion. On the other hand despite these advantages decision trees suffer from a significant limitation known as overfitting. Overfitting occurs when a model is trained too well on the training data and captures noise instead of meaningful patterns. As a result, model performs well on training data but fails to generalize effectively to unseen data. This issue is common in deep decision trees where excessive branching makes the model highly specific and less robust. This leads to weak adaptability of new, unseen data undermining the model’s predictive or explanatory power [2], [3].

To address this challenge, ensemble learning techniques have been widely used. Ensemble learning is a technique that uses multiple models of different kinds to create one model to improve accuracy and stability. One of the most commonly used technique is Random forest which constructs multiple decision trees using different subsets of data and aggregates their predictions. This approach reduces variance and significantly improves generalization performance [4], [5]. Another powerful ensemble technique is Gradient Boosting which builds models sequentially by correcting the errors of previous models. Gradient boosting focuses on improving performance step by step. This technique gives high predictive accuracy but increase the complexity of the model. Ensemble methods help improve accuracy and reduce overfitting but they introduce a new challenge related to interpretability. Unlike a single decision tree, these models are harder to understand. It becomes difficult to understand how they make decisions. This lack of transparency can reduce trust in machine learning systems [6].

To overcome this limitation, two widely used explainable artificial intelligence (XAI) techniques such as Local Interpretable model-agnostic Explanations (LIME) and Shapely Additive Explanations (SHAP) have been developed. Building on these methods, this research proposes a novel interpretability metric that evaluates the consistency base on LIME and SHAP as a measure of robustness in model interpretability [7].

This paper aims to reduce overfitting in decision trees by applying ensemble learning techniques while maintaining interpretability using explainable AI techniques. The proposed approach seeks to achieve a balance between model performance and transparency.

LITERATURE REVIEW

Several studies have been conducted to improve the performance of decision trees and address the problem of overfitting. Various techniques like ensemble learning methods and explainable artificial intelligence approaches have been developed to enhance the model accuracy and interpretability. This section reviews the existing literature related to decision trees, overfitting, ensemble methods and interpretability techniques. Decision trees, although highly interpretable tend to overfit when the model becomes complex, especially in the presence of noisy and high dimensional data. This limitation has motivated the use of advanced techniques to improve model generalization [8].

The reviewed studies showed that ensemble learning techniques such as bagging and boosting significantly improves the predictive performance of the decision tree models by reducing overfitting. Additionally, the integration of explainable artificial intelligence (XAI), specifically SHAP and LIME has enhanced the model transparency and interpretability. These approaches have been successfully applied in various domains such as healthcare, finance to demonstrate there practical significance. A comparative analysis of existing approaches indicates that ensemble learning methods primarily focus on improving predictive accuracy whereas XAI techniques emphasize model interpretability. However, very few studies provide a unified framework that effectively balances both aspects [9], [10] [11], [12].

Despite these advancements, several challenges remain. Ensemble methods often increase computational complexity and reduce interpretability. While explainable AI techniques may produce inconsistent explanations and require significant computational resources. Furthermore, existing studies focused either on improving interpretability or accuracy rather than achieving a balance between them. Therefore, there is a need to develop an efficient and scalable models that can give high predictive performance while maintaining interpretability. This highlights a critical research gap where an integrated approach combining ensemble learning and explainable AI is required to achieve both accuracy and transparency without significantly increasing computational cost [13], [14], [15], [16].

PROBLEM STATEMENT

Despite significant advancements in machine learning decision tree models continue to suffer from overfitting which limits their ability to generalize unseen data. Ensemble learning techniques, such as bagging and boosting help improve prediction performance of the decision tree models by reducing overfitting but they often increase model complexity and reduce interpretability. On the other hand, explainable artificial intelligence (XAI) techniques such as

SHAP and LIME enhance model transparency and interpretability but may produce inconsistent explanations and require significant computational resources. Furthermore, existing studies focused either on improving interpretability or accuracy rather than achieving a balance between them. Therefore, there is a need to develop an efficient and scalable models that can give high predictive performance while maintaining interpretability for real-world applications [17], [18], [19].

OBJECTIVE

The main objectives of this study are:

To analyze the problem of overfitting in decision trees.
To improve prediction accuracy using ensemble learning techniques.
To enhance model interpretability using XAI methods such as SHAP and LIME.
To handle imbalanced data using techniques like SMOTE.
To develop a model that balance accuracy, interpretability and computational efficiency.

METHODOLOGY

This study proposes a structured approach to improve the performance of decision tree models while maintaining interpretability. The methodology integrates data preprocessing, ensemble learning techniques and explainable artificial intelligence methods to achieve a balance between model transparency and prediction accuracy. The overall process consists of data collection, preprocessing, model training, application of ensemble learning methods and model interpretation using SHAP and LIME.

Proposed methodology of the system :

Data Collection

↓

Data Preprocessing

↓

Handling Imbalanced data (SMOTE)

↓

Model Training (Decision Tree)

↓

Ensemble Learning (Random Forest / Boosting)

↓

Model Evaluation

↓

XAI (SHAP + LIME)

↓

Result Analysis

Data collection: The dataset is collected from reliable sources related to application domains like healthcare or financial data. This dataset contains multiple features used for prediction.

Data preprocessing: This process includes handling missing and duplicate values, normalization and feature selection to improve model performance.

Handling imbalanced data: To address imbalanced data, SMOTE (Synthetic Minority Over- Sampling Technique) is applied to generate synthetic samples and balance the dataset [20].

Model development: A decision tree model is initially trained to analyze the baseline performance. Further, ensemble learning techniques such as Random Forest and boosting are applied to improve accuracy and reduce overfitting [21].

Model Evaluation: The model performance is evaluated using metrices such as Accuracy, Precision, Recall and F1-score.

Explainability (XAI): SHAP and LIME are applied to interpret the model predictions. These techniques provide feature importance and local explanations, improving transparency and trust in the model [22].

Result Analysis: The result are analyzed to compare model performance and interpretability, ensuring that the proposed model approach achieves a balance between both.

RESULT AND DISCUSSION

The performance of the proposed model is evaluated using publicly available dataset. The dataset was preprocessed by handling missing and duplicate values, normalizing features and balancing class distribution using SMOTE. Different models, including decision trees and ensemble learning techniques were trained and evaluated using performance metrices such as accuracy, precision, recall and F1- score.

F1-
Model	Accuracy	Precision	Recall	Score
Decision Tree	78%	75%	72%	73%
Random Forest	88%	85%	84%	84%
Boosting Model	91%	89%	87%	88%
Proposed Model	93%	91%	90%	90%
(Ensemble + XAI)

The results indicate that ensemble learning techniques significantly improve model performance compared to the basic decision tree. The decision tree model shows lower accuracy due to overfitting which affects it ability to generalize unseen data. In contrast, Random Forest reduces overfitting by combining multiple trees, resulting in improved accuracy. Boosting further enhances performance by focusing on misclassified instances, leading to better prediction results.

The proposed model achieves the highest accuracy as it combines ensemble learning with preprocessing techniques such as SMOTE, which helps in handling imbalanced data. Additionally, the use of SHAP and LIME improves model interpretability by providing insights into feature importance and prediction behaviour [22].

However, the improved performance comes with certain limitations. The use of ensemble methods and XAI techniques increases computational complexity and requires more processing time. This may limit the applicability of the model in real-time systems.

CONCLUSION

This research aims to address the limitations of decision tree models particularly the problem of overfitting and lack of interpretability. To overcome these limitations, ensemble learning techniques such as Bagging and Boosting were applied to improve prediction accuracy and model stability. In addition explainable artificial intelligence (XAI) methods such as SHAP and LIME were integrated to enhance model transparency and provide insights into feature importance.

The experimental results demonstrated that the proposed approach significantly outperforms the traditional decision tree model in terms of accuracy, precision, recall, and F1score. The use of ensemble learning techniques helped in reducing overfitting while explainable artificial methods improves the understanding of the model. However , the integration of these techniques increases computational complexity and may affect real- time applicability.

Overall, the study highlights the importance of balancing accuracy and interpretability in machine learning models and provides a foundation for developing efficient and reliable predictive systems.

FUTURE WORK

Although the proposed approach demonstrates improved performance and interpretability, there are several areas for future enhancement. Future research can focus on reducing computational complexity to make the model more efficient for real-time applications. Additionally, more advanced and consistent explainable AI techniques can be explored to overcome the limitations of SHAP and LIME.

Further improvements can include the use of deep learning models combined with explainability techniques to handle more complex datasets. The model can also be tested on larger and more diverse datasets to validate its robustness and scalability. Moreover, the proposed approach can be applied to real-world domains such as healthcare, financial prediction, and fraud detection to evaluate its practical usability [23].

REFERENCES

Ibmoiye Domor Mienye and Nobert Jere, A survey of decision trees : concepts, algorithms and applications, IEEE Access, 2024
A.D. Mankar, S.D. Bholte, K.G. Kharade, K.A. Raskar, Metaanalysis of overfitting of decision trees, Journal of Nonlinear Analysis and Optimization, 2024
Erblin Halabaku, Eliot Bytyci, Overfitting in Machine learning: A comparative analysis of decision trees and random forests, Intelligent Automation & Soft Computing, 2024
Hasan Ahmed Salman, Ali Kalakech and Amani Steiti, Random forest algorithm overview, Babylonian journal of machine learning, 2024
Anantha Babu Shanmugavel, Vijayan Ellappan, Anand Mehendran, Murali Subramanian, Ramanathan Lakshmanan and Manuel Mazzara, A novel ensemble based reduced overfitting model model with convolutional neural network for traffic sign recognition system, Electronics (MDPI), 2023
V. S. Stency, N Mohanasundaram, Revathi Santhosh, Ensembled gradient boosting technique with decision tree for intrusion detection system, International Journal of Intelligent systems and applications in engineering, 2024
Ahmed Salih, Zahra Raisi, Ilaria Boscolo Galazzo, Petia Radeva, A perspective on explainable artificial intelligence methods: SHAP and LIME, Advanced Intelligent Systems, 2024
Mykola Zlobin, Volodymyr Bazylevych, A data driven approach for balancing overfitting and underfitting in decision tree models, Collection of Scientific Papers, 2025
Hongke Zhao, Wenhui Liu, Yaxian Wang, Likang Wu, Comparative analysis of algorithmic approaches in ensemble learning: Bagging and Boosting, Scientific Reports, 2025
Abdallah, Hagar F. Gouda & Fatma D.M., Comparative performance of bagging and boosting ensemble models for predicting lumpy skin disease with multiclass-imbalanced data, Scientific Reports, 2025
Evandro S. Ortigossa, Thales Goncalves, Luis Gustavo Nanato, Explainable artificial intelligence (XAI)- from theory to methods and applications, IEEE Access, 2024
Bhawani Sankar Panigrahi, M. Vanitha , Mohd Ashraf, R.V.S Lalitha, D. Haritha, Ajith Sundaram, Explainable AI frameworks using SHAP and LIME enhance interpretable defect classification in additive manufacturing, Nondestructive Testing and Evaluation, 2026
Abel Abusitta, Miles Q. Li, Benjamin C.M. Fung, Survey on explainable AI: Techniques, challenges and open issues, Expert Systems with Applications, 2024
Trisna Ari Roshinta, Szucs Gabor, A comparative study of LIME and SHAP for enhancing trustworthiness and efficiency in explainable AI systems, IEEE International Conference on Computing (ICOCO),2024
Joshua Pinem, Widi Astuti , Adiwijaya, Explainable Ensemble learning Framework with SMOTE, SHAP and LIME FOR PREDICTING 30- DAY readmission in Diabetic patients, Jurnal Resti ( Rekayasa Sistem dan Teknologi Informasi), 2025
Ahmed Salih, Zahra Raisi, Ilaria Boscolo Galazzo, Petia Radeva, Steffen Erhard Petersen, Karim Lekadir, Gloria Manegez, A perspective on explainable artificial intelligence methods: SHAP and LIME, Advanced Intelligent Systems, 2024
MD. Mahmudal Hasan, Understanding model predictions: A comparative analysis of SHAP and LIME on various ML algorithms, Journal of Scientific and Technological Research, 2024
Hagar F. Gouda, Fatma D.M. Abadallah, Comparative performance of bagging and boosting ensemble models for predicting lumpy skin disease with multiclass- imbalanced data, Scientific Reports, 2025
Ashima Kukkar, Gagandeep Kaur, A novel adaptive ensemble classifier with LIME and SHAP-Based interpretability for fake news detection, Expert Systems with Applications, 2025
Essa E. Almazroei, Ensemble machine learning framework with SHAP and LIME for accurate early prediction of student success in online learning environments,Scientific Reports, 2026
Mie Wang, Feixiang Ying, Jianing Yang & Dongming Zhu, An explainable (interpretable) stacking ensemble machine learning model for real- time and short- term significant sea wave height prediction, Sustainable Energy Technologies and Assessments, 2026
Bhawani Sankar Panigrahi, M. Vanitha, Mohd Ashraf, R.V.S. Lalitha, D. Haritha & Ajith Sundaram, Explainable AI frameworks using SHAP and LIME enhance interpretable defect classification in additive manufacturing, Nondestructive Testing and Evaluation, 2026
Guillermo A. Francia lii, Hossain Shahriar, Eman El- Sheikh, Md Abdur Rahman, Sheikh Iqbal Ahamed, An explainable artificial intelligence approach for improved dynamic analysis with SHAP and LIME, IEEE International Conference on Computing (ICOCO), 2026

Mansi

Corresponding author

Department of AI and Data Science, Dr. Akhilesh Das Gupta Institute of Professional Studies, Delhi, India

Yatu Rani

Co-author

Archana Kumar

Mansi*, Yatu Rani, Archana Kumar, Overfitting In Decision Trees: Remedies Through Ensemble Learning And Explainable AI, Int. J. Sci. R. Tech., 2026, 3 (5), 214-219. https://doi.org/10.5281/zenodo.20050883

View Article

Overfitting In Decision Trees: Remedies Through Ensemble Learning And Explainable AI

Abstract

Keywords

Introduction

Reference

Mansi

Yatu Rani

Archana Kumar

More related articles

Anticancer Activity of Grapes and Papaya: A Compre...

Power Electronics in Renewable Energy Systems...

A Review on Health-Related Effects & Pharmacologic...

View more

Exploring the Mathematics of Spacetime in Einstein’s Relativity...

From Localization to Connectomics: A Contemporary View of Human Brain Structure ...

Computer-Aided Drug Design in Modern Pharmaceutical Research: An In-Silico Persp...

View more

Related Articles

Finite Element Wear Behaviour Modeling of AA7075 Coated with WS2/Cu Using ANSYS...

A Comprehensive Review on Oral Disintegrating Tablets...

Formulation, Development & Evaluation Of Polyherbal Sunscreen...

Ion Exchange Chromatography in the Analysis of Brain-Derived DNA: Unravelling th...

Anticancer Activity of Grapes and Papaya: A Comprehensive Review...

More related articles

Anticancer Activity of Grapes and Papaya: A Comprehensive Review...

Power Electronics in Renewable Energy Systems...

A Review on Health-Related Effects & Pharmacological Activity of Tamarindus Indi...

View more

Anticancer Activity of Grapes and Papaya: A Comprehensive Review...

Power Electronics in Renewable Energy Systems...

A Review on Health-Related Effects & Pharmacological Activity of Tamarindus Indi...

View more