Chronic Kidney Disease (CKD) is one of the greatest global health challenges that afflicts over 850 million people worldwide [1]. Clinically, CKD can be characterised by the consistent decrease of kidney ability to filter, often expressed in estimated glomerular filtration rate (eGFR), serum creatinine and proteinuria [1, 2]. Due to CKD being a silent disease in most cases, most patients develop the condition to a complicated stage before they are well attended to, exposing them to cardiovascular issues, hospitalisation, and even death [3]. An increase in cases of diabetes, high blood pressure, and other lifestyle-related illnesses has played a major role in CKD in developing nations [3]. The conventional diagnosis is based on manual analysis of biochemical pointers. This is, however, complicated with complex datasets that have multivariate relationships that cannot easily be established by human evaluation. Machine learning (ML) models have proven to have significant potential in CKD prediction because they are able to process complex clinical data and identify concealed patterns [4, 5]. Random Forest (RF), Logistic Regression (LR), Support Vector Machines (SVM), as well as boosting models, including XGBoost, have demonstrated encouraging performance [4, 6]. Although this has been achieved, the majority of research has been done on binary classification, between CKD and non-CKD, and this constrains clinical relevance to treatment planning since CKD progression is very severe based on the stage of advancement [6, 13]. Another important challenge is interpretability. ML models, especially those based on ensemble and boosting, are considered black boxes (their inner workings are hard to understand) in some way. This limits its implementation in health systems demanding transparent, auditable and clinically interpretable decisions in order to guarantee patient safety and trust [9, 14]. SHAP (SHapley Additive exPlanations) is the solution to this problem by offering mathematically consistent contributions of each feature to the final prediction [9]. Nevertheless, numerical SHAP values can also not be intuitively understood by clinicians. Fuzzy logic, which is based on the human reasoning style, is a natural solution when the model decisions are translated into the form of readable rules [10, 12]. The literature on the prediction of CKD has several research gaps. To begin with, stage-wise classification of CKD is still scanty, with the majority of the studies conducting binary classification as their approaches [13, 16]. Second, the terrible class imbalance in real-world CKD datasets, especially at low stages, can be observed, and the issue is not properly covered in many studies [8]. Third, many ML models employed to predict CKD do not have or lack adequate explainability [14]. Fourth, hybrid systems are uncommon that include hybridisation of ML, SHAP, and fuzzy reasoning. Finally, tools including GUI-based decision-support systems are deployable and are not available, which restricts the application in clinical and screening settings [13]. The proposed study is a bridge between these gaps because it presents a complete explainable model of CKD stage prediction with LightGBM and SMOTE, as well as SHAP and fuzzy reasoning, and a graphical interface. The main findings of the research are the following:
- The creation of a holistic CKD stage prediction tool that can predict all 6 stages (0-5).
- Successful management of class imbalance by optimising the performance of the minority classes with the help of SMOTE.
- SHAP international and local explainability implementation.
- Mechanism of clinical interpretability: The integration of fuzzy rules.
- Creation of a GUI to predict in real-time, visualise and make personalised suggestions.
LITERATURE REVIEW
Machine learning has experienced a significant amount of CKD prediction research, with the first models of this type examining the CKD presence by classifying it through the use of random forest (RF), Support Vector Machines (SVM), and Logistic Regression (LR) models, using structured clinical data [4, 5]. RF was effective in predicting because it was robust to noise and had the capacity to predict nonlinear relationships, whereas SVM was effective at dealing with high-dimensional medical data. XGBoost was subsequently enhanced with gradient boosting to achieve better precision, but remained poor at interpretation due to its complicated internal characterisation [6]. One of the most significant weaknesses witnessed in CKD datasets is an extreme imbalance of classes, in which most are of early-stage, and few are of advanced CKD (4-5) stages. Such an imbalance may cause biased model training, which will cause impoverished generalisation on minority classes. Synthetic Minority Oversampling Technique (SMOTE) has already been shown to be efficient in addressing such imbalance by creating synthetic samples in clusters of minority classes, thus enhancing the bias and recall of classifiers [8]. Research that has included SMOTE has had a continued increase in F1-score and sensitivity on underrepresented CKD groups. Machine learning must be adopted in healthcare because it has to be interpretable. One of the most mathematically sound schemes to explain model choices is SHAP, which is an algorithm introduced by Lundberg and Lee that computes the marginal contribution of each feature to the output [9]. SHAP has demonstrated itself to be a promising predictive model in areas of diabetes, cardiovascular disease, and oncology, yet it is hardly used to explain CKD staging. Arvind et al. noted that SHAP could be appropriate in the clinical setting, particularly because it could provide explanations that corresponded to physician reasoning and regulatory sustainability [14]. In addition to numerical interpretability, fuzzy logic provides the ability to think in a human manner with the use of linguistic representations like low GFR, moderately high creatinine, or high BUN. Fuzzy logic, originally introduced by Zadeh [10] and extended by Kosko [11], is highly used in clinical diagnostic systems because it is more flexible in uncertainty management and its interpretation ability. Son et al. proved that the use of fuzzy rule-based reasoning has been found to increase both clinician trust and enhance the usability of the decision-support system [12]. The latest systematic reviews of CKD prediction models highlight various gaps in the current literature that remain unaddressed [13, 16]. These gaps are a scarcity of research on stage-by-stage classification, inadequate work with skewed datasets, little incorporation of explainability methods like SHAP, and the absence of solutions linking ML with fuzzy and user interfaces. Moreover, most of the models are at the stage of academic research and do not become real-world clinical solutions because of the lack of deployable GUI-based solutions [13]. Resting on the above observations, it is evident that there is a need to have a holistic CKD prediction framework that:
- carries out prediction on a stage-by-stage basis,
- manages the issue of class imbalance,
- explains openly,
- is a fuzzy system that incorporates fuzzy reasoning to achieve clinical interpretability, and
- provides a GUI for real-time decision support.
All these research gaps are discussed in the current study, which is why it can be regarded as an important contribution to the CKD prediction literature.
METHODOLOGY
The proposed system of predicting CKD incorporates the preprocessing of data, balancing of classes, machine learning classification, probability calibration, explainability using SHAP, reasoning rules (fuzzy), and deployment into the GUI. The pipeline is multistage and therefore has high predictive accuracy, transparency and clinical usability.
Govardan Sai Palla*
Dr. I. Kullayamma
10.5281/zenodo.17918804