Computer Science and Engineering, GRD IMT Dehradun
The Heart Disease Prediction System is an application designed to predict the presence of heart disease in individuals based on critical medical data. This web-based solution, implemented using Python and Streamlit, combines data science and machine learning techniques to offer an intuitive and interactive interface for healthcare professionals, researchers, and students. By leveraging supervised learning algorithms such as K-Nearest Neighbors (KNN), Decision Trees, Random Forests, and Support Vector Machines (SVM), the application facilitates accurate predictions while enabling exploratory data analysis. Key functionalities include detailed data visualization, advanced feature engineering, and model selection, all of which aim to improve the interpretability and predictive power of the system. The project highlights the potential of machine learning in addressing critical health challenges, providing an accessible and effective tool for disease prediction.
Cardiovascular diseases are among the most prevalent health conditions worldwide, contributing to significant morbidity and mortality. According to the World Health Organization, early detection and intervention are crucial in mitigating risks and improving patient outcomes. Advances in data science and machine learning have paved the way for predictive systems that can assist healthcare providers in diagnosing and managing such conditions more effectively. The heart is a vital organ responsible for pumping blood throughout the body, ensuring the proper functioning of all other organs. If the heart fails to operate correctly, critical systems such as the brain and other organs cease functioning, leading to death within minutes. Changes in lifestyle, work-related stress, and unhealthy dietary habits have significantly contributed to the rise of heart-related diseases globally. Heart diseases have emerged as a leading cause of mortality worldwide. According to the World Health Organization (WHO), cardiovascular diseases account for approximately 17.7 million deaths annually, representing 31% of all global deaths. In India, CVDs account for a significant proportion of deaths, ranging from 30% to 42%. The Global Burden of Disease study estimates that the age-standardized CVD death rate in India is 272 per 100,000 people, which is higher than the global average of 235 per 100,000. In India, heart-related diseases have become the primary cause of death, with 1.7 million fatalities reported in 2016, as per the 2016 Global Burden of Disease Report. The economic impact is equally severe; between 2005 and 2015, India is estimated to have
LITERATURE SURVEY
Chala Beyene et al [1] proposed a framework for the Prediction and Analysis of Heart Disease Occurrence Using Data Mining Techniques. The primary goal of their methodology is to facilitate early and automated diagnosis of heart disease, delivering results swiftly. This approach is particularly beneficial in healthcare organizations with limited expertise or insufficient specialized skills. The proposed system employs a variety of medical attributes, including blood sugar levels, heart rate, age, and sex, to determine whether an individual is at risk of heart disease. By leveraging these attributes, the framework aims to improve the accuracy of predictions and assist in timely medical intervention. Senthil Kumar Mohan et al [2], proposed a hybrid machine learning approach for predicting heart diseases using the Cleveland dataset. Their method begins with a data pre-processing step, where tuples with missing values are removed, and non-essential attributes like age and sex are excluded, as they were deemed personal and irrelevant to prediction accuracy. The remaining 11 attributes, which hold significant clinical relevance, were retained for analysis. The authors introduced a Hybrid Random Forest Linear Method (HRFLM), combining Random Forest (RF) and Linear Method (LM). The HRFLM framework comprises four main algorithms:
METHODOLOGY
The Heart Disease Prediction System is structured into five main modules: data structure analysis, data visualization, feature engineering, model building, and prediction. Each module is designed to address a specific aspect of the data science pipeline, ensuring a comprehensive approach to data exploration and model development.
Data Loading and Exploration
The application begins by loading the heart disease dataset, which contains various medical attributes such as age, cholesterol levels, blood pressure, and others. These attributes are crucial for predicting the target variable, which indicates the presence or absence of heart disease. The dataset is loaded using Pandas and cached to improve performance. Users can explore the dataset's structure, including its shape, column names, data types, and summary statistics. This step provides a foundational understanding of the data. The first step in the methodology involves loading the dataset and performing initial exploration to understand the data structure, identify missing values, and recognize patterns. For this project, we use the Heart Disease dataset, which contains 303 instances, each representing a patient’s medical record. Each record includes 14 attributes that describe various aspects of the patient's medical condition, such as age, sex, blood pressure, cholesterol levels, and whether they suffer from heart disease.
a. Loading the Data
The dataset is loaded from a CSV file using a Python library like Pandas. Pandas provides an efficient way to read and manipulate structured data. The dataset is loaded into a Data Frame for easy
3. Random Forest Classifier
Overview
Random Forest is an ensemble learning method that builds multiple decision trees during training and merges their outputs to improve accuracy and reduce overfitting. It is a robust model that performs well on many tasks and is particularly effective for classification problems.
Working
Bootstrap Aggregating (Bagging): Random Forest uses a technique called bagging, where multiple decision trees are trained on different random subsets of the data. These subsets are generated by bootstrapping, which means randomly sampling with replacement from the training data. Feature Randomization: At each node, the algorithm randomly selects a subset of features for the split, reducing the correlation between trees and ensuring diversity in the forest. This hyperplane that best separates the data into two classes. The model assigns labels based on which side of the hyperplane the data points fall on. Non-linear SVM: For non-linearly separable data, SVM uses kernel functions to map the data into a higher-dimensional space where a linear separator can be found. Common kernel functions include:
SVM is highly effective for binary classification tasks and works well in high-dimensional spaces. However, it can be computationally expensive, especially with large datasets or when using complex kernels. Regularization (parameter C) and kernel choice are crucial for the model’s performance. Cross-validation is employed to evaluate the model’s generalization ability and to avoid overfitting.
OUTCOME OF
Prediction Module
The prediction module allows users to input patient-specific information, such as age, blood pressure, cholesterol levels, etc., which is processed and fed into a selected machine learning model, such as Support Vector Machine (SVM). The model predicts the likelihood of heart disease, with visual feedback provided for clarity.
Steps in Prediction
to achieve its objectives:
Matplotlib and Seaborn: Visualization libraries for creating graphs and plots that facilitate data exploration and feature analysis.
CONCLUSION
The Heart Disease Prediction System demonstrates the potential of machine learning in addressing real world health challenges. By integrating data visualization, feature engineering, and predictive modeling into a single platform, the project provides a comprehensive tool for heart disease prediction. The interactive interface simplifies complex machine learning processes, making the application accessible to a wide range of users. Through this project, we illustrate how advancements in data science and machine learning can be harnessed to create impactful solutions. While the system offers promising results, future enhancements could include expanding the dataset, incorporating additional models, and refining hyperparameters for improved accuracy. The application serves as a foundation for further research and development in predictive healthcare, showcasing the transformative potential of technology in improving human well-being.
REFERENCE
Himanshu Kothari*, Suman Rani, An Overview of the Heart Disease Prediction Using Machine Learning and its Application, Int. J. Sci. R. Tech., 2025, 2 (6), 560-565. https://doi.org/10.5281/zenodo.15715476