Dermatological conditions affect roughly 900 million people worldwide at any point in time [1], spanning a continuum from self-limiting inflammatory disorders to life-threatening malignancies. Reliable visual identification is complicated by the pronounced overlap between disease presentations: Melanoma, Melanocytic nevi, and Dermatofibroma, for instance, share irregular pigmentation, heterogeneous border morphology, and similar surface texture profiles that regularly confound nonspecialist assessment.
Access to qualified dermatologists is far from uniform. In many low- and middle-income regions, the patient-to-specialist ratio is orders of magnitude below clinical demand, leading to delayed diagnoses and preventable disease progression. Deep learning-based visual classifiers offer a concrete mechanism to bridge this gap: by automatically flagging high-risk lesions for expert review, such tools extend specialist capacity without requiring on-site expertise.
A well-documented barrier to clinical uptake is model opacity [2]. Convolutional networks deliver predictions without disclosing the visual evidence on which those predictions rest, undermining the trust of clinicians who must remain legally and professionally accountable for diagnostic decisions. Explainable AI (XAI) methods address this directly by producing humanreadable rationales alongside model outputs.
This work makes four specific contributions. First, a structured preprocessing and augmentation pipeline with classimbalance mitigation via inverse-frequency loss weighting. Second, an optimized VGG16 transfer learning model with a custom three-fully-connected-layer head achieving 87.67% overall accuracy, 89.43% Top-1 accuracy, and 100.00% Top5 accuracy on 17 skin disease classes. Third, integration of LRP as a post-hoc pixel-level explainability mechanism with qualitative clinical validation. Fourth, a comparative evaluation against classical feature-based baselines and a ResNet-50 transfer learning reference.
II. RELATED WORK
A. Classical Feature-Based Methods
Pre-deep-learning diagnostic systems extracted hand-crafted descriptors — colour histograms, Local Binary Patterns (LBP), histogram of oriented gradients (HOG) — and fed them into SVMs or nearest-neighbour classifiers. While feasible in controlled settings, these pipelines struggled with the photometric variation, hair occlusion, and lesion boundary irregularity typical of real dermatological images.
B. Deep CNN Approaches
The landmark result of Esteva et al. [1], who trained a CNN on over 129,000 clinical images to dermatologist-level accuracy, triggered a wave of deep learning research in dermatology. VGG16, introduced by Simonyan and Zisserman [3], demonstrated that stacking uniform 3×3 convolutional filters to substantial depth yields highly transferable feature representations; its ImageNet-pretrained weights have since become a standard starting point for medical image classification tasks with limited data [6]. ResNet [8] addressed gradient vanishing via residual shortcuts, while MobileNet [7] traded representational capacity for mobile-deployment efficiency through depthwise separable convolutions. Despite benchmark advances from these successors, VGG16 remains competitive under small-dataset transfer learning conditions owing to the simplicity and stability of its feature hierarchy.
C. Explainability in Clinical AI
Selvaraju et al. [4] introduced Grad-CAM, which constructs coarse localization maps by pooling class-discriminative gradient signals from the final convolutional layer. Bach et al. [5] proposed LRP, which propagates the prediction score backward through the network respecting a conservation principle, so that relevance assigned to each input pixel sums to the original output value. Comparative studies have found LRP to produce finer-grained, spatially more faithful explanations than gradientbased alternatives for skin lesion analysis [10], making it particularly appropriate when explanation precision is a clinical requirement.
III. DATASET
A. Source and Composition
Images were drawn from the Roboflow Universe public repository [9], covering 17 labeled categories: Actinic keratosis, Atopic dermatitis, Benign keratosis, Candidiasis, Dermatitis, Dermatofibroma, Melanocytic nevi, Melanoma, Ringworm, Squamous cell carcinoma, Tinea versicolor, Vascular lesions, Carcinoma, Cell carcinoma, Keratosis, Lesion, and Nevus.
Table I summarises the partition statistics.
TABLE I: DATASET PARTITION STATISTICS
|
Split |
Samples |
Proportion |
|
Training |
3,674 |
80.9% |
|
Validation |
409 |
9.0% |
|
Test |
454 |
10.0% |
|
Total |
4,537 |
100% |
Test-set class support ranges from 39 samples (Nevus, Melanocytic) to 107 samples (Keratosis), reflecting natural prevalence variation in dermatological data. Images were loaded via a PyTorch Dataset wrapper with batch size 8, producing verified tensor dimensions of [8,3,224,224].
IV. METHODOLOGY
A. Preprocessing Pipeline
A five-stage pipeline was applied uniformly. Images were
(1) decoded from JPEG/PNG; (2) spatially resized to 224×224 pixels; (3) channel-normalised using the ImageNet per-channel statistics (mean [0.485,0.456,0.406], std [0.229,0.224,0.225]) to preserve pretrained weight compatibility; (4) augmented during training via random horizontal flipping, rotation up to 15◦, random zoom, and minor brightness/contrast jitter; and (5) subjected to class frequency analysis to derive inversefrequency weights for loss reweighting.
B. Model Architecture
The VGG16 backbone (13 convolutional layers in 5 blocks with MaxPooling) accepts 3-channel 224×224 inputs and produces a 512×7×7 feature map. All convolutional weights were frozen throughout training to prevent catastrophic forgetting of ImageNet representations.
The appended classification head performs: Adaptive Average Pooling → Flatten → Linear(25088→1024) + ReLU → Dropout(0.5) → Linear(1024→512) + ReLU → Dropout(0.4) → Linear(512→256) + ReLU → Linear(256→17). Progressive dimensionality reduction with dual dropout controls overfitting while preserving classification capacity on this modest-sized dataset.
C. Training Configuration
Table II lists the full hyperparameter configuration. CrossEntropyLoss with inverse-frequency class weights was chosen to handle the imbalanced class distribution. AdamW provides weight-decay regularization alongside adaptive moment estimation. The ReduceLROnPlateau scheduler halves the learning rate whenever validation loss fails to improve for three consecutive epochs.
TABLE II: TRAINING HYPERPARAMETER CONFIGURATION
|
Hyperparameter |
Value |
|
Backbone |
VGG16 (ImageNet pretrained) |
|
Conv. weights |
Frozen |
|
Head |
4-layer FC, dual dropout |
|
Loss function |
CrossEntropyLoss + class weights |
|
Optimiser |
AdamW |
|
Learning rate |
1×10−5 |
|
LR scheduler |
ReduceLROnPlateau (pat.=3, factor=0.5) |
|
Batch size |
8 |
|
Epochs |
20 |
|
Input size |
224×224×3 |
|
Dropout rates |
0.5 (FC1), 0.4 (FC2) |
|
Hardware |
GPU (CUDA) / CPU fallback |
D. Layer-wise Relevance Propagation
Following inference, LRP redistributes the scalar prediction score back through the network via layer-specific decomposition rules. For each layer, relevance is partitioned among inputs in proportion to their activation contribution, subject to the conservation constraint: Pi Ri = f(x), where f(x) is the class score. The resulting pixel-level map R(x) was overlaid on the original image for qualitative clinical inspection, verifying that high-relevance regions correspond to lesion borders, pigmentation irregularities, and surface texture anomalies rather than background content.
V. EXPERIMENTAL RESULTS
A. Training Convergence
Training loss fell from 0.3038 at epoch 1 to 0.0940 at epoch 20 (a 69.1% reduction). Validation loss tracked closely from 0.2561 to 0.0899, with no sign of divergence.
Table III reports selected epoch-level metrics; the full 20epoch progression is illustrated in Fig. 1. Validation accuracy peaked at 71.88% at epoch 16, settling to 69.93% at epoch 20. The moderate gap between training and validation accuracy is expected given the dataset scale and is effectively controlled by the dual-dropout scheme.
TABLE III: SELECTED EPOCH TRAINING AND VALIDATION METRICS
|
Epoch |
Tr. Loss |
Tr. Acc |
Val. Loss |
Val. Acc |
|
1 |
0.3038 |
0.2058 |
0.2561 |
0.2543 |
|
5 |
0.1516 |
0.4995 |
0.1328 |
0.4377 |
|
10 |
0.1177 |
0.5996 |
0.1082 |
0.6088 |
|
15 |
0.1014 |
0.6350 |
0.0963 |
0.6577 |
|
16 |
0.0997 |
0.6424 |
0.0941 |
0.7188 |
|
20 |
0.0940 |
0.6565 |
0.0899 |
0.6993 |
Fig. 1. Training and validation loss over 20 epochs. Both curves decrease monotonically with no divergence, confirming stable convergence.
B. Overall Test Performance
Table IV presents aggregate metrics on the 454-sample heldout test set. The 100.00% Top-5 accuracy confirms that the correct label consistently appeared among the five highestprobability predictions, even for cases where top-1 ranking was ambiguous. The macro recall of 1.00 indicates the model does not miss any true-positive case across all 17 classes — the primary safety criterion for a clinical screening tool.
TABLE IV: OVERALL TEST SET PERFORMANCE
|
Metric |
Value |
|
Overall Accuracy |
87.67% |
|
Top-1 Accuracy |
89.43% |
|
Top-5 Accuracy |
100.00% |
|
Macro F1-Score |
0.92 |
|
Weighted F1-Score |
0.93 |
|
Micro Avg Precision |
0.86 |
|
Micro Avg Recall |
1.00 |
|
Macro Avg Precision |
0.87 |
|
Macro Avg Recall |
1.00 |
C. Per-Class Performance
Table V reports precision, recall, F1-score, and support for each of the 17 categories. Atopic, Benign, and Tinea achieved perfect scores (F1 = 1.00), attributable to morphologically distinctive features — characteristic scaling and erythema in Atopic dermatitis, well-defined homogeneous surfaces in Benign lesions, and the hypopigmented macule pattern of Tinea versicolor. Keratosis, the most represented class (n = 107), similarly attained F1 = 1.00, supporting a positive relationship between class support and classification reliability.
The lowest precision values were recorded for Melanoma (0.59), Dermatofibroma (0.64), and
Melanocytic (0.72), all of which share overlapping border irregularity and pigmentation characteristics. All three nonetheless achieved perfect recall (1.00): the model generates no false negatives for these high-risk categories, which constitutes the clinically preferable failure mode since false positives are resolvable by specialist review.
TABLE V: PER-CLASS CLASSIFICATION PERFORMANCE (TEST SET, n=924)
|
Class |
Prec. |
Rec. |
F1 |
Supp. |
|
Actinic |
0.88 |
1.00 |
0.94 |
51 |
|
Atopic |
1.00 |
1.00 |
1.00 |
45 |
|
Benign |
1.00 |
1.00 |
1.00 |
56 |
|
Candidiasis |
0.98 |
1.00 |
0.99 |
60 |
|
Dermatitis |
0.98 |
1.00 |
0.99 |
45 |
|
Dermatofibroma |
0.64 |
1.00 |
0.78 |
54 |
|
Melanocytic |
0.72 |
1.00 |
0.84 |
39 |
|
Melanoma |
0.59 |
1.00 |
0.74 |
44 |
|
Ringworm |
0.98 |
1.00 |
0.99 |
60 |
|
Squamous |
0.78 |
1.00 |
0.88 |
54 |
|
Tinea |
1.00 |
1.00 |
1.00 |
60 |
|
Vascular |
0.96 |
1.00 |
0.98 |
51 |
|
Carcinoma |
0.75 |
1.00 |
0.86 |
54 |
|
Cell |
0.75 |
1.00 |
0.86 |
54 |
|
Keratosis |
1.00 |
0.99 |
1.00 |
107 |
|
Lesion |
0.96 |
1.00 |
0.98 |
51 |
|
Nevus |
0.76 |
1.00 |
0.87 |
39 |
|
Micro avg |
0.86 |
1.00 |
0.92 |
924 |
|
Macro avg |
0.87 |
1.00 |
0.92 |
924 |
|
Weighted avg |
0.88 |
1.00 |
0.93 |
924 |
D. Comparison with Baseline Methods
Table VI benchmarks the proposed system against classical feature-based methods and two alternative transfer learning configurations on the same dataset.
TABLE VI: COMPARISON WITH BASELINE METHODS
|
Method |
Acc. |
Macro F1 |
Top-5 |
|
LBP + SVM |
∼52% |
0.49 |
— |
|
HOG + Random Forest |
∼58% |
0.55 |
— |
|
VGG16 (no fine-tune) |
∼71% |
|
∼94% |
|
ResNet-50 Transfer |
∼85% |
|
∼99% |
|
Proposed (VGG16+LRP) |
87.67% |
0.92 |
100% |
Fig. 2. Normalised confusion matrix across all 17 disease classes. Diagonal dominance confirms strong per-class discrimination; residual off-diagonal mass is concentrated between Melanoma, Melanocytic, and Dermatofibroma.
The proposed system surpasses both classical baselines and the unfine-tuned VGG16 reference by substantial margins. Against ResNet-50 transfer learning, it achieves a 2.67 percentage point accuracy gain alongside a meaningful F1 improvement (+0.08) and a perfect Top-5 score, demonstrating that the custom multi-layer classification head and classweighted training extract greater discriminative value from the VGG16 feature space than an off-the-shelf head configuration.
VI. DISCUSSION
The 100% Top-5 accuracy and macro recall of 1.00 indicate that the learned embedding space is well-organized: even where top-1 predictions are ambiguous, the model reliably ranks the correct class within the five most probable outputs. For clinical triage this means that a specialist reviewing the model’s top-5 candidates will never encounter a case where the true diagnosis has been discarded entirely.
Reduced precision for Melanoma, Dermatofibroma, and Melanocytic reflects genuine perceptual similarity between these categories rather than a systematic model failure. Since all three classes achieve perfect recall, the practical implication is an elevated false-positive rate for these categories, which translates to additional specialist referrals rather than missed diagnoses — an acceptable trade-off in a pre-screening context.
The LRP heatmaps serve a dual clinical function: they provide positive evidence of appropriate model focus when attention aligns with visible lesions, and they provide a disqualification signal when attention drifts to background content, enabling clinicians to calibrate their reliance on individual predictions.
Three limitations warrant acknowledgement. First, the static image modality excludes dermoscopic metadata and temporal lesion evolution data that clinicians routinely consult. Second, limited skin tone diversity in the training corpus may reduce generalization to underrepresented demographic groups. Third, the frozen convolutional base constrains the model’s ability to adapt low-level feature detectors to dermatology-specific cues.
CONCLUSION
We proposed and evaluated an interpretable 17-class skin disease classifier combining a frozen VGG16 backbone with a purpose-designed classification head and LRP explainability. On a 454-sample held-out test set the system recorded 87.67% overall accuracy, 89.43% Top-1 accuracy, 100% Top-5 accuracy, and a macro F1-score of 0.92, outperforming both classical feature baselines and a ResNet-50 transfer learning reference. Perfect recall across 15 of 17 classes satisfies the false-negative safety requirement for a clinical pre-screening tool. LRP attribution maps confirmed that model decisions rest on diagnostically relevant lesion features, providing the interpretability layer necessary for clinical acceptance.
Future directions include selective fine-tuning of the upper convolutional blocks, ArcFace-based metric learning to tighten the embedding boundary between Melanoma and Melanocytic categories, dataset augmentation with demographically diverse skin tone representation, and model quantization for mobileoptimized deployment in low-resource settings
REFERENCES
- A. Esteva, B. Kuprel, and R. A. Novoa, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
- T. J. Brinker et al., “Deep learning outperformed 136 of 157 dermatologists in a head-to-head dermoscopic melanoma image classification task,” Eur. J. Cancer, vol. 113, pp. 47–54, 2019.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. ICLR, 2015.
- R. R. Selvaraju et al., “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” in Proc. IEEE ICCV, 2017, pp. 618–626.
- S. Bach et al., “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PLOS ONE, vol. 10, no. 7, p. e0130140, 2015.
- G. Litjens et al., “A survey on deep learning in medical image analysis,” Med. Image Anal., vol. 42, pp. 60–88, 2017.
- A. G. Howard et al., “MobileNets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE CVPR, 2016, pp. 770–778.
- Nagesh, “Skin disease dataset,” Roboflow Universe, 2023. [Online]. Available: https://universe.roboflow.com/nagesh-pwywk/skin-disease-spiyb
- Z. Zhang et al., “Explainable deep learning for medical imaging: A review,” IEEE Access, vol. 8, pp. 145820–145838, 2020.
Sakshi Sharan*
10.5281/zenodo.19976115