Department of Computer Science, Hussaini Adamu Federal Polytechnic, Kazaure, Nigeria
Sign language translation (SLT) is crucial for bridging communication gaps in deaf and hard-of-hearing communities. This paper presents an ensemble deep learning framework combining EfficientNetB0 for efficient spatial feature extraction, ResNet50 for deep hierarchical recognition, and a transformer encoder decoder for temporal modeling. Trained on WLASL and MSASL datasets (8,500 videos, 2,500 glosses), the model achieves 98.6?curacy, BLEU score of 0.75, F1 score of 0.82 for continuous signing, and 65 ms/frame latency. It outperforms state-of-the-art methods in translation quality and efficiency while maintaining robustness across environmental variations and multilingual scenarios. Ablation studies validate the ensemble's complementary strengths. The system offers practical deployment potential for mobile and assistive technologies.
Sign language serves as the primary communication medium for over 466 million people with disabling hearing loss worldwide, projected to exceed 900 million by 2050 [1]. As a complete linguistic system with unique grammar and cultural nuances, it poses significant challenges for automated translation [2]. Deep learning has advanced SLT from isolated gesture recognition using convolutional neural networks (CNNs) [3] to sequence modeling with recurrent networks and transformers [4], [5]. Ensemble approaches leverage architectural diversity for improved performance [6], yet challenges persist in balancing accuracy, real-time efficiency, continuous signing handling, dataset diversity, and multilingual generalization [2], [7]. This study addresses these gaps through a hybrid ensemble framework. Objectives include developing an efficient architecture, enhancing continuous interpretation, improving robustness via augmentation, validating performance, and assessing deployability. Contributions comprise a novel adaptive fusion ensemble, superior continuous recognition (F1=0.82), benchmark-setting metrics (98.6% accuracy, 65 ms latency), and deployment guidelines. The work holds broad impact for education, healthcare, employment, and social inclusion [8], while preserving linguistic diversity and informing multimodal AI advancements [9].
Related Work
SLT research has progressed from hand-crafted features and hidden Markov models to deep learning paradigms [2]. CNNs like Efficient Net [10] and ResNet [11] dominate spatial feature extraction [3], [6]. Temporal modeling employs LSTMs [12], TCNs, or transformers [4], [5]. Hybrid CNN-transformer models integrate strengths [13], while ensembles enhance robustness [9], [6]. Multimodal inputs, including MediaPipe key points [14] and sensor fusion [15], improve invariance. Data augmentation mitigates scarcity [9], with metrics encompassing accuracy, BLEU [16], WER, F1, and latency [2]. Persistent gaps include real-time constraints, continuous signing, limited diversity, cross-linguistic transfer, and deployment robustness [7], [2]. This work bridges them via a diverse ensemble with key point preprocessing and comprehensive augmentation.
METHODOLOGY
System Architecture
The ensemble integrates:
EfficientNetB0: Lightweight spatial extractor with MBConv blocks; ImageNet-pretrained [10]. ResNet50: Deep residual network for hierarchical features; ImageNet-pretrained [11]. Transformer: 6-layer encoder-decoder (8 heads, 512-dim) with positional encoding and cross-attention for sequences [4]. Attention Fusion Layer: Dynamically weights features based on confidence and context [9]. Inputs are key point sequences and frames; outputs are gloss predictions or text translations.
Datasets
Combined WLASL [17] and MSASL [6]: 8,500 videos, 2,500 glosses, 150+ signers, 45 hours. Stratified split: 60% train, 20% validation, 20% test. Preprocessing and Augmentation Key points: Media Pipe Holistic extracts 42 hand, 468 face, 33 pose landmarks [14]. Normalization: 224×224 RGB frames. Segmentation: Motion thresholding for continuous sequences. Augmentation: Spatial (rotation ±15°, scaling 0.85-1.15×), temporal (speed 0.8-1.2×), occlusion, noise, blur. 3.4 Training and Implementation NVIDIA RTX 3090; TensorFlow 2.10. Adam optimizer (lr=10^{-4}-10^{-3}); batch size 32; 70 epochs with early stopping. Staged training: isolated components, joint fine-tuning, ensemble fusion. Combined loss: cross-entropy (classification/sequence) + attention regularization.
RESULTS
Training converged at epoch 67 (68 hours). Validation/test accuracy: 98.6%. Comparison with State-of-the-Art Table 1 compares key methods.
Table 1. Performance comparison.
|
Method |
Year |
Accuracy (%) |
BLEU |
WER (%) |
F1 (Continuous) |
Latency (ms/frame) |
|
Chung [7] |
2022 |
91.7 |
0.53 |
12.0 |
0.62 |
90 |
|
Shi et al. [13] |
2022 |
93.8 |
0.68 |
- |
0.73 |
105 |
|
Albert et al. [6] |
2023 |
99.8 |
0.68 |
6.5 |
N/A |
120 |
|
Alkhoraif [5] |
2025 |
95.2 |
0.61 |
8.7 |
0.70 |
160 |
|
Proposed |
2025 |
98.6 |
0.75 |
5.4 |
0.82 |
65 |
Superior in BLEU (+10.3% over [6]), F1, and latency (-46% vs. [6]).
4.2 Ablation Study Table 2 shows component impacts.
Table 2. Ablation results.
|
Configuration |
Accuracy (%) |
BLEU |
F1 |
Latency (ms/frame) |
|
EfficientNetB0 only |
94.2 |
0.62 |
0.68 |
35 |
|
ResNet50 only |
95.8 |
0.65 |
0.71 |
58 |
|
Transformer only |
91.5 |
0.70 |
0.75 |
95 |
|
EfficientNetB0 + ResNet50 |
97.1 |
0.69 |
0.74 |
48 |
|
EfficientNetB0 + Transformer |
96.3 |
0.72 |
0.78 |
55 |
|
ResNet50 + Transformer |
97.5 |
0.73 |
0.79 |
82 |
|
Full ensemble |
98.6 |
0.75 |
0.82 |
65 |
Ensemble yields synergistic gains. Continuous and Multilingual Performance Continuous F1: 0.82 overall; degrades gracefully with sequence length (0.91 short, 0.71 very long). Multilingual: 99.1% (ASL trained), 92.5% (BSL), 78.4% zero-shot (ISL).
Robustness and Efficiency Environmental drops mitigated by augmentation (max -15.4% combined challenges). Inference: 65 ms/frame (GPU, 15.4 fps); model size 475 MB. Statistical significance: p<0.001 vs. baselines (5 runs).
Error analysis: Confusions in similar signs; attention focuses on key frames.
DISCUSSION
The ensemble balances accuracy and efficiency via EfficientNetB0's lightness, ResNet50's depth, and transformer's sequencing [4], [10], [11]. Translation gains stem from cross-attention aligning signs to text. Continuous recognition addresses co-articulation via temporal segmentation.
LIMITATIONS
Dataset biases (ASL-dominant); cultural/pragmatic gaps; extreme-condition degradation. Implications: Real-time apps in education/healthcare; complements human interpreters. Ethical: Privacy (key points reduce risks), fairness monitoring, community involvement.
Future: Multimodal fusion [15], few-shot adaptation, bidirectional translation, field trials.
CONCLUSION
The proposed ensemble advances SLT with 98.6% accuracy, 0.75 BLEU, 0.82 continuous F1, and 65 ms latency—setting new benchmarks for practical, inclusive systems. It promotes accessibility while highlighting needs for diverse data and ethical deployment.
REFERENCE
Musbahu Yunusa Makama*, Development of a Robust Sign Language Translation System Using an Ensemble of Efficient Net Models for Enhanced Recognition Accuracy and Real-Time Performance, Int. J. Sci. R. Tech., 2025, 2 (11), 445-447. https://doi.org/10.5281/zenodo.17627834
10.5281/zenodo.17627834