View Article

  • Development of a Robust Sign Language Translation System Using an Ensemble of Efficient Net Models for Enhanced Recognition Accuracy and Real-Time Performance

  • Department of Computer Science, Hussaini Adamu Federal Polytechnic, Kazaure, Nigeria

Abstract

Sign language translation (SLT) is crucial for bridging communication gaps in deaf and hard-of-hearing communities. This paper presents an ensemble deep learning framework combining EfficientNetB0 for efficient spatial feature extraction, ResNet50 for deep hierarchical recognition, and a transformer encoder decoder for temporal modeling. Trained on WLASL and MSASL datasets (8,500 videos, 2,500 glosses), the model achieves 98.6?curacy, BLEU score of 0.75, F1 score of 0.82 for continuous signing, and 65 ms/frame latency. It outperforms state-of-the-art methods in translation quality and efficiency while maintaining robustness across environmental variations and multilingual scenarios. Ablation studies validate the ensemble's complementary strengths. The system offers practical deployment potential for mobile and assistive technologies.

Keywords

sign language translation; ensemble deep learning; EfficientNet; ResNet; transformer; real-time performance; continuous recognition; multilingual translation; computational efficiency; assistive technology

Introduction

Sign language serves as the primary communication medium for over 466 million people with disabling hearing loss worldwide, projected to exceed 900 million by 2050 [1]. As a complete linguistic system with unique grammar and cultural nuances, it poses significant challenges for automated translation [2]. Deep learning has advanced SLT from isolated gesture recognition using convolutional neural networks (CNNs) [3] to sequence modeling with recurrent networks and transformers [4], [5]. Ensemble approaches leverage architectural diversity for improved performance [6], yet challenges persist in balancing accuracy, real-time efficiency, continuous signing handling, dataset diversity, and multilingual generalization [2], [7]. This study addresses these gaps through a hybrid ensemble framework. Objectives include developing an efficient architecture, enhancing continuous interpretation, improving robustness via augmentation, validating performance, and assessing deployability. Contributions comprise a novel adaptive fusion ensemble, superior continuous recognition (F1=0.82), benchmark-setting metrics (98.6% accuracy, 65 ms latency), and deployment guidelines. The work holds broad impact for education, healthcare, employment, and social inclusion [8], while preserving linguistic diversity and informing multimodal AI advancements [9].

Related Work

SLT research has progressed from hand-crafted features and hidden Markov models to deep learning paradigms [2]. CNNs like Efficient Net [10] and ResNet [11] dominate spatial feature extraction [3], [6]. Temporal modeling employs LSTMs [12], TCNs, or transformers [4], [5]. Hybrid CNN-transformer models integrate strengths [13], while ensembles enhance robustness [9], [6]. Multimodal inputs, including MediaPipe key points [14] and sensor fusion [15], improve invariance. Data augmentation mitigates scarcity [9], with metrics encompassing accuracy, BLEU [16], WER, F1, and latency [2]. Persistent gaps include real-time constraints, continuous signing, limited diversity, cross-linguistic transfer, and deployment robustness [7], [2]. This work bridges them via a diverse ensemble with key point preprocessing and comprehensive augmentation.

METHODOLOGY

System Architecture

The ensemble integrates:

EfficientNetB0: Lightweight spatial extractor with MBConv blocks; ImageNet-pretrained [10]. ResNet50: Deep residual network for hierarchical features; ImageNet-pretrained [11]. Transformer: 6-layer encoder-decoder (8 heads, 512-dim) with positional encoding and cross-attention for sequences [4]. Attention Fusion Layer: Dynamically weights features based on confidence and context [9]. Inputs are key point sequences and frames; outputs are gloss predictions or text translations.

Datasets

Combined WLASL [17] and MSASL [6]: 8,500 videos, 2,500 glosses, 150+ signers, 45 hours. Stratified split: 60% train, 20% validation, 20% test. Preprocessing and Augmentation Key points: Media Pipe Holistic extracts 42 hand, 468 face, 33 pose landmarks [14]. Normalization: 224×224 RGB frames. Segmentation: Motion thresholding for continuous sequences. Augmentation: Spatial (rotation ±15°, scaling 0.85-1.15×), temporal (speed 0.8-1.2×), occlusion, noise, blur. 3.4 Training and Implementation NVIDIA RTX 3090; TensorFlow 2.10. Adam optimizer (lr=10^{-4}-10^{-3}); batch size 32; 70 epochs with early stopping. Staged training: isolated components, joint fine-tuning, ensemble fusion. Combined loss: cross-entropy (classification/sequence) + attention regularization.

RESULTS

Training converged at epoch 67 (68 hours). Validation/test accuracy: 98.6%. Comparison with State-of-the-Art Table 1 compares key methods.

Table 1. Performance comparison.

Method

Year

Accuracy (%)

BLEU

WER (%)

F1 (Continuous)

Latency (ms/frame)

Chung [7]

2022

91.7

0.53

12.0

0.62

90

Shi et al. [13]

2022

93.8

0.68

-

0.73

105

Albert et al. [6]

2023

99.8

0.68

6.5

N/A

120

Alkhoraif [5]

2025

95.2

0.61

8.7

0.70

160

Proposed

2025

98.6

0.75

5.4

0.82

65

Superior in BLEU (+10.3% over [6]), F1, and latency (-46% vs. [6]).

4.2 Ablation Study Table 2 shows component impacts.

Table 2. Ablation results.

Configuration

Accuracy (%)

BLEU

F1

Latency (ms/frame)

EfficientNetB0 only

94.2

0.62

0.68

35

ResNet50 only

95.8

0.65

0.71

58

Transformer only

91.5

0.70

0.75

95

EfficientNetB0 + ResNet50

97.1

0.69

0.74

48

EfficientNetB0 + Transformer

96.3

0.72

0.78

55

ResNet50 + Transformer

97.5

0.73

0.79

82

Full ensemble

98.6

0.75

0.82

65

Ensemble yields synergistic gains. Continuous and Multilingual Performance Continuous F1: 0.82 overall; degrades gracefully with sequence length (0.91 short, 0.71 very long). Multilingual: 99.1% (ASL trained), 92.5% (BSL), 78.4% zero-shot (ISL).

 Robustness and Efficiency Environmental drops mitigated by augmentation (max -15.4% combined challenges). Inference: 65 ms/frame (GPU, 15.4 fps); model size 475 MB. Statistical significance: p<0.001 vs. baselines (5 runs).

Error analysis: Confusions in similar signs; attention focuses on key frames.

DISCUSSION

The ensemble balances accuracy and efficiency via EfficientNetB0's lightness, ResNet50's depth, and transformer's sequencing [4], [10], [11]. Translation gains stem from cross-attention aligning signs to text. Continuous recognition addresses co-articulation via temporal segmentation.

LIMITATIONS

Dataset biases (ASL-dominant); cultural/pragmatic gaps; extreme-condition degradation. Implications: Real-time apps in education/healthcare; complements human interpreters. Ethical: Privacy (key points reduce risks), fairness monitoring, community involvement.

Future: Multimodal fusion [15], few-shot adaptation, bidirectional translation, field trials.

CONCLUSION

The proposed ensemble advances SLT with 98.6% accuracy, 0.75 BLEU, 0.82 continuous F1, and 65 ms latency—setting new benchmarks for practical, inclusive systems. It promotes accessibility while highlighting needs for diverse data and ethical deployment.                                          

REFERENCE

  1. World Health Organization, "World report on hearing," 2021.
  2. A. Núñez-Marcos et al., "A survey on sign language machine translation," Expert Syst. Appl., vol. 205, 2023.
  3. G. Levi and T. Hassner, "Age and gender classification using convolutional neural networks," in CVPR Workshops, 2015.
  4. A. Vaswani et al., "Attention is all you need," in NeurIPS, 2017.
  5. A. Alkhoraif, "Ensemble transformer-based word-level sign language recognition," J. Vis. Lang. Comput., vol. 58, 2025.
  6. P. A. Albert et al., "Ensemble deep learning for multilingual sign language translation and recognition," Scitepress, 2023.
  7. H. X. Chung, "Ensemble CNN models for real-time sign language recognition," IEEE Trans. Multimedia, vol. 34, no. 8, 2022.
  8. H. Kumar and P. Reddy, "Ensemble deep learning model for Indian sign language recognition," Int. J. Artif. Intell. Educ., vol. 35, no. 2, 2025.
  9. S. Ruder, "An overview of multi-task learning in deep neural networks," arXiv:1706.05098, 2017.
  10. M. Tan and Q. V. Le, "EfficientNet: Rethinking model scaling for convolutional neural networks," in ICML, 2019.
  11. K. He et al., "Deep residual learning for image recognition," in CVPR, 2016.
  12. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, 1997.
  13. B. Shi et al., "TTIC's WMT-SLT 22 sign language translation system," in WMT-SLT, 2022.
  14. J. Cao et al., "Realtime multi-person 2D pose estimation using part affinity fields," in CVPR, 2018.
  15. Y. Gu et al., "American sign language recognition with inertial systems," PMC, vol. 15, no. 5, 2024.
  16. K. Papineni et al., "BLEU: A method for automatic evaluation of machine translation," in ACL, 2002.
  17. D. Li et al., "Word-level deep sign language recognition from video: A new large-scale dataset and methods roadmap," arXiv:2012.01236, 2020.

Reference

  1. World Health Organization, "World report on hearing," 2021.
  2. A. Núñez-Marcos et al., "A survey on sign language machine translation," Expert Syst. Appl., vol. 205, 2023.
  3. G. Levi and T. Hassner, "Age and gender classification using convolutional neural networks," in CVPR Workshops, 2015.
  4. A. Vaswani et al., "Attention is all you need," in NeurIPS, 2017.
  5. A. Alkhoraif, "Ensemble transformer-based word-level sign language recognition," J. Vis. Lang. Comput., vol. 58, 2025.
  6. P. A. Albert et al., "Ensemble deep learning for multilingual sign language translation and recognition," Scitepress, 2023.
  7. H. X. Chung, "Ensemble CNN models for real-time sign language recognition," IEEE Trans. Multimedia, vol. 34, no. 8, 2022.
  8. H. Kumar and P. Reddy, "Ensemble deep learning model for Indian sign language recognition," Int. J. Artif. Intell. Educ., vol. 35, no. 2, 2025.
  9. S. Ruder, "An overview of multi-task learning in deep neural networks," arXiv:1706.05098, 2017.
  10. M. Tan and Q. V. Le, "EfficientNet: Rethinking model scaling for convolutional neural networks," in ICML, 2019.
  11. K. He et al., "Deep residual learning for image recognition," in CVPR, 2016.
  12. S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, no. 8, 1997.
  13. B. Shi et al., "TTIC's WMT-SLT 22 sign language translation system," in WMT-SLT, 2022.
  14. J. Cao et al., "Realtime multi-person 2D pose estimation using part affinity fields," in CVPR, 2018.
  15. Y. Gu et al., "American sign language recognition with inertial systems," PMC, vol. 15, no. 5, 2024.
  16. K. Papineni et al., "BLEU: A method for automatic evaluation of machine translation," in ACL, 2002.
  17. D. Li et al., "Word-level deep sign language recognition from video: A new large-scale dataset and methods roadmap," arXiv:2012.01236, 2020.

Photo
Musbahu Yunusa Makama*
Corresponding author

Department of Computer Science, Hussaini Adamu Federal Polytechnic, Kazaure, Nigeria

Musbahu Yunusa Makama*, Development of a Robust Sign Language Translation System Using an Ensemble of Efficient Net Models for Enhanced Recognition Accuracy and Real-Time Performance, Int. J. Sci. R. Tech., 2025, 2 (11), 445-447. https://doi.org/10.5281/zenodo.17627834

More related articles
Impurity Profiling and its Significance Active Pha...
Prathamesh Bhagyavant, Swapnil Ade, Sayyad J. H., ...
Design of Biochar Filter for Arsenic Removal...
Dr. Pranab Jyoti Barman, Aditi Verma, Barnana, Priyadarshini Saik...
Development of Protein Rich Snack Bar Using Spirulina...
Prachi Lokhande, Ayeshabano Fahim Hawaldar, Aman Paigambar Mujawar, Afrin Abdul Shaikh, ...
Analytical Method Development, Validation and Optimization of Fluconazole Drug U...
Aditi Chouksey, Nimita Manocha, Gurmeet Chhabra, Ritesh Patel, Gyanendra Singh Patel, ...
Related Articles
Design of Experiments in the Formulation and Optimization of Sustained Release M...
Kartik Shinde, Dr. Nilesh Gorde, Swapnil Phalak, Prajval Birajdar, Vishal Bodke, ...
A Unified Multi-Modal Real-Time Collaborative Development Environment Integratin...
P. U. Harsha, S. Steffi Nivedita, P. Rahul, P. Surya Tej, P. Venkat Balaji Naidu, ...
Impurity Profiling and its Significance Active Pharmaceutical Ingredients...
Prathamesh Bhagyavant, Swapnil Ade, Sayyad J. H., ...
More related articles
Impurity Profiling and its Significance Active Pharmaceutical Ingredients...
Prathamesh Bhagyavant, Swapnil Ade, Sayyad J. H., ...
Design of Biochar Filter for Arsenic Removal...
Dr. Pranab Jyoti Barman, Aditi Verma, Barnana, Priyadarshini Saikia, Himadri Das, Jupitara Gogoi, Su...
Impurity Profiling and its Significance Active Pharmaceutical Ingredients...
Prathamesh Bhagyavant, Swapnil Ade, Sayyad J. H., ...
Design of Biochar Filter for Arsenic Removal...
Dr. Pranab Jyoti Barman, Aditi Verma, Barnana, Priyadarshini Saikia, Himadri Das, Jupitara Gogoi, Su...