A Machine Learning–Based System for Real-Time Stuttered Speech Correction: Stutter Clear

Arvind Kumar Mishra

doi:10.5281/zenodo.17657053

Research Paper | Open Access
Volume 02 | Issue 11 | Article Id IJSRT/

A Machine Learning–Based System for Real-Time Stuttered Speech Correction: Stutter Clear
Arvind Kumar Mishra*
Shri Ram Murti Smarak College of Engineering and Technology Bareilly U.P India

Abstract

This paper introduces Stutter Clear, a machine learning system designed to detect stuttering in real-time speech and convert it into smooth, fluent speech while maintaining the speaker?s natural voice identity. The system works in three main stages: audio preprocessing and feature extraction, stuttering detection and segmentation, and fluency enhancement using speech synthesis or transformation techniques. A hybrid deep learning architecture combining CNN-RNN (or Conv-TCN) networks with Transformer-based sequence-to-sequence models is used to achieve this. The paper also discusses the system?s design, dataset preparation, training process, evaluation metrics, limitations, and potential directions for future research.

Keywords

Machine Learning, Real-Time Stuttered Speech Correction, Stutter Clear, CNN-RNN (or Conv-TCN) networks

Introduction

Stuttering is a widespread speech disorder that affects the natural rhythm and flow of speech. It is characterized by involuntary repetitions, prolongations, or blocks of sounds and syllables, which often make communication challenging for the speaker. This disruption in fluency can lead to frustration, anxiety, and low self-confidence, especially in social or professional situations [1]. Although stuttering can vary in severity from person to person, its psychological and emotional effects are often profound. For many individuals, the fear of speaking in public or engaging in conversations becomes a major barrier, influencing their personal growth, career opportunities, and quality of life [2]. Over the years, various therapeutic and technological solutions have been developed to help people who stutter. Traditional speech therapy focuses on breathing techniques, controlled speech, and behavioral exercises to improve fluency [3]. While these methods are helpful, their effectiveness depends largely on regular practice and the individual’s response to therapy [4]. On the other hand, technological aids such as Delayed Auditory Feedback (DAF) devices attempt to improve fluency by altering how the speaker hears their own voice, encouraging smoother speech patterns. However, these solutions are not universally effective and often fail to address the real-time dynamics of stuttering. Many users also find such devices uncomfortable or difficult to use in everyday communication [5]. In response to these limitations, this research proposes the development of an adaptive and intelligent system designed to detect and correct stuttering automatically in real time. The main goal of this system is to provide a seamless speaking experience by combining the power of modern machine learning and speech processing technologies [6]. The system operates in three core stages: detection, classification, and correction. In the first stage, the system captures live audio and performs preprocessing tasks such as noise removal and feature extraction. These features are then analyzed to identify stuttering patterns in the second stage [7]. The model classifies and segments various types of disfluencies, including repetitions, prolongations, and speech blocks. In the final stage, the detected stuttered segments are processed and corrected to produce smooth, fluent speech while preserving the speaker’s natural tone and voice identity [8]. To achieve this, the system utilizes a hybrid deep learning architecture that combines Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), Temporal Convolution Networks (TCNs) for feature extraction and sequence modeling. Additionally, Transformer-based sequence-to-sequence models are integrated to handle complex temporal dependencies and improve fluency correction accuracy [9]. These advanced models enable the system to understand speech patterns, detect disfluencies efficiently, and transform them into fluent speech in real time [10]. Beyond the technical aspects, the research also focuses on dataset preparation, model training, and performance evaluation using objective and subjective metrics [11]. By training the model on diverse speech samples containing various stuttering patterns, the system learns to generalize across different speakers and speech contexts. Evaluation parameters such as detection accuracy, fluency improvement rate, and voice naturalness are used to measure performance [12].

2. Related Work

Over the past several decades, numerous studies and technological developments have focused on addressing the challenges of stuttering and improving speech fluency. One of the earliest and most widely explored approaches involves the use of Delayed Auditory Feedback (DAF) and Frequency-Altered Feedback (FAF) devices. These electronic aids modify how the speaker hears their own voice, either by introducing a slight delay or by changing the pitch frequency [13]. This auditory alteration can temporarily improve speech fluency by helping the speaker slow down and regulate their speech rhythm [14]. However, while DAF and FAF devices can produce short-term benefits, they often lack adaptability and long-term effectiveness [15]. Users may experience only partial fluency improvement, and the devices can sometimes cause discomfort or distraction during prolonged use. Furthermore, these tools do not adapt to the individual’s specific speech patterns or stuttering triggers, limiting their real-world application and scalability [16]. With the rise of artificial intelligence and machine learning, data-driven approaches to stuttering detection. Traditional machine learning algorithms such as Support Vector Machines (SVMs), Random Forests, and Hidden Markov Models (HMMs) were among the first to be applied for identifying disfluencies in recorded speech17]. These methods rely on handcrafted features like pitch, energy, and temporal pauses extracted from speech signals. While these models achieved reasonable accuracy in controlled experiments, they struggled to perform consistently in real-world, noisy environments [18]. To overcome these challenges, deep learning models, particularly Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), were introduced. CNNs proved effective for extracting spectral and temporal features from audio spectrograms, while RNNs and their variants, such as Long Short-Term Memory (LSTM) networks, captured sequential dependencies in speech. These advancements significantly improved stuttering detection performance, enabling more accurate identification of disfluency types such as repetitions, prolongations, and blocks. However, despite their success in detection, these models typically stopped at identifying stuttered segments and did not attempt to enhance or correct them [19]. Parallel to these developments, major progress has also been made in neural speech enhancement and voice conversion technologies. Using architectures like Transformers and sequence-to-sequence (Seq2Seq) models, researchers have developed systems that can reconstruct, clean, or modify speech at the spectrogram level [20]. These models have been used to reduce background noise, enhance clarity, and even convert one speaker’s voice to sound like another’s while maintaining naturalness and intelligibility. The integration of attention mechanisms and autoencoder structures has further enhanced the quality and smoothness of generated speech, paving the way for real-time voice transformation applications [21]. Building upon these foundational works, the present paper proposes a unified framework that combines stuttering detection and speech transformation into a single intelligent pipeline. Unlike previous studies that treated detection and correction as separate tasks, this research integrates both processes into a continuous workflow [22]. The proposed system first detects stuttering events using advanced deep learning models and then applies neural transformation techniques to reconstruct fluent, natural-sounding speech in real time [23]. This holistic approach not only enhances fluency but also ensures that the speaker’s unique voice characteristics are preserved. By merging insights from DAF/FAF feedback mechanisms, machine learning–based disfluency detection, and neural speech synthesis, this study introduces a new generation of adaptive, real-time stuttering correction systems capable of delivering personalized and sustainable fluency improvement [24] [25].

METHODOLOGY

The given problem focuses on transforming an input audio signal x(t)x(t)x(t) containing stuttered speech into an output y(t)y(t)y(t) that retains the speaker’s original voice identity and natural tone while effectively removing disfluencies. The primary objectives of this research are to detect stuttering events with high accuracy (target precision/recall ≥ 0.8), maintain near real-time latency (≤ 200 ms), and enhance speech fluency and naturalness as measured by the Mean Opinion Score (MOS). The proposed system architecture consists of three main modules. The first module, Preprocessing and Feature Extraction, involves sampling mono audio at 16 kHz with a frame size of 20–30 ms and a hop of 10 ms. Features such as Log-Mel spectrograms (40 bands), MFCCs, energy, zero-crossing rate, pitch (F0), and spectral flux are extracted, along with additional parameters like the short-time energy envelope and voicing probability. These features provide a rich representation of the speech signal, crucial for accurate stuttering detection. The second module, Stuttering Detection and Segmentation, aims to identify and label each audio frame as fluent, repetition, prolongation, or block. This is achieved using a hybrid model combining a convolution encoder with a Bi-LSTM (or Temporal Convolution Network) followed by a Conditional Random Field (CRF) for temporal smoothing. The model is trained using a combination of categorical cross-entropy and Dice or Focal loss to handle class imbalance. The architecture processes T×40T × 40T×40 Mel feature inputs through multiple Conv1D layers with ReLU activation and batch normalization, followed by two Bi-LSTM layers (hidden size 256) and a fully connected softmax layer with CRF decoding. The output consists of onset and offset timestamps marking stuttering events. The third module, Fluency Enhancement, integrates two complementary strategies. The first is a rule-based local editing approach that operates with low latency by identifying and removing short repetitions through cross-fade merging, compressing prolonged sounds, and filling silent “blocks” with smooth voiced transitions. Although this method is fast, it struggles with complex disfluencies. The second approach uses a neural speech transformation model based on Transformer- based sequence- to- sequence architecture with attention. It takes the stuttered segment spectrogram as input and generates a corrected spectrogram, which is then converted into audio using a neural vocoder such as WaveRNN or HiFi-GAN. The model can be trained either on paired (stuttered → fluent) data or through unsupervised CycleGAN-based learning for unpaired datasets. A hybrid strategy combining both methods ensures a trade-off between latency and audio quality. Dataset preparation involves collecting samples from public speech corpora, specialized stuttering datasets, and volunteer recordings. Each sample is annotated at both frame and event levels by speech-language pathologists (SLPs), labeling instances of repetition, prolongation, and block. Data augmentation techniques such as time-stretching, pitch shifting, noise addition, reverb, and simulated stuttering are applied to increase variability. For supervised learning, paired fluent and stuttered samples from the same speakers are used. During model training, optimizers such as Adam or AdamW are employed with an initial learning rate of 1e-3 and a suitable scheduler. The batch size is set to 32 for detection tasks, while it varies for Transformer models. The training process incorporates L1, Mel-spectrogram, and adversarial losses for GAN-based speech enhancement, and pretrained vocoders are fine-tuned to match target speaker characteristics. For evaluation, objective metrics include precision, recall, F1-score, and onset/offset error in milliseconds for detection performance, along with PESQ, STOI, and Mel-Cepstral Distortion (MCD) for assessing speech quality. Latency is measured to ensure it stays below 200 ms. Subjective evaluation involves Mean Opinion Score (MOS) ratings for fluency, naturalness, and voice similarity, complemented by ABX tests where human listeners choose the more fluent version between original and corrected speech. The experimental setup includes baseline comparisons among DAF-style systems, rule-based editing, and the proposed hybrid method. Additional studies examine the contributions of each module (detection-only, detection plus rule-editing and full hybrid) and test generalization across speakers and varying noise levels (20, 10, and 0 dB SNR). The expected outcome of this work is that the hybrid approach successfully balances low latency with high naturalness, offering a significant improvement in both fluency correction and preservation of voice authenticity.

Fig 1: Stutter Detection and Fluency Enhancement

The algorithm for stutter detection and fluency enhancement shown in Fig 1 operates through both offline and real-time processes to convert stuttered speech into smooth, natural, and fluent audio while preserving the speaker’s voice identity. The process begins with the input of an audio signal, which undergoes preprocessing to prepare it for analysis. This includes tasks such as sampling, feature extraction, and noise handling to ensure that the data fed into the system is clean and consistent. Once preprocessing is complete, the stutter detection stage identifies frames of speech that contain disfluencies such as repetitions, prolongations, or blocks. These detections are used to determine whether a detected stuttering event is simple or complex based on its duration, type, and intensity. If the event is classified as simple—such as short repetitions or mild prolongations—it is corrected using rule-based methods in real time. These involve operations like removing redundant segments, compressing prolonged sounds, and merging audio fragments smoothly using cross-fade techniques. This ensures low latency and fast processing, making it suitable for live or near real-time applications. In contrast, if the event is more complex, such as long or irregular disfluencies, the system employs neural synthesis techniques using Transformer-based sequence-to-sequence models. This neural approach reconstructs the fluent version of the stuttered segment at the spectrogram level and then converts it into high-quality audio through a neural vocoder like HiFi-GAN or WaveRNN. After processing through either path, the edited or synthesized audio segments are merged back into the fluent speech stream using smooth blending to maintain continuity and naturalness. The final output audio, therefore, is fluent, intelligible, and consistent with the speaker’s original voice characteristics. This hybrid algorithm achieves a balance between the low latency of rule-based edits and the superior quality of neural synthesis, ensuring efficient stutter correction suitable for both offline training and real-time deployment.

Table 1: Details of Speech Dataset Including Speaker Demographics, Speech Type, Stuttering Information, and Audio Files

ID	Speaker Name	Gender	Age	Native Language	Accent	Recording Duration (sec)	Speech Type	Stuttering Type (if any)	Audio File Name
1	Rahul Sharma	M	25	Hindi	Indian	60	Read Speech	None	rahul_s1.wav
2	Priya Mehta	F	27	Hindi	Indian	55	Conversational	Repetition	priya_s1.wav
3	Amit Verma	M	30	English	Indian	70	Read Speech	Block	amit_s1.wav
4	Sneha Reddy	F	22	Telugu	Indian	65	Conversational	None	sneha_s1.wav
Reference N. Alhakbani, R. Alnashwan, A. Al-Nafjan, and A. Almudhi, “Automated Stuttering Detection Using Deep Learning Techniques,” Journal of Clinical Medicine, vol. 14, no. 10, p. 3552, 2025. https://doi.org/10.3390/jcm14103552 J. Liu, A. Wumaier, D. Wei, and S. Guo, “Automatic Speech Disfluency Detection Using wav2vec 2.0 for Different Languages with Variable Lengths,” Applied Sciences, vol. 13, no. 13, p. 7579, 2023. https://doi.org/10.3390/app13137579 S. P. Bayerl, A. Wolff von Gudenberg, F. Hönig, E. Noeth, and K. Riedhammer, “Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0,” Proc. Interspeech 2022, pp. 2228–2232, 2022. https://doi.org/10.21437/Interspeech.2022-630 A. K. Al-Banna, H. A. Abbas, and A. A. Al-Rizzo, “Stuttering Disfluency Detection Using Machine Learning,” International Journal of Speech Technology, 2022. https://doi.org/10.1142/S0219649222500204 T. Kourkounakis, A. Hajavi, and A. Etemad, “FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning,” arXiv preprint arXiv:2009.11394, 2020. https://doi.org/10.48550/arXiv.2009.11394 T. Okamoto, K. Matsubara, T. Toda, Y. Shiga, and H. Kawai, “Neural Speech-Rate Conversion with Multispeaker WaveNet Vocoder,” Speech Communication, vol. 138, pp. 1–12, 2022. https://doi.org/10.1016/j.specom.2022.01.003 T. Tanaka, M. Nakata, and K. Yoshino, “Disfluency Detection Based on Speech-Aware Token-by-Token Sequence Labeling,” Proc. APSIPA ASC 2019, pp. 1194–1199, 2019. https://doi.org/10.1109/APSIPAASC47483.2019.9023119 S. A. Sheikh, M. Sahidullah, F. Hirsch, and S. Ouni, “Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning,” arXiv preprint arXiv:2305.11819, 2023. https://doi.org/10.48550/arXiv.2305.11819 J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High-Fidelity Speech Synthesis,” Advances in Neural Information Processing Systems 33, 2020. https://doi.org/10.48550/arXiv.2010.05646 T. Passali, A. Kourkounakis, and A. Etemad, “Artificial Disfluency Detection, uh no, Disfluency Generation …,” Computer Speech & Language, 2025 (early access). https://doi.org/10.1016/j.csl.2025.101542 V. Ramitha, R. P. Krishnan, and S. S. Prasad, “Evaluative Comparison of Machine Learning Algorithms for Automatic Stuttering Detection and Classification Using SEP-28k Dataset,” Expert Systems with Applications, 2024. https://doi.org/10.1016/j.eswa.2024.121953 C. Lea, V. Mitra, et al., “SEP-28k: A Dataset for Stuttering Event Detection from Podcasts with People Who Stutter,” Proc. Interspeech 2021, pp. 635–639, 2021. https://doi.org/10.21437/Interspeech.2021-634 P. Jamshid Lou and M. Johnson, “Disfluency Detection Using a Noisy Channel Model and a Deep Neural Language Model,” Proc. 55th Annual Meeting of the Association for Computational Linguistics (Vol. 2), pp. 547–553, 2017. https://doi.org/10.18653/v1/P17-2087 J. Hough and D. Schlangen, “Joint, Incremental Disfluency Detection and Utterance Segmentation from Speech,” Proc. EACL 2017, pp. 326–336, 2017. https://doi.org/10.18653/v1/E17-1031 S. Wang, W. Che, Y. Zhang, M. Zhang, and T. Liu, “Transition-Based Disfluency Detection Using LSTMs,” Proc. EMNLP 2017, pp. 2785–2794, 2017. https://doi.org/10.18653/v1/D17-1296 A. Das, J. Mock, F. Irani, Y. Huang, P. Najafirad, and E. Golob, “Multimodal Explainable AI Predicts Upcoming Speech Behavior in Adults Who Stutter,” Frontiers in Neuroscience, vol. 16, 2022. https://doi.org/10.3389/fnins.2022.912798 V. Yawatkar, H. M. Chow, and E. Usler, “Automatic Temporal Analysis of Speech: A Quick and Objective Pipeline for the Assessment of Overt Stuttering,” Behavior Research Methods, vol. 57, art. 228, 2025. https://doi.org/10.3758/s13428-025-02733-z A. Al-Banna, H. Fang, and E. Edirisinghe, “A Novel Attention Model Across Heterogeneous Features for Stuttering Event Detection,” Expert Systems with Applications, vol. 232, art. 122967, 2024. https://doi.org/10.1016/j.eswa.2023.122967 S. Gudlavalleti, P. S. Devi, R. Lakka, R. Kuchanpally, and S. S. Dudekula, “Comparison of Machine Learning Algorithms for Detection of Stuttering in Speech,” Springer Proc. Math. & Statistics, pp. 391–403, 2025. https://doi.org/10.1007/978-3-031-51338-1_30 S. P. Rajalakshmi, R. Rengaraj, and G. R. Venkatakrishnan, “Efficient Recognition and Classification of Stuttered Speech Signal Using Deep Learning Technique,” International Journal of Intelligent Systems and Applications in Engineering, vol. 12 (18 s), pp. 613–622, 2024. (No DOI found) H. R. Gowda S. P., M. Chinmaya Rao, N. S. Raj, R. Jain, A. A. Gadag, S. Kumar, R. C. V., and S. M., “STUDS: Speech Therapy Utility for Detection and Analysis of Stuttering,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 13 (3), 2024. https://doi.org/10.17148/IJARCCE.2024.13349 J. Skidmore and R. Moore, “Incremental Disfluency Detection for Spoken Learner English,” Proc. 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pp. 272–278, 2022. https://doi.org/10.18653/v1/2022.bea-1.31 R. Alnashwan, N. Alhakbani, A. Al-Nafjan, A. Almudhi, and W. Al-Nuwaiser, “Computational Intelligence-Based Stuttering Detection: A Systematic Review,” Computers in Biology and Medicine, vol. 157, art. 106764, 2023. https://doi.org/10.1016/j.compbiomed.2023.106764 R. Reddappa Reddy and S. Kumar Gangadharaih, “UNNIGSA: A Unified Neural Network Approach for Enhanced Stutter Detection and Gait Recognition Analysis,” Journal of Electrical and Electronic Engineering, vol. 12, no. 4, pp. 71–83, 2024. https://doi.org/10.11648/j.jeee.20241204.12 P. Jamshid Lou and M. Johnson, “Neural Constituency Parsing of Speech Transcripts,” Proc. 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 2019. https://doi.org/10.18653/v1/P19-1340. Arvind Kumar Mishra Corresponding author Shri Ram Murti Smarak College of Engineering and Technology Bareilly U.P India AKMISHRAW@GMAIL.COM Arvind Kumar Mishra*, A Machine Learning?Based System for Real-Time Stuttered Speech Correction: Stutter Clear, Int. J. Sci. R. Tech., 2025, 2 (11), 510-520. https://doi.org/10.5281/zenodo.17657053 More related articles A Unified Video Content Understanding Framework fo... M. Manjunath, Srisailanath, C. Sharath Vamshi, Sai Gagan Tej K. B... Design And Integration Of A Modern Technology-Driv... Lokesh Singh, Velicheti Hemendra, Moka Asha Deepika, Thota Vijaya... Aphaisia: A Comprehensive Language Disorder, Clini... Pooja Rasal, Poonam Yadav, Granthali Shape... View more A Review on the Integration of Machine Learning in Cloud Computing Resource Mana... Thafzy V M... An Overview of the Real-Time Data Logging & Dashboard for Multi-User Operations ... Deepika Patwal, Suman Rani... AI-Based Intelligent Traffic Management System... Rafiek Ithrees, Yuvaraj N, Susvin S, Rohini Priya, Udhayakumar T... View more 10.5281/zenodo.17657053 Received12 Nov, 2025 Accepted17 Nov, 2025 Published20 Nov, 2025 Views625 Download PDF Share This Article Related Articles Enhancing Speech Emotion Recognition with Deep Learning Techniques... S. Guru Prasad, M. Sreevani... Artificial Intelligence And Machine Learning- Based Predictive Maintenance Of In... Pari Dargude , Mukund K. Nalawade, Om Date, Aditya Dhurve, Priyanshu Dhokane, Vinayak S. Deshmukh... Playpal: A Web-Based Platform For Synchronized Video Playback And Real-Time Comm... Dnyaneshwari S. Kadam, Sudarshan J. Sikchi , Vineet D. Gaikwad, Atharva R. Bhosale... TRISHUL AI: A High-Speed Intelligent Multimodal Voice-Driven Search Engine Using... Thiramdasu Shiva Kumar, M. Sridhar... A Unified Video Content Understanding Framework for Youtube and Local Videos wit... M. Manjunath, Srisailanath, C. Sharath Vamshi, Sai Gagan Tej K. B. , M. Shashank... More related articles A Unified Video Content Understanding Framework for Youtube and Local Videos wit... M. Manjunath, Srisailanath, C. Sharath Vamshi, Sai Gagan Tej K. B. , M. Shashank... Design And Integration Of A Modern Technology-Driven System For Crop Disease Ide... Lokesh Singh, Velicheti Hemendra, Moka Asha Deepika, Thota Vijaya Durga, Shaik Rajiya Sulthana, Kadi... Aphaisia: A Comprehensive Language Disorder, Clinical Treatments and Clinical Co... Pooja Rasal, Poonam Yadav, Granthali Shape... View more A Unified Video Content Understanding Framework for Youtube and Local Videos wit... M. Manjunath, Srisailanath, C. Sharath Vamshi, Sai Gagan Tej K. B. , M. Shashank... Design And Integration Of A Modern Technology-Driven System For Crop Disease Ide... Lokesh Singh, Velicheti Hemendra, Moka Asha Deepika, Thota Vijaya Durga, Shaik Rajiya Sulthana, Kadi... Aphaisia: A Comprehensive Language Disorder, Clinical Treatments and Clinical Co... Pooja Rasal, Poonam Yadav, Granthali Shape... View more Email Address: editor@ijsrtjournal.com Location: B-1, Business Square, Pipeline road, Anandvalli, Nashik - 422013 Quick Links Home Pay Fees Contact Check paper Status Author Guidelines Peer Review Process Policies Plagiarism Policy Open Access Policy Editorial Policy Terms and Conditions Cancellation and Refund Policy Privacy Policy Subscribe Our Newsletter Copyright © 2026 IJSRT. All rights reserved

View Article

A Machine Learning–Based System for Real-Time Stuttered Speech Correction: Stutter Clear

Abstract

Keywords

Introduction

Reference

Arvind Kumar Mishra

More related articles

View more

View more

Related Articles

More related articles

View more

View more