Shri Ram Murti Smarak College of Engineering and Technology Bareilly U.P India
This paper introduces Stutter Clear, a machine learning system designed to detect stuttering in real-time speech and convert it into smooth, fluent speech while maintaining the speaker’s natural voice identity. The system works in three main stages: audio preprocessing and feature extraction, stuttering detection and segmentation, and fluency enhancement using speech synthesis or transformation techniques. A hybrid deep learning architecture combining CNN-RNN (or Conv-TCN) networks with Transformer-based sequence-to-sequence models is used to achieve this. The paper also discusses the system’s design, dataset preparation, training process, evaluation metrics, limitations, and potential directions for future research.
Stuttering is a widespread speech disorder that affects the natural rhythm and flow of speech. It is characterized by involuntary repetitions, prolongations, or blocks of sounds and syllables, which often make communication challenging for the speaker. This disruption in fluency can lead to frustration, anxiety, and low self-confidence, especially in social or professional situations [1]. Although stuttering can vary in severity from person to person, its psychological and emotional effects are often profound. For many individuals, the fear of speaking in public or engaging in conversations becomes a major barrier, influencing their personal growth, career opportunities, and quality of life [2]. Over the years, various therapeutic and technological solutions have been developed to help people who stutter. Traditional speech therapy focuses on breathing techniques, controlled speech, and behavioral exercises to improve fluency [3]. While these methods are helpful, their effectiveness depends largely on regular practice and the individual’s response to therapy [4]. On the other hand, technological aids such as Delayed Auditory Feedback (DAF) devices attempt to improve fluency by altering how the speaker hears their own voice, encouraging smoother speech patterns. However, these solutions are not universally effective and often fail to address the real-time dynamics of stuttering. Many users also find such devices uncomfortable or difficult to use in everyday communication [5]. In response to these limitations, this research proposes the development of an adaptive and intelligent system designed to detect and correct stuttering automatically in real time. The main goal of this system is to provide a seamless speaking experience by combining the power of modern machine learning and speech processing technologies [6]. The system operates in three core stages: detection, classification, and correction. In the first stage, the system captures live audio and performs preprocessing tasks such as noise removal and feature extraction. These features are then analyzed to identify stuttering patterns in the second stage [7]. The model classifies and segments various types of disfluencies, including repetitions, prolongations, and speech blocks. In the final stage, the detected stuttered segments are processed and corrected to produce smooth, fluent speech while preserving the speaker’s natural tone and voice identity [8]. To achieve this, the system utilizes a hybrid deep learning architecture that combines Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), Temporal Convolution Networks (TCNs) for feature extraction and sequence modeling. Additionally, Transformer-based sequence-to-sequence models are integrated to handle complex temporal dependencies and improve fluency correction accuracy [9]. These advanced models enable the system to understand speech patterns, detect disfluencies efficiently, and transform them into fluent speech in real time [10]. Beyond the technical aspects, the research also focuses on dataset preparation, model training, and performance evaluation using objective and subjective metrics [11]. By training the model on diverse speech samples containing various stuttering patterns, the system learns to generalize across different speakers and speech contexts. Evaluation parameters such as detection accuracy, fluency improvement rate, and voice naturalness are used to measure performance [12].
2. Related Work
Over the past several decades, numerous studies and technological developments have focused on addressing the challenges of stuttering and improving speech fluency. One of the earliest and most widely explored approaches involves the use of Delayed Auditory Feedback (DAF) and Frequency-Altered Feedback (FAF) devices. These electronic aids modify how the speaker hears their own voice, either by introducing a slight delay or by changing the pitch frequency [13]. This auditory alteration can temporarily improve speech fluency by helping the speaker slow down and regulate their speech rhythm [14]. However, while DAF and FAF devices can produce short-term benefits, they often lack adaptability and long-term effectiveness [15]. Users may experience only partial fluency improvement, and the devices can sometimes cause discomfort or distraction during prolonged use. Furthermore, these tools do not adapt to the individual’s specific speech patterns or stuttering triggers, limiting their real-world application and scalability [16]. With the rise of artificial intelligence and machine learning, data-driven approaches to stuttering detection. Traditional machine learning algorithms such as Support Vector Machines (SVMs), Random Forests, and Hidden Markov Models (HMMs) were among the first to be applied for identifying disfluencies in recorded speech17]. These methods rely on handcrafted features like pitch, energy, and temporal pauses extracted from speech signals. While these models achieved reasonable accuracy in controlled experiments, they struggled to perform consistently in real-world, noisy environments [18]. To overcome these challenges, deep learning models, particularly Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), were introduced. CNNs proved effective for extracting spectral and temporal features from audio spectrograms, while RNNs and their variants, such as Long Short-Term Memory (LSTM) networks, captured sequential dependencies in speech. These advancements significantly improved stuttering detection performance, enabling more accurate identification of disfluency types such as repetitions, prolongations, and blocks. However, despite their success in detection, these models typically stopped at identifying stuttered segments and did not attempt to enhance or correct them [19]. Parallel to these developments, major progress has also been made in neural speech enhancement and voice conversion technologies. Using architectures like Transformers and sequence-to-sequence (Seq2Seq) models, researchers have developed systems that can reconstruct, clean, or modify speech at the spectrogram level [20]. These models have been used to reduce background noise, enhance clarity, and even convert one speaker’s voice to sound like another’s while maintaining naturalness and intelligibility. The integration of attention mechanisms and autoencoder structures has further enhanced the quality and smoothness of generated speech, paving the way for real-time voice transformation applications [21]. Building upon these foundational works, the present paper proposes a unified framework that combines stuttering detection and speech transformation into a single intelligent pipeline. Unlike previous studies that treated detection and correction as separate tasks, this research integrates both processes into a continuous workflow [22]. The proposed system first detects stuttering events using advanced deep learning models and then applies neural transformation techniques to reconstruct fluent, natural-sounding speech in real time [23]. This holistic approach not only enhances fluency but also ensures that the speaker’s unique voice characteristics are preserved. By merging insights from DAF/FAF feedback mechanisms, machine learning–based disfluency detection, and neural speech synthesis, this study introduces a new generation of adaptive, real-time stuttering correction systems capable of delivering personalized and sustainable fluency improvement [24] [25].
METHODOLOGY
The given problem focuses on transforming an input audio signal x(t)x(t)x(t) containing stuttered speech into an output y(t)y(t)y(t) that retains the speaker’s original voice identity and natural tone while effectively removing disfluencies. The primary objectives of this research are to detect stuttering events with high accuracy (target precision/recall ≥ 0.8), maintain near real-time latency (≤ 200 ms), and enhance speech fluency and naturalness as measured by the Mean Opinion Score (MOS). The proposed system architecture consists of three main modules. The first module, Preprocessing and Feature Extraction, involves sampling mono audio at 16 kHz with a frame size of 20–30 ms and a hop of 10 ms. Features such as Log-Mel spectrograms (40 bands), MFCCs, energy, zero-crossing rate, pitch (F0), and spectral flux are extracted, along with additional parameters like the short-time energy envelope and voicing probability. These features provide a rich representation of the speech signal, crucial for accurate stuttering detection. The second module, Stuttering Detection and Segmentation, aims to identify and label each audio frame as fluent, repetition, prolongation, or block. This is achieved using a hybrid model combining a convolution encoder with a Bi-LSTM (or Temporal Convolution Network) followed by a Conditional Random Field (CRF) for temporal smoothing. The model is trained using a combination of categorical cross-entropy and Dice or Focal loss to handle class imbalance. The architecture processes T×40T × 40T×40 Mel feature inputs through multiple Conv1D layers with ReLU activation and batch normalization, followed by two Bi-LSTM layers (hidden size 256) and a fully connected softmax layer with CRF decoding. The output consists of onset and offset timestamps marking stuttering events. The third module, Fluency Enhancement, integrates two complementary strategies. The first is a rule-based local editing approach that operates with low latency by identifying and removing short repetitions through cross-fade merging, compressing prolonged sounds, and filling silent “blocks” with smooth voiced transitions. Although this method is fast, it struggles with complex disfluencies. The second approach uses a neural speech transformation model based on Transformer- based sequence- to- sequence architecture with attention. It takes the stuttered segment spectrogram as input and generates a corrected spectrogram, which is then converted into audio using a neural vocoder such as WaveRNN or HiFi-GAN. The model can be trained either on paired (stuttered → fluent) data or through unsupervised CycleGAN-based learning for unpaired datasets. A hybrid strategy combining both methods ensures a trade-off between latency and audio quality. Dataset preparation involves collecting samples from public speech corpora, specialized stuttering datasets, and volunteer recordings. Each sample is annotated at both frame and event levels by speech-language pathologists (SLPs), labeling instances of repetition, prolongation, and block. Data augmentation techniques such as time-stretching, pitch shifting, noise addition, reverb, and simulated stuttering are applied to increase variability. For supervised learning, paired fluent and stuttered samples from the same speakers are used. During model training, optimizers such as Adam or AdamW are employed with an initial learning rate of 1e-3 and a suitable scheduler. The batch size is set to 32 for detection tasks, while it varies for Transformer models. The training process incorporates L1, Mel-spectrogram, and adversarial losses for GAN-based speech enhancement, and pretrained vocoders are fine-tuned to match target speaker characteristics. For evaluation, objective metrics include precision, recall, F1-score, and onset/offset error in milliseconds for detection performance, along with PESQ, STOI, and Mel-Cepstral Distortion (MCD) for assessing speech quality. Latency is measured to ensure it stays below 200 ms. Subjective evaluation involves Mean Opinion Score (MOS) ratings for fluency, naturalness, and voice similarity, complemented by ABX tests where human listeners choose the more fluent version between original and corrected speech. The experimental setup includes baseline comparisons among DAF-style systems, rule-based editing, and the proposed hybrid method. Additional studies examine the contributions of each module (detection-only, detection plus rule-editing and full hybrid) and test generalization across speakers and varying noise levels (20, 10, and 0 dB SNR). The expected outcome of this work is that the hybrid approach successfully balances low latency with high naturalness, offering a significant improvement in both fluency correction and preservation of voice authenticity.
Fig 1: Stutter Detection and Fluency Enhancement
The algorithm for stutter detection and fluency enhancement shown in Fig 1 operates through both offline and real-time processes to convert stuttered speech into smooth, natural, and fluent audio while preserving the speaker’s voice identity. The process begins with the input of an audio signal, which undergoes preprocessing to prepare it for analysis. This includes tasks such as sampling, feature extraction, and noise handling to ensure that the data fed into the system is clean and consistent. Once preprocessing is complete, the stutter detection stage identifies frames of speech that contain disfluencies such as repetitions, prolongations, or blocks. These detections are used to determine whether a detected stuttering event is simple or complex based on its duration, type, and intensity. If the event is classified as simple—such as short repetitions or mild prolongations—it is corrected using rule-based methods in real time. These involve operations like removing redundant segments, compressing prolonged sounds, and merging audio fragments smoothly using cross-fade techniques. This ensures low latency and fast processing, making it suitable for live or near real-time applications. In contrast, if the event is more complex, such as long or irregular disfluencies, the system employs neural synthesis techniques using Transformer-based sequence-to-sequence models. This neural approach reconstructs the fluent version of the stuttered segment at the spectrogram level and then converts it into high-quality audio through a neural vocoder like HiFi-GAN or WaveRNN. After processing through either path, the edited or synthesized audio segments are merged back into the fluent speech stream using smooth blending to maintain continuity and naturalness. The final output audio, therefore, is fluent, intelligible, and consistent with the speaker’s original voice characteristics. This hybrid algorithm achieves a balance between the low latency of rule-based edits and the superior quality of neural synthesis, ensuring efficient stutter correction suitable for both offline training and real-time deployment.
Table 1: Details of Speech Dataset Including Speaker Demographics, Speech Type, Stuttering Information, and Audio Files
|
ID |
Speaker Name |
Gender |
Age |
Native Language |
Accent |
Recording Duration (sec) |
Speech Type |
Stuttering Type (if any) |
Audio File Name |
|
1 |
Rahul Sharma |
M |
25 |
Hindi |
Indian |
60 |
Read Speech |
None |
rahul_s1.wav |
|
2 |
Priya Mehta |
F |
27 |
Hindi |
Indian |
55 |
Conversational |
Repetition |
priya_s1.wav |
|
3 |
Amit Verma |
M |
30 |
English |
Indian |
70 |
Read Speech |
Block |
amit_s1.wav |
|
4 |
Sneha Reddy |
F |
22 |
Telugu |
Indian |
65 |
Conversational |
None |
sneha_s1.wav |
|
Reference
Arvind Kumar Mishra*, A Machine Learning–Based System for Real-Time Stuttered Speech Correction: Stutter Clear, Int. J. Sci. R. Tech., 2025, 2 (11), 510-520. https://doi.org/10.5281/zenodo.17657053 More related articlesFormulation and Evaluation of Anti- Bacterial Mori...Anil Panchal, Nalawade Mahesh, Vishal Madankar, ...Music Genre Classifier Using Deep Learning...Swati Badachi, Dayanand Savakar, Padma Yadahalli, ...Comprehensive Pharmacological Study of Cannabis Sa...Akshay Wagh, Kunal Kothawade, Shivshankar Ambhore, Dr. Avinash Da...Review on Various Activities Exhibited by Coccinia Grandis...Mathew George, Lincy Joseph, Ann Mary V. A., Ashin Joseph, Liyamary N. J., Muhamad Ashif P. S., Shah...Development of A Novel Herbal Chewable Tablet for Dental Caries Treatment...Ujjwal Khairnar, Rameshwar Chole, Rushikesh Hire, Gaurav Bharti, ...Assessment of the Knowledge About Radiology Department Among Paramedical Student...Aman Verma, Shubhanshi Rani, Jyoti Yadav, Sandhya Verma, Shivam Kumar, ...
Related ArticlesDigital Classroom Management in Secondary Schools in Sierra Leone: Opportunities...Aiah Joseph Kamanda, ...Nano Based Drug Delivery Systems: Recent Developments and Future Prospects...Tejaswini Shinde, Kanchan Deshmukh, Vaibhavi Gavali, Shradha Deokar, ...Challenges Faced by Teachers Upon the Implementation of ‘2023- Zambia’s Comp...Chishimba Fidelis, Dorairaj Sudarsanam, Ramalakhmi Chelliah, Praveena P., ...Lecanemab: A Novel Therapeutic Approach for Alzheimer’s Disease...Tejaswini Gurud , Madhura Jadhav , Abhishek Bhosale , Swapnil Wadkar, Akash Balid, Sanket Fulari, Te...Formulation and Evaluation of Anti- Bacterial Moringa Soap...Anil Panchal, Nalawade Mahesh, Vishal Madankar, ...More related articlesFormulation and Evaluation of Anti- Bacterial Moringa Soap...Anil Panchal, Nalawade Mahesh, Vishal Madankar, ...Music Genre Classifier Using Deep Learning...Swati Badachi, Dayanand Savakar, Padma Yadahalli, ...Comprehensive Pharmacological Study of Cannabis Sativa Plant...Akshay Wagh, Kunal Kothawade, Shivshankar Ambhore, Dr. Avinash Darekar , ...Formulation and Evaluation of Anti- Bacterial Moringa Soap...Anil Panchal, Nalawade Mahesh, Vishal Madankar, ...Music Genre Classifier Using Deep Learning...Swati Badachi, Dayanand Savakar, Padma Yadahalli, ...Comprehensive Pharmacological Study of Cannabis Sativa Plant...Akshay Wagh, Kunal Kothawade, Shivshankar Ambhore, Dr. Avinash Darekar , ... |