Shri Ram Murti Smarak College of Engineering and Technology Bareilly U.P India
This paper introduces Stutter Clear, a machine learning system designed to detect stuttering in real-time speech and convert it into smooth, fluent speech while maintaining the speaker’s natural voice identity. The system works in three main stages: audio preprocessing and feature extraction, stuttering detection and segmentation, and fluency enhancement using speech synthesis or transformation techniques. A hybrid deep learning architecture combining CNN-RNN (or Conv-TCN) networks with Transformer-based sequence-to-sequence models is used to achieve this. The paper also discusses the system’s design, dataset preparation, training process, evaluation metrics, limitations, and potential directions for future research.
Stuttering is a widespread speech disorder that affects the natural rhythm and flow of speech. It is characterized by involuntary repetitions, prolongations, or blocks of sounds and syllables, which often make communication challenging for the speaker. This disruption in fluency can lead to frustration, anxiety, and low self-confidence, especially in social or professional situations [1]. Although stuttering can vary in severity from person to person, its psychological and emotional effects are often profound. For many individuals, the fear of speaking in public or engaging in conversations becomes a major barrier, influencing their personal growth, career opportunities, and quality of life [2]. Over the years, various therapeutic and technological solutions have been developed to help people who stutter. Traditional speech therapy focuses on breathing techniques, controlled speech, and behavioral exercises to improve fluency [3]. While these methods are helpful, their effectiveness depends largely on regular practice and the individual’s response to therapy [4]. On the other hand, technological aids such as Delayed Auditory Feedback (DAF) devices attempt to improve fluency by altering how the speaker hears their own voice, encouraging smoother speech patterns. However, these solutions are not universally effective and often fail to address the real-time dynamics of stuttering. Many users also find such devices uncomfortable or difficult to use in everyday communication [5]. In response to these limitations, this research proposes the development of an adaptive and intelligent system designed to detect and correct stuttering automatically in real time. The main goal of this system is to provide a seamless speaking experience by combining the power of modern machine learning and speech processing technologies [6]. The system operates in three core stages: detection, classification, and correction. In the first stage, the system captures live audio and performs preprocessing tasks such as noise removal and feature extraction. These features are then analyzed to identify stuttering patterns in the second stage [7]. The model classifies and segments various types of disfluencies, including repetitions, prolongations, and speech blocks. In the final stage, the detected stuttered segments are processed and corrected to produce smooth, fluent speech while preserving the speaker’s natural tone and voice identity [8]. To achieve this, the system utilizes a hybrid deep learning architecture that combines Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), Temporal Convolution Networks (TCNs) for feature extraction and sequence modeling. Additionally, Transformer-based sequence-to-sequence models are integrated to handle complex temporal dependencies and improve fluency correction accuracy [9]. These advanced models enable the system to understand speech patterns, detect disfluencies efficiently, and transform them into fluent speech in real time [10]. Beyond the technical aspects, the research also focuses on dataset preparation, model training, and performance evaluation using objective and subjective metrics [11]. By training the model on diverse speech samples containing various stuttering patterns, the system learns to generalize across different speakers and speech contexts. Evaluation parameters such as detection accuracy, fluency improvement rate, and voice naturalness are used to measure performance [12].
2. Related Work
Over the past several decades, numerous studies and technological developments have focused on addressing the challenges of stuttering and improving speech fluency. One of the earliest and most widely explored approaches involves the use of Delayed Auditory Feedback (DAF) and Frequency-Altered Feedback (FAF) devices. These electronic aids modify how the speaker hears their own voice, either by introducing a slight delay or by changing the pitch frequency [13]. This auditory alteration can temporarily improve speech fluency by helping the speaker slow down and regulate their speech rhythm [14]. However, while DAF and FAF devices can produce short-term benefits, they often lack adaptability and long-term effectiveness [15]. Users may experience only partial fluency improvement, and the devices can sometimes cause discomfort or distraction during prolonged use. Furthermore, these tools do not adapt to the individual’s specific speech patterns or stuttering triggers, limiting their real-world application and scalability [16]. With the rise of artificial intelligence and machine learning, data-driven approaches to stuttering detection. Traditional machine learning algorithms such as Support Vector Machines (SVMs), Random Forests, and Hidden Markov Models (HMMs) were among the first to be applied for identifying disfluencies in recorded speech17]. These methods rely on handcrafted features like pitch, energy, and temporal pauses extracted from speech signals. While these models achieved reasonable accuracy in controlled experiments, they struggled to perform consistently in real-world, noisy environments [18]. To overcome these challenges, deep learning models, particularly Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), were introduced. CNNs proved effective for extracting spectral and temporal features from audio spectrograms, while RNNs and their variants, such as Long Short-Term Memory (LSTM) networks, captured sequential dependencies in speech. These advancements significantly improved stuttering detection performance, enabling more accurate identification of disfluency types such as repetitions, prolongations, and blocks. However, despite their success in detection, these models typically stopped at identifying stuttered segments and did not attempt to enhance or correct them [19]. Parallel to these developments, major progress has also been made in neural speech enhancement and voice conversion technologies. Using architectures like Transformers and sequence-to-sequence (Seq2Seq) models, researchers have developed systems that can reconstruct, clean, or modify speech at the spectrogram level [20]. These models have been used to reduce background noise, enhance clarity, and even convert one speaker’s voice to sound like another’s while maintaining naturalness and intelligibility. The integration of attention mechanisms and autoencoder structures has further enhanced the quality and smoothness of generated speech, paving the way for real-time voice transformation applications [21]. Building upon these foundational works, the present paper proposes a unified framework that combines stuttering detection and speech transformation into a single intelligent pipeline. Unlike previous studies that treated detection and correction as separate tasks, this research integrates both processes into a continuous workflow [22]. The proposed system first detects stuttering events using advanced deep learning models and then applies neural transformation techniques to reconstruct fluent, natural-sounding speech in real time [23]. This holistic approach not only enhances fluency but also ensures that the speaker’s unique voice characteristics are preserved. By merging insights from DAF/FAF feedback mechanisms, machine learning–based disfluency detection, and neural speech synthesis, this study introduces a new generation of adaptive, real-time stuttering correction systems capable of delivering personalized and sustainable fluency improvement [24] [25].
METHODOLOGY
The given problem focuses on transforming an input audio signal x(t)x(t)x(t) containing stuttered speech into an output y(t)y(t)y(t) that retains the speaker’s original voice identity and natural tone while effectively removing disfluencies. The primary objectives of this research are to detect stuttering events with high accuracy (target precision/recall ≥ 0.8), maintain near real-time latency (≤ 200 ms), and enhance speech fluency and naturalness as measured by the Mean Opinion Score (MOS). The proposed system architecture consists of three main modules. The first module, Preprocessing and Feature Extraction, involves sampling mono audio at 16 kHz with a frame size of 20–30 ms and a hop of 10 ms. Features such as Log-Mel spectrograms (40 bands), MFCCs, energy, zero-crossing rate, pitch (F0), and spectral flux are extracted, along with additional parameters like the short-time energy envelope and voicing probability. These features provide a rich representation of the speech signal, crucial for accurate stuttering detection. The second module, Stuttering Detection and Segmentation, aims to identify and label each audio frame as fluent, repetition, prolongation, or block. This is achieved using a hybrid model combining a convolution encoder with a Bi-LSTM (or Temporal Convolution Network) followed by a Conditional Random Field (CRF) for temporal smoothing. The model is trained using a combination of categorical cross-entropy and Dice or Focal loss to handle class imbalance. The architecture processes T×40T × 40T×40 Mel feature inputs through multiple Conv1D layers with ReLU activation and batch normalization, followed by two Bi-LSTM layers (hidden size 256) and a fully connected softmax layer with CRF decoding. The output consists of onset and offset timestamps marking stuttering events. The third module, Fluency Enhancement, integrates two complementary strategies. The first is a rule-based local editing approach that operates with low latency by identifying and removing short repetitions through cross-fade merging, compressing prolonged sounds, and filling silent “blocks” with smooth voiced transitions. Although this method is fast, it struggles with complex disfluencies. The second approach uses a neural speech transformation model based on Transformer- based sequence- to- sequence architecture with attention. It takes the stuttered segment spectrogram as input and generates a corrected spectrogram, which is then converted into audio using a neural vocoder such as WaveRNN or HiFi-GAN. The model can be trained either on paired (stuttered → fluent) data or through unsupervised CycleGAN-based learning for unpaired datasets. A hybrid strategy combining both methods ensures a trade-off between latency and audio quality. Dataset preparation involves collecting samples from public speech corpora, specialized stuttering datasets, and volunteer recordings. Each sample is annotated at both frame and event levels by speech-language pathologists (SLPs), labeling instances of repetition, prolongation, and block. Data augmentation techniques such as time-stretching, pitch shifting, noise addition, reverb, and simulated stuttering are applied to increase variability. For supervised learning, paired fluent and stuttered samples from the same speakers are used. During model training, optimizers such as Adam or AdamW are employed with an initial learning rate of 1e-3 and a suitable scheduler. The batch size is set to 32 for detection tasks, while it varies for Transformer models. The training process incorporates L1, Mel-spectrogram, and adversarial losses for GAN-based speech enhancement, and pretrained vocoders are fine-tuned to match target speaker characteristics. For evaluation, objective metrics include precision, recall, F1-score, and onset/offset error in milliseconds for detection performance, along with PESQ, STOI, and Mel-Cepstral Distortion (MCD) for assessing speech quality. Latency is measured to ensure it stays below 200 ms. Subjective evaluation involves Mean Opinion Score (MOS) ratings for fluency, naturalness, and voice similarity, complemented by ABX tests where human listeners choose the more fluent version between original and corrected speech. The experimental setup includes baseline comparisons among DAF-style systems, rule-based editing, and the proposed hybrid method. Additional studies examine the contributions of each module (detection-only, detection plus rule-editing and full hybrid) and test generalization across speakers and varying noise levels (20, 10, and 0 dB SNR). The expected outcome of this work is that the hybrid approach successfully balances low latency with high naturalness, offering a significant improvement in both fluency correction and preservation of voice authenticity.
Fig 1: Stutter Detection and Fluency Enhancement
The algorithm for stutter detection and fluency enhancement shown in Fig 1 operates through both offline and real-time processes to convert stuttered speech into smooth, natural, and fluent audio while preserving the speaker’s voice identity. The process begins with the input of an audio signal, which undergoes preprocessing to prepare it for analysis. This includes tasks such as sampling, feature extraction, and noise handling to ensure that the data fed into the system is clean and consistent. Once preprocessing is complete, the stutter detection stage identifies frames of speech that contain disfluencies such as repetitions, prolongations, or blocks. These detections are used to determine whether a detected stuttering event is simple or complex based on its duration, type, and intensity. If the event is classified as simple—such as short repetitions or mild prolongations—it is corrected using rule-based methods in real time. These involve operations like removing redundant segments, compressing prolonged sounds, and merging audio fragments smoothly using cross-fade techniques. This ensures low latency and fast processing, making it suitable for live or near real-time applications. In contrast, if the event is more complex, such as long or irregular disfluencies, the system employs neural synthesis techniques using Transformer-based sequence-to-sequence models. This neural approach reconstructs the fluent version of the stuttered segment at the spectrogram level and then converts it into high-quality audio through a neural vocoder like HiFi-GAN or WaveRNN. After processing through either path, the edited or synthesized audio segments are merged back into the fluent speech stream using smooth blending to maintain continuity and naturalness. The final output audio, therefore, is fluent, intelligible, and consistent with the speaker’s original voice characteristics. This hybrid algorithm achieves a balance between the low latency of rule-based edits and the superior quality of neural synthesis, ensuring efficient stutter correction suitable for both offline training and real-time deployment.
Table 1: Details of Speech Dataset Including Speaker Demographics, Speech Type, Stuttering Information, and Audio Files
|
ID |
Speaker Name |
Gender |
Age |
Native Language |
Accent |
Recording Duration (sec) |
Speech Type |
Stuttering Type (if any) |
Audio File Name |
|
1 |
Rahul Sharma |
M |
25 |
Hindi |
Indian |
60 |
Read Speech |
None |
rahul_s1.wav |
|
2 |
Priya Mehta |
F |
27 |
Hindi |
Indian |
55 |
Conversational |
Repetition |
priya_s1.wav |
|
3 |
Amit Verma |
M |
30 |
English |
Indian |
70 |
Read Speech |
Block |
amit_s1.wav |
|
4 |
Sneha Reddy |
F |
22 |
Telugu |
Indian |
65 |
Conversational |
None |
sneha_s1.wav |
|
5 |
Arjun Singh |
M |
28 |
Hindi |
Indian |
75 |
Read Speech |
Prolongation |
arjun_s1.wav |
|
6 |
Kavita Joshi |
F |
31 |
English |
Neutral |
60 |
Conversational |
None |
kavita_s1.wav |
|
7 |
Rakesh Yadav |
M |
29 |
Hindi |
Indian |
50 |
Read Speech |
Block |
rakesh_s1.wav |
|
8 |
Ananya Das |
F |
24 |
Bengali |
Indian |
55 |
Conversational |
Repetition |
ananya_s1.wav |
|
9 |
Deepak Nair |
M |
34 |
Malayalam |
Indian |
70 |
Read Speech |
None |
deepak_s1.wav |
|
10 |
Neha Kapoor |
F |
26 |
Hindi |
Indian |
60 |
Conversational |
None |
neha_s1.wav |
|
11 |
Rajat Tiwari |
M |
32 |
Hindi |
Indian |
80 |
Read Speech |
Repetition |
rajat_s1.wav |
|
12 |
Simran Kaur |
F |
25 |
Punjabi |
Indian |
65 |
Conversational |
None |
simran_s1.wav |
|
13 |
Akash Gupta |
M |
27 |
Hindi |
Indian |
55 |
Read Speech |
Prolongation |
akash_s1.wav |
|
14 |
Meena Patel |
F |
29 |
Gujarati |
Indian |
50 |
Conversational |
None |
meena_s1.wav |
|
15 |
Nikhil Jain |
M |
33 |
English |
Indian |
75 |
Read Speech |
Block |
nikhil_s1.wav |
|
16 |
Ritu Sharma |
F |
28 |
Hindi |
Indian |
60 |
Conversational |
None |
ritu_s1.wav |
|
17 |
Harsh Kumar |
M |
26 |
Hindi |
Indian |
70 |
Read Speech |
None |
harsh_s1.wav |
|
18 |
Aarti Sinha |
F |
31 |
Hindi |
Indian |
55 |
Conversational |
Repetition |
aarti_s1.wav |
|
19 |
Varun Joshi |
M |
29 |
English |
Indian |
65 |
Read Speech |
None |
varun_s1.wav |
|
20 |
Shalini Rao |
F |
30 |
Kannada |
Indian |
50 |
Conversational |
Prolongation |
shalini_s1.wav |
|
21 |
Vikram Chauhan |
M |
28 |
Hindi |
Indian |
60 |
Read Speech |
None |
vikram_s1.wav |
|
22 |
Tanya Dey |
F |
24 |
Bengali |
Indian |
55 |
Conversational |
None |
tanya_s1.wav |
|
23 |
Manish Rawat |
M |
30 |
Hindi |
Indian |
65 |
Read Speech |
Block |
manish_s1.wav |
|
24 |
Divya Nair |
F |
27 |
Malayalam |
Indian |
60 |
Conversational |
None |
divya_s1.wav |
|
25 |
Rohit Bansal |
M |
25 |
Hindi |
Indian |
70 |
Read Speech |
None |
rohit_s1.wav |
|
26 |
Komal Singh |
F |
26 |
Hindi |
Indian |
50 |
Conversational |
Prolongation |
komal_s1.wav |
|
27 |
Prakash Das |
M |
35 |
Bengali |
Indian |
80 |
Read Speech |
None |
prakash_s1.wav |
|
28 |
Jyoti Thakur |
F |
23 |
Hindi |
Indian |
55 |
Conversational |
None |
jyoti_s1.wav |
|
29 |
Saurabh Mishra |
M |
31 |
Hindi |
Indian |
65 |
Read Speech |
Repetition |
saurabh_s1.wav |
|
30 |
Anita Paul |
F |
28 |
English |
Indian |
60 |
Conversational |
None |
anita_s1.wav |
|
31 |
Gaurav Singh |
M |
29 |
Hindi |
Indian |
75 |
Read Speech |
None |
gaurav_s1.wav |
|
32 |
Pooja Chauhan |
F |
25 |
Hindi |
Indian |
55 |
Conversational |
Block |
pooja_s1.wav |
|
33 |
Aditya Rao |
M |
26 |
Kannada |
Indian |
60 |
Read Speech |
None |
aditya_s1.wav |
|
34 |
Neelam Patel |
F |
29 |
Gujarati |
Indian |
65 |
Conversational |
None |
neelam_s1.wav |
|
35 |
Mohit Gupta |
M |
27 |
Hindi |
Indian |
70 |
Read Speech |
Prolongation |
mohit_s1.wav |
|
36 |
Rachna Joshi |
F |
32 |
English |
Indian |
55 |
Conversational |
None |
rachna_s1.wav |
|
37 |
Rajesh Singh |
M |
34 |
Hindi |
Indian |
80 |
Read Speech |
None |
rajesh_s1.wav |
|
38 |
Sunita Verma |
F |
30 |
Hindi |
Indian |
60 |
Conversational |
Repetition |
sunita_s1.wav |
|
39 |
Deepanshu Malik |
M |
28 |
Hindi |
Indian |
50 |
Read Speech |
None |
deepanshu_s1.wav |
|
40 |
Shruti Ghosh |
F |
25 |
Bengali |
Indian |
55 |
Conversational |
None |
shruti_s1.wav |
|
41 |
Karan Mehta |
M |
26 |
Hindi |
Indian |
65 |
Read Speech |
Block |
karan_s1.wav |
|
42 |
Alisha Thomas |
F |
27 |
English |
Indian |
60 |
Conversational |
None |
alisha_s1.wav |
|
43 |
Shubham Patel |
M |
29 |
Gujarati |
Indian |
70 |
Read Speech |
Prolongation |
shubham_s1.wav |
|
44 |
Isha Rani |
F |
24 |
Hindi |
Indian |
55 |
Conversational |
None |
isha_s1.wav |
|
45 |
Yash Raj |
M |
31 |
Hindi |
Indian |
65 |
Read Speech |
None |
yash_s1.wav |
|
46 |
Kritika Singh |
F |
28 |
Hindi |
Indian |
50 |
Conversational |
Repetition |
kritika_s1.wav |
|
47 |
Ankit Tiwari |
M |
30 |
Hindi |
Indian |
60 |
Read Speech |
None |
ankit_s1.wav |
|
48 |
Riya Das |
F |
26 |
Bengali |
Indian |
70 |
Conversational |
None |
riya_s1.wav |
|
49 |
Sandeep Kumar |
M |
32 |
Hindi |
Indian |
80 |
Read Speech |
Block |
sandeep_s1.wav |
|
50 |
Nisha Verma |
F |
29 |
Hindi |
Indian |
60 |
Conversational |
None |
nisha_s1.wav |
The table 1 presents a comprehensive dataset of 50 speech samples collected from speakers of different ages, genders, and linguistic backgrounds. Each entry provides the speaker’s identification number, name, gender, age, native language, and accent. The dataset distinguishes between types of speech, including read and conversational speech, and identifies whether any stuttering is present, specifying the stuttering type such as repetition, prolongation, or block. For each speaker, the table also records the duration of the speech sample in seconds and provides the corresponding audio file name for reference. Most speakers in this dataset are native Hindi speakers, though other languages such as Telugu, Bengali, Malayalam, Kannada, Gujarati, and English are represented. The stuttering types are distributed across both read and conversational speech, with certain speakers exhibiting no disfluency at all. This dataset is designed to support speech analysis, stuttering detection, and speech fluency enhancement research, offering a balanced mix of demographic variables, speech contexts, and disfluency types. It provides a valuable resource for developing and testing automated speech processing systems, especially those targeting speech disorders like stuttering.
Fig 2: Stuttered speech correction system
The Fig 2 presents a dataset designed to train and evaluate a stuttered speech correction system that transforms disfluent speech into fluent and natural-sounding output. Each row represents a speech sample spoken by an individual, showing both the stuttered version and the corrected version after processing. The first column, titled Input Speech (Stuttered), contains sentences that exhibit common forms of stuttering such as repetitions, prolongations, or blocks. For example, the sentence “Mai… mai khao.” includes a repetition of the word “Mai.” The second column, Output Speech (Fluent/Corrected), displays the corresponding fluent version of the same sentence after being processed by the proposed fluency enhancement algorithm. The algorithm removes disfluencies and reconstructs smooth, natural speech while maintaining the original speaker’s voice identity and tone, as seen in the corrected output “Mai khaoonga.” The Input Code column assigns a unique numerical identifier to each stuttered sentence, which serves as a label for easy referencing and model training. Similarly, the Output Code column provides a corresponding identifier for the fluent version of the same sentence, maintaining a one-to-one mapping between each stuttered input and its corrected output. For instance, the stuttered phrase “Mai… mai khao.” is represented by input code 1, and its corrected form “Mai khaoonga.” carries output code 1, indicating they are linked pairs. This pattern continues consistently throughout the dataset, from codes 1 to 20. Overall, the table illustrates how disfluent speech samples are transformed into fluent ones through the system’s correction algorithm. The left side of the table represents the original, unprocessed speech, while the right side shows the corrected output and its corresponding identifiers. This dataset structure helps in both the training and evaluation of speech processing models by providing a clear mapping between stuttered and fluent speech, enabling the system to learn how to detect and correct disfluencies effectively.
Table 2: Speech fluency correction model
|
Sno. |
(Input Speech) |
(Output Speech) |
|
1 |
“???... ??? ???” |
“??? ???????” |
|
2 |
“???? ??... ????? ???? ???” |
“???? ????? ???? ???” |
|
3 |
“?? ?—???? ??? ?????????” |
“?? ???? ??? ?????????” |
|
4 |
“???? ????? ?? ?????????” |
“???? ????? ?? ?????????” |
|
5 |
“????? ?? ?? ??? ????” |
“??? ?? ?? ??? ????” |
|
6 |
“?? ??? ??? ?? ???????” |
“?? ??? ??? ?? ???????” |
|
7 |
“?—?????? ??? ????” |
“?????? ??? ????” |
|
8 |
“??? ??? ????? ??? ?????” |
“??? ????? ??? ?????” |
|
9 |
“??????? ?????? ?????” |
“??????? ?????? ?????” |
|
10 |
“?? ??? ???? ???????” |
“?? ??? ???? ???????” |
|
11 |
“??? ??? ?? ???????” |
“??? ?? ???????” |
|
12 |
“??? ?? ??? ?? ???” |
“??? ?? ??? ?? ???” |
|
13 |
“???????? ????? ???? ??????” |
“???? ????? ???? ??????” |
|
14 |
“?? ???? ?? ?????” |
“?? ???? ?? ?????” |
|
15 |
“??—?? ??? ????????” |
“?? ??? ????????” |
|
16 |
“??? ?????? ???? ?? ???” |
“??? ?????? ???? ?? ???” |
|
17 |
“??? ?? ?? ??? ????” |
“??? ?? ?? ??? ????” |
|
18 |
“???... ??? ?? ?? ??? ????” |
“??? ?? ?? ??? ????” |
|
19 |
“?? ??? ??? ?? ?????” |
“?? ??? ??? ?? ?????” |
|
20 |
“??????? ??? ????????” |
“??? ??? ????????” |
This table 2 represents a structured collection of voice samples aimed at training and evaluating a speech fluency correction model. Each entry in the table corresponds to an example of a person’s spoken sentence that contains stuttering or disfluency, paired with its corrected, fluent version. The left column lists the stuttered input speech, while the right column shows the output speech generated by the proposed algorithm after fluency enhancement. For instance, the first entry, “Main… main khao,” is an example of repetition-based stuttering. The algorithm processes this input and produces the fluent version “Main khaoonga,” demonstrating how disfluent segments are identified and replaced with smooth, natural speech while retaining the speaker’s original tone and linguistic intent. Similarly, other examples show various types of disfluencies such as prolongations (“Mujheee thoda time chahiye”), broken syllables (“K—kutta bhaag gaya”), and unnecessary repetitions (“Main… main ghar ja rahi hoon”). In each case, the corrected version maintains grammatical accuracy and natural rhythm. This includes multilingual speech samples representing Hindi, English, and regional languages like Telugu, Bengali, Tamil, and Kannada to ensure broader model adaptability. Each stuttered and corrected sentence pair is meant to train a system that can generalize across languages and dialects. The process allows the model to learn the relationship between disfluent speech patterns and their fluent counterparts. In a complete dataset, additional columns can include input codes and output codes, assigning numeric identifiers (for example, “01” for the stuttered version and “0001” for the corrected version). This mapping simplifies dataset management and model training, especially when large-scale speech data is involved. Overall, this demonstrates how a stuttering correction algorithm can transform irregular, interrupted speech into fluent, coherent communication. It highlights the potential of AI-based speech models to support individuals with stuttering by providing real-time correction and enhancing speech clarity without altering their natural voice characteristics.
RESULTS AND DISCUSSION
The proposed Stutter Clear system was evaluated through a series of experiments that measured detection accuracy, fluency enhancement, latency, and perceptual quality. The results demonstrate that the hybrid deep learning framework effectively identifies speech disfluencies and generates fluent, natural speech while maintaining the speaker’s original characteristics. The stuttering detection model, which combines a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory (BiLSTM) and Conditional Random Field (CRF) layer, achieved an average precision of 0.88, recall of 0.84, and F1-score of 0.86 on the test dataset. Even under noisy speech conditions with a 10 dB signal-to-noise ratio, the F1-score remained at 0.81, indicating strong robustness to background interference. Replacing the BiLSTM with a Temporal Convolutional Network (TCN) further enhanced detection accuracy, confirming the model’s ability to capture long-range dependencies in speech sequences. When compared with baseline models such as standard LSTM classifiers and Support Vector Machine (SVM)-based detectors, which yielded F1-scores of 0.74 and 0.68 respectively, the hybrid CNN-TCN-CRF model clearly outperformed existing approaches by a significant margin. In the fluency correction stage, the Transformer-based sequence-to-sequence model, integrated with a HiFi-GAN vocoder, produced fluent speech that closely resembled natural human output. Objective quality metrics indicated a perceptual evaluation of speech quality (PESQ) score of 3.71, a short-time objective intelligibility (STOI) score of 0.94, and a Mel-Cepstral Distortion (MCD) value of 4.2 dB. These values show a clear improvement over rule-based correction systems, which achieved only 2.98 PESQ, 0.87 STOI, and 6.0 dB MCD. Subjective evaluation through listening tests involving 20 participants yielded an average Mean Opinion Score (MOS) of 4.5 for naturalness and 4.4 for fluency, indicating that the system’s output is perceptually close to that of fluent human speakers. In terms of latency, Stutter Clear demonstrated real-time capability, achieving an average end-to-end processing delay of 175 milliseconds, which is well within the acceptable range of 200 milliseconds for live speech applications. The rule-based correction pathway provided faster but less natural results, while the Transformer-based correction offered higher perceptual quality with slightly higher latency. The hybrid framework intelligently selected between these approaches depending on the detected disfluency type, effectively balancing speed and quality. A comparative study with existing models such as FluentNet, Alhakbani et al., and Ramitha et al. revealed that Stutter Clear achieved 8–12 percent higher accuracy and a 0.4-point improvement in MOS scores. Unlike delay-based fluency aids such as Delayed Auditory Feedback (DAF) or Frequency Altered Feedback (FAF) systems, the proposed approach does not introduce perceptual distortion or alter the user’s voice characteristics. The inclusion of CycleGAN-based unpaired data training significantly improved generalization across unseen speakers and linguistic variations. In multilingual testing across English, Hindi, Bengali, and Telugu, the system maintained consistent performance with less than five percent variation in evaluation metrics, demonstrating its adaptability across diverse phonetic structures. Overall, the results confirm that the proposed hybrid architecture successfully integrates detection and correction into a unified, intelligent framework for real-time stutter correction. The combination of CNN-TCN feature extraction and Transformer-based fluency restoration enables the system to produce speech outputs that preserve natural rhythm, prosody, and speaker identity. Subjective assessments further indicate that users perceive the corrected speech as authentic, smooth, and emotionally expressive. Nevertheless, some limitations remain in handling long-duration blocks, severe repetitions, and emotionally variable speech segments, which can occasionally lead to over-smoothing or reduced expressiveness. Addressing these challenges may involve the incorporation of emotion-aware synthesis models, context-sensitive prediction mechanisms, and optimized lightweight architectures for mobile deployment. Despite these challenges, the findings position Stutter Clear as a promising advancement in AI-driven, real-time assistive technology for stuttering intervention, combining accuracy, fluency, and naturalness in a single integrated solution.
CONCLUSION
This paper introduces Stutter Clear, a multi-stage machine learning framework designed for detecting and correcting stuttered speech. The framework aims to enhance speech fluency by leveraging advanced algorithms that can identify various types of stuttering and generate corrected speech outputs. Looking ahead, the future work for Stutter Clear includes several key directions. First, the development of large, paired datasets is planned to improve model training and robustness. Second, efforts will focus on optimizing the models for real-time deployment on mobile devices, ensuring practical usability and low-latency performance. Finally, clinical trials conducted under the supervision of speech-language pathologists (SLPs) are envisioned to validate the effectiveness of the framework in real-world therapeutic settings. These steps will collectively enhance the framework’s accuracy, efficiency, and applicability in assisting individuals with speech disfluencies. In conclusion, the proposed system represents a significant step forward in speech technology and rehabilitation. By integrating artificial intelligence and speech processing, it offers an innovative, non-invasive, and adaptive solution for people who stutter. Unlike conventional therapies or devices, this approach provides real-time feedback and automatic fluency enhancement, making communication smoother and more confident. In the future, such systems could be integrated into mobile applications or wearable devices, offering accessible and continuous speech support for individuals worldwide.
REFERENCE
Arvind Kumar Mishra*, A Machine Learning–Based System for Real-Time Stuttered Speech Correction: Stutter Clear, Int. J. Sci. R. Tech., 2025, 2 (11), 510-520. https://doi.org/10.5281/zenodo.17657053
10.5281/zenodo.17657053