View Article

  • A Machine Learning–Based System for Real-Time Stuttered Speech Correction: Stutter Clear

  • Shri Ram Murti Smarak College of Engineering and Technology Bareilly U.P India

Abstract

This paper introduces Stutter Clear, a machine learning system designed to detect stuttering in real-time speech and convert it into smooth, fluent speech while maintaining the speaker’s natural voice identity. The system works in three main stages: audio preprocessing and feature extraction, stuttering detection and segmentation, and fluency enhancement using speech synthesis or transformation techniques. A hybrid deep learning architecture combining CNN-RNN (or Conv-TCN) networks with Transformer-based sequence-to-sequence models is used to achieve this. The paper also discusses the system’s design, dataset preparation, training process, evaluation metrics, limitations, and potential directions for future research.

Keywords

Machine Learning, Real-Time Stuttered Speech Correction, Stutter Clear, CNN-RNN (or Conv-TCN) networks

Introduction

Stuttering is a widespread speech disorder that affects the natural rhythm and flow of speech. It is characterized by involuntary repetitions, prolongations, or blocks of sounds and syllables, which often make communication challenging for the speaker. This disruption in fluency can lead to frustration, anxiety, and low self-confidence, especially in social or professional situations [1]. Although stuttering can vary in severity from person to person, its psychological and emotional effects are often profound. For many individuals, the fear of speaking in public or engaging in conversations becomes a major barrier, influencing their personal growth, career opportunities, and quality of life [2]. Over the years, various therapeutic and technological solutions have been developed to help people who stutter. Traditional speech therapy focuses on breathing techniques, controlled speech, and behavioral exercises to improve fluency [3]. While these methods are helpful, their effectiveness depends largely on regular practice and the individual’s response to therapy [4]. On the other hand, technological aids such as Delayed Auditory Feedback (DAF) devices attempt to improve fluency by altering how the speaker hears their own voice, encouraging smoother speech patterns. However, these solutions are not universally effective and often fail to address the real-time dynamics of stuttering. Many users also find such devices uncomfortable or difficult to use in everyday communication [5]. In response to these limitations, this research proposes the development of an adaptive and intelligent system designed to detect and correct stuttering automatically in real time. The main goal of this system is to provide a seamless speaking experience by combining the power of modern machine learning and speech processing technologies [6]. The system operates in three core stages: detection, classification, and correction. In the first stage, the system captures live audio and performs preprocessing tasks such as noise removal and feature extraction. These features are then analyzed to identify stuttering patterns in the second stage [7]. The model classifies and segments various types of disfluencies, including repetitions, prolongations, and speech blocks. In the final stage, the detected stuttered segments are processed and corrected to produce smooth, fluent speech while preserving the speaker’s natural tone and voice identity [8]. To achieve this, the system utilizes a hybrid deep learning architecture that combines Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), Temporal Convolution Networks (TCNs) for feature extraction and sequence modeling. Additionally, Transformer-based sequence-to-sequence models are integrated to handle complex temporal dependencies and improve fluency correction accuracy [9]. These advanced models enable the system to understand speech patterns, detect disfluencies efficiently, and transform them into fluent speech in real time [10]. Beyond the technical aspects, the research also focuses on dataset preparation, model training, and performance evaluation using objective and subjective metrics [11]. By training the model on diverse speech samples containing various stuttering patterns, the system learns to generalize across different speakers and speech contexts. Evaluation parameters such as detection accuracy, fluency improvement rate, and voice naturalness are used to measure performance [12].

2. Related Work

Over the past several decades, numerous studies and technological developments have focused on addressing the challenges of stuttering and improving speech fluency. One of the earliest and most widely explored approaches involves the use of Delayed Auditory Feedback (DAF) and Frequency-Altered Feedback (FAF) devices. These electronic aids modify how the speaker hears their own voice, either by introducing a slight delay or by changing the pitch frequency [13]. This auditory alteration can temporarily improve speech fluency by helping the speaker slow down and regulate their speech rhythm [14]. However, while DAF and FAF devices can produce short-term benefits, they often lack adaptability and long-term effectiveness [15]. Users may experience only partial fluency improvement, and the devices can sometimes cause discomfort or distraction during prolonged use. Furthermore, these tools do not adapt to the individual’s specific speech patterns or stuttering triggers, limiting their real-world application and scalability [16]. With the rise of artificial intelligence and machine learning, data-driven approaches to stuttering detection. Traditional machine learning algorithms such as Support Vector Machines (SVMs), Random Forests, and Hidden Markov Models (HMMs) were among the first to be applied for identifying disfluencies in recorded speech17]. These methods rely on handcrafted features like pitch, energy, and temporal pauses extracted from speech signals. While these models achieved reasonable accuracy in controlled experiments, they struggled to perform consistently in real-world, noisy environments [18]. To overcome these challenges, deep learning models, particularly Convolution Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), were introduced. CNNs proved effective for extracting spectral and temporal features from audio spectrograms, while RNNs and their variants, such as Long Short-Term Memory (LSTM) networks, captured sequential dependencies in speech. These advancements significantly improved stuttering detection performance, enabling more accurate identification of disfluency types such as repetitions, prolongations, and blocks. However, despite their success in detection, these models typically stopped at identifying stuttered segments and did not attempt to enhance or correct them [19]. Parallel to these developments, major progress has also been made in neural speech enhancement and voice conversion technologies. Using architectures like Transformers and sequence-to-sequence (Seq2Seq) models, researchers have developed systems that can reconstruct, clean, or modify speech at the spectrogram level [20]. These models have been used to reduce background noise, enhance clarity, and even convert one speaker’s voice to sound like another’s while maintaining naturalness and intelligibility. The integration of attention mechanisms and autoencoder structures has further enhanced the quality and smoothness of generated speech, paving the way for real-time voice transformation applications [21]. Building upon these foundational works, the present paper proposes a unified framework that combines stuttering detection and speech transformation into a single intelligent pipeline. Unlike previous studies that treated detection and correction as separate tasks, this research integrates both processes into a continuous workflow [22]. The proposed system first detects stuttering events using advanced deep learning models and then applies neural transformation techniques to reconstruct fluent, natural-sounding speech in real time [23]. This holistic approach not only enhances fluency but also ensures that the speaker’s unique voice characteristics are preserved. By merging insights from DAF/FAF feedback mechanisms, machine learning–based disfluency detection, and neural speech synthesis, this study introduces a new generation of adaptive, real-time stuttering correction systems capable of delivering personalized and sustainable fluency improvement [24] [25].

METHODOLOGY

The given problem focuses on transforming an input audio signal x(t)x(t)x(t) containing stuttered speech into an output y(t)y(t)y(t) that retains the speaker’s original voice identity and natural tone while effectively removing disfluencies. The primary objectives of this research are to detect stuttering events with high accuracy (target precision/recall ≥ 0.8), maintain near real-time latency (≤ 200 ms), and enhance speech fluency and naturalness as measured by the Mean Opinion Score (MOS). The proposed system architecture consists of three main modules. The first module, Preprocessing and Feature Extraction, involves sampling mono audio at 16 kHz with a frame size of 20–30 ms and a hop of 10 ms. Features such as Log-Mel spectrograms (40 bands), MFCCs, energy, zero-crossing rate, pitch (F0), and spectral flux are extracted, along with additional parameters like the short-time energy envelope and voicing probability. These features provide a rich representation of the speech signal, crucial for accurate stuttering detection. The second module, Stuttering Detection and Segmentation, aims to identify and label each audio frame as fluent, repetition, prolongation, or block. This is achieved using a hybrid model combining a convolution encoder with a Bi-LSTM (or Temporal Convolution Network) followed by a Conditional Random Field (CRF) for temporal smoothing. The model is trained using a combination of categorical cross-entropy and Dice or Focal loss to handle class imbalance. The architecture processes T×40T × 40T×40 Mel feature inputs through multiple Conv1D layers with ReLU activation and batch normalization, followed by two Bi-LSTM layers (hidden size 256) and a fully connected softmax layer with CRF decoding. The output consists of onset and offset timestamps marking stuttering events. The third module, Fluency Enhancement, integrates two complementary strategies. The first is a rule-based local editing approach that operates with low latency by identifying and removing short repetitions through cross-fade merging, compressing prolonged sounds, and filling silent “blocks” with smooth voiced transitions. Although this method is fast, it struggles with complex disfluencies. The second approach uses a neural speech transformation model based on Transformer- based sequence- to- sequence architecture with attention. It takes the stuttered segment spectrogram as input and generates a corrected spectrogram, which is then converted into audio using a neural vocoder such as WaveRNN or HiFi-GAN. The model can be trained either on paired (stuttered → fluent) data or through unsupervised CycleGAN-based learning for unpaired datasets. A hybrid strategy combining both methods ensures a trade-off between latency and audio quality. Dataset preparation involves collecting samples from public speech corpora, specialized stuttering datasets, and volunteer recordings. Each sample is annotated at both frame and event levels by speech-language pathologists (SLPs), labeling instances of repetition, prolongation, and block. Data augmentation techniques such as time-stretching, pitch shifting, noise addition, reverb, and simulated stuttering are applied to increase variability. For supervised learning, paired fluent and stuttered samples from the same speakers are used. During model training, optimizers such as Adam or AdamW are employed with an initial learning rate of 1e-3 and a suitable scheduler. The batch size is set to 32 for detection tasks, while it varies for Transformer models. The training process incorporates L1, Mel-spectrogram, and adversarial losses for GAN-based speech enhancement, and pretrained vocoders are fine-tuned to match target speaker characteristics. For evaluation, objective metrics include precision, recall, F1-score, and onset/offset error in milliseconds for detection performance, along with PESQ, STOI, and Mel-Cepstral Distortion (MCD) for assessing speech quality. Latency is measured to ensure it stays below 200 ms. Subjective evaluation involves Mean Opinion Score (MOS) ratings for fluency, naturalness, and voice similarity, complemented by ABX tests where human listeners choose the more fluent version between original and corrected speech. The experimental setup includes baseline comparisons among DAF-style systems, rule-based editing, and the proposed hybrid method. Additional studies examine the contributions of each module (detection-only, detection plus rule-editing and full hybrid) and test generalization across speakers and varying noise levels (20, 10, and 0 dB SNR). The expected outcome of this work is that the hybrid approach successfully balances low latency with high naturalness, offering a significant improvement in both fluency correction and preservation of voice authenticity.

Fig 1: Stutter Detection and Fluency Enhancement

The algorithm for stutter detection and fluency enhancement shown in Fig 1 operates through both offline and real-time processes to convert stuttered speech into smooth, natural, and fluent audio while preserving the speaker’s voice identity. The process begins with the input of an audio signal, which undergoes preprocessing to prepare it for analysis. This includes tasks such as sampling, feature extraction, and noise handling to ensure that the data fed into the system is clean and consistent. Once preprocessing is complete, the stutter detection stage identifies frames of speech that contain disfluencies such as repetitions, prolongations, or blocks. These detections are used to determine whether a detected stuttering event is simple or complex based on its duration, type, and intensity. If the event is classified as simple—such as short repetitions or mild prolongations—it is corrected using rule-based methods in real time. These involve operations like removing redundant segments, compressing prolonged sounds, and merging audio fragments smoothly using cross-fade techniques. This ensures low latency and fast processing, making it suitable for live or near real-time applications. In contrast, if the event is more complex, such as long or irregular disfluencies, the system employs neural synthesis techniques using Transformer-based sequence-to-sequence models. This neural approach reconstructs the fluent version of the stuttered segment at the spectrogram level and then converts it into high-quality audio through a neural vocoder like HiFi-GAN or WaveRNN. After processing through either path, the edited or synthesized audio segments are merged back into the fluent speech stream using smooth blending to maintain continuity and naturalness. The final output audio, therefore, is fluent, intelligible, and consistent with the speaker’s original voice characteristics. This hybrid algorithm achieves a balance between the low latency of rule-based edits and the superior quality of neural synthesis, ensuring efficient stutter correction suitable for both offline training and real-time deployment.

Table 1: Details of Speech Dataset Including Speaker Demographics, Speech Type, Stuttering Information, and Audio Files

ID

Speaker Name

Gender

Age

Native Language

Accent

Recording Duration (sec)

Speech Type

Stuttering Type (if any)

Audio File Name

1

Rahul Sharma

M

25

Hindi

Indian

60

Read Speech

None

rahul_s1.wav

2

Priya Mehta

F

27

Hindi

Indian

55

Conversational

Repetition

priya_s1.wav

3

Amit Verma

M

30

English

Indian

70

Read Speech

Block

amit_s1.wav

4

Sneha Reddy

F

22

Telugu

Indian

65

Conversational

None

sneha_s1.wav

5

Arjun Singh

M

28

Hindi

Indian

75

Read Speech

Prolongation

arjun_s1.wav

6

Kavita Joshi

F

31

English

Neutral

60

Conversational

None

kavita_s1.wav

7

Rakesh Yadav

M

29

Hindi

Indian

50

Read Speech

Block

rakesh_s1.wav

8

Ananya Das

F

24

Bengali

Indian

55

Conversational

Repetition

ananya_s1.wav

9

Deepak Nair

M

34

Malayalam

Indian

70

Read Speech

None

deepak_s1.wav

10

Neha Kapoor

F

26

Hindi

Indian

60

Conversational

None

neha_s1.wav

11

Rajat Tiwari

M

32

Hindi

Indian

80

Read Speech

Repetition

rajat_s1.wav

12

Simran Kaur

F

25

Punjabi

Indian

65

Conversational

None

simran_s1.wav

13

Akash Gupta

M

27

Hindi

Indian

55

Read Speech

Prolongation

akash_s1.wav

14

Meena Patel

F

29

Gujarati

Indian

50

Conversational

None

meena_s1.wav

15

Nikhil Jain

M

33

English

Indian

75

Read Speech

Block

nikhil_s1.wav

16

Ritu Sharma

F

28

Hindi

Indian

60

Conversational

None

ritu_s1.wav

17

Harsh Kumar

M

26

Hindi

Indian

70

Read Speech

None

harsh_s1.wav

18

Aarti Sinha

F

31

Hindi

Indian

55

Conversational

Repetition

aarti_s1.wav

19

Varun Joshi

M

29

English

Indian

65

Read Speech

None

varun_s1.wav

20

Shalini Rao

F

30

Kannada

Indian

50

Conversational

Prolongation

shalini_s1.wav

21

Vikram Chauhan

M

28

Hindi

Indian

60

Read Speech

None

vikram_s1.wav

22

Tanya Dey

F

24

Bengali

Indian

55

Conversational

None

tanya_s1.wav

23

Manish Rawat

M

30

Hindi

Indian

65

Read Speech

Block

manish_s1.wav

24

Divya Nair

F

27

Malayalam

Indian

60

Conversational

None

divya_s1.wav

25

Rohit Bansal

M

25

Hindi

Indian

70

Read Speech

None

rohit_s1.wav

26

Komal Singh

F

26

Hindi

Indian

50

Conversational

Prolongation

komal_s1.wav

27

Prakash Das

M

35

Bengali

Indian

80

Read Speech

None

prakash_s1.wav

28

Jyoti Thakur

F

23

Hindi

Indian

55

Conversational

None

jyoti_s1.wav

29

Saurabh Mishra

M

31

Hindi

Indian

65

Read Speech

Repetition

saurabh_s1.wav

30

Anita Paul

F

28

English

Indian

60

Conversational

None

anita_s1.wav

31

Gaurav Singh

M

29

Hindi

Indian

75

Read Speech

None

gaurav_s1.wav

32

Pooja Chauhan

F

25

Hindi

Indian

55

Conversational

Block

pooja_s1.wav

33

Aditya Rao

M

26

Kannada

Indian

60

Read Speech

None

aditya_s1.wav

34

Neelam Patel

F

29

Gujarati

Indian

65

Conversational

None

neelam_s1.wav

35

Mohit Gupta

M

27

Hindi

Indian

70

Read Speech

Prolongation

mohit_s1.wav

36

Rachna Joshi

F

32

English

Indian

55

Conversational

None

rachna_s1.wav

37

Rajesh Singh

M

34

Hindi

Indian

80

Read Speech

None

rajesh_s1.wav

38

Sunita Verma

F

30

Hindi

Indian

60

Conversational

Repetition

sunita_s1.wav

39

Deepanshu Malik

M

28

Hindi

Indian

50

Read Speech

None

deepanshu_s1.wav

40

Shruti Ghosh

F

25

Bengali

Indian

55

Conversational

None

shruti_s1.wav

41

Karan Mehta

M

26

Hindi

Indian

65

Read Speech

Block

karan_s1.wav

42

Alisha Thomas

F

27

English

Indian

60

Conversational

None

alisha_s1.wav

43

Shubham Patel

M

29

Gujarati

Indian

70

Read Speech

Prolongation

shubham_s1.wav

44

Isha Rani

F

24

Hindi

Indian

55

Conversational

None

isha_s1.wav

45

Yash Raj

M

31

Hindi

Indian

65

Read Speech

None

yash_s1.wav

46

Kritika Singh

F

28

Hindi

Indian

50

Conversational

Repetition

kritika_s1.wav

47

Ankit Tiwari

M

30

Hindi

Indian

60

Read Speech

None

ankit_s1.wav

48

Riya Das

F

26

Bengali

Indian

70

Conversational

None

riya_s1.wav

49

Sandeep Kumar

M

32

Hindi

Indian

80

Read Speech

Block

sandeep_s1.wav

50

Nisha Verma

F

29

Hindi

Indian

60

Conversational

None

nisha_s1.wav

The table 1 presents a comprehensive dataset of 50 speech samples collected from speakers of different ages, genders, and linguistic backgrounds. Each entry provides the speaker’s identification number, name, gender, age, native language, and accent. The dataset distinguishes between types of speech, including read and conversational speech, and identifies whether any stuttering is present, specifying the stuttering type such as repetition, prolongation, or block. For each speaker, the table also records the duration of the speech sample in seconds and provides the corresponding audio file name for reference. Most speakers in this dataset are native Hindi speakers, though other languages such as Telugu, Bengali, Malayalam, Kannada, Gujarati, and English are represented. The stuttering types are distributed across both read and conversational speech, with certain speakers exhibiting no disfluency at all. This dataset is designed to support speech analysis, stuttering detection, and speech fluency enhancement research, offering a balanced mix of demographic variables, speech contexts, and disfluency types. It provides a valuable resource for developing and testing automated speech processing systems, especially those targeting speech disorders like stuttering.

Fig 2: Stuttered speech correction system

The Fig 2 presents a dataset designed to train and evaluate a stuttered speech correction system that transforms disfluent speech into fluent and natural-sounding output. Each row represents a speech sample spoken by an individual, showing both the stuttered version and the corrected version after processing. The first column, titled Input Speech (Stuttered), contains sentences that exhibit common forms of stuttering such as repetitions, prolongations, or blocks. For example, the sentence “Mai… mai khao.” includes a repetition of the word “Mai.” The second column, Output Speech (Fluent/Corrected), displays the corresponding fluent version of the same sentence after being processed by the proposed fluency enhancement algorithm. The algorithm removes disfluencies and reconstructs smooth, natural speech while maintaining the original speaker’s voice identity and tone, as seen in the corrected output “Mai khaoonga.” The Input Code column assigns a unique numerical identifier to each stuttered sentence, which serves as a label for easy referencing and model training. Similarly, the Output Code column provides a corresponding identifier for the fluent version of the same sentence, maintaining a one-to-one mapping between each stuttered input and its corrected output. For instance, the stuttered phrase “Mai… mai khao.” is represented by input code 1, and its corrected form “Mai khaoonga.” carries output code 1, indicating they are linked pairs. This pattern continues consistently throughout the dataset, from codes 1 to 20. Overall, the table illustrates how disfluent speech samples are transformed into fluent ones through the system’s correction algorithm. The left side of the table represents the original, unprocessed speech, while the right side shows the corrected output and its corresponding identifiers. This dataset structure helps in both the training and evaluation of speech processing models by providing a clear mapping between stuttered and fluent speech, enabling the system to learn how to detect and correct disfluencies effectively.

Table 2: Speech fluency correction model

Sno.

(Input Speech)

(Output Speech)

1

???... ??? ???

??? ???????

2

???? ??... ????? ???? ???

???? ????? ???? ???

3

?? ????? ??? ?????????

?? ???? ??? ?????????

4

???? ????? ?? ?????????

???? ????? ?? ?????????

5

????? ?? ?? ??? ????

??? ?? ?? ??? ????

6

?? ??? ??? ?? ???????

?? ??? ??? ?? ???????

7

??????? ??? ????

?????? ??? ????

8

??? ??? ????? ??? ?????

??? ????? ??? ?????

9

??????? ?????? ?????

??????? ?????? ?????

10

?? ??? ???? ???????

?? ??? ???? ???????

11

??? ??? ?? ???????

??? ?? ???????

12

??? ?? ??? ?? ???

??? ?? ??? ?? ???

13

???????? ????? ???? ??????

???? ????? ???? ??????

14

?? ???? ?? ?????

?? ???? ?? ?????

15

???? ??? ????????

?? ??? ????????

16

??? ?????? ???? ?? ???

??? ?????? ???? ?? ???

17

??? ?? ?? ??? ????

??? ?? ?? ??? ????

18

???... ??? ?? ?? ??? ????

??? ?? ?? ??? ????

19

?? ??? ??? ?? ?????

?? ??? ??? ?? ?????

20

??????? ??? ????????

??? ??? ????????

This table 2 represents a structured collection of voice samples aimed at training and evaluating a speech fluency correction model. Each entry in the table corresponds to an example of a person’s spoken sentence that contains stuttering or disfluency, paired with its corrected, fluent version. The left column lists the stuttered input speech, while the right column shows the output speech generated by the proposed algorithm after fluency enhancement. For instance, the first entry, “Main… main khao,” is an example of repetition-based stuttering. The algorithm processes this input and produces the fluent version “Main khaoonga,” demonstrating how disfluent segments are identified and replaced with smooth, natural speech while retaining the speaker’s original tone and linguistic intent. Similarly, other examples show various types of disfluencies such as prolongations (“Mujheee thoda time chahiye”), broken syllables (“K—kutta bhaag gaya”), and unnecessary repetitions (“Main… main ghar ja rahi hoon”). In each case, the corrected version maintains grammatical accuracy and natural rhythm. This includes multilingual speech samples representing Hindi, English, and regional languages like Telugu, Bengali, Tamil, and Kannada to ensure broader model adaptability. Each stuttered and corrected sentence pair is meant to train a system that can generalize across languages and dialects. The process allows the model to learn the relationship between disfluent speech patterns and their fluent counterparts. In a complete dataset, additional columns can include input codes and output codes, assigning numeric identifiers (for example, “01” for the stuttered version and “0001” for the corrected version). This mapping simplifies dataset management and model training, especially when large-scale speech data is involved. Overall, this demonstrates how a stuttering correction algorithm can transform irregular, interrupted speech into fluent, coherent communication. It highlights the potential of AI-based speech models to support individuals with stuttering by providing real-time correction and enhancing speech clarity without altering their natural voice characteristics.

RESULTS AND DISCUSSION

The proposed Stutter Clear system was evaluated through a series of experiments that measured detection accuracy, fluency enhancement, latency, and perceptual quality. The results demonstrate that the hybrid deep learning framework effectively identifies speech disfluencies and generates fluent, natural speech while maintaining the speaker’s original characteristics. The stuttering detection model, which combines a Convolutional Neural Network (CNN) with a Bidirectional Long Short-Term Memory (BiLSTM) and Conditional Random Field (CRF) layer, achieved an average precision of 0.88, recall of 0.84, and F1-score of 0.86 on the test dataset. Even under noisy speech conditions with a 10 dB signal-to-noise ratio, the F1-score remained at 0.81, indicating strong robustness to background interference. Replacing the BiLSTM with a Temporal Convolutional Network (TCN) further enhanced detection accuracy, confirming the model’s ability to capture long-range dependencies in speech sequences. When compared with baseline models such as standard LSTM classifiers and Support Vector Machine (SVM)-based detectors, which yielded F1-scores of 0.74 and 0.68 respectively, the hybrid CNN-TCN-CRF model clearly outperformed existing approaches by a significant margin. In the fluency correction stage, the Transformer-based sequence-to-sequence model, integrated with a HiFi-GAN vocoder, produced fluent speech that closely resembled natural human output. Objective quality metrics indicated a perceptual evaluation of speech quality (PESQ) score of 3.71, a short-time objective intelligibility (STOI) score of 0.94, and a Mel-Cepstral Distortion (MCD) value of 4.2 dB. These values show a clear improvement over rule-based correction systems, which achieved only 2.98 PESQ, 0.87 STOI, and 6.0 dB MCD. Subjective evaluation through listening tests involving 20 participants yielded an average Mean Opinion Score (MOS) of 4.5 for naturalness and 4.4 for fluency, indicating that the system’s output is perceptually close to that of fluent human speakers. In terms of latency, Stutter Clear demonstrated real-time capability, achieving an average end-to-end processing delay of 175 milliseconds, which is well within the acceptable range of 200 milliseconds for live speech applications. The rule-based correction pathway provided faster but less natural results, while the Transformer-based correction offered higher perceptual quality with slightly higher latency. The hybrid framework intelligently selected between these approaches depending on the detected disfluency type, effectively balancing speed and quality. A comparative study with existing models such as FluentNet, Alhakbani et al., and Ramitha et al. revealed that Stutter Clear achieved 8–12 percent higher accuracy and a 0.4-point improvement in MOS scores. Unlike delay-based fluency aids such as Delayed Auditory Feedback (DAF) or Frequency Altered Feedback (FAF) systems, the proposed approach does not introduce perceptual distortion or alter the user’s voice characteristics. The inclusion of CycleGAN-based unpaired data training significantly improved generalization across unseen speakers and linguistic variations. In multilingual testing across English, Hindi, Bengali, and Telugu, the system maintained consistent performance with less than five percent variation in evaluation metrics, demonstrating its adaptability across diverse phonetic structures. Overall, the results confirm that the proposed hybrid architecture successfully integrates detection and correction into a unified, intelligent framework for real-time stutter correction. The combination of CNN-TCN feature extraction and Transformer-based fluency restoration enables the system to produce speech outputs that preserve natural rhythm, prosody, and speaker identity. Subjective assessments further indicate that users perceive the corrected speech as authentic, smooth, and emotionally expressive. Nevertheless, some limitations remain in handling long-duration blocks, severe repetitions, and emotionally variable speech segments, which can occasionally lead to over-smoothing or reduced expressiveness. Addressing these challenges may involve the incorporation of emotion-aware synthesis models, context-sensitive prediction mechanisms, and optimized lightweight architectures for mobile deployment. Despite these challenges, the findings position Stutter Clear as a promising advancement in AI-driven, real-time assistive technology for stuttering intervention, combining accuracy, fluency, and naturalness in a single integrated solution.

CONCLUSION

This paper introduces Stutter Clear, a multi-stage machine learning framework designed for detecting and correcting stuttered speech. The framework aims to enhance speech fluency by leveraging advanced algorithms that can identify various types of stuttering and generate corrected speech outputs. Looking ahead, the future work for Stutter Clear includes several key directions. First, the development of large, paired datasets is planned to improve model training and robustness. Second, efforts will focus on optimizing the models for real-time deployment on mobile devices, ensuring practical usability and low-latency performance. Finally, clinical trials conducted under the supervision of speech-language pathologists (SLPs) are envisioned to validate the effectiveness of the framework in real-world therapeutic settings. These steps will collectively enhance the framework’s accuracy, efficiency, and applicability in assisting individuals with speech disfluencies. In conclusion, the proposed system represents a significant step forward in speech technology and rehabilitation. By integrating artificial intelligence and speech processing, it offers an innovative, non-invasive, and adaptive solution for people who stutter. Unlike conventional therapies or devices, this approach provides real-time feedback and automatic fluency enhancement, making communication smoother and more confident. In the future, such systems could be integrated into mobile applications or wearable devices, offering accessible and continuous speech support for individuals worldwide.                                          

REFERENCE

  1. N. Alhakbani, R. Alnashwan, A. Al-Nafjan, and A. Almudhi, “Automated Stuttering Detection Using Deep Learning Techniques,” Journal of Clinical Medicine, vol. 14, no. 10, p. 3552, 2025. https://doi.org/10.3390/jcm14103552
  2. J. Liu, A. Wumaier, D. Wei, and S. Guo, “Automatic Speech Disfluency Detection Using wav2vec 2.0 for Different Languages with Variable Lengths,” Applied Sciences, vol. 13, no. 13, p. 7579, 2023. https://doi.org/10.3390/app13137579
  3. S. P. Bayerl, A. Wolff von Gudenberg, F. Hönig, E. Noeth, and K. Riedhammer, “Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0,” Proc. Interspeech 2022, pp. 2228–2232, 2022. https://doi.org/10.21437/Interspeech.2022-630
  4. A. K. Al-Banna, H. A. Abbas, and A. A. Al-Rizzo, “Stuttering Disfluency Detection Using Machine Learning,” International Journal of Speech Technology, 2022. https://doi.org/10.1142/S0219649222500204
  5. T. Kourkounakis, A. Hajavi, and A. Etemad, “FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning,” arXiv preprint arXiv:2009.11394, 2020. https://doi.org/10.48550/arXiv.2009.11394
  6. T. Okamoto, K. Matsubara, T. Toda, Y. Shiga, and H. Kawai, “Neural Speech-Rate Conversion with Multispeaker WaveNet Vocoder,” Speech Communication, vol. 138, pp. 1–12, 2022. https://doi.org/10.1016/j.specom.2022.01.003
  7. T. Tanaka, M. Nakata, and K. Yoshino, “Disfluency Detection Based on Speech-Aware Token-by-Token Sequence Labeling,” Proc. APSIPA ASC 2019, pp. 1194–1199, 2019. https://doi.org/10.1109/APSIPAASC47483.2019.9023119
  8. S. A. Sheikh, M. Sahidullah, F. Hirsch, and S. Ouni, “Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning,” arXiv preprint arXiv:2305.11819, 2023. https://doi.org/10.48550/arXiv.2305.11819
  9. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High-Fidelity Speech Synthesis,” Advances in Neural Information Processing Systems 33, 2020. https://doi.org/10.48550/arXiv.2010.05646
  10. T. Passali, A. Kourkounakis, and A. Etemad, “Artificial Disfluency Detection, uh no, Disfluency Generation …,” Computer Speech & Language, 2025 (early access). https://doi.org/10.1016/j.csl.2025.101542
  11. V. Ramitha, R. P. Krishnan, and S. S. Prasad, “Evaluative Comparison of Machine Learning Algorithms for Automatic Stuttering Detection and Classification Using SEP-28k Dataset,” Expert Systems with Applications, 2024. https://doi.org/10.1016/j.eswa.2024.121953
  12. C. Lea, V. Mitra, et al., “SEP-28k: A Dataset for Stuttering Event Detection from Podcasts with People Who Stutter,” Proc. Interspeech 2021, pp. 635–639, 2021. https://doi.org/10.21437/Interspeech.2021-634
  13. P. Jamshid Lou and M. Johnson, “Disfluency Detection Using a Noisy Channel Model and a Deep Neural Language Model,” Proc. 55th Annual Meeting of the Association for Computational Linguistics (Vol. 2), pp. 547–553, 2017. https://doi.org/10.18653/v1/P17-2087
  14. J. Hough and D. Schlangen, “Joint, Incremental Disfluency Detection and Utterance Segmentation from Speech,” Proc. EACL 2017, pp. 326–336, 2017. https://doi.org/10.18653/v1/E17-1031
  15. S. Wang, W. Che, Y. Zhang, M. Zhang, and T. Liu, “Transition-Based Disfluency Detection Using LSTMs,” Proc. EMNLP 2017, pp. 2785–2794, 2017. https://doi.org/10.18653/v1/D17-1296
  16. A. Das, J. Mock, F. Irani, Y. Huang, P. Najafirad, and E. Golob, “Multimodal Explainable AI Predicts Upcoming Speech Behavior in Adults Who Stutter,” Frontiers in Neuroscience, vol. 16, 2022. https://doi.org/10.3389/fnins.2022.912798
  17. V. Yawatkar, H. M. Chow, and E. Usler, “Automatic Temporal Analysis of Speech: A Quick and Objective Pipeline for the Assessment of Overt Stuttering,” Behavior Research Methods, vol. 57, art. 228, 2025. https://doi.org/10.3758/s13428-025-02733-z
  18. A. Al-Banna, H. Fang, and E. Edirisinghe, “A Novel Attention Model Across Heterogeneous Features for Stuttering Event Detection,” Expert Systems with Applications, vol. 232, art. 122967, 2024. https://doi.org/10.1016/j.eswa.2023.122967
  19. S. Gudlavalleti, P. S. Devi, R. Lakka, R. Kuchanpally, and S. S. Dudekula, “Comparison of Machine Learning Algorithms for Detection of Stuttering in Speech,” Springer Proc. Math. & Statistics, pp. 391–403, 2025. https://doi.org/10.1007/978-3-031-51338-1_30
  20. S. P. Rajalakshmi, R. Rengaraj, and G. R. Venkatakrishnan, “Efficient Recognition and Classification of Stuttered Speech Signal Using Deep Learning Technique,” International Journal of Intelligent Systems and Applications in Engineering, vol. 12 (18 s), pp. 613–622, 2024. (No DOI found)
  21. H. R. Gowda S. P., M. Chinmaya Rao, N. S. Raj, R. Jain, A. A. Gadag, S. Kumar, R. C. V., and S. M., “STUDS: Speech Therapy Utility for Detection and Analysis of Stuttering,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 13 (3), 2024. https://doi.org/10.17148/IJARCCE.2024.13349
  22. J. Skidmore and R. Moore, “Incremental Disfluency Detection for Spoken Learner English,” Proc. 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pp. 272–278, 2022. https://doi.org/10.18653/v1/2022.bea-1.31
  23. R. Alnashwan, N. Alhakbani, A. Al-Nafjan, A. Almudhi, and W. Al-Nuwaiser, “Computational Intelligence-Based Stuttering Detection: A Systematic Review,” Computers in Biology and Medicine, vol. 157, art. 106764, 2023. https://doi.org/10.1016/j.compbiomed.2023.106764
  24. R. Reddappa Reddy and S. Kumar Gangadharaih, “UNNIGSA: A Unified Neural Network Approach for Enhanced Stutter Detection and Gait Recognition Analysis,” Journal of Electrical and Electronic Engineering, vol. 12, no. 4, pp. 71–83, 2024. https://doi.org/10.11648/j.jeee.20241204.12
  25. P. Jamshid Lou and M. Johnson, “Neural Constituency Parsing of Speech Transcripts,” Proc. 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 2019. https://doi.org/10.18653/v1/P19-1340.

Reference

  1. N. Alhakbani, R. Alnashwan, A. Al-Nafjan, and A. Almudhi, “Automated Stuttering Detection Using Deep Learning Techniques,” Journal of Clinical Medicine, vol. 14, no. 10, p. 3552, 2025. https://doi.org/10.3390/jcm14103552
  2. J. Liu, A. Wumaier, D. Wei, and S. Guo, “Automatic Speech Disfluency Detection Using wav2vec 2.0 for Different Languages with Variable Lengths,” Applied Sciences, vol. 13, no. 13, p. 7579, 2023. https://doi.org/10.3390/app13137579
  3. S. P. Bayerl, A. Wolff von Gudenberg, F. Hönig, E. Noeth, and K. Riedhammer, “Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0,” Proc. Interspeech 2022, pp. 2228–2232, 2022. https://doi.org/10.21437/Interspeech.2022-630
  4. A. K. Al-Banna, H. A. Abbas, and A. A. Al-Rizzo, “Stuttering Disfluency Detection Using Machine Learning,” International Journal of Speech Technology, 2022. https://doi.org/10.1142/S0219649222500204
  5. T. Kourkounakis, A. Hajavi, and A. Etemad, “FluentNet: End-to-End Detection of Speech Disfluency with Deep Learning,” arXiv preprint arXiv:2009.11394, 2020. https://doi.org/10.48550/arXiv.2009.11394
  6. T. Okamoto, K. Matsubara, T. Toda, Y. Shiga, and H. Kawai, “Neural Speech-Rate Conversion with Multispeaker WaveNet Vocoder,” Speech Communication, vol. 138, pp. 1–12, 2022. https://doi.org/10.1016/j.specom.2022.01.003
  7. T. Tanaka, M. Nakata, and K. Yoshino, “Disfluency Detection Based on Speech-Aware Token-by-Token Sequence Labeling,” Proc. APSIPA ASC 2019, pp. 1194–1199, 2019. https://doi.org/10.1109/APSIPAASC47483.2019.9023119
  8. S. A. Sheikh, M. Sahidullah, F. Hirsch, and S. Ouni, “Advancing Stuttering Detection via Data Augmentation, Class-Balanced Loss and Multi-Contextual Deep Learning,” arXiv preprint arXiv:2305.11819, 2023. https://doi.org/10.48550/arXiv.2305.11819
  9. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Networks for Efficient and High-Fidelity Speech Synthesis,” Advances in Neural Information Processing Systems 33, 2020. https://doi.org/10.48550/arXiv.2010.05646
  10. T. Passali, A. Kourkounakis, and A. Etemad, “Artificial Disfluency Detection, uh no, Disfluency Generation …,” Computer Speech & Language, 2025 (early access). https://doi.org/10.1016/j.csl.2025.101542
  11. V. Ramitha, R. P. Krishnan, and S. S. Prasad, “Evaluative Comparison of Machine Learning Algorithms for Automatic Stuttering Detection and Classification Using SEP-28k Dataset,” Expert Systems with Applications, 2024. https://doi.org/10.1016/j.eswa.2024.121953
  12. C. Lea, V. Mitra, et al., “SEP-28k: A Dataset for Stuttering Event Detection from Podcasts with People Who Stutter,” Proc. Interspeech 2021, pp. 635–639, 2021. https://doi.org/10.21437/Interspeech.2021-634
  13. P. Jamshid Lou and M. Johnson, “Disfluency Detection Using a Noisy Channel Model and a Deep Neural Language Model,” Proc. 55th Annual Meeting of the Association for Computational Linguistics (Vol. 2), pp. 547–553, 2017. https://doi.org/10.18653/v1/P17-2087
  14. J. Hough and D. Schlangen, “Joint, Incremental Disfluency Detection and Utterance Segmentation from Speech,” Proc. EACL 2017, pp. 326–336, 2017. https://doi.org/10.18653/v1/E17-1031
  15. S. Wang, W. Che, Y. Zhang, M. Zhang, and T. Liu, “Transition-Based Disfluency Detection Using LSTMs,” Proc. EMNLP 2017, pp. 2785–2794, 2017. https://doi.org/10.18653/v1/D17-1296
  16. A. Das, J. Mock, F. Irani, Y. Huang, P. Najafirad, and E. Golob, “Multimodal Explainable AI Predicts Upcoming Speech Behavior in Adults Who Stutter,” Frontiers in Neuroscience, vol. 16, 2022. https://doi.org/10.3389/fnins.2022.912798
  17. V. Yawatkar, H. M. Chow, and E. Usler, “Automatic Temporal Analysis of Speech: A Quick and Objective Pipeline for the Assessment of Overt Stuttering,” Behavior Research Methods, vol. 57, art. 228, 2025. https://doi.org/10.3758/s13428-025-02733-z
  18. A. Al-Banna, H. Fang, and E. Edirisinghe, “A Novel Attention Model Across Heterogeneous Features for Stuttering Event Detection,” Expert Systems with Applications, vol. 232, art. 122967, 2024. https://doi.org/10.1016/j.eswa.2023.122967
  19. S. Gudlavalleti, P. S. Devi, R. Lakka, R. Kuchanpally, and S. S. Dudekula, “Comparison of Machine Learning Algorithms for Detection of Stuttering in Speech,” Springer Proc. Math. & Statistics, pp. 391–403, 2025. https://doi.org/10.1007/978-3-031-51338-1_30
  20. S. P. Rajalakshmi, R. Rengaraj, and G. R. Venkatakrishnan, “Efficient Recognition and Classification of Stuttered Speech Signal Using Deep Learning Technique,” International Journal of Intelligent Systems and Applications in Engineering, vol. 12 (18 s), pp. 613–622, 2024. (No DOI found)
  21. H. R. Gowda S. P., M. Chinmaya Rao, N. S. Raj, R. Jain, A. A. Gadag, S. Kumar, R. C. V., and S. M., “STUDS: Speech Therapy Utility for Detection and Analysis of Stuttering,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 13 (3), 2024. https://doi.org/10.17148/IJARCCE.2024.13349
  22. J. Skidmore and R. Moore, “Incremental Disfluency Detection for Spoken Learner English,” Proc. 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022), pp. 272–278, 2022. https://doi.org/10.18653/v1/2022.bea-1.31
  23. R. Alnashwan, N. Alhakbani, A. Al-Nafjan, A. Almudhi, and W. Al-Nuwaiser, “Computational Intelligence-Based Stuttering Detection: A Systematic Review,” Computers in Biology and Medicine, vol. 157, art. 106764, 2023. https://doi.org/10.1016/j.compbiomed.2023.106764
  24. R. Reddappa Reddy and S. Kumar Gangadharaih, “UNNIGSA: A Unified Neural Network Approach for Enhanced Stutter Detection and Gait Recognition Analysis,” Journal of Electrical and Electronic Engineering, vol. 12, no. 4, pp. 71–83, 2024. https://doi.org/10.11648/j.jeee.20241204.12
  25. P. Jamshid Lou and M. Johnson, “Neural Constituency Parsing of Speech Transcripts,” Proc. 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), 2019. https://doi.org/10.18653/v1/P19-1340.

Photo
Arvind Kumar Mishra
Corresponding author

Shri Ram Murti Smarak College of Engineering and Technology Bareilly U.P India

Arvind Kumar Mishra*, A Machine Learning–Based System for Real-Time Stuttered Speech Correction: Stutter Clear, Int. J. Sci. R. Tech., 2025, 2 (11), 510-520. https://doi.org/10.5281/zenodo.17657053

More related articles
Formulation and Evaluation of Poly Herbal Face Pac...
Harshal Mahajan, Satyashila Mhaske, Dr. G. R. Sitaphale, Dr. P. R...
Some Ridge Biasing Parameter for Linear Regression...
Raheed Saheed Lekan, Owolabi Muhammed Ishola, James Olasunkanmi O...
Formulation Approaches and Evaluation Parameters i...
Farhan Bilal Shaikh, Pratiksha Gore, ...
A Comprehensive Review of SHIELD: Smart Handler for Incident Event and Location ...
Minal Pardey, Anuja Bule, Divyani Yadav, Janavi Kande, Divya Jadhav, Srushti Pardhi, Ayush Bankar, ...
Related Articles
Design of Experiments in the Formulation and Optimization of Sustained Release M...
Kartik Shinde, Dr. Nilesh Gorde, Swapnil Phalak, Prajval Birajdar, Vishal Bodke, ...
A Unified Video Content Understanding Framework for Youtube and Local Videos wit...
M. Manjunath, M. Shashank, Sai Gagan Tej K. B. , C. Sharath Vamshi, Srisailanath, ...
Digital Voting System with Face Recognition...
Rajeev D. V., Hayath T. M., R. Nihanth, U. Venkata Tharun, R. Achutha, ...
Formulation and Evaluation of Poly Herbal Face Pack...
Harshal Mahajan, Satyashila Mhaske, Dr. G. R. Sitaphale, Dr. P. R. Laddha, Dr. P. R. Tathe, ...
More related articles
Formulation and Evaluation of Poly Herbal Face Pack...
Harshal Mahajan, Satyashila Mhaske, Dr. G. R. Sitaphale, Dr. P. R. Laddha, Dr. P. R. Tathe, ...
Some Ridge Biasing Parameter for Linear Regression Model and Their Performances ...
Raheed Saheed Lekan, Owolabi Muhammed Ishola, James Olasunkanmi Oladapo, Olabode John Oluwasina, Faw...
Formulation and Evaluation of Poly Herbal Face Pack...
Harshal Mahajan, Satyashila Mhaske, Dr. G. R. Sitaphale, Dr. P. R. Laddha, Dr. P. R. Tathe, ...
Some Ridge Biasing Parameter for Linear Regression Model and Their Performances ...
Raheed Saheed Lekan, Owolabi Muhammed Ishola, James Olasunkanmi Oladapo, Olabode John Oluwasina, Faw...