Abstract
Precise detection of speech endpoints is an important factor which affects the performance of the systems where speech utterances need to be extracted from the speech signal such as Automatic Speech Recognition (ASR) system. Existing endpoint detection (EPD) methods mostly uses Short-Term Energy (STE), Zero-Crossing Rate (ZCR) based approaches and their variants. But STE and ZCR based EPD algorithms often fail in the presence of Non-speech Sound Artifacts (NSAs) produced by the speakers. Pattern recognition and classification techniques are also applied but those methods require labeled data for training. In this article, a novel approach is proposed to extract speech endpoints and the algorithm is termed as Wavelet Convolution based Speech Endpoint Detection (WCSED). WCSED decomposes the speech signal into high-frequency and low-frequency components using wavelet convolution and then computes information-entropy based thresholds for the two frequency components. The low-frequency thresholds are used to extract voiced speech segments, whereas the high-frequency thresholds are used to extract the unvoiced speech segments by filtering out the NSAs. WCSED does not require any labeled data for training and can automatically extract speech segments. Experiments are carried out on two speech databases and the results are promising even in the presence of NSAs.
Original language | English |
---|---|
Pages (from-to) | 162-175 |
Number of pages | 14 |
Journal | Communications in Nonlinear Science and Numerical Simulation |
Volume | 67 |
DOIs | |
Publication status | Published - Feb 2019 |
Keywords
- Pattern recognition
- Signal processing
- Speech endpoint detection
- Speech recognition
- Wavelet convolution
ASJC Scopus subject areas
- Numerical Analysis
- Modeling and Simulation
- Applied Mathematics