| 英文摘要 |
In this paper, we explore compact convolutional neural networks (CNNs) for end-to-end keyword spotting from raw audio to final recognition results, without using traditional feature extraction based on spectrogram. Such fully CNN models reach 90.5% accuracy, an improvement of 12.15% over traditional methods with similar structures, which only achieve 78.35% accuracy, on the Speech Commands dataset. This shows that learned CNN features outperform predefined FFT-based transforms. The results show that compact end-toend CNNs enable efficient, accurate small vocabulary keyword spotting that is well-suited for resource-constrained edge devices. All code will be released on the GitHub of the authors [Lin and Lyu, 2023]. |