英文摘要 |
In this paper, we propose a novel scheme in performing feature statistics normalization techniques for robust speech recognition. In the proposed approach, the processed temporal domain feature sequence is first converted into the modulation spectral domain. The magnitude part of the modulation spectrum is decomposed into overlapped non-uniform sub-band segments, and then each sub-band segment is individually processed by the well-known normalization methods, like mean normalization (MN) and mean and variance normalization (MVN). Finally, we reconstruct the feature stream with all the modified sub-band magnitude spectral segments and the original phase spectrum using the inverse DFT. With this process, the components that correspond to more important modulation spectral bands in the feature sequence can be processed separately and more spectral samples within each band give rise to more accurate statistic estimates due to overlapping the adjacent segments. For the Aurora-2 clean-condition training task, the new proposed overlapping sub-band spectral MN and MVN provide further error rate reductions over the conventional non-overlapping ones. |