阿里巴巴开源语音识别声学建模技术-低调大师

阿里巴巴开源语音识别声学建模技术

2018-06-06 689

编者按：本文作者阿里巴巴机器智能技术实验室高级算法工程师张仕良。文章介绍了阿里巴巴的语音识别声学建模新技术: 前馈序列记忆神经网络（DFSMN），目前基于DFSMN的语音识别系统已经在法庭庭审识别、智能客服、视频审核和实时字幕转写、声纹验证、物联网等多个场景成功应用。本次，我们开源了基于Kaldi语音识别工具实现的DFSMN代码，同时开源了相关训练脚本。通过开源的代码和训练流程，我们在公开的英文数据集LibriSpeech上可以获得目前最好的性能。

This post presents DFSMN, an improved Feedforward Sequential Memory Networks (FSMN) architecture for large vocabulary continuous speech recognition. We release the source codes and training recipes of DFSMN based on the popular Kaldi speech recognition toolkit and demonstrate that DFSMN can achieve the best performance in the LibriSpeech speech recognition task.

Acoustic Modeling in Speech Recognition

Deep neural networks have become the dominant acoustic models in large vocabulary continuous speech recognition systems. Depending on how the networks are connected, there exist various types of neural network architectures, such as feedforward fully-connected neural networks (FNN), convolutional neural networks (CNN) and recurrent neural networks (RNN).

For acoustic modeling, it is crucial to take advantage of the long term dependency within the speech signal. Recurrent neural networks (RNN) are designed to capture long term dependency within the sequential data using a simple mechanism of recurrent feedback. RNNs can learn to model sequential data over an extended period of time and store the memory in the connections, then carry out rather complicated transformations on the sequential data. As opposed to FNNs that can only learn to map a fixed-size input to a fixed-size output, RNNs can in principle learn to map from one variable-length sequence to another. Therefore, RNNs, especially the short term memory (LSTM), have become the most popular choice in acoustic modeling for speech recognition.

In our previous work, we have proposed a novel neural architecture non-recurrent structure, namely feedforward sequential memory networks (FSMN), which can effectively model long term dependency in sequential data without using any recurrent feedback. FSMN is inspired by the filter design knowledge in digital signal processing that any infinite impulse response (IIR) filter can be well approximated using a high-order finite impulse response (FIR) filter. Because the recurrent layer in RNNs can be conceptually viewed as a first-order IIR filter, it may be precisely approximated by a high-order FIR filter. Therefore, we extend the standard feedforward fully connected neural networks by augmenting some memory blocks, which adopt a tapped-delay line structure as in FIR filters, into the hidden layers. Fig. 1 (a) shows a FSMN with one memory block added into its -th hidden layer and Fig. 1 (b) shows the FIR filter like memory block in FSMN. As a result, the overall FSMN remains as a pure feedforward structure so that it can be learned in a much more efficient and stable way than RNNs. The learnable FIR like memory blocks in FSMNs may be used to encode long context information into a fixed-size representation, which helps the model to capture long-term dependency. Experimental results in the English recognition Switchboard task show that FSMN can outperform the popular BLSTM while faster in training speed.

Fig. 1. Illustration of FSMN and its tapped-delay memory block

DFSMN Open Source

Fig. 2. Illustration of Deep-FSMN (DFSMN) with skip connection

In this work, based on our previous FSMN works and recent works on neural networks with very deep architecture, we present an improved FSMN structure namely Deep-FSMN (DFSMN) (as show in Fig. 2) by introducing skip connections between memory blocks in adjacent layers. These skip connections enable the information flow across different layers and thus alleviate the gradient vanishing problem when building very deep structure. We can successfully build DFSMN with dozens of layers and significantly outperform the previous FSMN.

We implement the DFSMN based on the popular kaldi speech recognition toolkit and release the source code in (https://github.com/tramphero/kaldi). The DFSMN is embedded into the kaldi-nnet1 by adding some DFSMN related components and CUDA kernel functions. We use mini-batch based training instead of the multi-streams which is more stable and efficient.

Improving the State of Art

We have trained the DFSMN in the LibriSpeech corpus, which is a large (1000 hour) corpus of English read speech derived from audiobooks in the LibriVox project, sampled at 16 kHz. We trained DFSMN with two official settings using kaldi recipes: 1) model trained on the “cleaned data” (960-hours-setting); 2) model trained on the speed-perturbed and volume-perturbed “cleaned data” (3000-hours-setting).

For the plain 960-hours-setting, the previous kaldi official release best model is the cross-entropy trained BLSTM. For comparison, we trained the DFSMN with the same front-end processing as well as the decoding configurations as the official-BLSTM using the cross-entropy criterion. The experimental results are as shown in Table 1. For the augmented 3000-hours-setting, the previous best result is achieved by the TDNN trained with lattice-free MMI followed by sMBR based discriminative training. In comparison, we trained DFSMN with cross-entropy followed by one epoch sMBR based discriminative training. The experimental results are as shown in Table 2. For both settings, our DFSMN can achieve the significantly performance improvement compared to the previous best results.

Table 1. Performance (WER in %) of BLSTM and DFSMN trained on cleaned data.

Model	Small LM	Large LM
Official-BLSTM	6.85	5.22
DFSMN	4.73	4.36
Relative Gain	+30.95%	+16.48%

Table 2. Performance (WER in %) of BLSTM and DFSMN trained on speed-perturbed and volume-perturbed cleaned data.

Model	Small LM	Large LM
TDNN	6.15	4.31
DFSMN	5.10	3.96
Relative Gain	+17.07%	+8.12%

How to get our implementation and reproduce our results

We have released two methods to get the implementation and reproduce our results: 1) Github project based on the Kaldi; 2) A PATCH file with the DFSMN related codes and example scripts.

Get Github project

git clone https://github.com/tramphero/kaldi

Apply PATCH

The PATCH is built based on the Kaldi speech recognition toolkit with commit "04b1f7d6658bc035df93d53cb424edc127fab819". One can apply this PATCH to your own kaldi branch by using the following commands:

#Take a look at what changes are in the patch

git apply --stat Alibaba_MIT_Speech_DFSMN.patch

#Test the patch before you actually apply it

git apply --check Alibaba_MIT_Speech_DFSMN.patch

#If you don’t get any errors, the patch can be applied cleanly.

git am --signoff < Alibaba_MIT_Speech_DFSMN.patch

The training scripts and experimental results for the LibriSpeech task is available at https://github.com/tramphero/kaldi/tree/master/egs/librispeech/s5. There are three DFSMN configurations with different model size: DFSMN_S, DFSMN_M, DFSMN_L.

**********************************************************************************

# ## Training FSMN models on the cleaned-up data

# ## Three configurations of DFSMN with different model size: DFSMN_S, DFSMN_M, DFSMN_L

local/nnet/run_fsmn_ivector.sh DFSMN_S

local/nnet/run_fsmn_ivector.sh DFSMN_M

local/nnet/run_fsmn_ivector.sh DFSMN_L

**********************************************************************************

The DFSMN_S is a small DFSMN with six DFSMN-components while DFSMN_L is a large DFSMN consist of 10 DFSMN-components. For the 960-hours-setting, it takes about 2-3 days to train DFSMN_S only using one M40 GPU. And the detailed experimental results are listed in the RESULTS file.

For more details, take a look at our paper and the open-source project.

微信关注我们

原文链接：https://yq.aliyun.com/articles/600195

转载内容版权归作者及来源网站所有！

低调大师中文资讯倾力打造互联网数据资讯、行业资源、电子商务、移动互联网、网络营销平台。持续更新报道IT业界、互联网、市场资讯、驱动更新,是最及时权威的产业资讯及硬件资讯报道平台。

解读阿里云AI产品矩阵：目标是打造最强中国云计算能力

在近日举办的阿里云栖大会武汉峰会上，阿里云AI产品矩阵的亮相作为压轴大戏出场，这也是阿里云首次曝光了人工智能产品家族，全方位公开AI产品体系，AI作为阿里云，乃至阿里巴巴集团技术应用和市场战略的重要一环，承载了阿里怎样的思量？ AI产品矩阵全亮相产业界目前对AI的期待不一而同，有些企业希望通过AI实现让机器人探索大自然，建立媲美自然的数字神经系统，或者战胜重大疾病等远大的目标，这些想象虽很美好，但却需要厂商持续不断的海量投入和长远的攻关，起码在现阶段仍不可能完成。而阿里云对于AI的思考不同，在近期阿里云栖大会武汉峰会上，阿里云产品总监何云飞表示，以阿里巴巴内部验证过的AI产技术带来实际效率提升的能力，借由阿里云的平台以最方便的API方式让所有企业都能够容易的获得AI能力，这是阿里云在AI层面的战略和思路。雷锋网(公众号：雷锋网)了解

2018-06-06

685

人工智能和大数据是人们耳熟能详的流行术语，但也可能会有一些混淆。人工智能和大数据有什么相似之处和不同之处?它们有什么共同点吗?它们是否相似?能进行有效的比较吗?嵌入式定制有人认为将人工智能与大数据结合在一起是一个很自然的错误，其部分原因是两者实际上是一致的。但它们是完成相同任务的不同工具。但首先要做的事是先弄清二者的定义。很多人并不知道这些。人工智能与大数据一个主要的区别是大数据是需要在数据变得有用之前进行清理、结构化和集成的原始输入，而人工智能则是输出，即处理数据产生的智能。这使得两者有着本质上的不同。人工智能是一种计算形式，它允许机器执行认知功能，例如对输入起作用或作出反应，类似于人类的做法。传统的计算应用程序也会对数据做出反应，但反应和响应都必须采用人工编码。如果出现任何类型的差错，就像意外的结果一样，应用程序无法做出反应。而人工智能系统不断改变它们的行为，以适应调查结果的变化并修改它们的反应。支持人工智能的机器旨在分析和解释数据，然后根据这些解释解决问题。通过机器学习，计算机会学习一次如何对某个结果采取行动或做出反应，并在未来知道采取相同的行动。大数据是一种传统计算。...

2018-06-07

578

资源下载

更多资源

优质分享App

近一个月的开发和优化，本站点的第一个app全新上线。该app采用极致压缩，本体才4.36MB。系统里面做了大量数据访问、缓存优化。方便用户在手机上查看文章。后续会推出HarmonyOS的适配版本。

腾讯云软件源

为解决软件依赖安装时官方源访问速度慢的问题，腾讯云为一些软件搭建了缓存服务。您可以通过使用腾讯云软件源站来提升依赖包的安装速度。为了方便用户自由搭建服务架构，目前腾讯云软件源站支持公网访问和内网访问。

Rocky Linux

Rocky Linux（中文名：洛基）是由Gregory Kurtzer于2020年12月发起的企业级Linux发行版，作为CentOS稳定版停止维护后与RHEL（Red Hat Enterprise Linux）完全兼容的开源替代方案，由社区拥有并管理，支持x86_64、aarch64等架构。其通过重新编译RHEL源代码提供长期稳定性，采用模块化包装和SELinux安全架构，默认包含GNOME桌面环境及XFS文件系统，支持十年生命周期更新。

Sublime Text

Sublime Text具有漂亮的用户界面和强大的功能，例如代码缩略图，Python的插件，代码段等。还可自定义键绑定，菜单和工具栏。Sublime Text 的主要功能包括：拼写检查，书签，完整的 Python API ， Goto 功能，即时项目切换，多选择，多窗口等等。Sublime Text 是一个跨平台的编辑器，同时支持Windows、Linux、Mac OS X等操作系统。