Taiwan Mandarin Speech with Video (TMSV)

The TMSV Dataset

TMSV (Taiwan Mandarin Speech with Video) is an audio-visual dataset based on the script of TMHINT (Taiwan Mandarin hearing in noise test). TMSV was recorded by 18 speakers, including 13 males and 5 females, each providing 320 video clips, amounting a total of 5,760 speech utterances. Each utterance in the TMHINT script consists of 10 Chinese characters. The length of each clip in TMSV is approximately 2–4 seconds. The video clips were recorded in a recording studio with sufficient light, and the speakers were filmed from the front view at 50 frames per second at a resolution of 1080p. The audio channels were recorded at 48 kHz. The misread labels and actual readed texts for ASR tasks are provided as well. The extracted wav files in the audio folders are downsampled to 16 kHz in mono channel.

Shang-Yi Chuang
Shang-Yi Chuang
ML Researcher | ASR R&D

Extremely self-motivated engineer with excellent understanding of machine learning algorithms. Interested in speech processing, natural language processing, and multimodal learning.