Preparing Japanese Audio Datasets for TensorFlow

February 23, 2021

Note: The purpose of this post is as a personal reflection and not as a tutorial.

To to set up these datasets we will follow this guide:

JSUT

JSUT is a japanese speech dataset consisting of about 10h of a single female speaker. The transcipt was designed to cover common use words.

TensorFlow datasets only has version 1 of this dataset which does not have Japanese.

Version 6 has 5h total Japanese speech with 3h of it validated.

Japanese sentences that contain audio on Tatobeta.

Consists of about 1h of Japanese speech made up from 1525 sentences.

Zip file can be downloaded here.

Japanese speech dataset.

About 30h hours with 100 speakers.

Contains some of the same sentences as jsut.