Note: The purpose of this post is as a personal reflection and not as a tutorial.

These datasets to be used with TensorFlow are available here.

To to set up these datasets we will follow this guide:

JSUT

https://sites.google.com/site/shinnosuketakamichi/publication/jsut

JSUT is a japanese speech dataset consisting of about 10h of a single female speaker. The transcipt was designed to cover common use words.

Common Voice Version 6

TensorFlow datasets only has version 1 of this dataset which does not have Japanese.

Version 6 has 5h total Japanese speech with 3h of it validated.

Tatoeba Japanese

Japanese sentences that contain audio on Tatobeta.

Consists of about 1h of Japanese speech made up from 1525 sentences.

Zip file can be downloaded here.

JVS Corpus

More information available at dataset’s webpage

Japanese speech dataset.

About 30h hours with 100 speakers.

Contains some of the same sentences as jsut.