Implement Shion(詩音) from SingaBitofHarmony(讓我聽見愛的歌聲) with Python¶

:Event: PyCon APAC 2022
:Presented: 2022/07/20 (pre-recorded) nikkie

你好❗️ PyCon APAC 2022¶

Many thanks to all the staff who worked so hard❤️

About nikkie myself¶

  • loves Python (& Anime, Japanese cartoons)
  • Twitter @ftnext / GitHub @ftnext
  • PyCon JP: 2019〜2020 staff & 2021 chair

About nikkie myself¶

  • Data scientist at Uzabase, Inc. (NLP, Write Python)
  • We're hiring!! (Engineers, Data scientists, Researchers)

Uzabase logo

Also talk "Revisit Python from statements and PEG"¶

Sing a Bit of Harmony¶

https://ainouta.jp/

Sing a Bit of Harmony¶

  • Animation film released in Japan in October 2021.
  • SF x juvenile x musical
  • Key character is Shion, the AI robot (humanoid)🤖.

Shion says "I will make you happy!"¶

https://youtu.be/1UeIEUoHZ6E

Want to implement Shion!!¶

  • Shion is program.
  • I can write programs in Python.
  • 👉 I should be able to write a program like Shion.

Implement Shion(詩音) from SingaBitofHarmony(讓我聽見愛的歌聲) with Python¶

  • I will share the detail of my maker project "Implement Shion with Python".
  • I would be happy to provide a little inspiration for your Maker project.

Caveats⚠️¶

  • The implementation shared here is the wild fancy by nikkie (a fan)
    • There does not seem to be any mention of operating system or programming language in the movie.
  • nikkie is not a practitioner of audio.
    • self-taught, so if there's a better way, let me know!

Implement Shion with Python¶

  • Implement one feature: talking with people
  • Small start (v0.0.1)

Define Shion v0.0.1¶

  • Implement only software
  • a program that can speak with a human
  • like a smart speaker

Demo: Shion v0.0.1¶

  • Reads aloud the spoken texts.

    • Hello (こんにちは)
    • Okay? I'm giving you a command. (いい? 命令するよ?)

Organize technical requirements¶

Technologies behind Shion v0.0.1

Definition of Shion v0.0.1¶

A human inputs voice.

  1. Transcribe speech into text
  2. Process the text to create response text
  3. Read the response text out loud

Technical requirements¶

  • Input: Convert voice to text
  • Output: Read text out loud
  • parroting (this time)

Validate then refine¶

  • I don't know if I'm satisfied with Shion until I make it.
  • Quickly validating ideas on the first move.
  • If it looks good, make it more like Shion.

Technology to read text out loud¶

  • Called "speech synthesis"
  • Also called, Text-To-Speech (TTS)

TTS (Text-To-Speech) in this talk¶

  • First move: call OS command
  • Refinement: use a pre-trained machine learning model

Technology to convert voice to text¶

  • Callled "speech recognition"
  • Also called Automatic Speech Recognition (ASR)

ASR (Automatic Speech Recognition) in this talk¶

  • First move: call Web API
  • Refinement: use a pre-trained machine learning model

Technology line-up in this talk¶

  • TTS fitst move
  • ASR first move
  • TTS refinement
  • ASR refinement

TTS first move: call OS command¶

Shion v0.0.1 at this step¶

  1. Transcribe speech into text
  2. Process the text to create response text
  3. Read the response text out loud 👈

TTS command¶

  • macOS: say command (detailed later)
  • Linux and Windows: espeak command

say command in macOS¶

say -v <voice> <text>

In [2]:
!say -v Kyoko いま、幸せ?

say -v ?: obtain a list of voices¶

  • ja_JP: Kyoko
  • zh_TW: Mei-Jia
In [3]:
!say -v Mei-Jia 你好

Call say command from Python¶

  • subprocess in standard library
  • Example in docs: "Speaking logging messages"

subprocess.run¶

  • Call commands not limited to TTS.
  • Pass command as a sequence of program arguments.

Example

subprocess.run(["ls", "-l"])  # Call `ls -l`

TTS with subprocess.run¶

In [4]:
import subprocess
In [5]:
subprocess.run(["say", "-v", "Kyoko", "いま、幸せ?"])
Out[5]:
CompletedProcess(args=['say', '-v', 'Kyoko', 'いま、幸せ?'], returncode=0)

TTS sample script¶

import readline  # noqa: F401
import subprocess


def say(sentence: str):
    subprocess.run(["say", "-v", "Kyoko", sentence])


if __name__ == "__main__":
    while True:
        sentence = input("読み上げたい文を入力してください (qで終了): ")
        stripped = sentence.strip()
        if not stripped:
            continue
        if stripped.lower() == "q":
            break

        say(stripped)

Technology line-up in this talk¶

  • TTS fitst move
  • ASR first move
  • TTS refinement
  • ASR refinement

ASR first move: Call Web API¶

Shion v0.0.1 at this step¶

  1. Transcribe speech into text 👈
  2. Process the text to create response text
  3. Read the response text out loud ✅

ASR Web APIs¶

  • Google Cloud Speech-to-Text API (👈Use this time)
  • Microsoft Azure Speech
  • IBM Speech to Text
  • etc. etc.

SpeechRecognition¶

  • Library for ASR
  • Supports Web APIs and engines
  • https://github.com/Uberi/speech_recognition

Process with SpeechRecognition¶

  1. Get audio from a microphone
  2. Send audio to ASR Web API

1.Get audio from a microphone¶

In [6]:
import speech_recognition as sr
In [7]:
r = sr.Recognizer()
In [8]:
with sr.Microphone(sample_rate=16_000) as source:
    print("なにか話してください")
    audio = r.listen(source)
    print("音声を取得しました")
なにか話してください
音声を取得しました

2.Send audio to ASR Web API¶

Select Google Cloud Speech-to-Text API

In [9]:
import os
In [10]:
with open(os.environ.get("SPEECH_TO_TEXT_API_SERVICE_ACCOUNT_KEY")) as f:
    credentials = f.read()

2.Send audio to ASR Web API (cont.)¶

In [11]:
recognized_text = r.recognize_google_cloud(
    audio, credentials, language="ja-JP"
)
print(recognized_text.strip())
こんにちは

ASR sample script¶

import argparse

import speech_recognition as sr


def input_from_microphone(recognizer: "sr.Recognizer") -> "sr.AudioData":
    with sr.Microphone(sample_rate=16_000) as source:
        print("なにか話してください")
        audio = recognizer.listen(source)
        print("音声を取得しました")
        return audio


def recognize_speech(
    recognizer: "sr.Recognizer", audio: "sr.AudioData", credentials: str
) -> str:
    recognized_text = recognizer.recognize_google_cloud(
        audio, credentials, language="ja-JP"
    )
    return recognized_text.strip()


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("credentials_path")
    args = parser.parse_args()

    with open(args.credentials_path) as f:
        credentials = f.read()

    r = sr.Recognizer()

    while True:
        audio = input_from_microphone(r)
        text = recognize_speech(r, audio, credentials)
        print(text)

        character = input("ここで終了する場合はq、続ける場合はEnterを押してください: ")
        if character.strip().lower() == "q":
            break

Process text¶

Shion v0.0.1 at this step¶

  1. Transcribe speech into text ✅
  2. Process the text to create response text 👈
  3. Read the response text out loud ✅

parroting the text🦜 (this time)¶

  • The simplest text processing
def talk_with_chatbot(sentence: str) -> str:
    return sentence

Place validating ideas quickly above everything¶

  • TTS: subprocess.run
  • ASR: Web API
  • parroting the text

Validation result¶

  • LGTM👍 (my qualitative feedback)
  • Make it more like Shion!

Room for refinement of quick implementation 1/2¶

  • say command depends on the OS
  • Shion does not seem to be macOS

Room for refinement of quick implementation 2/2¶

  • Calling Web API depends Internet access
  • Shion is standalone. i.e. don't communicate with Web API

Make it more like Shion!¶

  • Download pre-trained machine learning models beforehand
  • TTS & ASR with machine learning models

Technology line-up in this talk¶

  • TTS fitst move
  • ASR first move
  • TTS refinement
  • ASR refinement

TTS refinement: use pre-trained model¶

Shion v0.0.1 at this step¶

  1. Transcribe speech into text ✅
  2. Process the text to create response text ✅
  3. Read the response text out loud 👈

ttslearn¶

  • Library for TTS (Japanese support)
  • 『Pythonで学ぶ音声合成』📘
  • https://github.com/r9y9/ttslearn

Example of speech synthesis in Japanese¶

In [12]:
from ttslearn.dnntts import DNNTTS
In [13]:
dnntts_engine = DNNTTS()
In [14]:
audio_array, sampling_rate = dnntts_engine.tts("いま、幸せ?")

DNNTTS()¶

  • Implementation of TTS with deep neural network (DNN)
  • Load pre-trained models (download if needed)
  • tts method returns NumPy array representing audio data

sounddevice¶

Play and Record Sound with Python

  • https://github.com/spatialaudio/python-sounddevice/
  • Use to play NumPy array representing audio data

Example of TTS¶

In [15]:
import sounddevice as sd
In [16]:
sd.play(audio_array, sampling_rate)
sd.wait()

Refined TTS sample script¶

import readline  # noqa: F401

import sounddevice as sd
from ttslearn.dnntts import DNNTTS

dnntts_engine = DNNTTS()


def say(sentence: str):
    audio_array, sampling_rate = dnntts_engine.tts(sentence)
    sd.play(audio_array, sampling_rate)
    sd.wait()


if __name__ == "__main__":
    while True:
        sentence = input("読み上げたい文を入力してください (qで終了): ")
        stripped = sentence.strip()
        if not stripped:
            continue
        if stripped.lower() == "q":
            break

        say(stripped)

Technology line-up in this talk¶

  • TTS fitst move
  • ASR first move
  • TTS refinement
  • ASR refinement

ASR refinement: use pre-trained model¶

Shion v0.0.1 at this step¶

  1. Transcribe speech into text 👈
  2. Process the text to create response text ✅
  3. Read the response text out loud ✅✨

ESPnet¶

end-to-end speech processing toolkit

  • https://github.com/espnet/espnet
  • Use for ASR (the library also supports TTS; future works)

Use pre-trained model in ESPnet¶

  • Use the model published on Hugging Face

    • pre-trained by owner
  • pip install espnet-model-zoo

Example for using pre-trained model¶

In [17]:
from espnet2.bin.asr_inference import Speech2Text
In [18]:
speech2text = Speech2Text.from_pretrained(
    "kan-bayashi/csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave"
)

Refine ASR feature with pre-trained model in Espnet¶

  1. First step: ASR of WAV file
  2. ASR of voice input from microphone

First step: ASR of WAV file¶

SoundFile¶

an audio library based on libsndfile, CFFI and NumPy.

  • https://github.com/bastibe/python-soundfile

tips: Create WAV file with say command¶

In [19]:
!say -v Kyoko いま、幸せ? -o sample.wav --data-format=LEF32@16000
  • @16000 means the sampling rate (ref: man say)
  • This model is pre-trained at a sampling rate of 16000 Hz, so match.

ASR of WAV file¶

In [20]:
import soundfile as sf
In [21]:
speech_array, sampling_rate = sf.read("sample.wav")
In [22]:
nbests = speech2text(speech_array)
text, tokens, *_ = nbests[0]
print(text)
今幸せ

ASR of voice input from microphone¶

Handle microphone: SpeechRecognition again¶

In [23]:
r = sr.Recognizer()
with sr.Microphone(sample_rate=16_000) as source:
    print("なにか話してください")
    audio = r.listen(source)
    print("音声を取得しました")
なにか話してください
音声を取得しました

Get bytes of WAV format¶

In [24]:
wav_bytes = audio.get_wav_data()
type(wav_bytes)
Out[24]:
bytes

Convert to NumPy array¶

In [25]:
from io import BytesIO
In [26]:
wav_stream = BytesIO(wav_bytes)
speech_array, sampling_rate = sf.read(wav_stream)
type(speech_array)
Out[26]:
numpy.ndarray

ASR of array¶

In [27]:
nbests = speech2text(speech_array)
text, tokens, *_ = nbests[0]
print(text)
えー今幸せ

Refined ASR sample script¶

from io import BytesIO

import numpy as np
import soundfile as sf
import speech_recognition as sr
from espnet2.bin.asr_inference import Speech2Text

speech2text = Speech2Text.from_pretrained(
    "kan-bayashi/csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave"
)

SAMPLING_RATE_HZ = 16_000


def input_from_microphone(recognizer: "sr.Recognizer") -> "sr.AudioData":
    with sr.Microphone(sample_rate=SAMPLING_RATE_HZ) as source:
        print("なにか話してください")
        audio = recognizer.listen(source)
        print("音声を取得しました")
        return audio


def convert_to_array(audio: "sr.AudioData") -> "np.array":
    wav_bytes = audio.get_wav_data()
    wav_stream = BytesIO(wav_bytes)
    audio_array, sampling_rate = sf.read(wav_stream)
    assert sampling_rate == SAMPLING_RATE_HZ
    return audio_array


def recognize_speech(audio_array: "np.array") -> str:
    nbests = speech2text(audio_array)
    text, tokens, *_ = nbests[0]
    return text


if __name__ == "__main__":
    r = sr.Recognizer()

    while True:
        audio = input_from_microphone(r)
        array = convert_to_array(audio)
        text = recognize_speech(array)
        print(text)

        character = input("ここで終了する場合はq、続ける場合はEnterを押してください: ")
        if character.strip().lower() == "q":
            break

Shion v0.0.1 refined!¶

  1. Transcribe speech into text ✅✨
  2. Process the text to create response text ✅
  3. Read the response text out loud ✅✨

shion.py: the integration¶

  1. Transcribe speech into text (ASR)
  2. Process the text to create response text (parroting)
  3. Read the response text out loud (TTS)

shion.py¶

from io import BytesIO

import numpy as np
import sounddevice as sd
import soundfile as sf
import speech_recognition as sr
from espnet2.bin.asr_inference import Speech2Text
from ttslearn.dnntts import DNNTTS

speech2text = Speech2Text.from_pretrained(
    "kan-bayashi/csj_asr_train_asr_transformer_raw_char_sp_valid.acc.ave"
)

SAMPLING_RATE_HZ = 16_000


def input_from_microphone(recognizer: "sr.Recognizer") -> "sr.AudioData":
    with sr.Microphone(sample_rate=SAMPLING_RATE_HZ) as source:
        print("なにか話してください")
        audio = recognizer.listen(source)
        print("音声を取得しました")
        return audio


def convert_to_array(audio: "sr.AudioData") -> "np.array":
    wav_bytes = audio.get_wav_data()
    wav_stream = BytesIO(wav_bytes)
    audio_array, sampling_rate = sf.read(wav_stream)
    assert sampling_rate == SAMPLING_RATE_HZ
    return audio_array


def recognize_speech(audio_array: "np.array") -> str:
    nbests = speech2text(audio_array)
    text, tokens, *_ = nbests[0]
    return text


def recognize_mircophone_input(recognizer: "sr.Recognizer") -> str:
    audio = input_from_microphone(recognizer)
    array = convert_to_array(audio)
    return recognize_speech(array)


def process_text(sentence: str) -> str:
    return sentence


dnntts_engine = DNNTTS()


def say(sentence: str):
    audio_array, sampling_rate = dnntts_engine.tts(sentence)
    sd.play(audio_array, sampling_rate)
    sd.wait()


if __name__ == "__main__":
    r = sr.Recognizer()

    while True:
        text = recognize_mircophone_input(r)
        response = process_text(text)
        say(response)

        character = input("ここで終了する場合はq、続ける場合はEnterを押してください: ")
        if character.strip().lower() == "q":
            break

Share what I'm learning in implementing Shion with Python¶

  • Implement quickly
  • Use machine learning (2 points)

Implement quickly¶

  • Placed implementing TTS and ASR quickly above everything
  • Validated system (Shion) piecing together quick implementations

    • ref: Tracer ammunition "The Pragmatic Programmer"

Use machine learning 1/2¶

  • Used ASR Web API as first move
  • We developers always have the option to use Web APIs for machine learning tasks!

Use machine learning 2/2¶

  • Used pre-trained models in TTS and ASR
  • Just like pip installing libraries, we can download and use pre-trained models for machine learning tasks!

Summary🌯 Implement Shion(詩音) from SingaBitofHarmony(讓我聽見愛的歌聲) with Python¶

  • Define Shion v0.0.1, implement as shion.py
  • Share implementations (ASR and TTS in Python) and lessons

Define Shion v0.0.1¶

  1. Transcribe speech into text (ASR)
  2. Process the text to create response text (parroting)
  3. Read the response text out loud (TTS)

Don't know if I'm satisfied with Shion until I make it, but¶

  • Quick implementation as first move; validate idea first
  • Piece together quick implementations to validate system (Tracer ammunition)
  • It looked good to me, so make it more like Shion

ASR in Python¶

  • Use Web API (as quick implementation)
  • Use pre-trained machine learning model

TTS in Python¶

  • Call OS command (as quick implementation)
  • Use pre-trained machine learning model

Lessons through Shion v0.0.1¶

  • If part of what you want to create can be viewed as a machine learning task, the following approaches can also be used

    • Use Web API
    • Use pre-trained models

Thank you very much for your attention.¶

I would be happy to provide a little inspiration for your Maker project.