音声分析合成

音声分析合成（英: Speech analysis/synthesis）は音声を分析し特徴量を得てそこから音声を再合成する音声処理である^[1]^[2]^[3]。

概要

音声分析合成は「音声→音響特徴量→音声」という一連の信号処理を指す。すなわち、音声信号を特徴量へ変換する音声分析と特徴量ベースの音声合成を一体として理解した音声処理である。

音声符号化は圧縮・暗号化目的の音声分析合成と見做すことができ、分析はエンコード、特徴量は符号、合成はデコードと対応する。また音声加工では信号の直接加工ではなく音響特徴量の加工がしばしばおこなわれる。加工の影響（例: 歪み、ノイズ）は音響特徴量の特性と合成部の仕様に深く関連しているため、分析-合成を一体のシステムとして理解することに大きなメリットがある。このように音声分析合成は音響信号処理全般の基礎技術として重要である^[4]。

ボコーダー

音声分析合成システムは総称としてボコーダー（英: vocoder）と呼ばれる。

ボコーダー（vocoder）という語は音声符号化に関する (Dudley 1939) の論文で「音声 (voice) を符号化しその符号 (code) から音声を再合成するシステム」という意味で作られた^[5]。「分析による特徴抽出とそれに基づく再合成」という意味でこれは音声分析合成システムであり、現在では音声分析合成システムの総称としてボコーダーという語が広く用いられる^[6]。

例

様々な音声分析合成システム（ボコーダー）が提唱されている^[7]。以下はその一例である。

表. ボコーダーとその特性
名称	音響特徴量	合成器	原著
チャネルボコーダ	fo・音量・サブバンド強度包絡^[8]	減算合成	Dudley (1939)
フェーズボコーダ	複素振幅（STFT）	iSTFT
LPCボコーダ	励起信号・LP係数	線形予測（減算合成）
Spectral modeling synthesis^[9] (音響分析合成)	fo・振幅 / スペクトル	調波加算合成 / 雑音減算合成	Serra, Smith (1990)
TANDEM-STRAIGHT^[10]
WORLD^[11]	fo・スペクトル包絡・非周期性指標

チャネルボコーダ

チャネルボコーダ（英: channel vocoder）は音声を基本周波数とサブバンド強度包絡へ符号化し減算合成で再合成するボコーダである。

分析はピッチとスペクトルのブランチからなる。

スペクトルでは「帯域分割→半波整流→低域通過^[12]」によりスペクトルの符号を得る。この符号は合成において各周波数帯のパワーを制御する^[13]。帯域分割をしているため各符号はサブバンドの信号であり、半波整流+低域通過は信号処理における一般的な包絡抽出法である。ゆえにこの符号はサブバンドの強度包絡（ゆっくりとした振幅変動^[14]）として解釈できる。ソース・フィルタモデルの観点からは構音を反映していると解釈できる^[15]。伝送工学の用語を用いれば振幅変調の信号成分とも解釈できる^[16]。

手法

音声分析合成では音声分析・音声合成の様々な手法が採用される。さらに、分析と合成を一体で捉える利点を生かした、音声分析合成特有の様々な技法・枠組みが開発されてきた。以下はその一例である。

合成による分析

合成による分析（英: Analysis by Synthesis、AbS）は「合成音の評価に基づく特徴量の抽出（=分析）」という音声分析合成の枠組みである。

シンプルな音声分析合成では分析と合成を独立しておこなう。よってある入力に対してどのような特徴量が得られるかは分析器によって一意に定まる。一方 AbS ではまず暫定的な分析をおこない、得られた特徴量に基づいて再合成をおこなう。次にこの合成音の評価をおこない、これに基づいて「特徴量は音声を良く表現しているか」を判定する。もし不十分であれば暫定特徴量を更新（再分析）し、同様の合成-評価をおこなうことで分析がより良くなる。この「分析-合成-評価のループによる分析」という枠組みが AbS である。

AbSは合成器の存在が前提となっており、分析と合成を一体で捉える音声分析合成の特徴を生かした枠組みとなっている。

AbS では1つの特徴量を得るために多数のループを回す必要があるため、分析コストが大きくなるデメリットがある。最も原始的なAbSでは全特徴量候補から総当たりで合成をおこなって最良特徴量が得られるが、これは明らかにコストが大きい。実用されるAbSでは階層的な絞り込みや勾配法など、計算量を抑える様々な工夫がなされている。

AbSを採用した例として音声符号化におけるCELPが挙げられる。

歴史

音声処理の歴史自体は古く、20世紀以前から多様な音声分析と音声合成の研究が存在した^[17]。

音声の分析と合成を一体として捉える「音声分析合成」の歴史は Dudley (1939) から始まった。この論文では音声を基本周波数・音量・周波数バンド強度比に分割し、必要に応じて操作し、再合成できることを示した。

脚注

^ "SPEECH has been remade ... by analyzing a talker's speech for the fundamental speech information and then using this information to remake the speech with a synthesizing device" (Dudley 1939) p.169 より引用。
^ "音声分析合成は ... 音声を何等かの音声パラメータに分解し，音声パラメータ群から波形を生成する仕組みとして定義される。(森勢 2019)
^ "音声波形の分析により特徴パラメータを抽出し、これを基に再び音声波形を合成する技術（音声分析合成方式）" 以下より引用。発見と発明の日本デジタル博物館. 音声分析合成方式の研究. 卓越研究データベース, 登録番号948. 日本学術振興会. 2022-11-28 版.
^ "音声分析合成技術は，様々な研究領域を支える基盤技術としての役割を担う。(森勢 2019)
^ "The apparatus used has been called a 'vocoder' because it operates on the principle of coding the voice and then reconstructing the voice in accordance with this code." (Dudley 1939) p.169 より引用。
^ "音声分合成系の近代化 ... Dudley に端を発する Vocoder 技術(板倉 2006)
^ "such conventional high-quality vocoders as STRAIGHT ... and WORLD" Tachibana, et al. (2018). An Investigation of Noise Shaping with Perceptual Weighting for Wavenet-Based Speech Generation. doi: 10.1109/ICASSP.2018.8461332
^ "In the synthesizer two streams of sound are employed ... first sound streams ... by three properties: ... determined by fundamental frequency of vibration; ... determined by the total sound power; ... determined by the relative amount of sound power in various fixed frequency bands ... second sound stream ... by three properties: ... random frequency components with no true pitch ... determined by total sound power ... determined by the relative power in fixed frequency bands." (Dudley 1939) p.170 より引用。
^ Xavier Serra; Julius O. Smith III (1990), “Spectral Modeling Synthesis: A Sound Analysis/Synthesis System Based on a Deterministic Plus Stochastic Decomposition” (英語), Computer Music Journal 14 (4), doi:10.2307/3680788, JSTOR 3680788
^ "TANDEM-STRAIGHTは ... 音声分析合成システムです．" STRAIGHT Library. (2013). STRAIGHT Library - Introduction. 山梨大学.
^ "WORLDは，Vocoderのアイディアを発展させた音声分析変換合成システムです" Morise. (2013). WORLD. 山梨大学.
^ "the spectrum analysis begins with the separation of the original speech power into frequency bands ... The power selected by the transmitting band filter is rectified to obtain a measure thereof and the resulting current passed through a 25-cycle lowpass filter" (Dudley 1939) p.174 より引用。
^ "The spectrum is measured electrically in the analyzer and the resulting spectrum-defining currents are then passed to the synthesizer where they control the amount of power at the different frequencies" (Dudley 1939) p.173 より引用。
^ "speech-defining signals ... vary at slow rates." (Dudley 1939) p.176 より引用。
^ "they are equivalent to lip and other motions, that is, they are the parametric equivalents of such syllabic motions which contain the real speech message that is impressed upon the cord tone and the breath tone as carriers." (Dudley 1939) p.176 より引用。
^ "In transmission engineering parlance the two streams of sound may be regarded as carriers (complex multi-frequency carriers), and the slow variations as signals. These signals have been impressed on the voiced carrier by both frequency modulation (pitch change) and selective amplitude modulation and on the unvoiced carrier by selective amplitude modulation." (Dudley 1939) p.176 より引用。
^ "speech analysis ... speech synthesis ... these have separately been subjects of study by many workers in a wide variety of fields" (Dudley 1939) p.169 より引用。

参考文献

Homer Dudley (1939). “Remaking Speech”. The Journal of the Acoustical Society of America 11 (2): 169–177. doi:10.1121/1.1916020.
森勢将雅「話声の合成における基盤技術」『日本音響学会誌』第75巻第7号、日本音響学会、2019年7月、387-392頁、CRID 1390283659837422336、doi:10.20697/jasj.75.7_387、ISSN 03694232。
板倉文忠「音声分析合成の基礎技術とその音声符号化への応用」（PDF）『電子情報通信学会研資』第6巻、2006年、4-5頁、CRID 1571980075445130496。

概要

ボコーダー

例

チャネルボコーダ

手法

合成による分析

歴史

脚注

参考文献

関連項目