å°ãåã«ãä»ã®å€ãã®åéãšåæ§ã«ãæ©æ¢°åŠç¿ã¯é³å£°åæã®åéã«å ¥ããŸããã ã·ã¹ãã å šäœã®å€ãã®ã³ã³ããŒãã³ãããã¥ãŒã©ã«ãããã¯ãŒã¯ã«çœ®ãæããããšãã§ãããããæ¢åã®ã¢ã«ãŽãªãºã ã®å質ã«ã¢ãããŒãã§ããã ãã§ãªããããããå€§å¹ ã«äžåãããšãã§ããŸãã
![](https://habrastorage.org/webt/g2/-j/hw/g2-jhwalba1v-5toodv7dcqxjro.jpeg)
ç§ã¯å®å šã«ãã¥ãŒã©ã«ãããã¯ãŒã¯ã®çµ±åãèªåã®æã§è¡ããåæã«ã³ãã¥ããã£ã§çµéšãå ±æããããšããããšã«ããŸããã ããã®ç±æ¥ã¯ãç«ã®äžãèŠãããšã§ããããŸãã
é³å£°åæ
é³å£°åæã·ã¹ãã ãæ§ç¯ããã«ã¯ãããŸããŸãªåéã®å°é家ããŒã å šäœãå¿ èŠã§ãã ãããã®ããããã«ã€ããŠãã¢ã«ãŽãªãºã ãšã¢ãããŒãã®ãã¹ãããããŸãã åºæ¬çãªã¢ãããŒãã説æããå士è«æãšåæã®æ¬ãæžãããŠããŸãã ãããã®ããããã®è¡šé¢çãªç解ããå§ããŸãããã
èšèªåŠ
- ããã¹ãã®æ£èŠå ã æåã«ããã¹ãŠã®ç¥èªãæ°åãæ¥ä»ãããã¹ãã«å±éããå¿ èŠããããŸãã 20äžçŽã®50幎代ã¯20äžçŽã®50代ã«ãªã ããµã³ã¯ãããã«ãã«ã¯ã®éœåžãããªã·ã§ã€prãP.S. ãµã³ã¯ãããã«ãã«ã¯åžãããªã·ã§ã€ãããã°ã©ãŒãåŽã®å±æ ã ããã¯ãæžããããã®ãèªãããã«äººã«æ±ãããããã®ããã«èªç¶ã«èµ·ããã¯ãã§ãã
- ã¹ãã¬ã¹èŸæžã®æºå ã ã¢ã¯ã»ã³ãã¯ãèšèªã®èŠåã«åŸã£ãŠé 眮ã§ããŸãã è±èªã§ã¯ãæåã®é³ç¯ã«éç¹ã眮ãããããšãå€ããã¹ãã€ã³èªã§ã¯æåŸãã2çªç®ã®é³ç¯ã«éç¹ã眮ãããŸãã ããã«ããããã®ã«ãŒã«ããã¯ãäžè¬çãªã«ãŒã«ã«åŸããªãäŸå€ãå€æ°ãããŸãã ããããèæ ®ããå¿ èŠããããŸãã äžè¬çãªæå³ã§ã®ãã·ã¢èªã®å Žåãã¹ãã¬ã¹ãé 眮ããããã®ã«ãŒã«ã¯ãŸã£ããååšããªããããã¹ãã¬ã¹ãé 眮ãããèŸæžããªããã°ãè¡ãæ¹æ³ã¯ãŸã£ãããããŸããã
- ãã¢ã°ã©ãã£ã®åé€ ã ãã¢ã°ã©ãã¯ãã¹ãã«ã¯äžèŽãããçºé³ãç°ãªãåèªã§ãã ãã€ãã£ãã¹ããŒã«ãŒã¯ç°¡åã«ã¹ãã¬ã¹ããããããšãã§ããŸãïŒ ãã¢ããã¯ãšå±±ã®å ã ãããã ããã¯ã®éµã¯ããé£ããã¿ã¹ã¯ã§ãã æèãèæ ®ããããšãªããã¢ã°ã©ãã£ãå®å šã«åé€ããããšã¯äžå¯èœã§ãã
ãããœãã£ã«
- ã·ã³ã¿ã°ãã®åŒ·èª¿è¡šç€ºãšäžæåæ¢ ã ã·ã³ã¿ã°ãã¯ãæå³ã«ãããŠæ¯èŒçå®æããé³å£°ã®ã»ã°ã¡ã³ããè¡šããŸãã 人ã話ããšãã圌ã¯éåžžãæã®éã«ããŒãºãæ¿å ¥ããŸãã ããã¹ãããã®ãããªæ§æã«åå²ããæ¹æ³ãåŠã¶å¿ èŠããããŸãã
- ã€ã³ãããŒã·ã§ã³ã®ã¿ã€ãã®æ±ºå® ã å®å šæ§ãçåãæåã®è¡šçŸã¯æãåçŽãªã€ã³ãããŒã·ã§ã³ã§ãã ããããç®èãç念ãç±æãè¡šçŸããããšã¯ãã¯ããã«é£ãã課é¡ã§ãã
é³å£°åŠ
- æåèµ·ãããååŸãã ã æçµçã«ã¯ã©ã€ãã£ã³ã°ã§ã¯ãªãçºé³ã䜿çšãããããæåïŒæžèšçŽ ïŒã®ä»£ããã«é³ïŒé³çŽ ïŒã䜿çšããããšãè«ççã§ããããšã¯æããã§ãã æžèšçŽ ã®é²é³ãé³çŽ ã«å€æããããšã¯ãå€ãã®ã«ãŒã«ãšäŸå€ã§æ§æãããå¥ã®ã¿ã¹ã¯ã§ãã
- ã€ã³ãããŒã·ã§ã³ãã©ã¡ãŒã¿ã®èšç® ã ãã®æç¹ã§ãé 眮ãããããŒãºãéžæãããé³çŽ ã·ãŒã±ã³ã¹ãè¡šçŸãããã€ã³ãããŒã·ã§ã³ã®ã¿ã€ãã«å¿ããŠãããããšçºé³é床ãã©ã®ããã«å€åãããã決å®ããå¿ èŠããããŸãã åºæ¬çãªããŒã³ãšé床ã«å ããŠãé·æéå®éšã§ããä»ã®ãã©ã¡ãŒã¿ãŒããããŸãã
é³é¿åŠ
- é³ã®èŠçŽ ã®éžæ ã åæã·ã¹ãã ã¯ãããããç°é³ã§åäœããŸã-ç°å¢ã«å¿ããŠãé³çŽ ã®å®çŸã ãã¬ãŒãã³ã°ããŒã¿ã®ã¬ã³ãŒãã¯ãé³çŽ ããŒãã³ã°ã«ãã£ãŠæçã«åãåããããç°é³ã®ããŒã¹ã圢æããŸãã åç°é³ã¯ãã³ã³ããã¹ãïŒé³çŽ è¿åïŒãããããæç¶æéãªã©ã®äžé£ã®ãã©ã¡ãŒã¿ã«ãã£ãŠç¹åŸŽä»ããããŸãã åæããã»ã¹èªäœã¯ãçŸåšã®æ¡ä»¶ã«æãé©ãããç°é³ã®æ£ããã·ãŒã±ã³ã¹ã®éžæã§ãã
- å€æŽããã³å¹æé³ ã çµæã®é²é³ã§ã¯ãåæåŸã®é³å£°ã人éã®é³å£°ã«å°ãè¿ã¥ããããäœããã®æ¬ é¥ãä¿®æ£ãããããç¹å¥ãªãã£ã«ã¿ãŒãå¿ èŠã«ãªãå ŽåããããŸãã
ããããã¹ãŠç°¡åã«ãªããé ã®äžã§ç解ã§ããããåã ã®ã¢ãžã¥ãŒã«ã®ãã¥ãŒãªã¹ãã£ãã¯ãããã«èŠã€ãããããšæããããããã³ãã£ãŒèªã§åæããå¿ èŠããããšæ³åããŠãã ããã èšèªãããããªãå Žåãé©åãªã¬ãã«ã®èšèªãç¥ã£ãŠãã人ãåŒãä»ããããšãªããåæã®å質ãè©äŸ¡ããããšããã§ããŸããã ç§ã®æ¯åœèªã¯ãã·ã¢èªã§ãåæãã¹ãã¬ã¹ãšééããããããééã£ãå£èª¿ã§è©±ãããããããšèãããŸãã ãããåæã«ãåæããããã¹ãŠã®è±èªã®é³ã¯ãç§ã«ãšã£ãŠã¯ã»ãŒåãã§ããããããšããŸããã¯ãªèšèªã¯èšããŸã§ããããŸããã
å®è£
ã·ã³ã»ã·ã¹ã®End-2-EndïŒE2EïŒå®è£ ãèŠã€ããããšããŸããããã¯ãèšèªã®åŸ®åŠãªç¹ã«é¢é£ãããã¹ãŠã®å°é£ãåŒãåããŸãã èšãæãããšãå ¥åãšããŠããã¹ããåãåããåºåãšããŠåæé³å£°ãçæãããã¥ãŒã©ã«ãããã¯ãŒã¯ã«åºã¥ããã·ã¹ãã ãæ§ç¯ããããšèããŠããŸãã çãå°åã®å°é家ããŒã å šäœãæ©æ¢°åŠç¿ã«ç¹åããããŒã ïŒå Žåã«ãã£ãŠã¯1人ã§ãïŒã«çœ®ãæãããããªãããã¯ãŒã¯ããã¬ãŒãã³ã°ããããšã¯å¯èœã§ããïŒ
end2end ttsãªã¯ãšã¹ãã§ãGoogleã¯å€æ°ã®çµæãçæããŸãã äžçªäžã«ããã®ã¯ãGoogleèªäœããã®Tacotronã®å®è£ ã§ãã ãã®åéã®ç 究ã«åŸäºããããŸããŸãªã¢ãŒããã¯ãã£ã®å®è£ ãå ±æããŠããGithubã®ç¹å®ã®äººã ããè¡ãããšã¯ãç§ã«ãšã£ãŠæãç°¡åã«æããŸããã
ç§ã¯3ã€ãéžã³ãŸãïŒ
ãªããžããªã§ããããèŠãŠãæ å ±ã®å庫ããããŸãã E2Eåæã®åé¡ã«ã¯ãå€ãã®ã¢ãŒããã¯ãã£ãšã¢ãããŒãããããŸãã äž»ãªãã®ã®äžã§ïŒ
- ã¿ã³ããã³ïŒããŒãžã§ã³1ã2ïŒã
- DeepVoiceïŒããŒãžã§ã³1ã2ã3ïŒã
- Char2Wavã
- DCTTSã
- WaveNet
ãããããéžæããå¿ èŠããããŸãã å°æ¥ã®å®éšã®åºç€ãšããŠã ä¹ byå ¬åã®ãã£ãŒãã³ã³ããªã¥ãŒã·ã§ãã«ããã¹ãèªã¿äžãïŒDCTTSïŒãéžæããŸããã ãªãªãžãã«ã®èšäºã¯ãã¡ãã§ã芧ããã ããŸã ã å®è£ ã詳ããèŠãŠã¿ãŸãããã
èè ã¯ãåæã®çµæã3ã€ã®ç°ãªãããŒã¹ã§ããã¬ãŒãã³ã°ã®ç°ãªã段éã§ã¬ã€ã¢ãŠãããŸããã ç§ã®è¶£å³ãšããŠã¯ããã€ãã£ãã¹ããŒã«ãŒã§ã¯ãªãã«ããŠããããªããŸãšãã§ãã ç§ã®ããŒã¿ããŒã¹ã«ã¯ã»ãŒåçšåºŠã®éã®ããŒã¿ãå«ãŸããŠãããããæåŸã®è±èªã®ããŒã¿ããŒã¹ïŒã±ã€ããŠã£ã³ã¹ã¬ããã®ãªãŒãã£ãªããã¯ïŒã«ã¯ããã5æéã®ã¹ããŒãããå«ãŸããŠããŸããã
ã·ã¹ãã ããã¬ãŒãã³ã°ããŠãã°ããããŠããªããžããªã«èè ãéåœèªã®ã¢ãã«ãæ£åžžã«ãã¬ãŒãã³ã°ãããšããæ å ±ã衚瀺ãããŸããã ããã¯éåžžã«éèŠã§ããèšèªã¯å€§ããç°ãªãå¯èœæ§ããããèšèªã«å¯Ÿããå ç¢æ§ã¯çŽ æŽãããè¿œå æ©èœã ããã§ãã ãã¬ãŒãã³ã°ããã»ã¹äžããã¬ãŒãã³ã°ããŒã¿ã®åã»ãããžã®ç¹å¥ãªã¢ãããŒãïŒèšèªãé³å£°ããŸãã¯ãã®ä»ã®ç¹æ§ïŒã¯å¿ èŠãªãããšãäºæ³ãããŸãã
ãã®çš®ã®ã·ã¹ãã ã®ãã1ã€ã®éèŠãªãã€ã³ãã¯ããã¬ãŒãã³ã°æéã§ãã ç§ãæã£ãŠãããã®éã®ã¿ã³ããã³ã¯ãç§ã®æšå®ã«ãããšãçŽ2é±éå匷ããŸãã åæã¬ãã«ã§ã®ãããã¿ã€ãã³ã°ã«ã€ããŠã¯ããªãœãŒã¹ãéäžçã«äœ¿çšããããã«æããŸããã ãã¡ãããããã«ããããå¿ èŠã¯ãããŸããããããã€ãã®åºæ¬çãªãããã¿ã€ããäœæããã«ã¯å€ãã®ã«ã¬ã³ããŒæéãããããŸãã æçµããŒãžã§ã³ã®DCTTSã¯ãæ°æ¥ã§åŠç¿ããŸãã
åç 究è ã«ã¯ãèªåã®ä»äºã§äœ¿çšããäžé£ã®ããŒã«ããããŸãã 誰ãã奜ã¿ã«åãããŠããããéžæããŸãã ç§ã¯PyTorchãæ¬åœã«å¥œãã§ãã æ®å¿µãªãããDCTTSã®å®è£ ãèŠã€ããããšãã§ãããTensorFlowã䜿çšããå¿ èŠããããŸããã ããããããæç¹ã§ãå®è£ ãPyTorchã«æçš¿ããã§ãããã
ãã¬ãŒãã³ã°ããŒã¿
åæãå®è£ ããããã®åªããåºç€ã¯ãæåã®äž»ãªä¿èšŒã§ãã æ°ãã声ã®æºåã¯éåžžã«åŸ¹åºçã«è¡ãããŠããŸãã ããã®ã¢ããŠã³ãµãŒãäºåã«æºåãããã¬ãŒãºãäœæéãçºé³ããŸãã çºè©±ããšã«ããã¹ãŠã®ããŒãºã«èãããžã£ãŒã¯ãã¹ããŒããŠã³ãªãã§è©±ãããåºæ¬é³ã®æ£ããã¢ãŠãã©ã€ã³ãåçŸããããããã¹ãŠãæ£ããã€ã³ãããŒã·ã§ã³ã§åçŸããå¿ èŠããããŸãã ãšãããããã¹ãŠã®å£°ãåãããã«å¿å°ããèãããããã§ã¯ãããŸããã
ããã®ã¢ããŠã³ãµãŒã«ãã£ãŠèšé²ãããçŽ8æéã®ããŒã¹ãæã«ããŸããã ç§ã®ååãšç§ã¯çŸåšããã®é³å£°ãéå¶å©ç®çã§èªç±ã«å©çšã§ããããã«ããå¯èœæ§ã«ã€ããŠè°è«ããŠããŸãã ãã¹ãŠãããŸãããã°ãé²é³èªäœã«å ããŠãé³å£°ä»ãã®é ä¿¡ã«ããããã®æ£ç¢ºãªããã¹ããå«ãŸããŸãã
å§ããŸããã
å ¥åãšããŠããã¹ããåãåããåºåãšããŠåæããããµãŠã³ããçæãããããã¯ãŒã¯ãäœæããŸãã è±å¯ãªå®è£ ã¯ãããå¯èœã§ããããšã瀺ããŠããŸããããã¡ããå€ãã®äºçŽããããŸãã
ã¡ã€ã³ã·ã¹ãã ãã©ã¡ãŒã¿ãŒã¯éåžžãã€ããŒãã©ã¡ãŒã¿ãŒãšåŒã°ããåå¥ã®ãã¡ã€ã«ã«åãåºãããŸãããã®ãã¡ã€ã«ã¯ããã®äŸã®ããã«hparams.pyãŸãã¯hyperparams.pyãšåŒã°ããŸãã ã¡ã€ã³ã³ãŒãã«è§Šããããšãªããã€ã¹ãã§ãããã¹ãŠã®ãã®ã¯ããã€ããŒãã©ã¡ãŒã¿ãŒã§åãåºãããŸãã ãã°ã®ãã£ã¬ã¯ããªããå§ãŸããé ãã¬ã€ã€ãŒã®ãµã€ãºã§çµãããŸãã ãã®åŸãã³ãŒãå ã®ãã€ããŒãã©ã¡ãŒã¿ãŒã¯æ¬¡ã®ããã«äœ¿çšãããŸãã
from hyperparams import Hyperparams as hp batch_size = hp.B #
ããã«ã hpæ¥é èŸãæã€ãã¹ãŠã®å€æ°ã ãã€ããŒãã©ã¡ãŒã¿ãŒãã¡ã€ã«ããååŸã ãããã®ãã©ã¡ãŒã¿ã¯ãã¬ãŒãã³ã°ããã»ã¹äžã«å€æŽãããªããããæ°ãããã©ã¡ãŒã¿ã§äœããåèµ·åããå Žåã¯æ³šæããŠãã ããã
ããã¹ã
ããã¹ãã®åŠçã«ã¯ãéåžžãæåã«é 眮ãããããããåã蟌ã¿ã¬ã€ã€ãŒã䜿çšãããŸãã ãã®æ¬è³ªã¯ã·ã³ãã«ã§ã-æåãã¯ãã«ãæåãã¯ãã«ã«é¢é£ä»ããåãªããã¬ãŒãã§ãã åŠç¿ããã»ã¹ã§ã¯ããããã®ãã¯ãã«ã«æé©ãªå€ãéžæããå®æããã¢ãã«ã«åŸã£ãŠåæãããšãã«ããã®ãã¬ãŒãããå€ãååŸããŸãã ãã®ã¢ãããŒãã¯ããã§ã«åºãç¥ãããŠããWord2Vecã§äœ¿çšãããŠãããåèªã®ãã¯ãã«è¡šçŸãæ§ç¯ãããŸãã
ããšãã°ãåçŽãªã¢ã«ãã¡ãããã䜿çšããŸãã
['a', 'b', 'c']
åŠç¿ããã»ã¹ã«ãããŠãåã·ã³ãã«ã®æé©å€ã¯æ¬¡ã®ãšããã§ããããšãããããŸããã
{ 'a': [0, 1], 'b': [2, 3], 'c': [4, 5] }
次ã«ãåã蟌ã¿å±€ãééããåŸã®aabbccè¡ã«ã€ããŠã次ã®ãããªãã¯ã¹ãååŸããŸãã
[[0, 1], [0, 1], [2, 3], [2, 3], [4, 5], [4, 5]]
ãã®ãããªãã¯ã¹ã¯ãã·ã³ãã«ã®æŠå¿µã§åäœããªããªã£ãä»ã®ã¬ã€ã€ãŒã«éãããŸãã
çŸæç¹ã§ã¯ãç§ãã¡ã®åœã«æåã«çŸããå¶éããããŸããåæã®ããã«éä¿¡ã§ããæåã®ã»ããã¯éãããŠããŸãã ãã£ã©ã¯ã¿ãŒããšã«ãã§ããã°ç°ãªãã³ã³ããã¹ãã§ããŒã以å€ã®æ°ã®ãã¬ãŒãã³ã°ããŒã¿ã®ãµã³ãã«ãå¿ èŠã§ãã ããã¯ãã¢ã«ãã¡ãããã®éžæã«æ³šæããå¿ èŠãããããšãæå³ããŸãã
ç§ã®å®éšã§ã¯ããªãã·ã§ã³ã決å®ããŸããïŒ
# vocab = "E -"
ããã¯ããã·ã¢èªã®ã¢ã«ãã¡ãããããã€ãã³ãã¹ããŒã¹ãããã³è¡æ«ã®æå®ã§ãã ããã€ãã®éèŠãªãã€ã³ããšä»®å®ããããŸãã
- ã¢ã«ãã¡ãããã«å¥èªç¹ãè¿œå ããŸããã§ããã äžæ¹ã§ã¯ãå®éã«ã¯çºé³ããŸããã äžæ¹ãå¥èªç¹ã«ããã°ããã¬ãŒãºãéšåïŒã·ã³ã¿ã°ãïŒã«åå²ããäžæåæ¢ã§åå²ããŸãã ã·ã¹ãã ã¯ã©ã®ããã«å®è¡ãèš±ããªããšçºé³ããŸããïŒ
- ã¢ã«ãã¡ãããã«ã¯æ°åããããŸããã åæãé©çšããåãã€ãŸãæ£èŠåããåã«ãããããæ°åã«å±éãããããšãæåŸ ããŠããŸãã äžè¬ã«ãç§ãèŠããã¹ãŠã®E2Eã¢ãŒããã¯ãã£ã«ã¯ãæ£ç¢ºã«æ£èŠåãããããã¹ããå¿ èŠã§ãã
- ã¢ã«ãã¡ãããã«ã¯ã©ãã³æåããããŸããã è±èªã·ã¹ãã ã¯çºé³ã§ããŸããã ããªãã¯é³èš³ãè©ŠããŠã匷ããã·ã¢èªã®ã¢ã¯ã»ã³ããåŸãããšãã§ããŸãã
- ã¢ã«ãã¡ãããã«ã¯eãšããæåããããŸãã ã·ã¹ãã ããã¬ãŒãã³ã°ããããŒã¿ã§ã¯ãã·ã¹ãã ãå¿ èŠãªå Žæã«ç«ã£ãŠããããããã®é 眮ãå€æŽããªãããšã«ããŸããã ããããç§ãçµæãè©äŸ¡ããŠããç¬éã«ãä»ãåæãç³è«ããåã«ããã®æåãæ£ããèšå®ããå¿ èŠãããããšãããããŸãããããã§ãªããã°ãã·ã¹ãã ã¯eã§ã¯ãªãeãæ£ç¢ºã«çºé³ããŸãã
å°æ¥ã®ããŒãžã§ã³ã§ã¯ãåã¢ã€ãã ã«ããã«æ³šæãæãããšãã§ããŸãããä»ã®ãšããã¯ããã®ãããªãããã«åçŽåããã圢åŒã®ãŸãŸã«ããŠãããŸãã
é³
ã»ãšãã©ãã¹ãŠã®ã·ã¹ãã ã¯ãä¿¡å·èªäœã§ã¯ãªããç¹å®ã®ã¹ãããã§ãŠã£ã³ããŠã§ååŸãããããŸããŸãªçš®é¡ã®ã¹ãã¯ãã«ã§åäœããŸãã 詳现ã«ã€ããŠã¯èª¬æããŸãããããã®ãããã¯ã«ã€ããŠã¯ããŸããŸãªçš®é¡ã®æç®ããããŸãã å®è£ ãšäœ¿çšã«çŠç¹ãåœãŠãŸãã DCTTSå®è£ ã§ã¯ãæ¯å¹ ã¹ãã¯ãã«ãšãã§ãŒã¯ã¹ãã¯ãã«ã®2çš®é¡ã®ã¹ãã¯ãã«ã䜿çšãããŸãã
ãããã¯æ¬¡ã®ããã«èæ ®ãããŸãïŒãã®ãªã¹ãããã³åŸç¶ã®ãã¹ãŠã®ã³ãŒãã¯DCTTSå®è£ ããååŸãããŸãããæ確ã«ããããã«å€æŽãããŠããŸãïŒã
# y, sr = librosa.load(wavename, sr=hp.sr) # y, _ = librosa.effects.trim(y) # Pre-emphasis y = np.append(y[0], y[1:] - hp.preemphasis * y[:-1]) # linear = librosa.stft(y=y, n_fft=hp.n_fft, hop_length=hp.hop_length, win_length=hp.win_length) # mag = np.abs(linear) # - mel_basis = librosa.filters.mel(hp.sr, hp.n_fft, hp.n_mels) mel = np.dot(mel_basis, mag) # mel = 20 * np.log10(np.maximum(1e-5, mel)) mag = 20 * np.log10(np.maximum(1e-5, mag)) # mel = np.clip((mel - hp.ref_db + hp.max_db) / hp.max_db, 1e-8, 1) mag = np.clip((mag - hp.ref_db + hp.max_db) / hp.max_db, 1e-8, 1) # mel = mel.T.astype(np.float32) mag = mag.T.astype(np.float32) # t = mel.shape[0] num_paddings = hp.r - (t % hp.r) if t % hp.r != 0 else 0 mel = np.pad(mel, [[0, num_paddings], [0, 0]], mode="constant") mag = np.pad(mag, [[0, num_paddings], [0, 0]], mode="constant") # - mel = mel[::hp.r, :]
èšç®ã«ã¯ãã»ãŒãã¹ãŠã®E2Eåæãããžã§ã¯ãã§LibROSAã©ã€ãã©ãªïŒ https://librosa.github.io/librosa/ ïŒã䜿çšãããŸã ã ããã«ã¯å€ãã®æçšãªãã®ãå«ãŸããŠããŸããããã¥ã¡ã³ãã調ã¹ãŠãå 容ã確èªããããšããå§ãããŸãã
次ã«ã䜿çšããããŒã¿ããŒã¹ã®ãã¡ã€ã«ã®1ã€ã§æ¯å¹ ã¹ãã¯ãã«ãã©ã®ããã«èŠããããèŠãŠã¿ãŸãããã
![](https://habrastorage.org/webt/9_/_k/x8/9__kx8nn97o0ksl3i0iayi6l2ek.png)
ãŠã£ã³ããŠã¹ãã¯ã¿ãŒãè¡šããã®ãªãã·ã§ã³ã¯ã¹ãã¯ããã°ã©ã ãšåŒã°ããŸãã ç§åäœã®æéã¯æšªåº§æšã«ããããã«ãåäœã®åšæ³¢æ°ã¯çžŠåº§æšã«ãããŸãã ã¹ãã¯ãã«ã®æ¯å¹ ãè²ã§åŒ·èª¿è¡šç€ºãããŸãã ãã€ã³ããæããã»ã©ãæ¯å¹ ã¯å€§ãããªããŸãã
ãã§ãŒã¯ã¹ãã¯ãã«ã¯æ¯å¹ ã¹ãã¯ãã«ã§ããããã§ãŒã¯ã¹ã±ãŒã«ã§ç¹å®ã®ã¹ããããšãŠã£ã³ããŠã§æ®åœ±ãããŸãã äºåã«ã¹ãããæ°ãèšå®ããŸã;ã»ãšãã©ã®å®è£ ã§ã¯ãå€80ãåæã«äœ¿çšãããŸãïŒ hp.n_melsãã©ã¡ãŒã¿ãŒã§èšå®ïŒã ãã§ãŒã¯ã¹ãã¯ãã«ãžã®ç§»è¡ã¯ãããŒã¿éãå€§å¹ ã«åæžã§ããŸãããåæã«é³å£°ä¿¡å·ã«ãšã£ãŠéèŠãªç¹æ§ãä¿æããŸãã åããã¡ã€ã«ã®ãã§ãŒã¯ã¹ãã¯ããã°ã©ã ã¯æ¬¡ã®ãšããã§ãã
![](https://habrastorage.org/webt/kg/lb/gb/kglbgboruly8io9anzhvttcjnku.png)
ãªã¹ãã®æåŸã®è¡ã§ããã§ãŒã¯ã¹ãã¯ãã«ãæéãšãšãã«èããªãããšã«æ³šæããŠãã ããã ãããã4ã€ã®ãã¯ãã«ïŒ hp.r == 4 ïŒã®ã¿ã䜿çšããããããµã³ããªã³ã°åšæ³¢æ°ãåæžãããŸãã é³å£°åæã§ã¯ãäžé£ã®æåãããã§ãŒã¯ã¹ãã¯ãã«ãäºæž¬ããŸãã èãæ¹ã¯åçŽã§ãããããã¯ãŒã¯ã®äºæž¬ãå°ããã»ã©ãããŸã察åŠã§ããŸãã
ã¹ãã¯ããã°ã©ã ã¯é³å£°ã§ååŸã§ããŸãããèãããšã¯ã§ããŸããã ãããã£ãŠãä¿¡å·ãå ã«æ»ãå¿ èŠããããŸãã ãããã®ç®çã®ããã«ãã·ã¹ãã ã¯å€ãã®å ŽåãGriffin-Limã¢ã«ãŽãªãºã ãšãã®ææ°ã®è§£éïŒããšãã°ãRTISILAã link ïŒã䜿çšããŸãã ãã®ã¢ã«ãŽãªãºã ã«ãããæ¯å¹ ã¹ãã¯ãã«ããä¿¡å·ã埩å ã§ããŸãã ç§ã䜿çšããå®è£ ïŒ
def griffin_lim(spectrogram, n_iter=hp.n_iter): x_best = copy.deepcopy(spectrogram) for i in range(n_iter): x_t = librosa.istft(x_best, hp.hop_length, win_length=hp.win_length, window="hann") est = librosa.stft(x_t, hp.n_fft, hp.hop_length, win_length=hp.win_length) phase = est / np.maximum(1e-8, np.abs(est)) x_best = spectrogram * phase x_t = librosa.istft(x_best, hp.hop_length, win_length=hp.win_length, window="hann") y = np.real(x_t) return y
ãããŠãæ¯å¹ ã¹ãã¯ããã°ã©ã ããã®ä¿¡å·ã¯ã次ã®ããã«åŸ©å ã§ããŸãïŒã¹ãã¯ãã«ãååŸããã®ãšéã®æé ïŒã
# mag = mag.T # mag = (np.clip(mag, 0, 1) * hp.max_db) - hp.max_db + hp.ref_db # mag = np.power(10.0, mag * 0.05) # wav = griffin_lim(mag**hp.power) # De-pre-emphasis wav = signal.lfilter([1], [1, -hp.preemphasis], wav)
æ¯å¹ ã¹ãã¯ãã«ãååŸããŠã埩å ããŠãããè©ŠããŠã¿ãŸãããã
ãªãªãžãã«ïŒ
埩å ãããä¿¡å·ïŒ
ç§ã®å¥œã¿ã§ã¯ãçµæã¯æªåããŠããŸãã Tacotronã®èè ïŒæåã®ããŒãžã§ã³ããã®ã¢ã«ãŽãªãºã ã䜿çšïŒã¯ãGriffin-Limã¢ã«ãŽãªãºã ãäžæçãªãœãªã¥ãŒã·ã§ã³ãšããŠäœ¿çšããŠãã¢ãŒããã¯ãã£ã®æ©èœãå®èšŒããŠããããšã«æ³šç®ããŸããã WaveNetããã³åæ§ã®ã¢ãŒããã¯ãã£ã«ããããã質ã®é«ãé³å£°ãåæã§ããŸãã ãããããããã¯ãããããŒãŠã§ã€ãã§ããããã¬ãŒãã³ã°ã«ã¯å€å°ã®åªåãå¿ èŠã§ãã
ãã¬ãŒãã³ã°
éžæããDCTTSã¯ã2ã€ã®å®è³ªçã«ç¬ç«ãããã¥ãŒã©ã«ãããã¯ãŒã¯ã§æ§æãããŠããŸãïŒText2MelãšSpectrogram Super-resolution NetworkïŒSSRNïŒã
![](https://habrastorage.org/webt/pk/82/jb/pk82jbp5bnguczow3_ygrqh9gdq.png)
Text2Melã¯ã2ã€ã®ãšã³ã³ãŒããŒïŒTextEncãAudioEncïŒãš1ã€ã®ãã³ãŒããŒïŒAudioDecïŒããªã³ã¯ããã¢ãã³ã·ã§ã³ã¡ã«ããºã ã䜿çšããŠãããã¹ãå ã®ãã§ãŒã¯ã¹ãã¯ãã«ãäºæž¬ããŸãã Text2Melã¯ãŸã°ããªãã§ãŒã¯ã¹ãã¯ãã«ãæ£ç¢ºã«åŸ©å ããããšã«æ³šæããŠãã ããã
SSRNã¯ããã¬ãŒã ã®æ¬ èœãèæ ®ãããµã³ããªã³ã°åšæ³¢æ°ã埩å ããŠããã§ãŒã¯ã¹ãã¯ãã«ããå®å šãªæ¯å¹ ã¹ãã¯ãã«ã埩å ããŸãã
èšç®ã®ã·ãŒã±ã³ã¹ã«ã€ããŠã¯ãå ã®èšäºã§è©³ãã説æããŠããŸãã ããã«ãå®è£ çšã®ãœãŒã¹ã³ãŒããããããããã€ã§ããããã°ããŠåŸ®åŠãªãšããã調ã¹ãããšãã§ããŸãã å®è£ ã®äœè ãããã€ãã®å Žæã§èšäºããé¢ããããšã«æ³šæããŠãã ããã 2ã€ã®ãã€ã³ãã匷調ããŸãã
- æ£èŠåã®ããã®è¿œå ã®å±€ïŒæ£èŠåå±€ïŒãããããããªãã§ã¯ãèè ã«ããã°äœãæ©èœããŸããã§ããã
- å®è£ ã§ã¯ãæ£èŠåãæ¹åããããã«ããããã¢ãŠãã¡ã«ããºã ã䜿çšããŸãã ããã¯èšäºã«ã¯ãããŸããã
8æéã®é²é³ïŒæ°åãã¡ã€ã«ïŒãå«ãé³å£°ãåããŸããã å·Šã®èšé²ã®ã¿ïŒ
- ããã¹ãã«ã¯ãæåãã¹ããŒã¹ããã€ãã³ã®ã¿ãå«ãŸããŸãã
- ããã¹ãã®é·ãã¯hp.max_Nãè¶ ããŸããã
- åžéåŸã®ãã§ãŒã¯ã¹ãã¯ãã«ã®é·ãã¯hp.max_Tãè¶ ããŸããã
ç§ã¯5æé匷ãåŸãŸããã ãã¹ãŠã®èšé²ã«å¿ èŠãªã¹ãã¯ãã«ãèšç®ããText2MelãšSSRNã®ãã¬ãŒãã³ã°ãéå§ããŸããã ããã¯ãã¹ãŠéåžžã«å·§åŠã«è¡ãããŸãïŒ
$ python prepro.py $ python train.py 1 $ python train.py 2
å ã®ãªããžããªã§ã¯ã prepro.pyã¯prepo.pyãšåŒã°ããããšã«æ³šæããŠãã ãã ã ç§ã®å ãªãå®ç§äž»çŸ©è ã¯ããã«èããããªãã£ãã®ã§ãæ¹åããŸããã
DCTTSã«ã¯ç³ã¿èŸŒã¿å±€ã®ã¿ãå«ãŸããŠãããTacotronã®ãããªRNNå®è£ ãšã¯ç°ãªããã¯ããã«é«éã«åŠç¿ããŸãã
Intel Core i5-4670ã16 Gb RAMãGeForce 1080ãæèŒããç§ã®ãã·ã³ã§ã¯ãText2Melã®5äžã¹ãããã¯15æéã§åŠç¿ããSSRNã®7äž5ã¹ãããã¯5æéã§åŠç¿ããŸãã åŠç¿ããã»ã¹ã®1000ã¹ãããã«å¿ èŠãªæéã¯ã»ãšãã©å€ãããªãã£ãã®ã§ãå€ãã®ã¹ãããã§åŠç¿ããã®ã«ã©ãã ãã®æéããããããç°¡åã«ææ¡ã§ããŸãã
ããããµã€ãºã¯hp.Bã§èª¿æŽã§ããŸãã æã ãåŠç¿ããã»ã¹ã¯ã¡ã¢ãªäžè¶³ã§èœã¡ããããããããµã€ãºã2ã«åå²ãããŒãããåŠç¿ãåéããŸããã ãã®åé¡ã¯TensorFlowã®è žïŒç§ã¯ææ°çã䜿çšããªãã£ãïŒãšãããåŠçã®å®è£ ã®è€éãã«ãããšä¿¡ããŠããŸãã å€8ã§ãã¹ãŠãèœã¡ãã®ããããã®ã§ãç§ã¯ããã«å¯ŸåŠããŸããã§ããã
çµæ
ã¢ãã«ããã¬ãŒãã³ã°ãããåŸãæçµçã«åæãéå§ã§ããŸãã ãããè¡ãã«ã¯ããã¡ã€ã«ã«ãã¬ãŒãºãå ¥åããŠå®è¡ããŸãïŒ
$ python synthesize.py
å®è£ ãå°ã調æŽããŠãç®çã®ãã¡ã€ã«ãããã¬ãŒãºãçæããŸããã
çµæã¯waveãã¡ã€ã«ã®åœ¢åŒã§ã samplesãã£ã¬ã¯ããªã«ä¿åãããŸã ã ç§ãæã«å ¥ããåæã·ã¹ãã ã®äŸã次ã«ç€ºããŸãã
çµè«ãšçºèš
çµæã¯ãå質ã«å¯Ÿããç§ã®å人çãªæåŸ ãäžåããŸããã ã·ã¹ãã ã¯ã¹ãã¬ã¹ããããã¹ããŒãã¯èªã¿ããããé³å£°ã¯èªèå¯èœã§ãã äžè¬ã«ãæåã®ããŒãžã§ã³ã§ã¯æªããããŸããã§ãããç¹ã«ããã¬ãŒãã³ã°ã«äœ¿çšãããã®ã¯5æéã®ãã¬ãŒãã³ã°ããŒã¿ã ãã ã£ãããã§ãã
ãã®ãããªåæã®å¯å¶åŸ¡æ§ã«ã€ããŠã¯çåãæ®ã£ãŠããŸãã ãããééã£ãŠããå Žåãåèªã®ã¹ãã¬ã¹ãä¿®æ£ããããšããäžå¯èœã§ãã ãã¬ãŒãºã®æ倧é·ãšãã§ãŒã¯ã¹ãã¯ããã°ã©ã ã®ãµã€ãºã«å³å¯ã«é¢é£ä»ããããŠããŸãã ã€ã³ãããŒã·ã§ã³ãšåçé床ãå¶åŸ¡ããæ¹æ³ã¯ãããŸããã
å ã®å®è£ ã®ã³ãŒãã«å€æŽãæçš¿ããŸããã§ããã 圌ãã¯ãæ¢è£œã·ã¹ãã ã«ããåæã®ããã®ãã¬ãŒãã³ã°ããŒã¿ãšãã¬ãŒãºã®èªã¿èŸŒã¿ãããã³ãã€ããŒãã©ã¡ãŒã¿ãŒã®å€ïŒã¢ã«ãã¡ãããïŒ hp.vocab ïŒãšãããã®ãµã€ãºïŒ hp.B ïŒã®ã¿ã«é¢ä¿ããŠããŸããã æ®ãã®å®è£ ã¯å ã®ãŸãŸã§ãã
話ã®äžéšãšããŠãç§ã¯ãã®ãããªã·ã¹ãã ã®å®è£ ã®çç£ã®ãããã¯ã«ãŸã£ãã觊ããŸããã§ãããããã¯ãŸã å®å šã«E2Eé³å£°åæã·ã¹ãã ããéåžžã«é ãã§ãã ç§ã¯CUDAã§GPUã䜿çšããŸããããããã§ããã¹ãŠããªã¢ã«ã¿ã€ã ãããäœéã§ãã ãã¹ãŠãCPUã§ãšãŠã€ããªããã£ããåäœããã ãã§ãã
ãããã®åé¡ã¯ãã¹ãŠãä»åŸæ°å¹Žéã§å€§äŒæ¥ãç§åŠã³ãã¥ããã£ã«ãã£ãŠå¯ŸåŠãããŸãã éåžžã«èå³æ·±ããã®ã«ãªããšç¢ºä¿¡ããŠããŸãã