Turned out that wavenet is sequential so it needs previous step to predict next.
Quite obvious when you look at how people speak sentences, they hardly stop in the middle of the word.
need to think how to proceed next, how to cut sentences.
Watched deepvoice3 and some accent models from baidu.
I can generate 8 sentences at a time, each takes 8 minutes so if I cut between words and got last mels between words right I can get 1 minute but I need to store model somewhere.

I forgot my machine learning and speech synthesis skills from previous life, time to load more skills ...

