Anyone tried converting speech waveforms to some type of image and then using those as training data for a stable diffusion model?

Hypothetically it should generate "ultrarealistic" waveforms for phonemes, for any given style of voice. The training labels are naturally the words or phonemes themselves, in text format (well, embedding vectors fwiw)

After that it's a matter of testing text-to-image, which should generate the relevant phonemes as images of waveforms (or your given visual representation, however you choose to pack it)

I would have tried this myself but I only have 3gb vram.

Even rudimentary voice generation that produces recognizable words from text input, would be interesting to see implemented and maybe a first for SD.

In other news:
Implementing SQL for an identity explorer. Basically the system generates sets of values for given known identities, and stores the formulas as strings, along with the values.
For any given value test set we can then cross reference to look up equivalent identities. And then we can test if these same identities hold for other test sets of actual variable values. If not, the identity string cam be removed, or gophered elsewhere in the database for further exploration and experimentation.

I'm hoping by doing this, I can somewhat automate the process of finding identities, instead of relying on logs and using the OS built-in text search for test value (which I can then look up in the files that show up, and cross reference the logged equations that produced those values), which I use to find new identities.

I was even considering processing the logs of equations and identities as some form of training data perhaps for a ML system that generates plausible new identities but that's a little outside my reach I think.

Finally, now that I know the new modular function converts semiprimes into numbers with larger factor trees, I'm thinking of writing a visual browser that maps the connections from factor tree to factor tree, making them expandable and collapsible, andallowong adjusting the formula and regenerating trees on the fly.

  • 2
    I know google boo boo baaa but can't you just use the google collab stuff to have more vram? It sounds like an interesting idea.
  • 2
    @jonas-w not familiar with google collab? At all.

    Edit: it's not that I cant google it (I can and will) but theres a TON of stuff out there and platforms and libraries and details that are entirely over my head.

    I assume it's a cloud host but that's all I think I know.
  • 4
    Why converting to image instead of using the original waveform data?
  • 1
    @iiii stable diffusion trains on images only (as far as I know)

    Only thing is waveforms tend to be real dense (so the size of the training data per sample, would be fairly large). I'd still try the naive approach first just to see if anything even *resembling* speech can or is produced. And if not, or if the original format is too dense, some other visual packing would be needed other than wavefprms converted to straight images.
  • 1
    @Wisecrack https://colab.research.google.com/g...

    Google colab uses these python notebook things and I think to some extent it is completely free.

    This above is stable diffusion as a google colab (never used it tho as i have enough vram)
  • 1
    What kind of waveform are you talking about? How about a spectogram, that should show vowels quite clearly.
  • 1
    I have use CLIP + VQGAN some time ago (before stable diffusion etc) as a google colab and it worked relatively well, but I never wrote my own google colab python scripts.
Add Comment