Speech to Music, speak a sentence and convert it to music… Basic Pitch and Formant frequencies of a the spoken sentence President Hosni Mubarak Resigned translated to the equal tempered scale (where the F1 and F2 frequencies are lowered by 3 octaves, respectively, to keep it readable :)).
Too bad that the synthesized sound is not equally enjoyable as the sense of the sentence (which is the reason why I didn’t add any midi/wav file to this post..). But I’m still working on it :). There are still too many things I didn’t take into account (intensity, rhythm,…). Ah, and concerning the tempo…It seems like, I still have to do something about that as well …
A very short analysis.
I find it exciting that the note score shows very clearly which of the notes represent the sibilants (like s) and nasals (the consonants m or n). How is that? The sibilants, generally, have a very high F1 frequency, whereas the nasals have a relatively low F1 frequency (See here, for a table and vowel to score translation of some of the main vowels). Both can be identified fairly easy in the F1 bar displayed in the picture. However, the i-vowel, on first sight, seems identical to the sibilant, both are marked by a high F1 frequency. Nevertheless, we can still easily distinguish the two sounds: The vowels are usually voiced, and the s for example not (but z, on the other hand is voiced). And if a sound is voiced, then there is a basic pitch, which is not the case for the unvoiced sounds. And there we go, a quick look at the score shows that the notes denoting i do in fact also have a basic pitch, whereas the s sounds generally don’t (unless they are preceded by a vowel or more like z sounding pronunciation, as for instance in resigned.
To sum up, I don’t have a very clear idea yet, what the formants are concerned, if to take them into account or not for the speech to music translations. As at least for the melody, it doesn’t seem to do much. The basic melody is set by the basic pitch.
What makes the formants so interesting and difficult at the same time, is that they are generally not perfectly harmonic to the basic pitch. Instruments do produce formants that are always notes of the same frequency, set some n octaves higher (generally, in music, the formula Pitch*(2^n) gives you the respective formants for each tone). In speech, however, as we can read from the note score, this is not at all the case. And here it this is even a crucial characteristic, as the F1 and F2 formants are the numbers that give each of the vowels its own timbre, so that we are actually able to disambiguate them. Hence, in speech the formants must be different from the basic pitch.
So well, initially, I thought that the F1 and F2 numbers could maybe be represented together with the basic pitch as chords, or harmonics. But that didn’t yet work out very well, the formants seem to make the melody extremely odd, rather than to support it. But perhaps that is exactly what makes speech to interesting :).
FYI: The mentioned sentence (1,95 seconds) was extracted from the program All Things Considered by NPR on January 11. Listen here to the original sequence:
Interesting link: http://www.newscientist.com/article/dn4031