The Vision
ChromoVOX will represent the next revolution in voice processing. Moving beyond mere pitch-correction, this technology will intelligently improve the tonal characteristics of the human voice. Quite simply, it will transform mediocre vocals into great vocals. As with H(ω)Tone, ChromoVOX will improve the tonality of a recorded track; the difference is that it will perform the onerous task of determining the unique tonal characteristics of one voice and applying them to another voice.
The Challenge
Most musical instruments have consistent tonal characteristics over their playable range, making the single played instance of a note or chord a sufficient reference for generating an impulse response. Even a guitar, each of whose strings produces notes with a unique timbre, can have its overall tonal characteristics represented by a single reference clip -so long as the reference and subject clips contain the same note/chord played on the same string(s). A drum kit departs from this assertion, but only in that individual drum kit pieces must be treated as separate instruments, each with its own reference clip.
In stark contrast, every syllable, every vowel and consonant rendered by the human voice must be treated as a separate musical instrument, thus necessitating a different impulse response for each instance. ChromoVOX must be able to intelligently detect and adapt to every distinct vocal sound or phone throughout an entire recorded track.
The Proof
H(ω)Tone can be used to demonstrate the underlying concept to be implemented and expanded in ChromoVOX. Although H(ω)Tone was not designed for voice processing, it can still be used to process vocals under certain conditions. In fact, vocals processed with H(ω)Tone were used in one song from the Electric Fez: Project A’ EP. (See also the EP’s Track Credits.)
With H(ω)Tone, we derive a transfer function from a reference track and then apply its corresponding equalization profile to a subject track to give it the tonal characteristics of the reference track. For intruments, only one such profile typically needs to be generated. In the case of vocals, we need to derive a transfer function for every phone.
The example below illustrates what constitutes an incredible challenge for the developer (and what manifests the superiority and appeal of vocals over instruments to the listener). To process an audio clip of the monosyllabic word “Yeah” sung melismatically with two notes, one must split the clip into three sub-clips: the first containing the consonant “y”, the second containing the short “e” vowel and the third clip containing the short “a” vowel. Note that the reference singer (who sings very well) and the subject singer (who does not) are singing “Yeah” in an almost identical manner. In this example, the subject consonant was not processed but instead spliced directly to the processed subject clips.
The start points and the sample window are selected for the short “e” vowel section of the reference and subject clips. When the parameters are selected, we generate a transfer function that represents the short “e” vowel.
The parameters used were as follows:
Ref. Time index=25992 samples
Sub. Time index=71048 samples
Window Size=21463 samples
Zeros=1024, Poles=0
The start points and the sample window are selected for the short “a” vowel section of the reference and subject clips. When the parameters are selected, we generate a transfer function that represents the short “a” vowel.
The parameters used were as follows:
Ref. Time index=69374 samples
Sub. Time index=119166 samples
Window Size=21463 samples
Zeros=1024, Poles=0
The derived transfer curves, one for the “e” and one for the “a”, are applied to the same subject clip using convolution, which results in two processed subject vocal clips. We need two transfer curves in this instance because the transfer curve derived from the “e” vowel cannot be applied to the “a” vowel and vice-versa.
Finally, the two convolved vowel clips (“e” and “a”), along with the unprocessed consonant “y”, are spliced together. The length of the spliced clip was intentionally made a little shorter in order for it to be in sync with the song into which it was mixed. (The processing used in H(ω)Tone does not change the timing of the clips.) We can compare the subject clip to the reference clip below:
Subject clip (the sound we currently have)
Reference clip (the sound we want)
Processed and spliced subject clip (the result).
From this example, we observe that while both singers are singing the same vocal line in exactly the same manner, the originally rendered vowels do not have the exact same quality. For example, one singer’s “a” is a little more rounded than the other’s. But this does not seem to affect the overall result since any nuances in the reference vocals are captured and then applied, along with their better sound, to the subject vocals.
The call
We are currently seeking interested parties for collaboration and funding to develop, market and sell ChromoVOX. Those interested can refer to the Contact page.