7 mouth frames seems like overkill, perhaps 3 is enough (closed, mid-way, open) tor normal speech, with maybe a wide open frame for pain or shouting.
Ideally you'd want frames for lip-spreading vowels (like English /i/ or Japanese /u/) but that would require very sophisticated analysis of the sound file.
An alternative to real-time analysis is to perform the analysis by a separate program which puts the results into a file that would accompany each sound file. I think that method could be done entirely within cgame.
Q3 does have CHAN_VOICE:
// sound channels
// channel 0 never willingly overrides
// other channels will allways override a playing sound on that channel
typedef enum {
CHAN_AUTO,
CHAN_LOCAL, // menu sounds, etc
CHAN_WEAPON,
CHAN_VOICE,
CHAN_ITEM,
CHAN_BODY,
CHAN_LOCAL_SOUND, // chat messages, etc
CHAN_ANNOUNCER // announcer voices, etc
} soundChannel_t;