UTK

EA MicroTalk (also UTalk or UTK) is a linear-predictive speech codec used in various games by Electronic Arts. The earliest known game to use it is Beasts & Bumpkins (1997). The codec has a bandwidth of 11.025kHz (sampling rate 22.05kHz) and frame size of 20ms (432 samples) and only supports mono. It is typically encoded at 32 kbit/s.

The codec was reverse-engineered by trass3r on August 05, 2010 and was later analyzed by Andrew D'Addesio on January 12, 2012.

MicroTalk has been seen in several containers, depending on the game:
 * PT/.m10 (Beasts and Bumpkins)
 * Maxis UTK (The Sims Online, SimCity 4)
 * SCxl (FIFA 2001 PS2, FIFA 2002 PS2)

All code and text content on this page is public domain and licensed under the UNLICENSE.

For decoders for all of the above games, see https://github.com/daddesio/utkencode.

Variants
There are a few variants of MicroTalk:
 * MicroTalk 10:1 (codec_id=4 in EA SCxl) and MicroTalk 5:1 (codec_id=22) are really the same codec; the encoder simply chooses different compression parameters (reduced_bw is enabled in 10:1 and disabled in 5:1). MicroTalk 10:1 is typically encoded at 32kit/s; 5:1 is encoded at 64kbit/s.
 * MicroTalk Revision 3 (revision=3 in SCxl) is a variant which supports raw PCM samples. (See Revision 3.)

UTK file header
The simplest container is Maxis UTK. The UTK file header is little-endian and is very similar to the Maxis XA file header, with a few modifications.


 * sID - A 4-byte string identifier equal to "UTM0".
 * dwOutSize - The size in bytes of the decompressed audio, not including the WAV header; equal to num_samples*2.
 * dwWfxSize - The size in bytes of the WAVEFORMATEX structure to follow; must be 20 (0x14).
 * wFormatTag - The decoded audio format; set to WAVE_FORMAT_PCM (0x0001).
 * nChannels - Number of channels in the decoded audio; must be 1.
 * nSamplesPerSec - Sampling rate of the decoded audio; this is always 22050.
 * nAvgBytesPerSec - Byte rate of the decoded audio; equal to nSamplesPerSec*nChannels*wBitsPerSample/8 or nSamplesPerSec*nBlockAlign.
 * nBlockAlign - Number of bytes per sampling interval in the decoded audio; equal to nChannels*wBitsPerSample/8.
 * wBitsPerSample - Bits per sample in the decoded audio (8, 16, 24, 32, etc.); this is always 16.
 * cbSize - The size in bytes of extra format information following the WAVEFORMATEX structure; must be 0.
 * padding - Two bytes of padding; this is always 0.

Huffman tables
Normal model (Table 0):

Large-pulse model (Table 1):

Revision 3
MicroTalk Revision 3 does not seem to be used in any games(?) but can be created using the Electronic Arts Sound eXchange tool sx.exe (sx -mt5_blk input.dat -=output.dat).

The only difference is that each audio frame may optionally contain raw 16-bit PCM data, which overwrites n samples at a given offset in the frame.

The decoder logic changes from this:

while position < num_samples: utk_decode_frame;

to this:

while position < num_samples: raw_data_present = (read_byte == 0xee); utk_decode_frame; reset_bit_reader; if (raw_data_present): offset = read_i16_be; // 16-bit big-endian signed integer count = read_i16_be; if (offset < 0 || offset > 432) die("offset out of range"); if (count < 0 || count > 432 - offset) die("count out of range"); for (i = 0; i < count; i++): decompressed_frame[offset+i] = read_i16_be;

sx.exe reads the offset and count fields as 16-bit signed integers and does not do any bounds checking (see 004274D1 in sx.exe v3.01.01). This means a specially crafted MicroTalk Rev. 3 file can crash sx.exe. For completeness, we have added the bounds checking to the above pseudocode.