UTK

From Niotso Wiki
Jump to: navigation, search
UTalk
Type of format Audio
Filename extension .utk
Magic number UTM0
Byte order Little endian
Type ID 0x1B6B9806
UTalk (UTK) is a speech-oriented lossy audio codec used in The Sims Online based on CELP. It has a fixed bandwidth of 11.025kHz (sampling rate 22.05kHz) and frame size of 20ms. Of all audio codecs that the game uses, UTalk uses the smallest bitrate at 32kbps, compared to XA at 88.2kbps, and MP3 at 96 or 128.

The codec appears to be unique to The Sims Online. It was reverse-engineered by Andrew D'Addesio, alias Fatbag, on January 12, 2012.

All content on this page is public domain. A complete working Maxis UTalk file reading implementation can be found in the FileHandler interface of Niotso under the ISC license.

UTK file header[edit]

The UTK file header is the same as the XA file header, with a few modifications.

struct UTKHeader
{
    char  sID[4];
    DWORD dwOutSize;
    DWORD dwWfxSize;
    /* WAVEFORMATEX */
    WORD  wFormatTag;
    WORD  nChannels;
    DWORD nSamplesPerSec;
    DWORD nAvgBytesPerSec;
    WORD  nBlockAlign;
    WORD  wBitsPerSample;
    DWORD cbSize;
};
  • sID - A 4-byte string identifier equal to "UTM0"
  • dwOutSize - The decompressed size of the audio stream
  • dwWfxSize - The size in bytes of the WAVEFORMATEX structure to follow; must be 20
  • wFormatTag - The decoded audio format; set to WAVE_FORMAT_PCM (0x0001)
  • nChannels - Number of channels in the decoded audio data
  • nSamplesPerSec - Sampling rate used in the decoded audio data
  • nAvgBytesPerSec - Bytes per second consumed by the decoded audio data; equal to nChannels*nSamplesPerSec*wBitsPerSample/8 or nSamplesPerSec*nBlockAlign
  • nBlockAlign - The number of bytes consumed by an audio frame (one sample for each channel) in the decoded audio data; equal to nChannels*wBitsPerSample/8
  • wBitsPerSample - The bits per sample for one audio channel in the decoded audio data; 8-, 16-, or 24-bit, etc.
  • cbSize - The size in bytes of extra format information appended to the end of the WAVEFORMATEX structure; must be 0. Note that in the original WAVEFORMATEX, this parameter is a WORD, not a DWORD.

Note that the part of the header from wFormatTag below is a modified Microsoft WAVEFORMATEX structure.

Constants[edit]

Tables[edit]

UTalk makes use of three tables, defined below:

const float UTKCosine[64] = {
    0,
    -.99677598476409912109375, -.99032700061798095703125, -.983879029750823974609375, -.977430999279022216796875,
    -.970982015132904052734375, -.964533984661102294921875, -.958085000514984130859375, -.9516370296478271484375,
    -.930754005908966064453125, -.904959976673126220703125, -.879167020320892333984375, -.853372991085052490234375,
    -.827579021453857421875, -.801786005496978759765625, -.775991976261138916015625, -.75019800662994384765625,
    -.724404990673065185546875, -.6986110210418701171875, -.6706349849700927734375, -.61904799938201904296875,
    -.567460000514984130859375, -.515873014926910400390625, -.4642859995365142822265625, -.4126980006694793701171875,
    -.361110985279083251953125, -.309523999691009521484375, -.257937014102935791015625, -.20634900033473968505859375,
    -.1547619998455047607421875, -.10317499935626983642578125, -.05158700048923492431640625,
    0,
    +.05158700048923492431640625, +.10317499935626983642578125, +.1547619998455047607421875, +.20634900033473968505859375,
    +.257937014102935791015625, +.309523999691009521484375, +.361110985279083251953125, +.4126980006694793701171875,
    +.4642859995365142822265625, +.515873014926910400390625, +.567460000514984130859375, +.61904799938201904296875,
    +.6706349849700927734375, +.6986110210418701171875, +.724404990673065185546875, +.75019800662994384765625,
    +.775991976261138916015625, +.801786005496978759765625, +.827579021453857421875, +.853372991085052490234375,
    +.879167020320892333984375, +.904959976673126220703125, +.930754005908966064453125, +.9516370296478271484375,
    +.958085000514984130859375, +.964533984661102294921875, +.970982015132904052734375, +.977430999279022216796875,
    +.983879029750823974609375, +.99032700061798095703125, +.99677598476409912109375
};
 
const uint8_t UTKCodebook[512] = {
    4,  6,  5,  9,  4,  6,  5, 13,  4,  6,  5, 10,  4,  6,  5, 17,
    4,  6,  5,  9,  4,  6,  5, 14,  4,  6,  5, 10,  4,  6,  5, 21,
    4,  6,  5,  9,  4,  6,  5, 13,  4,  6,  5, 10,  4,  6,  5, 18,
    4,  6,  5,  9,  4,  6,  5, 14,  4,  6,  5, 10,  4,  6,  5, 25,
    4,  6,  5,  9,  4,  6,  5, 13,  4,  6,  5, 10,  4,  6,  5, 17,
    4,  6,  5,  9,  4,  6,  5, 14,  4,  6,  5, 10,  4,  6,  5, 22,
    4,  6,  5,  9,  4,  6,  5, 13,  4,  6,  5, 10,  4,  6,  5, 18,
    4,  6,  5,  9,  4,  6,  5, 14,  4,  6,  5, 10,  4,  6,  5,  0,
    4,  6,  5,  9,  4,  6,  5, 13,  4,  6,  5, 10,  4,  6,  5, 17,
    4,  6,  5,  9,  4,  6,  5, 14,  4,  6,  5, 10,  4,  6,  5, 21,
    4,  6,  5,  9,  4,  6,  5, 13,  4,  6,  5, 10,  4,  6,  5, 18,
    4,  6,  5,  9,  4,  6,  5, 14,  4,  6,  5, 10,  4,  6,  5, 26,
    4,  6,  5,  9,  4,  6,  5, 13,  4,  6,  5, 10,  4,  6,  5, 17,
    4,  6,  5,  9,  4,  6,  5, 14,  4,  6,  5, 10,  4,  6,  5, 22,
    4,  6,  5,  9,  4,  6,  5, 13,  4,  6,  5, 10,  4,  6,  5, 18,
    4,  6,  5,  9,  4,  6,  5, 14,  4,  6,  5, 10,  4,  6,  5,  2,
    4, 11,  7, 15,  4, 12,  8, 19,  4, 11,  7, 16,  4, 12,  8, 23,
    4, 11,  7, 15,  4, 12,  8, 20,  4, 11,  7, 16,  4, 12,  8, 27,
    4, 11,  7, 15,  4, 12,  8, 19,  4, 11,  7, 16,  4, 12,  8, 24,
    4, 11,  7, 15,  4, 12,  8, 20,  4, 11,  7, 16,  4, 12,  8,  1,
    4, 11,  7, 15,  4, 12,  8, 19,  4, 11,  7, 16,  4, 12,  8, 23,
    4, 11,  7, 15,  4, 12,  8, 20,  4, 11,  7, 16,  4, 12,  8, 28,
    4, 11,  7, 15,  4, 12,  8, 19,  4, 11,  7, 16,  4, 12,  8, 24,
    4, 11,  7, 15,  4, 12,  8, 20,  4, 11,  7, 16,  4, 12,  8,  3,
    4, 11,  7, 15,  4, 12,  8, 19,  4, 11,  7, 16,  4, 12,  8, 23,
    4, 11,  7, 15,  4, 12,  8, 20,  4, 11,  7, 16,  4, 12,  8, 27,
    4, 11,  7, 15,  4, 12,  8, 19,  4, 11,  7, 16,  4, 12,  8, 24,
    4, 11,  7, 15,  4, 12,  8, 20,  4, 11,  7, 16,  4, 12,  8,  1,
    4, 11,  7, 15,  4, 12,  8, 19,  4, 11,  7, 16,  4, 12,  8, 23,
    4, 11,  7, 15,  4, 12,  8, 20,  4, 11,  7, 16,  4, 12,  8, 28,
    4, 11,  7, 15,  4, 12,  8, 19,  4, 11,  7, 16,  4, 12,  8, 24,
    4, 11,  7, 15,  4, 12,  8, 20,  4, 11,  7, 16,  4, 12,  8,  3
};
 
const uint8_t UTKCodeSizes[29] = {8,7,8,7,2,2,2,3,3,4,4,3,3,5,5,4,4,6,6,5,5,7,7,6,6,8,8,7,7};

Parameters and coefficients[edit]

typedef struct {
    const uint8_t *InData, *InDataEnd;
    unsigned UnreadBitsValue, UnreadBitsCount;
    int HalvedExcitation;
    unsigned VoicedThreshold;
    float InnovationPower[64];
    float RC[12];
    float History[12];
    float Delay[324];
    float DecompressedFrame[432];
} utkcontext_t;
 
static void InitUTKParameters(utkcontext_t *ctx){
    int i;
    float base;
    ctx->UnreadBitsValue = *(ctx->InData++);
    ctx->UnreadBitsCount = 8;
    ctx->HalvedExcitation = ReadBits(ctx, 1);
    ctx->VoicedThreshold = 32 - ReadBits(ctx, 4);
    ctx->InnovationPower[0] = (ReadBits(ctx, 4)+1) * 8; /* significand */
 
    base = 1.04f + (float)ReadBits(ctx, 6)/1000;
    for(i=1; i<64; i++)
        ctx->InnovationPower[i] = ctx->InnovationPower[i-1]*base;
 
    memset(ctx->RC, 0, 12*sizeof(float));
    memset(ctx->History, 0, 12*sizeof(float));
    memset(ctx->Delay, 0, 324*sizeof(float));
}

Functions[edit]

Utility functions[edit]

#ifndef read_int32
 #define read_uint32(x) (unsigned)(((x)[0]<<(8*0)) | ((x)[1]<<(8*1)) | ((x)[2]<<(8*2)) | ((x)[3]<<(8*3)))
 #define read_uint16(x) (unsigned)(((x)[0]<<(8*0)) | ((x)[1]<<(8*1)))
#endif
#ifndef write_int32
 #define write_uint16(dest, src) do {\
    (dest)[0] = ((src)&0x00FF)>>(8*0); \
    (dest)[1] = ((src)&0xFF00)>>(8*1); \
    } while(0)
#endif
 
#ifndef round
 #define round(x) ((x) >= 0 ? (x)+0.5 : (x)-0.5)
#endif
#ifndef clamp
 #define clamp(x, low, high) ((x) < low ? low : (x) > high ? high : (x))
#endif
#ifndef min
 #define min(x, y) ((x) < (y) ? (x) : (y))
#endif
 
static uint8_t ReadBits(utkcontext_t *ctx, uint8_t bits){
    unsigned value = ctx->UnreadBitsValue & (255>>(8-bits));
    ctx->UnreadBitsValue >>= bits;
    ctx->UnreadBitsCount -= bits;
 
    if(ctx->UnreadBitsCount < 8 && ctx->InData != ctx->InDataEnd){
        ctx->UnreadBitsValue |= *(ctx->InData++) << ctx->UnreadBitsCount;
        ctx->UnreadBitsCount += 8;
    }
    return value;
}
 
void utk_decode(const uint8_t *__restrict InBuffer, uint8_t *__restrict OutBuffer, size_t InSize, size_t Samples){
    utkcontext_t ctx;
    ctx.InData = InBuffer;
    ctx.InDataEnd = InBuffer + InSize;
    InitUTKParameters(&ctx);
 
    while(Samples){
        int i, BlockSize = min(Samples, 432);
        DecodeFrame(&ctx);
 
        for(i=0; i<BlockSize; i++){
            int value = round(ctx.DecompressedFrame[i]);
            value = clamp(value, -32768, 32767);
            write_uint16(OutBuffer, value);
            OutBuffer += 2;
        }
        Samples -= BlockSize;
    }
}

Decoder functions[edit]

static void DecodeFrame(utkcontext_t *ctx){
    int i,j;
    float Excitation[118]; /* includes 5 0-valued samples to both the left and the right */
    float RCDelta[12];
    int Voiced = 0;
 
    memset(&Excitation[0], 0, 5*sizeof(float));
    memset(&Excitation[113], 0, 5*sizeof(float));
 
    for(i=0; i<12; i++){
        unsigned index = ReadBits(ctx, (i<4) ? 6 : 5);
        if(i==0 && index < ctx->VoicedThreshold) Voiced++;
        RCDelta[i] = (UTKCosine[index + ((i<4)?0:16)] - ctx->RC[i])/4;
    }
 
    for(i=0; i<4; i++){
        float PitchGain, InnovationGain;
        int Phase = ReadBits(ctx, 8);
        PitchGain = (float)ReadBits(ctx, 4)/15;
        InnovationGain = ctx->InnovationPower[ReadBits(ctx, 6)];
 
        if(!ctx->HalvedExcitation){
            GenerateExcitation(ctx, Voiced, &Excitation[5], 1);
        }else{
            /* Fill the excitation window with half as many samples and interpolate the rest */
            int Alignment = ReadBits(ctx, 1); /* whether to fill the even or odd samples */
            int FillWithZero = ReadBits(ctx, 1);
            GenerateExcitation(ctx, Voiced, &Excitation[5+Alignment], 2);
 
            if(FillWithZero){
                for(j=0; j<108; j+=2)
                    Excitation[5 + (1-Alignment) + j] = 0.0;
            }else{
                /* Use sinc interpolation with 6 neighboring samples */
                float *x = &Excitation[5 + (1-Alignment)];
                for(j=0; j<54; j++, x+=2)
                    *x =   (x[-1]+x[+1]) * .5973859429f
                         - (x[-3]+x[+3]) * .1145915613f
                         + (x[-5]+x[+5]) * .0180326793f;
 
                InnovationGain /= 2;
            }
        }
 
        /* If 216-Phase is negative on the first subframe, it will read into RC and History
           as the reference decoder does, which have been initialized to 0 in InitUTKParameters(). */
        for(j=0; j<108; j++)
            ctx->DecompressedFrame[108*i + j] = InnovationGain*Excitation[5+j] + PitchGain*ctx->Delay[108*i + j + (216-Phase)];
    }
 
    memcpy(ctx->Delay, &ctx->DecompressedFrame[108], 324*sizeof(float));
 
    for(i=0; i<4; i++){
        /* Linearly interpolate the reflection coefficients for the current subframe */
        for(j=0; j<12; j++)
            ctx->RC[j] += RCDelta[j];
 
        Synthesize(ctx, i*12, (i!=3) ? 12 : 396);
    }
}
 
static void GenerateExcitation(utkcontext_t *ctx, int Voiced, float * Window, int Interval){
    if(Voiced){
        int Table = 0;
        int i = 0;
        while(i<108){
            unsigned code = UTKCodebook[(Table<<8) | (ctx->UnreadBitsValue&0xFF)];
            Table = (code<2 || code>8);
            ReadBits(ctx, UTKCodeSizes[code]);
 
            if(code >= 4){
                /* Fill a sample with a value specified by the code; magnitude is limited to 6.0 */
                Window[i] = (code-1)/4;
                if(code&1)
                    Window[i] *= -1.0;
 
                i += Interval;
            }else if(code >= 2){
                /* Fill between 7 and 70 samples (capped to the number of remaining samples) with 0s */
                int x = ReadBits(ctx, 6) + 7;
                x = min(x, (108 - i)/Interval);
 
                while(x--){
                    Window[i] = 0.0;
                    i += Interval;
                }
            }else{
                /* Fill a sample with a custom value with magnitude >= 7.0 */
                Window[i] = 7.0;
                while(ReadBits(ctx, 1))
                    Window[i]++;
 
                if(!ReadBits(ctx, 1))
                    Window[i] *= -1.0;
 
                i += Interval;
            }
        }
    }else{
        /* Unvoiced: restrict all samples to 0.0, -2.0, or +2.0 without using the codebook */
        int i;
        for(i=0; i<108; i+=Interval){
            if(!ReadBits(ctx, 1)) Window[i] = 0.0;
            else if(!ReadBits(ctx, 1)) Window[i] = -2.0;
            else Window[i] = 2.0;
        }
    }
}
 
static void Synthesize(utkcontext_t *ctx, size_t Sample, size_t Samples){
    float LPC[12];
    int offset = -1;
    RCtoLPC(ctx->RC, LPC);
 
    while(Samples--){
        int i;
        for(i=0; i<12; i++){
            if(++offset == 12) offset = 0;
            ctx->DecompressedFrame[Sample] += LPC[i] * ctx->History[offset];
        }
        ctx->History[offset--] = ctx->DecompressedFrame[Sample++];
    }
}
 
static void RCtoLPC(const float *__restrict RC, float *__restrict LPC){
    int i,j;
    float RCTemp[12], LPCTemp[12];
    RCTemp[0] = 1.0;
    memcpy(&RCTemp[1], RC, 11*sizeof(float));
 
    for(i=0; i<12; i++){
        LPC[i] = 0.0;
        for(j=11; j>=0; j--){
            LPC[i] -= RC[j] * RCTemp[j];
            if(j != 11)
                RCTemp[j+1] = RCTemp[j] + RC[j] * LPC[i];
        }
        RCTemp[0] = LPCTemp[i] = LPC[i];
 
        for(j=0; j<i; j++)
            LPC[i] -= LPCTemp[i-j-1] * LPC[j];
    }
}

Bitstream specification[edit]

The UTalk bitstream consists of bits that are packed and subsequently extracted from the least significant bit of the first byte of the compressed data buffer (labelled bit 0 in the bitstream diagrams on this page) to the most significant bit of the first byte (labelled bit 7), followed by the least significant bit of the second byte (labelled bit 8), and so on. A number consisting of multiple bits is packed in the bitstream starting with its least significant bit and ending with its most significant bit. The largest single number encoded in the bitstream is 8 bits wide.

Header[edit]

The bitstream begins with the following 15 bit header.

 0                   1
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|h| thres |innsig |  innbase  |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Halved Excitation is signaled by a flag ("h") indicating whether to use half sampling rate for the excitation. Voiced Threshold (evaluated as follows: 32 - thres) is an integer used later to determine if audio frames are voiced. Innovation Significand (evaluated as follows: (innsig+1) * 8) and Innovation Base (evaluated as follows: 1.04f + (float)innbase/1000) are floating point numbers used later to calculate innovation gains.

The audio frames, carrying 432 samples each, are contiguously stored directly following this header.

Audio frame[edit]

Reflection coefficients[edit]

An audio frame begins by signaling 12 indices into the UTKCosine table. The first four indices are each encoded 6 bits wide and span from index 0 to 63, while the remaining eight are each encoded only 5 bits wide and span from index 16 to 47. If the first index is below the Voiced Threshold, the entire audio frame is considered voiced; otherwise, the frame is unvoiced. The 12 reflection coefficients are constructed by indexing the table.

Excitation subframes[edit]

The excitation for the audio frame is encoded in four subframes of 108 samples each.

Each subframe starts by signaling the adaptive codebook phase offset in samples ("phase"), followed by the pitch gain ("pgain") as a fixed point integer using steps of 1/15ths, and followed by the innovation power ("innpow").

 0                   1
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|     phase     | pgain |  innpow   |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The innovation gain is constructed by evaluating Innovation Significand times Innovation Base raised to the Innovation Power, and pgain can be converted to floating point by dividing it by 15.

If the bitstream does not use Halved Excitation, the excitation window for the subframe is 108 samples long; however, if Halved Excitation is set, the excitation window is encoded at half the sampling rate, coding only 54 samples, and then upsampled into 108. In that case, two additional flags are specified for the subframe.

 0
 0 1
+-+-+
|a|z|
+-+-+

Each sample explicitly coded in the excitation window can either take on the even samples (sample 0, sample 2, sample 4, ...) or the odd samples (sample 1, sample 3, sample 5, ...). The Alignment flag ("a") signals the former case if unset and the latter case if set. If the Zero flag ("z") is unset, the remaining samples are interpolated and the innovation gain is halved, but if set, the remaining samples are filled with 0s, and the innovation gain remains untouched.

The procedure for generating the excitation window of either 54 or 108 samples follows.

If the frame is unvoiced, each sample is coded in sequential order as either a 0 (signaled by a 0), a -2 (signaled by a 1 followed by a 0), or a +2 (signaled by a 1 followed by a 1).

If the frame is voiced, the fixed codebook is used. A variable named Table is initialized to 0, whose purpose is to determine whether to refer to the first half (as indicated by a 0) or the second half (as indicated by a 1) of UTKCodebook. The following process then repeats until the entire excitation has been generated. An 8-bit index (containing at least one Huffman code in the least significant bits) is extracted from the bitstream and is used as an index to retrieve a codeword from the half of UTKCodebook indicated by Table, and the highest (8-UTKCodeSizes[codeword]) bits of the 8-bit index are placed back into the bitstream (in equivalent terms, rather than removing all 8 bits from the bitstream after reading the index, only the first UTKCodeSizes[codeword] of them are removed). Table is then updated to a 1 if the codeword was less than 2 or greater than 8, or to a 0 otherwise. Afterwards,

  • If the codeword is greater than or equal to 4, the next sample is coded as floor((code-1)/4) if the codeword is even or -floor((code-1)/4) if the codeword is odd.
  • If the codeword is less than 4 but greater than or equal to 2, the next n samples are coded as 0s, where n is a 6-bit integer extracted from the bitstream, incremented by 7, and capped to the remaining number of samples in the window.
  • If the codeword is less than 2, the next sample is coded with a custom value n, where n is a unary number extracted from the bitstream, incremented by 7, and then, if another bit extracted directly after the unary number is a 0, multiplied by -1.

Lastly, if Halved Excitation was used with the Zero flag unset, the remaining samples of the excitation window are filled using sinc interpolation with 6 neighboring samples, with the coefficients .5973859429, -.1145915613, and .0180326793 used as weights for the neighbors distanced 1, 3, and 5 samples, respectively, to both the left and right of each remaining sample. If a neighbor is missing, the value associated with that neighbor should be treated as 0.

When the excitation window for the subframe has been constructed, it is combined with the contribution of the adaptive codebook (which is treated as all zeros when read for the first audio frame) to become the final excitation: FinalExcitation[108*i + n] = InnovationGain * Excitation[n] + PitchGain * AdaptiveCodebook[108*i + 216 - Phase + n] for 0 <= n <= 107, where i is a number from 0 to 3 indicating the current subframe. On the first subframe, Phase should be capped to 216.

When the above process has repeated for all four subframes, the adaptive codebook is updated with the values of the final excitation: AdaptiveCodebook[n] = FinalExcitation[n] for 0 <= n <= 107.

Linear prediction synthesis[edit]

The synthesis filter for the audio frame is run four times, each with a new set of LPC coefficients corresponding to the reflection coefficients linearly interpolated in fourths from those of the previous frame: InterpolatedRC[k] = RCPrevious[k] + (RCCurrent[k] - RCPrevious[k])*r/4 for 0 <= k <= 11 and 1 <= r <= 4. A window of 12 is used for each of the first three runs, and a window of 396 is used for the final run. Conversion of the interpolated reflection coefficients into linear prediction coefficients can be performed using step-up recursion.

If the linear prediction coefficients are ordered by associativity with the most present to the most past samples (as used in the source code provided), then each sample in the decompressed audio frame is given by

                                            11
                                            __
DecompressedFrame[n] = FinalExcitation[n] + \  (LPC[k] * DecompressedFrame[n-1-k])
                                            /_
                                            k=0

where values of DecompressedFrame for n-1-k < 0 refer to values of DecompressedFrame[432+n-1-k] for the prior frame; the prior frame should be treated as all 0s by the first audio frame.

Reference decoder quirks[edit]

There exist a number of peculiarities in the reference decoder used in TSOAudioViewD.dll.

The first is that Phase is not capped to 216 on the first subframe. If this ever happened, this would cause the copy to begin in InnovationPower[64] and leak into RC[12], followed by History[12], before reaching Delay. These are all float values, and they are all packed into a struct, which means this is safe, defined, and portable behavior in C.

More oddly, however, is the fact that Delay and the first 108 samples of DecompressedFrame are together used as the adaptive codebook. The game's UTK files rely on this behavior; reordering Delay so that it does not immediately precede DecompressedFrame will distort the decompressed output.

Another issue is an off-by-one mistake in the excitation generation code for handling codewords 2 and 3 ("Fill between 7 and 70 samples (capped to the number of remaining samples) with 0s"):

CPU Disasm
Address   Hex dump          Command                                  Comments
029C5801  |.  6A 06         |PUSH 6                                  ; /Arg2 = 6
029C5803  |.  FF75 08       |PUSH DWORD PTR SS:[EBP+8]               ; |Arg1 = context
029C5806  |.  E8 CDFCFFFF   |CALL UpdateCompressionParameters        ; \TSOAudioViewD.ReadBits
029C580B  |.  59            |POP ECX                                 ;
029C580C  |.  83C0 07       |ADD EAX,7                               ; x += 7
029C580F  |.  59            |POP ECX                                 ;
029C5810  |.  8BC8          |MOV ECX,EAX                             ;
029C5812  |.  0FAF4D 14     |IMUL ECX,DWORD PTR SS:[EBP+14]          ; y = x*Interval
029C5816  |.  03CB          |ADD ECX,EBX                             ; y += i
029C5818  |.  83F9 6C       |CMP ECX,6C                              ; Compare with 108
029C581B  |.- 7E 09         |JLE SHORT 029C5826                      ; If less than or equal, x is <= the cap
029C581D  |.  6A 6C         |PUSH 6C                                 ; Otherwise, x is above the cap; set x = 108
029C581F  |.  58            |POP EAX                                 ;
029C5820  |.  2BC3          |SUB EAX,EBX                             ; x -= i
029C5822  |.  99            |CDQ                                     ;
029C5823  |.  F77D 14       |IDIV DWORD PTR SS:[EBP+14]              ; x /= Interval
029C5826  |>  85C0          |TEST EAX,EAX                            ;
029C5828  |.- 7E 6D         |JLE SHORT 029C5897                      ; If x is zero, we have no samples to fill, so we are done
029C582A  |.  8B75 14       |MOV ESI,DWORD PTR SS:[EBP+14]           ;
029C582D  |.  8B4D 10       |MOV ECX,DWORD PTR SS:[EBP+10]           ;
029C5830  |.  8BF8          |MOV EDI,EAX                             ; Set EDI, the counter for the following loop, to EAX
029C5832  |.  0FAFC6        |IMUL EAX,ESI                            ; Multiply EAX by Interval
029C5835  |.  8BD6          |MOV EDX,ESI                             ;
029C5837  |.  8D0C99        |LEA ECX,[EBX*4+ECX]                     ; Set ECX to i*4 + Window
029C583A  |.  C1E2 02       |SHL EDX,2                               ; Set EDX to Interval*4
029C583D  |.  03D8          |ADD EBX,EAX                             ; i += x
029C583F  |>  D9EE          |/FLDZ                                   ;
029C5841  |.  D919          ||FSTP DWORD PTR DS:[ECX]                ; Set the sample at ECX to zero
029C5843  |.  03CA          ||ADD ECX,EDX                            ; Increment ECX by EDX (Interval*4)
029C5845  |.  4F            ||DEC EDI                                ;
029C5846  |.- 75 F7         |\JNE SHORT 029C583F                     ;

This translates to the following logic:

Set x to (ReadBits(6) + 7)
Compare (x * Interval + i) with 108.
If greater: Change x to (108 - i) / Interval.

for(j=0; j<x; j++){
    Window[i] = 0.0;
    i += Interval;
}

At first glance, this appears to do "if(x > RemainingSamples) x = RemainingSamples", so let's study the calculation of RemainingSamples, the latter calculation specifically ("(108 - i) / Interval"). Consider the situation in which HalvedExcitation is 1 (which means that Interval=2) and Alignment is 1 and we are currently on the 105th sample.

key: x  = sample that currently needs to be filled in
    " " = sample that should be skipped
     $  = first non-sample
     *  = current position

  *
 105 106 107 108
 [x] [ ] [x] [$]

RemainingSamples = floor((108 - 105)/2 = 3/2 = 1.5) = 1.0

The code thinks the maximum number of samples to fill in is 1, when it should be 2. This incorrect logic is also used for the first calculation ("x * Interval + i") as well:

-> Compare (x * Interval + i) with 108.
-> Compare (current position + total slots we incorrectly think we need for these x new samples) with 108.

In both calculations, the code thinks that each sample takes up Interval slots, which is not true: the last sample only takes up one slot, regardless of the Interval. This means that if Alignment=1 is used together with HalvedExcitation=1, then codewords 2 and 3 will never fill the last sample needed for the excitation window even if it specifies enough samples to do so; the code caps the number of samples one value too short, so that the last value for the excitation window is not filled in and i is not increased to >=108 (which is needed to end the loop). In fact, a single audio frame can be encoded into an indefinite number of bytes (allowing you to make a UTK file as pointlessly high-bitrate-for-no-gain as you want to at a specific time) by continually specifying the UTKCodebook index 255 on table 0 (codeword 2).

How the fixed codebook achieves compression[edit]

Let's say the encoder wants to encode the very first sample of the excitation signal as a specific value. The options "00000000", "10000000", "01000000", "11000000", "00100000", and so on result in the following output on the decoder end:

Let's see what this means.

00000000 (Index 0)
* Table 0: Codeword 4
    * Set Table to 0
    * Skip 2 bits
    * Window[i] = +0
    * Next index must be 000000__ in binary.
* Table 1: Codeword 4
    * Set Table to 0
    * Skip 2 bits
    * Window[i] = +0
    * Next index must be 000000__ in binary.

"00000000" means:

  • If currently on Table 0, set Table to 0, encode this sample as 0, and restrict the next index to begin with 6 0s.
  • If currently on Table 1, set Table to 0, encode this sample as 0, and restrict the next index to begin with 6 0s.

Now do a regular expression search for "^00" through the list. It turns out that that all indices that begin with "00" result in the codeword 4, whose action is to set Table to 0, encode this sample as 0, and restrict the next index to the remaining 6 bits. The remaining 6 bits actually have no influence on which codeword to select. If you repeat this process ("^0", "^1", and then more specifically "^01", "^10", "^11", and so on), you will come up with the following data:

Table 0:
To fill one 0, the first two bits must be 00; the next 6 bits are free to be anything.
To fill one +1, the first two bits must be 10
To fill one -1, the first two bits must be 01.
To fill one +2, the first four bits must be 1101.
To fill one -2, the first four bits must be 1100.
To fill one +3, the first five bits must be 11101.
To fill one -3, the first five bits must be 11100.
To fill one +4, the first six bits must be 111101.
To fill one -4, the first six bits must be 111100.
To fill one +5, the first seven bits must be 1111101.
To fill one -5, the first seven bits must be 1111100.
To fill one +6, the first eight bits must be 11111101.
To fill one -6, the first eight bits must be 11111100.
To fill a custom value with magnitude >= 7.0, the first eight bits must be 11111110.
To fill between 7 and 70 samples with 0s, the first eight bits must be 11111111.

Table 1:
To fill one 0, the first two bits must be 00; the next 6 bits are free to be anything.
To fill one +1, the first three bits must be 011.
To fill one -1, the first three bits must be 010.
To fill one +2, the first three bits must be 101.
To fill one -2, the first three bits must be 100.
To fill one +3, the first four bits must be 1101.
To fill one -3, the first four bits must be 1100.
To fill one +4, the first five bits must be 11101.
To fill one -4, the first five bits must be 11100.
To fill one +5, the first six bits must be 111101.
To fill one -5, the first six bits must be 111100.
To fill one +6, the first seven bits must be 1111101.
To fill one -6, the first seven bits must be 1111100.
To fill a custom value with magnitude >= 7.0, the first seven bits must be 1111110.
To fill between 7 and 70 samples with 0s, the first seven bits must be 1111111.

Hence, this special codebook is simply an optimization trick on the decoder end for implementing a Huffman-code decoder; it's an alternative to doing "read the first bit; if a 0, do this, otherwise, read the second bit; if a 0, do this, otherwise, read the third bit; ...", the latter of which requires branch instructions. (This trick is possible because the maximum Huffman code length for this model is only 8 bits, so the lookup table does not have to be very large.)

You will notice that Table 0 is more efficient at encoding a +/- 1, Table 0 and Table 1 are equally efficient at encoding a 0, and Table 1 is more efficient at encoding everything else. Further, a +/- 0 or +/- 1 means the Table is set to 0, and all other sample values mean the Table is set to 1. Hence, the codebook also implements a Markov model of the excitation.

Naming notes[edit]

The UTK files in the game were stored in DBPF archives without filenames. However, we can deduce the codec's name and file extension from this text contained in sys/tsoaudio.ini (starting with The Sims Online Play Test Rated T version):

;kGZCLSID_cWaveXA = 0x1d07eb4b
;kGZCLSID_cGZSndSegmentDataSourceUTalk = 0x1b6b9806
;kGZCLSID_cGZWave = 0xbb7051f5
;kGZCLSID_cHitMp3DecoderDataSource = 0x3cec2b47
1=0x1d07eb4b,XA
2=0x1b6b9806,UTK
3=0xbb7051f5,WAV
4=0x3cec2b47,MP3

WAV and MP3 are common audio formats, and XA refers to Maxis XA (with extension ".xa" in The Sims 1).