Floating points, Movement, and Quantization

Now that my library works in UE5, and I can communicate over the UDP socket to my server, it’s time to take a look at getting objects moving. As a refresher, clients send only their inputs and some additional metadata to the server. It’s usually a Really Bad Idea to send positional information because it can’t be trusted, like, at all.

My protocol is made up of opcodes that the client sends to the server, the movement opcodes look like this:

// generated from flatbuffers flatc
enum InputType : uint16_t {
  InputType_MoveForward = 0,
  InputType_MoveBackward = 1,
  InputType_MoveStop = 2,
  InputType_MoveTurnLeft = 3,
  InputType_MoveTurnRight = 4,
  InputType_MoveTurnStop = 5,
  InputType_MoveStrafeLeft = 6,
  InputType_MoveStrafeRight = 7,
  InputType_MoveSprint = 8,
  InputType_MoveJump = 9,
  InputType_MoveFallLand = 10,
  InputType_MoveStartSwim = 11,
  InputType_MoveStopSwim = 12,
  InputType_MovePitchUp = 13,
  InputType_MovePitchDown = 14,
  InputType_MIN = InputType_MoveForward,
  InputType_MAX = InputType_MovePitchDown
};

For the most part this is a 1:1 mapping of opcode to movement, meaning the server doesn’t need to know anything else except the opcode. As an example, either we are moving forward or backward. This is a binary yes/no input, because the server knows the characters speed and can simply calculate NewPosition = (CurrentPosition + CharacterSpeed) / DeltaTime.

The only complication is turning. Why is this a complication? Well turning can have a magnitude and that can change significantly if someone’s just pressing the left/right keys versus holding down the right mouse button and flicking their character 180 degrees.

As we will hopefully be having hundreds of clients here, I’d like to reduce the amount of data every client sends. That means not only do i prefer to not be sending floats, because floats are Big. I’d like to compress the turn radius as much as possible, and to do that i first need to know an acceptable range.

Flicking the mouse to turn abruptly, I was able to get the rotation value per-frame tick to a maximum of around +/-90.0f. Playing around I started clamping the maximum values and found +/-30.0f to be an acceptable range without sacrificing too much mobility. OK so we have our range, I felt like this low value should be able to fit in a uint8_t, or, a byte. Floats are usually 32bits, and the MovementVector X/Y axis in UE5 is actually stored as a double (64 bits on most machines these days), so this was going to take some magic.

Floats

Floats are not stored in memory like integers are. I’ll spare you the education and point you to some wonderful resources that really helped me grok them. I strongly recommend watching IEEE 754 Standard for Floating Point Binary Arithmetic, this really clarified how exactly they are stored and what you can/can’t do with them. I think what surprised me the most is just how often numbers aren’t really able to be properly stored as a float, it all seems like magic.

After understanding them a bit more, I realized I want nothing to do with them and don’t want to operate on them directly. My first idea was to simply multiply the rotation by 100, then truncate it and cast it to an integer. Then I was going to scale out the integral part, and the fractional part differently (big mistake, too complicated). I was going to scale the top 4 bits of the integer by 2, and the fractional lower 4 bits by .25, something like this:

4 bits = integer (scale = 2) (0,2,4,6,8,10,12,14,16,18,20,22,24,26,28) 
4 bits = fraction (scale = .25). (0,0.25,0.5,0.75)

Unfortunately working with ‘digits’ is very different than working with bits. I think there is a way, but I could not figure out how to extract the digits in an efficient way. I was hoping for some masking and shifting bit tricks but ended up having to basically loop over them:

int scale(int truncated) {
    while( truncated > 0 )
    {
        printf( "%d\n", truncated % 10 ) ;
        truncated = truncated / 10 ;
    }
    printf("%d\n", truncated);
    return truncated;
}

int main() {
    double x = 13.1f;
    double y = x * 100;
    int truncated = trunc(y);
    printf("x: %f\ny: %f\ntrunc: %d\n", x, y, truncated);
    scale(truncated);
    return 0;
}

This sucks and I don’t like having to loop, so I figured it would be time to find a better way.

Quantization

Quantization is basically a form of compression, which is wayyy more popular now due to all the interest in LLMs/ML. But it’s obviously been used for many, many years. There’s a few ways to do it, in ML you usually see it done using:

I searched for code that does this and didn’t find anything in C/C++. I did however come across this interesting write up on Quantization for Neural Networks. I converted the python code near the bottom of that post to C++ so I could test it out:

template <typename T>
T clip(const T& n, const T& lower, const T& upper) {
  return std::max(lower, std::min(n, upper));
}

int quantization(float x, float s, float z, float min, float max)
{
    float x_q = round(1 / s * x + z);
    return clip(x_q, min, max);
}

float dequantization(int x_q, float s, float z)
{
    return s * (x_q - z);
}

int main() {
    float x = 29.6f;
    float scale = 0.4f;
    float zeroPoint = 0.0f; // I'll be honest, no idea what zero point is for
    auto result = quantization(x, scale, zeroPoint, -127.f, 127.f);
    printf("orig: %f quantizated: %d, back to float: %f\n", x, result, dequantization(result, scale, zeroPoint));
    return 0;
}

I don’t really understand the zero-point thing, I guess it’s a pre-calculated value or something? Although the tensorflow documentation says it should be zero. Anyways, I didn’t even write that code until I started writing this blog post. I actually found another solution from the venerable Gaffer On Games.

If you are interested in netcode even in the slightest, I can not recommend his blog posts enough, they’ve really exposed to a lot of developers how netcode and physics should work. For my problem at hand however, one post stood out: Serialization Strategies.

Part way down he states:

But what about situations where you don’t need full precision? What about a floating point value in the range [0,10] with an acceptable precision of 0.01? Is there a way to send this over the network using less bits?

Yes there is. The trick is to simply divide by 0.01 to get an integer in the range [0,1000] and send that value over the network. On the other side, convert back to a float by multiplying by 0.01.

This sounds like what I want! I converted his code to two separate functions and commented it so it’s a bit easier to understand:

// Taken from https://www.gafferongames.com/post/serialization_strategies/
// but removed stream code and error handling 
uint8_t compress_float(float value, float min, float max, float resolution)
{
    // Get full possible range
    const float delta = max - min; 
    printf("Delta: %f\n", delta);
    // Get number possible values
    const float values = delta / resolution;
    printf("Values: %f\n", values);
    // Round 'em so we can get the maximum possible integer value
    const uint32_t maxIntegerValue = (uint32_t) ceil( values ); 
    printf("maxIntegerValue: %d\n", maxIntegerValue);
    uint8_t integerValue = 0;
    // Normalize it between 0 and 1 by dividing the value-min by the delta (possible range).
    float normalizedValue = std::clamp(( value - min ) / delta, 0.0f, 1.0f); // 
    printf("normalizedValue: %f\n", normalizedValue);
    // multiply this normalized value by the maximum possible integer and
    // add 0.5f, I ASSUME so we prefer higher values. e.g 7.9 would roll up to 8.4
    // then be rounded down to 8 instead of being rounded to 7 without the .5f
    integerValue = (uint8_t) floor(normalizedValue * maxIntegerValue + 0.5f);
    printf("integerValue: %d\n", integerValue);
    // return the compressed value
    return integerValue;
}

float uncompress_float(uint8_t value, float min, float max, float resolution)
{
    const float delta = max - min;
    const float values = delta / resolution;
    const uint32_t maxIntegerValue = (uint32_t) ceil( values );
    // undo our compression
    return value / float(maxIntegerValue) * delta + min;
}

int main() {
    float min = -30;
    float max = 30;
    float resolution = 0.48; // roughly the max to fit it in a single byte
    float z = 29.6f;
    uint8_t result = compress_float(z, min, max, resolution);
    printf("-----------\n");
    printf("%d, %f\n", result, uncompress_float(result, min, max, resolution));
    return 0;
}

I then threw this code into the UE5 client and ran the compression/decompression routines prior to passing the value to the movement vector. It turns out the loss in precision and clamping of values isn’t that big of a deal, as you can see in the recording.

One final note, I think I can actually get double the precision here because I’m sending two different opcodes depending on the direction the character is turning! Since I send a MoveLeftStart I can set min to 0 and max to 30, then set my resolution to 0.24 and get double the precision. For MoveRightStart I do the same thing but negate it after i decompress it.

Re-reading my post here and looking at the ML quantization code vs Glenn’s… the ML quantization might be quicker. I’ll need to run some benchmarks. It certainly looks like less code!

This week I’ll be spending more time figuring out the details of my movement system. Stay tuned!