Client Desyncs, Mostly Solved

Published by

on

there was an attempt

I am cautiously optimistic that my client desyncs are (mostly) solved. Boy has it been a wild ride. I started this endeavor 3 months ago as I first noticed my client was slowly unable to hit another player that should have been within the range of the collision sphere. I was not expecting what it would take to get to where I am now.

Bugs

I had a lot of bugs in my netcode. But one of that really bit me was using mod N that was not divisible by 256. For my state buffer I was storing 24 states, for my jitter buffer I was storing 12 elements. Why is this a problem? Due to integer overflows wrapping around. This is defined behavior by the way!. My netcode uses uint16_t (2 bytes) for sequence ids, however my game loop logic truncates this to a uint8_t when storing the sequence ids in various buffers/states.

Let’s see an example of what this bug looks like with the following code:

uint8_t ReadSeq{}; // max is 255
uint8_t MaxSeqId = 12; 

for (size_t i = 0; i < 264; i++)
{
	uint8_t Idx = (ReadSeq % MaxSeqId);
	printf("seq: %ld read: %d idx: %d\n", i, ReadSeq, Idx);
	ReadSeq++;
}

Let’s walk through some iterations manually:

  • At i = 0, Idx = 0
  • At i = 11, Idx = 11
  • At i = 12, Idx = 0 (12 % 12 = 0)
  • At i = 253, Idx = 1
  • At i = 254, Idx = 2
  • At i = 255, Idx = 3
  • At i = 256, Idx = 0 boom

This obviously had catastrophic results for my jitter buffer (and state buffer!) as the index would suddenly jump to an invalid part of the buffer and start playing or replaying from an invalid index.

The fix here was to ensure my buffer sizes are divisible by 256. Pretty stilly bug that I clearly overlooked!

After I fixed the above and everything appeared to be working, I accidentally hit the ‘dodge’ key which played the dodge animation and caused my client to desync again… After pouring over the logs I noticed the “LastClientAckId” was 0 when the client got desyncd. This is weird because the previous packet ack’d 23, the packet after the desync was 25, what happened to 24?

I track the LastClientAckId in my state object, and this is a rather recent addition. I forgot to add this property to the PlayerWorldStates class move assign/constructors:

 // move assign
PlayerWorldStates& operator=(PlayerWorldStates&& other)
{
	States = std::move(other.States);
	Acks = std::move(other.Acks);
	SequenceId = other.SequenceId;
	ClientSequenceId = other.ClientSequenceId; // <-- forgot this
	return *this;
}

// move constructor
PlayerWorldStates(PlayerWorldStates&& other) noexcept
{
	States = std::move(other.States);
	Acks = std::move(other.Acks);
	SequenceId = other.SequenceId;
	ClientSequenceId = other.ClientSequenceId; // <-- and this
}

You (like me) may be wondering why on earth would playing an animation cause the move assign/constructors to be called? Well, when I play animations, I create new ECS components, adding components to an entity causes flecs to shift/move things around. As you may have guessed, this causes the object to be moved and these assignment/constructors to be called.

The Solution

As I hinted in my last post. It was recommended that I implement some sort of dynamic throttling. At first I implemented it server side, with the intent of pushing throttling commands down to clients as a simplistic back pressure algorithm. The problem is I noticed the server doesn’t actually see any problem, not nearly as quickly as the client sees server packets coming in faster than it’s reading, leading the overflow I was experiencing.

Here’s some logging which demonstrates the slow drift of the client not reading packets fast enough from the jitter buffer (i get 2 packets at once, notice the time stamp and notice the Drift value):

[2025-04-08 17:35:47.773] [test] [info] PreFrame: Got MessageData_EncServerCommand SeqId: 7 ServerTime 244 OpCode: 1
[2025-04-08 17:35:47.773] [test] [info] PreFrame: Buffer Insert: 7 Buf Read Seq: 1, Buf Write Seq: 6 Drift: 5
[2025-04-08 17:35:47.773] [test] [info] PreFrame: Got MessageData_EncServerCommand SeqId: 8 ServerTime 278 OpCode: 1
[2025-04-08 17:35:47.773] [test] [info] PreFrame: Buffer Insert: 8 Buf Read Seq: 1, Buf Write Seq: 7 Drift: 6
...
[2025-04-08 17:35:48.410] [test] [info] PreFrame: Got MessageData_EncServerCommand SeqId: 26 ServerTime 902 OpCode: 1
[2025-04-08 17:35:48.410] [test] [info] PreFrame: Buffer Insert: 26 Buf Read Seq: 3, Buf Write Seq: 9 Drift: 6
[2025-04-08 17:35:48.410] [test] [info] PreFrame: Got MessageData_EncServerCommand SeqId: 27 ServerTime 937 OpCode: 1
[2025-04-08 17:35:48.410] [test] [info] PreFrame: Buffer Insert: 27 Buf Read Seq: 3, Buf Write Seq: 10 Drift: 7

Every 40-50 packets or so I’d start getting 2 packets from the server in a single tick, causing my drift to get bigger and bigger from the read/write sequence numbers. This eventually leads to the dreaded overflow where my write sequence over takes itself and starts overwriting.

So, some client side throttling was in order. One problem/issue with my current design is my game loop/tick rate are tightly coupled. Meaning, things must be in a pretty specific order or otherwise the simulations will not work across the network.

For example, my sequence ids lock in what my simulations results will be, I can’t send packets without incrementing those ids, I can’t increment those ids without running a simulation tick. I can’t just “send packets faster.” But what I can do, is run my entire game loop faster.

Keep in mind my physics + netcode is all handled within my PMO game library, which is shared between clients and servers. The rendering is all done in Unreal and I can separate the two quite easily. Basically, the client will have no idea if i’m dynamically adjusting the physics/netcode tick rate.

What’s great about having flecs handle my game loop is that I can dynamically adjust the FPS of my client using flecs’ set_target_fps. One thing to note, I could not actually get the FPS modifications to work if I called world.set_target_fps(NewValue) inside of a flecs system. I had to call it outside of flecs. Maybe I needed to add a defer_begin/defer_end?

When I process server messages on the client, I extract some details from the jitter buffer:

// Our network buffer is ready so we can play
GameWorld.system<network::ClientJitterBuffer, NetworkReady>("OnProcessUpdate")
	.write<network::PhysicsDesynced>()
	.kind(flecs::PreFrame)
	.each([&](flecs::iter& It, size_t Index, network::ClientJitterBuffer &Buffer, NetworkReady)
{
	auto GameData = Buffer.Get();
	auto Command = Game::Message::GetServerCommand(GameData->Data->data());
	auto Drift = math::Distance(Buffer.ReadSequence%Buffer.JitterSize, Buffer.WriteSequenceId%Buffer.JitterSize, Buffer.JitterSize);
	
	switch (Drift)
	{
		case 1: case 2:
			SuggestedFPS = 28;
		break;
		case 3:
			SuggestedFPS = 29;
		break;
		case 4: case 5: case 6:
			SuggestedFPS = 30;
		break;
		case 7:
			SuggestedFPS = 33;
		break;
		case 8:
			SuggestedFPS = 34;
		break;
		default:
			SuggestedFPS = 40;
		break;
	}
	PMOWorld->ProcessServerUpdate(Command);
});

Then, in my game loop I simply adjust the FPS with the newly suggested value:

GameWorld.set_target_fps(GameWorld.get<network::NetworkClient>()->SuggestedFPS);

Altering the FPS, while maybe not the best method, was the easiest due to how my game loop works. Low and behold, as my client starts to drift, as I get 2 packets at once, I start playing my game client a little bit faster which lets me drain the write sequence just fast enough that the server data doesn’t start overwriting itself before I have time to read it.

What’s left

Well, two things. One: I want to work on something else, it’s been three bloody months of sweat and tears to get network synchronization and rollback to work. I really want to work on other things! But two, I need to also play with what I have in worse network conditions.

I was about to implement a custom network simulation layer until a friend (who is also building their own MMORPG) told me about clumsy which basically does everything I need for testing various network conditions (latency/throttling/out of order etc). So I absolutely need to test my network stack with this when I get a moment.

I also need to implement network smoothing for when a client DOES legitimately desync. I need to slowly adjust the player to their real location. But honestly I’m just amazed I got to this point, it’s been a 3 month slog and I’m Exhausted.