Thursday, January 26, 2023

InterlockedCompareExchange without volatile.

cl /LD /Z7 /GS- /O1 a.c /link /noentry /incremental:no && link /dump /disasm a.dll

__int64 _InterlockedCompareExchange64(volatile __int64*, __int64, __int64);

#define LOCK \
  __int64 lock = *plock; \
  return (lock & 1) && _InterlockedCompareExchange64(plock, lock & ~1, lock) == lock;

__declspec(dllexport) __int64 BrokenTryLock(__int64* plock)
{
    LOCK
}

__declspec(dllexport) __int64 WorkingTryLock(volatile __int64* plock)
{
    LOCK
}

BrokenTryLock:

  mov         r8d,1
  test        byte ptr [rcx],r8b
  je          1C
  mov         rdx,qword ptr [rcx]          <== problem, only with /O1, not with /O2
  mov         rax,qword ptr [rcx]          <== problem, only with /O1, not with /O2
  and         rdx,0FFFFFFFFFFFFFFFEh
  lock cmpxchg qword ptr [rcx],rdx
  je          1F
1C:
  xor         r8d,r8d
1F:
  mov         rax,r8
  ret

WorkingTryLock:
  mov         rax,qword ptr [rcx]
  mov         r8d,1
  test        r8b,al
  je          40
  mov         rdx,rax
  and         rdx,0FFFFFFFFFFFFFFFEh
  lock cmpxchg qword ptr [rcx],rdx
  je          43
40:
  xor         r8d,r8d
43:
  mov  rax, r8
  ret

This is a common pattern.

It is correct with volatile.

Let's break down what is going on.

The idea is that we are trying to take a lock.

Only trying. Under contention, the lock is not acquired and false is returned. If there is to be a spin or spin with backoff, etc., that is up to the caller. This function has a small finite maximum number of operations, no loop.


Restated:

  First do a cheaper non-interlocked check, and if the lock appears not held, do a "real" interlocked checked, which can also fail. Return true if the lock was acquired.

x86 is "complex". Specifically, while it has separate load/store ("mov") instructions, most instructions can also access memory while doing other work. There are many examples. You can add to memory. You can call or jump through a pointer in memory. This may sound obvious, of course you can do these operations, on all processors, but most processors require a separate load or store prior to or after add, jmp, call etc. to load the operands or store the results (only when needed of course; RISC architectures also have more registers and get more done without accessing memory). For further example, consider that x86 call stores the return address in memory. But on ARM the return address goes into the link register, there is no store to memory (unless needed in non-leaf prior to making another call). And x86 ret reads the return address from memory. ARM return is branch to register, typically link register. That had to be separately loaded from memory (if spilled in non-leaf).

RISC architectures separate load/store instructions from almost everything else.

When optimizing for size, the compiler has taken advantage of x86 and combined the mov and the test. This part of the dynamic path is one instruction shorter. When the lock is under contention, this function likely runs one fewer instruction than the speed-optimized version (not shown). This is considered the rare slow path. The number of memory accesses in this case is unchanged.

When the lock is not under contention, the size-optimized version still needs the value of the mov, and so it is issued in the second part of the function when size-optimized. In the volatile or speed-optimized form, the value from the mov is retained from before the test. As a result, the static number of instructions in either case is the same.

And due to vagaries of x86 variably sized instructions, the size-optimized form is one byte larger. In the common uncontended case, the size-optimized form turns one memory access into three (x86 hides memory latencies very well, so I do not presume this is slower, maybe).

Other than the varying efficiencies, the semantics are significantly different, and the size-optimized form causes problems. The problem is that the "compare" value is read, the "exchange" value is computed from it, and then "compare" is read again. The exchange value is no longer a function of the actual compare value that is used.

Consider a way that people sometimes write reference counting:

void increment(long* p)
{
long old, next;
do {
old = *p;
next = old + 1;
} while (InterlockedCompareExchange(p, next, old) != old);
}

This gets transformed into more like:

void increment(long* p)
{
long old, next;
do {
next = *p + 1;
old = *p; // problem
} while (InterlockedCompareExchange(p, next, old) != old);
}

Between the computation of next, and reread from p to form old, another thread can have incremented the value already. The thread losing the race will write back the value that was already there, instead of incrementing. The compare should fail, but is more likely to succeed (unless there is yet another increment in the mean time).

When it comes time to decrement, if the race is not duplicated (highly unlikely!) an extra decrement will occur.


A complete program that shows the problem. It works, as long as /O1 is not used.

// One thread flips high bit, one thread increments lower 63 bits.
// Counter should never go down.
// int64_t cannot really overflow. Nor int63.
//
// To see this program fail:
//   cl /O1 inc.cpp && inc.exe
// To see it succeed:
//   cl /O2 inc.cpp && inc.exe
// or:
//   cl inc.cpp && inc.exe

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <windows.h>

#define LOCK    (0x800000000000i64)
#define COUNTER (0x7fffffffffffi64)

// Ordinally the return value of InterlockedCompareExchange
// is compared with old to see if the exchange succeeded.
// For this demonstration-only program, that is not needed.

DWORD Flip(void* pv)
{
    int64_t* p = (int64_t*)pv;
    while (true)
    {
        int64_t old = *p;
        int64_t next = old ^ LOCK;
        InterlockedCompareExchange64(p, next, old);
    }
}

// This thread increments the counter and..it does not
// matter what it does to the lock. For sample purposes
// it is always clearing.

DWORD Inc(void* pv)
{
    int64_t* p = (int64_t*)pv;
    while (true)
    {
        int64_t old = *p;
        int64_t next = ((old & COUNTER) + 1);
        InterlockedCompareExchange64(p, next, old);
    }
}

int main()
{
    int64_t volatile i = 0;
    CreateThread(0, 0, Inc, const_cast<int64_t*>(&i), 0, 0);
    CreateThread(0, 0, Flip, const_cast<int64_t*>(&i), 0, 0);
    while (true)
    {
        // The counter should only go up, or stay the same.
        // But it goes down sometimes due to the problem.
        int64_t j1 = (i & COUNTER);
        int64_t j2 = (i & COUNTER);
        if (j2 < j1)
        {
            printf("error:%lld %%lld\n", j1, j2);
            exit(1);
        }
    }
}