Monday, November 25, 2024

ARM64 assembly learnings and cheat sheet.

ldr x0, [x1,#n]

 x0 = *(x1 + n)


ldr x0, [x1],#n

 x0 = *x1

 x1 += n


ldr x0, [x1,#n]!

 x1 += n

 x0 = *x1

Friday, September 27, 2024

Ptrdiff_t is sorta signed.

We are using qsort/bsearch on pointers.
The code looked like this:

typedef struct { int64_t a[0xb0 / 8]; } Data;

int compare(void** p1, void** p2)
{
  ptrdiff_t delta = (*(Data**)p1 - *(Data**)p2);
  return (delta < 0) ? -1 : ((delta > 0) ? 1 : 0);
}

This is somewhat idiomatic and sometimes works, but often wrong or undefined.
In the past I'd seen this go wrong due to subtracting a larger unsigned from a smaller unsigned.
You should always write more like:

  return (*p1 < *p2) ? -1 : ((*p1 > *p2) ? 1 : 0);

But I hadn't seen this go wrong with pointers. Well, I hadn't seen with pointers much.
Here is a program using the same behaviors:
#include <stdlib.h>
#define assert(x) ((void)((x) ? 0 : __debugbreak()))
typedef struct { __int64 a[0xb0 / 8]; } Data;
int main()
{
    while (1)
    {
        Data* d1 = (Data*)malloc(sizeof(Data));
        Data* d2 = (Data*)malloc(sizeof(Data));
        assert((d1 - d2) < 0 == (d1 < d2));
        free(d1);
        free(d2);
    }
}

The problem in all these cases, is that comparison and subtraction of pointers is only defined if the pointers are to elements within the same array. So what? What difference might that make? It means the difference of the pointers must be a multiple of the size of the type they point to. It means that unsigned and signed division implied by the subtraction will return the same result. They only vary in rounding. i.e. When the answer is not a whole number. But the difference is not an entire multiple than unsigned and signed division will round to a different value.

And so why did this come up? The code was failing. What changed to make it fail?
I had changed from the build systems default WholeProgramOptimization (/LTCG), which is incremental (link /LTCG:incremental) to regular /LTCG.

/LTCG:incremental divides signed in this case, and the code depended on it.
/LTCG divides unsigned, and that broke the code.

I changed the code to compare with <. I suspect to fully avoid undefined behavior, we should cast to char* or uintptr_t or intptr_t (should pointers be considered signed or not?)

Monday, August 21, 2023

string literals as template parameters, sort of

It is very long standing problem of C++ that you cannot use string literals as template parameters.

I acknowledge that macros are bad, however..


C++ lambdas have a special property of generating anonymous unique types.

Therefore a scope in which to create additional types.

This seems to be quite useful. It seems like producing types (or functions) within lambdas within macros provides template-like powers, and not in the bad old way that macros on their own do.

I'm not sure where all this will go, but here is a start:

#include <stdio.h>

#include <string>


#define STRING_TEMPLATE_FUNCTION(str) \

    []() { struct Type { static const char* Function() { return str; } }; return Type::Function; }()


#define STRING_TEMPLATE_CLASS(str) \

    [=]() { struct Type : std::string { using std::string::operator=; Type() : std::string(str) { } }; return []{ return Type();}(); }()


int main()

{

    auto foo = STRING_TEMPLATE_FUNCTION("foo");

    auto bar = STRING_TEMPLATE_CLASS("bar");

    auto bar2 = STRING_TEMPLATE_CLASS("bar2");


    printf("%s\n", foo());

    printf("%s\n", bar.c_str());

    printf("%s\n", bar2.c_str());

    bar = bar2;

    printf("%s\n", bar.c_str());

}


Friday, August 11, 2023

A bit on modulo arithmetic.

Let's consider implementing the AMD64 adc add with carry instruction.

Add without carry is easy.

But how do we accomodate the extra carry_in?

How do we compute the flags? Unsigned carry and signed overflow?

Aka last carry and second to last carry.


Doing the operation bit-wise is easy. You can loop over the bits.

But what about using C++ unsigned::operator+ to do almost all the work in one operation?


intsafe.h LongLongAdd (I wrote it), teaches us, without carry_in:


 - Adding positive to negative never overflows.

 - If you add two positive numbers, you expect a positive result.

 - If you add two negative numbers, you expect a negative result.

 - Overflow if inputs have the same sign and output has the other sign.


The HPPA manual says similar I was pleased to later read.


But intsafe.h functions have no carry_in. What about it?

QEMU seems to ignore it for flag computation (and is correct).

So, the answer is, if you consider the edge cases, adding carry_in does not change their outcome. Or rather, the correctness of the rules. 

So these are the rules.


z = x + y + carry;

signed_overflow = (x < 0) == (y < 0) && (x < 0) != (z < 0);


Let's go through edge cases to confirm.

We will use 4bit numbers. The third term in any addition is carry_in.

Negative numbers are interchangable with large positive numbers.

If the output is shown as having two digits, the first is the unsigned carry_out.

Base 16 is implied. F==0xF==decimal15

8 == -8; F == -1

The sign is considered negative if the second digit is higher than 7.


check large positive:

  7+7+0=E overflowed

  7+7+1=F overflowed

check medium positive:

  3+4+0=7 no overflow

  3+4+1=8 overflowed

check negatives:

  8+8+0=10 overflowed

  8+8+1=11 overflowed

  F+F+0=1E=-2 no overflow

  F+F+1=1F=-1 no overflow

check mix of positive and negative:

 7+8+0=F=-1 no overflow

 7+8+1=10=0 no overflow

 7+F+0=16=6=no overflow

7+F+1=17=7=no overflow


How about carry?

intsafe.h's ULongLongAdd is less informative. Carry is if output is less than either input.

This does not work with carry_in. The output is equal to the inputs if adding max + max + carry.

Like adding 4bit numbers: F+F+1=1F


Again, QEMU does not seem to consider it (and is correct).


So, think again, more like LongLongAdd:

 - Adding two “negative" (unsigned!) numbers always carries. Edge case:8+8=10

 - Adding two positive numbers never carries. Edge case:7+7=E

 - Adding positive and “negative“ carried if result is positive. Edge cases:1+F=10; 8+7=F


And again, adding carry_in does not push any edge case over the edge.

8+8+1=11

7+7+1=F

1+F+1=11

8+7+1=10 this one seems like carry_in made a difference. It did push the result into having carry_out, but the same rules work and do not need to consider carry_in.


Why does carry_in not matter in this regard?

I think it has something to do with the asymmetric range. There is one

more negative number than positive numbers, and carry_in is always positive.


unsigned_carry = ((x < 0 && y < 0) || ((x < 0) != (y < 0) && (z >= 0));


Tuesday, July 25, 2023

Placing string literals in sections.

 We have an assertion implementation something like this:

#define RetailAssert(expr) \

__declspec(allocate(".rdata$cold")) static const char exp = #expr; \

if (!expr) {fprintf(stderr, "%s(%d): assertion failed:%s\n", __FILE__, __LINE__, exp); abort(); }

__FILE__ and the format string should be similarly cold too.

I would like to make it an expression, like standard assert, something like this:

#define RetailAssert(expr) \

((expr) ? 0 : fprintf(stderr, "%s(%d): assertion failed:%s\n", __FILE__, __LINE__, __declspec(allocate(".rdata$cold"))(#expr)), abort()))

But __declspec(allocate) seemingly cannot be used with string literals.

This seems like a language hole. String literals are their own special thing.

There is #pragma const_seg, and __pragma(const_seg).

They do not quite work because they operate at function level.

Like, the last __pragma applies to its entire enclosing function.

Lately I have found that lambdas are very good at breaking things into functions, perhaps unnaturally, where you don't really intend to have a function, but just chunks of code, in order to gain functionality only offered to functions (such as declspec(noinline)).

So here:

// Place a string literal in a section.

// For example, for hot/cold split of assertion messages in retail builds.

// const_seg and data_seg for /GF and not

// 4177 pragma 'data_seg' should only be used at global scope or namespace scope

#include <windows.h>

#include <stdio.h>

#include <stdlib.h>

#undef NDEBUG

#include <assert.h>

#pragma section("1", read)

#pragma section("2", read)

#pragma section("3", read)


#define SEGSTR(seg, str)                \

   ([] { return                         \

    __pragma(warning(suppress:4177))    \

    __pragma(const_seg(push, seg))      \

    __pragma(data_seg(push, seg))       \

        str;                            \

    }                                   \

    __pragma(const_seg(pop))            \

    __pragma(data_seg(pop)) ())         \

 

extern "C" { extern IMAGE_DOS_HEADER __ImageBase; }

 

static PCH segNameOf(const void* p)

{

    PIMAGE_DOS_HEADER dos = &__ImageBase;

    PCH base = (PCH)dos;

    PIMAGE_NT_HEADERS nt = (PIMAGE_NT_HEADERS)(dos->e_lfanew + (PCH)dos);

    PIMAGE_SECTION_HEADER sec = IMAGE_FIRST_SECTION(nt);

    size_t nsec = nt->FileHeader.NumberOfSections;

    size_t i = 0;

    for (i = 0; i < nsec; ++i)

    {

        if (p >= (base + sec->VirtualAddress) && p < (base + sec->VirtualAddress + sec->Misc.VirtualSize))

        {

            //printf("vprot %X\n", sec->Characteristics);

            return (PCH)sec->Name;

        }

        ++sec;

    }

    return "";

}


__declspec(noinline) 

void assert_failed(const char* expression, const char* file, int line)

{

    const char* a = SEGSTR("3", "%s(%d): assertion failed %s\n");

    printf("default:%s 1:%s 2:%s 3:%s\n", segNameOf(""), segNameOf(expression), segNameOf(file), segNameOf(a));


    fprintf(stderr, a, file, line, expression);

    //abort();

}


#define Assert(exp) ((exp) ? ((void)0) : assert_failed(SEGSTR("1", #exp), SEGSTR("2", __FILE__), __LINE__))

 

int main(int argc, char** argv)

{

    Assert(argc==1);

    printf("default:%s\n", 1+segNameOf(""));

    printf("text:%s\n", 1+segNameOf(&main));

    static const char rdata[]="";

    static char data[]="";

    printf("rdata:%s\n", 1+segNameOf(&rdata));

    printf("data:%s\n", 1+segNameOf(&data));

    assert(!strcmp("rdata", 1+segNameOf(&rdata)));

    assert(!strcmp("data", 1+segNameOf(&data)));

    assert(!strcmp("text", 1+segNameOf(&main)));

}



This is maybe not quite the final answer, because I have found that captureless parameterless lambdas, have an extra cost associated with them, of passing them their unused "this" pointer. This can be mitigated by using a local class with a static member function, inside a lambda. I will make that adjustment later. Well, in this case, the codegen looks OK actually. I guess it is only when the lambda with the real code is larger, does the compiler pass its address, or maybe the newer compiler fixed this.

Wednesday, February 1, 2023

More Interlocked problems.

 There are helpers like:

template <typename T>
bool TryCompareExchange(T volatile * p, T exchange, T* compare)
{
    T old = Traits<T>::InterlockedCompareExchange(p, exchange, *compare);
    bool success = (old == *compare);
    if (!success) *compare = old;
    return success;
}

Traits handles char, short, int, long __int64, unsigned, etc., and same-sized structs (hopefully with consistent/zeroed padding).

Leading to uses like:

void increment(long volatile * p)
{
    long old = *a;
    while (!TryCompareExchange(p, old + 1, &old)) ;
}

The problem comes when used with InterlockedCompareExchange128.
There is no atomic get/set128.
They can be synthesized from InterlockedCompareExchange128.

There are at least two incorrect forms, which do work in non-race conditions, so lull us into complacency ("it works"):

template <typename T>
T Get(T volatile * p)
{
    T value;
    while (!TryCompareExchange(p, *p, &value)) ; // incorrect code
    return value;
}

*p is a non-atomic read.
If p happens to match input value, then the non-atomic *p will be written back.
This can be mitigated by initializing value to a never-set value.
But finding such a value is not really needed (keep reading).


template <typename T>
void Set(T volatile * p, T value)
{
    TryCompareExchange(&value, value, p); // incorrect code
}

The write through the last parameter is not atomic.


The correct forms:

template <typename T>
T Get(T volatile * p)
{
    T value;
    (void)TryCompareExchange(p, value, &value);
    return value;
}

If the compare does match, the same value will be written back, atomically.

 

template <typename T>
void Set(T volatile * p, T value)
{
    T old = *p; // non-atomic read, ok
    while (!TryCompareExchange(p, value, &old)) ; 
}

Loop until the non-atomic read matches, upon which write the desired value atomically.

Thursday, January 26, 2023

InterlockedCompareExchange without volatile.

cl /LD /Z7 /GS- /O1 a.c /link /noentry /incremental:no && link /dump /disasm a.dll

__int64 _InterlockedCompareExchange64(volatile __int64*, __int64, __int64);

#define LOCK \
  __int64 lock = *plock; \
  return (lock & 1) && _InterlockedCompareExchange64(plock, lock & ~1, lock) == lock;

__declspec(dllexport) __int64 BrokenTryLock(__int64* plock)
{
    LOCK
}

__declspec(dllexport) __int64 WorkingTryLock(volatile __int64* plock)
{
    LOCK
}

BrokenTryLock:

  mov         r8d,1
  test        byte ptr [rcx],r8b
  je          1C
  mov         rdx,qword ptr [rcx]          <== problem, only with /O1, not with /O2
  mov         rax,qword ptr [rcx]          <== problem, only with /O1, not with /O2
  and         rdx,0FFFFFFFFFFFFFFFEh
  lock cmpxchg qword ptr [rcx],rdx
  je          1F
1C:
  xor         r8d,r8d
1F:
  mov         rax,r8
  ret

WorkingTryLock:
  mov         rax,qword ptr [rcx]
  mov         r8d,1
  test        r8b,al
  je          40
  mov         rdx,rax
  and         rdx,0FFFFFFFFFFFFFFFEh
  lock cmpxchg qword ptr [rcx],rdx
  je          43
40:
  xor         r8d,r8d
43:
  mov  rax, r8
  ret

This is a common pattern.

It is correct with volatile.

Let's break down what is going on.

The idea is that we are trying to take a lock.

Only trying. Under contention, the lock is not acquired and false is returned. If there is to be a spin or spin with backoff, etc., that is up to the caller. This function has a small finite maximum number of operations, no loop.


Restated:

  First do a cheaper non-interlocked check, and if the lock appears not held, do a "real" interlocked checked, which can also fail. Return true if the lock was acquired.

x86 is "complex". Specifically, while it has separate load/store ("mov") instructions, most instructions can also access memory while doing other work. There are many examples. You can add to memory. You can call or jump through a pointer in memory. This may sound obvious, of course you can do these operations, on all processors, but most processors require a separate load or store prior to or after add, jmp, call etc. to load the operands or store the results (only when needed of course; RISC architectures also have more registers and get more done without accessing memory). For further example, consider that x86 call stores the return address in memory. But on ARM the return address goes into the link register, there is no store to memory (unless needed in non-leaf prior to making another call). And x86 ret reads the return address from memory. ARM return is branch to register, typically link register. That had to be separately loaded from memory (if spilled in non-leaf).

RISC architectures separate load/store instructions from almost everything else.

When optimizing for size, the compiler has taken advantage of x86 and combined the mov and the test. This part of the dynamic path is one instruction shorter. When the lock is under contention, this function likely runs one fewer instruction than the speed-optimized version (not shown). This is considered the rare slow path. The number of memory accesses in this case is unchanged.

When the lock is not under contention, the size-optimized version still needs the value of the mov, and so it is issued in the second part of the function when size-optimized. In the volatile or speed-optimized form, the value from the mov is retained from before the test. As a result, the static number of instructions in either case is the same.

And due to vagaries of x86 variably sized instructions, the size-optimized form is one byte larger. In the common uncontended case, the size-optimized form turns one memory access into three (x86 hides memory latencies very well, so I do not presume this is slower, maybe).

Other than the varying efficiencies, the semantics are significantly different, and the size-optimized form causes problems. The problem is that the "compare" value is read, the "exchange" value is computed from it, and then "compare" is read again. The exchange value is no longer a function of the actual compare value that is used.

Consider a way that people sometimes write reference counting:

void increment(long* p)
{
long old, next;
do {
old = *p;
next = old + 1;
} while (InterlockedCompareExchange(p, next, old) != old);
}

This gets transformed into more like:

void increment(long* p)
{
long old, next;
do {
next = *p + 1;
old = *p; // problem
} while (InterlockedCompareExchange(p, next, old) != old);
}

Between the computation of next, and reread from p to form old, another thread can have incremented the value already. The thread losing the race will write back the value that was already there, instead of incrementing. The compare should fail, but is more likely to succeed (unless there is yet another increment in the mean time).

When it comes time to decrement, if the race is not duplicated (highly unlikely!) an extra decrement will occur.


A complete program that shows the problem. It works, as long as /O1 is not used.

// One thread flips high bit, one thread increments lower 63 bits.
// Counter should never go down.
// int64_t cannot really overflow. Nor int63.
//
// To see this program fail:
//   cl /O1 inc.cpp && inc.exe
// To see it succeed:
//   cl /O2 inc.cpp && inc.exe
// or:
//   cl inc.cpp && inc.exe

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <windows.h>

#define LOCK    (0x800000000000i64)
#define COUNTER (0x7fffffffffffi64)

// Ordinally the return value of InterlockedCompareExchange
// is compared with old to see if the exchange succeeded.
// For this demonstration-only program, that is not needed.

DWORD Flip(void* pv)
{
    int64_t* p = (int64_t*)pv;
    while (true)
    {
        int64_t old = *p;
        int64_t next = old ^ LOCK;
        InterlockedCompareExchange64(p, next, old);
    }
}

// This thread increments the counter and..it does not
// matter what it does to the lock. For sample purposes
// it is always clearing.

DWORD Inc(void* pv)
{
    int64_t* p = (int64_t*)pv;
    while (true)
    {
        int64_t old = *p;
        int64_t next = ((old & COUNTER) + 1);
        InterlockedCompareExchange64(p, next, old);
    }
}

int main()
{
    int64_t volatile i = 0;
    CreateThread(0, 0, Inc, const_cast<int64_t*>(&i), 0, 0);
    CreateThread(0, 0, Flip, const_cast<int64_t*>(&i), 0, 0);
    while (true)
    {
        // The counter should only go up, or stay the same.
        // But it goes down sometimes due to the problem.
        int64_t j1 = (i & COUNTER);
        int64_t j2 = (i & COUNTER);
        if (j2 < j1)
        {
            printf("error:%lld %%lld\n", j1, j2);
            exit(1);
        }
    }
}