Friday, September 27, 2024

Ptrdiff_t is sorta signed.

We are using qsort/bsearch on pointers.
The code looked like this:

typedef struct { int64_t a[0xb0 / 8]; } Data;

int compare(void** p1, void** p2)
{
  ptrdiff_t delta = (*(Data**)p1 - *(Data**)p2);
  return (delta < 0) ? -1 : ((delta > 0) ? 1 : 0);
}

This is somewhat idiomatic and sometimes works, but often wrong or undefined.
In the past I'd seen this go wrong due to subtracting a larger unsigned from a smaller unsigned.
You should always write more like:

  return (*p1 < *p2) ? -1 : ((*p1 > *p2) ? 1 : 0);

But I hadn't seen this go wrong with pointers. Well, I hadn't seen with pointers much.
Here is a program using the same behaviors:
#include <stdlib.h>
#define assert(x) ((void)((x) ? 0 : __debugbreak()))
typedef struct { __int64 a[0xb0 / 8]; } Data;
int main()
{
    while (1)
    {
        Data* d1 = (Data*)malloc(sizeof(Data));
        Data* d2 = (Data*)malloc(sizeof(Data));
        assert((d1 - d2) < 0 == (d1 < d2));
        free(d1);
        free(d2);
    }
}

The problem in all these cases, is that comparison and subtraction of pointers is only defined if the pointers are to elements within the same array. So what? What difference might that make? It means the difference of the pointers must be a multiple of the size of the type they point to. It means that unsigned and signed division implied by the subtraction will return the same result. They only vary in rounding. i.e. When the answer is not a whole number. But the difference is not an entire multiple than unsigned and signed division will round to a different value.

And so why did this come up? The code was failing. What changed to make it fail?
I had changed from the build systems default WholeProgramOptimization (/LTCG), which is incremental (link /LTCG:incremental) to regular /LTCG.

/LTCG:incremental divides signed in this case, and the code depended on it.
/LTCG divides unsigned, and that broke the code.

I changed the code to compare with <. I suspect to fully avoid undefined behavior, we should cast to char* or uintptr_t or intptr_t (should pointers be considered signed or not?)

Monday, August 21, 2023

string literals as template parameters, sort of

It is very long standing problem of C++ that you cannot use string literals as template parameters.

I acknowledge that macros are bad, however..


C++ lambdas have a special property of generating anonymous unique types.

Therefore a scope in which to create additional types.

This seems to be quite useful. It seems like producing types (or functions) within lambdas within macros provides template-like powers, and not in the bad old way that macros on their own do.

I'm not sure where all this will go, but here is a start:

#include <stdio.h>

#include <string>


#define STRING_TEMPLATE_FUNCTION(str) \

    []() { struct Type { static const char* Function() { return str; } }; return Type::Function; }()


#define STRING_TEMPLATE_CLASS(str) \

    [=]() { struct Type : std::string { using std::string::operator=; Type() : std::string(str) { } }; return []{ return Type();}(); }()


int main()

{

    auto foo = STRING_TEMPLATE_FUNCTION("foo");

    auto bar = STRING_TEMPLATE_CLASS("bar");

    auto bar2 = STRING_TEMPLATE_CLASS("bar2");


    printf("%s\n", foo());

    printf("%s\n", bar.c_str());

    printf("%s\n", bar2.c_str());

    bar = bar2;

    printf("%s\n", bar.c_str());

}


Friday, August 11, 2023

A bit on modulo arithmetic.

Let's consider implementing the AMD64 adc add with carry instruction.

Add without carry is easy.

But how do we accomodate the extra carry_in?

How do we compute the flags? Unsigned carry and signed overflow?

Aka last carry and second to last carry.


Doing the operation bit-wise is easy. You can loop over the bits.

But what about using C++ unsigned::operator+ to do almost all the work in one operation?


intsafe.h LongLongAdd (I wrote it), teaches us, without carry_in:


 - Adding positive to negative never overflows.

 - If you add two positive numbers, you expect a positive result.

 - If you add two negative numbers, you expect a negative result.

 - Overflow if inputs have the same sign and output has the other sign.


The HPPA manual says similar I was pleased to later read.


But intsafe.h functions have no carry_in. What about it?

QEMU seems to ignore it for flag computation (and is correct).

So, the answer is, if you consider the edge cases, adding carry_in does not change their outcome. Or rather, the correctness of the rules. 

So these are the rules.


z = x + y + carry;

signed_overflow = (x < 0) == (y < 0) && (x < 0) != (z < 0);


Let's go through edge cases to confirm.

We will use 4bit numbers. The third term in any addition is carry_in.

Negative numbers are interchangable with large positive numbers.

If the output is shown as having two digits, the first is the unsigned carry_out.

Base 16 is implied. F==0xF==decimal15

8 == -8; F == -1

The sign is considered negative if the second digit is higher than 7.


check large positive:

  7+7+0=E overflowed

  7+7+1=F overflowed

check medium positive:

  3+4+0=7 no overflow

  3+4+1=8 overflowed

check negatives:

  8+8+0=10 overflowed

  8+8+1=11 overflowed

  F+F+0=1E=-2 no overflow

  F+F+1=1F=-1 no overflow

check mix of positive and negative:

 7+8+0=F=-1 no overflow

 7+8+1=10=0 no overflow

 7+F+0=16=6=no overflow

7+F+1=17=7=no overflow


How about carry?

intsafe.h's ULongLongAdd is less informative. Carry is if output is less than either input.

This does not work with carry_in. The output is equal to the inputs if adding max + max + carry.

Like adding 4bit numbers: F+F+1=1F


Again, QEMU does not seem to consider it (and is correct).


So, think again, more like LongLongAdd:

 - Adding two “negative" (unsigned!) numbers always carries. Edge case:8+8=10

 - Adding two positive numbers never carries. Edge case:7+7=E

 - Adding positive and “negative“ carried if result is positive. Edge cases:1+F=10; 8+7=F


And again, adding carry_in does not push any edge case over the edge.

8+8+1=11

7+7+1=F

1+F+1=11

8+7+1=10 this one seems like carry_in made a difference. It did push the result into having carry_out, but the same rules work and do not need to consider carry_in.


Why does carry_in not matter in this regard?

I think it has something to do with the asymmetric range. There is one

more negative number than positive numbers, and carry_in is always positive.


unsigned_carry = ((x < 0 && y < 0) || ((x < 0) != (y < 0) && (z >= 0));


Tuesday, July 25, 2023

Placing string literals in sections.

 We have an assertion implementation something like this:

#define RetailAssert(expr) \

__declspec(allocate(".rdata$cold")) static const char exp = #expr; \

if (!expr) {fprintf(stderr, "%s(%d): assertion failed:%s\n", __FILE__, __LINE__, exp); abort(); }

__FILE__ and the format string should be similarly cold too.

I would like to make it an expression, like standard assert, something like this:

#define RetailAssert(expr) \

((expr) ? 0 : fprintf(stderr, "%s(%d): assertion failed:%s\n", __FILE__, __LINE__, __declspec(allocate(".rdata$cold"))(#expr)), abort()))

But __declspec(allocate) seemingly cannot be used with string literals.

This seems like a language hole. String literals are their own special thing.

There is #pragma const_seg, and __pragma(const_seg).

They do not quite work because they operate at function level.

Like, the last __pragma applies to its entire enclosing function.

Lately I have found that lambdas are very good at breaking things into functions, perhaps unnaturally, where you don't really intend to have a function, but just chunks of code, in order to gain functionality only offered to functions (such as declspec(noinline)).

So here:

// Place a string literal in a section.

// For example, for hot/cold split of assertion messages in retail builds.

// const_seg and data_seg for /GF and not

// 4177 pragma 'data_seg' should only be used at global scope or namespace scope

#include <windows.h>

#include <stdio.h>

#include <stdlib.h>

#undef NDEBUG

#include <assert.h>

#pragma section("1", read)

#pragma section("2", read)

#pragma section("3", read)


#define SEGSTR(seg, str)                \

   ([] { return                         \

    __pragma(warning(suppress:4177))    \

    __pragma(const_seg(push, seg))      \

    __pragma(data_seg(push, seg))       \

        str;                            \

    }                                   \

    __pragma(const_seg(pop))            \

    __pragma(data_seg(pop)) ())         \

 

extern "C" { extern IMAGE_DOS_HEADER __ImageBase; }

 

static PCH segNameOf(const void* p)

{

    PIMAGE_DOS_HEADER dos = &__ImageBase;

    PCH base = (PCH)dos;

    PIMAGE_NT_HEADERS nt = (PIMAGE_NT_HEADERS)(dos->e_lfanew + (PCH)dos);

    PIMAGE_SECTION_HEADER sec = IMAGE_FIRST_SECTION(nt);

    size_t nsec = nt->FileHeader.NumberOfSections;

    size_t i = 0;

    for (i = 0; i < nsec; ++i)

    {

        if (p >= (base + sec->VirtualAddress) && p < (base + sec->VirtualAddress + sec->Misc.VirtualSize))

        {

            //printf("vprot %X\n", sec->Characteristics);

            return (PCH)sec->Name;

        }

        ++sec;

    }

    return "";

}


__declspec(noinline) 

void assert_failed(const char* expression, const char* file, int line)

{

    const char* a = SEGSTR("3", "%s(%d): assertion failed %s\n");

    printf("default:%s 1:%s 2:%s 3:%s\n", segNameOf(""), segNameOf(expression), segNameOf(file), segNameOf(a));


    fprintf(stderr, a, file, line, expression);

    //abort();

}


#define Assert(exp) ((exp) ? ((void)0) : assert_failed(SEGSTR("1", #exp), SEGSTR("2", __FILE__), __LINE__))

 

int main(int argc, char** argv)

{

    Assert(argc==1);

    printf("default:%s\n", 1+segNameOf(""));

    printf("text:%s\n", 1+segNameOf(&main));

    static const char rdata[]="";

    static char data[]="";

    printf("rdata:%s\n", 1+segNameOf(&rdata));

    printf("data:%s\n", 1+segNameOf(&data));

    assert(!strcmp("rdata", 1+segNameOf(&rdata)));

    assert(!strcmp("data", 1+segNameOf(&data)));

    assert(!strcmp("text", 1+segNameOf(&main)));

}



This is maybe not quite the final answer, because I have found that captureless parameterless lambdas, have an extra cost associated with them, of passing them their unused "this" pointer. This can be mitigated by using a local class with a static member function, inside a lambda. I will make that adjustment later. Well, in this case, the codegen looks OK actually. I guess it is only when the lambda with the real code is larger, does the compiler pass its address, or maybe the newer compiler fixed this.

Wednesday, February 1, 2023

More Interlocked problems.

 There are helpers like:

template <typename T>
bool TryCompareExchange(T volatile * p, T exchange, T* compare)
{
    T old = Traits<T>::InterlockedCompareExchange(p, exchange, *compare);
    bool success = (old == *compare);
    if (!success) *compare = old;
    return success;
}

Traits handles char, short, int, long __int64, unsigned, etc., and same-sized structs (hopefully with consistent/zeroed padding).

Leading to uses like:

void increment(long volatile * p)
{
    long old = *a;
    while (!TryCompareExchange(p, old + 1, &old)) ;
}

The problem comes when used with InterlockedCompareExchange128.
There is no atomic get/set128.
They can be synthesized from InterlockedCompareExchange128.

There are at least two incorrect forms, which do work in non-race conditions, so lull us into complacency ("it works"):

template <typename T>
T Get(T volatile * p)
{
    T value;
    while (!TryCompareExchange(p, *p, &value)) ; // incorrect code
    return value;
}

*p is a non-atomic read.
If p happens to match input value, then the non-atomic *p will be written back.
This can be mitigated by initializing value to a never-set value.
But finding such a value is not really needed (keep reading).


template <typename T>
void Set(T volatile * p, T value)
{
    TryCompareExchange(&value, value, p); // incorrect code
}

The write through the last parameter is not atomic.


The correct forms:

template <typename T>
T Get(T volatile * p)
{
    T value;
    (void)TryCompareExchange(p, value, &value);
    return value;
}

If the compare does match, the same value will be written back, atomically.

 

template <typename T>
void Set(T volatile * p, T value)
{
    T old = *p; // non-atomic read, ok
    while (!TryCompareExchange(p, value, &old)) ; 
}

Loop until the non-atomic read matches, upon which write the desired value atomically.

Thursday, January 26, 2023

InterlockedCompareExchange without volatile.

cl /LD /Z7 /GS- /O1 a.c /link /noentry /incremental:no && link /dump /disasm a.dll

__int64 _InterlockedCompareExchange64(volatile __int64*, __int64, __int64);

#define LOCK \
  __int64 lock = *plock; \
  return (lock & 1) && _InterlockedCompareExchange64(plock, lock & ~1, lock) == lock;

__declspec(dllexport) __int64 BrokenTryLock(__int64* plock)
{
    LOCK
}

__declspec(dllexport) __int64 WorkingTryLock(volatile __int64* plock)
{
    LOCK
}

BrokenTryLock:

  mov         r8d,1
  test        byte ptr [rcx],r8b
  je          1C
  mov         rdx,qword ptr [rcx]          <== problem, only with /O1, not with /O2
  mov         rax,qword ptr [rcx]          <== problem, only with /O1, not with /O2
  and         rdx,0FFFFFFFFFFFFFFFEh
  lock cmpxchg qword ptr [rcx],rdx
  je          1F
1C:
  xor         r8d,r8d
1F:
  mov         rax,r8
  ret

WorkingTryLock:
  mov         rax,qword ptr [rcx]
  mov         r8d,1
  test        r8b,al
  je          40
  mov         rdx,rax
  and         rdx,0FFFFFFFFFFFFFFFEh
  lock cmpxchg qword ptr [rcx],rdx
  je          43
40:
  xor         r8d,r8d
43:
  mov  rax, r8
  ret

This is a common pattern.

It is correct with volatile.

Let's break down what is going on.

The idea is that we are trying to take a lock.

Only trying. Under contention, the lock is not acquired and false is returned. If there is to be a spin or spin with backoff, etc., that is up to the caller. This function has a small finite maximum number of operations, no loop.


Restated:

  First do a cheaper non-interlocked check, and if the lock appears not held, do a "real" interlocked checked, which can also fail. Return true if the lock was acquired.

x86 is "complex". Specifically, while it has separate load/store ("mov") instructions, most instructions can also access memory while doing other work. There are many examples. You can add to memory. You can call or jump through a pointer in memory. This may sound obvious, of course you can do these operations, on all processors, but most processors require a separate load or store prior to or after add, jmp, call etc. to load the operands or store the results (only when needed of course; RISC architectures also have more registers and get more done without accessing memory). For further example, consider that x86 call stores the return address in memory. But on ARM the return address goes into the link register, there is no store to memory (unless needed in non-leaf prior to making another call). And x86 ret reads the return address from memory. ARM return is branch to register, typically link register. That had to be separately loaded from memory (if spilled in non-leaf).

RISC architectures separate load/store instructions from almost everything else.

When optimizing for size, the compiler has taken advantage of x86 and combined the mov and the test. This part of the dynamic path is one instruction shorter. When the lock is under contention, this function likely runs one fewer instruction than the speed-optimized version (not shown). This is considered the rare slow path. The number of memory accesses in this case is unchanged.

When the lock is not under contention, the size-optimized version still needs the value of the mov, and so it is issued in the second part of the function when size-optimized. In the volatile or speed-optimized form, the value from the mov is retained from before the test. As a result, the static number of instructions in either case is the same.

And due to vagaries of x86 variably sized instructions, the size-optimized form is one byte larger. In the common uncontended case, the size-optimized form turns one memory access into three (x86 hides memory latencies very well, so I do not presume this is slower, maybe).

Other than the varying efficiencies, the semantics are significantly different, and the size-optimized form causes problems. The problem is that the "compare" value is read, the "exchange" value is computed from it, and then "compare" is read again. The exchange value is no longer a function of the actual compare value that is used.

Consider a way that people sometimes write reference counting:

void increment(long* p)
{
long old, next;
do {
old = *p;
next = old + 1;
} while (InterlockedCompareExchange(p, next, old) != old);
}

This gets transformed into more like:

void increment(long* p)
{
long old, next;
do {
next = *p + 1;
old = *p; // problem
} while (InterlockedCompareExchange(p, next, old) != old);
}

Between the computation of next, and reread from p to form old, another thread can have incremented the value already. The thread losing the race will write back the value that was already there, instead of incrementing. The compare should fail, but is more likely to succeed (unless there is yet another increment in the mean time).

When it comes time to decrement, if the race is not duplicated (highly unlikely!) an extra decrement will occur.


A complete program that shows the problem. It works, as long as /O1 is not used.

// One thread flips high bit, one thread increments lower 63 bits.
// Counter should never go down.
// int64_t cannot really overflow. Nor int63.
//
// To see this program fail:
//   cl /O1 inc.cpp && inc.exe
// To see it succeed:
//   cl /O2 inc.cpp && inc.exe
// or:
//   cl inc.cpp && inc.exe

#include <assert.h>
#include <stdint.h>
#include <stdio.h>
#include <windows.h>

#define LOCK    (0x800000000000i64)
#define COUNTER (0x7fffffffffffi64)

// Ordinally the return value of InterlockedCompareExchange
// is compared with old to see if the exchange succeeded.
// For this demonstration-only program, that is not needed.

DWORD Flip(void* pv)
{
    int64_t* p = (int64_t*)pv;
    while (true)
    {
        int64_t old = *p;
        int64_t next = old ^ LOCK;
        InterlockedCompareExchange64(p, next, old);
    }
}

// This thread increments the counter and..it does not
// matter what it does to the lock. For sample purposes
// it is always clearing.

DWORD Inc(void* pv)
{
    int64_t* p = (int64_t*)pv;
    while (true)
    {
        int64_t old = *p;
        int64_t next = ((old & COUNTER) + 1);
        InterlockedCompareExchange64(p, next, old);
    }
}

int main()
{
    int64_t volatile i = 0;
    CreateThread(0, 0, Inc, const_cast<int64_t*>(&i), 0, 0);
    CreateThread(0, 0, Flip, const_cast<int64_t*>(&i), 0, 0);
    while (true)
    {
        // The counter should only go up, or stay the same.
        // But it goes down sometimes due to the problem.
        int64_t j1 = (i & COUNTER);
        int64_t j2 = (i & COUNTER);
        if (j2 < j1)
        {
            printf("error:%lld %%lld\n", j1, j2);
            exit(1);
        }
    }
}

Monday, April 18, 2022

Windows process start and dynamic linking

 I was looking at Apple dyld.

And it struck me as a little wierd, er, surprising.

That they have to do work without benefit of a heap allocator.


It is true, heap allocators require initialization

and code that runs before that initialization must do without heap.


But it struck me that the structure of Windows is a bit more

elegant here, and simpler, possibly smaller, faster, easier to maintain, more general, etc.

(ok, maybe not smaller, due to an "extra" C runtime, but still, this is not much, and provides a lot of value, for example you can printf to the debugger via this code: DbgPrint).


So here is a brief description of Windows process startup and dynamic loading.


There are just a few basic aspects to the structure from which the rest follow.


 - ntdll.dll has a very special relationship with the kernel. (Yes, "dll dll", hereafter just "ntdll").


 - All usermode processes begin in ntdll. (I am ignoring Pico processes.)

   They do *not* begin in executables. No matter what flags

   the executable is built with. There is no "alternate dyld" or "ELF interpreter".

   There is no such thing as a statically linked executable on Windows.

   Well, yes, the executable can just ret or int3. It can even try to

   statically link system service stubs (their interface is sadly unstable).

   It need not have any imports. It can seem statically linked. 


   But execution still starts in ntdll no matter what the executable looks like, and such an executable cannot do much, given the unstable undocumented kernel interface (NtOpenFile, etc.)

(Prior to Windows XP, an executable with no imports would actually crash, attempting to run the address in kernel32.dll that the creater assumed would be mapped in all processes.)

 - Rather, all usermode threads begin in ntdll. There is no specific kernel to user upcall for new processes, only new threads. It suffices.


 - Process initialization occurs by virtue of thread start noticing

   the process has not been initialized. If you create a process suspended,

   and multiple threads in it suspended, and then resume them all "quickly" (NtResumeProcess?), they will race to initialize the process first.

   This is synchronized and safe.


 - ntdll contains statically linked all the system service stubs (NtOpenFile, etc.)

   They are exported from there to the rest of the usermode OS.

   Aside: win32u.dll contains the ones for win32k.sys, for use by user32.dll/gdi32.dll.

 In the past these were statically linked into user32.dll and gdi32.dll but got separated at some point. I don't know how DirectX works. Maybe via gdi32!Escape()?


 - ntdll has no imports. It probably could have some if it was careful, but this is kinda the point.


 - ntdll has thread locals, via an internal non-extensible mechanism. Not declspec(thread), not TlsAlloc.

   All of ntdll's thread locals are allocated along with all usermode threads, by the kernel. This is not really about ntdll per se. These "built in" thread locals are also very efficient to access, fixed register + offset. They should be spent wisely, at least each page of them, as every thread pays for them. This is known as the TEB, the thread environment block. It is at e.g. GS:0 on AMD64, FS:0 for x86, and dedicated registers for other processors. See NtCurrentTeb in winnt.h.


 - The special relationship between ntdll and the kernel extends, such that

   all kernel to user upcalls go through ntdll. For example exception dispatch (KiUserExceptionDispatcher), asynchronous I/O completion (KiUserApcDispatcher), win32k callbacks (KiUserCallbackDispatcher).

 - ntdll contains its own statically linked C runtime. Such as strcmp, bsearch.

   This is partially exported to the system, but is not generally reused. Usermode generally uses the universal C runtime (ucrt) or older msvcrt.dll. They are more complete and support e.g. C++ exception handling. The ntdll C runtime is msvcrt.dll, but ifdef'ed ("not all lines") and selectively built ("not all files"), e.g. to avoid kernel dependencies, though they could work.

   i.e. no fread or malloc, though malloc would be trivial. While this is somewhat wasteful, it is not terrible. This C runtime, libcntpr.lib, is also statically linked to and exported from the kernel, which is why e.g. malloc makes sense to omit (usermode heap is built on NtAllocateVirtualMemory / VirtualAlloc, kernel has those but it is not usually what kernel code wants, kernel "heap" is historically ExAllocatePool, etc.) (In truth, this C runtime was later ifdefed again to separate ntdll and kernel).


 - Other than mapping ntdll into all processes, the kernel either maps the executable and/or passes its path and/or mapped base to ntdll. Passing the path would suffice, since ntdll could map it, just as it maps .dlls.


 - ntdll process initialization then proceeds like so:


   initialize heap (process heap, i.e. GetProcessHeap())

   recursively walk executables imports,

    searching for .dlls

      mapping them (roughly: CreateFile + CreateFileMapping(SEC_IMAGE) + MapViewOfFile)

      resolving their imports

      calling DllMain

  call the executable's entry

Since ntdll has no imports, the only dependencies here, by static construction, are the system services and ntdll itself (being careful to initialize ntdll in the right order, e.g. heap first, but some other things too, like critical section support; critical sections optionally use heap).


- You can step through all this. The "magic" is asking the debugger to stop on module loads, ntdll specifically:

 cdb /xe ld:ntdll.dll foo.exe (or maybe just /xe ld).

This breaks very early in a usermode process, long before main and long before the builtin initial breakpoint.


 - It should be noted that exception dispatch is also in ntdll.

   Dynamic loading has no problem using exceptions.

   (Glossing over: ntdll is written mostly in C. It can use C exceptions. It does

   not have a C++ runtime. The C++ runtime uses the "wrong" kind of thread locals (FlsAlloc), like for rethrow so does not at present work here, but it could, or maybe omit the rethrow functionality; exceptions on Windows at least do not require heap, unlike other systems; they can be used to indicate out of memory).


 - I think Apple could/should merge libSystem and dyld and therefore ease

   the development of dyld, but there may be reasons they are split,

   some functionality I am unaware of. Or maybe it is just too much

   work at this point. Maybe they have static executables that do not use dyld,

   or even libSystem?