ldr x0, [x1,#n]
x0 = *(x1 + n)
ldr x0, [x1],#n
x0 = *x1
x1 += n
ldr x0, [x1,#n]!
x1 += n
x0 = *x1
ldr x0, [x1,#n]
x0 = *(x1 + n)
ldr x0, [x1],#n
x0 = *x1
x1 += n
ldr x0, [x1,#n]!
x1 += n
x0 = *x1
typedef struct { int64_t a[0xb0 / 8]; } Data;int compare(void** p1, void** p2){ptrdiff_t delta = (*(Data**)p1 - *(Data**)p2);return (delta < 0) ? -1 : ((delta > 0) ? 1 : 0);}
return (*p1 < *p2) ? -1 : ((*p1 > *p2) ? 1 : 0);
#include <stdlib.h>
#define assert(x) ((void)((x) ? 0 : __debugbreak()))
typedef struct { __int64 a[0xb0 / 8]; } Data;
int main()
{
while (1)
{
Data* d1 = (Data*)malloc(sizeof(Data));
Data* d2 = (Data*)malloc(sizeof(Data));
assert((d1 - d2) < 0 == (d1 < d2));
free(d1);
free(d2);}}
It is very long standing problem of C++ that you cannot use string literals as template parameters.
I acknowledge that macros are bad, however..
C++ lambdas have a special property of generating anonymous unique types.
Therefore a scope in which to create additional types.
This seems to be quite useful. It seems like producing types (or functions) within lambdas within macros provides template-like powers, and not in the bad old way that macros on their own do.
I'm not sure where all this will go, but here is a start:
#include <stdio.h>
#include <string>
#define STRING_TEMPLATE_FUNCTION(str) \
[]() { struct Type { static const char* Function() { return str; } }; return Type::Function; }()
#define STRING_TEMPLATE_CLASS(str) \
[=]() { struct Type : std::string { using std::string::operator=; Type() : std::string(str) { } }; return []{ return Type();}(); }()
int main()
{
auto foo = STRING_TEMPLATE_FUNCTION("foo");
auto bar = STRING_TEMPLATE_CLASS("bar");
auto bar2 = STRING_TEMPLATE_CLASS("bar2");
printf("%s\n", foo());
printf("%s\n", bar.c_str());
printf("%s\n", bar2.c_str());
bar = bar2;
printf("%s\n", bar.c_str());
}
Let's consider implementing the AMD64 adc add with carry instruction.
Add without carry is easy.
But how do we accomodate the extra carry_in?
How do we compute the flags? Unsigned carry and signed overflow?
Aka last carry and second to last carry.
Doing the operation bit-wise is easy. You can loop over the bits.
But what about using C++ unsigned::operator+ to do almost all the work in one operation?
intsafe.h LongLongAdd (I wrote it), teaches us, without carry_in:
- Adding positive to negative never overflows.
- If you add two positive numbers, you expect a positive result.
- If you add two negative numbers, you expect a negative result.
- Overflow if inputs have the same sign and output has the other sign.
The HPPA manual says similar I was pleased to later read.
But intsafe.h functions have no carry_in. What about it?
QEMU seems to ignore it for flag computation (and is correct).
So, the answer is, if you consider the edge cases, adding carry_in does not change their outcome. Or rather, the correctness of the rules.
So these are the rules.
z = x + y + carry;
signed_overflow = (x < 0) == (y < 0) && (x < 0) != (z < 0);
Let's go through edge cases to confirm.
We will use 4bit numbers. The third term in any addition is carry_in.
Negative numbers are interchangable with large positive numbers.
If the output is shown as having two digits, the first is the unsigned carry_out.
Base 16 is implied. F==0xF==decimal15
8 == -8; F == -1
The sign is considered negative if the second digit is higher than 7.
check large positive:
7+7+0=E overflowed
7+7+1=F overflowed
check medium positive:
3+4+0=7 no overflow
3+4+1=8 overflowed
check negatives:
8+8+0=10 overflowed
8+8+1=11 overflowed
F+F+0=1E=-2 no overflow
F+F+1=1F=-1 no overflow
check mix of positive and negative:
7+8+0=F=-1 no overflow
7+8+1=10=0 no overflow
7+F+0=16=6=no overflow
7+F+1=17=7=no overflow
How about carry?
intsafe.h's ULongLongAdd is less informative. Carry is if output is less than either input.
This does not work with carry_in. The output is equal to the inputs if adding max + max + carry.
Like adding 4bit numbers: F+F+1=1F
Again, QEMU does not seem to consider it (and is correct).
So, think again, more like LongLongAdd:
- Adding two “negative" (unsigned!) numbers always carries. Edge case:8+8=10
- Adding two positive numbers never carries. Edge case:7+7=E
- Adding positive and “negative“ carried if result is positive. Edge cases:1+F=10; 8+7=F
And again, adding carry_in does not push any edge case over the edge.
8+8+1=11
7+7+1=F
1+F+1=11
8+7+1=10 this one seems like carry_in made a difference. It did push the result into having carry_out, but the same rules work and do not need to consider carry_in.
Why does carry_in not matter in this regard?
I think it has something to do with the asymmetric range. There is one
more negative number than positive numbers, and carry_in is always positive.
unsigned_carry = ((x < 0 && y < 0) || ((x < 0) != (y < 0) && (z >= 0));
We have an assertion implementation something like this:
#define RetailAssert(expr) \
__declspec(allocate(".rdata$cold")) static const char exp = #expr; \
if (!expr) {fprintf(stderr, "%s(%d): assertion failed:%s\n", __FILE__, __LINE__, exp); abort(); }
__FILE__ and the format string should be similarly cold too.
I would like to make it an expression, like standard assert, something like this:
#define RetailAssert(expr) \
((expr) ? 0 : fprintf(stderr, "%s(%d): assertion failed:%s\n", __FILE__, __LINE__, __declspec(allocate(".rdata$cold"))(#expr)), abort()))
But __declspec(allocate) seemingly cannot be used with string literals.
This seems like a language hole. String literals are their own special thing.
There is #pragma const_seg, and __pragma(const_seg).
They do not quite work because they operate at function level.
Like, the last __pragma applies to its entire enclosing function.
Lately I have found that lambdas are very good at breaking things into functions, perhaps unnaturally, where you don't really intend to have a function, but just chunks of code, in order to gain functionality only offered to functions (such as declspec(noinline)).
So here:
// Place a string literal in a section.
// For example, for hot/cold split of assertion messages in retail builds.
// const_seg and data_seg for /GF and not
// 4177 pragma 'data_seg' should only be used at global scope or namespace scope
#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#undef NDEBUG
#include <assert.h>
#pragma section("1", read)
#pragma section("2", read)
#pragma section("3", read)
#define SEGSTR(seg, str) \
([] { return \
__pragma(warning(suppress:4177)) \
__pragma(const_seg(push, seg)) \
__pragma(data_seg(push, seg)) \
str; \
} \
__pragma(const_seg(pop)) \
__pragma(data_seg(pop)) ()) \
extern "C" { extern IMAGE_DOS_HEADER __ImageBase; }
static PCH segNameOf(const void* p)
{
PIMAGE_DOS_HEADER dos = &__ImageBase;
PCH base = (PCH)dos;
PIMAGE_NT_HEADERS nt = (PIMAGE_NT_HEADERS)(dos->e_lfanew + (PCH)dos);
PIMAGE_SECTION_HEADER sec = IMAGE_FIRST_SECTION(nt);
size_t nsec = nt->FileHeader.NumberOfSections;
size_t i = 0;
for (i = 0; i < nsec; ++i)
{
if (p >= (base + sec->VirtualAddress) && p < (base + sec->VirtualAddress + sec->Misc.VirtualSize))
{
//printf("vprot %X\n", sec->Characteristics);
return (PCH)sec->Name;
}
++sec;
}
return "";
}
__declspec(noinline)
void assert_failed(const char* expression, const char* file, int line)
{
const char* a = SEGSTR("3", "%s(%d): assertion failed %s\n");
printf("default:%s 1:%s 2:%s 3:%s\n", segNameOf(""), segNameOf(expression), segNameOf(file), segNameOf(a));
fprintf(stderr, a, file, line, expression);
//abort();
}
#define Assert(exp) ((exp) ? ((void)0) : assert_failed(SEGSTR("1", #exp), SEGSTR("2", __FILE__), __LINE__))
int main(int argc, char** argv)
{
Assert(argc==1);
printf("default:%s\n", 1+segNameOf(""));
printf("text:%s\n", 1+segNameOf(&main));
static const char rdata[]="";
static char data[]="";
printf("rdata:%s\n", 1+segNameOf(&rdata));
printf("data:%s\n", 1+segNameOf(&data));
assert(!strcmp("rdata", 1+segNameOf(&rdata)));
assert(!strcmp("data", 1+segNameOf(&data)));
assert(!strcmp("text", 1+segNameOf(&main)));
}
There are helpers like:
template <typename T>
bool TryCompareExchange(T volatile * p, T exchange, T* compare)
{
T old = Traits<T>::InterlockedCompareExchange(p, exchange, *compare);
bool success = (old == *compare);
if (!success) *compare = old;
return success;
}
Traits handles char, short, int, long __int64, unsigned, etc., and same-sized structs (hopefully with consistent/zeroed padding).
Leading to uses like:
void increment(long volatile * p)
{
long old = *a;
while (!TryCompareExchange(p, old + 1, &old)) ;
}The problem comes when used with InterlockedCompareExchange128.
There are at least two incorrect forms, which do work in non-race conditions, so lull us into complacency ("it works"):
template <typename T>T Get(T volatile * p){T value;while (!TryCompareExchange(p, *p, &value)) ; // incorrect codereturn value;}
template <typename T>void Set(T volatile * p, T value){TryCompareExchange(&value, value, p); // incorrect code
}
The write through the last parameter is not atomic.
The correct forms:
template <typename T>T Get(T volatile * p){T value;(void)TryCompareExchange(p, value, &value);return value;
}
If the compare does match, the same value will be written back, atomically.
template <typename T>
void Set(T volatile * p, T value)
{
T old = *p; // non-atomic read, ok
while (!TryCompareExchange(p, value, &old)) ;
}
Loop until the non-atomic read matches, upon which write the desired value atomically.
cl /LD /Z7 /GS- /O1 a.c /link /noentry /incremental:no && link /dump /disasm a.dll
BrokenTryLock:
mov r8d,1This is a common pattern.
It is correct with volatile.
Let's break down what is going on.
The idea is that we are trying to take a lock.
Only trying. Under contention, the lock is not acquired and false is returned. If there is to be a spin or spin with backoff, etc., that is up to the caller. This function has a small finite maximum number of operations, no loop.
Restated:
First do a cheaper non-interlocked check, and if the lock appears not held, do a "real" interlocked checked, which can also fail. Return true if the lock was acquired.
x86 is "complex". Specifically, while it has separate load/store ("mov") instructions, most instructions can also access memory while doing other work. There are many examples. You can add to memory. You can call or jump through a pointer in memory. This may sound obvious, of course you can do these operations, on all processors, but most processors require a separate load or store prior to or after add, jmp, call etc. to load the operands or store the results (only when needed of course; RISC architectures also have more registers and get more done without accessing memory). For further example, consider that x86 call stores the return address in memory. But on ARM the return address goes into the link register, there is no store to memory (unless needed in non-leaf prior to making another call). And x86 ret reads the return address from memory. ARM return is branch to register, typically link register. That had to be separately loaded from memory (if spilled in non-leaf).
RISC architectures separate load/store instructions from almost everything else.
When optimizing for size, the compiler has taken advantage of x86 and combined the mov and the test. This part of the dynamic path is one instruction shorter. When the lock is under contention, this function likely runs one fewer instruction than the speed-optimized version (not shown). This is considered the rare slow path. The number of memory accesses in this case is unchanged.
When the lock is not under contention, the size-optimized version still needs the value of the mov, and so it is issued in the second part of the function when size-optimized. In the volatile or speed-optimized form, the value from the mov is retained from before the test. As a result, the static number of instructions in either case is the same.
And due to vagaries of x86 variably sized instructions, the size-optimized form is one byte larger. In the common uncontended case, the size-optimized form turns one memory access into three (x86 hides memory latencies very well, so I do not presume this is slower, maybe).
Other than the varying efficiencies, the semantics are significantly different, and the size-optimized form causes problems. The problem is that the "compare" value is read, the "exchange" value is computed from it, and then "compare" is read again. The exchange value is no longer a function of the actual compare value that is used.
Consider a way that people sometimes write reference counting:
void increment(long* p)
{long old, next;
do {
old = *p;
next = old + 1;} while (InterlockedCompareExchange(p, next, old) != old);}
This gets transformed into more like:
void increment(long* p)
{
long old, next;do {
next = *p + 1;old = *p; // problem
} while (InterlockedCompareExchange(p, next, old) != old);
}
When it comes time to decrement, if the race is not duplicated (highly unlikely!) an extra decrement will occur.
A complete program that shows the problem. It works, as long as /O1 is not used.
// One thread flips high bit, one thread increments lower 63 bits.