Jay's blog: 2017

Friday, December 1, 2017

Epilogues and esp. prologues are probably not what you think.

The NT/amd64 ABI speaks of function prologues and epilogues, and the rest of the function.
Epilogues might not be what you think and prologues almost definitely are not what you think.

First, the easier clarification, is that a function an have any number of epilogues.
It can have zero epilogues, it can have one epilogue at the end, it can have
one epilogue not at the end, and it can have any number of epilogues.

An epilogue is not the code located at the end of the function,
it is the last thing a function runs -- it is about dynamic execution,
not static location.

A function will have zero epilogues if it never returns:

type no_epilogue.c
cl /LD /O2 /GL /GS- no_epilogue.c /link /incremental:no /export:no_epilogue /nod /noentry
link /dump /disasm no_epilogue.dll

C:\> type no_epilogue.c
        cl /LD /O2 /GL /GS- no_epilogue.c /link /incremental:no /export:no_epilogue /nod /noentry
        link /dump /disasm no_epilogue.dll

   void no_epilogue(void (*f)()) { while (1) f(); }

Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64

    0000000180001000: 40 53              push        rbx
    0000000180001002: 48 83 EC 20        sub         rsp,20h
    0000000180001006: 48 8B D9           mov         rbx,rcx
    0000000180001009: 0F 1F 80 00 00 00 nop         dword ptr [rax+0000000000000000h]
                      00
    0000000180001010: FF D3              call        rbx
    0000000180001012: EB FC              jmp         0000000180001010

   A function can have multiple epilogues if it has an "early return":

C:\> type multiple_epilogues.c
       cl /LD /O2 /GL /GS- multiple_epilogues.c /link /incremental:no /export:multiple_epilogues /nod /noentry
       link /dump /disasm multiple_epilogues.dll

   int multiple_epilogues(int i, int (*f)(void), int (*g)(void))
   {
     if (i)
      return f();
     return g() + g() + g();
   }

Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64
    0000000180001000: push        rdi
    0000000180001002: sub         rsp,20h
    0000000180001006: mov         rdi,r8
    0000000180001009: test        ecx,ecx
    000000018000100B: je          0000000180001015
    000000018000100D: add rsp,20h    <== possibly epilog
    0000000180001011: pop         rdi    <== epilog
    0000000180001012: jmp         rdx    <== epilog
    0000000180001015: mov         qword ptr [rsp+30h],rbx
    000000018000101A: call        rdi
    000000018000101C: mov         ebx,eax
    000000018000101E: call        rdi
    0000000180001020: add         ebx,eax
    0000000180001022: call        rdi
    0000000180001024: add         eax,ebx
    0000000180001026: mov         rbx,qword ptr [rsp+30h]
    000000018000102B: add         rsp,20h <== possibly epilog
    000000018000102F: pop         rdi <== epilog
    0000000180001030: ret <== epilog

   And this multiple epiloge case accidentally demonstrates the next point.

   Just as epilogue is not instructions located at the end of a function,
   prologue is not instructions located at the start of a function.

   The prologue *instructions* are the instructions that save nonvolatile
   registers, or adjust rsp (prior to frame pointer establishment -- not alloca),
   or establish the frame pointer (mov x, rsp).

   The prologue instructions can and are interleaved with somewhat arbitrary
   other instructions. The critical requirement is that nonvolatiles be saved
   before nonvolatiles are changed -- as well as recording rsp adjustment
   and frame pointer establishment -- such recording being a function
   of executing the instruction marked as such in the "xdata".

   The multi-prologue example above has such "dispersed" prologue.
   Let's look at it again in more detail:

C:\> link /dump /unwindinfo /disasm multiple_epilogues.dll

    Microsoft (R) COFF/PE Dumper Version 14.00.24215.1

0180001000: push rdi <=== prologue instruction, unsurprising
0180001002: sub rsp,20h <=== prologue instruction, unsurprising
0180001006: mov rdi,r8
0180001009: test ecx,ecx
018000100B: je 0000000180001015
018000100D: add rsp,20h
0180001011: pop rdi
0180001012: jmp rdx
0180001015: mov qword ptr [rsp+30h],rbx <=== also a prologue instruction
018000101A: call rdi <=== offset 1A in the unwind info below
018000101C: mov ebx,eax
018000101E: call rdi
0180001020: add ebx,eax
0180001022: call rdi
0180001024: add eax,ebx
0180001026: mov rbx,qword ptr [rsp+30h]
018000102B: add rsp,20h
018000102F: pop rdi
0180001030: ret
Function Table (1)

             Begin    End      Info      Function Name
    00000000 00001000 00001031 0000208C
      Unwind version: 1
      Unwind flags: None
      Size of prologue: 0x1A     <== This is also telling.
      Count of codes: 4
      Unwind codes:
        1A: SAVE_NONVOL, register=rbx offset=0x30
        06: ALLOC_SMALL, size=0x20
        02: PUSH_NONVOL, register=rdi
       /*\
        *
        *
        * look here

   The critical information we want to look at is the left most column
   of the unwind codes. These are the offsets just after prologue instructions.
   They are reverse sorted by offset, and the underlying data is not fixed
   size per line shown -- you must always walk them linearly from the start.

   Offset 2 and 6 are what you expect -- the first two instructions.
   But offset 1A is quite a bit into the function -- that is a bit surprising when you first see it.

And another thing. While the specification is that the offsets are just after the instruction that does the nonvolatile save, etc., the requirement and reality are looser. The offset can be later than the save, as long as it is before a change. As well, the location of a save might change between the save and the recorded offset. For example, the compiler will move nonvolatiles into home space, and then adjust rsp, and then or at the same place record that the nonvolatile was saved.

This can be achieved with the multiple_prologue.c example just by compiling with /O1 instead of /O2. Let's see:

cl /LD /O1 /GL /GS- multiple_epilogues.c /link /incremental:no /export:multiple_epilogues /nod /noentry
link /dump /disasm /unwindinfo multiple_epilogues.dll
Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1

0180001000:   mov   qword ptr [rsp+8],rbx   <==== rbx saved here
0180001005:   push rdi
0180001006:   sub   rsp,20h                 <==== but recorded here
018000100A:   mov   rdi,r8
018000100D:   test ecx,ecx
018000100F:   je    0000000180001015
0180001011:   call rdx
0180001013:   jmp   0000000180001021
0180001015:   call rdi
0180001017:   mov   ebx,eax
0180001019:   call rdi
018000101B:   add   ebx,eax
018000101D:   call rdi
018000101F:   add   eax,ebx
0180001021:   mov   rbx,qword ptr [rsp+30h]
0180001026:   add   rsp,20h
018000102A:   pop   rdi
018000102B:   ret
Function Table (1)
           Begin    End      Info      Function Name
00000000 00001000 0000102C 0000208C
    Unwind version: 1
    Unwind flags: None
    Size of prologue: 0x0A
    Count of codes: 4
    Unwind codes:
      0A: SAVE_NONVOL, register=rbx offset=0x30     <=== rbx save
      0A: ALLOC_SMALL, size=0x20                    <=== two unwind codes with same offset
      06: PUSH_NONVOL, register=rdi

And see how rbx is saved at rsp+8 but recorded as rsp+30, because
rsp changes by 28 between the save and the recorded position.

And this is all ok. If you take an exception between the save and recorded
position of the save, rbx has not been changed, and need not be restored.
Such an exception is rare -- maybe stack overflow -- but the ABI accounts
for exceptions and stack walks from arbitrary instructions.

Saturday, November 25, 2017

DLL_THREAD_ATTACH and DLL_THREAD_DETACH are not what they sound like (but are documented correctly)

DllMain gets called four specific reasons:
DLL_PROCESS_ATTACH
DLL_PROCESS_DETACH
DLL_THREAD_ATTACH
DLL_THREAD_DETACH

Ostensibly, THREAD_ATTACH is where you initialize
any thread locals and THREAD_DETACH is where you clean them up.

However they do not work as they sound.

They are documented correctly however.

DLL_THREAD_ATTACH is only called for threads created after a dll is loaded.
DLL_THREAD_DETACH is only called for threads that exit while a dll is loaded.

That leaves DLL_THREAD_ATTACH not called for threads that exist
before the dll is loaded, and DLL_THREAD_DETACH not called for threads
that still exist when a dll is unloaded.

Therefore, if you have a thread local with a constructor, it can be used without being constructed.
If you have a thread local with a destructor, it might not be called.

Here is an example:

F:\1>type dll.cpp dll.h dll.def exe.cpp
dll.cpp

#include <windows.h>
#include <stdio.h>

int constructs;
int destroys;
__declspec(thread) bool constructed;

struct A
{
A() { constructed = true; printf(" constructed:%d on thread:%d \n", ++constructs, GetCurrentThreadId()); }

~A() { printf(" destroyed:%d on thread:%d \n", ++destroys, GetCurrentThreadId()); }

void F1() { printf(" called on thread:%d constructed:%d \n", GetCurrentThreadId(), (int)constructed); }
};

__declspec(thread) A a;

extern "C" void F1() { a.F1(); }

PCSTR ReasonString(ULONG Reason)
{
     switch (Reason)
     {
     case DLL_PROCESS_ATTACH: return "DLL_PROCESS_ATTACH";
     case DLL_PROCESS_DETACH: return "DLL_PROCESS_DETACH";
     case DLL_THREAD_ATTACH: return "DLL_THREAD_ATTACH";
     case DLL_THREAD_DETACH: return "DLL_THREAD_DETACH";
     }
     return "unknown";
}

BOOL __stdcall DllMain(HINSTANCE dll, ULONG reason, PVOID reserved)
{
     printf(" DllMain dll:%p, reason:%s reserved:%d thread:%d \n", dll, ReasonString(reason), !!reserved, GetCurrentThreadId());
     if (reason == DLL_PROCESS_DETACH)
         printf(" unloading with %d leaked constructions \n", constructs - destroys);
     return TRUE;
}

dll.h

extern "C" void F1();

dll.def

EXPORTS
F1
exe.cpp

#include <stdio.h>
#include <windows.h>
#include "dll.h"

decltype(&F1) pf1;

HANDLE thread[3];
ULONG threadid[3];
HANDLE event[3];

ULONG __stdcall Thread(PVOID p)
{
     size_t i = (size_t)p;

     // wait for dll/function to be available
     while (!pf1)
         Sleep(1);
     pf1();

     // Indicate this thread is done with the dll.
     SetEvent(event[i]);

     // keep doing more work on this thread -- at least pretend
     if (i == 2)
     {
         printf(" not exiting thread:%d \n", GetCurrentThreadId());
         Sleep(INFINITE);
     }

     printf(" exiting thread:%d \n", GetCurrentThreadId());
     return 0;
}

int main()
{
     size_t i = 0;

     printf(" initial thread:%d \n", GetCurrentThreadId());

     // create thread before loading dll (current thread as well)
     event[i] = CreateEvent(0, 0, 0, 0);
     thread[i] = CreateThread(0, 0, &Thread, (PVOID)i, 0, &threadid[i]);
     printf(" created thread:%d \n", threadid[i]);
     ++i;

     auto const dll = LoadLibrary("dll");
     (PROC&)pf1 = GetProcAddress(dll, "F1");
     pf1();

     // create threads after loading dll
     event[i] = CreateEvent(0, 0, 0, 0);
     thread[i] = CreateThread(0, 0, &Thread, (PVOID)i, 0, &threadid[i]);
     printf(" created thread:%d \n", threadid[i]);
     ++i;

     event[i] = CreateEvent(0, 0, 0, 0);
     thread[i] = CreateThread(0, 0, &Thread, (PVOID)i, 0, &threadid[i]);
     printf(" created thread:%d \n", threadid[i]);
     ++i;

     // Wait for all calls to the dll to be done.
     WaitForMultipleObjects(3, event, TRUE, INFINITE);

     // Wait for some of the threads to exit.
     WaitForMultipleObjects(2, thread, TRUE, INFINITE);

     FreeLibrary(dll);
}

F:\1>cl /LD /MD /Zi dll.cpp /link /def:dll.def
F:\1>cl /MD /Zi exe.cpp

F:\1>.\exe.exe
initial thread:80676
created thread:70020
constructed:1 on thread:80676
DllMain dll:00007FFBDEFD0000, reason:DLL_PROCESS_ATTACH reserved:0 thread:80676
called on thread:80676 constructed:1
created thread:66288
constructed:2 on thread:66288
DllMain dll:00007FFBDEFD0000, reason:DLL_THREAD_ATTACH reserved:0 thread:66288
created thread:49672
called on thread:70020 constructed:0 called on thread:66288 constructed:1
exiting thread:66288
exiting thread:70020
constructed:3 on thread:49672
DllMain dll:00007FFBDEFD0000, reason:DLL_THREAD_ATTACH reserved:0 thread:49672
called on thread:49672 constructed:1
not exiting thread:49672
destroyed:1 on thread:66288
DllMain dll:00007FFBDEFD0000, reason:DLL_THREAD_DETACH reserved:0 thread:66288
DllMain dll:00007FFBDEFD0000, reason:DLL_THREAD_DETACH reserved:0 thread:70020
destroyed:2 on thread:80676
DllMain dll:00007FFBDEFD0000, reason:DLL_PROCESS_DETACH reserved:0 thread:80676
unloading with 1 leaked constructions

- Jay

Saturday, October 28, 2017

What's in a name -- "structured" exception handling.

Consider that Posix has signals. Such as SIGFAULT.

Consider that Windows has "structured exceptions"
or "SEH" -- "structured exception handling".
Such as STATUS_ACCESS_VIOLATION -- same as SIGFAULT.
divide by zero
stack overflow
invalid instruction
Many predefined 32bit values, and you can RaiseException your own.

Consider that both signals and exceptions can be "handled".

For example, you can simulate mmap and memory mapped readinng
of files by associating a range of address space with a file,
and as pages are touched, VirtualProtect/mprotect the pages
and read in the file contents. (Writing is more elaborate
as a worker thread or such has to issue the writes.)

Nevermind if this is slow or debugger-unfriendly or just somehow bad,
or just use mmap directly.
There are enough other reasons for these mechanisms to exist.
They are costly to implement and nobody did this lightly.

Consider that Posix lets you install signal handlers,
for a process with sigaction. Is there a way to do it per thread?
Clearly what Posix calls a "signal", Windows calls an "exception".
So why dows Windows call them called "structured" exceptions, vs just plain "exceptions"?

"SEH" structured exception handling has become imbued with very negative connoations,
because you can catch access violations, and this causes bugs, security bugs.

However, it is simply a reasonable general purpose mechanism, and underlies C++ exceptions and C# exceptions. Don't fault it for providing a bit more functionality than most people want.
If this functionality was not provided here, and anybody wanted it, they would be hard pressed
to get their job done well. And there are cases where it is crucial that I might cover later.
The operating system's job is to provide very general purpose mechanisms that most people
never use to their full functionality, aiming to provide the union of everyone's needs.
Satisfying everyone is a big feature set.

Anyway, I believe what "structured" means is in fact:
Instead of calling a function to set or unset a thread's or process's handler,
"structured" exception handlers are established/unestablished very efficiently by
virtue of static lexical scoping and dynamic scope i.e. stack walking.
"structured" means almost the same thing as "scoped".

Consider:

void f1();

void f2()
{
__try
{
     f1();
     __try
     {
         f1();
     }
     __except(...)
     {
     }
}
__except(...)
{
     ...
}
}

void f1()
{
__try
{
     ... ;
}
__except(...)
{
     ...
}
}

What would this look like using Posix signals?
Every enter and exit of a __try, and rougly, every exit of an __except,
would require an expensive call to establish or unestablish a handler.

Assuming there is a way to do it per-thread.
Absent that, you would establish a handler per process that you communicate
with somehow so it can determine lexical scope and walk the stack for dynamic scope.
Maybe via thread locals, maybe optimized "thread local storage", which is a far cry
efficiency-wise from how Windows achieves this.

On Windows, the cost of entering and exiting a __try is very small
on 32bit x86, and essentially zero on all other architectures.
The costs on the other architectures is generate some cold data on the side
at compile/link time, and *perhaps*, but perhaps not, some compiler optimization
inhibitions.

On non-x86 platforms, communication of lexical scope is achieved by mapping
the instruction pointer (or relative virtual address within a module) to static data.

Dynamic scope is achieved by having an ABI that ensure the stack is always efficiently
walkable, again, without severely or at all compromising code quality.

On x86, instead of mapping instruction pointer, functions have one volatile local integer
to indicate scope, and guaranteed stack-walkability in the face of lack of an adequate ABI,
is achieved via a highly optimized per-thread (or per-fiber) linked list of frames.
While it is indeed highly optimized, it is slower in most scenarios than the other architectures.

Stepping through almost any code will show the linked list is through FS:0.
FS itself is the start of a bunch of per-thread (or per-fiber) data, and this linked list
is the very first think in it. FS is referred to as the "thread environment block" (TEB)
or "thread information block" (TIB), which to me just sounds like "Thread".

x86 may achieve faster throw/raise/dispatch of exceptions, but it spreads a "peanut butter tax"
throughout all the non-exceptional code paths.

As well, on Windows, if you do really want process-wide handlers, there are "vectored"
exception handlers. This was added circa Windows 2000 or Windows XP.

Therefore Windows provides Posix-like parity with a little used
mechanism, and far surpasses it with a heavily used mechanism (again, recall that
SEH is the basis of C++ and C# exceptions, as well as by default interacting with setjmp/longjmp.)

"vectored" here meaning "global" instead of scoped or structured.

- Jay

Friday, October 27, 2017

Windows AMD64 ABI nuances part 3 -- why so concerned with exception handling?

When speaking about an ABI, exception handling comes up a lot.

Many programmers will roll their eyes and protest.
"I have barely any try/catch or throw or Windows __try/__except/__finally in my program, so why so concerned with exception handling?"

"One is all you need."

If all of your exceptions are "fail fast" (exit(), ExitProcess(), TerminateProcess(), etc.),
then sure, nevermind.

More common than try/catch or __try/__except/__finally is functions with locals with destructors.
Every successful construction is conceptually paired with a "finally destroy".

struct A
{
   A();
   ~A();
};

void f() { A a, b; }

This function has an exception handler.
It looks like:

void f()
{
   construct a
   try
   {
      construct b
      detroy b
   }
   finally
   {
      destroy a
   }
}

- Jay

Windows AMD64 ABI nuances part 2 -- function types

The ABI speaks of two types of functions.

https://docs.microsoft.com/en-us/cpp/build/function-types

They are called "nested" or "frame" or "non-leaf" functions, and leaf functions.

A common misconception is that a leaf function is one that makes no calls.

This is a natural misunderstanding, because people think of call trees, and leaves of it.

While a function that makes calls is indeed not a leaf, that is not the only characteristic that makes a function not a leaf.

A frame function is a function that changes non-volatile registers or has an exception handler (or both). These are the base most and only two reasons a function is a frame function. Everything else follows.

Two common examples of changing non-volatile registers are changing RSP to allocate stack space, or calling another function, which also changes RSP.

But these are not additional conditions, merely common examples of changing non-volatile registers. RSP is just another non-volatile (https://docs.microsoft.com/en-us/cpp/build/register-usage). A function is expected to return with the same RSP as it started with.

Another reason to change non-volatiles is to use them for local variables.

If you change a non-volatile, then you must first save it, and describe how you saved it. Alternatively, instead of saving it, you can describe how you changed it -- that is, you can state how much you subtracted from RSP, so that restore is not loading the old value, but just adding.

And then, the description of how you saved non-volatiles goes into xdata, which is found via pdata.

As well, if you have an exception handler, that is also described in the xdata.

So, by virtue of the base reasons of changing non-volatiles or having an exception handler, a frame function has pdata/xdata.

- Jay

Thursday, October 26, 2017

Windows AMD64 ABI nuances part 1 -- the point of pdata/xdata.

The Windows AMD64 ABI has several surprising nuances.

It is required reading material for this blog entry.

I am focusing here not on calling convention -- where
in registers/stack to place parameters, or sizeof(long) --
but on exception handling and "pdata" and "xdata".

"p" means procedure, or what most programmers now call functions.
Pascal calls them "procedures" for example.

"x" presumably means "exception", or "arbitrary but not p".

So, what is the point of all the pdata/xdata?

There are one or two or three basic purposes, depending
on what you consider the same thing.

pdata/xdata lets debuggers walk the stack.
This data could be relegated to symbols, if that
was the only point, and if symbol-less debugging
was allowed to degrade so much as to break stack walking.

Keep in mind that you usually only have some symbols, like
for your code, but don't have all the symbols for functions
on the stack. So carrying around a small amount of metadata
at runtime can greatly improve the debugging experience.

As well, pdata/xdata lets other components walk the stack at runtime.
Such as profilers or sampling profilers (ETW).
It is not particularly practical to expect ETW to find and read
symbols while profiling, let alone for all code on the stack.

pdata/xdata let exception handling dispatch walk the stack.

Now, "walk the stack" -- is that just retrieving return addresses?
For strictly stack walking, no, not exactly, and for exception dispatch,
definitely not.

Other than return addresses, stack walking must recover non-volatile registers
in order to retrieve frame pointers, in order to recover return addresses.

The basic stack walk method is "recover RSP and then dereference and increment it".
However "recover RSP" is not trivial.

This point about nonvolatile restoration feeding into frame/stack/return restoration
is left as kind of a "hint" and not fully elaborated here.

Think about it. Given that a function can leave rsp in some frame pointer..
what we might think of as rbp, but can be any nonvolatile, and then the function
can alloca() freely, and then call another function or arbitrary functions that
saves and changes arbitrary nonvolatiles, how do you walk the stack? You must
restore all nonvolatiles, iteratively, to restore frame pointers, to discover
stack pointers, to discover return addresses.

As well, when exceptions are dispatched, and handlers are called, and
exception resumed somewhere ("exception is caught"), other than
a correct stack pointer, code needs non-volatiles restored
because locals can be in non-volatiles and are expected to survive exceptions.

I claim these use-cases are all really one slightly general thing -- restore nonvolatile registers.
RSP and RIP can be considered essentially non-volatile.

When you return "normally" from a function, RIP, RSP, and all non-volatiles are restored to what they were before you were called. (You can quibble off-by-oneness.) Likewise, a debugger or exceptions simulate returning from a function, referred to as "unwinding", without running any of the "remaining" code in the function that would normally restore the registers. They can do this via the pdata/xdata.

pdata describes the start/end of a function, and refers to the xdata.
xdata holds the "unwind codes", that describe how to undo the affects of the function's prologue, restoring all non-volatile registers, including RSP, and therefore RIP (return address) as well.

Let's see an example where exception handling can be seen to restore non-volatile registers..well I was unable to get the C compiler to do it,
so this took a while and will end this first installment.
I do have more planned.

First let's provide a minimal C runtime for our assembly.
We are only building an import library, so we just need function names.
Calling printf in the modern C runtime is more involved so we will use the old one.

msvcrt.c:

void printf() { }
void exit() { }
void __C_specific_handler() { }

msvcrt.def:

EXPORTS
printf
exit
__C_specific_handler

To build this:
cl /LD msvcrt.c /link /def:msvcrt.def /noentry /nod
del msvcrt.dll

And now the assembly nvlocala.asm:

include ksamd64.inc

    extern printf:proc
    extern exit:proc
    extern RtlUnwindEx:proc
    altentry resume

.const
str1 db "hello %X %X %X %X", 10, 0

.code

; int handler(ExceptionRecord, EstablisherFrame, ContextRecord, DispatcherContext)
; RtlUnwindEx(TargetFrame, TargetIp, ExceptionRecord, ReturnValue, OriginalContext, HistoryTable)
;                 0            8          10           18           20                28
nested_entry handler, _text
;int 3
; Save nonvolatiles just so we can trash them, to demonstrate the point.
; Note that even when we call RtlUnwindEx, our frame is properly unwound.
push_reg r12 ; the last 4 registers are nonvolatile -- easy rule to remember
push_reg r13
push_reg r14
push_reg r15
alloc_stack 038h ; establish room for 6 parameters and align

end_prologue

; Trash nonvolatiles to help demonstrate the point.
xor r12, r12
xor r13, r13
xor r14, r14
xor r15, r15

; Dispatch or unwind?
mov eax, ErExceptionFlags[rcx]
test eax, EXCEPTION_UNWIND
jne unwind

; dispatch -- always handle it, resuming at hardcoded location
xor eax, eax
mov [rsp + 028h], rax     ; HistoryTable is optional
mov [rsp + 020h], r8      ; OriginalContext = ContextRecord
mov r8, rcx               ; ExceptionRecord = ExceptionRecord
mov rcx, rdx              ; TargetFrame = EstablisherFrame
lea rdx, resume           ; TargetIp = resume
mov r9, 05678h            ; ReturnValue, just to demonstrate the feature
call RtlUnwindEx
int 3 ; should never get here

unwind: ; We are called for unwind as about the last thing RtlUnwindEx
        ; does before restoring context to the Rip we specify.
mov eax, ExceptionContinueSearch
add rsp, 038h
begin_epilogue
pop r15
pop r14
pop r13
pop r12
ret

nested_end handler, _text

nested_entry entry, _text, handler
;int 3
push_reg r12 ; the last 4 registers are nonvolatile -- easy rule to remember
push_reg r13
push_reg r14
push_reg r15
alloc_stack 028h ; room for 5 parameters and aligned
end_prologue

; Cache some values in nonvolatiles -- the point of this exercise.
mov r12, 0123h
mov r13, 0234h
mov r14, 0456h
mov r15, 0789h

lea rcx, str1 ; 0
mov rdx, r12 ; 8
mov r8, r13 ; 10
mov r9, r14 ; 18
mov [rsp + 020h], r15
call printf

lea rcx, str1
mov rdx, r12
mov r8, r13
mov r9, r14
mov [rsp + 020h], r15
call printf

; Produce an access violation, which will be caught.
call qword ptr[0]
int 3 ; should never get here

; Exception will resume here, because this is hardcoded in the handler.
resume:
lea rcx, str1
mov rdx, r12
mov r8, r13
mov r9, rax      ; ReturnValue to RtlUnwindEx
mov [rsp + 020h], r15
call printf

lea rcx, str1
mov rdx, r13
mov r8, r12
mov r9, r14      ; again with the other nonvolatile
mov [rsp + 020h], r15
call printf

mov ecx, 3
call exit
int 3 ; should not get here

add rsp, 038h
begin_epilogue
pop r15
pop r14
pop r13
pop r12
ret
nested_end entry, _text

end

Build and run:

ml64 nvlocala.asm /link /entry:entry /subsystem:console .\msvcrt.lib kernel32.lib
nvlocala.exe

- Jay

Thursday, April 6, 2017

Unbuffered I/O of arbitrary record sizes.

A few years ago I wrote a plain text indexer.

It was a bit of an experiment, so I had fun with it.

The files were all around 10MB each.

There were around 40GB of logs per day.

So, the overall data did not fit in memory (this was years ago), but individual files did.

Something I had always wanted to use successfully was unbuffered I/O.

Unbuffered I/O was appropriate here because an indexer tends to read things only once. Unbuffered is also uncached -- the data goes directly into user-supplied buffers, skipping the file system cache.

On Windows, at least, unbuffered I/O requires aligned file offsets, aligned read/write lengths, and aligned buffer addresses. This make sense -- the alignment allows the block-level driver to work directly -- no buffering. The alignment requirement is not constant, it is up to some underlying driver and can be queried. But I just use 64K.

To align the buffers I just use VirtualAlloc. This aligns at 64K boundaries and is not interesting or challenging.

The indexer was line oriented (spec suggested by someone else and I went with it). The lines were of arbitrary length. Not fixed length and not short.

And word oriented -- it would find words, and record the offset of the start of the line they occured on.

Initially I said, hey, this is easy, whenever a line crosses the end of the buffer, double the buffer size.

This does work a little teeny tiny bit, but it is actually pretty dumb and bad.

What happens is that you will repeatedly hit a line that crosses the end of the buffer. It doesn't require a long line to do this.

The buffer doubles and doubles and doubles. The only thing stopping it is when the entire file fits in the buffer. Which, for 10MB isn't terrible, but defeats the point and doesn't scale well (line orientation or word orientation should lead to an ability to index documents that do not fit in memory, as well the indexer was partly multi-threaded, so 10MB buffer per thread is perhaps large).

So, the solution is to read into the middle of the buffer (double the initial buffer size perhaps).

If you then cross the end of the buffer, you copy the buffer contents to before the middle. Not the entire half buffer, just from the start of the line that cross the end of the buffer.

This will happen repeatedly.

The last line in the buffer will repeatedly cross the end of the buffer, and every time you will copy it back so that the data you have ends at the middle of the buffer.

And then eventually you might hit a very long line that does not fit in the buffer at all. Only then do you double the buffer, and copy the entire thing, to end at the middle of the newly doubled buffer.

Some pictures:

 

              64K          128K
 |------------|------------|
               the\n quick brown fox\njumped really really high\n

First line fits no problem. Second line straddles. Copy back and read more.


              64K          128K
 |------------|------------|
        quick brown fox\njumped really really high\n

Second line fits now. Third line straddles. Copy back and read more.


              64K          128K
 |------------|------------|
            jumped really really high\n

Line still straddles and first half of buffer already in use. Double buffer and read more. When you double, you also copy, and you copy to before the middle. It is possible maybe to copy to a different place but this maximizes buffer use and ensures you have an aligned location to read to.


                           128K                      256K
 |-------------------------|-------------------------|
            jumped really really high\n

Note that this example is contrived. Typically many lines will fit in the buffer, and relatively little copying occurs. And doubling will be rare. But note that some copying will occur at almost every buffer end. - Jay