Friday, December 1, 2017

Epilogues and esp. prologues are probably not what you think.

  The NT/amd64 ABI speaks of function prologues and epilogues, and the rest of the function. 
  Epilogues might not be what you think and prologues almost definitely are not what you think. 


  First, the easier clarification, is that a function an have any number of epilogues. 
  It can have zero epilogues, it can have one epilogue at the end, it can have 
  one epilogue not at the end, and it can have any number of epilogues. 

  An epilogue is not the code located at the end of the function, 
  it is the last thing a function runs -- it is about dynamic execution, 
  not static location. 

  A function will have zero epilogues if it never returns: 

  type no_epilogue.c 
  cl /LD /O2 /GL /GS- no_epilogue.c /link /incremental:no /export:no_epilogue /nod /noentry
  link /dump /disasm no_epilogue.dll  

  C:\> type no_epilogue.c
        cl /LD /O2 /GL /GS- no_epilogue.c /link /incremental:no /export:no_epilogue /nod /noentry
        link /dump /disasm no_epilogue.dll


   void no_epilogue(void (*f)()) { while (1) f(); } 

  Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64 

    0000000180001000: 40 53              push        rbx 
    0000000180001002: 48 83 EC 20        sub         rsp,20h 
    0000000180001006: 48 8B D9           mov         rbx,rcx 
    0000000180001009: 0F 1F 80 00 00 00  nop         dword ptr [rax+0000000000000000h] 
                      00 
    0000000180001010: FF D3              call        rbx 
    0000000180001012: EB FC              jmp         0000000180001010 


   A function can have multiple epilogues if it has an "early return":  
    
  C:\> type multiple_epilogues.c 
       cl /LD /O2 /GL /GS- multiple_epilogues.c /link /incremental:no /export:multiple_epilogues /nod /noentry
       link /dump /disasm multiple_epilogues.dll 

   int multiple_epilogues(int i, int (*f)(void), int (*g)(void)) 
   { 
     if (i) 
      return f(); 
     return g() + g() + g(); 
   } 

  Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64 
    0000000180001000:   push        rdi 
    0000000180001002:   sub         rsp,20h 
    0000000180001006:   mov         rdi,r8 
    0000000180001009:   test        ecx,ecx 
    000000018000100B:   je          0000000180001015 
    000000018000100D:   add         rsp,20h            <== possibly epilog  
    0000000180001011:   pop         rdi                <== epilog 
    0000000180001012:   jmp         rdx                <== epilog  
    0000000180001015:   mov         qword ptr [rsp+30h],rbx  
    000000018000101A:   call        rdi  
    000000018000101C:   mov         ebx,eax  
    000000018000101E:   call        rdi  
    0000000180001020:   add         ebx,eax  
    0000000180001022:   call        rdi  
    0000000180001024:   add         eax,ebx  
    0000000180001026:   mov         rbx,qword ptr [rsp+30h]  
    000000018000102B:   add         rsp,20h               <== possibly epilog  
    000000018000102F:   pop         rdi                   <== epilog 
    0000000180001030:   ret                               <== epilog
 

   And this multiple epiloge case accidentally demonstrates the next point. 

   Just as epilogue is not instructions located at the end of a function, 
   prologue is not instructions located at the start of a function. 

   The prologue *instructions* are the instructions that save nonvolatile 
   registers, or adjust rsp (prior to frame pointer establishment -- not alloca), 
   or establish the frame pointer (mov x, rsp). 

   The prologue instructions can and are interleaved with somewhat arbitrary 
   other instructions. The critical requirement is that nonvolatiles be saved 
   before nonvolatiles are changed -- as well as recording rsp adjustment 
   and frame pointer establishment -- such recording being a function 
   of executing the instruction marked as such in the "xdata". 

   The multi-prologue example above has such "dispersed" prologue. 
   Let's look at it again in more detail: 

  C:\> link /dump /unwindinfo /disasm multiple_epilogues.dll 

    Microsoft (R) COFF/PE Dumper Version 14.00.24215.1 

    0180001000:  push rdi               <=== prologue instruction, unsurprising  
    0180001002:  sub  rsp,20h           <=== prologue instruction, unsurprising  
    0180001006:  mov  rdi,r8 
    0180001009:  test ecx,ecx 
    018000100B:  je   0000000180001015 
    018000100D:  add  rsp,20h 
    0180001011:  pop  rdi 
    0180001012:  jmp  rdx 
    0180001015:  mov  qword ptr [rsp+30h],rbx  <=== also a prologue instruction  
    018000101A:  call rdi               <=== offset 1A in the unwind info below  
    018000101C:  mov  ebx,eax 
    018000101E:  call rdi 
    0180001020:  add  ebx,eax 
    0180001022:  call rdi 
    0180001024:  add  eax,ebx 
    0180001026:  mov  rbx,qword ptr [rsp+30h] 
    018000102B:  add  rsp,20h 
    018000102F:  pop  rdi 
    0180001030:  ret

  Function Table (1) 


             Begin    End      Info      Function Name 
    00000000 00001000 00001031 0000208C 
      Unwind version: 1 
      Unwind flags: None 
      Size of prologue: 0x1A     <== This is also telling.
      Count of codes: 4 
      Unwind codes: 
        1A: SAVE_NONVOL, register=rbx offset=0x30 
        06: ALLOC_SMALL, size=0x20 
        02: PUSH_NONVOL, register=rdi 
       /*\ 
        * 
        * 
        * look here 


   The critical information we want to look at is the left most column 
   of the unwind codes. These are the offsets just after prologue instructions. 
   They are reverse sorted by offset, and the underlying data is not fixed 
   size per line shown -- you must always walk them linearly from the start. 

   Offset 2 and 6 are what you expect -- the first two instructions. 
   But offset 1A is quite a bit into the function -- that is a bit surprising when you first see it.

 And another thing. While the specification is that the offsets are just after the instruction that does the nonvolatile save, etc., the requirement and reality are looser. The offset can be later than the save, as long as it is before a change. As well, the location of a save might change between the save and the recorded offset. For example, the compiler will move nonvolatiles into home space, and then adjust rsp, and then or at the same place record that the nonvolatile was saved.

This can be achieved with the multiple_prologue.c example just by compiling with /O1 instead of /O2. Let's see:


  cl /LD /O1 /GL /GS- multiple_epilogues.c /link /incremental:no /export:multiple_epilogues /nod /noentry 
  link /dump /disasm /unwindinfo multiple_epilogues.dll
Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24215.1 for x64
Microsoft (R) COFF/PE Dumper Version 14.00.24215.1

  0180001000:   mov   qword ptr [rsp+8],rbx   <==== rbx saved here
  0180001005:   push  rdi
  0180001006:   sub   rsp,20h                 <==== but recorded here
  018000100A:   mov   rdi,r8
  018000100D:   test  ecx,ecx
  018000100F:   je    0000000180001015
  0180001011:   call  rdx
  0180001013:   jmp   0000000180001021
  0180001015:   call  rdi
  0180001017:   mov   ebx,eax
  0180001019:   call  rdi
  018000101B:   add   ebx,eax
  018000101D:   call  rdi
  018000101F:   add   eax,ebx
  0180001021:   mov   rbx,qword ptr [rsp+30h]
  0180001026:   add   rsp,20h
  018000102A:   pop   rdi
  018000102B:   ret

Function Table (1)
           Begin    End      Info      Function Name
  00000000 00001000 0000102C 0000208C
    Unwind version: 1
    Unwind flags: None
    Size of prologue: 0x0A
    Count of codes: 4
    Unwind codes:
      0A: SAVE_NONVOL, register=rbx offset=0x30     <=== rbx save
      0A: ALLOC_SMALL, size=0x20                    <=== two unwind codes with same offset
      06: PUSH_NONVOL, register=rdi



And see how rbx is saved at rsp+8 but recorded as rsp+30, because
rsp changes by 28 between the save and the recorded position.

And this is all ok. If you take an exception between the save and recorded
position of the save, rbx has not been changed, and need not be restored.
Such an exception is rare -- maybe stack overflow -- but the ABI accounts
for exceptions and stack walks from arbitrary instructions.

Saturday, November 25, 2017

DLL_THREAD_ATTACH and DLL_THREAD_DETACH are not what they sound like (but are documented correctly)

 DllMain gets called four specific reasons:
  DLL_PROCESS_ATTACH
  DLL_PROCESS_DETACH
  DLL_THREAD_ATTACH
  DLL_THREAD_DETACH

 Ostensibly, THREAD_ATTACH is where you initialize
 any thread locals and THREAD_DETACH is where you clean them up.

 However they do not work as they sound.

 They are documented correctly however.

 DLL_THREAD_ATTACH is only called for threads created after a dll is loaded.
 DLL_THREAD_DETACH is only called for threads that exit while a dll is loaded.


 That leaves DLL_THREAD_ATTACH not called for threads that exist
 before the dll is loaded, and DLL_THREAD_DETACH not called for threads
 that still exist when a dll is unloaded.

 Therefore, if you have a thread local with a constructor, it can be used without being constructed.
 If you have a thread local with a destructor, it might not be called.


 Here is an example:

F:\1>type dll.cpp dll.h dll.def exe.cpp
dll.cpp

 #include <windows.h>
 #include <stdio.h>

 int constructs;
 int destroys;
 __declspec(thread) bool constructed;

 struct A
 {
 A() { constructed = true; printf(" constructed:%d on thread:%d \n", ++constructs, GetCurrentThreadId()); }

 ~A() { printf(" destroyed:%d on thread:%d \n", ++destroys, GetCurrentThreadId()); }

 void F1() { printf(" called on thread:%d constructed:%d \n", GetCurrentThreadId(), (int)constructed); }
 };

 __declspec(thread) A a;

 extern "C" void F1() { a.F1(); }

 PCSTR ReasonString(ULONG Reason)
 {
     switch (Reason)
     {
     case DLL_PROCESS_ATTACH: return "DLL_PROCESS_ATTACH";
     case DLL_PROCESS_DETACH: return "DLL_PROCESS_DETACH";
     case DLL_THREAD_ATTACH: return "DLL_THREAD_ATTACH";
     case DLL_THREAD_DETACH: return "DLL_THREAD_DETACH";
     }
     return "unknown";
 }

 BOOL __stdcall DllMain(HINSTANCE dll, ULONG  reason, PVOID reserved)
 {
     printf(" DllMain dll:%p, reason:%s reserved:%d thread:%d \n", dll, ReasonString(reason), !!reserved, GetCurrentThreadId());
     if (reason == DLL_PROCESS_DETACH)
         printf(" unloading with %d leaked constructions \n", constructs - destroys);
     return TRUE;
 }

dll.h

extern "C" void F1();

dll.def

EXPORTS
F1
exe.cpp

 #include <stdio.h>
 #include <windows.h>
 #include "dll.h"

 decltype(&F1) pf1;

 HANDLE thread[3];
 ULONG threadid[3];
 HANDLE event[3];

 ULONG __stdcall Thread(PVOID p)
 {
     size_t i = (size_t)p;

     // wait for dll/function to be available
     while (!pf1)
         Sleep(1);
     pf1();

     // Indicate this thread is done with the dll.
     SetEvent(event[i]);

     // keep doing more work on this thread -- at least pretend
     if (i == 2)
     {
         printf(" not exiting thread:%d \n", GetCurrentThreadId());
         Sleep(INFINITE);
     }

     printf(" exiting thread:%d \n", GetCurrentThreadId());
     return 0;
 }

 int main()
 {
     size_t i = 0;

     printf(" initial thread:%d \n", GetCurrentThreadId());

     // create thread before loading dll (current thread as well)
     event[i] = CreateEvent(0, 0, 0, 0);
     thread[i] = CreateThread(0, 0, &Thread, (PVOID)i, 0, &threadid[i]);
     printf(" created thread:%d \n", threadid[i]);
     ++i;

     auto const dll = LoadLibrary("dll");
     (PROC&)pf1 = GetProcAddress(dll, "F1");
     pf1();

     // create threads after loading dll
     event[i] = CreateEvent(0, 0, 0, 0);
     thread[i] = CreateThread(0, 0, &Thread, (PVOID)i, 0, &threadid[i]);
     printf(" created thread:%d \n", threadid[i]);
     ++i;

     event[i] = CreateEvent(0, 0, 0, 0);
     thread[i] = CreateThread(0, 0, &Thread, (PVOID)i, 0, &threadid[i]);
     printf(" created thread:%d \n", threadid[i]);
     ++i;

     // Wait for all calls to the dll to be done.
     WaitForMultipleObjects(3, event, TRUE, INFINITE);

     // Wait for some of the threads to exit.
     WaitForMultipleObjects(2, thread, TRUE, INFINITE);

     FreeLibrary(dll);
 }

F:\1>cl /LD /MD /Zi dll.cpp /link /def:dll.def
F:\1>cl  /MD /Zi exe.cpp

F:\1>.\exe.exe
 initial thread:80676
 created thread:70020
 constructed:1 on thread:80676
 DllMain dll:00007FFBDEFD0000, reason:DLL_PROCESS_ATTACH reserved:0 thread:80676
 called on thread:80676 constructed:1
 created thread:66288
 constructed:2 on thread:66288
 DllMain dll:00007FFBDEFD0000, reason:DLL_THREAD_ATTACH reserved:0 thread:66288
 created thread:49672
 called on thread:70020 constructed:0 called on thread:66288 constructed:1
 exiting thread:66288
 exiting thread:70020
 constructed:3 on thread:49672
 DllMain dll:00007FFBDEFD0000, reason:DLL_THREAD_ATTACH reserved:0 thread:49672
 called on thread:49672 constructed:1
 not exiting thread:49672
 destroyed:1 on thread:66288
 DllMain dll:00007FFBDEFD0000, reason:DLL_THREAD_DETACH reserved:0 thread:66288
 DllMain dll:00007FFBDEFD0000, reason:DLL_THREAD_DETACH reserved:0 thread:70020
 destroyed:2 on thread:80676
 DllMain dll:00007FFBDEFD0000, reason:DLL_PROCESS_DETACH reserved:0 thread:80676
 unloading with 1 leaked constructions 

 - Jay

Saturday, October 28, 2017

What's in a name -- "structured" exception handling.

Consider that Posix has signals. Such as SIGFAULT.

Consider that Windows has "structured exceptions"
or "SEH" -- "structured exception handling".
Such as STATUS_ACCESS_VIOLATION -- same as SIGFAULT.
 divide by zero
 stack overflow
 invalid instruction
 Many predefined 32bit values, and you can RaiseException your own.

Consider that both signals and exceptions can be "handled".

For example, you can simulate mmap and memory mapped readinng
of files by associating a range of address space with a file,
and as pages are touched, VirtualProtect/mprotect the pages
and read in the file contents. (Writing is more elaborate
as a worker thread or such has to issue the writes.)

Nevermind if this is slow or debugger-unfriendly or just somehow bad,
or just use mmap directly.
There are enough other reasons for these mechanisms to exist.
They are costly to implement and nobody did this lightly.

Consider that Posix lets you install signal handlers,
for a process with sigaction. Is there a way to do it per thread?
Clearly what Posix calls a "signal", Windows calls an "exception".
So why dows Windows call them called "structured" exceptions, vs just plain "exceptions"?

"SEH" structured exception handling has become imbued with very negative connoations,
because you can catch access violations, and this causes bugs, security bugs.

However, it is simply a reasonable general purpose mechanism, and underlies C++ exceptions and C# exceptions. Don't fault it for providing a bit more functionality than most people want.
If this functionality was not provided here, and anybody wanted it, they would be hard pressed
to get their job done well. And there are cases where it is crucial that I might cover later.
The operating system's job is to provide very general purpose mechanisms that most people
never use to their full functionality, aiming to provide the union of everyone's needs.
Satisfying everyone is a big feature set.


Anyway, I believe what "structured" means is in fact:
  Instead of calling a function to set or unset a thread's or process's handler,
  "structured" exception handlers are established/unestablished very efficiently by
  virtue of static lexical scoping and dynamic scope i.e. stack walking.
  "structured" means almost the same thing as "scoped".


Consider:

void f1();

void f2()
{
  __try
  {
     f1();
     __try
     {
         f1();
     }
     __except(...)
     {
     }
  }
  __except(...)
  {
     ...
  }
}

void f1()
{
  __try
  {
     ... ;
  }
  __except(...)
  {
     ...
  }
}

What would this look like using Posix signals?
Every enter and exit of a __try, and rougly, every exit of an __except,
would require an expensive call to establish or unestablish a handler.

Assuming there is a way to do it per-thread.
Absent that, you would establish a handler per process that you communicate
with somehow so it can determine lexical scope and walk the stack for dynamic scope.
Maybe via thread locals, maybe optimized "thread local storage", which is a far cry
efficiency-wise from how Windows achieves this.

On Windows, the cost of entering and exiting a __try is very small
on 32bit x86, and essentially zero on all other architectures.
The costs on the other architectures is generate some cold data on the side
at compile/link time, and *perhaps*, but perhaps not, some compiler optimization
inhibitions.

On non-x86 platforms, communication of lexical scope is achieved by mapping
the instruction pointer (or relative virtual address within a module) to static data.

Dynamic scope is achieved by having an ABI that ensure the stack is always efficiently
walkable, again, without severely or at all compromising code quality.

On x86, instead of mapping instruction pointer, functions have one volatile local integer
to indicate scope, and guaranteed stack-walkability in the face of lack of an adequate ABI,
is achieved via a highly optimized per-thread (or per-fiber) linked list of frames.
While it is indeed highly optimized, it is slower in most scenarios than the other architectures.

Stepping through almost any code will show the linked list is through FS:0.
FS itself is the start of a bunch of per-thread (or per-fiber) data, and this linked list
is the very first think in it. FS is referred to as the "thread environment block" (TEB)
or "thread information block" (TIB), which to me just sounds like "Thread".

x86 may achieve faster throw/raise/dispatch of exceptions, but it spreads a "peanut butter tax"
throughout all the non-exceptional code paths.

As well, on Windows, if you do really want process-wide handlers, there are "vectored"
exception handlers. This was added circa Windows 2000 or Windows XP.

Therefore Windows provides Posix-like parity with a little used
mechanism, and far surpasses it with a heavily used mechanism (again, recall that
SEH is the basis of C++ and C# exceptions, as well as by default interacting with setjmp/longjmp.)

"vectored" here meaning "global" instead of scoped or structured.

 - Jay

Friday, October 27, 2017

Windows AMD64 ABI nuances part 3 -- why so concerned with exception handling?

When speaking about an ABI, exception handling comes up a lot.

Many programmers will roll their eyes and protest.
"I have barely any try/catch or throw or Windows __try/__except/__finally in my program, so why so concerned with exception handling?"

"One is all you need."

If all of your exceptions are "fail fast" (exit(), ExitProcess(), TerminateProcess(), etc.),
then sure, nevermind.

More common than try/catch or __try/__except/__finally is functions with locals with destructors.
Every successful construction is conceptually paired with a "finally destroy".

 struct A
 {
   A();
   ~A();
 };

 void f() { A a, b; }

 This function has an exception handler.
 It looks like:

 void f()
 {
   construct a
   try
   {
      construct b
      detroy b
   }
   finally
   {
      destroy a
   }
 }

 - Jay

Windows AMD64 ABI nuances part 2 -- function types

The ABI speaks of two types of functions.

 https://docs.microsoft.com/en-us/cpp/build/function-types

They are called "nested" or "frame" or "non-leaf" functions, and leaf functions.

A common misconception is that a leaf function is one that makes no calls.

This is a natural misunderstanding, because people think of call trees, and leaves of it.

While a function that makes calls is indeed not a leaf, that is not the only characteristic that makes a function not a leaf.

A frame function is a function that changes non-volatile registers or has an exception handler (or both). These are the base most and only two reasons a function is a frame function. Everything else follows.

Two common examples of changing non-volatile registers are changing RSP to allocate stack space, or calling another function, which also changes RSP.

But these are not additional conditions, merely common examples of changing non-volatile registers. RSP is just another non-volatile (https://docs.microsoft.com/en-us/cpp/build/register-usage). A function is expected to return with the same RSP as it started with.

Another reason to change non-volatiles is to use them for local variables.

If you change a non-volatile, then you must first save it, and describe how you saved it. Alternatively, instead of saving it, you can describe how you changed it -- that is, you can state how much you subtracted from RSP, so that restore is not loading the old value, but just adding.

And then, the description of how you saved non-volatiles goes into xdata, which is found via pdata.

As well, if you have an exception handler, that is also described in the xdata.

So, by virtue of the base reasons of changing non-volatiles or having an exception handler, a frame function has pdata/xdata.

 - Jay

Thursday, October 26, 2017

Windows AMD64 ABI nuances part 1 -- the point of pdata/xdata.

The Windows AMD64 ABI has several surprising nuances.

It is required reading material for this blog entry.

I am focusing here not on calling convention -- where
in registers/stack to place parameters, or sizeof(long) --
but on exception handling and "pdata" and "xdata".

"p" means procedure, or what most programmers now call functions.
Pascal calls them "procedures" for example.

"x" presumably means "exception", or "arbitrary but not p".

So, what is the point of all the pdata/xdata?

There are one or two or three basic purposes, depending
on what you consider the same thing.

pdata/xdata lets debuggers walk the stack.
This data could be relegated to symbols, if that
was the only point, and if symbol-less debugging
was allowed to degrade so much as to break stack walking.

Keep in mind that you usually only have some symbols, like
for your code, but don't have all the symbols for functions
on the stack. So carrying around a small amount of metadata
at runtime can greatly improve the debugging experience.

As well, pdata/xdata lets other components walk the stack at runtime.
Such as profilers or sampling profilers (ETW).
It is not particularly practical to expect ETW to find and read
symbols while profiling, let alone for all code on the stack.

pdata/xdata let exception handling dispatch walk the stack.

Now, "walk the stack" -- is that just retrieving return addresses?
For strictly stack walking, no, not exactly, and for exception dispatch,
definitely not.


Other than return addresses, stack walking must recover non-volatile registers
in order to retrieve frame pointers, in order to recover return addresses.


The basic stack walk method is "recover RSP and then dereference and increment it".
However "recover RSP" is not trivial.


This point about nonvolatile restoration feeding into frame/stack/return restoration
is left as kind of a "hint" and not fully elaborated here.


Think about it. Given that a function can leave rsp in some frame pointer..
what we might think of as rbp, but can be any nonvolatile, and then the function
can alloca() freely, and then call another function or arbitrary functions that
saves and changes arbitrary nonvolatiles, how do you walk the stack? You must
restore all nonvolatiles, iteratively, to restore frame pointers, to discover
stack pointers, to discover return addresses.


As well, when exceptions are dispatched, and handlers are called, and
exception resumed somewhere ("exception is caught"), other than
a correct stack pointer, code needs non-volatiles restored
because locals can be in non-volatiles and are expected to survive exceptions.

I claim these use-cases are all really one slightly general thing -- restore nonvolatile registers.
RSP and RIP can be considered essentially non-volatile.

When you return "normally" from a function, RIP, RSP, and all non-volatiles are restored to what they were before you were called. (You can quibble off-by-oneness.) Likewise, a debugger or exceptions simulate returning from a function, referred to as "unwinding", without running any of the "remaining" code in the function that would normally restore the registers. They can do this via the pdata/xdata.

pdata describes the start/end of a function, and refers to the xdata.
xdata holds the "unwind codes", that describe how to undo the affects of the function's prologue, restoring all non-volatile registers, including RSP, and therefore RIP (return address) as well.

Let's see an example where exception handling can be seen to restore non-volatile registers..well I was unable to get the C compiler to do it,
so this took a while and will end this first installment.
I do have more planned.

First let's provide a minimal C runtime for our assembly.
We are only building an import library, so we just need function names.
Calling printf in the modern C runtime is more involved so we will use the old one.

msvcrt.c:

void printf() { }
void exit() { }
void __C_specific_handler() { }

msvcrt.def:

EXPORTS
printf
exit
__C_specific_handler

To build this:
cl /LD msvcrt.c /link /def:msvcrt.def /noentry /nod
del msvcrt.dll


And now the assembly nvlocala.asm:

include ksamd64.inc

    extern printf:proc
    extern exit:proc
    extern RtlUnwindEx:proc
    altentry resume

.const
str1 db "hello %X %X %X %X", 10, 0

.code

; int handler(ExceptionRecord, EstablisherFrame, ContextRecord, DispatcherContext)
; RtlUnwindEx(TargetFrame, TargetIp, ExceptionRecord, ReturnValue, OriginalContext, HistoryTable)
;                 0            8          10           18           20                28
  nested_entry handler, _text
  ;int 3
  ; Save nonvolatiles just so we can trash them, to demonstrate the point.
  ; Note that even when we call RtlUnwindEx, our frame is properly unwound.
  push_reg r12  ; the last 4 registers are nonvolatile -- easy rule to remember
  push_reg r13
  push_reg r14
  push_reg r15
  alloc_stack 038h  ; establish room for 6 parameters and align

  end_prologue

; Trash nonvolatiles to help demonstrate the point.
  xor r12, r12
  xor r13, r13
  xor r14, r14
  xor r15, r15

; Dispatch or unwind?
  mov eax, ErExceptionFlags[rcx]
  test eax, EXCEPTION_UNWIND
  jne unwind

; dispatch -- always handle it, resuming at hardcoded location
  xor eax, eax
  mov [rsp + 028h], rax     ; HistoryTable is optional
  mov [rsp + 020h], r8      ; OriginalContext = ContextRecord
  mov r8, rcx               ; ExceptionRecord = ExceptionRecord
  mov rcx, rdx              ; TargetFrame = EstablisherFrame
  lea rdx, resume           ; TargetIp = resume
  mov r9, 05678h            ; ReturnValue, just to demonstrate the feature
  call RtlUnwindEx
  int 3 ; should never get here

unwind: ; We are called for unwind as about the last thing RtlUnwindEx
        ; does before restoring context to the Rip we specify.
  mov eax, ExceptionContinueSearch
  add rsp, 038h
  begin_epilogue
  pop r15
  pop r14
  pop r13
  pop r12
  ret

  nested_end handler, _text

  nested_entry entry, _text, handler
  ;int 3
  push_reg r12  ; the last 4 registers are nonvolatile -- easy rule to remember
  push_reg r13
  push_reg r14
  push_reg r15
  alloc_stack 028h  ; room for 5 parameters and aligned
  end_prologue

  ; Cache some values in nonvolatiles -- the point of this exercise.
  mov r12, 0123h
  mov r13, 0234h
  mov r14, 0456h
  mov r15, 0789h

  lea rcx, str1 ; 0
  mov rdx, r12  ; 8
  mov r8,  r13  ; 10
  mov r9,  r14  ; 18
  mov [rsp + 020h], r15
  call printf

  lea rcx, str1
  mov rdx, r12
  mov r8,  r13
  mov r9,  r14
  mov [rsp + 020h], r15
  call printf

; Produce an access violation, which will be caught.
  call qword ptr[0]
  int 3 ; should never get here

; Exception will resume here, because this is hardcoded in the handler.
 resume:
  lea rcx, str1
  mov rdx, r12
  mov r8,  r13
  mov r9,  rax      ; ReturnValue to RtlUnwindEx
  mov [rsp + 020h], r15
  call printf

  lea rcx, str1
  mov rdx, r13
  mov r8,  r12
  mov r9,  r14      ; again with the other nonvolatile
  mov [rsp + 020h], r15
  call printf

  mov ecx, 3
  call exit
  int 3 ; should not get here

  add rsp, 038h
  begin_epilogue
  pop r15
  pop r14
  pop r13
  pop r12
  ret
  nested_end entry, _text

end

Build and run:

ml64 nvlocala.asm /link /entry:entry /subsystem:console .\msvcrt.lib kernel32.lib
nvlocala.exe


 - Jay

Thursday, April 6, 2017

Unbuffered I/O of arbitrary record sizes.

A few years ago I wrote a plain text indexer.

It was a bit of an experiment, so I had fun with it.

The files were all around 10MB each.

There were around 40GB of logs per day.

So, the overall data did not fit in memory (this was years ago), but individual files did.

Something I had always wanted to use successfully was unbuffered I/O.

Unbuffered I/O was appropriate here because an indexer tends to read things only once. Unbuffered is also uncached -- the data goes directly into user-supplied buffers, skipping the file system cache.

On Windows, at least, unbuffered I/O requires aligned file offsets, aligned read/write lengths, and aligned buffer addresses. This make sense -- the alignment allows the block-level driver to work directly -- no buffering. The alignment requirement is not constant, it is up to some underlying driver and can be queried. But I just use 64K.

To align the buffers I just use VirtualAlloc. This aligns at 64K boundaries and is not interesting or challenging.

The indexer was line oriented (spec suggested by someone else and I went with it). The lines were of arbitrary length. Not fixed length and not short.

And word oriented -- it would find words, and record the offset of the start of the line they occured on.

Initially I said, hey, this is easy, whenever a line crosses the end of the buffer, double the buffer size.

This does work a little teeny tiny bit, but it is actually pretty dumb and bad.

What happens is that you will repeatedly hit a line that crosses the end of the buffer. It doesn't require a long line to do this.

The buffer doubles and doubles and doubles. The only thing stopping it is when the entire file fits in the buffer. Which, for 10MB isn't terrible, but defeats the point and doesn't scale well (line orientation or word orientation should lead to an ability to index documents that do not fit in memory, as well the indexer was partly multi-threaded, so 10MB buffer per thread is perhaps large).

So, the solution is to read into the middle of the buffer (double the initial buffer size perhaps).

If you then cross the end of the buffer, you copy the buffer contents to before the middle. Not the entire half buffer, just from the start of the line that cross the end of the buffer.

This will happen repeatedly.

The last line in the buffer will repeatedly cross the end of the buffer, and every time you will copy it back so that the data you have ends at the middle of the buffer.

And then eventually you might hit a very long line that does not fit in the buffer at all. Only then do you double the buffer, and copy the entire thing, to end at the middle of the newly doubled buffer.

Some pictures:

 

              64K          128K
 |------------|------------|
               the\n quick brown fox\njumped really really high\n

 
First line fits no problem. Second line straddles. Copy back and read more.

              64K          128K
 |------------|------------|
        quick brown fox\njumped really really high\n

 
Second line fits now. Third line straddles. Copy back and read more.

              64K          128K
 |------------|------------|
            jumped really really high\n            

 
Line still straddles and first half of buffer already in use. Double buffer and read more. When you double, you also copy, and you copy to before the middle. It is possible maybe to copy to a different place but this maximizes buffer use and ensures you have an aligned location to read to.

                           128K                      256K
 |-------------------------|-------------------------|
            jumped really really high\n


 

Note that this example is contrived. Typically many lines will fit in the buffer, and relatively little copying occurs. And doubling will be rare. But note that some copying will occur at almost every buffer end. - Jay