Saturday, October 28, 2017

What's in a name -- "structured" exception handling.

Consider that Posix has signals. Such as SIGFAULT.

Consider that Windows has "structured exceptions"
or "SEH" -- "structured exception handling".
Such as STATUS_ACCESS_VIOLATION -- same as SIGFAULT.
 divide by zero
 stack overflow
 invalid instruction
 Many predefined 32bit values, and you can RaiseException your own.

Consider that both signals and exceptions can be "handled".

For example, you can simulate mmap and memory mapped readinng
of files by associating a range of address space with a file,
and as pages are touched, VirtualProtect/mprotect the pages
and read in the file contents. (Writing is more elaborate
as a worker thread or such has to issue the writes.)

Nevermind if this is slow or debugger-unfriendly or just somehow bad,
or just use mmap directly.
There are enough other reasons for these mechanisms to exist.
They are costly to implement and nobody did this lightly.

Consider that Posix lets you install signal handlers,
for a process with sigaction. Is there a way to do it per thread?
Clearly what Posix calls a "signal", Windows calls an "exception".
So why dows Windows call them called "structured" exceptions, vs just plain "exceptions"?

"SEH" structured exception handling has become imbued with very negative connoations,
because you can catch access violations, and this causes bugs, security bugs.

However, it is simply a reasonable general purpose mechanism, and underlies C++ exceptions and C# exceptions. Don't fault it for providing a bit more functionality than most people want.
If this functionality was not provided here, and anybody wanted it, they would be hard pressed
to get their job done well. And there are cases where it is crucial that I might cover later.
The operating system's job is to provide very general purpose mechanisms that most people
never use to their full functionality, aiming to provide the union of everyone's needs.
Satisfying everyone is a big feature set.


Anyway, I believe what "structured" means is in fact:
  Instead of calling a function to set or unset a thread's or process's handler,
  "structured" exception handlers are established/unestablished very efficiently by
  virtue of static lexical scoping and dynamic scope i.e. stack walking.
  "structured" means almost the same thing as "scoped".


Consider:

void f1();

void f2()
{
  __try
  {
     f1();
     __try
     {
         f1();
     }
     __except(...)
     {
     }
  }
  __except(...)
  {
     ...
  }
}

void f1()
{
  __try
  {
     ... ;
  }
  __except(...)
  {
     ...
  }
}

What would this look like using Posix signals?
Every enter and exit of a __try, and rougly, every exit of an __except,
would require an expensive call to establish or unestablish a handler.

Assuming there is a way to do it per-thread.
Absent that, you would establish a handler per process that you communicate
with somehow so it can determine lexical scope and walk the stack for dynamic scope.
Maybe via thread locals, maybe optimized "thread local storage", which is a far cry
efficiency-wise from how Windows achieves this.

On Windows, the cost of entering and exiting a __try is very small
on 32bit x86, and essentially zero on all other architectures.
The costs on the other architectures is generate some cold data on the side
at compile/link time, and *perhaps*, but perhaps not, some compiler optimization
inhibitions.

On non-x86 platforms, communication of lexical scope is achieved by mapping
the instruction pointer (or relative virtual address within a module) to static data.

Dynamic scope is achieved by having an ABI that ensure the stack is always efficiently
walkable, again, without severely or at all compromising code quality.

On x86, instead of mapping instruction pointer, functions have one volatile local integer
to indicate scope, and guaranteed stack-walkability in the face of lack of an adequate ABI,
is achieved via a highly optimized per-thread (or per-fiber) linked list of frames.
While it is indeed highly optimized, it is slower in most scenarios than the other architectures.

Stepping through almost any code will show the linked list is through FS:0.
FS itself is the start of a bunch of per-thread (or per-fiber) data, and this linked list
is the very first think in it. FS is referred to as the "thread environment block" (TEB)
or "thread information block" (TIB), which to me just sounds like "Thread".

x86 may achieve faster throw/raise/dispatch of exceptions, but it spreads a "peanut butter tax"
throughout all the non-exceptional code paths.

As well, on Windows, if you do really want process-wide handlers, there are "vectored"
exception handlers. This was added circa Windows 2000 or Windows XP.

Therefore Windows provides Posix-like parity with a little used
mechanism, and far surpasses it with a heavily used mechanism (again, recall that
SEH is the basis of C++ and C# exceptions, as well as by default interacting with setjmp/longjmp.)

"vectored" here meaning "global" instead of scoped or structured.

 - Jay

Friday, October 27, 2017

Windows AMD64 ABI nuances part 3 -- why so concerned with exception handling?

When speaking about an ABI, exception handling comes up a lot.

Many programmers will roll their eyes and protest.
"I have barely any try/catch or throw or Windows __try/__except/__finally in my program, so why so concerned with exception handling?"

"One is all you need."

If all of your exceptions are "fail fast" (exit(), ExitProcess(), TerminateProcess(), etc.),
then sure, nevermind.

More common than try/catch or __try/__except/__finally is functions with locals with destructors.
Every successful construction is conceptually paired with a "finally destroy".

 struct A
 {
   A();
   ~A();
 };

 void f() { A a, b; }

 This function has an exception handler.
 It looks like:

 void f()
 {
   construct a
   try
   {
      construct b
      detroy b
   }
   finally
   {
      destroy a
   }
 }

 - Jay

Windows AMD64 ABI nuances part 2 -- function types

The ABI speaks of two types of functions.

 https://docs.microsoft.com/en-us/cpp/build/function-types

They are called "nested" or "frame" or "non-leaf" functions, and leaf functions.

A common misconception is that a leaf function is one that makes no calls.

This is a natural misunderstanding, because people think of call trees, and leaves of it.

While a function that makes calls is indeed not a leaf, that is not the only characteristic that makes a function not a leaf.

A frame function is a function that changes non-volatile registers or has an exception handler (or both). These are the base most and only two reasons a function is a frame function. Everything else follows.

Two common examples of changing non-volatile registers are changing RSP to allocate stack space, or calling another function, which also changes RSP.

But these are not additional conditions, merely common examples of changing non-volatile registers. RSP is just another non-volatile (https://docs.microsoft.com/en-us/cpp/build/register-usage). A function is expected to return with the same RSP as it started with.

Another reason to change non-volatiles is to use them for local variables.

If you change a non-volatile, then you must first save it, and describe how you saved it. Alternatively, instead of saving it, you can describe how you changed it -- that is, you can state how much you subtracted from RSP, so that restore is not loading the old value, but just adding.

And then, the description of how you saved non-volatiles goes into xdata, which is found via pdata.

As well, if you have an exception handler, that is also described in the xdata.

So, by virtue of the base reasons of changing non-volatiles or having an exception handler, a frame function has pdata/xdata.

 - Jay

Thursday, October 26, 2017

Windows AMD64 ABI nuances part 1 -- the point of pdata/xdata.

The Windows AMD64 ABI has several surprising nuances.

It is required reading material for this blog entry.

I am focusing here not on calling convention -- where
in registers/stack to place parameters, or sizeof(long) --
but on exception handling and "pdata" and "xdata".

"p" means procedure, or what most programmers now call functions.
Pascal calls them "procedures" for example.

"x" presumably means "exception", or "arbitrary but not p".

So, what is the point of all the pdata/xdata?

There are one or two or three basic purposes, depending
on what you consider the same thing.

pdata/xdata lets debuggers walk the stack.
This data could be relegated to symbols, if that
was the only point, and if symbol-less debugging
was allowed to degrade so much as to break stack walking.

Keep in mind that you usually only have some symbols, like
for your code, but don't have all the symbols for functions
on the stack. So carrying around a small amount of metadata
at runtime can greatly improve the debugging experience.

As well, pdata/xdata lets other components walk the stack at runtime.
Such as profilers or sampling profilers (ETW).
It is not particularly practical to expect ETW to find and read
symbols while profiling, let alone for all code on the stack.

pdata/xdata let exception handling dispatch walk the stack.

Now, "walk the stack" -- is that just retrieving return addresses?
For strictly stack walking, no, not exactly, and for exception dispatch,
definitely not.


Other than return addresses, stack walking must recover non-volatile registers
in order to retrieve frame pointers, in order to recover return addresses.


The basic stack walk method is "recover RSP and then dereference and increment it".
However "recover RSP" is not trivial.


This point about nonvolatile restoration feeding into frame/stack/return restoration
is left as kind of a "hint" and not fully elaborated here.


Think about it. Given that a function can leave rsp in some frame pointer..
what we might think of as rbp, but can be any nonvolatile, and then the function
can alloca() freely, and then call another function or arbitrary functions that
saves and changes arbitrary nonvolatiles, how do you walk the stack? You must
restore all nonvolatiles, iteratively, to restore frame pointers, to discover
stack pointers, to discover return addresses.


As well, when exceptions are dispatched, and handlers are called, and
exception resumed somewhere ("exception is caught"), other than
a correct stack pointer, code needs non-volatiles restored
because locals can be in non-volatiles and are expected to survive exceptions.

I claim these use-cases are all really one slightly general thing -- restore nonvolatile registers.
RSP and RIP can be considered essentially non-volatile.

When you return "normally" from a function, RIP, RSP, and all non-volatiles are restored to what they were before you were called. (You can quibble off-by-oneness.) Likewise, a debugger or exceptions simulate returning from a function, referred to as "unwinding", without running any of the "remaining" code in the function that would normally restore the registers. They can do this via the pdata/xdata.

pdata describes the start/end of a function, and refers to the xdata.
xdata holds the "unwind codes", that describe how to undo the affects of the function's prologue, restoring all non-volatile registers, including RSP, and therefore RIP (return address) as well.

Let's see an example where exception handling can be seen to restore non-volatile registers..well I was unable to get the C compiler to do it,
so this took a while and will end this first installment.
I do have more planned.

First let's provide a minimal C runtime for our assembly.
We are only building an import library, so we just need function names.
Calling printf in the modern C runtime is more involved so we will use the old one.

msvcrt.c:

void printf() { }
void exit() { }
void __C_specific_handler() { }

msvcrt.def:

EXPORTS
printf
exit
__C_specific_handler

To build this:
cl /LD msvcrt.c /link /def:msvcrt.def /noentry /nod
del msvcrt.dll


And now the assembly nvlocala.asm:

include ksamd64.inc

    extern printf:proc
    extern exit:proc
    extern RtlUnwindEx:proc
    altentry resume

.const
str1 db "hello %X %X %X %X", 10, 0

.code

; int handler(ExceptionRecord, EstablisherFrame, ContextRecord, DispatcherContext)
; RtlUnwindEx(TargetFrame, TargetIp, ExceptionRecord, ReturnValue, OriginalContext, HistoryTable)
;                 0            8          10           18           20                28
  nested_entry handler, _text
  ;int 3
  ; Save nonvolatiles just so we can trash them, to demonstrate the point.
  ; Note that even when we call RtlUnwindEx, our frame is properly unwound.
  push_reg r12  ; the last 4 registers are nonvolatile -- easy rule to remember
  push_reg r13
  push_reg r14
  push_reg r15
  alloc_stack 038h  ; establish room for 6 parameters and align

  end_prologue

; Trash nonvolatiles to help demonstrate the point.
  xor r12, r12
  xor r13, r13
  xor r14, r14
  xor r15, r15

; Dispatch or unwind?
  mov eax, ErExceptionFlags[rcx]
  test eax, EXCEPTION_UNWIND
  jne unwind

; dispatch -- always handle it, resuming at hardcoded location
  xor eax, eax
  mov [rsp + 028h], rax     ; HistoryTable is optional
  mov [rsp + 020h], r8      ; OriginalContext = ContextRecord
  mov r8, rcx               ; ExceptionRecord = ExceptionRecord
  mov rcx, rdx              ; TargetFrame = EstablisherFrame
  lea rdx, resume           ; TargetIp = resume
  mov r9, 05678h            ; ReturnValue, just to demonstrate the feature
  call RtlUnwindEx
  int 3 ; should never get here

unwind: ; We are called for unwind as about the last thing RtlUnwindEx
        ; does before restoring context to the Rip we specify.
  mov eax, ExceptionContinueSearch
  add rsp, 038h
  begin_epilogue
  pop r15
  pop r14
  pop r13
  pop r12
  ret

  nested_end handler, _text

  nested_entry entry, _text, handler
  ;int 3
  push_reg r12  ; the last 4 registers are nonvolatile -- easy rule to remember
  push_reg r13
  push_reg r14
  push_reg r15
  alloc_stack 028h  ; room for 5 parameters and aligned
  end_prologue

  ; Cache some values in nonvolatiles -- the point of this exercise.
  mov r12, 0123h
  mov r13, 0234h
  mov r14, 0456h
  mov r15, 0789h

  lea rcx, str1 ; 0
  mov rdx, r12  ; 8
  mov r8,  r13  ; 10
  mov r9,  r14  ; 18
  mov [rsp + 020h], r15
  call printf

  lea rcx, str1
  mov rdx, r12
  mov r8,  r13
  mov r9,  r14
  mov [rsp + 020h], r15
  call printf

; Produce an access violation, which will be caught.
  call qword ptr[0]
  int 3 ; should never get here

; Exception will resume here, because this is hardcoded in the handler.
 resume:
  lea rcx, str1
  mov rdx, r12
  mov r8,  r13
  mov r9,  rax      ; ReturnValue to RtlUnwindEx
  mov [rsp + 020h], r15
  call printf

  lea rcx, str1
  mov rdx, r13
  mov r8,  r12
  mov r9,  r14      ; again with the other nonvolatile
  mov [rsp + 020h], r15
  call printf

  mov ecx, 3
  call exit
  int 3 ; should not get here

  add rsp, 038h
  begin_epilogue
  pop r15
  pop r14
  pop r13
  pop r12
  ret
  nested_end entry, _text

end

Build and run:

ml64 nvlocala.asm /link /entry:entry /subsystem:console .\msvcrt.lib kernel32.lib
nvlocala.exe


 - Jay