Monday, April 18, 2022

Windows process start and dynamic linking

 I was looking at Apple dyld.

And it struck me as a little wierd, er, surprising.

That they have to do work without benefit of a heap allocator.


It is true, heap allocators require initialization

and code that runs before that initialization must do without heap.


But it struck me that the structure of Windows is a bit more

elegant here, and simpler, possibly smaller, faster, easier to maintain, more general, etc.

(ok, maybe not smaller, due to an "extra" C runtime, but still, this is not much, and provides a lot of value, for example you can printf to the debugger via this code: DbgPrint).


So here is a brief description of Windows process startup and dynamic loading.


There are just a few basic aspects to the structure from which the rest follow.


 - ntdll.dll has a very special relationship with the kernel. (Yes, "dll dll", hereafter just "ntdll").


 - All usermode processes begin in ntdll. (I am ignoring Pico processes.)

   They do *not* begin in executables. No matter what flags

   the executable is built with. There is no "alternate dyld" or "ELF interpreter".

   There is no such thing as a statically linked executable on Windows.

   Well, yes, the executable can just ret or int3. It can even try to

   statically link system service stubs (their interface is sadly unstable).

   It need not have any imports. It can seem statically linked. 


   But execution still starts in ntdll no matter what the executable looks like, and such an executable cannot do much, given the unstable undocumented kernel interface (NtOpenFile, etc.)

(Prior to Windows XP, an executable with no imports would actually crash, attempting to run the address in kernel32.dll that the creater assumed would be mapped in all processes.)

 - Rather, all usermode threads begin in ntdll. There is no specific kernel to user upcall for new processes, only new threads. It suffices.


 - Process initialization occurs by virtue of thread start noticing

   the process has not been initialized. If you create a process suspended,

   and multiple threads in it suspended, and then resume them all "quickly" (NtResumeProcess?), they will race to initialize the process first.

   This is synchronized and safe.


 - ntdll contains statically linked all the system service stubs (NtOpenFile, etc.)

   They are exported from there to the rest of the usermode OS.

   Aside: win32u.dll contains the ones for win32k.sys, for use by user32.dll/gdi32.dll.

 In the past these were statically linked into user32.dll and gdi32.dll but got separated at some point. I don't know how DirectX works. Maybe via gdi32!Escape()?


 - ntdll has no imports. It probably could have some if it was careful, but this is kinda the point.


 - ntdll has thread locals, via an internal non-extensible mechanism. Not declspec(thread), not TlsAlloc.

   All of ntdll's thread locals are allocated along with all usermode threads, by the kernel. This is not really about ntdll per se. These "built in" thread locals are also very efficient to access, fixed register + offset. They should be spent wisely, at least each page of them, as every thread pays for them. This is known as the TEB, the thread environment block. It is at e.g. GS:0 on AMD64, FS:0 for x86, and dedicated registers for other processors. See NtCurrentTeb in winnt.h.


 - The special relationship between ntdll and the kernel extends, such that

   all kernel to user upcalls go through ntdll. For example exception dispatch (KiUserExceptionDispatcher), asynchronous I/O completion (KiUserApcDispatcher), win32k callbacks (KiUserCallbackDispatcher).

 - ntdll contains its own statically linked C runtime. Such as strcmp, bsearch.

   This is partially exported to the system, but is not generally reused. Usermode generally uses the universal C runtime (ucrt) or older msvcrt.dll. They are more complete and support e.g. C++ exception handling. The ntdll C runtime is msvcrt.dll, but ifdef'ed ("not all lines") and selectively built ("not all files"), e.g. to avoid kernel dependencies, though they could work.

   i.e. no fread or malloc, though malloc would be trivial. While this is somewhat wasteful, it is not terrible. This C runtime, libcntpr.lib, is also statically linked to and exported from the kernel, which is why e.g. malloc makes sense to omit (usermode heap is built on NtAllocateVirtualMemory / VirtualAlloc, kernel has those but it is not usually what kernel code wants, kernel "heap" is historically ExAllocatePool, etc.) (In truth, this C runtime was later ifdefed again to separate ntdll and kernel).


 - Other than mapping ntdll into all processes, the kernel either maps the executable and/or passes its path and/or mapped base to ntdll. Passing the path would suffice, since ntdll could map it, just as it maps .dlls.


 - ntdll process initialization then proceeds like so:


   initialize heap (process heap, i.e. GetProcessHeap())

   recursively walk executables imports,

    searching for .dlls

      mapping them (roughly: CreateFile + CreateFileMapping(SEC_IMAGE) + MapViewOfFile)

      resolving their imports

      calling DllMain

  call the executable's entry

Since ntdll has no imports, the only dependencies here, by static construction, are the system services and ntdll itself (being careful to initialize ntdll in the right order, e.g. heap first, but some other things too, like critical section support; critical sections optionally use heap).


- You can step through all this. The "magic" is asking the debugger to stop on module loads, ntdll specifically:

 cdb /xe ld:ntdll.dll foo.exe (or maybe just /xe ld).

This breaks very early in a usermode process, long before main and long before the builtin initial breakpoint.


 - It should be noted that exception dispatch is also in ntdll.

   Dynamic loading has no problem using exceptions.

   (Glossing over: ntdll is written mostly in C. It can use C exceptions. It does

   not have a C++ runtime. The C++ runtime uses the "wrong" kind of thread locals (FlsAlloc), like for rethrow so does not at present work here, but it could, or maybe omit the rethrow functionality; exceptions on Windows at least do not require heap, unlike other systems; they can be used to indicate out of memory).


 - I think Apple could/should merge libSystem and dyld and therefore ease

   the development of dyld, but there may be reasons they are split,

   some functionality I am unaware of. Or maybe it is just too much

   work at this point. Maybe they have static executables that do not use dyld,

   or even libSystem?