Unicorn2 Devblog: mmap and 'overcommit' memory on Windows

Motivation

Unicorn released 2.1.0 recently, with one of the exciting features: no longer asking for 2GB memory per instance on Windows. Previously, when we port Unicorn to Windows, we found QEMU uses a global static memory region as a translation buffer, which doesn’t suits Unicorn’s design because Unicorn needs separate translation buffers for every instance. For simplicity, we wrote VirtualAlloc(2GB, MEM_RESERVE | MEM_COMMIT, ...) and it seemed to be working well until we receive lots of reports like “Not starting with low memory”. This drives me to investigate possible solutions.

Why it works on Linux?

First, the reports all come from Windows users so there should be some behavior difference on the mapped memory. On Linux, Unicorn does mmap(NULL, 2GB, MAP_ANON | MAP_PRIVATE..) instead. Unlike VirtualAlloc, mmap inherently has the feature of Overcommit, which means the virtual memory is reserved but not installed with a physical memory page. This allows users to allocate much bigger memory regions than the amount of physical memory and only occupy the memory if it is accessed. This explains why Linux users didn’t complain because each Unicorn instance usually only uses a little memory, though 2GB is allocated.

However, VirtualAlloc behaves differently by immediately allocating 2GB of physical memory, which causes the difference. Therefore, the key to solving this issue is: Can we have a mmap implementation on Windows?

MEM_RESERVE and MEM_COMMIT

Back to our call to VirtualAlloc, MEM_RESERVE and MEM_COMMIT catch my eye, and they sound exactly what we need. Quick googling gives this article from Microsoft, however, this requires the code to be wrapped in a __try block and we can’t assume users will always do this. Further digging into the Microsoft documents still gives nothing about how to use it without __try. It looks like this is not a beloved feature of Win32.

After some trials and errors, it turns out that SEH (Structured Exception Handling) is frame-based, meaning everything needs to be on stack as confirmed by cygwin. To support handling exception anywhere, we need to use VEH (Vectored Exception Handling). Unfortunately, again, the document here is really almost nothing, but I soon get a minimal example to work it out.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
static LONG handler(PEXCEPTION_POINTERS ptr) {
PEXCEPTION_RECORD record = ptr->ExceptionRecord;
if (record->ExceptionCode == EXCEPTION_ACCESS_VIOLATION) {
uint8_t* base = (uint8_t*)(record->ExceptionInformation[1]);
fprintf(stderr, "We are committing %p\n", base);
if (VirtualAlloc(base, 4096, MEM_COMMIT, PAGE_EXECUTE_READWRITE)) {
return EXCEPTION_CONTINUE_EXECUTION;
}
}
return EXCEPTION_CONTINUE_SEARCH;
}

int main() {
uint8_t* mem = (uint8_t*)VirtualAlloc(NULL, 16384, MEM_RESERVE, PAGE_EXECUTE_READWRITE);
fprintf(stderr, "Reserved %p\n", mem);
AddVectoredExceptionHandler(0, PVECTORED_EXCEPTION_HANDLER(handler));
mem[0] = 0xFF;

fprintf(stderr, "Memory content is %hhx\n", mem[0]);
}

This gives:

1
2
3
Reserved 00000199B2BE0000
We are accessing and committing 00000199B2BE0000
Memory content is ff

And it seems that our issue gets resolved.

Global Exception Handler is Bad

Unfortunately, this is still not enough to solve our problem because the handler installed is global and shared between all instances. In other words, we can’t know the corresponding memory regions of different instances and thus we can’t blindly commit memory.

But wait, this situation looks pretty similar to my previous post “Cast a Closure to a Function Pointer – How libffi closure works”, which add an arbitrary context to a raw function pointer. Here the goal is similar: we need the handler to know the exception context. Therefore, let’s have some assembly black magic:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
static LONG handler(PEXCEPTION_POINTERS ptr, uint64_t context) {
PEXCEPTION_RECORD record = ptr->ExceptionRecord;
if (record->ExceptionCode == EXCEPTION_ACCESS_VIOLATION) {
uint8_t* base = (uint8_t*)(record->ExceptionInformation[1]);
fprintf(stderr, "We are accessing %p and context is %llx\n", base, context);
if (VirtualAlloc(base, 4096, MEM_COMMIT, PAGE_EXECUTE_READWRITE)) {
return EXCEPTION_CONTINUE_EXECUTION;
}
}
return EXCEPTION_CONTINUE_SEARCH;
}

int main() {
uint8_t* closure = (uint8_t*)VirtualAlloc(NULL, 4096, MEM_RESERVE | MEM_COMMIT, PAGE_EXECUTE_READWRITE);
uint8_t* data = closure + 4096 / 2;
uint8_t* ptr = closure;

uint64_t context = 0x114514;
fprintf(stderr, "Out content is %lx\n", context);

*ptr = 0x48; // REX.w
ptr += 1;
*ptr = 0xb8; // mov rax
ptr += 1;
memcpy(ptr, &data, 8); // mov rax, &data
ptr += 8;
// ; rax = &data
// mov [rax], rdx ; save rdx
// mov rdx, [rax+0x8] ; move pointer to 2nd arg
// sub rsp, 0x10; reserve 2 slots as ms fastcall requires
// call [rax + 0x10] ; go to handler
const char tramp[] = "\x48\x89\x10\x48\x8b\x50\x08\x48\x83\xec\x10\xff\x50\x10";
memcpy(ptr, (void*)tramp, sizeof(tramp) - 1); // Note last zero!
ptr += sizeof(tramp) - 1;
*ptr = 0x48; // REX.w
ptr += 1;
*ptr = 0xba; // mov rdx
ptr += 1;
memcpy(ptr, &data, 8); // mov rdx, &data
ptr += 8;
// ; rdx = &data
// add rsp, 0x10 ; clean stack
// mov rdx, [rdx] ; restore rdx
// ret
const char tramp2[] = "\x48\x83\xc4\x10\x48\x8b\x12\xc3";
memcpy(ptr, (void*)tramp2, sizeof(tramp2) - 1);

void* handler_address = (void*)handler;
memcpy(data + 0x8, (void*)&context, 8);
memcpy(data + 0x10, (void*)&handler_address, 8);
AddVectoredExceptionHandler(0, PVECTORED_EXCEPTION_HANDLER(closure));

uint8_t* mem = (uint8_t*)VirtualAlloc(NULL, 16384, MEM_RESERVE, PAGE_EXECUTE_READWRITE);
fprintf(stderr, "Reserved %p\n", mem);
mem[0] = 0xFF;

fprintf(stderr, "Memory content is %hhx\n", mem[0]);
}

This generally allocates a closure wrapping the context with a trampoline, and thus avoiding a single global exception handler. Every instance have a standalone closure allocated and installed. If you feel confused, have a look at the previous post, which contains a sample project.

Discussion: “Overcommit” or not

I don’t really think “Overcommit” is a precise wording to describe the behavior (note even Linux can not commit memory more than physical + swap!). Overall, this refers to the following logic:

  • Users ask to reserve some virtual memory
  • Kernel doesn’t allocate physical pages
  • Users access the reserved virtual memory
  • Kernel is interrupted by a page fault, and installs the physical page.
  • Users get the data as if no page fault happens

This is actually more formally called “Demand paging” in academia. Linux, by default, tries to find a new page in kernel space, while Windows delivers the page fault to userspace (as we handled in the exception handlers). Although there should be some performance difference, both systems are capable of doing demand paging, or “overcommit”.

Conclusion

Although “overcommit” is not a beloved feature on Windows, it is still possible to implement it with a little magic in Unicorn. The code of this post can be found at here. The full implementation in unicorn is available at here.