Cast a Closure to a Function Pointer -- How libffi closure works

Lazymio's Blog

2021-07-29

C, C++, callback, closure, function pointer, libffi, mmap, python, raw function pointer, source reading

Callbacks

This post also has a Chinese version: 把闭包变成函数指针——libffi 闭包原理解析.

As is often the case, a lib written in C may expose an API which accepts a function pointer as a callback function like:

1
2
3

typedef void (*callback_t)(void* data);

void register_callback(callback_t callback, void* data);

And we could call it like:

typedef struct Context {
    int id;
} Context;

void callback(void* data) {
    Context* ctx = (Context*)data;

    printf("Context id=%d\n", ctx->id);
}

int main(){
    Context* ctx = malloc(sizeof(Context)); 
    ctx->id = 114514;

    // some operation

    register_callback(callback, ctx);

    // other operation

    free(ctx);
}

But still it’s inconvenient since we have to care about the lifetime of the data pointer manually which may cause memory leak.

Closures

If you are familiar with C++, it won’t take too much time to wrap it with std::function like:

typedef std::function<void()> callback_fn;

class CallbackManager {
public:
    static CallbackManager* New(callback_fn fn) {
        return new CallbackManager(fn);
    }

    void operator()() {
        return this->_fn();
    }

    ~CallbackManager(){
        delete this;
    }
private:
    callback_fn _fn;
    CallbackManager(callback_fn fn) : _fn(fn) {}
};

void register_callback_wrapper(callback_fn callback) {
    CallbackManager* mgr = CallbackManager::New(callback);
    register_callback([](void* data){ (*(CallbackManager*)data)(); }, mgr);
}

Then we could rewrite previous code in a little more modern and safe way:

typedef struct Context {
    int id;
} Context;

int main(){
    std::shared_ptr<Context> ctx = std::make_shared<Context>();
    ctx->id = 114514;

    register_callback_wrapper([ctx](){
        // play with Context without any concern.
        printf("Context id=%d\n", ctx->id);
    });

    // no need to free context.
}

Now we could pass a std::function, in other words, a closure with such a simple wrapper and enjoy various modern C++ utilities.

A closure typically consists of a pointer to the function and the corresponding context (captured variables etc.) which the parameter callback and data in the original function signature is exactly for.

Raw Function Pointer

But wait, not all callbacks API is designed like this. For example:

1	void register_callback_bad(callback_bad_t callback);

It only accepts a raw function pointer, which means we can’t pass a closure to it since there is no place for the context. Similarly, C++ won’t allow you to convert a lambda to a raw function pointer if it captures any variable.

So here comes the challenge: Could we pass a closure as a raw function pointer? It sounds impossible but I find libffi achieves it indeed:

After calling ffi_prep_closure_loc, you can cast codeloc to the appropriate pointer-to-function type.

Let’s see the tricks behind libffi.

libffi Implementation

Idea

As explained just now, a raw function pointer is just an address and doesn’t carry any callback-related context at runtime. Thus, the only solution is to allocate a static memory and access it within the callback function. However, we still need something to distinguish between callbacks and you may already guess it. Yes, the address of the function pointer itself can work as the unique identification, though such usage is quite uncommon.

Basically, the idea of the libffi is to use extra memory to store data for each closure but the design is pretty interesting.

Trampoline

The core concept of the implementation is called trampoline. Each trampoline consists of two important fields: code and data. The code is the entry point of the registered callback. It would have multiple copies in the memory to have different addresses. The data is the context data related to a callback. Below is the code snippet of the trampoline:

#ifdef ENDBR_PRESENT
# define X86_DATA_OFFSET	4077
# ifdef __ILP32__
#  define X86_CODE_OFFSET	4069
# else
#  define X86_CODE_OFFSET	4073
# endif
#else
# define X86_DATA_OFFSET	4081
# ifdef __ILP32__
#  define X86_CODE_OFFSET	4073
# else
#  define X86_CODE_OFFSET	4077
# endif
#endif

	.align	UNIX64_TRAMP_MAP_SIZE
	.globl	trampoline_code_table
	FFI_HIDDEN(C(trampoline_code_table))

C(trampoline_code_table):
	.rept	UNIX64_TRAMP_MAP_SIZE / UNIX64_TRAMP_SIZE
	_CET_ENDBR
	subq	$16, %rsp			/* Make space on the stack */
	movq	%r10, (%rsp)			/* Save %r10 on stack */
#ifdef __ILP32__
	movl	X86_DATA_OFFSET(%rip), %r10d	/* Copy data into %r10 */
#else
	movq	X86_DATA_OFFSET(%rip), %r10	/* Copy data into %r10 */
#endif
	movq	%r10, 8(%rsp)			/* Save data on stack */
#ifdef __ILP32__
	movl	X86_CODE_OFFSET(%rip), %r10d	/* Copy code into %r10 */
#else
	movq	X86_CODE_OFFSET(%rip), %r10	/* Copy code into %r10 */
#endif
	jmp	*%r10				/* Jump to code */
	.align	8
	.endr
ENDF(C(trampoline_code_table))
	.align	UNIX64_TRAMP_MAP_SIZE
#endif /* FFI_EXEC_STATIC_TRAMP */

So the secret is that: the offset between the code and data of a trampoline is fixed! Therefore, when a function pointer can be used as a callback, it is also the address (after plus the fix offset) of the callback data. After preparing both the real callback address and the callback data, it jumps to some other wrapper code by jmp *%r10.

Another question is that: how is the offset calculated? To keep the fixed offset between code and data, libffi allocates two continuous 4k (UNIX64_TRAMP_MAP_SIZE) memory regions. So the offset is 4k minus the total length of instructions from beginning to the current one. Assume we are in linux 86_64 with endbr64 available, we could calculate the X86_DATA_OFFSET as:

X86_DATA_OFFSET
= 4096 - sizeof(`endbr64`) - sizeof(`subq $16, %rsp`) - sizeof(`movq %r10, (%rsp)`) - sizeof(`movq X86_DATA_OFFSET(%rip), %r10`)
= 4096 - 4 - 4 - 4 - 7
= 4077

And that’s exactly the offset defined in the code snippet above.

Further Details

An interesting thing is how libffi finds the trampoline table:

static int
ffi_tramp_get_libffi (void)
{
  FILE *fp;
  char file[PATH_MAX], line[PATH_MAX+100], perm[10], dev[10];
  unsigned long start, end, offset, inode;
  uintptr_t addr = (uintptr_t) tramp_globals.text;
  int nfields, found;

  snprintf (file, PATH_MAX, "/proc/%d/maps", getpid());
  fp = fopen (file, "r");
  if (fp == NULL)
    return 0;

  found = 0;
  while (feof (fp) == 0) {
    if (fgets (line, sizeof (line), fp) == 0)
      break;

    nfields = sscanf (line, "%lx-%lx %9s %lx %9s %ld %s",
      &start, &end, perm, &offset, dev, &inode, file);
    if (nfields != 7)
      continue;

    if (addr >= start && addr < end) {
      tramp_globals.offset = offset + (addr - start);
      found = 1;
      break;
    }
  }
  fclose (fp);

  if (!found)
    return 0;

  tramp_globals.fd = open (file, O_RDONLY);
  if (tramp_globals.fd == -1)
    return 0;

  /*
   * Allocate a trampoline table just to make sure that the trampoline code
   * table can be mapped.
   */
  if (!tramp_table_alloc ())
    {
      close (tramp_globals.fd);
      tramp_globals.fd = -1;
      return 0;
    }
  return 1;
}

libffi reads all its mappings, locates the text segment and caculates the offset of the trampoline table. Note that tramp_globals.text is equal to &trampoline_code_table. Then, it maps the trampoline table by mmap.

static int
tramp_table_map (struct tramp_table *table)
{
  char *addr;

  /*
   * Create an anonymous mapping twice the map size. The top half will be used
   * for the code table. The bottom half will be used for the parameter table.
   */
  addr = mmap (NULL, tramp_globals.map_size * 2, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

  if (addr == MAP_FAILED)
    return 0;

  /*
   * Replace the top half of the anonymous mapping with the code table mapping.
   */
  table->code_table = mmap (addr, tramp_globals.map_size, PROT_READ | PROT_EXEC,
    MAP_PRIVATE | MAP_FIXED, tramp_globals.fd, tramp_globals.offset);

  if (table->code_table == MAP_FAILED)
    {
      (void) munmap (addr, tramp_globals.map_size * 2);
      return 0;
    }
  table->parm_table = table->code_table + tramp_globals.map_size;
  return 1;
}

Note that the trampoline code is repeated multiple times by .rept to fulfill the 4k memory.

Summary

To summarize, see figure below:

Update(08/01/2021): The Trampoline mechanism also eliminates the need for Writable/Executable memory since the content of the tramp->code is generated at compile time.

My Implementation

To check whether my understanding is correct, I also write a demo. Check it at Github.

The core library is less than 100 lines and should be easy to understand compared to libffi. If you feel it helpful, don’t hesitate to give it a star.