GlusterFS Translator API

Introduction

Before we dive into the specifics of the API, it's important to understand two things about the context in which it exists. The first is that it's a filesystem API, exposing most functionality through a dispatch table. In GlusterFS this is xlator_fops, which sort of combines Linux's file_operations, inode_operations, and super_operations (all in fs.h) in one place. To understand how translators work, you'll first need to understand what these calls do and how they related to each other - from open/close and read/write to opendir/readdir, truncate, symlink, etc.

The second essential aspect of the translator API is that it's asynchronous and callback-based. What this means is that your code for handling a particular request must be broken up into two parts which get to see the request before and after the next translator. In other words, your dispatch (first-half) function calls the next translator's dispatch function and then returns without blocking. Your callback (second-half function) might be called immediately when you call the next translator's dispatch function, or it might be called some time later from a completely different thread (most often the network transport's polling thread). In neither case can the callback just pick up its context from the stack as would be the case with a synchronous API. GlusterFS does provide several ways to preserve and pass context between the dispatch function and its callback, but it's fundamentally something you have to deal with yourself instead of relying on the stack.

Dispatch Tables and Default Functions

The main dispatch table for a translator is always called fops (the translator loading code specifically looks up this name using dlsym) which contains pointers to all the "normal" filesystem functions. You only need to fill in the fields for functions that your particular translator cares about. Any others will be filled in at runtime with default values that just pass the request straight on to the next translator, specifying a callback that just passes the result back to the previous translator.

In addition to providing this default functionality, these default functions and callbacks serve another useful purpose. Any time you need to add a new function to your translators, the easiest way to start is to copy and rename the consistently-named default function for that same operation - e.g. default_open for open, default_truncate for truncate, etc. This ensures that you start with the correct argument list and a reasonable kind of default behavior. Just make sure you update your fops table to point to your copy.

When you copy and rename a default function, your copy will often use the default callback as well (e.g. default_open will refer to default_open_cbk). Often this will be exactly what you need; if you do all of your work before passing the request onward, you might not need to do anything at all in the callback and might as well use the default one. Even when that's not the case, copying and renaming the default callback works just as well as copying and renaming the dispatch function to ensure the correct argument list and so on.

Each translator may also have additional dispatch tables, including a table named cbk which is used to manage inode and file descriptor lifecycles; see the section on inode and file descriptor context for more details.

STACK_WIND and STACK_UNWIND

The main functions that implement the callback-based translator API are called STACK_WIND and STACK_UNWIND. These operate not on the usual call stack as you'd see in gdb, but on a separately maintained stack of frames representing calls to translators. When your fops entry point gets called, that call represents a request on its way from FUSE on the client to a local filesystem on server. Your entry point can do whatever processing it wants, then pass the request on to the next translator along that path using STACK_WIND. The arguments are as follows:

As mentioned in the previous section, your "rfn" callback might be invoked from within the STACK_WIND call, or it might be invoked later in a different context. To complete a request without invoking the next translator (e.g. returning data from cache), or to pass it back to the previous one from your callback when it's done, you use STACK_UNWIND. Actually, you're better off using STACK_UNWIND_STRICT, which allows you to specify what kind of request you're completing. The arguments are:

In practice, almost all of the request types use two additional parameters between op and params, even though these aren't apparent in the macro definition:

The specific arguments used by each dispatch function and its associated callback are operation-specific, but you can always count on the first few arguments to a dispatch function being as follows:

Callbacks are similar, except that there's an additional argument between those two. This is the "cookie" which is an opaque pointer stored by the matching STACK_WIND. By default this is a pointer to the stack frame that was created by the STACK_WIND (which doesn't seem terribly useful) but there's also a STACK_WIND_COOKIE call that allows you to specify a different value. In this case, the extra argument comes between the rfn and obj arguments to STACK_WIND, and can be used to pass some context from the dispatch function to its callback. Note that this must not be a pointer to anything on the stack, because the stack might be gone by the time the callback is invoked.

One other important note: STACK_UNWIND might cause the entire call stack to be unwound, at which point the last call will free all of its frames. For this reason, you should never do anything that might require even the current frame to be intact after calling STACK_UNWIND.

Per Request Context

Part of each translator-stack frame is a "local" pointer which is used to store translator-specific context. This is the primary mechanism for saving context between your dispatch function and its callback, so you might as well get used to the following pattern:

    /* in dispatch function */
    local = (my_locals_t *)GF_CALLOC(1,sizeof(*local),...);
    if (!local) {
        /* STACK_UNWIND with ENOMEM eror */
    }
    /* fill in my_locals_t fields */
    frame->local = local;

    /* in callback */
    local = frame->local;

The important thing to remember is that every frame's local field will be passed to GF_FREE if it's non-NULL when the stack is destroyed, but no other cleanup will be done. If your own local structure contains pointers or references to other objects, then you'll need to take care of those yourself. It would also be nice if memory (and other resources) could be freed before the stack is destroyed, so it's best not to rely on the automatic GF_FREE. Instead, the safest thing to do is define your own translator-specific "destructor" and then call it manually in every return path just before STACK_UNWIND:

    void my_destructor (call_frame_t *frame)
    {
        my_own_cleanup(frame->local);
        GF_FREE(frame->local);
        /* Make sure STACK_DESTROY doesn't double-free it. */
        frame->local = NULL;
    }

It would be nice if the call_frame_t structure held a pointer to the destructor and invoked it automatically from STACK_UNWIND, and if local structures were handled more efficiently than by requiring two trips through the glibc memory allocator per translator, but that's not the world we live in.

Inode and File Descriptor Context

Most dispatch functions and callbacks take either a file descriptor (fd_t) or an inode (inode_t) as an argument. Often, your translator might need to store some of its own context on these objects, in a way that persists beyond the lifetime of a single request. For example, DHT stores layout maps for directories and last known locations on inodes. There's a whole set of functions for storing this kind of context. In each case, the second argument is a pointer to the translator object with which values are being associated, and the values are unsigned 64-bit integers. They all return zero for success, using reference parameters instead of return values for the _get and _del functions.

The _del functions are really "destructive gets" which both return and delete values. Also, the inode functions have two-value forms (e.g. inode_ctx_put2) which allow manipulation of two values per translator instead of one.

The use of a translator-object pointer as a key/index for these calls is not merely cosmetic. When an inode_t or fd_t is being deleted, the delete code looks through the context slots. For each one that's used, it looks in the translator's cbk dispatch table and calls its forget entry point for inodes or release entry point for file descriptors. If the context is a pointer, this is your chance to free it and any other associated resources.

Lastly, it's important to remember that an inode_t or fd_t pointer passed to a dispatch function or callback represents only a borrowed reference. If you want to be sure that object is still around later, you need to call inode_ref or fd_ref to add a permanent reference, and then call inode_unref or fd_unref when the reference is no longer needed.

Dictionaries and Translator Options

Another common argument type is a dict_t, which is a sort of generic dictionary or hash-map data structure capable of holding arbitrary values associated with string-valued keys. For example, values might be various sizes of signed or unsigned integers, strings, or binary blobs. Strings and binary blobs might be marked to be free with GlusterFS functions when no longer needed, to be freed with glibc functions, or not to be freed at all. Both the dict_t* and the *data_t objects that hold values are reference-counted and destroyed only when their reference counts reach zero. As with inodes and file descriptors, if you want to make sure that a dict_t you received as an argument will be around later, you need to add _ref and _unref calls to manage its lifecycle appropriately.

Dictionaries are not only used as dispatch function and callback arguments. They are also used to pass options to various modules, including the options for your translator's init function. In fact, the bodies of existing translators' init functions are often mostly consumed with interpreting options contained in dictionaries. To add an option for your translator, you also need to add an entry in your translator's options array (another of those names that the translator-loading code looks up with dlsym). Each option can be a boolean, an integer, a string, a path, a translator name, or any of several other specialized types you can find by looking for GF_OPTION_TYPE_ in the code. If it's a string, you can even specify a list of valid values. The parsed options, plus any other information that's translator-wide, can be stored in a structure using the opaque private pointer in the xlator_t structure (usually this in most contexts).

Logging

Most logging in translators is done using the gf_log function. This takes as arguments a string (generally this->name for translator code), a log level, a vsprintf-sytle format, and possibly additional arguments according to the format. Commonly used log levels include GF_LOG_ERROR, GF_LOG_WARNING, and GF_LOG_DEBUG. It's often useful to define your own macros which wrap gf_log, or your own levels which map to the official ones, so that the level of debug information coming out of your translator can be adjusted at run time. In the simplest case, this might mean tweaking the variables in gdb. If you're feeling a bit more ambitious, you can add a translator option for the debug level (several of the base translators do this). If you're feeling really ambitious, you can implement a "magic" xattr call to pass in new values to a running translator.

Child Enumeration and Fan Out

One common pattern in translators is to enumerate its children, either to match the one that meets some criterion or to operate on all of them. For example, DHT needs to gather hash-layout "maps" from all of its children to determine where files should go; AFR needs to fetch pending operation counts for the same file from children to determine replication status. The idiom for this is:

    xlator_list_t *trav;
    xlator_t *xl;

    for (trav = this->children; trav; trav = trav->next) {
        xl = trav->xlator;
        do_something(xl);
    }

If the goal is to "fan out" a request to each child, some additional gyrations are necessary. The most common approach is to do something like this in the original dispatch function:

    local->call_count = priv->num_children;

    for (trav = this->children; trav; trav = trav->next) {
        xl = trav->xlator;
        STACK_WIND(frame,my_callback,xl,xl->fops->whatever,...);
    }

Then, in my_callback:

    LOCK(&frame->lock);
    call_cnt = --local->call_count;
    UNLOCK(&frame->lock);

    /* Do whatever you do for every call */

    if (!call_cnt) {
        /* Do last-call processing. */
        STACK_UNWIND(frame,op_ret,op_errno,...);
    }

    return 0;

In some cases, you can also use STACK_WIND_COOKIE to let each callback know which of N calls has returned. Examples of this are legion in the AFR code.

Stubs and sync calls