I love Clang

Hello,

Here is a small post to say I just discovered a “killer optimization” in Clang that makes me confident that it will be a very good OpenCL compiler.

I’m currently implementing all the functions described in the “The OpenCL Platform Layer”chapter  of the OpenCL spec. This chapter contains three functions whose name ends with “Info”.

These functions have a signature like this :

cl_int clGetSomethingInfo(cl_something, cl_enum info, size_t len, void *buf, size_t *real_len);
  • cl_something is the object for which we want a piece of information
  • info is the info we want (CL_CONTEXT_DEVICES for example)
  • len is the size of the application-allocated buffer that will contain the info. This buffer must be large enough to contain what the function will return
  • buf is the buffer
  • real_len is returned by the function and says what size actually has the buffer

These functions are convenient to use from the application perspective, but are a pain to implement, because info can take tens of values, and each value has a type. And for each value, we must check that it will fit in the application-provided buffer. So, it’s a big switch full of copy/paste.

When I took the code from its original author, I saw he implemented this kind of function like this (lines to copy/paste marked with an @, lines originally written by the othors marked with a # because I added other to make this code on par with mine checks-wise) :

#define VERSION_STR "OpenCL 1.0"
#define VERSION_STR_LEN 10

cl_int foo() {
   switch(param_name) {
   case CL_PLATFORM_VERSION:
      if (param_value_size < VERSION_STR_LEN && param_value) // @
         return CL_INVALID_VALUE;                            // @
      if (param_value)                                       // @
         strcpy((char*)param_value, VERSION_STR);            // #
      if (param_value_size_ret)                              // @
         *param_value_size_ret = VERSION_STR_LEN;            // #
      break;

   default:
      return CL_INVALID_VALUE;
   }

   return CL_SUCCESS;

We can see that each case statement needs to be full of verification code. Did I say that nearly all the parameters are optional ?

So, I found an elegant solution to solve this problem : put the verification code out of the switch, to make case statements as empty as possible. The resulting code can be found here, and is like this :

cl_int foo() {
    void *value; // The pointer we'll use outside the switch
    int value_length; // Nearly each case has a different value size

    // Then, to save space on the stack, a small union
    union {
        cl_uint cl_uint_var;
        cl_device_type cl_device_type_var;
    };

    // Use some macros to clarify the code
#define SIMPLE_ASSIGN(type, _value) \
    value_length = sizeof(type);    \
    type##_var = _value;            \
    value = & type##_var;
    
#define STRING_ASSIGN(string)           \
{                                       \
    static const char str[] = string;   \
    value_length = sizeof(str);         \
    value = (void *)str;                \
}

    // And now the switch
    switch (param_name)
    {
        case CL_DEVICE_TYPE:
            SIMPLE_ASSIGN(cl_device_type, CL_DEVICE_TYPE_CPU);
            break;
        
        case CL_DEVICE_VENDOR_ID:
            SIMPLE_ASSIGN(cl_uint, 0);
            break;

        // Tens of cases

        case CL_DEVICE_OPENCL_C_VERSION:
            STRING_ASSIGN("OpenCL C 1.1 LLVM 3.0"); // TODO: LLVM version
            break;
            
        default:
            return CL_INVALID_VALUE;
    }
    
    // Now we know all we have to, we can check everything at one place
    if (param_value && param_value_size < value_length)
        return CL_INVALID_VALUE;
    
    if (param_value_size_ret)
        *param_value_size_ret = value_length;
        
    if (param_value)
        memcpy(param_value, value, value_length);
    
    return CL_SUCCESS;
}

For one or two cases, my code is longer, but it is more easy to read and less error-prone when there are more cases (copy/paste is always to avoid in programming).

So, I was happy with that. Then, I wanted to look at the code produced by Clang to see how it handles all these things. What I saw is that it is very good at optimizing, and that my solution was not yet the best. Here is a C version of what it does :

void foo() {
    // Yeah, Clang agrees that it's a good idea :)
    void *value;
    void value_length;

    // In C, we need my union because of static typing, but assembly code uses register 
    // overlapping (eax in rax, etc)
    union {
        uint32_t i32;
        uint64_t i64;
    } my_union;

    // Then, Clang saw that my code is nearly always using SIMPLE_ASSIGN, that's to say
    // my_union
    value = (void *)&my_union;
    // And it also knows that more than half the values are i32
    value_length = 4;

    // Now a stripped-down version of the switch
    switch (param_name)
    {
        case CL_DEVICE_TYPE:
            my_union.i32 = CL_DEVICE_TYPE_CPU; // Yes, each case is a simple assign operation !
            break;
        case CL_DEVICE_VENDOR_ID:
            my_union.i32 = 0;
            break;
        case CL_DEVICE_MAX_WORK_GROUP:
            my_union.i64 = 1;
            value_length = 8;   // Oh oh, size_t is bigger than an i32 !
            break;
        case CL_DEVICE_VERSION:
            value = "OpenCL 1.1 Mesa O.1"; //Oh oh, we don't use the union !
            value_length = 20;
            break;
    }

    // Then we have my checks and the memcpy, it isn't touched by the optimizer
}

I’m happy, Clang managed to make a code faster than mine, and without the need of macros to make it short. Congrats !

I didn’t test, but it is possible that even the old version, with all the copy/paste, would be optimized like that by Clang (the union is created by the optimizer pass that “unions” vars that are never used at the same time, and then it’s like my version).

So, hats off Clang and LLVM developers, you made a wonderful tool ! (And by the way, I use exclusively Clang to compile my projects, it’s faster and produce better warnings and messages).

I also looked at the code produced by GCC, but is is less beautiful that the one produced by Clang. Every case has three moves : the value length, the value in the union, and then the address of the union in *value.

Advertisements

7 responses to “I love Clang

  • GOUJON Alexandre

    First, I’m really impressed of what you’re trying to achieve. I really hope a *big* company will hire you, you’re talented and you would deserve it.

    That’s great Clang optimizes so well the code. Thus you can focus on features and readability.
    May I suggest using do {} while(0) in your macros ? It’s already used everywhere.
    You can have a look at http://kernelnewbies.org/FAQ/DoWhile0 for instance.

    And don’t forget to have some fun too, you’re human after all ^^

  • steckdenis

    Thank you 🙂 .

    I’ll add the while (0) tomorrow, and I’ll also try to see if I can have my macros declared only at one place, and not in the three functions like it’s the case for now.

    To give information about my progress, the whole chapter “The OpenCL Platform Layer” is implemented, that’s to say all the functions related to Platform, DeviceID and Context. The following classes to be implemented are CommandQueue and Buffer, then there is plenty of commands to manage buffers. I’ll implement some of them (or maybe all of them if it’s easier than I thought), then I’ll go to the 5.4 chapter : programs. I hope to be at the program stage by the beginning of June, to be on time regarding my plan. It will be the more interesting work : using Clang to compile kernels, and run these kernels.

    Best regards.

  • blabal

    Which commands did you use to generate that C code from clang and gcc?

  • steckdenis

    It was :

    clang -S -o cpudevice.s -O3 cpudevice.cpp -I../../build/src/core/
    gcc -S -o cpudevice.s -O3 cpudevice.cpp -I../../build/src/core/

  • Vitaly

    Just a few minor suggestions. Since you are using C++, it would make sense to include the C++ headers e.g. rather than . Also, inline functions provide type checking and should be used when possible instead of macros.

    You’ve also declared value_length as int. It should be std::size_t.

    Keep up the excellent work!

    • steckdenis

      Hello,

      I just committed your suggestion about size_t.

      For the headers, I need the C ones because I use memcpy, and I know no equivalent function in C++. The problem of inline functions is that my macros use local variables of the ::info() function. If I use inlines like simple_assign(value);, I would have to pass them references to all the variables I use, for instance simple_assign(value, value_length, my_union, value);

      It’s possible to avoid doing this with sub-functions or closures, but the problem of type##_var remains : I cannot do that without a preprocessor, out of a macro.

      I also think there are too many pieces of C code for a C++ implementation, but this implementation is tightly bound to a C binding, and C++ lacks sometimes features like memcpy.

    • Vitaly

      Hi,

      Most of the suggestions were general things to keep in mind when writing C++ code.

      I think you’ll find that C++ does have memcpy, so there really is no reason to use the standard C headers. It’s available in the cstring header (analogous to the string.h header in C).

      http://www.cplusplus.com/reference/clibrary/cstring/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: