Monthly Archives: May 2011

Progress so far


The official coding period started this monday, but I had already do some things before. I’ll now describe what we can currently achieve with my OpenCL implementation.


First, a bit of history. The Clover project seems to have been started in late 2008, and evolved at a slow pace during two years. Then, since November 2010, it seemed to have been abandoned.

This implementation was made by Zack Rusin, who I thank for his great work. Despite its lack of features, his implementation was well organized and full of good ideas.

When I started my work, I took Rusin’s work as a starting point. I looked carefully at it and saw its qualities, for example its structure, its unit tests, and the use of CMake (I like CMake and use it when I develop small KDE-based apps).

So, I decided to take its structure. I then looked at the implementation, but I thought I was better removing it and starting from scratch. The code was a bit aging and hackish, it implemented the bare minimal OpenCL functions to be able to run some demos, and I thought I would be more comfortable with code of my hand, and that what he already implemented should not take too much time to be reimplemented.

Then I downloaded the OpenCL 1.1 spec and read it carefully. I implemented one function at a time, with having read the next paragraphs to be able to take good architecture decisions. Every function is fully covered by unit tests.

Days after days, the implementation has grown, and is now in an interesting state : nearly all the infrastructure work is done, we can create contexts and command queues, and I’ll now implement buffers, commands, programs and kernels. The implementation is relatively small and straightforward to implement.

Current features

So, I’m glad to say that my code is available on Github. It can be built and installed. It contains a small example application, querying information about the platforms and devices of the OpenCL implementation. This application can also be used on a proprietary OpenCL implementation.

With this implementation, we can get the platform list, get information about these platforms, create devices, get information about them, create and destroy context and create and destroy command queues.

If you take an OpenCL application, I have high hope that all its initialization code already runs on my implementation, that’s to say the code creating a context with a specified device, and creating one or more command queues.

The next step is to implement the “useful” functions, the kernels, programs, memory objects, etc. Following my timeline, I have until early June to finish and completely debug the currently implemented functions (if needed) and some more functions needed to support some simple kernels. I think I’m comfortably in advance, so I hope I will begin to work on Clang and the interesting part sooner than anticipated, to be able to implement more stuff during the summer.


I love Clang


Here is a small post to say I just discovered a “killer optimization” in Clang that makes me confident that it will be a very good OpenCL compiler.

I’m currently implementing all the functions described in the “The OpenCL Platform Layer”chapter  of the OpenCL spec. This chapter contains three functions whose name ends with “Info”.

These functions have a signature like this :

cl_int clGetSomethingInfo(cl_something, cl_enum info, size_t len, void *buf, size_t *real_len);
  • cl_something is the object for which we want a piece of information
  • info is the info we want (CL_CONTEXT_DEVICES for example)
  • len is the size of the application-allocated buffer that will contain the info. This buffer must be large enough to contain what the function will return
  • buf is the buffer
  • real_len is returned by the function and says what size actually has the buffer

These functions are convenient to use from the application perspective, but are a pain to implement, because info can take tens of values, and each value has a type. And for each value, we must check that it will fit in the application-provided buffer. So, it’s a big switch full of copy/paste.

When I took the code from its original author, I saw he implemented this kind of function like this (lines to copy/paste marked with an @, lines originally written by the othors marked with a # because I added other to make this code on par with mine checks-wise) :

#define VERSION_STR "OpenCL 1.0"
#define VERSION_STR_LEN 10

cl_int foo() {
   switch(param_name) {
      if (param_value_size < VERSION_STR_LEN && param_value) // @
         return CL_INVALID_VALUE;                            // @
      if (param_value)                                       // @
         strcpy((char*)param_value, VERSION_STR);            // #
      if (param_value_size_ret)                              // @
         *param_value_size_ret = VERSION_STR_LEN;            // #

      return CL_INVALID_VALUE;

   return CL_SUCCESS;

We can see that each case statement needs to be full of verification code. Did I say that nearly all the parameters are optional ?

So, I found an elegant solution to solve this problem : put the verification code out of the switch, to make case statements as empty as possible. The resulting code can be found here, and is like this :

cl_int foo() {
    void *value; // The pointer we'll use outside the switch
    int value_length; // Nearly each case has a different value size

    // Then, to save space on the stack, a small union
    union {
        cl_uint cl_uint_var;
        cl_device_type cl_device_type_var;

    // Use some macros to clarify the code
#define SIMPLE_ASSIGN(type, _value) \
    value_length = sizeof(type);    \
    type##_var = _value;            \
    value = & type##_var;
#define STRING_ASSIGN(string)           \
{                                       \
    static const char str[] = string;   \
    value_length = sizeof(str);         \
    value = (void *)str;                \

    // And now the switch
    switch (param_name)
        case CL_DEVICE_TYPE:
            SIMPLE_ASSIGN(cl_device_type, CL_DEVICE_TYPE_CPU);
        case CL_DEVICE_VENDOR_ID:
            SIMPLE_ASSIGN(cl_uint, 0);

        // Tens of cases

            STRING_ASSIGN("OpenCL C 1.1 LLVM 3.0"); // TODO: LLVM version
            return CL_INVALID_VALUE;
    // Now we know all we have to, we can check everything at one place
    if (param_value && param_value_size < value_length)
        return CL_INVALID_VALUE;
    if (param_value_size_ret)
        *param_value_size_ret = value_length;
    if (param_value)
        memcpy(param_value, value, value_length);
    return CL_SUCCESS;

For one or two cases, my code is longer, but it is more easy to read and less error-prone when there are more cases (copy/paste is always to avoid in programming).

So, I was happy with that. Then, I wanted to look at the code produced by Clang to see how it handles all these things. What I saw is that it is very good at optimizing, and that my solution was not yet the best. Here is a C version of what it does :

void foo() {
    // Yeah, Clang agrees that it's a good idea :)
    void *value;
    void value_length;

    // In C, we need my union because of static typing, but assembly code uses register 
    // overlapping (eax in rax, etc)
    union {
        uint32_t i32;
        uint64_t i64;
    } my_union;

    // Then, Clang saw that my code is nearly always using SIMPLE_ASSIGN, that's to say
    // my_union
    value = (void *)&my_union;
    // And it also knows that more than half the values are i32
    value_length = 4;

    // Now a stripped-down version of the switch
    switch (param_name)
        case CL_DEVICE_TYPE:
            my_union.i32 = CL_DEVICE_TYPE_CPU; // Yes, each case is a simple assign operation !
        case CL_DEVICE_VENDOR_ID:
            my_union.i32 = 0;
            my_union.i64 = 1;
            value_length = 8;   // Oh oh, size_t is bigger than an i32 !
        case CL_DEVICE_VERSION:
            value = "OpenCL 1.1 Mesa O.1"; //Oh oh, we don't use the union !
            value_length = 20;

    // Then we have my checks and the memcpy, it isn't touched by the optimizer

I’m happy, Clang managed to make a code faster than mine, and without the need of macros to make it short. Congrats !

I didn’t test, but it is possible that even the old version, with all the copy/paste, would be optimized like that by Clang (the union is created by the optimizer pass that “unions” vars that are never used at the same time, and then it’s like my version).

So, hats off Clang and LLVM developers, you made a wonderful tool ! (And by the way, I use exclusively Clang to compile my projects, it’s faster and produce better warnings and messages).

I also looked at the code produced by GCC, but is is less beautiful that the one produced by Clang. Every case has three moves : the value length, the value in the union, and then the address of the union in *value.

Repository available


A quick blog post to give more news about my Google Summer of Code project. The first good news is that my work is now available on Github (before I possibly get an access to the Freedesktop git repository), in the Clover project.

Some days ago, I sent mails to Jerome Glisse and Stéphane Marchesin, two french-speaking Mesa developers. Jerome is also my mentor, by the way. I was discussing how to integrate my work in Mesa : a classic state tracker, or an independent library like Clover is now.

Finally, it was decided to create an independent library focused on software rendering. I’ll also add what’s needed to be easily able to add hardware acceleration after the summer.

So, my Github repository contains a clone of the mesa/clover repository on Freedesktop. I already made some small modifications to the code to be able to compile it, and I will clean it up in the following days. My target is to be able to query the platforms and devices in one or two weeks, with nice and well-organised code.

I’ll keep you informed of my progress.

Using Clang to compile OpenCL kernels


After about a week, here are some news about my Summer work. I will describe here how I want to implement the compiler part of OpenCL.

Using Clang to compile kernels

I’ll begin with a good news : an incredible Free C compiler exists : Clang. It is very easily used and embedded into applications, and runs very fast. It is also compatible with the OpenCL C subset I have to implement.

A few days ago, I tried to make Clang compile some simple kernels. I used for that the command line, not an application. My test kernel was the following :

/* Header to make Clang compatible with OpenCL */
#define __global __attribute__((address_space(1)))

int get_global_id(int index);

/* Test kernel */
__kernel void test(__global float *in, __global float *out) {
        int index = get_global_id(0);
        out[index] = 3.14159f * in[index] + in[index];

This code, placed in the file “”, can be compiled with Clang using the following command line :

clang -S -emit-llvm -o test.ll -x cl

The interesting part is “-x cl”, which tells the compiler to enable its OpenCL compatibility features. These features include more type checks (like a warning if I don’t put a “f” after the float value, saying that double precision floating point needs an OpenCL extension) and a very nice feature : the recognition of “__kernel”. All the kernels of a source file are placed in a special LLVM section. Here is the LLVM assembly output by Clang :

; ModuleID = ''
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

define void @test(float addrspace(1)* nocapture %in, float addrspace(1)* nocapture %out) nounwind {
  %call = tail call i32 @get_global_id(i32 0) nounwind
  %idxprom = sext i32 %call to i64
  %arrayidx = getelementptr inbounds float addrspace(1)* %in, i64 %idxprom
  %tmp2 = load float addrspace(1)* %arrayidx, align 4, !tbaa !1
  %mul = fmul float %tmp2, 0x400921FA00000000
  %add = fadd float %mul, %tmp2
  %arrayidx11 = getelementptr inbounds float addrspace(1)* %out, i64 %idxprom
  store float %add, float addrspace(1)* %arrayidx11, align 4, !tbaa !1
  ret void

declare i32 @get_global_id(i32)

!opencl.kernels = !{!0}

!0 = metadata !{void (float addrspace(1)*, float addrspace(1)*)* @test}
!1 = metadata !{metadata !"float", metadata !2}
!2 = metadata !{metadata !"omnipotent char", metadata !3}
!3 = metadata !{metadata !"Simple C/C++ TBAA", null}

We can see that the “__global” macro is expanded and that address-space information can be found in the LLVM output. The “opencl.kernels” metadata entry also contains the signature of my kernel.

Running the LLVM code

Compiling a code is good, running it is better. For my Google Summer of Code, I want to implement a software OpenCL library, with all the needed stuff to be able to easily add an hardware accelerated path.

To do that, I will use an abstraction layer : a class named for instance “KernelRunner” that will be subclassed by “CPUKernelRunner” and “GalliumKernelRunner”. The CPUKernelRunner class will use the LLVM JIT to run the kernels. The Gallium one will transfort the LLVM bitcode generated by Clang to TGSI, or will pass it directly to the Gallium drivers if they support that when I will implement that.

This class will also translate some state information, for instance the bound arguments. I hope it will be fairly easy with the LLVM JIT, but it could be a bit more difficult for the accelerated path (transferring data to the GPU isn’t an easy task).

The end word

To close this post, I’ll say that I’m pretty confident that I will be able to do interesting things. I have plenty of Free Software project to rely upon, and many great people. I already asked the Clang mailing-list how to properly integrate Clang in a library and got a response. I have a question posted on the Mesa mailing list that currently has no response, but it doesn’t stop me.

Thanks for reading.