Author Archives: steckdenis

The GSoC is over

Hello,

Yesterday was the last Google Summer of Code day. This wonderful experience is now over, and I will see if I pass the final evaluation.

The GSoC is over, but not my work. These following days will be a bit more light on development, as I need holidays now that I have worked hard for nearly one year (for school since September 2010, for a big school project during the holidays, then for Google since May).

But after this break, I will continue to work for Clover, and the following feature will be built-in functions. I will implement them using a small preprocessor written in Python. I’ll give it an input in the form of :

# def <var> : [values]
def vecf : float2 float3 float4 float8 float16
def gentype : float $vecf

# Normal function, placed in stdlib.c, can use Clang features
# func <name> [types] : <return type> [args]
# - args : <name>:<type>
# - $type is replaced with the current type of the implementation
# - in the body, $vecdim is replaced by the vector dimension of the type
func min $gentype : $type x:$type y:$type
    return (x < y ? x : y)
end

# Native function, placed in src/core/cpu/builtins.cpp
native cos float : float x:float
    return std::cos(x);
end

# Function overloading possible
native cos $vecf : $type x:$type
    for (unsigned int i=0; i<$vecdim; ++i)
        result[i] = std::cos(x[i]);
end

The Python preprocessor will duplicate each function for each of the types given, either in stdlib.c (for functions that will be byte-compiled by Clang and executed using the LLVM JIT, allowing inlining) or in builtins.cpp (for functions that need to use external C or C++ functions, like STL or system ones).

The current preprocessor (yes, I already started working on it) currently can produce this code for the example above :

static float cos_floatfloat(float x) {
    return std::cos(x);
}

static void cos_float2float2(float *result, float *x) {
    for (unsigned int i=0; i<2; ++i)
        result[i] = std::cos(x[i]);
}

static void cos_float3float3(float *result, float *x) {
    for (unsigned int i=0; i<3; ++i)
        result[i] = std::cos(x[i]);
}

static void cos_float4float4(float *result, float *x) {
    for (unsigned int i=0; i<4; ++i)
        result[i] = std::cos(x[i]);
}

static void cos_float8float8(float *result, float *x) {
    for (unsigned int i=0; i<8; ++i)
        result[i] = std::cos(x[i]);
}

static void cos_float16float16(float *result, float *x) {
    for (unsigned int i=0; i<16; ++i)
        result[i] = std::cos(x[i]);
}

You can see that the function names are mangled, as the preprocessor knows more things about the function than the C++ compiler (for example, the dimension of the vectors). If they had the same name, these function (except the first) should have looked the same for a C++ compiler as they take the same parameters and all return void. These functions will be put in builtins.cpp. For “normal” functions, mangling will not be necessary as Clang is aware of the vector dimension (using __attribute__((__ext_vector_size(foo)))) and can mangle even C functions using __attribute__((overloadable)).

Working on this preprocessor is exciting and I hope to be able to do that in the following days, and I will try my best to keep the code clean (Python is a clean programming language but it is easy to have programs like shell scripts).

By the way, the Clover git repository is still online and you can read the code if you didn’t do so before. You can also enjoy the documentation that is now finished (link in the right menu of this blog). I hope you’ll find it helpful and understandable.


Documentation online

Hello,

I’m currently writing the documentation of Clover, and although it is not finished yet and maybe full of spelling mistakes, it’s already online here for the ones wanting to already read it.

The link to this documentation has also been added in the side bar. It will be updated every day or so to match the Doxygen doc in Git.


The End is Approaching

Hello,

Today is the “soft pencil down” day. It means that I have to mostly stop my work now and concentrate on writing a good documentation for my project.

The first step of that is this blog post, a summary of what I’ve done during the summer, what is still to do, etc.

The Project

My project was to work on an existing small code-base called Clover, an Open Source OpenCL implementation. This project was started by a Mesa developer, Zack Rusin.

Before I started my project, and as I said in one of my first blog post, the code was very small and did nearly nothing, but the structure was well in place and easy to extend.

This project having been started by a Mesa developer, many people wanted it to be able to take advantage of hardware acceleration.

Status

My target was do to as much as I could during the summer, in order to have a feature-rich code base with many difficult details already solved. I wanted Clover to be able to launch some simple OpenCL demos on the CPU, with all in place to be able to use Gallium3D to provide hardware acceleration.

I’m happy to say that Clover evolved fairly well during the summer. It’s now an API-complete library, relatively fast (but it will become even more in the following days, I just thought about a potentially huge optimization), and with all the needed pieces in place for hardware acceleration.

The only missing things are some built-in functions available to the kernels. I already implemented the most difficult of them, and the remaining ones are things like clamp(), etc.

Plans for this week

Now that most of the work is finished, I will be able to clean the code and write documentation. It will be in Doxygen format and will cover the DeviceInterface API (abstraction layer between the core API and the devices, for the moment only CPUDevice), the core classes, and will also explain some difficult parts of Clover :

  • The way memory objects work.
  • The structure of events, command queues and CPU worker threads
  • How I implemented barrier() (a cleaner and more comprehensible explanation than the one I posted on August 7)
  • How I use Clang and LLVM to compile and launch kernels.

I’ll also improve the “stub functions” of a kernel. They are currently small functions taking no argument at all and calling the kernel function with its arguments given as constants. It works fairly well but is a bit slow when __local pointers are used (these pointers must be allocated for every work group, so their value changes constantly, and the stub function needs to be recreated).

My plan is to create only one stub function taking one argument : void **locals. This table of pointers will be used by the stub to call the kernel with varying arguments without needing to be re-created and re-JITed every time.

Furthermore, a huge optimization is made possible by that : I implemented the image built-ins, and they are full of switch statements. The code is very big and could highly take advantage of aggressive inlining. I want to implement the images not as pointers to opaque structures (“typedef struct image2d image2d_t;”), but as real structures. When the stub function is created, it will put interesting data about the images in these structs, and the kernel will have the data at hand without needing to call native functions like __cpu_get_image_width. That will allow inlining.

Current stub:

void stub_for_one_work_group() {
    kernel(16, (image2d_t)0xfee80000, (float4 *)0x8945a000);
}

void kernel(uint size, __read_only image2d_t image, __local float4 *temp) {
    // Many read_imagef calls that cannot be inlined and reduced to say
    // pixel = int_to_float(uchar_to_int(*(uchar4 *)(image->data + 256)));
}

Future stub before optimization:

void stub_for_all_the_kernel(void **locals) {
    image2d image;
    image.data_order = CLK_RGBA;   // Constant built from info available in Image2D
    image.data_type = CLK_UNORM_INT8;
    image.width = 4;
    image.height = 4;
    image.data = (uchar4 *)0xdeadbeef;

    kernel(16, &image, (float4 *)locals[0]);
}

void kernel(...) {...}

After an LLVM optimization pass, kernel() can be inlined into the stub, and all the image functions flattened to become lightweight direct memory accesses. The fact that images are stored as constants and not opaque pointers allows LLVM to do its constant propagation pass.


When easy is difficult, and vice versa

Hello,

Some days ago, I was thinking of what I have to work on first for Clover. The API was nearly complete (only samplers and clFlush/clFinish remaining), there was built-ins to implement, etc.

I decided to take some time to read the OpenCL spec part speaking about the built-in functions. The first chunk is implemented, the next ones (mathematical functions) very easy to do (but boring because despite the fact that Clang allows function overloading in C, it doesn’t allow templates and each function can accept float, float2, float3, float4, float8, float16 and the same with int, short, etc). Then came the “memory fences” that don’t do anything on a CPU device. Then image functions, fairly easy (all the algorithms are given in the spec). I finally read the part entitled “Synchronization functions”.

This part contains only one function : barrier(). This function stops the current work-item and waits for the others to reach the same function call. This doesn’t seem difficult, but in fact it is, because Clover runs the work-items of a work-group sequentially.

So, I decided I was better beginning by implementing the Sampler objects, an easy task.

As the title says, it wasn’t so easy and barrier() wasn’t so difficult. I love challenging problems, so I couldn’t help having my brain thinking about barrier(). I thought that the best way to solve this problem is to launch a work-item, and when it encounters a barrier(), to stop it and start the next, until it reaches barrier(), launching the next, etc. The idea is good, but difficult to achieve. I wanted to use nested calls, but it could have resulted to stack overflows.

After a few minutes, I remembered what I’ve read in an old book dating from the Windows NT 4 days (it was a Microsoft Press book). In an appendix, it speaks about a strange thing called “fibers”. In fact, during my reflection, I wanted something like threads but managed cooperatively by the application. I don’t know why, but this small ten-pages appendix (in a book close to 1000 pages) I’ve read nearly 6 years ago came immediately in my mind.

After a quick Google search to check that fibers are still available in modern operating system, I found my dream : setcontext (and other functions in the same family).

These functions allow a thread to save its context (stack, CPU registers, flags, instruction pointer) and to jump into another context (like an interprocedural goto). It’s exactly what I want !

Each work-item will have its own context consisting of a small stack and CPU registers. When a barrier is encountered, the context is halted and CPUKernelWorkGroup jumps to the next context. It will also encounter the barrier, jump to the next context, and so forth. When the last context reached barrier(), it can continue, get to another barrier or finish its execution. When that arises, the unfinished contexts are resumed and can continue their execution.

It’s way easier and more efficient than having one thread per work-item, I don’t need to use any locking machinery, the context switch is exactly when the application needs it, etc.

So, this barrier() problem was relatively easily solved. It allowed me to read some papers about using Clang and LLVM to implement OpenCL (and it seems that either nVidia or AMD, I don’t remember, are using a complex LLVM IR rewriting to chain blocks, handling barriers, to allow complex auto-vectorization to take place). I hope that my solution will work and will not be too slow (the contexts will be created at the first barrier() call, so no overhead when no barriers are needed).

I then came back to my samplers, the “boring” thing I have to do before being able to work on barrier(), using very exciting POSIX functions.

The implementation of class Sampler was boring as expected. I need to convert three flag arguments to one bitfield, and then to convert it back to the flag arguments in clGetSamplerInfo. Boring, yes. The API was also easy : copy/paste of the one of Context, with only a small set of modifications.

Then, I had to plug the samplers in the Kernel code. I went to the big switch handling arguments type, and saw an horror : sampler_t isn’t a pointer to an opaque struct, it is an unsigned int !

What does it mean ? An opaque struct has a name, so I can use ” struct_type->getName() == “image2d”; ” to know that the argument is of type Image2D. With an integer, all I get from LLVM IR is a “i32” type, indistinguishable from a simple “uint” type.

The problem was bigger than expected, and no Microsoft book could help me : LLVM simply doesn’t supply enough type information, and I don’t want to use Clang’s debug informations (they are too big and complex).

After a few days of thinking (barrier() only too a few hours to sort out), I came to a solution that works in 99,99999% of the cases : storing in a std::list the known samplers. In setKernelArg, when the user tries to fit a pointer into an i32, I check if the pointer is a known sampler. On 32-bit architectures, sizeof(void *) == 4, so I always check if the i32 isn’t in fact a pointer to a sampler.

On 64-bit machines, the code is perfect. The samplers are detected, normal i32s are left untouched, and plain wrong i64-to-i32 conversions are spotted and result to an error. On 32-bit machines though, all valid i32s are checked against the table of known samplers, and it’s possible that a valid i32 corresponds to a valid pointer to a sampler. In this corner case, the i32 gets replaced by the sampler’s bitfield value, that is data corruption.

I have no better solution, and the infrastructure needed to do that allowed me to make Clover more robust, so I keep it for the moment. I’ll now implement clFlush and clFinish, then tests for the command queue events, then barrier().

I hope to have barrier() finished by August 15, the “soft pencil down date”. It will mean that I was able to code during my summer a nearly complete OpenCL implementation, lacking only some small built-ins. The following days until August 22 will be used to write the documentation of what I did, and maybe some builtins (the image ones, I want to do first the most complex thing, because it’s what I have to do).

By the way, if you know applications or examples using OpenCL (but nearly no built-in functions), I will be glad to test OpenCL with them. It will be especially interesting if these applications use many clEnqueueWaitForEvents, clEnqueueBarrier, barrier() and out-of-order command queues, that is the part of Clover the most difficult to test using testsuites.


Boring work for API completeness

Hello,

Clover’s development seemed slow these days, but in fact it wasn’t. I’m currently “polishing” all I’ve already done. Not because I’m near the end of the project, but because the last part of my Google Summer of Code project will begin in the following days, and I want the code upon which I’ll build it to be solid.

So, my first target for Clover was to be able to launch OpenCL-compiled kernels. In order to be able to do that, the implementation needed to support several things : buffers, events, command queues, contexts, etc. Now that the kernels can run (but without any interesting built-in function), I decided to finish the public API of OpenCL.

In the git repository, you can therefore see many commits like “Implement clFoo and clFooBar”. I’ve read all the APIs and implemented the missing functions.

Currently, I focused on the “enqueue” functions, that is the functions used to queue specific events, the actions OpenCL can perform. These functions are :

  • clEnqueueRead/WriteBufferRect: a complex function copying a buffer to another, but only a rectangle (if we say the buffer contains 2D data) or a cube. This event is particularly important because I built all the image-related events upon it.
  • clEnqueueCopyBuffer: a simple event copying a buffer to another.
  • clEnqueueCopyBufferRect.
  • clCreateImage2D and clCreateImage3D, to add image support to Clover.
  • clEnqueueReadImage and clEnqueueWriteImage, built upon CopyBufferRect.
  • clEnqueueCopyImage (really the mirror of CopyBufferRect).
  • clEnqueueCopyImageToBuffer and clEnqueueCopyBufferToImage.
  • clEnqueueMapImage.
  • clGetSupportedImageFormats.
  • And then clEnqueueBarrier, clEnqueueMarker and clEnqueueWaitForEvents

Now, all the “enqueue” API is completed. I have now to implement the samplers, and clFlush and clFinish. Then, I will be able to implement the interesting built-in functions (from simple mathematical functions to barrier(), the one that could take a fair amount of time thinking on how I could implement it).

The functions I just implemented are based on the “events” framework of Clover, a set of classes inheriting Coal::Event and organized in a complex heritage tree. This enabled me to implement all the events and their checks with only 1500 lines of code in events.cpp (the biggest file of Clover). All the “rectangle-related” events (that is to say Read/Write/CopyBufferRect, and image events) are implemented in less than 100 lines of worker code in CPUDevice (but the code isn’t really readable, I heavily used the testsuite to check my code). For the reference, here is the code doing all the 2D and 3D copies in CPUDevice :

case Event::ReadBufferRect:
case Event::WriteBufferRect:
case Event::CopyBufferRect:
case Event::ReadImage:
case Event::WriteImage:
case Event::CopyImage:
case Event::CopyBufferToImage:
case Event::CopyImageToBuffer:
{
    // src = buffer and dst = mem if note copy
    ReadWriteCopyBufferRectEvent *e = (ReadWriteCopyBufferRectEvent *)event;
    CPUBuffer *src_buf = (CPUBuffer *)e->source()->deviceBuffer(device);

    unsigned char *src = (unsigned char *)src_buf->data();
    unsigned char *dst;

    switch (t)
    {
        case Event::CopyBufferRect:
        case Event::CopyImage:
        case Event::CopyImageToBuffer:
        case Event::CopyBufferToImage:
        {
            CopyBufferRectEvent *cbre = (CopyBufferRectEvent *)e;
            CPUBuffer *dst_buf =
                (CPUBuffer *)cbre->destination()->deviceBuffer(device);

            dst = (unsigned char *)dst_buf->data();
            break;
        }
        default:
        {
            // dst = host memory location
            ReadWriteBufferRectEvent *rwbre = (ReadWriteBufferRectEvent *)e;

            dst = (unsigned char *)rwbre->ptr();
        }
    }

    // Iterate over the lines to copy and use memcpy
    for (size_t z=0; z<e->region(2); ++z)
    {
        for (size_t y=0; y<e->region(1); ++y)
        {
            unsigned char *s;
            unsigned char *d;

            d = imageData(dst,
                          e->dst_origin(0),
                          y + e->dst_origin(1),
                          z + e->dst_origin(2),
                          e->dst_row_pitch(),
                          e->dst_slice_pitch(),
                          1);

            s = imageData(src,
                          e->src_origin(0),
                          y + e->src_origin(1),
                          z + e->src_origin(2),
                          e->src_row_pitch(),
                          e->src_slice_pitch(),
                          1);

            // Copying an image to a buffer may need to add an offset
            // to the buffer address (its rectangular origin is
            // always (0, 0, 0)).
            if (t == Event::CopyBufferToImage)
            {
                CopyBufferToImageEvent *cptie = (CopyBufferToImageEvent *)e;
                s += cptie->offset();
            }
            else if (t == Event::CopyImageToBuffer)
            {
                CopyImageToBufferEvent *citbe = (CopyImageToBufferEvent *)e;
                d += citbe->offset();
            }

            if (t == Event::WriteBufferRect || t == Event::WriteImage)
                std::memcpy(s, d, e->region(0)); // Write dest (memory) in src
            else
                std::memcpy(d, s, e->region(0)); // Write src (buffer) in dest (memory), or copy the buffers
        }
    }

    break;
}

ImageData is a simple function returning the address of a pixel given its coordinates. It currently works only on little-endian architectures. You’ll see that bytes_per_pixel is always 1 in this code (the last argument of imageData). It’s normal, Event objects already did the multiplications where needed.

static unsigned char *imageData(unsigned char *base, size_t x, size_t y,
                                size_t z, size_t row_pitch, size_t slice_pitch,
                                unsigned int bytes_per_pixel)
{
    unsigned char *result = base;

    result += (z * slice_pitch) +
              (y * row_pitch) +
              (x * bytes_per_pixel);

    return result;

I’m nearing the end of my project. I don’t know if I will be able to implement all the built-in functions by August 25. I’ll start with the “difficult” ones (barrier(), image reading and writing) in the hope that I will be able to implement the remaining ones after the Summer of Code program. These are fairly simple functions already implemented in many third-party mathematical libraries, so I can simply call them or copy their code.


A fix and an optimization

Hello,

Last week, I was able to run an OpenCL-compiled kernel using Clover. The problem was that there was a problem : every once, the kernel failed.

I thought it was a locking problem. In fact, it was : the work groups called kernel->callFunction (to get a zero-argument function calling the actual kernel with its arguments) without any lock.

After a fix, the kernels were mostly working but continued to fail after a few iterations. The problem took one week to find and was tricky.

In fact, when an event is “finished”, it can be deleted. The logic was simple : run each work group of a kernel, and when the last work group has finished running, kill the kernel. This seemed to work on paper, but in reality, on multi-core machines running multi-threaded OSes, it doesn’t work. What happened is that the last work group finished before other work groups, because they were on busier CPU cores, or were preempted. The kernel deleted, clWaitForEvent returned, and the program began to read the buffer while it wasn’t already ready.

This fix was to add a counter of the finished work groups. When all the work groups are finished, we can delete the kernel. A simple fix for a complex bug.

I then added an optimization regarding the get_global_id builtin function. This function returns an ID based on the current work group and work item. OpenCL also adds a “work-offset” and some other things. The result was a function with two additions, a multiplication, and several getter calls (I hadn’t enabled link-time optimization).

Now, some of these computations are cached in a private variable, and get_global_id is free of any function call and contains only a simple addition. The speedup is interesting : for a kernel with two very slow modulo operations (what a processor does the slowest), the speedup is of 1.72x. I haven’t tried a simpler kernel, but it seems that the speedup for only get_global_id is already very interesting.

The other builtin functions are simpler and don’t need to be optimized. Another slow part of Clover is the thread-local storage. Each call to a builting needs a call to __tls_get_storage or a function like this.

The plan for the following days is to implement samplers and images (to have a complete API), and then a Qt-based example applications allowing to write kernels, add buffers and images and run programs. This will allow me to write nice test kernels for the testsuite, that will have a new testrunner taking a kernel as argument (and information about the buffers) and testing the kernel. With all of this, I will be able to write and test the builtin functions 🙂 .


Compile and launch simple OpenCL C kernels with Clover

Hello,

Another Clover milestone was hit today ! During my holidays (from July 4 to July 15), I worked on the OpenCL C compiler based on Clang that is used in Clover to compile OpenCL programs. I also implemented all the infrastructure to get and launch the kernels of a program.

All is in place and can be tested (but there are still some limitations). Clover is now able to fully execute functions like clSetKernelArg, clBuildProgram and the best one : clEnqueueNDRangeKernel.

This means that from now, my OpenCL implementation is very close from being complete API-wise (I’ve still a lot of work to do regarding the OpenCL C language and its standard set of functions).

For information, Clover is able to launch this program (found in tests/test_kernel.cpp) :

float simple_function(float a) {
    return a * 2.5f;
}

__kernel void kernel1(__global float *a, __global float *b, float f) {
    size_t i = get_global_id(0);

    a[i] = simple_function(f) * b[i];
}

__kernel void kernel2(__global int *buf) {
    size_t i = get_global_id(0);

    buf[i] = 2 * i;
}

It recognizes that kernel1 and kernel2 are two different kernels (and that simple_function cannot be called as a kernel). The testcase launches kernel2 as it has only one argument (it’s faster to test a simpler kernel, but I’ll also add testcases for the other kernel), and checks that its result is ok.

int buffer[64];

// Create a cl_mem buffer to give to the kernel
buf = clCreateBuffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                     sizeof(buffer), buffer, &result);

// Fetch the list of the kernels in the program
result = clCreateKernelsInProgram(program, 2, kernels, 0);

// Set the argument of kernel2
result = clSetKernelArg(kernels[1], 0, sizeof(cl_mem), &buf);

// Execute the kernel
size_t global_size = sizeof(buffer) / sizeof(buffer[0]);
result = clEnqueueNDRangeKernel(queue, kernels[1], 1, 0, &global_size, 0, 0, 0, &event);
result = clWaitForEvents(1, &event);

The buffer contains after the kernel is run 64 ints : (0, 2, 4, 6, 8, …, 126).

The implementation of all of this is a bit tricky. I invite you to take a look at the CGit, but the commits are fairly difficult to understand. I hope the code isn’t too ugly, it works, I even find it somewhat nice at some places, but it’s too early to be sure it’s good.

To end this blog post, a summary of what I did during my holidays :

55 files changed, 5031 insertions(+), 1941 deletions(-)

The line numbers are not truth, but it show I was able to work many hours a day during my holidays (and I also very enjoyed them). I’m happy 🙂 . I promised to be able to work during my holidays, I kept my promise. I have now a month and a half to work on the builtin functions (only get_global_id is implemented, and there are plenty of other functions) and to finish the API (samplers and images aren’t yet completely implemented).

Then I will have to write tests, tests and more tests.


When LLVM is slow

Hello,

A quick blog post about what I’ve just seen in the LLVM codebase.

First, a note about the progress I made since the previous post. The kernels are done, I’ve implemented clGetKernelInfo, clGetKernelWorkGroupInfo (currently a stub), clSetKernelArg, and even clEnqueueNDRangeKernel and clEnqueueTask. It means that API-wise, we can run compiled OpenCL kernels. But it isn’t the case yet. I still need to implement many things in the core of Clover, but every hour makes me closer to finish that.

One of the thing I have to implement is a fast mean to call a kernel. “Fast” is really important here, since most kernels are small (thus run fast) but are a lot to be run.

I naively began by envisaging using llvm::ExecutionEngine::runFunction. This function takes two parameters : the function to run, and a vector of its arguments. This vector didn’t make me happy, I found it too slow. So I looked at the code of this function for the JIT (the engine used on x86, x86_64 and ARM, the three targets I want to run fast). I was horrified by this function !

// ... Many "fast" special cases

// Okay, this is not one of our quick and easy cases.  Because we don't have a
// full FFI, we have to codegen a nullary stub function that just calls the
// function we are interested in, passing in constants for all of the
// arguments.  Make this function and return.

// First, create the function.
FunctionType *STy=FunctionType::get(RetTy, false);
Function *Stub = Function::Create(STy, Function::InternalLinkage, "",
                                  F->getParent());

// Insert a basic block.
BasicBlock *StubBB = BasicBlock::Create(F->getContext(), "", Stub);

// ... Code generation for the args, to have a function like
//     T f() { return function(3, "foobar", (int *)0x9fffff90); }

// Finally, call our nullary stub function.
GenericValue Result = runFunction(Stub, std::vector<GenericValue>());
// Erase it, since no other function can have a reference to it.
Stub->eraseFromParent();

The last line is the worst : the function is removed, each call will need to use code generation !

Fortunately, I can avoid that for kernels. A kernel is a function called many times with the same parameters. So, I can build this stub function (avoiding some corner cases LLVM handles), keep it in the Coal::Kernel object, and call it many times with a simple :

void (*func)() = device_kernel->nativeFunctionAddress();

for (...)
    func();

Simple, and really fast !


Thread-local storage

Hello,

Here is a post about my thoughts on how to implement functions like “get_global_id()” in a threaded environment.

This function, and others like it, returns an identifier of the current OpenCL kernel run. It’s used for instance to have the same kernel running in parallel on multiple elements of a data set. The trick is that its return value cannot be inferred from its arguments, but from some data provided either by the GPU, or the current CPU thread.

Let’s see how I want to implement that from top to bottom.

The LLVM IR binding

The very good thing about LLVM IR is that it allows calling remote functions. It’s a bit dangerous (I have to figure out how to disable a kernel from calling random C functions resolvable from the current application, like exec()) but very powerful to allow implementing functions like get_global_id() in pure compiled C++.

The first part is to define the function in the LLVM IR. Using Clang as the compiler, it’s as easy as writing a declaration in a header, like done in src/runtime/stdlib.h :

size_t get_global_size(uint dimindx);

It’s all for the LLVM IR part. Now, I’ll use the LLVM JIT to run kernels, so I dove in the LLVM documentation and saw the llvm::ExecutionEngine:: InstallLazyFunctionCreator() function. This function takes a function pointer that will get called everytime the JIT sees a function not defined in LLVM IR. Clover will be able to provide a native function for this name.

So, the code :

static void *createFunction(const std::string &name)
{
    if (name == "get_global_id")
        return &native_get_global_id;
    else if (name == "get_work_dim")
        return &native_get_work_dim;
    // Other...
}

So, it’s simple (but not really fast due to the string comparisons, fortunately this function is called once for each function call-site, not for every call). Let’s implement these functions.

The native part

My first plan was to put the thread-local variable in the stdlib bitcode, and implementing functions like get_global_id like this :

__thread void *tls;                    // Value set by native code, but stored by LLVM

size_t _get_global_id(void *tls, uint dimindx);   // The extern native function

size_t get_global_id(uint dimindx)
{
    return _get_global_id(tls, dimindx);
}

The idea was to have the native functions called with the thread-local param given as a parameter (its could be a pointer to a CPUKernelInstance). Some inlining during the program compilation would have replaced all the get_global_id calls with _get_global_id calls.

It was a bad idea.

The first reason is that TLS is not well supported by the LLVM JIT. It works on x86, not on x86_64, and I don’t know if it’s good on ARM and other targets. And I want OpenCL to be usable everywhere, even on Meego-powered ARM devices using Free drivers 🙂 .

So, after some minutes of reasoning, I found an evident and way better solution : putting the TLS value in native code, not in the LLVM IR.

So, I removed all the stub functions from stdlib.c. Now, a call to get_global_id remains a call to this function, without added parameters and without renaming.

So, it’s time to implement these functions :

__thread struct
{
    size_t work_dim;
    size_t current_global_id[CPUDEVICE_MAX_DIMS];
    // Other variables needed to implement the functions
} g_tls;

size_t native_get_work_dim()
{
    return g_tls.work_dim;
}

size_t native_get_global_id(uint dimindx)
{
    if (dimindx >= CPUDEVICE_MAX_DIMS)
        return 0; // Follow the OpenCL spec

    return g_tls.current_global_id[dimindx];
}

It’s not so complicated. It’s a pity that I cannot put the TLS variable in LLVM IR (inlining would have removed the if if dimindx is known to be good, and it would have avoided the need of an expensive x86 CALL only to get a variable), but I hope it will be fast enough.

The last part to implement is the CPUKernelInstance. I don’t know yet what this class will do, but I think it will represent a work-goup. OpenCL divides a task into work-groups, divided into work-items. We can also divide a work-group into sub-work-groups if needed (I think it’s only the case on GPUs). A work-group is a “hint” on how to divide the work on parts bigger that a work-item. All the work-groups must have the same size. If I have correctly understood the specification, a work-item could be a pixel in an image, and a work-group a line. “local” OpenCL storage is also shared between the work-items of a work_group, but not with the ones outside (again if I have correctly understood the spec that is a complex here).

These work-groups will be used by Clover not to run all the work-items in parallel. It would be way too slow (every work-item would require a fetch from the command-queue, and even modern processors don’t have thousands of cores)). I intend to run the work-groups in parallel (a command-queue fetch per work-group), but the work-items sequentially :

void KernelInstance::runWorkGroup()
{
    // Number of work-items in a work_group
    size_t items_per_work_group = getItemsPerWorkGroup();

    // Index of the current work-group (in fact, it is a CPUDEVICE_MAX_DIMS-dimensional array,
    // but I simplify here)
    size_t work_group_index = getWorkGroupIndex();

    // Function to call
    llvm::Function *func = p_kernel->getFunction(p_device);

    // Arguments
    std::vector<llvm::GenericValue> &args = p_kernel->llvmFunctionArgs();

    // Run the work-items
    for (size_t i=0; i<items_per_work_group; ++i)
    {
        // TLS data
        g_tls.current_global_id = (work_group_index * items_per_work_group) + i;

        // Call the function
        p_jit->runFunction(func, args);
    }
}

I think all is in place in order to run kernels. No code is in place yet, but it will come in the following days (I have some infrastructure to do first). I am open to any suggestion.


STL + Clang + LLVM, the pieces are coming together

Hello,

Yes, the pieces are coming together. Today, a new word appeared in the list of supported features of Clover : kernels. Clover can now extract them from programs and do some interesting things with them, for instance listing their parameters (and types). It’s for the moment the only  “big” thing it can do, but there is more to come.

Implementing the kernels was a nice experience. I fact, the current days of Clover development are very exciting. I have to use many technologies and projects, like Clang, LLVM, the STL, and event Check (Clover’s unit testing framework). All is very nice and well done, and pretty diversified. It goes from the thousand-classes LLVM to the “small” STL doing just what it needs to.

Speaking about the STL, I have a question for the few C++-experts reading this blog. The OpenCL specification says that: “Kernel objects are not created for any __kernel functions in program that do not have the same function definition across all devices for which a program executable has been successfully built.” This means that I need to keep a set of the kernels available in all the “per-device sub-programs” (the function definition is verified after).

Doing that is like a SELECT … INNER JOINT in SQL. We can also express it with the following pseudocode :

kernels is a set

for (subprogram in subprograms)
    if (kernels.empty)
        kernels = subprogram.kernels
    else
        kernels &= subprogram.kernels // This line

I’m currently doing this in C++ using std::set, but I’m not comfortable with my current code :

    std::set<std::string> kernels_set;

    for (int i=0; i<p_device_dependent.size(); ++i)
    {
        DeviceDependent &dep = p_device_dependent.at(i);
        std::vector<llvm::Function *> kernels = kernelFunctions(dep);
        std::set<std::string> set;

        // Add the kernels in the set
        for (int j=0; j<kernels.size(); ++j)
        {
            llvm::Function *func = kernels.at(j);
            set.insert(func->getNameStr());
        }

        // intersection of the sets
        if (kernels_set.empty())
        {
            kernels_set = set;
        }
        else
        {
            std::set<std::string> inter;

            set_intersection(set.begin(), set.end(), kernels_set.begin(),
                             kernels_set.end(),
                             std::inserter(inter, inter.begin()));

            kernels_set = inter;
        }
    }

The problem is that I don’t know if it’s the best way to do that (or a sufficiently good one). Qt has a good function to do this without a temporary set, but the STL lacks one. I don’t know if I can use the same std::set in the source parts of set_intersection and as the result set.

Except that, the code should not be too ugly, and it works fairly well. I thank Zack Rusin for having added unit tests in Clover (and I’m happy to have continued to implement them), it’s very pleasant to type “make test”, to see that all goes well, and to think “my code is ok 🙂 “.