Monthly Archives: June 2011

Compiling programs using Clang


In this title, when I write “programs”, I mean “OpenCL programs”. Since my last blog post, I took some days to enjoy the end of the school year, but I also thought about how to implement a Clang-based OpenCL C compiler.

After some days of coding, I have finally something that works : we can now compile an OpenCL C kernel and get LLVM IR. We cannot do anything with this IR except printing it on stderr, but it shows the infrastructure is already working.

The next days will see the implementation of kernels (exploring a program looking for __kernel functions), and maybe already a JIT.

I will be on holiday from July 3, but I will have two or three hours a day to work ,so work will continue to advance, even at a faster pace that during my exams.

To end this post, here is the testcase used to check that all goes right :

const char program_source[] =
    "#define __global __attribute__((address_space(1)))\n"
    "__kernel void test(__global float *a, __global float *b, int n) {\n"
    "   int i;\n"
    "   for (i=0; i<n; i++) {\n"
    "       a[i] = 3.1415926f * b[i] * b[i];\n"
    "   }\n"

program = clCreateProgramWithSource(ctx, 1, &src, 0, &result);
result = clBuildProgram(program, 1, &device, "", 0, 0);

A “module->dump()” in the Clover source code sends to stderr what Clang produces (currently unoptimized) :

; ModuleID = ''
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

define void @test(float addrspace(1)* %a, float addrspace(1)* %b, i32 %n) nounwind {
  %a.addr = alloca float addrspace(1)*, align 8
  %b.addr = alloca float addrspace(1)*, align 8
  %n.addr = alloca i32, align 4
  %i = alloca i32, align 4
  store float addrspace(1)* %a, float addrspace(1)** %a.addr, align 8
  store float addrspace(1)* %b, float addrspace(1)** %b.addr, align 8
  store i32 %n, i32* %n.addr, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond

for.cond:                                         ; preds =, %entry
  %tmp = load i32* %i, align 4
  %tmp1 = load i32* %n.addr, align 4
  %cmp = icmp slt i32 %tmp, %tmp1
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %tmp2 = load i32* %i, align 4
  %idxprom = sext i32 %tmp2 to i64
  %tmp3 = load float addrspace(1)** %b.addr, align 8
  %arrayidx = getelementptr inbounds float addrspace(1)* %tmp3, i64 %idxprom
  %tmp4 = load float addrspace(1)* %arrayidx
  %mul = fmul float 0x400921FB40000000, %tmp4
  %tmp5 = load i32* %i, align 4
  %idxprom6 = sext i32 %tmp5 to i64
  %tmp7 = load float addrspace(1)** %b.addr, align 8
  %arrayidx8 = getelementptr inbounds float addrspace(1)* %tmp7, i64 %idxprom6
  %tmp9 = load float addrspace(1)* %arrayidx8
  %mul10 = fmul float %mul, %tmp9
  %tmp11 = load i32* %i, align 4
  %idxprom12 = sext i32 %tmp11 to i64
  %tmp13 = load float addrspace(1)** %a.addr, align 8
  %arrayidx14 = getelementptr inbounds float addrspace(1)* %tmp13, i64 %idxprom12
  store float %mul10, float addrspace(1)* %arrayidx14
  br label                                          ; preds = %for.body
  %tmp15 = load i32* %i, align 4
  %inc = add nsw i32 %tmp15, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret void

!opencl.kernels = !{!0}

!0 = metadata !{void (float addrspace(1)*, float addrspace(1)*, i32)* @test}

You can see that Clang even lists OpenCL kernels in a special metadata entry (very useful). Another nicety is that I got rid of the hack Zack Rusin had to do : my source code is read directly from memory, not put in a on-disk file that is read back by Clang. It is easily done this way :

// Fake source name for debugging and reporting purpose
frontend_opts.Inputs.push_back(std::make_pair(clang::IK_OpenCL, ""));

// Say to Clang this file is not on disk but in memory (source : llvm::MemoryBuffer)
clang::PreprocessorOptions &prep_opts = p_compiler.getPreprocessorOpts();
prep_opts.addRemappedFile("", source);

It’s easy and works well. Clang is also surprisingly fast at compiling kernels (without any standard lib or header, I admit). It takes 0.06 second on my single-core 1,66 Ghz Atom computer to compile the kernel I give in this blog post.

The code is already in Git, ready to be tested.


Exams finished


A small blog post to say I successfully passed my exams. I can now go to university to study IT.

On a more “professional” side, the bases of my OpenCL implementation are done for the most parts, I’m now reading the Clang documentation. I hope to have some experiments in a testable state by the following week.

Use OpenCL to execute native kernels


Today is a big day for Clover : we can finally use it to execute native kernels on the processor, in a command queue, asynchronously, and multiple one can be executed in parallel. A native kernel is a simple C/C++ function that we queue for execution on a CPU device, so there is no compiler, no bitcode, etc.

Here is a sample code executing a simple kernel (original in tests/test_kernel.cpp) :

#include <CL/cl.h>

struct args
    size_t buffer_size;
    char *buffer;

static void native_kernel(void *a)
    struct args *data = (struct args *)a;
    int i;
    // Not
    for (int i=0; i<data->buffer_size; ++i)
        data->buffer[i] = ~data->buffer[i];

int main(int argc, char **argv)
    cl_platform_id platform = 0;
    cl_device_id device;
    cl_context ctx;
    cl_command_queue queue;
    cl_event events[2];
    cl_mem buf1, buf2;
    char s1[] = "Lorem ipsum dolor sit amet";
    char s2[] = "I want to tell you that you rock";
    // Initialize the context
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_DEFAULT, 1, &device, 0);
    ctx = clCreateContext(0, 1, &device, 0, 0, 0);
    // And the command queue
    queue = clCreateCommandQueue(ctx, device, 
                                 CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, 0);

    // Create two buffers
    buf1 = clCreateBuffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR, 
                          sizeof(s1), (void *)&s1, 0);
    buf2 = clCreateBuffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                          sizeof(s2), (void *)&s2, 0);
    // Enqueue native kernels
    struct args a;
    const void *mem_loc = (const void *)&a.buffer; // Tell OpenCL to complete the struct
    a.buffer_size = sizeof(s1);
    clEnqueueNativeKernel(queue, &native_kernel, &a, sizeof(a),
                          1, &buf1, &mem_loc, 0, 0, &events[0]);
    a.buffer_size = sizeof(s2);
    clEnqueueNativeKernel(queue, &native_kernel, &a, sizeof(a),
                          1, &buf2, &mem_loc, 0, 0, &events[1]);
    // Wait for events
    clWaitForEvents(2, events);
    // Finished
    return 0;

The code has been pushed on git. You look at the diff to see how I implemented this. The code is full of casts, I don’t really like that, but it’s low-level and I try to keep a good code quality.

I’ll now polish a bit what I already made (for instance implementing clUnmapMemObject, an easy function), then I’ll begin the OpenCL C kernels. My exams will end on Tuesday, I will finally have plenty of time to work on Clover before my vacation (starting July 3)

Have fun parallelizing your applications !

More command queues work


During these past days, I studied my math exam I had on Friday and I also worked a bit on Clover. The good news is that even if I didn’t do anything very exciting like kernels, I highly stabilized my previous work.

If you go on the clover git page, you’ll see 8 new commits :

  • The first avoids potential crashes when an event is released by a worker thread. This commit removes code and makes Clover leak events, but also begins a correct implementation, finished by a later commit.
  • The second is a simple fix for a bug introduced by the previous commit. It was found by my testsuite, so write unit tests !
  • Then, a new OpenCL function : clEnqueueWriteBuffer, built upon ReadBuffer. A simple function added to see that what I’ve done actually works.
  • To be able to write good unit tests, I needed support for UserEvents. I had the bad surprise to discover that these events aren’t bound to a command queue, so the implementation is a bit hackish.
  • Then, finally, a heavy test suite for nearly 2000 lines of code added this week and the previous one. The test suite tests the functions, the buffer operations, how a command queue works when we use user events and do asynchronous things, etc.
  • This morning (10 a.m GMT+2), I applied a small modification suggested by someone on this blog, the first “external contribution” to Clover 🙂 .
  • Then, I continued and fixed the leak of events. With a few instrumentation, I could see that events got properly deleted.
  • I finished my day (before studying a bit for my exam of tomorrow) with the implementation of profiling.


If we refer to my original schedule, I’m a bit late, I should have begin my work on the compiler before the start of June, and we are already nearly at the middle of June.

In fact, I’m not late, I’m just doing things in advance. My original plan was to do as little as possible before the compiler. I wanted to implement only Context, Devices, Platforms and Command Queues. Then, I discovered how the command queues work, and I decided to do a proper implementation of them before.

I will begin working on kernels (at first native kernels) soon, but I have one or two things I want to see implemented before, for example support for clEnqueueMapBuffer (needs a hook in the abstraction layer that will be used by the kernels), and maybe support for images, to have a complete buffer implementation. I could even go as far as samplers, given the fact they are only bitfields.

By the end of June, Clover should be able to launch native kernels, so I will be able to test my command queues under heavy load. During my holidays of early July, I will use a notebook with a single core, so I want to be able to test all the implementation details related to multithreading before I left my main dual core computer.

So, I invite you to read the code and to launch the test suite. The “commandqueue” one takes one second, because I inserted a sleep to be sure everything continues to go well if the main thread of an application does something.

Command queues, mutexes and threads


Sorry for so little blog posts these time, but I’m in my exam period and I have work to do for school, but my biggest exam is friday and it’s followed by a long week-end of three days. Nevertheless, I had time to work on my project these days, and I implemented all the “events and command queues” stuff.

When reading the spec, it can seem an easy task, but it isn’t, really. The fact is that I want to do a proper and fast implementation, compatible with software and hardware rendering. That’s what makes it difficult.

To read from or write to a buffer object, you have to use a function like clEnqueueReadBuffer. This function creates an event, and pushes it in a command queue created with clCreateCommandQueue. When an event is pushed, it can be submitted to the device for execution.

The trick is that I implement software rendering, but I want it to be efficient and multi-threaded. So, I began by implementing a function returning the number of CPU cores in the computer (mine has only one core but two threads, so anyway I can test). Then, when a CPUDevice is created, Clover launches one thread per core.

One part of the work is done in CPUDevice : each thread waits for events to process, by reading them from a list of events assigned to the CPU and that can be run in any order.

But this list has to be fed, and here’s the difficulties. The principle isn’t too complex : when we enqueue events, the command queue checks what events it can push to the device (events with no not-already-completed dependencies, no events after a barrier, etc). It’s simple and it works, but not always. For example, an application can push two sequential events, and then do heavy I/O. The first event is pushed on the device, then we try to push the second but it fails because the first hasn’t finished. The device executes the first, “unlocks” the second but the main thread is waiting for I/O, so the second event isn’t pushed on the device. Worse, if the main thread doesn’t touch the command queue until it flushes it or release it (doing an implicit flush), the second event will never be pushed until the command queue is flushed. It’s inefficient, the worker threads sleeps for nothing.

So, when a worker thread finishes an event, it can itself ask the command queue to push another event. If there’s nothing to push, then it’ll wait, but we are sure that it’s because no event can be pushed and that we don’t waste time.

Ok, it’s solved, but here start the problems. The biggest is the synchronization of all of this. It took me hours to add mutexes and wait conditions everywhere it’s needed.

Another is a small line of code found in CommandQueue until I removed it :

// Called when an event is completed

Seems harmless ? The problem was that with all these threads, all these functions, it was possible that this line was called from an event queue, and it’s dangerous because the event queue may get deleted if the event’s reference count becomes zero (so it will delete itself and dereference its command queue) and the command queue’s reference count can also becomes zero, so it also deletes itself.

The solution is to delete the event from the worker thread, outside any command queue code. The only thing I still have to sort out is how I synchronize the event’s destruction with the other threads, I think I’ll change how I handle reference count, to be able to use a mutex and be sure that the event gets deleted only one time.

So, my code is probably full of ugly bugs that will take hours each to be solved, but I hope they will not be too many, and my early tests pass, including the nicest :

    char data[16];
    result = clEnqueueReadBuffer(queue, subbuf, 1, 0, 5, data, 0, 0, 0);
        result != CL_SUCCESS,
        "unable to read the buffer"
        strncmp(data, "world", 5),
        "the subbuffer must contain \"world\""

By passing, it tells me that clWaitForEvents works, and also all the machinery I wrote about in this blog post. More, it validates that what I coded before works. The primary signification of this test passing is that my OpenCL implementation started today to be useful and to do things ! (yes, we can use it to copy buffers around in a multi-threaded fashion, interesting isn’t it ?)

So, thanks for reading. I’ll work on my maths exam and give you some more news of my future progress this week-end.

Implementing buffer objects


Sorry for the delay between posts, but the part of the OpenCL spec on which I’m currently is difficult to implement right.

This part is chapter 5.2 : “Buffer objects”. When one reads this chapter, (s)he may think that it’s fairly easy to implement : there are only two functions. The trick is that buffer objects are way more spread in OpenCL than said by this sole chapter. The two next chapters speak about Images and Querying information about … buffer objects. Yes, three chapters are consecrated to these buffers.

But the amount of work isn’t the worst thing. The worst, that took me three days to sort out with the help of my mentor and Stéphane Marchesin, is that buffers are created context-wise. That means that when we create a buffer, we only know in which context it will be used. The problem is that a buffer needs to reside on a device, and that a context can operate several devices.

So, the problem was to find an elegant solution to this problem of creating one buffer in many devices. Finally, a nVidia forum gave me the solution : when we create a buffer object, we don’t allocate anything. When the buffer is first used on a device, it is allocated on this device.

This solution is elegant and works fairly well, but there are some problems, mainly of data availability. One problem I currently think about is that we can pass a host_ptr to clCreateBuffer. This pointer contains data to be uploaded to the buffer, to pre-populate it. The problem is that if we create the buffer after the clCreateBuffer call, for instance in a clEnqueueNDRangeKernel, how to be sure that the memory once pointed by host_ptr is still valid ?

It’s one of the big problems I must face to implement this part of the spec. Maybe the solution is to memcpy the content of host_ptr somewhere, but it would take memory.

So, it’s time to come to the facts : I did actually things these days. Mainly thinking about all these hairy problems, but also coding. I let you look at this before continuing :

 src/CMakeLists.txt         |    1 +
 src/api/api_memory.cpp     |  111 ++++++++++++++-
 src/core/cpudevice.cpp     |  184 ++++++++++++++++++++++++
 src/core/cpudevice.h       |   20 +++
 src/core/deviceinterface.h |   17 ++-
 src/core/memobject.cpp     |  340 ++++++++++++++++++++++++++++++++++++++++++++
 src/core/memobject.h       |  129 +++++++++++++++++
 tests/CMakeLists.txt       |    2 +
 tests/test_mem.cpp         |  163 +++++++++++++++++++++
 tests/test_mem.h           |   17 +++
 tests/tests.c              |    2 +
 11 files changed, 982 insertions(+), 4 deletions(-)

It’s one commit (sorry for so big commits, but I wasn’t able to break it into smaller pieces that continue to be usable).

Mainly, we can now create Buffer and SubBuffer objects, with functions like clCreateBuffer. We can give them host_ptr’s, but we cannot do anything with them. Reading, writing, copying and mapping them will come in the following days (I have a full day free of school tomorrow, so I hope Clover will advance).

Another note is that the Clover repository moved back to, and will soon be accessible at

I think it’s all for this blog post. If I have something to add, I’ll put it in the next post, that will also contain instructions to build and test Clover.