Author Archives: steckdenis

When LLVM is slow


A quick blog post about what I’ve just seen in the LLVM codebase.

First, a note about the progress I made since the previous post. The kernels are done, I’ve implemented clGetKernelInfo, clGetKernelWorkGroupInfo (currently a stub), clSetKernelArg, and even clEnqueueNDRangeKernel and clEnqueueTask. It means that API-wise, we can run compiled OpenCL kernels. But it isn’t the case yet. I still need to implement many things in the core of Clover, but every hour makes me closer to finish that.

One of the thing I have to implement is a fast mean to call a kernel. “Fast” is really important here, since most kernels are small (thus run fast) but are a lot to be run.

I naively began by envisaging using llvm::ExecutionEngine::runFunction. This function takes two parameters : the function to run, and a vector of its arguments. This vector didn’t make me happy, I found it too slow. So I looked at the code of this function for the JIT (the engine used on x86, x86_64 and ARM, the three targets I want to run fast). I was horrified by this function !

// ... Many "fast" special cases

// Okay, this is not one of our quick and easy cases.  Because we don't have a
// full FFI, we have to codegen a nullary stub function that just calls the
// function we are interested in, passing in constants for all of the
// arguments.  Make this function and return.

// First, create the function.
FunctionType *STy=FunctionType::get(RetTy, false);
Function *Stub = Function::Create(STy, Function::InternalLinkage, "",

// Insert a basic block.
BasicBlock *StubBB = BasicBlock::Create(F->getContext(), "", Stub);

// ... Code generation for the args, to have a function like
//     T f() { return function(3, "foobar", (int *)0x9fffff90); }

// Finally, call our nullary stub function.
GenericValue Result = runFunction(Stub, std::vector<GenericValue>());
// Erase it, since no other function can have a reference to it.

The last line is the worst : the function is removed, each call will need to use code generation !

Fortunately, I can avoid that for kernels. A kernel is a function called many times with the same parameters. So, I can build this stub function (avoiding some corner cases LLVM handles), keep it in the Coal::Kernel object, and call it many times with a simple :

void (*func)() = device_kernel->nativeFunctionAddress();

for (...)

Simple, and really fast !

Thread-local storage


Here is a post about my thoughts on how to implement functions like “get_global_id()” in a threaded environment.

This function, and others like it, returns an identifier of the current OpenCL kernel run. It’s used for instance to have the same kernel running in parallel on multiple elements of a data set. The trick is that its return value cannot be inferred from its arguments, but from some data provided either by the GPU, or the current CPU thread.

Let’s see how I want to implement that from top to bottom.

The LLVM IR binding

The very good thing about LLVM IR is that it allows calling remote functions. It’s a bit dangerous (I have to figure out how to disable a kernel from calling random C functions resolvable from the current application, like exec()) but very powerful to allow implementing functions like get_global_id() in pure compiled C++.

The first part is to define the function in the LLVM IR. Using Clang as the compiler, it’s as easy as writing a declaration in a header, like done in src/runtime/stdlib.h :

size_t get_global_size(uint dimindx);

It’s all for the LLVM IR part. Now, I’ll use the LLVM JIT to run kernels, so I dove in the LLVM documentation and saw the llvm::ExecutionEngine:: InstallLazyFunctionCreator() function. This function takes a function pointer that will get called everytime the JIT sees a function not defined in LLVM IR. Clover will be able to provide a native function for this name.

So, the code :

static void *createFunction(const std::string &name)
    if (name == "get_global_id")
        return &native_get_global_id;
    else if (name == "get_work_dim")
        return &native_get_work_dim;
    // Other...

So, it’s simple (but not really fast due to the string comparisons, fortunately this function is called once for each function call-site, not for every call). Let’s implement these functions.

The native part

My first plan was to put the thread-local variable in the stdlib bitcode, and implementing functions like get_global_id like this :

__thread void *tls;                    // Value set by native code, but stored by LLVM

size_t _get_global_id(void *tls, uint dimindx);   // The extern native function

size_t get_global_id(uint dimindx)
    return _get_global_id(tls, dimindx);

The idea was to have the native functions called with the thread-local param given as a parameter (its could be a pointer to a CPUKernelInstance). Some inlining during the program compilation would have replaced all the get_global_id calls with _get_global_id calls.

It was a bad idea.

The first reason is that TLS is not well supported by the LLVM JIT. It works on x86, not on x86_64, and I don’t know if it’s good on ARM and other targets. And I want OpenCL to be usable everywhere, even on Meego-powered ARM devices using Free drivers 🙂 .

So, after some minutes of reasoning, I found an evident and way better solution : putting the TLS value in native code, not in the LLVM IR.

So, I removed all the stub functions from stdlib.c. Now, a call to get_global_id remains a call to this function, without added parameters and without renaming.

So, it’s time to implement these functions :

__thread struct
    size_t work_dim;
    size_t current_global_id[CPUDEVICE_MAX_DIMS];
    // Other variables needed to implement the functions
} g_tls;

size_t native_get_work_dim()
    return g_tls.work_dim;

size_t native_get_global_id(uint dimindx)
    if (dimindx >= CPUDEVICE_MAX_DIMS)
        return 0; // Follow the OpenCL spec

    return g_tls.current_global_id[dimindx];

It’s not so complicated. It’s a pity that I cannot put the TLS variable in LLVM IR (inlining would have removed the if if dimindx is known to be good, and it would have avoided the need of an expensive x86 CALL only to get a variable), but I hope it will be fast enough.

The last part to implement is the CPUKernelInstance. I don’t know yet what this class will do, but I think it will represent a work-goup. OpenCL divides a task into work-groups, divided into work-items. We can also divide a work-group into sub-work-groups if needed (I think it’s only the case on GPUs). A work-group is a “hint” on how to divide the work on parts bigger that a work-item. All the work-groups must have the same size. If I have correctly understood the specification, a work-item could be a pixel in an image, and a work-group a line. “local” OpenCL storage is also shared between the work-items of a work_group, but not with the ones outside (again if I have correctly understood the spec that is a complex here).

These work-groups will be used by Clover not to run all the work-items in parallel. It would be way too slow (every work-item would require a fetch from the command-queue, and even modern processors don’t have thousands of cores)). I intend to run the work-groups in parallel (a command-queue fetch per work-group), but the work-items sequentially :

void KernelInstance::runWorkGroup()
    // Number of work-items in a work_group
    size_t items_per_work_group = getItemsPerWorkGroup();

    // Index of the current work-group (in fact, it is a CPUDEVICE_MAX_DIMS-dimensional array,
    // but I simplify here)
    size_t work_group_index = getWorkGroupIndex();

    // Function to call
    llvm::Function *func = p_kernel->getFunction(p_device);

    // Arguments
    std::vector<llvm::GenericValue> &args = p_kernel->llvmFunctionArgs();

    // Run the work-items
    for (size_t i=0; i<items_per_work_group; ++i)
        // TLS data
        g_tls.current_global_id = (work_group_index * items_per_work_group) + i;

        // Call the function
        p_jit->runFunction(func, args);

I think all is in place in order to run kernels. No code is in place yet, but it will come in the following days (I have some infrastructure to do first). I am open to any suggestion.

STL + Clang + LLVM, the pieces are coming together


Yes, the pieces are coming together. Today, a new word appeared in the list of supported features of Clover : kernels. Clover can now extract them from programs and do some interesting things with them, for instance listing their parameters (and types). It’s for the moment the only  “big” thing it can do, but there is more to come.

Implementing the kernels was a nice experience. I fact, the current days of Clover development are very exciting. I have to use many technologies and projects, like Clang, LLVM, the STL, and event Check (Clover’s unit testing framework). All is very nice and well done, and pretty diversified. It goes from the thousand-classes LLVM to the “small” STL doing just what it needs to.

Speaking about the STL, I have a question for the few C++-experts reading this blog. The OpenCL specification says that: “Kernel objects are not created for any __kernel functions in program that do not have the same function definition across all devices for which a program executable has been successfully built.” This means that I need to keep a set of the kernels available in all the “per-device sub-programs” (the function definition is verified after).

Doing that is like a SELECT … INNER JOINT in SQL. We can also express it with the following pseudocode :

kernels is a set

for (subprogram in subprograms)
    if (kernels.empty)
        kernels = subprogram.kernels
        kernels &= subprogram.kernels // This line

I’m currently doing this in C++ using std::set, but I’m not comfortable with my current code :

    std::set<std::string> kernels_set;

    for (int i=0; i<p_device_dependent.size(); ++i)
        DeviceDependent &dep =;
        std::vector<llvm::Function *> kernels = kernelFunctions(dep);
        std::set<std::string> set;

        // Add the kernels in the set
        for (int j=0; j<kernels.size(); ++j)
            llvm::Function *func =;

        // intersection of the sets
        if (kernels_set.empty())
            kernels_set = set;
            std::set<std::string> inter;

            set_intersection(set.begin(), set.end(), kernels_set.begin(),
                             std::inserter(inter, inter.begin()));

            kernels_set = inter;

The problem is that I don’t know if it’s the best way to do that (or a sufficiently good one). Qt has a good function to do this without a temporary set, but the STL lacks one. I don’t know if I can use the same std::set in the source parts of set_intersection and as the result set.

Except that, the code should not be too ugly, and it works fairly well. I thank Zack Rusin for having added unit tests in Clover (and I’m happy to have continued to implement them), it’s very pleasant to type “make test”, to see that all goes well, and to think “my code is ok 🙂 “.

Advanced LLVM : compile, link, optimize


Yesterday and Today were interesting days for Clover : I read big parts of the LLVM documentation and implemented the first bits of the OpenCL standard library.

Clover is now able to compile a program and to link it with the standard library (if the device needs that, like CPUDevice). Some work was also done to make Clover more standard-compliant regarding the devices and the programs. The code is now relatively clean I think.

Now that I am on holidays, I can work hours and hours a day (I worked today from 1 pm to now, 10 pm, with some pauses. I think it’s nearly 6 hours a day) and Clover advances fast.

The next step now that compiling and linking work is to implement the kernel functions (a kernel is a simple function with a __kernel attribute, and means to set its arguments for execution on a GPU). Then, it will be the first main target : having some small compiled kernels running using Clang.

After that, I will have to implement the standard library and all its functions (from the simple geometric ones to the tricky ones like barrier() that need tight integration with the host CPU). I have more than a month and a half to do that, so I hope it will suffice.



I write this blog post from a wonderful chalet in the south of France, close to the Mont Blanc. The landscape is beautiful, the days hot and sunny, all is really good. For information, here is where my computer is, where I work on my GSoC project :

It’s not very comfortable, but I already managed to do interesting things on Clover. I committed yesterday a few changes :

  • The first and biggest is the implementation of clGetProgramInfo. This function allows the client application to retrieve a binary form of a program. Clover returns LLVM bitcode. This new functions also allows me to test the implementation of clCreateProgramWithBinary. It took a few hours to implement that because I had to install LLVM, and it’s tricky on openSUSE (you need to install llvm, llvm-devel, llvm-clang and llvm-clang-devel from devel:tools, and only this repository works. If you install another version, you get a broken llvm-config, or not all the needed libs).
  • Another one is just a cleanup : it removes all the trailing spaces Kate keeps in the files (spaces at the end of lines and lines made only of spaces). My Kate being broken, I now use Geany which I recommend to everyone (who can’t use Kate). Geany removes trailing spaces, and I decided to do that for every file in Clover.
  • Doing this put me in “cleanup mode”, so I decided to write my C++ code in … C++. I replaced all the C functions like memcpy by the C++ equivalents, that’s to say std::memcpy and friends. I also removed all the printf I put here and there to debug tricky problems, so the code is clean and silent.

During my holidays, I can work about 4 to 5 hours a day, I’m very happy, Clover will advance ! I have planned for today the implementation of clGetProgramBuildInfo and the OpenCL C compiler’s options. I could also begin some sort of standard library, but I have yet to decide if I will implement it in LLVM bitcode (compiled from a C file by Clang), good for interprocedural optimizations and very fast LLVM IR, or if I only code C headers and let the function calls unresolved in the LLVM IR. The last solution is the slowest (no inlining, no optimization), but the easiest to implement for hardware accelerated devices (just find all the unresolved calls and replace them with an intrinsic of the device). For a CPU device, the LLVM JIT has a wonderful function : InstallLazyFunctionCreator. This function lets me register a callback that is given a function name (like get_global_id) and returns a function pointer. I will use that to implement functions like get_global_id, barrier and fences.

I think I will do the two : code the headers, and let the device choose if it wants the binary be linked to a standard library IR, or not.

I’m very happy and very motivated : the interesting part of my project begins and it will be even more interesting than I expected. OpenCL is a good spec, but with small challenges that make work event more enjoyable.

Compiling programs using Clang


In this title, when I write “programs”, I mean “OpenCL programs”. Since my last blog post, I took some days to enjoy the end of the school year, but I also thought about how to implement a Clang-based OpenCL C compiler.

After some days of coding, I have finally something that works : we can now compile an OpenCL C kernel and get LLVM IR. We cannot do anything with this IR except printing it on stderr, but it shows the infrastructure is already working.

The next days will see the implementation of kernels (exploring a program looking for __kernel functions), and maybe already a JIT.

I will be on holiday from July 3, but I will have two or three hours a day to work ,so work will continue to advance, even at a faster pace that during my exams.

To end this post, here is the testcase used to check that all goes right :

const char program_source[] =
    "#define __global __attribute__((address_space(1)))\n"
    "__kernel void test(__global float *a, __global float *b, int n) {\n"
    "   int i;\n"
    "   for (i=0; i<n; i++) {\n"
    "       a[i] = 3.1415926f * b[i] * b[i];\n"
    "   }\n"

program = clCreateProgramWithSource(ctx, 1, &src, 0, &result);
result = clBuildProgram(program, 1, &device, "", 0, 0);

A “module->dump()” in the Clover source code sends to stderr what Clang produces (currently unoptimized) :

; ModuleID = ''
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-unknown-linux-gnu"

define void @test(float addrspace(1)* %a, float addrspace(1)* %b, i32 %n) nounwind {
  %a.addr = alloca float addrspace(1)*, align 8
  %b.addr = alloca float addrspace(1)*, align 8
  %n.addr = alloca i32, align 4
  %i = alloca i32, align 4
  store float addrspace(1)* %a, float addrspace(1)** %a.addr, align 8
  store float addrspace(1)* %b, float addrspace(1)** %b.addr, align 8
  store i32 %n, i32* %n.addr, align 4
  store i32 0, i32* %i, align 4
  br label %for.cond

for.cond:                                         ; preds =, %entry
  %tmp = load i32* %i, align 4
  %tmp1 = load i32* %n.addr, align 4
  %cmp = icmp slt i32 %tmp, %tmp1
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %tmp2 = load i32* %i, align 4
  %idxprom = sext i32 %tmp2 to i64
  %tmp3 = load float addrspace(1)** %b.addr, align 8
  %arrayidx = getelementptr inbounds float addrspace(1)* %tmp3, i64 %idxprom
  %tmp4 = load float addrspace(1)* %arrayidx
  %mul = fmul float 0x400921FB40000000, %tmp4
  %tmp5 = load i32* %i, align 4
  %idxprom6 = sext i32 %tmp5 to i64
  %tmp7 = load float addrspace(1)** %b.addr, align 8
  %arrayidx8 = getelementptr inbounds float addrspace(1)* %tmp7, i64 %idxprom6
  %tmp9 = load float addrspace(1)* %arrayidx8
  %mul10 = fmul float %mul, %tmp9
  %tmp11 = load i32* %i, align 4
  %idxprom12 = sext i32 %tmp11 to i64
  %tmp13 = load float addrspace(1)** %a.addr, align 8
  %arrayidx14 = getelementptr inbounds float addrspace(1)* %tmp13, i64 %idxprom12
  store float %mul10, float addrspace(1)* %arrayidx14
  br label                                          ; preds = %for.body
  %tmp15 = load i32* %i, align 4
  %inc = add nsw i32 %tmp15, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret void

!opencl.kernels = !{!0}

!0 = metadata !{void (float addrspace(1)*, float addrspace(1)*, i32)* @test}

You can see that Clang even lists OpenCL kernels in a special metadata entry (very useful). Another nicety is that I got rid of the hack Zack Rusin had to do : my source code is read directly from memory, not put in a on-disk file that is read back by Clang. It is easily done this way :

// Fake source name for debugging and reporting purpose
frontend_opts.Inputs.push_back(std::make_pair(clang::IK_OpenCL, ""));

// Say to Clang this file is not on disk but in memory (source : llvm::MemoryBuffer)
clang::PreprocessorOptions &prep_opts = p_compiler.getPreprocessorOpts();
prep_opts.addRemappedFile("", source);

It’s easy and works well. Clang is also surprisingly fast at compiling kernels (without any standard lib or header, I admit). It takes 0.06 second on my single-core 1,66 Ghz Atom computer to compile the kernel I give in this blog post.

The code is already in Git, ready to be tested.

Exams finished


A small blog post to say I successfully passed my exams. I can now go to university to study IT.

On a more “professional” side, the bases of my OpenCL implementation are done for the most parts, I’m now reading the Clang documentation. I hope to have some experiments in a testable state by the following week.