Monthly Archives: July 2011

A fix and an optimization

Hello,

Last week, I was able to run an OpenCL-compiled kernel using Clover. The problem was that there was a problem : every once, the kernel failed.

I thought it was a locking problem. In fact, it was : the work groups called kernel->callFunction (to get a zero-argument function calling the actual kernel with its arguments) without any lock.

After a fix, the kernels were mostly working but continued to fail after a few iterations. The problem took one week to find and was tricky.

In fact, when an event is “finished”, it can be deleted. The logic was simple : run each work group of a kernel, and when the last work group has finished running, kill the kernel. This seemed to work on paper, but in reality, on multi-core machines running multi-threaded OSes, it doesn’t work. What happened is that the last work group finished before other work groups, because they were on busier CPU cores, or were preempted. The kernel deleted, clWaitForEvent returned, and the program began to read the buffer while it wasn’t already ready.

This fix was to add a counter of the finished work groups. When all the work groups are finished, we can delete the kernel. A simple fix for a complex bug.

I then added an optimization regarding the get_global_id builtin function. This function returns an ID based on the current work group and work item. OpenCL also adds a “work-offset” and some other things. The result was a function with two additions, a multiplication, and several getter calls (I hadn’t enabled link-time optimization).

Now, some of these computations are cached in a private variable, and get_global_id is free of any function call and contains only a simple addition. The speedup is interesting : for a kernel with two very slow modulo operations (what a processor does the slowest), the speedup is of 1.72x. I haven’t tried a simpler kernel, but it seems that the speedup for only get_global_id is already very interesting.

The other builtin functions are simpler and don’t need to be optimized. Another slow part of Clover is the thread-local storage. Each call to a builting needs a call to __tls_get_storage or a function like this.

The plan for the following days is to implement samplers and images (to have a complete API), and then a Qt-based example applications allowing to write kernels, add buffers and images and run programs. This will allow me to write nice test kernels for the testsuite, that will have a new testrunner taking a kernel as argument (and information about the buffers) and testing the kernel. With all of this, I will be able to write and test the builtin functions 🙂 .


Compile and launch simple OpenCL C kernels with Clover

Hello,

Another Clover milestone was hit today ! During my holidays (from July 4 to July 15), I worked on the OpenCL C compiler based on Clang that is used in Clover to compile OpenCL programs. I also implemented all the infrastructure to get and launch the kernels of a program.

All is in place and can be tested (but there are still some limitations). Clover is now able to fully execute functions like clSetKernelArg, clBuildProgram and the best one : clEnqueueNDRangeKernel.

This means that from now, my OpenCL implementation is very close from being complete API-wise (I’ve still a lot of work to do regarding the OpenCL C language and its standard set of functions).

For information, Clover is able to launch this program (found in tests/test_kernel.cpp) :

float simple_function(float a) {
    return a * 2.5f;
}

__kernel void kernel1(__global float *a, __global float *b, float f) {
    size_t i = get_global_id(0);

    a[i] = simple_function(f) * b[i];
}

__kernel void kernel2(__global int *buf) {
    size_t i = get_global_id(0);

    buf[i] = 2 * i;
}

It recognizes that kernel1 and kernel2 are two different kernels (and that simple_function cannot be called as a kernel). The testcase launches kernel2 as it has only one argument (it’s faster to test a simpler kernel, but I’ll also add testcases for the other kernel), and checks that its result is ok.

int buffer[64];

// Create a cl_mem buffer to give to the kernel
buf = clCreateBuffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                     sizeof(buffer), buffer, &result);

// Fetch the list of the kernels in the program
result = clCreateKernelsInProgram(program, 2, kernels, 0);

// Set the argument of kernel2
result = clSetKernelArg(kernels[1], 0, sizeof(cl_mem), &buf);

// Execute the kernel
size_t global_size = sizeof(buffer) / sizeof(buffer[0]);
result = clEnqueueNDRangeKernel(queue, kernels[1], 1, 0, &global_size, 0, 0, 0, &event);
result = clWaitForEvents(1, &event);

The buffer contains after the kernel is run 64 ints : (0, 2, 4, 6, 8, …, 126).

The implementation of all of this is a bit tricky. I invite you to take a look at the CGit, but the commits are fairly difficult to understand. I hope the code isn’t too ugly, it works, I even find it somewhat nice at some places, but it’s too early to be sure it’s good.

To end this blog post, a summary of what I did during my holidays :

55 files changed, 5031 insertions(+), 1941 deletions(-)

The line numbers are not truth, but it show I was able to work many hours a day during my holidays (and I also very enjoyed them). I’m happy 🙂 . I promised to be able to work during my holidays, I kept my promise. I have now a month and a half to work on the builtin functions (only get_global_id is implemented, and there are plenty of other functions) and to finish the API (samplers and images aren’t yet completely implemented).

Then I will have to write tests, tests and more tests.


When LLVM is slow

Hello,

A quick blog post about what I’ve just seen in the LLVM codebase.

First, a note about the progress I made since the previous post. The kernels are done, I’ve implemented clGetKernelInfo, clGetKernelWorkGroupInfo (currently a stub), clSetKernelArg, and even clEnqueueNDRangeKernel and clEnqueueTask. It means that API-wise, we can run compiled OpenCL kernels. But it isn’t the case yet. I still need to implement many things in the core of Clover, but every hour makes me closer to finish that.

One of the thing I have to implement is a fast mean to call a kernel. “Fast” is really important here, since most kernels are small (thus run fast) but are a lot to be run.

I naively began by envisaging using llvm::ExecutionEngine::runFunction. This function takes two parameters : the function to run, and a vector of its arguments. This vector didn’t make me happy, I found it too slow. So I looked at the code of this function for the JIT (the engine used on x86, x86_64 and ARM, the three targets I want to run fast). I was horrified by this function !

// ... Many "fast" special cases

// Okay, this is not one of our quick and easy cases.  Because we don't have a
// full FFI, we have to codegen a nullary stub function that just calls the
// function we are interested in, passing in constants for all of the
// arguments.  Make this function and return.

// First, create the function.
FunctionType *STy=FunctionType::get(RetTy, false);
Function *Stub = Function::Create(STy, Function::InternalLinkage, "",
                                  F->getParent());

// Insert a basic block.
BasicBlock *StubBB = BasicBlock::Create(F->getContext(), "", Stub);

// ... Code generation for the args, to have a function like
//     T f() { return function(3, "foobar", (int *)0x9fffff90); }

// Finally, call our nullary stub function.
GenericValue Result = runFunction(Stub, std::vector<GenericValue>());
// Erase it, since no other function can have a reference to it.
Stub->eraseFromParent();

The last line is the worst : the function is removed, each call will need to use code generation !

Fortunately, I can avoid that for kernels. A kernel is a function called many times with the same parameters. So, I can build this stub function (avoiding some corner cases LLVM handles), keep it in the Coal::Kernel object, and call it many times with a simple :

void (*func)() = device_kernel->nativeFunctionAddress();

for (...)
    func();

Simple, and really fast !


Thread-local storage

Hello,

Here is a post about my thoughts on how to implement functions like “get_global_id()” in a threaded environment.

This function, and others like it, returns an identifier of the current OpenCL kernel run. It’s used for instance to have the same kernel running in parallel on multiple elements of a data set. The trick is that its return value cannot be inferred from its arguments, but from some data provided either by the GPU, or the current CPU thread.

Let’s see how I want to implement that from top to bottom.

The LLVM IR binding

The very good thing about LLVM IR is that it allows calling remote functions. It’s a bit dangerous (I have to figure out how to disable a kernel from calling random C functions resolvable from the current application, like exec()) but very powerful to allow implementing functions like get_global_id() in pure compiled C++.

The first part is to define the function in the LLVM IR. Using Clang as the compiler, it’s as easy as writing a declaration in a header, like done in src/runtime/stdlib.h :

size_t get_global_size(uint dimindx);

It’s all for the LLVM IR part. Now, I’ll use the LLVM JIT to run kernels, so I dove in the LLVM documentation and saw the llvm::ExecutionEngine:: InstallLazyFunctionCreator() function. This function takes a function pointer that will get called everytime the JIT sees a function not defined in LLVM IR. Clover will be able to provide a native function for this name.

So, the code :

static void *createFunction(const std::string &name)
{
    if (name == "get_global_id")
        return &native_get_global_id;
    else if (name == "get_work_dim")
        return &native_get_work_dim;
    // Other...
}

So, it’s simple (but not really fast due to the string comparisons, fortunately this function is called once for each function call-site, not for every call). Let’s implement these functions.

The native part

My first plan was to put the thread-local variable in the stdlib bitcode, and implementing functions like get_global_id like this :

__thread void *tls;                    // Value set by native code, but stored by LLVM

size_t _get_global_id(void *tls, uint dimindx);   // The extern native function

size_t get_global_id(uint dimindx)
{
    return _get_global_id(tls, dimindx);
}

The idea was to have the native functions called with the thread-local param given as a parameter (its could be a pointer to a CPUKernelInstance). Some inlining during the program compilation would have replaced all the get_global_id calls with _get_global_id calls.

It was a bad idea.

The first reason is that TLS is not well supported by the LLVM JIT. It works on x86, not on x86_64, and I don’t know if it’s good on ARM and other targets. And I want OpenCL to be usable everywhere, even on Meego-powered ARM devices using Free drivers 🙂 .

So, after some minutes of reasoning, I found an evident and way better solution : putting the TLS value in native code, not in the LLVM IR.

So, I removed all the stub functions from stdlib.c. Now, a call to get_global_id remains a call to this function, without added parameters and without renaming.

So, it’s time to implement these functions :

__thread struct
{
    size_t work_dim;
    size_t current_global_id[CPUDEVICE_MAX_DIMS];
    // Other variables needed to implement the functions
} g_tls;

size_t native_get_work_dim()
{
    return g_tls.work_dim;
}

size_t native_get_global_id(uint dimindx)
{
    if (dimindx >= CPUDEVICE_MAX_DIMS)
        return 0; // Follow the OpenCL spec

    return g_tls.current_global_id[dimindx];
}

It’s not so complicated. It’s a pity that I cannot put the TLS variable in LLVM IR (inlining would have removed the if if dimindx is known to be good, and it would have avoided the need of an expensive x86 CALL only to get a variable), but I hope it will be fast enough.

The last part to implement is the CPUKernelInstance. I don’t know yet what this class will do, but I think it will represent a work-goup. OpenCL divides a task into work-groups, divided into work-items. We can also divide a work-group into sub-work-groups if needed (I think it’s only the case on GPUs). A work-group is a “hint” on how to divide the work on parts bigger that a work-item. All the work-groups must have the same size. If I have correctly understood the specification, a work-item could be a pixel in an image, and a work-group a line. “local” OpenCL storage is also shared between the work-items of a work_group, but not with the ones outside (again if I have correctly understood the spec that is a complex here).

These work-groups will be used by Clover not to run all the work-items in parallel. It would be way too slow (every work-item would require a fetch from the command-queue, and even modern processors don’t have thousands of cores)). I intend to run the work-groups in parallel (a command-queue fetch per work-group), but the work-items sequentially :

void KernelInstance::runWorkGroup()
{
    // Number of work-items in a work_group
    size_t items_per_work_group = getItemsPerWorkGroup();

    // Index of the current work-group (in fact, it is a CPUDEVICE_MAX_DIMS-dimensional array,
    // but I simplify here)
    size_t work_group_index = getWorkGroupIndex();

    // Function to call
    llvm::Function *func = p_kernel->getFunction(p_device);

    // Arguments
    std::vector<llvm::GenericValue> &args = p_kernel->llvmFunctionArgs();

    // Run the work-items
    for (size_t i=0; i<items_per_work_group; ++i)
    {
        // TLS data
        g_tls.current_global_id = (work_group_index * items_per_work_group) + i;

        // Call the function
        p_jit->runFunction(func, args);
    }
}

I think all is in place in order to run kernels. No code is in place yet, but it will come in the following days (I have some infrastructure to do first). I am open to any suggestion.


STL + Clang + LLVM, the pieces are coming together

Hello,

Yes, the pieces are coming together. Today, a new word appeared in the list of supported features of Clover : kernels. Clover can now extract them from programs and do some interesting things with them, for instance listing their parameters (and types). It’s for the moment the only  “big” thing it can do, but there is more to come.

Implementing the kernels was a nice experience. I fact, the current days of Clover development are very exciting. I have to use many technologies and projects, like Clang, LLVM, the STL, and event Check (Clover’s unit testing framework). All is very nice and well done, and pretty diversified. It goes from the thousand-classes LLVM to the “small” STL doing just what it needs to.

Speaking about the STL, I have a question for the few C++-experts reading this blog. The OpenCL specification says that: “Kernel objects are not created for any __kernel functions in program that do not have the same function definition across all devices for which a program executable has been successfully built.” This means that I need to keep a set of the kernels available in all the “per-device sub-programs” (the function definition is verified after).

Doing that is like a SELECT … INNER JOINT in SQL. We can also express it with the following pseudocode :

kernels is a set

for (subprogram in subprograms)
    if (kernels.empty)
        kernels = subprogram.kernels
    else
        kernels &= subprogram.kernels // This line

I’m currently doing this in C++ using std::set, but I’m not comfortable with my current code :

    std::set<std::string> kernels_set;

    for (int i=0; i<p_device_dependent.size(); ++i)
    {
        DeviceDependent &dep = p_device_dependent.at(i);
        std::vector<llvm::Function *> kernels = kernelFunctions(dep);
        std::set<std::string> set;

        // Add the kernels in the set
        for (int j=0; j<kernels.size(); ++j)
        {
            llvm::Function *func = kernels.at(j);
            set.insert(func->getNameStr());
        }

        // intersection of the sets
        if (kernels_set.empty())
        {
            kernels_set = set;
        }
        else
        {
            std::set<std::string> inter;

            set_intersection(set.begin(), set.end(), kernels_set.begin(),
                             kernels_set.end(),
                             std::inserter(inter, inter.begin()));

            kernels_set = inter;
        }
    }

The problem is that I don’t know if it’s the best way to do that (or a sufficiently good one). Qt has a good function to do this without a temporary set, but the STL lacks one. I don’t know if I can use the same std::set in the source parts of set_intersection and as the result set.

Except that, the code should not be too ugly, and it works fairly well. I thank Zack Rusin for having added unit tests in Clover (and I’m happy to have continued to implement them), it’s very pleasant to type “make test”, to see that all goes well, and to think “my code is ok 🙂 “.


Advanced LLVM : compile, link, optimize

Hello,

Yesterday and Today were interesting days for Clover : I read big parts of the LLVM documentation and implemented the first bits of the OpenCL standard library.

Clover is now able to compile a program and to link it with the standard library (if the device needs that, like CPUDevice). Some work was also done to make Clover more standard-compliant regarding the devices and the programs. The code is now relatively clean I think.

Now that I am on holidays, I can work hours and hours a day (I worked today from 1 pm to now, 10 pm, with some pauses. I think it’s nearly 6 hours a day) and Clover advances fast.

The next step now that compiling and linking work is to implement the kernel functions (a kernel is a simple function with a __kernel attribute, and means to set its arguments for execution on a GPU). Then, it will be the first main target : having some small compiled kernels running using Clang.

After that, I will have to implement the standard library and all its functions (from the simple geometric ones to the tricky ones like barrier() that need tight integration with the host CPU). I have more than a month and a half to do that, so I hope it will suffice.


Holidays

Hello,

I write this blog post from a wonderful chalet in the south of France, close to the Mont Blanc. The landscape is beautiful, the days hot and sunny, all is really good. For information, here is where my computer is, where I work on my GSoC project :

It’s not very comfortable, but I already managed to do interesting things on Clover. I committed yesterday a few changes :

  • The first and biggest is the implementation of clGetProgramInfo. This function allows the client application to retrieve a binary form of a program. Clover returns LLVM bitcode. This new functions also allows me to test the implementation of clCreateProgramWithBinary. It took a few hours to implement that because I had to install LLVM, and it’s tricky on openSUSE (you need to install llvm, llvm-devel, llvm-clang and llvm-clang-devel from devel:tools, and only this repository works. If you install another version, you get a broken llvm-config, or not all the needed libs).
  • Another one is just a cleanup : it removes all the trailing spaces Kate keeps in the files (spaces at the end of lines and lines made only of spaces). My Kate being broken, I now use Geany which I recommend to everyone (who can’t use Kate). Geany removes trailing spaces, and I decided to do that for every file in Clover.
  • Doing this put me in “cleanup mode”, so I decided to write my C++ code in … C++. I replaced all the C functions like memcpy by the C++ equivalents, that’s to say std::memcpy and friends. I also removed all the printf I put here and there to debug tricky problems, so the code is clean and silent.

During my holidays, I can work about 4 to 5 hours a day, I’m very happy, Clover will advance ! I have planned for today the implementation of clGetProgramBuildInfo and the OpenCL C compiler’s options. I could also begin some sort of standard library, but I have yet to decide if I will implement it in LLVM bitcode (compiled from a C file by Clang), good for interprocedural optimizations and very fast LLVM IR, or if I only code C headers and let the function calls unresolved in the LLVM IR. The last solution is the slowest (no inlining, no optimization), but the easiest to implement for hardware accelerated devices (just find all the unresolved calls and replace them with an intrinsic of the device). For a CPU device, the LLVM JIT has a wonderful function : InstallLazyFunctionCreator. This function lets me register a callback that is given a function name (like get_global_id) and returns a function pointer. I will use that to implement functions like get_global_id, barrier and fences.

I think I will do the two : code the headers, and let the device choose if it wants the binary be linked to a standard library IR, or not.

I’m very happy and very motivated : the interesting part of my project begins and it will be even more interesting than I expected. OpenCL is a good spec, but with small challenges that make work event more enjoyable.