The GSoC is over

Hello,

Yesterday was the last Google Summer of Code day. This wonderful experience is now over, and I will see if I pass the final evaluation.

The GSoC is over, but not my work. These following days will be a bit more light on development, as I need holidays now that I have worked hard for nearly one year (for school since September 2010, for a big school project during the holidays, then for Google since May).

But after this break, I will continue to work for Clover, and the following feature will be built-in functions. I will implement them using a small preprocessor written in Python. I’ll give it an input in the form of :

# def <var> : [values]
def vecf : float2 float3 float4 float8 float16
def gentype : float $vecf

# Normal function, placed in stdlib.c, can use Clang features
# func <name> [types] : <return type> [args]
# - args : <name>:<type>
# - $type is replaced with the current type of the implementation
# - in the body, $vecdim is replaced by the vector dimension of the type
func min $gentype : $type x:$type y:$type
    return (x < y ? x : y)
end

# Native function, placed in src/core/cpu/builtins.cpp
native cos float : float x:float
    return std::cos(x);
end

# Function overloading possible
native cos $vecf : $type x:$type
    for (unsigned int i=0; i<$vecdim; ++i)
        result[i] = std::cos(x[i]);
end

The Python preprocessor will duplicate each function for each of the types given, either in stdlib.c (for functions that will be byte-compiled by Clang and executed using the LLVM JIT, allowing inlining) or in builtins.cpp (for functions that need to use external C or C++ functions, like STL or system ones).

The current preprocessor (yes, I already started working on it) currently can produce this code for the example above :

static float cos_floatfloat(float x) {
    return std::cos(x);
}

static void cos_float2float2(float *result, float *x) {
    for (unsigned int i=0; i<2; ++i)
        result[i] = std::cos(x[i]);
}

static void cos_float3float3(float *result, float *x) {
    for (unsigned int i=0; i<3; ++i)
        result[i] = std::cos(x[i]);
}

static void cos_float4float4(float *result, float *x) {
    for (unsigned int i=0; i<4; ++i)
        result[i] = std::cos(x[i]);
}

static void cos_float8float8(float *result, float *x) {
    for (unsigned int i=0; i<8; ++i)
        result[i] = std::cos(x[i]);
}

static void cos_float16float16(float *result, float *x) {
    for (unsigned int i=0; i<16; ++i)
        result[i] = std::cos(x[i]);
}

You can see that the function names are mangled, as the preprocessor knows more things about the function than the C++ compiler (for example, the dimension of the vectors). If they had the same name, these function (except the first) should have looked the same for a C++ compiler as they take the same parameters and all return void. These functions will be put in builtins.cpp. For “normal” functions, mangling will not be necessary as Clang is aware of the vector dimension (using __attribute__((__ext_vector_size(foo)))) and can mangle even C functions using __attribute__((overloadable)).

Working on this preprocessor is exciting and I hope to be able to do that in the following days, and I will try my best to keep the code clean (Python is a clean programming language but it is easy to have programs like shell scripts).

By the way, the Clover git repository is still online and you can read the code if you didn’t do so before. You can also enjoy the documentation that is now finished (link in the right menu of this blog). I hope you’ll find it helpful and understandable.


Documentation online

Hello,

I’m currently writing the documentation of Clover, and although it is not finished yet and maybe full of spelling mistakes, it’s already online here for the ones wanting to already read it.

The link to this documentation has also been added in the side bar. It will be updated every day or so to match the Doxygen doc in Git.


The End is Approaching

Hello,

Today is the “soft pencil down” day. It means that I have to mostly stop my work now and concentrate on writing a good documentation for my project.

The first step of that is this blog post, a summary of what I’ve done during the summer, what is still to do, etc.

The Project

My project was to work on an existing small code-base called Clover, an Open Source OpenCL implementation. This project was started by a Mesa developer, Zack Rusin.

Before I started my project, and as I said in one of my first blog post, the code was very small and did nearly nothing, but the structure was well in place and easy to extend.

This project having been started by a Mesa developer, many people wanted it to be able to take advantage of hardware acceleration.

Status

My target was do to as much as I could during the summer, in order to have a feature-rich code base with many difficult details already solved. I wanted Clover to be able to launch some simple OpenCL demos on the CPU, with all in place to be able to use Gallium3D to provide hardware acceleration.

I’m happy to say that Clover evolved fairly well during the summer. It’s now an API-complete library, relatively fast (but it will become even more in the following days, I just thought about a potentially huge optimization), and with all the needed pieces in place for hardware acceleration.

The only missing things are some built-in functions available to the kernels. I already implemented the most difficult of them, and the remaining ones are things like clamp(), etc.

Plans for this week

Now that most of the work is finished, I will be able to clean the code and write documentation. It will be in Doxygen format and will cover the DeviceInterface API (abstraction layer between the core API and the devices, for the moment only CPUDevice), the core classes, and will also explain some difficult parts of Clover :

  • The way memory objects work.
  • The structure of events, command queues and CPU worker threads
  • How I implemented barrier() (a cleaner and more comprehensible explanation than the one I posted on August 7)
  • How I use Clang and LLVM to compile and launch kernels.

I’ll also improve the “stub functions” of a kernel. They are currently small functions taking no argument at all and calling the kernel function with its arguments given as constants. It works fairly well but is a bit slow when __local pointers are used (these pointers must be allocated for every work group, so their value changes constantly, and the stub function needs to be recreated).

My plan is to create only one stub function taking one argument : void **locals. This table of pointers will be used by the stub to call the kernel with varying arguments without needing to be re-created and re-JITed every time.

Furthermore, a huge optimization is made possible by that : I implemented the image built-ins, and they are full of switch statements. The code is very big and could highly take advantage of aggressive inlining. I want to implement the images not as pointers to opaque structures (“typedef struct image2d image2d_t;”), but as real structures. When the stub function is created, it will put interesting data about the images in these structs, and the kernel will have the data at hand without needing to call native functions like __cpu_get_image_width. That will allow inlining.

Current stub:

void stub_for_one_work_group() {
    kernel(16, (image2d_t)0xfee80000, (float4 *)0x8945a000);
}

void kernel(uint size, __read_only image2d_t image, __local float4 *temp) {
    // Many read_imagef calls that cannot be inlined and reduced to say
    // pixel = int_to_float(uchar_to_int(*(uchar4 *)(image->data + 256)));
}

Future stub before optimization:

void stub_for_all_the_kernel(void **locals) {
    image2d image;
    image.data_order = CLK_RGBA;   // Constant built from info available in Image2D
    image.data_type = CLK_UNORM_INT8;
    image.width = 4;
    image.height = 4;
    image.data = (uchar4 *)0xdeadbeef;

    kernel(16, &image, (float4 *)locals[0]);
}

void kernel(...) {...}

After an LLVM optimization pass, kernel() can be inlined into the stub, and all the image functions flattened to become lightweight direct memory accesses. The fact that images are stored as constants and not opaque pointers allows LLVM to do its constant propagation pass.


When easy is difficult, and vice versa

Hello,

Some days ago, I was thinking of what I have to work on first for Clover. The API was nearly complete (only samplers and clFlush/clFinish remaining), there was built-ins to implement, etc.

I decided to take some time to read the OpenCL spec part speaking about the built-in functions. The first chunk is implemented, the next ones (mathematical functions) very easy to do (but boring because despite the fact that Clang allows function overloading in C, it doesn’t allow templates and each function can accept float, float2, float3, float4, float8, float16 and the same with int, short, etc). Then came the “memory fences” that don’t do anything on a CPU device. Then image functions, fairly easy (all the algorithms are given in the spec). I finally read the part entitled “Synchronization functions”.

This part contains only one function : barrier(). This function stops the current work-item and waits for the others to reach the same function call. This doesn’t seem difficult, but in fact it is, because Clover runs the work-items of a work-group sequentially.

So, I decided I was better beginning by implementing the Sampler objects, an easy task.

As the title says, it wasn’t so easy and barrier() wasn’t so difficult. I love challenging problems, so I couldn’t help having my brain thinking about barrier(). I thought that the best way to solve this problem is to launch a work-item, and when it encounters a barrier(), to stop it and start the next, until it reaches barrier(), launching the next, etc. The idea is good, but difficult to achieve. I wanted to use nested calls, but it could have resulted to stack overflows.

After a few minutes, I remembered what I’ve read in an old book dating from the Windows NT 4 days (it was a Microsoft Press book). In an appendix, it speaks about a strange thing called “fibers”. In fact, during my reflection, I wanted something like threads but managed cooperatively by the application. I don’t know why, but this small ten-pages appendix (in a book close to 1000 pages) I’ve read nearly 6 years ago came immediately in my mind.

After a quick Google search to check that fibers are still available in modern operating system, I found my dream : setcontext (and other functions in the same family).

These functions allow a thread to save its context (stack, CPU registers, flags, instruction pointer) and to jump into another context (like an interprocedural goto). It’s exactly what I want !

Each work-item will have its own context consisting of a small stack and CPU registers. When a barrier is encountered, the context is halted and CPUKernelWorkGroup jumps to the next context. It will also encounter the barrier, jump to the next context, and so forth. When the last context reached barrier(), it can continue, get to another barrier or finish its execution. When that arises, the unfinished contexts are resumed and can continue their execution.

It’s way easier and more efficient than having one thread per work-item, I don’t need to use any locking machinery, the context switch is exactly when the application needs it, etc.

So, this barrier() problem was relatively easily solved. It allowed me to read some papers about using Clang and LLVM to implement OpenCL (and it seems that either nVidia or AMD, I don’t remember, are using a complex LLVM IR rewriting to chain blocks, handling barriers, to allow complex auto-vectorization to take place). I hope that my solution will work and will not be too slow (the contexts will be created at the first barrier() call, so no overhead when no barriers are needed).

I then came back to my samplers, the “boring” thing I have to do before being able to work on barrier(), using very exciting POSIX functions.

The implementation of class Sampler was boring as expected. I need to convert three flag arguments to one bitfield, and then to convert it back to the flag arguments in clGetSamplerInfo. Boring, yes. The API was also easy : copy/paste of the one of Context, with only a small set of modifications.

Then, I had to plug the samplers in the Kernel code. I went to the big switch handling arguments type, and saw an horror : sampler_t isn’t a pointer to an opaque struct, it is an unsigned int !

What does it mean ? An opaque struct has a name, so I can use ” struct_type->getName() == “image2d”; ” to know that the argument is of type Image2D. With an integer, all I get from LLVM IR is a “i32″ type, indistinguishable from a simple “uint” type.

The problem was bigger than expected, and no Microsoft book could help me : LLVM simply doesn’t supply enough type information, and I don’t want to use Clang’s debug informations (they are too big and complex).

After a few days of thinking (barrier() only too a few hours to sort out), I came to a solution that works in 99,99999% of the cases : storing in a std::list the known samplers. In setKernelArg, when the user tries to fit a pointer into an i32, I check if the pointer is a known sampler. On 32-bit architectures, sizeof(void *) == 4, so I always check if the i32 isn’t in fact a pointer to a sampler.

On 64-bit machines, the code is perfect. The samplers are detected, normal i32s are left untouched, and plain wrong i64-to-i32 conversions are spotted and result to an error. On 32-bit machines though, all valid i32s are checked against the table of known samplers, and it’s possible that a valid i32 corresponds to a valid pointer to a sampler. In this corner case, the i32 gets replaced by the sampler’s bitfield value, that is data corruption.

I have no better solution, and the infrastructure needed to do that allowed me to make Clover more robust, so I keep it for the moment. I’ll now implement clFlush and clFinish, then tests for the command queue events, then barrier().

I hope to have barrier() finished by August 15, the “soft pencil down date”. It will mean that I was able to code during my summer a nearly complete OpenCL implementation, lacking only some small built-ins. The following days until August 22 will be used to write the documentation of what I did, and maybe some builtins (the image ones, I want to do first the most complex thing, because it’s what I have to do).

By the way, if you know applications or examples using OpenCL (but nearly no built-in functions), I will be glad to test OpenCL with them. It will be especially interesting if these applications use many clEnqueueWaitForEvents, clEnqueueBarrier, barrier() and out-of-order command queues, that is the part of Clover the most difficult to test using testsuites.


Boring work for API completeness

Hello,

Clover’s development seemed slow these days, but in fact it wasn’t. I’m currently “polishing” all I’ve already done. Not because I’m near the end of the project, but because the last part of my Google Summer of Code project will begin in the following days, and I want the code upon which I’ll build it to be solid.

So, my first target for Clover was to be able to launch OpenCL-compiled kernels. In order to be able to do that, the implementation needed to support several things : buffers, events, command queues, contexts, etc. Now that the kernels can run (but without any interesting built-in function), I decided to finish the public API of OpenCL.

In the git repository, you can therefore see many commits like “Implement clFoo and clFooBar”. I’ve read all the APIs and implemented the missing functions.

Currently, I focused on the “enqueue” functions, that is the functions used to queue specific events, the actions OpenCL can perform. These functions are :

  • clEnqueueRead/WriteBufferRect: a complex function copying a buffer to another, but only a rectangle (if we say the buffer contains 2D data) or a cube. This event is particularly important because I built all the image-related events upon it.
  • clEnqueueCopyBuffer: a simple event copying a buffer to another.
  • clEnqueueCopyBufferRect.
  • clCreateImage2D and clCreateImage3D, to add image support to Clover.
  • clEnqueueReadImage and clEnqueueWriteImage, built upon CopyBufferRect.
  • clEnqueueCopyImage (really the mirror of CopyBufferRect).
  • clEnqueueCopyImageToBuffer and clEnqueueCopyBufferToImage.
  • clEnqueueMapImage.
  • clGetSupportedImageFormats.
  • And then clEnqueueBarrier, clEnqueueMarker and clEnqueueWaitForEvents

Now, all the “enqueue” API is completed. I have now to implement the samplers, and clFlush and clFinish. Then, I will be able to implement the interesting built-in functions (from simple mathematical functions to barrier(), the one that could take a fair amount of time thinking on how I could implement it).

The functions I just implemented are based on the “events” framework of Clover, a set of classes inheriting Coal::Event and organized in a complex heritage tree. This enabled me to implement all the events and their checks with only 1500 lines of code in events.cpp (the biggest file of Clover). All the “rectangle-related” events (that is to say Read/Write/CopyBufferRect, and image events) are implemented in less than 100 lines of worker code in CPUDevice (but the code isn’t really readable, I heavily used the testsuite to check my code). For the reference, here is the code doing all the 2D and 3D copies in CPUDevice :

case Event::ReadBufferRect:
case Event::WriteBufferRect:
case Event::CopyBufferRect:
case Event::ReadImage:
case Event::WriteImage:
case Event::CopyImage:
case Event::CopyBufferToImage:
case Event::CopyImageToBuffer:
{
    // src = buffer and dst = mem if note copy
    ReadWriteCopyBufferRectEvent *e = (ReadWriteCopyBufferRectEvent *)event;
    CPUBuffer *src_buf = (CPUBuffer *)e->source()->deviceBuffer(device);

    unsigned char *src = (unsigned char *)src_buf->data();
    unsigned char *dst;

    switch (t)
    {
        case Event::CopyBufferRect:
        case Event::CopyImage:
        case Event::CopyImageToBuffer:
        case Event::CopyBufferToImage:
        {
            CopyBufferRectEvent *cbre = (CopyBufferRectEvent *)e;
            CPUBuffer *dst_buf =
                (CPUBuffer *)cbre->destination()->deviceBuffer(device);

            dst = (unsigned char *)dst_buf->data();
            break;
        }
        default:
        {
            // dst = host memory location
            ReadWriteBufferRectEvent *rwbre = (ReadWriteBufferRectEvent *)e;

            dst = (unsigned char *)rwbre->ptr();
        }
    }

    // Iterate over the lines to copy and use memcpy
    for (size_t z=0; z<e->region(2); ++z)
    {
        for (size_t y=0; y<e->region(1); ++y)
        {
            unsigned char *s;
            unsigned char *d;

            d = imageData(dst,
                          e->dst_origin(0),
                          y + e->dst_origin(1),
                          z + e->dst_origin(2),
                          e->dst_row_pitch(),
                          e->dst_slice_pitch(),
                          1);

            s = imageData(src,
                          e->src_origin(0),
                          y + e->src_origin(1),
                          z + e->src_origin(2),
                          e->src_row_pitch(),
                          e->src_slice_pitch(),
                          1);

            // Copying an image to a buffer may need to add an offset
            // to the buffer address (its rectangular origin is
            // always (0, 0, 0)).
            if (t == Event::CopyBufferToImage)
            {
                CopyBufferToImageEvent *cptie = (CopyBufferToImageEvent *)e;
                s += cptie->offset();
            }
            else if (t == Event::CopyImageToBuffer)
            {
                CopyImageToBufferEvent *citbe = (CopyImageToBufferEvent *)e;
                d += citbe->offset();
            }

            if (t == Event::WriteBufferRect || t == Event::WriteImage)
                std::memcpy(s, d, e->region(0)); // Write dest (memory) in src
            else
                std::memcpy(d, s, e->region(0)); // Write src (buffer) in dest (memory), or copy the buffers
        }
    }

    break;
}

ImageData is a simple function returning the address of a pixel given its coordinates. It currently works only on little-endian architectures. You’ll see that bytes_per_pixel is always 1 in this code (the last argument of imageData). It’s normal, Event objects already did the multiplications where needed.

static unsigned char *imageData(unsigned char *base, size_t x, size_t y,
                                size_t z, size_t row_pitch, size_t slice_pitch,
                                unsigned int bytes_per_pixel)
{
    unsigned char *result = base;

    result += (z * slice_pitch) +
              (y * row_pitch) +
              (x * bytes_per_pixel);

    return result;

I’m nearing the end of my project. I don’t know if I will be able to implement all the built-in functions by August 25. I’ll start with the “difficult” ones (barrier(), image reading and writing) in the hope that I will be able to implement the remaining ones after the Summer of Code program. These are fairly simple functions already implemented in many third-party mathematical libraries, so I can simply call them or copy their code.


A fix and an optimization

Hello,

Last week, I was able to run an OpenCL-compiled kernel using Clover. The problem was that there was a problem : every once, the kernel failed.

I thought it was a locking problem. In fact, it was : the work groups called kernel->callFunction (to get a zero-argument function calling the actual kernel with its arguments) without any lock.

After a fix, the kernels were mostly working but continued to fail after a few iterations. The problem took one week to find and was tricky.

In fact, when an event is “finished”, it can be deleted. The logic was simple : run each work group of a kernel, and when the last work group has finished running, kill the kernel. This seemed to work on paper, but in reality, on multi-core machines running multi-threaded OSes, it doesn’t work. What happened is that the last work group finished before other work groups, because they were on busier CPU cores, or were preempted. The kernel deleted, clWaitForEvent returned, and the program began to read the buffer while it wasn’t already ready.

This fix was to add a counter of the finished work groups. When all the work groups are finished, we can delete the kernel. A simple fix for a complex bug.

I then added an optimization regarding the get_global_id builtin function. This function returns an ID based on the current work group and work item. OpenCL also adds a “work-offset” and some other things. The result was a function with two additions, a multiplication, and several getter calls (I hadn’t enabled link-time optimization).

Now, some of these computations are cached in a private variable, and get_global_id is free of any function call and contains only a simple addition. The speedup is interesting : for a kernel with two very slow modulo operations (what a processor does the slowest), the speedup is of 1.72x. I haven’t tried a simpler kernel, but it seems that the speedup for only get_global_id is already very interesting.

The other builtin functions are simpler and don’t need to be optimized. Another slow part of Clover is the thread-local storage. Each call to a builting needs a call to __tls_get_storage or a function like this.

The plan for the following days is to implement samplers and images (to have a complete API), and then a Qt-based example applications allowing to write kernels, add buffers and images and run programs. This will allow me to write nice test kernels for the testsuite, that will have a new testrunner taking a kernel as argument (and information about the buffers) and testing the kernel. With all of this, I will be able to write and test the builtin functions :) .


Compile and launch simple OpenCL C kernels with Clover

Hello,

Another Clover milestone was hit today ! During my holidays (from July 4 to July 15), I worked on the OpenCL C compiler based on Clang that is used in Clover to compile OpenCL programs. I also implemented all the infrastructure to get and launch the kernels of a program.

All is in place and can be tested (but there are still some limitations). Clover is now able to fully execute functions like clSetKernelArg, clBuildProgram and the best one : clEnqueueNDRangeKernel.

This means that from now, my OpenCL implementation is very close from being complete API-wise (I’ve still a lot of work to do regarding the OpenCL C language and its standard set of functions).

For information, Clover is able to launch this program (found in tests/test_kernel.cpp) :

float simple_function(float a) {
    return a * 2.5f;
}

__kernel void kernel1(__global float *a, __global float *b, float f) {
    size_t i = get_global_id(0);

    a[i] = simple_function(f) * b[i];
}

__kernel void kernel2(__global int *buf) {
    size_t i = get_global_id(0);

    buf[i] = 2 * i;
}

It recognizes that kernel1 and kernel2 are two different kernels (and that simple_function cannot be called as a kernel). The testcase launches kernel2 as it has only one argument (it’s faster to test a simpler kernel, but I’ll also add testcases for the other kernel), and checks that its result is ok.

int buffer[64];

// Create a cl_mem buffer to give to the kernel
buf = clCreateBuffer(ctx, CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR,
                     sizeof(buffer), buffer, &result);

// Fetch the list of the kernels in the program
result = clCreateKernelsInProgram(program, 2, kernels, 0);

// Set the argument of kernel2
result = clSetKernelArg(kernels[1], 0, sizeof(cl_mem), &buf);

// Execute the kernel
size_t global_size = sizeof(buffer) / sizeof(buffer[0]);
result = clEnqueueNDRangeKernel(queue, kernels[1], 1, 0, &global_size, 0, 0, 0, &event);
result = clWaitForEvents(1, &event);

The buffer contains after the kernel is run 64 ints : (0, 2, 4, 6, 8, …, 126).

The implementation of all of this is a bit tricky. I invite you to take a look at the CGit, but the commits are fairly difficult to understand. I hope the code isn’t too ugly, it works, I even find it somewhat nice at some places, but it’s too early to be sure it’s good.

To end this blog post, a summary of what I did during my holidays :

55 files changed, 5031 insertions(+), 1941 deletions(-)

The line numbers are not truth, but it show I was able to work many hours a day during my holidays (and I also very enjoyed them). I’m happy :) . I promised to be able to work during my holidays, I kept my promise. I have now a month and a half to work on the builtin functions (only get_global_id is implemented, and there are plenty of other functions) and to finish the API (samplers and images aren’t yet completely implemented).

Then I will have to write tests, tests and more tests.


Follow

Get every new post delivered to your Inbox.