The End is Approaching

Hello,

Today is the “soft pencil down” day. It means that I have to mostly stop my work now and concentrate on writing a good documentation for my project.

The first step of that is this blog post, a summary of what I’ve done during the summer, what is still to do, etc.

The Project

My project was to work on an existing small code-base called Clover, an Open Source OpenCL implementation. This project was started by a Mesa developer, Zack Rusin.

Before I started my project, and as I said in one of my first blog post, the code was very small and did nearly nothing, but the structure was well in place and easy to extend.

This project having been started by a Mesa developer, many people wanted it to be able to take advantage of hardware acceleration.

Status

My target was do to as much as I could during the summer, in order to have a feature-rich code base with many difficult details already solved. I wanted Clover to be able to launch some simple OpenCL demos on the CPU, with all in place to be able to use Gallium3D to provide hardware acceleration.

I’m happy to say that Clover evolved fairly well during the summer. It’s now an API-complete library, relatively fast (but it will become even more in the following days, I just thought about a potentially huge optimization), and with all the needed pieces in place for hardware acceleration.

The only missing things are some built-in functions available to the kernels. I already implemented the most difficult of them, and the remaining ones are things like clamp(), etc.

Plans for this week

Now that most of the work is finished, I will be able to clean the code and write documentation. It will be in Doxygen format and will cover the DeviceInterface API (abstraction layer between the core API and the devices, for the moment only CPUDevice), the core classes, and will also explain some difficult parts of Clover :

  • The way memory objects work.
  • The structure of events, command queues and CPU worker threads
  • How I implemented barrier() (a cleaner and more comprehensible explanation than the one I posted on August 7)
  • How I use Clang and LLVM to compile and launch kernels.

I’ll also improve the “stub functions” of a kernel. They are currently small functions taking no argument at all and calling the kernel function with its arguments given as constants. It works fairly well but is a bit slow when __local pointers are used (these pointers must be allocated for every work group, so their value changes constantly, and the stub function needs to be recreated).

My plan is to create only one stub function taking one argument : void **locals. This table of pointers will be used by the stub to call the kernel with varying arguments without needing to be re-created and re-JITed every time.

Furthermore, a huge optimization is made possible by that : I implemented the image built-ins, and they are full of switch statements. The code is very big and could highly take advantage of aggressive inlining. I want to implement the images not as pointers to opaque structures (“typedef struct image2d image2d_t;”), but as real structures. When the stub function is created, it will put interesting data about the images in these structs, and the kernel will have the data at hand without needing to call native functions like __cpu_get_image_width. That will allow inlining.

Current stub:

void stub_for_one_work_group() {
    kernel(16, (image2d_t)0xfee80000, (float4 *)0x8945a000);
}

void kernel(uint size, __read_only image2d_t image, __local float4 *temp) {
    // Many read_imagef calls that cannot be inlined and reduced to say
    // pixel = int_to_float(uchar_to_int(*(uchar4 *)(image->data + 256)));
}

Future stub before optimization:

void stub_for_all_the_kernel(void **locals) {
    image2d image;
    image.data_order = CLK_RGBA;   // Constant built from info available in Image2D
    image.data_type = CLK_UNORM_INT8;
    image.width = 4;
    image.height = 4;
    image.data = (uchar4 *)0xdeadbeef;

    kernel(16, &image, (float4 *)locals[0]);
}

void kernel(...) {...}

After an LLVM optimization pass, kernel() can be inlined into the stub, and all the image functions flattened to become lightweight direct memory accesses. The fact that images are stored as constants and not opaque pointers allows LLVM to do its constant propagation pass.

About these ads

7 responses to “The End is Approaching

  • Dragomir

    I am looking forward eagerly for your documentation. The source looks very clear and comprehensive, event without deep knowledge of OpenCL.
    One question: Where “__builtin_ia32_*” family came from? Is it from LLVM or not yet merged stuff?

  • Eric Anholt

    I’ve been amazed at the progress you’ve made here — “get CPU-side CL kernels working” sounded like far more than a one-summer GSOC project to me!

    I’m curious, what are you using for testing your CL work? Are there any testsuites out there already? If you’re building any regression tests yourself and looking for a decent framework, we’d be interested in incorporating them into piglit.

    That image data inlining sounds very aggressive to me. I would think that inlining the pointer to the image data would result in a lot of recompiles on apps, since that changes every time someone runs the kernel on a different image. width/height I’m not sure about, but still is something I’d expect to change pretty regularly. data_order/data_type I would expect to typically stay constant for the lifetime of a kernel, though, and seems valuable to expose as a constant to the compiler.

  • steckdenis

    Hello,

    Thanks for the comments (and sorry for the delay approving them, I still need to figure out how to disable comment moderation).

    > One question: Where “__builtin_ia32_*” family came from? Is it from LLVM or not yet merged stuff?

    They are Clang built-ins. If you have a Clang SVN copy somewhere, take a look at llvm/tools/clang/include/clang/Basic/BuiltinsX86.def

    > I’m curious, what are you using for testing your CL work?

    I’m using a custom testsuite based on Check. It works fairly well and is very simple (adding a test-case is just a matter of adding a function, or a line in an existing function). I’ll take a look at piglit, it’s interesting and I’ll see if I can contribute my tests to piglit after the GSoC.

    > since that changes every time someone runs the kernel on a different image.

    That’s very interesting ! When I read the spec, it wasn’t explicitly said that a kernel can be run multiple times with different arguments (but it’s logical now that you say that), so I implemented the compilation as if a kernel could only be used one time with a single set of arguments.

    I’ll now fix this small glitch (it should not take too much time), but it’s sad that I will not be able to do this nice optimization without having to recompile the kernel every time it’s run (and that takes time, but I need to see if it isn’t faster to recompile the kernel each run and to run it fast, or not to recompile it but having it running slowly).

  • Eric Anholt

    Looks like check has a lot of the same design qualities we go for in piglit (small, easy-to-add testcases, run in a separate address space so that assertions and crashes don’t block a test run). Very nice.

    The comment on images changing was based on assumptions given how other Khronos APIs are built, and how I imagine people would use CL. I’ll be the first to admit being ignorant about CL, though!

  • Aaron Watry

    “> since that changes every time someone runs the kernel on a different image.

    That’s very interesting ! When I read the spec, it wasn’t explicitly said that a kernel can be run multiple times with different arguments (but it’s logical now that you say that), so I implemented the compilation as if a kernel could only be used one time with a single set of arguments.”

    Yup. One example would be in the context of the CL-based WebM video decoder I was working on for my master’s thesis project:

    Example:

    cl_kernel kernel =
    for(frame = 0; frame < #(frames); frame++){

    err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &input_mem);
    err = clSetKernelArg(kernel, 1, sizeof(cl_mem), &output_mem);
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local , 0, NULL, NULL);

    }

    The WebM decoder software includes a program that calculates MD5-summed output, so that might work as a fairly good test of basic functionality/correctness.

  • Victor Oliveira

    I’m the student who made the OpenCL support for GEGL and I’m really looking forward to use this, very interesting job! do you think it’s possible in the state it is?

    http://meudepositodeideias.wordpress.com/2011/08/08/opencl-on-gegl-results-up-to-now/

  • steckdenis

    Hello,

    GEGL may not crash using my OpenCL implementation, but I don’t think the kernels will compile: Clover currently lacks too many built-ins.

    If your kernels use something like clamp(), dot() or any built-in that is not barrier() nor the image ones, its compilation will fail.

    So, I hope that Clover will be testable with your code in the following weeks, when the built-ins will be implemented, but it isn’t yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: