When easy is difficult, and vice versa

Hello,

Some days ago, I was thinking of what I have to work on first for Clover. The API was nearly complete (only samplers and clFlush/clFinish remaining), there was built-ins to implement, etc.

I decided to take some time to read the OpenCL spec part speaking about the built-in functions. The first chunk is implemented, the next ones (mathematical functions) very easy to do (but boring because despite the fact that Clang allows function overloading in C, it doesn’t allow templates and each function can accept float, float2, float3, float4, float8, float16 and the same with int, short, etc). Then came the “memory fences” that don’t do anything on a CPU device. Then image functions, fairly easy (all the algorithms are given in the spec). I finally read the part entitled “Synchronization functions”.

This part contains only one function : barrier(). This function stops the current work-item and waits for the others to reach the same function call. This doesn’t seem difficult, but in fact it is, because Clover runs the work-items of a work-group sequentially.

So, I decided I was better beginning by implementing the Sampler objects, an easy task.

As the title says, it wasn’t so easy and barrier() wasn’t so difficult. I love challenging problems, so I couldn’t help having my brain thinking about barrier(). I thought that the best way to solve this problem is to launch a work-item, and when it encounters a barrier(), to stop it and start the next, until it reaches barrier(), launching the next, etc. The idea is good, but difficult to achieve. I wanted to use nested calls, but it could have resulted to stack overflows.

After a few minutes, I remembered what I’ve read in an old book dating from the Windows NT 4 days (it was a Microsoft Press book). In an appendix, it speaks about a strange thing called “fibers”. In fact, during my reflection, I wanted something like threads but managed cooperatively by the application. I don’t know why, but this small ten-pages appendix (in a book close to 1000 pages) I’ve read nearly 6 years ago came immediately in my mind.

After a quick Google search to check that fibers are still available in modern operating system, I found my dream : setcontext (and other functions in the same family).

These functions allow a thread to save its context (stack, CPU registers, flags, instruction pointer) and to jump into another context (like an interprocedural goto). It’s exactly what I want !

Each work-item will have its own context consisting of a small stack and CPU registers. When a barrier is encountered, the context is halted and CPUKernelWorkGroup jumps to the next context. It will also encounter the barrier, jump to the next context, and so forth. When the last context reached barrier(), it can continue, get to another barrier or finish its execution. When that arises, the unfinished contexts are resumed and can continue their execution.

It’s way easier and more efficient than having one thread per work-item, I don’t need to use any locking machinery, the context switch is exactly when the application needs it, etc.

So, this barrier() problem was relatively easily solved. It allowed me to read some papers about using Clang and LLVM to implement OpenCL (and it seems that either nVidia or AMD, I don’t remember, are using a complex LLVM IR rewriting to chain blocks, handling barriers, to allow complex auto-vectorization to take place). I hope that my solution will work and will not be too slow (the contexts will be created at the first barrier() call, so no overhead when no barriers are needed).

I then came back to my samplers, the “boring” thing I have to do before being able to work on barrier(), using very exciting POSIX functions.

The implementation of class Sampler was boring as expected. I need to convert three flag arguments to one bitfield, and then to convert it back to the flag arguments in clGetSamplerInfo. Boring, yes. The API was also easy : copy/paste of the one of Context, with only a small set of modifications.

Then, I had to plug the samplers in the Kernel code. I went to the big switch handling arguments type, and saw an horror : sampler_t isn’t a pointer to an opaque struct, it is an unsigned int !

What does it mean ? An opaque struct has a name, so I can use ” struct_type->getName() == “image2d”; ” to know that the argument is of type Image2D. With an integer, all I get from LLVM IR is a “i32″ type, indistinguishable from a simple “uint” type.

The problem was bigger than expected, and no Microsoft book could help me : LLVM simply doesn’t supply enough type information, and I don’t want to use Clang’s debug informations (they are too big and complex).

After a few days of thinking (barrier() only too a few hours to sort out), I came to a solution that works in 99,99999% of the cases : storing in a std::list the known samplers. In setKernelArg, when the user tries to fit a pointer into an i32, I check if the pointer is a known sampler. On 32-bit architectures, sizeof(void *) == 4, so I always check if the i32 isn’t in fact a pointer to a sampler.

On 64-bit machines, the code is perfect. The samplers are detected, normal i32s are left untouched, and plain wrong i64-to-i32 conversions are spotted and result to an error. On 32-bit machines though, all valid i32s are checked against the table of known samplers, and it’s possible that a valid i32 corresponds to a valid pointer to a sampler. In this corner case, the i32 gets replaced by the sampler’s bitfield value, that is data corruption.

I have no better solution, and the infrastructure needed to do that allowed me to make Clover more robust, so I keep it for the moment. I’ll now implement clFlush and clFinish, then tests for the command queue events, then barrier().

I hope to have barrier() finished by August 15, the “soft pencil down date”. It will mean that I was able to code during my summer a nearly complete OpenCL implementation, lacking only some small built-ins. The following days until August 22 will be used to write the documentation of what I did, and maybe some builtins (the image ones, I want to do first the most complex thing, because it’s what I have to do).

By the way, if you know applications or examples using OpenCL (but nearly no built-in functions), I will be glad to test OpenCL with them. It will be especially interesting if these applications use many clEnqueueWaitForEvents, clEnqueueBarrier, barrier() and out-of-order command queues, that is the part of Clover the most difficult to test using testsuites.

About these ads

4 responses to “When easy is difficult, and vice versa

  • Karl

    I believe that the darktable project optionally uses opencl, it could be worth a try. :-)

  • michael bien

    Hi, i am the author of the Java OpenCL binding JOCL. JOCL covers large parts of the CL API with junit tests to make sure that the binding works. If you implemented the platform model i could run it over your CL implementation if you are interested.

  • J-P

    Examples using the OpenCL C++ Bindings can pass a NULL pointer as the 4th arg (const char * options) of clBuildProgram(). This results in an
    terminate called after throwing an instance of ‘std::logic_error’ what(): basic_string::_S_construct NULL
    at bool Compiler::compile(const std::string &options,
    llvm::MemoryBuffer *source)

    When I compile clover, everything is fine (LLVM 3.0 svn), but the testcases program and kernel result in an SIGSEGV.
    i have to re-order the clang libs and remove the libClang in
    cmake/modules/FindClang.cmake

    FIND_AND_ADD_CLANG_LIB(clangFrontendTool)
    FIND_AND_ADD_CLANG_LIB(clangFrontend)
    FIND_AND_ADD_CLANG_LIB(clangDriver)
    FIND_AND_ADD_CLANG_LIB(clangSerialization)
    FIND_AND_ADD_CLANG_LIB(clangCodeGen)
    FIND_AND_ADD_CLANG_LIB(clangParse)
    FIND_AND_ADD_CLANG_LIB(clangSema)
    FIND_AND_ADD_CLANG_LIB(clangStaticAnalyzerFrontend)
    FIND_AND_ADD_CLANG_LIB(clangStaticAnalyzerCheckers)
    FIND_AND_ADD_CLANG_LIB(clangStaticAnalyzerCore)
    #FIND_AND_ADD_CLANG_LIB(clangChecker)
    FIND_AND_ADD_CLANG_LIB(clangAnalysis)
    FIND_AND_ADD_CLANG_LIB(clangIndex)
    FIND_AND_ADD_CLANG_LIB(clangRewrite)
    FIND_AND_ADD_CLANG_LIB(clangAST)
    FIND_AND_ADD_CLANG_LIB(clangLex)
    FIND_AND_ADD_CLANG_LIB(clangBasic)
    #FIND_AND_ADD_CLANG_LIB(clang)

    • steckdenis

      Hello,

      Thank you very much, I just pushed a fix on the Git repository, I hope it will work for you (I copy/pasted your list of libs but I removed the ones speaking about StaticAnalyzer, as it isn’t used).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: