When LLVM is slow


A quick blog post about what I’ve just seen in the LLVM codebase.

First, a note about the progress I made since the previous post. The kernels are done, I’ve implemented clGetKernelInfo, clGetKernelWorkGroupInfo (currently a stub), clSetKernelArg, and even clEnqueueNDRangeKernel and clEnqueueTask. It means that API-wise, we can run compiled OpenCL kernels. But it isn’t the case yet. I still need to implement many things in the core of Clover, but every hour makes me closer to finish that.

One of the thing I have to implement is a fast mean to call a kernel. “Fast” is really important here, since most kernels are small (thus run fast) but are a lot to be run.

I naively began by envisaging using llvm::ExecutionEngine::runFunction. This function takes two parameters : the function to run, and a vector of its arguments. This vector didn’t make me happy, I found it too slow. So I looked at the code of this function for the JIT (the engine used on x86, x86_64 and ARM, the three targets I want to run fast). I was horrified by this function !

// ... Many "fast" special cases

// Okay, this is not one of our quick and easy cases.  Because we don't have a
// full FFI, we have to codegen a nullary stub function that just calls the
// function we are interested in, passing in constants for all of the
// arguments.  Make this function and return.

// First, create the function.
FunctionType *STy=FunctionType::get(RetTy, false);
Function *Stub = Function::Create(STy, Function::InternalLinkage, "",

// Insert a basic block.
BasicBlock *StubBB = BasicBlock::Create(F->getContext(), "", Stub);

// ... Code generation for the args, to have a function like
//     T f() { return function(3, "foobar", (int *)0x9fffff90); }

// Finally, call our nullary stub function.
GenericValue Result = runFunction(Stub, std::vector<GenericValue>());
// Erase it, since no other function can have a reference to it.

The last line is the worst : the function is removed, each call will need to use code generation !

Fortunately, I can avoid that for kernels. A kernel is a function called many times with the same parameters. So, I can build this stub function (avoiding some corner cases LLVM handles), keep it in the Coal::Kernel object, and call it many times with a simple :

void (*func)() = device_kernel->nativeFunctionAddress();

for (...)

Simple, and really fast !


One response to “When LLVM is slow

  • steckdenis


    The compilation of a stub is done, it was a bit more difficult than expected because I had to handle vector types (float4, etc).

    Now it works, and when I call CPUKernel::callFunction, it returns a zero-argument LLVM function implemented like that in the module :

    define internal void @0() {
    tail call void @kernel2(i32 addrspace(1)* inttoptr (i64 140734640601216 to i32 addrspace(1)*))
    ret void

    The buffer is resolved to its address in RAM, the local buffers are allocated, all normally works ok.

    Now, the only thing I have to implement before having the kernels running is feeding the execution engine with kernels to execute, and some builtin functions like get_global_id.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: