Boring work for API completeness

Hello,

Clover’s development seemed slow these days, but in fact it wasn’t. I’m currently “polishing” all I’ve already done. Not because I’m near the end of the project, but because the last part of my Google Summer of Code project will begin in the following days, and I want the code upon which I’ll build it to be solid.

So, my first target for Clover was to be able to launch OpenCL-compiled kernels. In order to be able to do that, the implementation needed to support several things : buffers, events, command queues, contexts, etc. Now that the kernels can run (but without any interesting built-in function), I decided to finish the public API of OpenCL.

In the git repository, you can therefore see many commits like “Implement clFoo and clFooBar”. I’ve read all the APIs and implemented the missing functions.

Currently, I focused on the “enqueue” functions, that is the functions used to queue specific events, the actions OpenCL can perform. These functions are :

  • clEnqueueRead/WriteBufferRect: a complex function copying a buffer to another, but only a rectangle (if we say the buffer contains 2D data) or a cube. This event is particularly important because I built all the image-related events upon it.
  • clEnqueueCopyBuffer: a simple event copying a buffer to another.
  • clEnqueueCopyBufferRect.
  • clCreateImage2D and clCreateImage3D, to add image support to Clover.
  • clEnqueueReadImage and clEnqueueWriteImage, built upon CopyBufferRect.
  • clEnqueueCopyImage (really the mirror of CopyBufferRect).
  • clEnqueueCopyImageToBuffer and clEnqueueCopyBufferToImage.
  • clEnqueueMapImage.
  • clGetSupportedImageFormats.
  • And then clEnqueueBarrier, clEnqueueMarker and clEnqueueWaitForEvents

Now, all the “enqueue” API is completed. I have now to implement the samplers, and clFlush and clFinish. Then, I will be able to implement the interesting built-in functions (from simple mathematical functions to barrier(), the one that could take a fair amount of time thinking on how I could implement it).

The functions I just implemented are based on the “events” framework of Clover, a set of classes inheriting Coal::Event and organized in a complex heritage tree. This enabled me to implement all the events and their checks with only 1500 lines of code in events.cpp (the biggest file of Clover). All the “rectangle-related” events (that is to say Read/Write/CopyBufferRect, and image events) are implemented in less than 100 lines of worker code in CPUDevice (but the code isn’t really readable, I heavily used the testsuite to check my code). For the reference, here is the code doing all the 2D and 3D copies in CPUDevice :

case Event::ReadBufferRect:
case Event::WriteBufferRect:
case Event::CopyBufferRect:
case Event::ReadImage:
case Event::WriteImage:
case Event::CopyImage:
case Event::CopyBufferToImage:
case Event::CopyImageToBuffer:
{
    // src = buffer and dst = mem if note copy
    ReadWriteCopyBufferRectEvent *e = (ReadWriteCopyBufferRectEvent *)event;
    CPUBuffer *src_buf = (CPUBuffer *)e->source()->deviceBuffer(device);

    unsigned char *src = (unsigned char *)src_buf->data();
    unsigned char *dst;

    switch (t)
    {
        case Event::CopyBufferRect:
        case Event::CopyImage:
        case Event::CopyImageToBuffer:
        case Event::CopyBufferToImage:
        {
            CopyBufferRectEvent *cbre = (CopyBufferRectEvent *)e;
            CPUBuffer *dst_buf =
                (CPUBuffer *)cbre->destination()->deviceBuffer(device);

            dst = (unsigned char *)dst_buf->data();
            break;
        }
        default:
        {
            // dst = host memory location
            ReadWriteBufferRectEvent *rwbre = (ReadWriteBufferRectEvent *)e;

            dst = (unsigned char *)rwbre->ptr();
        }
    }

    // Iterate over the lines to copy and use memcpy
    for (size_t z=0; z<e->region(2); ++z)
    {
        for (size_t y=0; y<e->region(1); ++y)
        {
            unsigned char *s;
            unsigned char *d;

            d = imageData(dst,
                          e->dst_origin(0),
                          y + e->dst_origin(1),
                          z + e->dst_origin(2),
                          e->dst_row_pitch(),
                          e->dst_slice_pitch(),
                          1);

            s = imageData(src,
                          e->src_origin(0),
                          y + e->src_origin(1),
                          z + e->src_origin(2),
                          e->src_row_pitch(),
                          e->src_slice_pitch(),
                          1);

            // Copying an image to a buffer may need to add an offset
            // to the buffer address (its rectangular origin is
            // always (0, 0, 0)).
            if (t == Event::CopyBufferToImage)
            {
                CopyBufferToImageEvent *cptie = (CopyBufferToImageEvent *)e;
                s += cptie->offset();
            }
            else if (t == Event::CopyImageToBuffer)
            {
                CopyImageToBufferEvent *citbe = (CopyImageToBufferEvent *)e;
                d += citbe->offset();
            }

            if (t == Event::WriteBufferRect || t == Event::WriteImage)
                std::memcpy(s, d, e->region(0)); // Write dest (memory) in src
            else
                std::memcpy(d, s, e->region(0)); // Write src (buffer) in dest (memory), or copy the buffers
        }
    }

    break;
}

ImageData is a simple function returning the address of a pixel given its coordinates. It currently works only on little-endian architectures. You’ll see that bytes_per_pixel is always 1 in this code (the last argument of imageData). It’s normal, Event objects already did the multiplications where needed.

static unsigned char *imageData(unsigned char *base, size_t x, size_t y,
                                size_t z, size_t row_pitch, size_t slice_pitch,
                                unsigned int bytes_per_pixel)
{
    unsigned char *result = base;

    result += (z * slice_pitch) +
              (y * row_pitch) +
              (x * bytes_per_pixel);

    return result;

I’m nearing the end of my project. I don’t know if I will be able to implement all the built-in functions by August 25. I’ll start with the “difficult” ones (barrier(), image reading and writing) in the hope that I will be able to implement the remaining ones after the Summer of Code program. These are fairly simple functions already implemented in many third-party mathematical libraries, so I can simply call them or copy their code.

About these ads

4 responses to “Boring work for API completeness

  • nobody

    thanks for your hard work

  • J-P

    is there a bugtracker for clover?
    you’re doing a really great job, but testing is an endless task, specially if Khronos does not make the conformance tests public :-(

    some opencl programs have ’0′ as 2nd (num_devices) and 3th (device_list) parameter of clBuildProgram().
    The API says:
    ‘If device_list is NULL value, the program executable is built for all devices associated with program for which a source or binary has been loaded.’
    but it seems that this is an issue for clover, cause such programs failed (in my case first at the call clSetKernelArg(), but changing the params of clBuildProgram() fixed this).

  • steckdenis

    Hello,

    Thanks for the bug report. Clover has no bug tracker yet, but I hope it will get one after the GSoC period, on Freedesktop.org.

    I have just pushed a possible fix to your problem, it was a little mistake. When clBuildProgram is called with num_devices=0, I take the devices given at clCreateProgramWithBinary and use them. I just forgot to do this small replacement at one place in the code.

    The specification doesn’t give much information about what to do when the program was created with clCreateProgramWithSource : fail or use the whole list of devices used by the context ?

    Thanks.

    • J-P

      i think in this case the context devices should be used, as:
      clGetProgramInfo -> CL_PROGRAM_DEVICES:
      ‘Return the list of devices associated with the program object. This can be the devices associated with context on which the program object has been created or can be a subset of devices that are specified when a progam object is created using clCreateProgramWithBinary.’

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: