STL + Clang + LLVM, the pieces are coming together

Hello,

Yes, the pieces are coming together. Today, a new word appeared in the list of supported features of Clover : kernels. Clover can now extract them from programs and do some interesting things with them, for instance listing their parameters (and types). It’s for the moment the only  “big” thing it can do, but there is more to come.

Implementing the kernels was a nice experience. I fact, the current days of Clover development are very exciting. I have to use many technologies and projects, like Clang, LLVM, the STL, and event Check (Clover’s unit testing framework). All is very nice and well done, and pretty diversified. It goes from the thousand-classes LLVM to the “small” STL doing just what it needs to.

Speaking about the STL, I have a question for the few C++-experts reading this blog. The OpenCL specification says that: “Kernel objects are not created for any __kernel functions in program that do not have the same function definition across all devices for which a program executable has been successfully built.” This means that I need to keep a set of the kernels available in all the “per-device sub-programs” (the function definition is verified after).

Doing that is like a SELECT … INNER JOINT in SQL. We can also express it with the following pseudocode :

kernels is a set

for (subprogram in subprograms)
    if (kernels.empty)
        kernels = subprogram.kernels
    else
        kernels &= subprogram.kernels // This line

I’m currently doing this in C++ using std::set, but I’m not comfortable with my current code :

    std::set<std::string> kernels_set;

    for (int i=0; i<p_device_dependent.size(); ++i)
    {
        DeviceDependent &dep = p_device_dependent.at(i);
        std::vector<llvm::Function *> kernels = kernelFunctions(dep);
        std::set<std::string> set;

        // Add the kernels in the set
        for (int j=0; j<kernels.size(); ++j)
        {
            llvm::Function *func = kernels.at(j);
            set.insert(func->getNameStr());
        }

        // intersection of the sets
        if (kernels_set.empty())
        {
            kernels_set = set;
        }
        else
        {
            std::set<std::string> inter;

            set_intersection(set.begin(), set.end(), kernels_set.begin(),
                             kernels_set.end(),
                             std::inserter(inter, inter.begin()));

            kernels_set = inter;
        }
    }

The problem is that I don’t know if it’s the best way to do that (or a sufficiently good one). Qt has a good function to do this without a temporary set, but the STL lacks one. I don’t know if I can use the same std::set in the source parts of set_intersection and as the result set.

Except that, the code should not be too ugly, and it works fairly well. I thank Zack Rusin for having added unit tests in Clover (and I’m happy to have continued to implement them), it’s very pleasant to type “make test”, to see that all goes well, and to think “my code is ok :) “.

About these ads

7 responses to “STL + Clang + LLVM, the pieces are coming together

  • Anonymous

    You have to iterate over the vector anyway, so rather than making a temporary set, just filter the vector, removing any elements that don’t appear in the set.

  • Gustaw Smolarczyk

    1. There is a possible optimization. If you iterate over all elements of a vector (a_vector being e.g. p_device_dependent) using something like this:

    for (int i = 0; i < a_vector.size(); ++i) {
    // …
    }

    then you shouldn't use a_vector.at(i) inside a loop. std::vector::at(size_t position) function does range-checking, i.e. it checks every time whether position isn’t out of bounts. It’s not needed, because the loop always iterates over valid positions in a vector (unless you change the size of the vector inside the loop). Use just a_vector[i] instead, which is faster (decodes to less instructions and prevents any branch mispredictions).

    Additionally, you can store the size of the vector in a constant variable. Depending on particular compiler (I didn’t test any, however) it can result in more efficient code (it’s actually used troughout llvm code base, which makes heavy use of STL). Something like that:

    for (int i = 0, Size = a_vector.size(); i < Size; ++i) {
    // …
    }

    You can also declare "i" and "Size" as unsigned (or even size_t).

    2. If a function returns a container (i.e. std::vector) then try not to make an explicit copy of it for nothing. For example, you can change this line:

    std::vector kernels = kernelFunctions(dep);

    into:

    const std::vector &kernels = kernelFunctions(dep);

    This construct “catches” a copy of a vector that kernelFunctions() returns into a constant reference and prevents making an unneeded copy of an object, greatly increasing performance (well, you can use C++0x’s r-value reference here, but just don’t mess with unrealesed C++ standard now :P).

    3. std::set_intersection() function ([1]) expects both containers (set and kernels_set) to be already sorted. I think it’s not forced anywhere so your code is invalid. You should sort set and kernels_set (well, kernels_set should be already sorted, because set_intersection() returns sorted sequence) somewhere. Just sorting set before comment

    // intersection of the sets

    should suffice.

    4. If I understand llvm codebase correctly ([2]), you should use llvm::Value::getName instead of llvm::Value::getNameStr (this one returns llvm::StringRef instead of std::string), because it prevents making a copy of a string.

    In the end, I don’t exactly understand, what do you try to accomplish. The fragment of OpenCL specification you cited says that the kernel objects should not be created for any functions that have not the same definition accross all devices. However, you don’t really check the definitions at all – you only check the names. I’m not sure if I understand correctly the spec here (or llvm). Please, clarify a bit if you can.

    I tried to help as much as I could.

    [1] http://www.cplusplus.com/reference/algorithm/set_intersection/
    [2] http://llvm.org/doxygen/classllvm_1_1Value.html#a3c095aec2b1245bb8092e0bf63ae8b64

  • Nicolai Hähnle

    I can’t answer your specific question about std::set because I would just have to dive into the documentation like you would have to.

    However, your pseudo-code and your C++-code looks suspiciously like there is a bug. I don’t know much about OpenCL and those kernels, but it sounds to me like you are trying to compute the intersection of a number of sets, and you’re doing that wrong. What you need to do is:

    kernels = subprograms[0].kernels;

    for (j = 1..subprograms.size()-1) { kernels &= subprograms[j].kernels; }

    where &= is taking the intersection.

    Otherwise, suppose that you have three subprograms, and the intersection of the first two sets is empty. Then the resulting kernels set that your code computes will in fact be all kernels from the last subprogram, even though the total intersection is empty.

  • M.Pomme

    Hello Denis,

    I’m not an expert in STL, however…

    - Is it possible that two “per-device sub-programs” have no common kernel ? In that case there would be a bug (even in the pseudo code), because kernel_set might be empty at any time (not just at the first iteration). There are many ways to fix it, the “mathematical” way would be to use the union of all the sets as initial value, but of course using just the first one would be fine.

    - you might be better off set.erase()-ing all the kernels you don’t set.find() in the other set rather than using set_intersect. You might need to iterate in reverse to do that safely.

  • steckdenis

    Hello,

    Thanks to all for your very interesting comments ! The bug of kernels_set being empty at any time is a good catch, I will fix that.

    Iterating through the set and removing the elements not found in the other is also a good idea. In fact, it’s what Qt does, and what the STL snippet given in the C++ documentation does.

    Regarding sorting, a std::set is always sorted (it’s one of its difference with std::vector).

    I think I will take a look at the Qt code. My idea is to iterate through the items and to remove the ones that aren’t in the other set, but with a small optimization given that the sets are ordered.

    The comments of Gustaw Smolarczyk are very interesting and I will take them into account. I’ve planned to cleanup my code a bit (the work I did at the beginning of the coding period is a bit too C-ish). For the reference (avoiding a copy), the std::vector is built into the function as a local variable. I don’t know if I can return a reference to a local variable and having that working well. In Qt, these kind of copies have nearly no cost because Qt uses copy-on-write for all its containers.

    Thanks.

  • Gustaw Smolarczyk

    Yes, Qt uses COW for every container, which helps a lot. However, my snippet:

    const std::vector &kernels = kernelFunctions(dep);

    does not make function kernelFunctions() return a reference to local. The function still returns a temporary object. Temporary objects in C++ are objects returned by a function (not by pointer/reference). They are automatically destroyed at the end of line (well, statement) which called the function. In your code, the temporary is returned, then it is copied to a brand-new vector, and after that the temporary is destroyed.

    The construct above is called “catching” a temporary into a constant reference. The temporary is given a name, therefore it’s not destroyed until the name is valid (i.e. till the end of scope).

    And about sorting std::set’s – my bad, I forgot how set/map works. I was having a hash (unordered_set) in my head and missed this implementation detail.

    But my last question is still here – where do you check whether the functions have the same definition?

    P.S. WordPress didn’t show inequality signs around template instantiations. I hope it was readable.

  • steckdenis

    Hello,

    Ok, I will add the references. Thanks for the very good explanation :) .

    The implementation of this part of the spec must be done in two steps :

    The first one is the one I speak about in this blog post : I must find the common subset of kernels available on all devices, without signature checking.

    Once I have all the common kernels, I can run this code on them :

    for (it = kernels_set.begin(); it != kernels_set.end(); ++it)
    {
    cl_int result = CL_SUCCESS;
    Kernel *kernel = createKernel(*it, &result);

    if (result == CL_SUCCESS)
    {
    rs.push_back(kernel);
    }
    else
    {
    delete kernel;
    }
    }

    Each call to createKernel takes a kernel name and check that its signature is the same on all devices.

    But now that I re-read my code, I see that createKernel also verify that the name is available on all devices, so I did the check two times. I think I can greatly improve my code by removing all the stuff regarding kernels_set and simply call createKernel on every name found for the first device. If createKernel fails, it means that either the name is not found on another device or that the signature is not good. So, it’s better and yields the same result.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: