Is there any benefit to replacing STL containers/algorithms with hand-rolled ones? In particular, my program includes a very large priority queue (currently a std::priority_queue) whose manipulation is taking a lot of total time. Is this something worth looking into, or is the STL implementation already likely the fastest possible?
I assume you’re aware that the STL containers rely on copying the elements. In certain cases, this can be a significant loss. Store pointers and you may see an increase in performance if you do a lot of container manipulation. On the other hand, it may reduce cache locality and hurt you. Another option is to use specialized allocators.
Certain containers (e.g.
list) rely on lots of pointer manipulation. Although counterintuitive, it can often lead to faster code to replace them with
vector. The resulting algorithm might go from
O(log n) to
O(n), but due to cache locality it can be much faster in practice. Profile to be sure.
You mentioned you’re using priority_queue, which I would imagine pays a lot for rearranging the elements, especially if they’re large. You can try switching the underlying container (maybe
deque or specialized). I’d almost certainly store pointers – again, profile to be sure.
Along similar lines, for a std::vectors whose needed sizes are unknown but have a reasonably small upper bound, is it profitable to replace them with statically-allocated arrays?
Again, this may help a small amount, depending on the use case. You can avoid the heap allocation, but only if you don’t need your array to outlive the stack… or you could
reserve() the size in the
vector so there is less copying on reallocation.
I’ve found that dynamic memory allocation is often a severe bottleneck, and that eliminating it can lead to significant speedups. As a consequence I’m interesting in the performance tradeoffs of returning large temporary data structures by value vs. returning by pointer vs. passing the result in by reference. Is there a way to reliably determine whether or not the compiler will use RVO for a given method (assuming the caller doesn’t need to modify the result, of course)?
You could look at the generated assembly to see if RVO is applied, but if you return pointer or reference, you can be sure there’s no copy. Whether this will help is dependent on what you’re doing – e.g. can’t return references to temporaries. You can use arenas to allocate
and reuse objects, so not to pay a large heap penalty.
How cache-aware do compilers tend to be? For example, is it worth looking into reordering nested loops?
I’ve seen dramatic (seriously dramatic) speedups in this realm. I saw more improvements from this than I later saw from multithreading my code. Things may have changed in the five years since – only one way to be sure – profile.
On the flip side, are there any techniques I should avoid because they are likely to interfere with the compiler’s ability to automatically optimize code?
expliciton your single argument constructors. Temporary object construction and destruction may be hidden in your code.
Be aware of hidden copy constructor calls on large objects. In some cases, consider replacing with pointers.
Profile, profile, profile. Tune areas that are bottlenecks.