After some selection and benchmarking, I am finally ready to try optimizing a piece of software as part of my Software Optimization and Portability course’s final project.
I noted in my project plan that CityHash, the hash function I chose to optimize, was already fairly optimized. It used O3 compiler flags, meaning that most of the free compiler optimizations were already made. The algorithm itself is also very complex and streamlined, such that rewriting it on a basic level is probably beyond my current ability. So my choice was to try and use SIMD instructions to vectorize the code, using assembler, or otherwise using assembler to speed things up.
But after reviewing the code further, I can see some issues with this too. There aren’t many long repetitive loops that would make vectorizing it a profitable venture (speed-wise). Loops are generally tied to string length, so the utility of doing that is also questionable. So here is my revised strategy:
- Try various compiler flags beyond O3
- Profile Guided Optimization
- Using another compiler
- If time is left over, try to speed things up in assembler
Compiler Flags
Flag combinations:
-march=native
This flag allows the compiler to use CPU architecture specific commands, mening it will not necessarily run the same on another CPU (unless you compile it again on that machine). This lowered the minimum time taken to run slightly, though made it less consistent. Instead of taking a minimum time of about 1101ms, it started to take about 1093ms as a minimum time. This is only a 0.7% decreased time to run in the best case, and the same speed or slightly worse a lot of the time.
-Ofast
This one got a similar speedup to -march=native on the high end, but is more consistent. With 1091ms as the minimum time, thats a 1.0% speedup. Still not amazing, but interesting for the relative ease.
This small speedup could be within error, but my strategy for minimizing error is to assume that on a machine with almost nothing running, the shortest time (if consistent) vaguely represents the “true runtime” on that machine under those conditions. This logic might not be perfect, but if a program is running consistently within a very tight duration I think this comparison is sane. Larger fluctuations are a different matter.
-Ofast -march=native
Combining the two does not seem to offer twice the speedup. I think I prefer the consistency of just having -Ofast here. -march=native doesn’t seem to add anything (still 1% speedup like in -Ofast), and just lowers the consistency, occasionally slowing it down by a good 10%.
-Ofast -fno-signed-zeros -fno-trapping-math
I included a few extra flags that are similar to -Ofast but not within -Ofast itself. There was no noticeable difference from the original -O3 suggested by Google.
–msse4.2
Remembering this as a suggested option from googles documentation, I tried to combine this and some of the previous flags in different combinations. Once again, no speedup was found.
-frename-registers and f-unroll-loops
I tried these separately, combined with -Ofast. Nothing special, once again. I think they depend on the code being a certain way to make a significant difference.
Compiler flag summary:
So far, just the -O3 vs -Ofast flag is to my liking. I will try both versions going forward as I make the rest of my changes. But considering there is only about a 1% change, I’m not ready to call it.
Profile-Guided Compiler Optimizations
By setting the compiler flags on cityhash to be -fprofile-generate, running the program, and then switching it to -fprofile-use, I noticed another slight speedup. The range of speeds I was getting was consistently shifted down from the original 1101ms-1109ms down to 1082ms-1090ms. Although somtimes it still took 1090ms+ or even 1100ms, in general there was a consistent speedup. This ~20ms speedup amounts to about 2%. Still not amazing, but I’ll take what I can get.
Delighted with this change, I felt a little cheeky and tried to do this:
-O3 -fprofile-use
It didn’t work. I can only assume that the parts of O3 does that the profile didn’t want were left out for a reason. I’m not even sure how they interact, to be honest. But it was worth a shot. I skipped trying the same cute trick with -Ofast.
Using another compiler
I read somewhere that there is a compiler that can optimize more aggressively than gcc. This is the Intel C++ compiler from Intel Parallel Studio XE. Fortunately, it seems that Seneca College has a floating license, and I was able to get access for use in my school project. The cool thing here is that there is an auto-vectorizer (which of course, can only kick in if the code is already auto-vectorizable). It also has its own O2 and O3 capabilities, as well as PGO (which I used gcc for), but I ignored those since I already did those types of optimizations.
Unfortunately, even after installing and setting everything up, I could not get the compiler to compile anything. The cityhash makefile was quite long and complicated, and getting it to use another compiler was impossible. I tried to compile it manually myself, but whenever I attempt that without the makefile (even for gcc) it wants to import a file that cannot be located. Here is roughly what I had planned to do, based on this guide (page 7):
1) compile and run once with auto-vectorizer ( icc -O3 -no-vec city.cc -o city.o )
2) compiler and run once without auto-vectorizer ( icc -O3 -vec-report -o city.o )
3) compare runtimes
4) repeat with -Ofast if the intel compiler allows it
Using Assembler
To be honest, I couldn’t get anything to work in this department. My skill in assembler is is pretty low and my attempts to rewrite things in the language never succeeded on any level. Though while this could theoretically be possible, I doubt there would be much to gain here. The hash function is already pretty fast.