Today was the day I got my finally-installed copy of CityHash to drop some benchmarks and show me how fast it really is. The road has been bumpy, but I’ve made it past each hurdle. It was hard to tell, at times, what type of errors I was getting. But in the end I was able to inch forward past each one.
Since CityHash didn’t seem to have instructions to call a main program, and seemed to be more like a library, I wrote my own program to call CityHash64(), the function I was interested in optimizing. My program initially looked something like this:
CityHash didn’t come with a plethora of tests, but with a few runs, it was clear that it was producing hashes, and produced the same outputs every time for a fixed input. Initially, I had used a copy-paste file with the full text of Mary Shelley’s Frankenstein (easy to get since it’s in the public domain). However, it didn’t seem to run for very long, so I replaced it with something bigger. Now running on the order of ~15 seconds, I was happier with the benchmark-worthiness. The file itself contains many hundreds of thousands of strings of varying lengths.
The tests were run on two systems. A Fedora virtual machine on my personal laptop (with x86_64 architecture), and the Aarchie server provided by the school (which uses ARM). I might only end up using one of them if I use assembly language to optimize, but for now I thought I would cover all of my bases.
I ran some initial tests, but I was worried about something. My own program had so much downtime that it was difficult to see what was going with CityHash. It had such a tiny percentage of the runtime. So I first tried to trim it down. I got rid of the lines that printed the hashes. Strangely, this made my hash function super fast and finished instantly. Putting the line printing back, but piping it instead of using standard out made it still take only 1-2 seconds.
I had difficulty trying to make the file bigger, so I decided to just print out the lower order digits of the sum of all the hashes, since I was paranoid that the compiler would optimize everything away if I didn’t use the values. I didn’t want tons of cout calls since that would have dwarfed the rest of the program and hid the variance.
Most other programs were kept closed, especially those more likely to have spikes in resource usages. On the remote server, this was not something that could be controlled, but in both cases, an attempt to control interfering load was made by simply taking many, many trials.
Runtimes:
x86_64 (Fedora VM)
The vast majority of runs ran between 1150-1160 ms. There were some under or over. Although most ran in this range, only once out of about 20 runs did it go below 1150 ms. I’m confident that this lower bound is somewhat close to our ideal runtime, since the variance seems to be in only one direction. We will then say that the total runtime is about ~1155 ms (for the purpose of comparison, later).
ARM (Aarchie server)
The bulk of runs ran around 700ms, but at times there were some results in the 750-780 ms ranges. I believe based on timing, the higher cases were when someone else was trying to use the system. I say this because otherwise, I would have a expected a more even distribution of the results. There weren’t many 720 – 740 ms times, which would be expected if it were up to normal variance. I choose to take the smaller cluster of results as a rough estimate of the runtime on Aarchie (~700ms), since that suggest the least interference.
Perf
The perf report on x86 typically looked like this:
If you look closely, a lot of the time spent on this program is unrelated to CityHash. It was difficult to isolate CityHash behaviour just looking at this, so I will end up paying more attention to the graph. You can still see CityHash among the functions called (runtime about 1-2%), as well as HashLen0to16 (runtime usually 3-4%). Perhaps the call graph will illuminate more.
Strangely enough, the perf report on ARM calculated the values differently. The x86 doesn’t seem to show main or _start, which would be the proper head of the hierarchy at 96% time spent.
Call Graph:
Below is an example of one of the call-graphs, cropped to the relevant parts. Since We don’t really care about the other parts of my program (which just deals with preparing the strings from the input file and such) we can compare relative sizes of these. I’m not sure why CityHash64 is split into two calls, but HashLen0to16 is not lower in the hierarchy than CityHash64, either. My expectation is that it should be called by CityHash, and not by main. I’m not really sure what to make of this.
I was hoping to get a percentage value for the runtime of CityHash out of the total runtime for benchmarking program, but this seems too messy to resolve right now. So, I will stick to the total-runtime approach until it is fixed. Assuming all of the startup and string-related wizardry remains untouched, I am willing to bet that any optimizations I make will directly reduce the total runtime of the program. It will be impossible to tell how much faster I made the hash relative to itself, but hopefully by then I can produce a more accurate percentage value to compare it with.