Over the next few weeks I will put what I’ve learned in Software Optimization and Portability to the test by attempting to speed up some open-source hashing software. The hashing tool in question is called cityhash, which seems to come from Google.It hashes strings, and is written in C++. I chose it because it seems to be legitimate, and claims decent performance, but still doesn’t seem over-streamlined to the point where I am not capable of contributing anything. That may still be the case in the end, but I think the potential is there.
I hope to build the project on two systems. My own computer (an x86_64 system) and one of the school servers of the aarch64 architecture for variety. Using the test code included in the package as a guide, I will make a large set of strings to increase the runtime. Using the profiling tool perf, I will measure the overall and function-specific run-times of a specific hash function (probably CityHash32 or CityHash64) and take the average over several runs. The same data set will later be used to check for speedups.
There are a variety of options to explore when it comes to optimizing this function. The simplest way seems to be messing with compiler options, which I may play with, but the README suggests that they are already aware of the more general compiler optimizations available (such as gcc preferring “-g -O3” over the default “-g -O2”). The actual functions themselves seem too complex for me to significantly rewrite. I also find it likely that using in-line assembler will backfire on me, as I have read that except in very specific cases, it often leads to a slow-down.
Therefore, I have my eyes set on vectorizing the code. That is, to group similar operations and do them simultaneously rather than just one by one. There are even notes in the code that suggest that no SIMD (Single Instruction Multiple Data) operations are done, so I think provides a good candidate for improvement.
There process of doing so may be quite difficult. I am reading up on vectorization, and a conversation with my professor suggested that process of vectorizing this code may take me on a bit more complex path than I intended. I am however up for the challenge, and will carefully learn about masking and using vector registers. Fortunately, I have until stage 3 to bring myself up to speed. I’m kind of excited to be getting my hands dirty with something like this.
Since the project comes with a testing suite, I will run the tests to make sure nothing is broken. I will also do a cursory check to make sure the outputs are coming out the same. To prove performance increases, I will compare both overall runtime and relative runtime for the function in question, on both systems. I will focus mostly on speed, and not memory or other resource dimensions, though if I stumble onto something of interest I will include it.