Optimization Project part 3 (SPO600)

After some selection and benchmarking, I am finally ready to try optimizing a piece of software as part of my Software Optimization and Portability course’s final project.

I noted in my project plan that CityHash, the hash function I chose to optimize, was already fairly optimized. It used O3 compiler flags, meaning that most of the free compiler optimizations were already made. The algorithm itself is also very complex and streamlined, such that rewriting it on a basic level is probably beyond my current ability. So my choice was to try and use SIMD instructions to vectorize the code, using assembler, or otherwise using assembler to speed things up.

But after reviewing the code further, I can see some issues with this too. There aren’t many long repetitive loops that would make vectorizing it a profitable venture (speed-wise). Loops are generally tied to string length, so the utility of doing that is also questionable. So here is my revised strategy:

  1. Try various compiler flags beyond O3
  2. Profile Guided Optimization
  3. Using another compiler
  4. If time is left over, try to speed things up in assembler

 

Compiler Flags

Flag combinations:

-march=native

This flag allows the compiler to use CPU architecture specific commands, mening it will not necessarily run the same on another CPU (unless you compile it again on that machine). This lowered the minimum time taken to run slightly, though made it less consistent. Instead of taking a minimum time of about 1101ms, it started to take about 1093ms as a minimum time. This is only a 0.7% decreased time to run in the best case, and the same speed or slightly worse a lot of the time.

-Ofast

This one got a similar speedup to -march=native on the high end, but is more consistent. With 1091ms as the minimum time, thats a 1.0% speedup. Still not amazing, but interesting for the relative ease.

This small speedup could be within error, but my strategy for minimizing error is to assume that on a machine with almost nothing running, the shortest time (if consistent) vaguely represents the “true runtime” on that machine under those conditions. This logic might not be perfect, but if a program is running consistently within a very tight duration I think this comparison is sane. Larger fluctuations are a different matter.

-Ofast -march=native

Combining the two does not seem to offer twice the speedup. I think I prefer the consistency of just having -Ofast here. -march=native doesn’t seem to add anything (still 1% speedup like in -Ofast), and just lowers the consistency, occasionally slowing it down by a good 10%.

-Ofast -fno-signed-zeros -fno-trapping-math

I included a few extra flags that are similar to -Ofast but not within -Ofast itself. There was no noticeable difference from the original -O3 suggested by Google.

–msse4.2

Remembering this as a suggested option from googles documentation, I tried to combine this and some of the previous flags in different combinations. Once again, no speedup was found.

-frename-registers and f-unroll-loops

I tried these separately, combined with -Ofast. Nothing special, once again. I think they depend on the code being a certain way to make a significant difference.

 

Compiler flag summary:

So far, just the -O3 vs -Ofast flag is to my liking. I will try both versions going forward as I make the rest of my changes. But considering there is only about a 1% change, I’m not ready to call it.

 

Profile-Guided Compiler Optimizations

By setting the compiler flags on cityhash to be -fprofile-generate, running the program, and then switching it to -fprofile-use, I noticed another slight speedup. The range of speeds I was getting was consistently shifted down from the original 1101ms-1109ms down to 1082ms-1090ms. Although somtimes it still took 1090ms+ or even 1100ms, in general there was a consistent speedup. This ~20ms speedup amounts to about 2%. Still not amazing, but I’ll take what I can get.

Delighted with this change, I felt a little cheeky and tried to do this:

-O3 -fprofile-use

It didn’t work. I can only assume that the parts of O3 does that the profile didn’t want were left out for a reason. I’m not even sure how they interact, to be honest. But it was worth a shot. I skipped trying the same cute trick with -Ofast.

 

Using another compiler

I read somewhere that there is a compiler that can optimize more aggressively than gcc. This is the Intel C++ compiler from Intel Parallel Studio XE. Fortunately, it seems that Seneca College has a floating license, and I was able to get access for use in my school project. The cool thing here is that there is an auto-vectorizer (which of course, can only kick in if the code is already auto-vectorizable). It also has its own O2 and O3 capabilities, as well as PGO (which I used gcc for), but I ignored those since I already did those types of optimizations.

Unfortunately, even after installing and setting everything up, I could not get the compiler to compile anything. The cityhash makefile was quite long and complicated, and getting it to use another compiler was impossible. I tried to compile it manually myself, but whenever I attempt that without the makefile (even for gcc) it wants to import a file that cannot be located. Here is roughly what I had planned to do, based on this guide (page 7):

1) compile and run once with auto-vectorizer     ( icc -O3 -no-vec city.cc -o city.o )

2) compiler and run once without auto-vectorizer ( icc -O3 -vec-report -o city.o )

3) compare runtimes

4) repeat with -Ofast if the intel compiler allows it

 

Using Assembler

To be honest, I couldn’t get anything to work in this department.  My skill in assembler is is pretty low and my attempts to rewrite things in the language never succeeded on any level. Though while this could theoretically be possible, I doubt there would be much to gain here. The hash function is already pretty fast.

Advertisement

Package Benchmarks (SPO600 Project Stage 2)

Today  was the day I got my finally-installed copy of CityHash to drop some benchmarks and show me how fast it really is. The road has been bumpy, but I’ve made it past each hurdle. It was hard to tell, at times, what type of errors I was getting. But in the end I was able to inch forward past each one.

Since CityHash didn’t seem to have instructions to call a main program, and seemed to be more like a library, I wrote my own program to call CityHash64(), the function I was interested in optimizing. My program initially looked something like this:

bench1

CityHash didn’t come with a plethora of tests, but with a few runs, it was clear that it was producing hashes, and produced the same outputs every time for a fixed input. Initially, I had used a copy-paste file with the full text of Mary Shelley’s Frankenstein (easy to get since it’s in the public domain). However, it didn’t seem to run for very long, so I replaced it with something bigger. Now running on the order of ~15 seconds, I was happier with the benchmark-worthiness. The file itself contains many hundreds of thousands of strings of varying lengths.

The tests were run on two systems. A Fedora virtual machine on my personal laptop (with x86_64 architecture), and the Aarchie server provided by the school (which uses ARM). I might only end up using one of them if I use assembly language to optimize, but  for now I thought I would cover all of my bases.

I ran some initial tests, but I was worried about something. My own program had so much downtime that it was difficult to see what was going with CityHash. It had such a tiny percentage of the runtime. So I first tried to trim it down. I got rid of the lines that printed the hashes. Strangely, this made my hash function super fast and finished instantly.  Putting the line printing back, but piping it instead of using standard out made it still take only 1-2 seconds.

I had difficulty trying to make the file bigger, so I decided to just print out the  lower order digits of the sum of all the hashes, since I was paranoid that the compiler would optimize everything away if I didn’t use the values. I didn’t want tons of cout calls since that would have dwarfed the rest of the program and hid the variance.

Most other programs were kept closed, especially those more likely to have spikes in resource usages. On the remote server, this was not something that could be controlled, but in both cases, an attempt to control interfering load was made by simply taking many, many trials.

Runtimes:

x86_64 (Fedora VM)

The vast majority of runs ran between 1150-1160 ms. There were some under or over. Although most ran in this range, only once out of about 20 runs did it go below 1150 ms. I’m confident that this lower bound is somewhat close to our ideal runtime, since the variance seems to be in only one direction. We will then say that the total runtime is about ~1155 ms (for the purpose of comparison, later).

ARM (Aarchie server)

The bulk of runs ran around 700ms, but at times there were some results in the 750-780 ms ranges. I believe based on timing, the higher cases were when someone else was trying to use the system. I say this because otherwise, I would have a expected a more even distribution of the results. There weren’t many 720 – 740 ms times, which would be expected if it were up to normal variance. I choose to take the smaller cluster of results as a rough estimate of the runtime on Aarchie (~700ms), since that suggest the least interference.

Perf

The perf report on x86 typically looked like this:

bench3

If you look closely, a lot of the time spent on this program is unrelated to CityHash. It was difficult to isolate CityHash behaviour just looking at this, so I will end up paying more attention to the graph. You can still see CityHash among the functions called (runtime about 1-2%), as well as HashLen0to16 (runtime usually 3-4%). Perhaps the call graph will illuminate more.

Strangely enough, the perf report on ARM calculated the values differently. The x86 doesn’t seem to show main or _start, which would be the proper head of the hierarchy at 96% time spent.

bench4

Call Graph:

Below is an example of one of the call-graphs, cropped to the relevant parts. Since We don’t really care about the other parts of my program (which just deals with preparing the strings from the input file and such) we can compare relative sizes of these. I’m not sure why CityHash64 is split into two calls, but HashLen0to16 is not lower in the hierarchy than CityHash64, either. My expectation is that it should be called by CityHash, and not by main. I’m not really sure what to make of this.

bench5

I was hoping to get a percentage value for the runtime of CityHash out of the total runtime for benchmarking program, but this seems too messy to resolve right now. So, I will stick to the total-runtime approach until it is fixed. Assuming all of the startup and string-related wizardry remains untouched, I am willing to bet that any optimizations I make will directly reduce the total runtime of the program. It will be impossible to tell how much faster I made the hash relative to itself, but hopefully by then I can produce a more accurate percentage value to compare it with.

Building Issues (SPO600 project)

So having selected Google’s old CityHash package as the focus of my optimization project, I had some issues with actually getting it to build out of the box. First, there were issues with configuring it. After those were solved, I had issues with my  make commands. It seemed to point out all kinds of errors C++ related errors whenever I tried to build it. But after investigating issues with versions of gcc (which didn’t end up being the problem), I found a part of the README that offered an alternative with an additional compiler flag, -msse4.2. From what I can tell, this allows for SSE instructions, which stands for Streaming SIMD Extensions. This is pretty unexpected for me, considering there is a comment in the code itself saying:

It's probably possible to create even faster has functions [...] by using SIMD instructions[.]

My guess is that either the SIMD stuff is in another part of the package, or is done by the compiler and not hard-coded. I will still take my attempts to speed it up. Now, let’s benchmark this thing.

Software Optimization Project Plan (SPO600 Assignment Stage 1)

Over the next few weeks I will put what I’ve learned in Software Optimization and Portability to the test by attempting to speed up some open-source hashing software. The hashing tool in question is called cityhash, which seems to come from Google.It hashes strings, and is written in C++. I chose it because it seems to be legitimate, and claims decent performance, but  still doesn’t seem over-streamlined to the point where I am not capable of contributing anything. That may still be the case in the end, but I think the potential is there.

Benchmarking

I hope to build the project on two systems. My own computer (an x86_64 system) and one of the school servers of the aarch64 architecture for variety. Using the test code included in the package as a guide, I will make a large set of strings to increase the runtime. Using the profiling tool perf, I will measure the overall and function-specific run-times of a specific hash function (probably CityHash32 or CityHash64) and take the average over several runs. The same data set will later be used to check for speedups.

 

Optimization

There are a variety of options to explore when it comes to optimizing this function. The simplest way seems to be messing with compiler options, which I may play with, but the README suggests that they are already aware of the more general compiler optimizations available (such as gcc preferring “-g -O3” over the default “-g -O2”). The actual functions themselves seem too complex for me to significantly rewrite. I also find it likely that using in-line assembler will backfire on me, as I have read that except in very specific cases, it often leads to a slow-down.

Therefore, I have my eyes set on vectorizing the code. That is, to group similar operations and do them simultaneously rather than just one by one.  There are even notes in the code that suggest that no SIMD (Single Instruction Multiple Data) operations are done, so I think provides a good candidate for improvement.

There process of doing so may be quite difficult. I am reading up on vectorization, and a conversation with my professor suggested that process of vectorizing this code may take me on a bit more complex path than I intended. I am however up for the challenge, and will carefully learn about masking and using vector registers. Fortunately, I have until stage 3 to bring myself up to speed. I’m kind of excited to be getting my hands dirty with something like this.

Since the project comes with a testing suite, I will run the tests to make sure nothing is broken. I will also do a cursory check to make sure the outputs are coming out the same. To prove performance increases, I will compare both overall runtime and relative runtime for the function in question, on both systems. I will focus mostly on speed, and not memory or other resource dimensions, though if I stumble onto something of interest I will include it.

In-line assembly Lab (SPO600 Lab 6)

In this lab, we inspect an open source project that uses in-line assembly language in C files. The open source project I have selected is ardour.

This project contains 43 instances of assembly code used in-line (in .c, .h, .ccp files). In some this file, behaviors are defined on x86, ia64, s390, i386 i586 systems. The file uses assembly to get a count of CPU cycles. For each architecture, this is done in a different way and it decides which to use based on if/else logic. On platforms not included, such as sparc, arm, and m68k it gives a warning that the project does not support the platform yet.

Example of in-line assembly for x86_64

inline assembly blog

 

Since the information being accessed here is related to CPU itself, I think it makes sense to use assembly (which depends on the architecture of the CPU). However, I did find one source that said that this particular use is out of date. Nowadays, there are ways to do this that are built-in to compilers. If this is true then perhaps the project is simply old enough that it was useful to use assembly at the time.

Comparing Open Source Projects (SPO600 Lab 1)

This lab is about comparing two open source projects with different licenses, and comparing the process of contributing to them. I will be comparing the Bootstrap web framework (under the MIT license) and Mozilla FireFox (under the Mozilla Public License), by analyzing one software “patch”.

Bootstrap

The issue and patch. From start to finish, 4 people were involved in this patch. It began with someone claiming a bug, and other people verifying that it exists. The person who posted the bug posted a fix, and others commented on it and voiced any issues. One other person changed the code with their own commit. The issue was posted on may 23 and was resolved on July 15, totaling 63 days.

 

Mozzilla FireFox

This is the bug fix I chose to analyze. From the posting of the bug to resolution took 7 days in total, and took 3 people: the person working on the fix and two people reviewing it.

According to mozilla, the official process of contributing is:

  1. Build the project
  2. Find a bug
  3. Fix the bug
  4. Get code reviewed
  5. Respond to the review
  6. Automated testing

In this case, it does seem to match these steps. The review itself involved a reviewer pointing out errors and requesting certain changes to be made. A second review then came in and clarified that some of the issues the first reviewer had were not problems with the code being submitted, but a version problem on the reviewer’s side. The process seems relatively quick, and with enough collaboration to catch issues well.

Software Profiling Investigation (SPO600)

Today’s investigation is into the world of software profiling, with the open source profiling tool gprof. Gprof is a tool for unix systems that analyzes the performance of a another program. It does this by inserting additional code when compiling that other program, so that it can provide a great level of detail in its analysis.

I chose to run this test gprof with one of my programs made for lab 5 of SPO600. The program scales fake “sound samples” (random integers in this case).  I have prepared the program to run extra samples, for a longer run time and better profiling.  With 1,000,000,000 samples, it takes about 53 seconds to run. After compiling with -g -pg -O3 options on the gcc compiler. After compiling, once we run the program we find an extra file in the directory called gmon.out.

Running gprof on our executable, we gave me this:

picgprof1

Unfortunately, there isn’t any content in the profile. Some searching led me to recompile it differently to avoid a bug in gcc. Now get one new line to show up.

picprof2

The bulk of this is an explanation of what each stat means.

The name is main, the name of the function being profiled in this line. It seems that you can look at each function individually and its impact. Since there is only one function defined in this program, it takes up 100% of the time. And since no functions are being called inside of main, self seconds and cumulative seconds are the same. The other categories (involving the function being called) are left empty, perhaps because main is a special function, and doesn’t get profiled in this way.

Algorithm Selection (SP0600 Lab5)

In this lab we take a short C program that scales a set of random int16_t values representing the amplitude of sound waves for audio encoding, and multiplying them by 0.75 (lowering the volume). This type of operation occurs often when using audio since the computer must do these calculations in the background.

We begin with the naive algorithm (taking each value and performing a multiplication operation using floating-point arithmetic 0.75. Performing this over 10 000 000 samples on the aarch64 server supplied by our school gives gave our entire program a runtime of 514ms. Subtracting the setup and teardown, as well as unrelated code (which took 496ms to run), we get 18ms of runtime for the operations we care about.

 

Lets look at two alternative algorithms to see if we can speed things up:

 

Alternative Algorithm 1: Pre-scaled array

Since we are performing the same operation many times, and since we are likely to get repeat inputs and outputs, perhaps it would save time to do each possible calculation once (about 65,000 times, for all possible values of an int16_t). This way instead of doing a new calculation, we can just look up the output value that we already know tot be correct.

Unfortnately, this did not result in a speedup. It actually slowed things down to a total of 42ms (vs the original 18). I was expecting it to be faster, and I think it would have, except for the problem of our int16_t’s being unsigned. The subtracting and adding of the negative portions (since you cannot use a negative index in an array) may have slowed down my implementation here. This method also requires quite a bit of memory overhead (an array storing around 65000 16-bit integers). I tested this with 1 billion samples after 10 million seemed to be slower, and found that it was still slower. The payoff didn’t seem become better in the long game either.

 

Alternative Algorithm 2: Fixed-point arithmetic

Instead of using floating point arithmetic to multiple by 0.75, we can use regular integer multiplication and bit-shifting to get the same results. Here, my algorithm multiplied input sample values by 6, and shifted the result to the right by 3 bits (truncating the last 3 digits, thus dividing by 2 three times). The end result is that we get an integer value that is 3/4 of the original.

Now, this algorithm has given us a significant speedup (at least, around the specific operations being done) and is the best of the three. 13ms compared to the naive algorithm’s 18ms puts us at 28% faster with the same memory overhead (or lack of it, compared with the pre-calculated array).

 

 

Are they fast enough to process CD data? – At 44100 samples per second, and 2 channels, I think each algorithm is sufficient. If it can process 10 million samples in a few milliseconds, it should easily be able to keep up with a CD. I will note however, that we should use the best algorithm regardless, since it won’t be the only program running on the machine.

 

Further Optimizations

As stated before, I think treating the values as unsigned integers may prove to speed things up, especially in the pre-calculated array algorithm. The addition and subtraction waste processing of a repetitive and useless task that could be avoided.

Building a software package (SPO600 Lab 2)

Part 1

In this lab we download and build an open source software package and get it running on our own machine. The following are the steps and comments on the process.

  1.  I chose nano as the software package to install, expecting it to be relatively small and straightforward to set up. After a bit of searching, I downloaded the compressed source code from the  downloads page of the nano website.
  2. Since nano is meant for use on Linux systems, I decided to build it on seneca’s Matrix server remotely using Putty. I tried to locate the makefile in order to build the program, but was confused by the various files with names like Makefile.am and Makefile.in.
  3. After seeking some professor assistance, I discovered that despite all the files with makefile in their name, none of them was quite what I was looking for. It was necessary to run a configure command to let the package recognize what environment it was in and create a Makefile accordingly.
  4. New problem: permissions. It turns out that unzipping the file on my windows machine and dropping into my matrix account. After unzipping it on matrix…
  5. > configure
  6. > make
  7. Success!

An unexpectedly bumpy journey. Lessons learned:

  • Makefiles don’t come automatically. It may take some digging/configuring to get it to appear.
  • Be careful with unzipping.