Optimization Project part 3 (SPO600)

After some selection and benchmarking, I am finally ready to try optimizing a piece of software as part of my Software Optimization and Portability course’s final project.

I noted in my project plan that CityHash, the hash function I chose to optimize, was already fairly optimized. It used O3 compiler flags, meaning that most of the free compiler optimizations were already made. The algorithm itself is also very complex and streamlined, such that rewriting it on a basic level is probably beyond my current ability. So my choice was to try and use SIMD instructions to vectorize the code, using assembler, or otherwise using assembler to speed things up.

But after reviewing the code further, I can see some issues with this too. There aren’t many long repetitive loops that would make vectorizing it a profitable venture (speed-wise). Loops are generally tied to string length, so the utility of doing that is also questionable. So here is my revised strategy:

  1. Try various compiler flags beyond O3
  2. Profile Guided Optimization
  3. Using another compiler
  4. If time is left over, try to speed things up in assembler

 

Compiler Flags

Flag combinations:

-march=native

This flag allows the compiler to use CPU architecture specific commands, mening it will not necessarily run the same on another CPU (unless you compile it again on that machine). This lowered the minimum time taken to run slightly, though made it less consistent. Instead of taking a minimum time of about 1101ms, it started to take about 1093ms as a minimum time. This is only a 0.7% decreased time to run in the best case, and the same speed or slightly worse a lot of the time.

-Ofast

This one got a similar speedup to -march=native on the high end, but is more consistent. With 1091ms as the minimum time, thats a 1.0% speedup. Still not amazing, but interesting for the relative ease.

This small speedup could be within error, but my strategy for minimizing error is to assume that on a machine with almost nothing running, the shortest time (if consistent) vaguely represents the “true runtime” on that machine under those conditions. This logic might not be perfect, but if a program is running consistently within a very tight duration I think this comparison is sane. Larger fluctuations are a different matter.

-Ofast -march=native

Combining the two does not seem to offer twice the speedup. I think I prefer the consistency of just having -Ofast here. -march=native doesn’t seem to add anything (still 1% speedup like in -Ofast), and just lowers the consistency, occasionally slowing it down by a good 10%.

-Ofast -fno-signed-zeros -fno-trapping-math

I included a few extra flags that are similar to -Ofast but not within -Ofast itself. There was no noticeable difference from the original -O3 suggested by Google.

–msse4.2

Remembering this as a suggested option from googles documentation, I tried to combine this and some of the previous flags in different combinations. Once again, no speedup was found.

-frename-registers and f-unroll-loops

I tried these separately, combined with -Ofast. Nothing special, once again. I think they depend on the code being a certain way to make a significant difference.

 

Compiler flag summary:

So far, just the -O3 vs -Ofast flag is to my liking. I will try both versions going forward as I make the rest of my changes. But considering there is only about a 1% change, I’m not ready to call it.

 

Profile-Guided Compiler Optimizations

By setting the compiler flags on cityhash to be -fprofile-generate, running the program, and then switching it to -fprofile-use, I noticed another slight speedup. The range of speeds I was getting was consistently shifted down from the original 1101ms-1109ms down to 1082ms-1090ms. Although somtimes it still took 1090ms+ or even 1100ms, in general there was a consistent speedup. This ~20ms speedup amounts to about 2%. Still not amazing, but I’ll take what I can get.

Delighted with this change, I felt a little cheeky and tried to do this:

-O3 -fprofile-use

It didn’t work. I can only assume that the parts of O3 does that the profile didn’t want were left out for a reason. I’m not even sure how they interact, to be honest. But it was worth a shot. I skipped trying the same cute trick with -Ofast.

 

Using another compiler

I read somewhere that there is a compiler that can optimize more aggressively than gcc. This is the Intel C++ compiler from Intel Parallel Studio XE. Fortunately, it seems that Seneca College has a floating license, and I was able to get access for use in my school project. The cool thing here is that there is an auto-vectorizer (which of course, can only kick in if the code is already auto-vectorizable). It also has its own O2 and O3 capabilities, as well as PGO (which I used gcc for), but I ignored those since I already did those types of optimizations.

Unfortunately, even after installing and setting everything up, I could not get the compiler to compile anything. The cityhash makefile was quite long and complicated, and getting it to use another compiler was impossible. I tried to compile it manually myself, but whenever I attempt that without the makefile (even for gcc) it wants to import a file that cannot be located. Here is roughly what I had planned to do, based on this guide (page 7):

1) compile and run once with auto-vectorizer     ( icc -O3 -no-vec city.cc -o city.o )

2) compiler and run once without auto-vectorizer ( icc -O3 -vec-report -o city.o )

3) compare runtimes

4) repeat with -Ofast if the intel compiler allows it

 

Using Assembler

To be honest, I couldn’t get anything to work in this department.  My skill in assembler is is pretty low and my attempts to rewrite things in the language never succeeded on any level. Though while this could theoretically be possible, I doubt there would be much to gain here. The hash function is already pretty fast.

Advertisements

OSD600, Course in Review

Over the past few months, I’ve been taking a course called Topics in Open Source Development at Seneca College. This post will be about my experience in the course.

Going into this course, I didn’t know much about it. I had read an online comment about how it is only useful for someone who is interested in staying heavily in open source development, and I would have to stay that this wasn’t true about the course I took. Although it was a good first dive into open source programming, I found that there were a lot of things in the course that had universal value.

In fact, my favourite thing about the course had more to do with learning about the popular tools that facilitate collaboration on software projects. Things like learning to use git, github, and continuous integration tools. Even the methods of unifying code formatting was interesting.

These are all things that are quite popular in the industry, and necessary, to varying degrees. So much so, that I am shocked that I wouldn’t have learned them if I had not taking this professional option. I have already begun using git for my other courses, and it has made my life easier. Combined with all of the other tools, I feel like it has been one of the most valuable courses I haven taken at Seneca.

I’m not saying that I’m an expert in any of these things yet. But the curtain has parted, and I have taken my first steps. Those, I think, are the hardest ones. I feel like I am able to jump into any of these now, and deepen my understanding without being intimidated.

As for the open source part, I am glad I had a taste of it, but I am still not sure what to make of it. On the one hand it seems like a great way to get involved in programming, especially as an outsider. Joining big projects is a good way to gain experience. Creating small projects is a great way to turn the ideas you have into reality, too. But the decision to make a project open source versus keeping it private is one that I’m not sure how to make. Not only that, but if I do end up getting a job in industry, I wonder how much time I might have to dedicate to open source work. This is something that will answer itself in time.

The toughest part of the course for me has been in selecting projects. I know there are plenty of cool projects out there, but when it comes time to find them, I have no clue. Perhaps it is my lack of imagination, or my lack of awareness of what is going on in community. For example, a lot of the hot repositories on github seem to be machine-learning related, but I have yet to even dabble in that field.

In summary, I loved this course. It might not be possible to make every course in the same style. It’s open-ended style would not suit subjects where everyone need to know exactly the same thing, and be accountable for performing specific tasks perfectly. But, as a learning experience, I think it has seeded my growth in a big way, and will continue to benefit me for a long time.

Getting started with Travis-CI (OSD600)

My 2nd task for the 0.4 Release has been to get Travis-ci automated build testing set up on our internal project DarkChatter. I’m excited about this one, because I have always wondered how those big fancy projects on github had set up such a robust system that automatically checks newly submitted code for not being utterly broken. Having learned about it in class, I wanted to get my hands dirty with it. It would save a lot of time and effort on any project, present or future.

After getting started, I noticed some things that I hadn’t expected. First of all, you need to give this third party access to your github repositories, and only then can it run the builds for your. It isn’t part of the github environment natively. For a moment I hesitated before signing away these permissions, but I’m trusting in the system.

The next thing I noticed was that it isn’t as simple as plug in and play. After making my .travis.yaml file a relatively basic setup, it still didnt trigger builds automatically. After attempting a manual build, I got an error: Build config file is required via repository settings, but config is empty.

Funny story. After paring down the file to the just the basics, it still didn’t work. The culprit turned out to be the file name… It should have been .yml, not .yaml. Sometimes it’s the little things that eat up the hours.

Another funny story… It still didn’t work due to a parsing error.

After comparing my version to more example builds, I decided to pare it down even more. After getting it down to just 3 lines, I got my first build to trigger successfully (with eventual errors). Unfortunately, this is all that it did:

language: python
python:
  - "2.7"

Not super impressive. But this gave me a starting point, and I could now add in one line at a time to make sure each section worked. I broke up our dependency installs line by line and fixed the ones that weren’t working. Then I set up some basic tests with pytest to see if they would run. Between the issues with import statements, getting pytest to run properly, and other little stumbling blocks, it wasn’t until build #32 that I managed to have a fully successful build.

travis first success

Wonderful. This was however, only done on my fork, so I made sure to enable travis-ci for our main repo, too. Then sending the PR, there were no problems and the appropriate info showed up on github.

travis success 2

Working on something bigger (OSD600)

For most of the projects I’ve chosen to work on for OSD600, I’ve contributed to relatively small projects. Now, I’m excited to say that I get to play a small role in a bigger project. Mozilla Firefox is trying to get ESLint running on its code-base in order to have a cleaner, unified style across the project. But suddenly enforcing these new rules onto a project that asks for you to have 30+GB free to clone the repo is not something that can be done instantly. Each file might theoretically need to be altered to accommodate the linting rules implemented.

Claiming a small portion of the source code, I am thus responsible for making sure the files in the /dom/tests/unit and /dom/system directories abide by these new rules. In some cases, issues can be detected and fixed automatically. In other cases, it involves making simple adjustments (one was replacing a tab with spaces), or informing eslint that they are not in fact actual problems. But several of these issues are a bit more complicated. They require rewriting a bit of code, which I am hesitant to plow through too quickly.

There is a type of rule that involves replacing all getService() calls with something like Services.console. They exist for various services, and not just console. The problem is that I need to look up how each one works, and decide what an equivalent call is and what parameters to use. I tried to fix one, and then ran ./mach test dom/system. No failed tests. Wonderful. Then, just to be sure, I commented out the entire line and ran the test again. No failed tests. What?

Running the tests on that folder doesn’t seem to be a reliable metric in checking if I successfully preserved the functionality of the line I changed. Was I running the wrong tests? Is there something I’m missing? Who knows. I’ll keep reading, and keep learning about the code I’m working on. I’ll discuss things with fellow students who are working on enabling ESLint on the other directories.

In other news, I have decided to implement continuous integration on DarkChatter as part of our internal project contribution. Not only will it prove useful in the long run (if we continue to tinker with the project after the course is over), but it will allow me to practice something I see as essential to any project I will ever work on. If time permits, I will also update the command line argument guide I made in release 0.3’s PR 1 with the additional research I did in order to complete PR 2. Hopefully by the end of this week I’ll have both this and the Mozilla stuff done so that I can focus on my last issue next week.

Package Benchmarks (SPO600 Project Stage 2)

Today  was the day I got my finally-installed copy of CityHash to drop some benchmarks and show me how fast it really is. The road has been bumpy, but I’ve made it past each hurdle. It was hard to tell, at times, what type of errors I was getting. But in the end I was able to inch forward past each one.

Since CityHash didn’t seem to have instructions to call a main program, and seemed to be more like a library, I wrote my own program to call CityHash64(), the function I was interested in optimizing. My program initially looked something like this:

bench1

CityHash didn’t come with a plethora of tests, but with a few runs, it was clear that it was producing hashes, and produced the same outputs every time for a fixed input. Initially, I had used a copy-paste file with the full text of Mary Shelley’s Frankenstein (easy to get since it’s in the public domain). However, it didn’t seem to run for very long, so I replaced it with something bigger. Now running on the order of ~15 seconds, I was happier with the benchmark-worthiness. The file itself contains many hundreds of thousands of strings of varying lengths.

The tests were run on two systems. A Fedora virtual machine on my personal laptop (with x86_64 architecture), and the Aarchie server provided by the school (which uses ARM). I might only end up using one of them if I use assembly language to optimize, but  for now I thought I would cover all of my bases.

I ran some initial tests, but I was worried about something. My own program had so much downtime that it was difficult to see what was going with CityHash. It had such a tiny percentage of the runtime. So I first tried to trim it down. I got rid of the lines that printed the hashes. Strangely, this made my hash function super fast and finished instantly.  Putting the line printing back, but piping it instead of using standard out made it still take only 1-2 seconds.

I had difficulty trying to make the file bigger, so I decided to just print out the  lower order digits of the sum of all the hashes, since I was paranoid that the compiler would optimize everything away if I didn’t use the values. I didn’t want tons of cout calls since that would have dwarfed the rest of the program and hid the variance.

Most other programs were kept closed, especially those more likely to have spikes in resource usages. On the remote server, this was not something that could be controlled, but in both cases, an attempt to control interfering load was made by simply taking many, many trials.

Runtimes:

x86_64 (Fedora VM)

The vast majority of runs ran between 1150-1160 ms. There were some under or over. Although most ran in this range, only once out of about 20 runs did it go below 1150 ms. I’m confident that this lower bound is somewhat close to our ideal runtime, since the variance seems to be in only one direction. We will then say that the total runtime is about ~1155 ms (for the purpose of comparison, later).

ARM (Aarchie server)

The bulk of runs ran around 700ms, but at times there were some results in the 750-780 ms ranges. I believe based on timing, the higher cases were when someone else was trying to use the system. I say this because otherwise, I would have a expected a more even distribution of the results. There weren’t many 720 – 740 ms times, which would be expected if it were up to normal variance. I choose to take the smaller cluster of results as a rough estimate of the runtime on Aarchie (~700ms), since that suggest the least interference.

Perf

The perf report on x86 typically looked like this:

bench3

If you look closely, a lot of the time spent on this program is unrelated to CityHash. It was difficult to isolate CityHash behaviour just looking at this, so I will end up paying more attention to the graph. You can still see CityHash among the functions called (runtime about 1-2%), as well as HashLen0to16 (runtime usually 3-4%). Perhaps the call graph will illuminate more.

Strangely enough, the perf report on ARM calculated the values differently. The x86 doesn’t seem to show main or _start, which would be the proper head of the hierarchy at 96% time spent.

bench4

Call Graph:

Below is an example of one of the call-graphs, cropped to the relevant parts. Since We don’t really care about the other parts of my program (which just deals with preparing the strings from the input file and such) we can compare relative sizes of these. I’m not sure why CityHash64 is split into two calls, but HashLen0to16 is not lower in the hierarchy than CityHash64, either. My expectation is that it should be called by CityHash, and not by main. I’m not really sure what to make of this.

bench5

I was hoping to get a percentage value for the runtime of CityHash out of the total runtime for benchmarking program, but this seems too messy to resolve right now. So, I will stick to the total-runtime approach until it is fixed. Assuming all of the startup and string-related wizardry remains untouched, I am willing to bet that any optimizations I make will directly reduce the total runtime of the program. It will be impossible to tell how much faster I made the hash relative to itself, but hopefully by then I can produce a more accurate percentage value to compare it with.

Building Issues (SPO600 project)

So having selected Google’s old CityHash package as the focus of my optimization project, I had some issues with actually getting it to build out of the box. First, there were issues with configuring it. After those were solved, I had issues with my  make commands. It seemed to point out all kinds of errors C++ related errors whenever I tried to build it. But after investigating issues with versions of gcc (which didn’t end up being the problem), I found a part of the README that offered an alternative with an additional compiler flag, -msse4.2. From what I can tell, this allows for SSE instructions, which stands for Streaming SIMD Extensions. This is pretty unexpected for me, considering there is a comment in the code itself saying:

It's probably possible to create even faster has functions [...] by using SIMD instructions[.]

My guess is that either the SIMD stuff is in another part of the package, or is done by the compiler and not hard-coded. I will still take my attempts to speed it up. Now, let’s benchmark this thing.

Release 0.3 End (OSD600)

I have found an external open source project that I want to work more seriously for my OSD600 class. Well, I didn’t just find it. I mentioned before that I was interested in working on MTGATracker, which allows users of the online card game Magic The Gathering; Arena players to keep track of what is in their deck (and many other useful things). When I had initially looked at it, I thought it might be interesting, but I was later scared away. It looked kind of complicated and didn’t seem to have any low-hanging fruit for me to grab. Not only have I not used python in a while, but never in conjunction with other things. I don’t particularly know much about UI building either. But, after coming back to it several times during my search, I have decided to try and dive into the code and figure out how it works.

But there’s not time to familiarize myself with it yet, so I decided to do something quicker. While mulling things over, I ended up looking at one of the repositories I previously contributed documentation to, and mulled over the idea of doing some more of the same. I noticed, however, that none of the links in the README file actually worked. I expected this to be a 30-second fix I could tack onto the documentation I would later write, so I jumped in to make the quick change.

Except it wasn’t that simple. I didn’t understand how the link references worked. Even after looking it up, I still didn’t quite get it. How did it know where to send the user? Eventually I found something unhelpful that nudged a gear in my brain and gave me an a-ha moment. There was a conversion algorithm linking things together. The link targets had to be using heading markdown, and they had to be written in a specific way. Rushing to put newly tested theory into practice, I ended up breaking the entire bullet-point hierarchy too. I had to reconcile two different structures that didn’t work together, and implement that new structure throughout the entire documentation. Was it the most difficult thing in the world? Perhaps not, but I do feel like I learned a little bit more than if I just repeated the same documentation gig as last time.

 

On the Internal project front, I’m still working on DarkChatter. I’m finally kitted out with the essentials for the project (a Fedora virtual machine, since Windows comes with extra roadblocks; a USB-network card to hook into my VM; and the myriad software packages involved with advanced network wizardry).

My next task was to integrate the command line argument research I had done last time into the actual project. As I started to edit the main python script that Ryan had been fiddling with, I realized two things. First, is that it was a “test” file of sorts. ScappyTest.py need not resemble anything finally get into the final code. Therefore piggybacking onto it and inserting my bits may be a waste of time. The other thing I realized was that I didn’t have to put everything in one file. My memories of my Python are a bit foggy, but back then I didn’t know of the magic of writing maintainable code and separating unrelated content. Better to leave that world behind. I split up the command line arguments logic into one file, and the main code in another. Then I built out the overarching entry logic that determined whether our script was sending or receiving packets. The code we eventually add for that can then be slid into place.