Optimization Project part 3 (SPO600)

After some selection and benchmarking, I am finally ready to try optimizing a piece of software as part of my Software Optimization and Portability course’s final project.

I noted in my project plan that CityHash, the hash function I chose to optimize, was already fairly optimized. It used O3 compiler flags, meaning that most of the free compiler optimizations were already made. The algorithm itself is also very complex and streamlined, such that rewriting it on a basic level is probably beyond my current ability. So my choice was to try and use SIMD instructions to vectorize the code, using assembler, or otherwise using assembler to speed things up.

But after reviewing the code further, I can see some issues with this too. There aren’t many long repetitive loops that would make vectorizing it a profitable venture (speed-wise). Loops are generally tied to string length, so the utility of doing that is also questionable. So here is my revised strategy:

  1. Try various compiler flags beyond O3
  2. Profile Guided Optimization
  3. Using another compiler
  4. If time is left over, try to speed things up in assembler


Compiler Flags

Flag combinations:


This flag allows the compiler to use CPU architecture specific commands, mening it will not necessarily run the same on another CPU (unless you compile it again on that machine). This lowered the minimum time taken to run slightly, though made it less consistent. Instead of taking a minimum time of about 1101ms, it started to take about 1093ms as a minimum time. This is only a 0.7% decreased time to run in the best case, and the same speed or slightly worse a lot of the time.


This one got a similar speedup to -march=native on the high end, but is more consistent. With 1091ms as the minimum time, thats a 1.0% speedup. Still not amazing, but interesting for the relative ease.

This small speedup could be within error, but my strategy for minimizing error is to assume that on a machine with almost nothing running, the shortest time (if consistent) vaguely represents the “true runtime” on that machine under those conditions. This logic might not be perfect, but if a program is running consistently within a very tight duration I think this comparison is sane. Larger fluctuations are a different matter.

-Ofast -march=native

Combining the two does not seem to offer twice the speedup. I think I prefer the consistency of just having -Ofast here. -march=native doesn’t seem to add anything (still 1% speedup like in -Ofast), and just lowers the consistency, occasionally slowing it down by a good 10%.

-Ofast -fno-signed-zeros -fno-trapping-math

I included a few extra flags that are similar to -Ofast but not within -Ofast itself. There was no noticeable difference from the original -O3 suggested by Google.


Remembering this as a suggested option from googles documentation, I tried to combine this and some of the previous flags in different combinations. Once again, no speedup was found.

-frename-registers and f-unroll-loops

I tried these separately, combined with -Ofast. Nothing special, once again. I think they depend on the code being a certain way to make a significant difference.


Compiler flag summary:

So far, just the -O3 vs -Ofast flag is to my liking. I will try both versions going forward as I make the rest of my changes. But considering there is only about a 1% change, I’m not ready to call it.


Profile-Guided Compiler Optimizations

By setting the compiler flags on cityhash to be -fprofile-generate, running the program, and then switching it to -fprofile-use, I noticed another slight speedup. The range of speeds I was getting was consistently shifted down from the original 1101ms-1109ms down to 1082ms-1090ms. Although somtimes it still took 1090ms+ or even 1100ms, in general there was a consistent speedup. This ~20ms speedup amounts to about 2%. Still not amazing, but I’ll take what I can get.

Delighted with this change, I felt a little cheeky and tried to do this:

-O3 -fprofile-use

It didn’t work. I can only assume that the parts of O3 does that the profile didn’t want were left out for a reason. I’m not even sure how they interact, to be honest. But it was worth a shot. I skipped trying the same cute trick with -Ofast.


Using another compiler

I read somewhere that there is a compiler that can optimize more aggressively than gcc. This is the Intel C++ compiler from Intel Parallel Studio XE. Fortunately, it seems that Seneca College has a floating license, and I was able to get access for use in my school project. The cool thing here is that there is an auto-vectorizer (which of course, can only kick in if the code is already auto-vectorizable). It also has its own O2 and O3 capabilities, as well as PGO (which I used gcc for), but I ignored those since I already did those types of optimizations.

Unfortunately, even after installing and setting everything up, I could not get the compiler to compile anything. The cityhash makefile was quite long and complicated, and getting it to use another compiler was impossible. I tried to compile it manually myself, but whenever I attempt that without the makefile (even for gcc) it wants to import a file that cannot be located. Here is roughly what I had planned to do, based on this guide (page 7):

1) compile and run once with auto-vectorizer     ( icc -O3 -no-vec city.cc -o city.o )

2) compiler and run once without auto-vectorizer ( icc -O3 -vec-report -o city.o )

3) compare runtimes

4) repeat with -Ofast if the intel compiler allows it


Using Assembler

To be honest, I couldn’t get anything to work in this department.  My skill in assembler is is pretty low and my attempts to rewrite things in the language never succeeded on any level. Though while this could theoretically be possible, I doubt there would be much to gain here. The hash function is already pretty fast.


OSD600, Course in Review

Over the past few months, I’ve been taking a course called Topics in Open Source Development at Seneca College. This post will be about my experience in the course.

Going into this course, I didn’t know much about it. I had read an online comment about how it is only useful for someone who is interested in staying heavily in open source development, and I would have to stay that this wasn’t true about the course I took. Although it was a good first dive into open source programming, I found that there were a lot of things in the course that had universal value.

In fact, my favourite thing about the course had more to do with learning about the popular tools that facilitate collaboration on software projects. Things like learning to use git, github, and continuous integration tools. Even the methods of unifying code formatting was interesting.

These are all things that are quite popular in the industry, and necessary, to varying degrees. So much so, that I am shocked that I wouldn’t have learned them if I had not taking this professional option. I have already begun using git for my other courses, and it has made my life easier. Combined with all of the other tools, I feel like it has been one of the most valuable courses I haven taken at Seneca.

I’m not saying that I’m an expert in any of these things yet. But the curtain has parted, and I have taken my first steps. Those, I think, are the hardest ones. I feel like I am able to jump into any of these now, and deepen my understanding without being intimidated.

As for the open source part, I am glad I had a taste of it, but I am still not sure what to make of it. On the one hand it seems like a great way to get involved in programming, especially as an outsider. Joining big projects is a good way to gain experience. Creating small projects is a great way to turn the ideas you have into reality, too. But the decision to make a project open source versus keeping it private is one that I’m not sure how to make. Not only that, but if I do end up getting a job in industry, I wonder how much time I might have to dedicate to open source work. This is something that will answer itself in time.

The toughest part of the course for me has been in selecting projects. I know there are plenty of cool projects out there, but when it comes time to find them, I have no clue. Perhaps it is my lack of imagination, or my lack of awareness of what is going on in community. For example, a lot of the hot repositories on github seem to be machine-learning related, but I have yet to even dabble in that field.

In summary, I loved this course. It might not be possible to make every course in the same style. It’s open-ended style would not suit subjects where everyone need to know exactly the same thing, and be accountable for performing specific tasks perfectly. But, as a learning experience, I think it has seeded my growth in a big way, and will continue to benefit me for a long time.

Getting started with Travis-CI (OSD600)

My 2nd task for the 0.4 Release has been to get Travis-ci automated build testing set up on our internal project DarkChatter. I’m excited about this one, because I have always wondered how those big fancy projects on github had set up such a robust system that automatically checks newly submitted code for not being utterly broken. Having learned about it in class, I wanted to get my hands dirty with it. It would save a lot of time and effort on any project, present or future.

After getting started, I noticed some things that I hadn’t expected. First of all, you need to give this third party access to your github repositories, and only then can it run the builds for your. It isn’t part of the github environment natively. For a moment I hesitated before signing away these permissions, but I’m trusting in the system.

The next thing I noticed was that it isn’t as simple as plug in and play. After making my .travis.yaml file a relatively basic setup, it still didnt trigger builds automatically. After attempting a manual build, I got an error: Build config file is required via repository settings, but config is empty.

Funny story. After paring down the file to the just the basics, it still didn’t work. The culprit turned out to be the file name… It should have been .yml, not .yaml. Sometimes it’s the little things that eat up the hours.

Another funny story… It still didn’t work due to a parsing error.

After comparing my version to more example builds, I decided to pare it down even more. After getting it down to just 3 lines, I got my first build to trigger successfully (with eventual errors). Unfortunately, this is all that it did:

language: python
  - "2.7"

Not super impressive. But this gave me a starting point, and I could now add in one line at a time to make sure each section worked. I broke up our dependency installs line by line and fixed the ones that weren’t working. Then I set up some basic tests with pytest to see if they would run. Between the issues with import statements, getting pytest to run properly, and other little stumbling blocks, it wasn’t until build #32 that I managed to have a fully successful build.

travis first success

Wonderful. This was however, only done on my fork, so I made sure to enable travis-ci for our main repo, too. Then sending the PR, there were no problems and the appropriate info showed up on github.

travis success 2

Working on something bigger (OSD600)

For most of the projects I’ve chosen to work on for OSD600, I’ve contributed to relatively small projects. Now, I’m excited to say that I get to play a small role in a bigger project. Mozilla Firefox is trying to get ESLint running on its code-base in order to have a cleaner, unified style across the project. But suddenly enforcing these new rules onto a project that asks for you to have 30+GB free to clone the repo is not something that can be done instantly. Each file might theoretically need to be altered to accommodate the linting rules implemented.

Claiming a small portion of the source code, I am thus responsible for making sure the files in the /dom/tests/unit and /dom/system directories abide by these new rules. In some cases, issues can be detected and fixed automatically. In other cases, it involves making simple adjustments (one was replacing a tab with spaces), or informing eslint that they are not in fact actual problems. But several of these issues are a bit more complicated. They require rewriting a bit of code, which I am hesitant to plow through too quickly.

There is a type of rule that involves replacing all getService() calls with something like Services.console. They exist for various services, and not just console. The problem is that I need to look up how each one works, and decide what an equivalent call is and what parameters to use. I tried to fix one, and then ran ./mach test dom/system. No failed tests. Wonderful. Then, just to be sure, I commented out the entire line and ran the test again. No failed tests. What?

Running the tests on that folder doesn’t seem to be a reliable metric in checking if I successfully preserved the functionality of the line I changed. Was I running the wrong tests? Is there something I’m missing? Who knows. I’ll keep reading, and keep learning about the code I’m working on. I’ll discuss things with fellow students who are working on enabling ESLint on the other directories.

In other news, I have decided to implement continuous integration on DarkChatter as part of our internal project contribution. Not only will it prove useful in the long run (if we continue to tinker with the project after the course is over), but it will allow me to practice something I see as essential to any project I will ever work on. If time permits, I will also update the command line argument guide I made in release 0.3’s PR 1 with the additional research I did in order to complete PR 2. Hopefully by the end of this week I’ll have both this and the Mozilla stuff done so that I can focus on my last issue next week.

Package Benchmarks (SPO600 Project Stage 2)

Today  was the day I got my finally-installed copy of CityHash to drop some benchmarks and show me how fast it really is. The road has been bumpy, but I’ve made it past each hurdle. It was hard to tell, at times, what type of errors I was getting. But in the end I was able to inch forward past each one.

Since CityHash didn’t seem to have instructions to call a main program, and seemed to be more like a library, I wrote my own program to call CityHash64(), the function I was interested in optimizing. My program initially looked something like this:


CityHash didn’t come with a plethora of tests, but with a few runs, it was clear that it was producing hashes, and produced the same outputs every time for a fixed input. Initially, I had used a copy-paste file with the full text of Mary Shelley’s Frankenstein (easy to get since it’s in the public domain). However, it didn’t seem to run for very long, so I replaced it with something bigger. Now running on the order of ~15 seconds, I was happier with the benchmark-worthiness. The file itself contains many hundreds of thousands of strings of varying lengths.

The tests were run on two systems. A Fedora virtual machine on my personal laptop (with x86_64 architecture), and the Aarchie server provided by the school (which uses ARM). I might only end up using one of them if I use assembly language to optimize, but  for now I thought I would cover all of my bases.

I ran some initial tests, but I was worried about something. My own program had so much downtime that it was difficult to see what was going with CityHash. It had such a tiny percentage of the runtime. So I first tried to trim it down. I got rid of the lines that printed the hashes. Strangely, this made my hash function super fast and finished instantly.  Putting the line printing back, but piping it instead of using standard out made it still take only 1-2 seconds.

I had difficulty trying to make the file bigger, so I decided to just print out the  lower order digits of the sum of all the hashes, since I was paranoid that the compiler would optimize everything away if I didn’t use the values. I didn’t want tons of cout calls since that would have dwarfed the rest of the program and hid the variance.

Most other programs were kept closed, especially those more likely to have spikes in resource usages. On the remote server, this was not something that could be controlled, but in both cases, an attempt to control interfering load was made by simply taking many, many trials.


x86_64 (Fedora VM)

The vast majority of runs ran between 1150-1160 ms. There were some under or over. Although most ran in this range, only once out of about 20 runs did it go below 1150 ms. I’m confident that this lower bound is somewhat close to our ideal runtime, since the variance seems to be in only one direction. We will then say that the total runtime is about ~1155 ms (for the purpose of comparison, later).

ARM (Aarchie server)

The bulk of runs ran around 700ms, but at times there were some results in the 750-780 ms ranges. I believe based on timing, the higher cases were when someone else was trying to use the system. I say this because otherwise, I would have a expected a more even distribution of the results. There weren’t many 720 – 740 ms times, which would be expected if it were up to normal variance. I choose to take the smaller cluster of results as a rough estimate of the runtime on Aarchie (~700ms), since that suggest the least interference.


The perf report on x86 typically looked like this:


If you look closely, a lot of the time spent on this program is unrelated to CityHash. It was difficult to isolate CityHash behaviour just looking at this, so I will end up paying more attention to the graph. You can still see CityHash among the functions called (runtime about 1-2%), as well as HashLen0to16 (runtime usually 3-4%). Perhaps the call graph will illuminate more.

Strangely enough, the perf report on ARM calculated the values differently. The x86 doesn’t seem to show main or _start, which would be the proper head of the hierarchy at 96% time spent.


Call Graph:

Below is an example of one of the call-graphs, cropped to the relevant parts. Since We don’t really care about the other parts of my program (which just deals with preparing the strings from the input file and such) we can compare relative sizes of these. I’m not sure why CityHash64 is split into two calls, but HashLen0to16 is not lower in the hierarchy than CityHash64, either. My expectation is that it should be called by CityHash, and not by main. I’m not really sure what to make of this.


I was hoping to get a percentage value for the runtime of CityHash out of the total runtime for benchmarking program, but this seems too messy to resolve right now. So, I will stick to the total-runtime approach until it is fixed. Assuming all of the startup and string-related wizardry remains untouched, I am willing to bet that any optimizations I make will directly reduce the total runtime of the program. It will be impossible to tell how much faster I made the hash relative to itself, but hopefully by then I can produce a more accurate percentage value to compare it with.

Building Issues (SPO600 project)

So having selected Google’s old CityHash package as the focus of my optimization project, I had some issues with actually getting it to build out of the box. First, there were issues with configuring it. After those were solved, I had issues with my  make commands. It seemed to point out all kinds of errors C++ related errors whenever I tried to build it. But after investigating issues with versions of gcc (which didn’t end up being the problem), I found a part of the README that offered an alternative with an additional compiler flag, -msse4.2. From what I can tell, this allows for SSE instructions, which stands for Streaming SIMD Extensions. This is pretty unexpected for me, considering there is a comment in the code itself saying:

It's probably possible to create even faster has functions [...] by using SIMD instructions[.]

My guess is that either the SIMD stuff is in another part of the package, or is done by the compiler and not hard-coded. I will still take my attempts to speed it up. Now, let’s benchmark this thing.

Release 0.3 End (OSD600)

I have found an external open source project that I want to work more seriously for my OSD600 class. Well, I didn’t just find it. I mentioned before that I was interested in working on MTGATracker, which allows users of the online card game Magic The Gathering; Arena players to keep track of what is in their deck (and many other useful things). When I had initially looked at it, I thought it might be interesting, but I was later scared away. It looked kind of complicated and didn’t seem to have any low-hanging fruit for me to grab. Not only have I not used python in a while, but never in conjunction with other things. I don’t particularly know much about UI building either. But, after coming back to it several times during my search, I have decided to try and dive into the code and figure out how it works.

But there’s not time to familiarize myself with it yet, so I decided to do something quicker. While mulling things over, I ended up looking at one of the repositories I previously contributed documentation to, and mulled over the idea of doing some more of the same. I noticed, however, that none of the links in the README file actually worked. I expected this to be a 30-second fix I could tack onto the documentation I would later write, so I jumped in to make the quick change.

Except it wasn’t that simple. I didn’t understand how the link references worked. Even after looking it up, I still didn’t quite get it. How did it know where to send the user? Eventually I found something unhelpful that nudged a gear in my brain and gave me an a-ha moment. There was a conversion algorithm linking things together. The link targets had to be using heading markdown, and they had to be written in a specific way. Rushing to put newly tested theory into practice, I ended up breaking the entire bullet-point hierarchy too. I had to reconcile two different structures that didn’t work together, and implement that new structure throughout the entire documentation. Was it the most difficult thing in the world? Perhaps not, but I do feel like I learned a little bit more than if I just repeated the same documentation gig as last time.


On the Internal project front, I’m still working on DarkChatter. I’m finally kitted out with the essentials for the project (a Fedora virtual machine, since Windows comes with extra roadblocks; a USB-network card to hook into my VM; and the myriad software packages involved with advanced network wizardry).

My next task was to integrate the command line argument research I had done last time into the actual project. As I started to edit the main python script that Ryan had been fiddling with, I realized two things. First, is that it was a “test” file of sorts. ScappyTest.py need not resemble anything finally get into the final code. Therefore piggybacking onto it and inserting my bits may be a waste of time. The other thing I realized was that I didn’t have to put everything in one file. My memories of my Python are a bit foggy, but back then I didn’t know of the magic of writing maintainable code and separating unrelated content. Better to leave that world behind. I split up the command line arguments logic into one file, and the main code in another. Then I built out the overarching entry logic that determined whether our script was sending or receiving packets. The code we eventually add for that can then be slid into place.

Release 0.3 Update 2 (OSD600)

This semester keeps going faster by the day. It seems like every course has its own project to do, and the internal projects for my open-source course seem to simultaneously be most and least intimidating. Getting a project off the ground within the last six weeks of class seems impossible, and scary. The fact that there is no hard final product to produce alleviates the pressure a bit, but I would much rather take a slow and thorough approach to it. My opinion on this comes from two things.

First is the chaos of the project being pulled in various directions. I had originally hoped to play a bit of a manager role on the student project DarkChatter begun by Ryan and I. While Ryan seemed engrossed by the technical side of things, I felt that someone needed to chart a course on what exactly was being built. I will admit that I was slow on doing this. On top of juggling my other courses, I had to do some research to bring myself up to speed on the subject, and looking forward to the future. Hesitant to start building something that might not even be feasible, I really took my time on this.

In the meantime, several people hopped onto the project and started pulling it it new directions. Suddenly we were a multi-platform chat app with 3 different repositories, and I felt a little overwhelmed. All the plans I had brewing in my mind suddenly seemed to not fit into the whole, and it had out-scoped any managerial aspirations of mine. This isn’t necessarily for the worse, though. It does seem possible for the three different parts of the project to co-exist. If someone else wants to work on an iOS messaging app that leverages the tech, more power to them. This let’s me in turn focus more on the part I find interesting and unique: the back-end. But I worry about the likelihood of it all coming together by the end of the semester.

The second issue I ran into recently is getting everything running. It turns out the stuff we are doing with wifi cards and monitoring for dangling packets is a lot easier in linux, meaning I had to try and get a linux machine running for myself. I didn’t want to do this on the school’s matrix server initially, because I wasn’t sure if I could or should be doing weird networking nonsense off of their machines. This meant falling back onto my usb copy of mint. Except it didn’t seem to work on my laptop. It worked on my desktop, but I couldn’t use it on there for other reasons. The time spent hunting for a linux machine just makes me lament the time crunch even more.

So, what did I end up getting done after all this research and linux-hunting? Well, I’m still no networking wizard, so I started to do some work elsewhere in the project. I learned how to implement command line arguments and started to set them up. While there is still the matter of hooking it up to something useful on the networking side, I took the opportunity to create a command line argument guide/template to aid any contributors (and probably future-me). Next goal: make the arguments actually worth using!

Software Optimization Project Plan (SPO600 Assignment Stage 1)

Over the next few weeks I will put what I’ve learned in Software Optimization and Portability to the test by attempting to speed up some open-source hashing software. The hashing tool in question is called cityhash, which seems to come from Google.It hashes strings, and is written in C++. I chose it because it seems to be legitimate, and claims decent performance, but  still doesn’t seem over-streamlined to the point where I am not capable of contributing anything. That may still be the case in the end, but I think the potential is there.


I hope to build the project on two systems. My own computer (an x86_64 system) and one of the school servers of the aarch64 architecture for variety. Using the test code included in the package as a guide, I will make a large set of strings to increase the runtime. Using the profiling tool perf, I will measure the overall and function-specific run-times of a specific hash function (probably CityHash32 or CityHash64) and take the average over several runs. The same data set will later be used to check for speedups.



There are a variety of options to explore when it comes to optimizing this function. The simplest way seems to be messing with compiler options, which I may play with, but the README suggests that they are already aware of the more general compiler optimizations available (such as gcc preferring “-g -O3” over the default “-g -O2”). The actual functions themselves seem too complex for me to significantly rewrite. I also find it likely that using in-line assembler will backfire on me, as I have read that except in very specific cases, it often leads to a slow-down.

Therefore, I have my eyes set on vectorizing the code. That is, to group similar operations and do them simultaneously rather than just one by one.  There are even notes in the code that suggest that no SIMD (Single Instruction Multiple Data) operations are done, so I think provides a good candidate for improvement.

There process of doing so may be quite difficult. I am reading up on vectorization, and a conversation with my professor suggested that process of vectorizing this code may take me on a bit more complex path than I intended. I am however up for the challenge, and will carefully learn about masking and using vector registers. Fortunately, I have until stage 3 to bring myself up to speed. I’m kind of excited to be getting my hands dirty with something like this.

Since the project comes with a testing suite, I will run the tests to make sure nothing is broken. I will also do a cursory check to make sure the outputs are coming out the same. To prove performance increases, I will compare both overall runtime and relative runtime for the function in question, on both systems. I will focus mostly on speed, and not memory or other resource dimensions, though if I stumble onto something of interest I will include it.

Building a collaborative project from the ground up (OSD600 0.3 week 1)

A New Project

With the start of a new assignment cycle, the time has come to start an open source project. For the 0.3 release of my open source development course, my classmates began brainstorming new ideas for internal projects that we could envision making for the remainder of the course. A friend of mine had several ideas, but one appealed the most to me: An application that used a device’s network card to send data directly to another network card, cutting out the middle-man (without internet, or even LAN). After telling him how much I liked the idea we pitched it to the class.

This DarkChatter  was born. In my eyes, the details are still a bit up in the air, however. Would this be a full blown chat application, or just message-in-a-bottle deal? Would it fully anonymous, or somewhat anonymous with persisting identities? Would it be a mobile application, or would the permission issues or required rooting make PC the preferred platform? (edit: it seems that this may be easier for an iPhone but I’m an android user).

I’m still not sure yet where it will go. But in all of this I know where my preference lies. I think I want to be closer to the lower levels on this. What is going on with the hardware, the packets, and what difficulties are there in doing this? Why is  this not already a more popular thing? Whether it is a chat app or not is not the biggest factor for me. However, first we need to get this off the ground, and get people on board.

For that reason I consider this a research week. Some of the choices that need to be made must be done by someone who is informed on the topic. As I am not an expert at networking or app development, I want to prime myself on the relevant topics.

Research Begins

The project’s original mastermind pointed me to Scapy  first. Scapy is a tool that allows you to craft packets in any way you want, even if it doesn’t seem to make sense. This does seem useful, since we are likely to be doing some non-standard networking wizardry, and this will allow us to customize things to our needs. Though it is based in python, which makes me wonder how to get it running in the context of a larger codebase in another language.

After reading a bit about Ad-hoc Networks and going down the rabbit hole to read about Mesh Networking, I realized that there could have many nodes communicating at once. However, I am probably still against group chatting in the interest of time and reliability. It also got me wondering as to the distance limits and how chaining would work. Could this be extremely useful in areas of high population density? Or would other people be slowing down my device by using me to pass data as a middleman? But, back on topic…

After looking up a list of similar applications, I found that Bridgefy  offers an SDK for communicating without Internet for up to 100 ft. It doesn’t however say that it is free, so I am afraid that it is likely not. The silver lining is that if they are able to deliver these services without jail-breaking anything, perhaps we can too. But the lack of existing open source projects in this realm is going to hurt, and require me to do much more research with less hand-holding. Hopefully I won’t be in over my head.