Software Optimization Project Plan (SPO600 Assignment Stage 1)

Over the next few weeks I will put what I’ve learned in Software Optimization and Portability to the test by attempting to speed up some open-source hashing software. The hashing tool in question is called cityhash, which seems to come from Google.It hashes strings, and is written in C++. I chose it because it seems to be legitimate, and claims decent performance, but  still doesn’t seem over-streamlined to the point where I am not capable of contributing anything. That may still be the case in the end, but I think the potential is there.

Benchmarking

I hope to build the project on two systems. My own computer (an x86_64 system) and one of the school servers of the aarch64 architecture for variety. Using the test code included in the package as a guide, I will make a large set of strings to increase the runtime. Using the profiling tool perf, I will measure the overall and function-specific run-times of a specific hash function (probably CityHash32 or CityHash64) and take the average over several runs. The same data set will later be used to check for speedups.

 

Optimization

There are a variety of options to explore when it comes to optimizing this function. The simplest way seems to be messing with compiler options, which I may play with, but the README suggests that they are already aware of the more general compiler optimizations available (such as gcc preferring “-g -O3” over the default “-g -O2”). The actual functions themselves seem too complex for me to significantly rewrite. I also find it likely that using in-line assembler will backfire on me, as I have read that except in very specific cases, it often leads to a slow-down.

Therefore, I have my eyes set on vectorizing the code. That is, to group similar operations and do them simultaneously rather than just one by one.  There are even notes in the code that suggest that no SIMD (Single Instruction Multiple Data) operations are done, so I think provides a good candidate for improvement.

There process of doing so may be quite difficult. I am reading up on vectorization, and a conversation with my professor suggested that process of vectorizing this code may take me on a bit more complex path than I intended. I am however up for the challenge, and will carefully learn about masking and using vector registers. Fortunately, I have until stage 3 to bring myself up to speed. I’m kind of excited to be getting my hands dirty with something like this.

Since the project comes with a testing suite, I will run the tests to make sure nothing is broken. I will also do a cursory check to make sure the outputs are coming out the same. To prove performance increases, I will compare both overall runtime and relative runtime for the function in question, on both systems. I will focus mostly on speed, and not memory or other resource dimensions, though if I stumble onto something of interest I will include it.

Advertisements

In-line assembly Lab (SPO600 Lab 6)

In this lab, we inspect an open source project that uses in-line assembly language in C files. The open source project I have selected is ardour.

This project contains 43 instances of assembly code used in-line (in .c, .h, .ccp files). In some this file, behaviors are defined on x86, ia64, s390, i386 i586 systems. The file uses assembly to get a count of CPU cycles. For each architecture, this is done in a different way and it decides which to use based on if/else logic. On platforms not included, such as sparc, arm, and m68k it gives a warning that the project does not support the platform yet.

Example of in-line assembly for x86_64

inline assembly blog

 

Since the information being accessed here is related to CPU itself, I think it makes sense to use assembly (which depends on the architecture of the CPU). However, I did find one source that said that this particular use is out of date. Nowadays, there are ways to do this that are built-in to compilers. If this is true then perhaps the project is simply old enough that it was useful to use assembly at the time.

Comparing Open Source Projects (SPO600 Lab 1)

This lab is about comparing two open source projects with different licenses, and comparing the process of contributing to them. I will be comparing the Bootstrap web framework (under the MIT license) and Mozilla FireFox (under the Mozilla Public License), by analyzing one software “patch”.

Bootstrap

The issue and patch. From start to finish, 4 people were involved in this patch. It began with someone claiming a bug, and other people verifying that it exists. The person who posted the bug posted a fix, and others commented on it and voiced any issues. One other person changed the code with their own commit. The issue was posted on may 23 and was resolved on July 15, totaling 63 days.

 

Mozzilla FireFox

This is the bug fix I chose to analyze. From the posting of the bug to resolution took 7 days in total, and took 3 people: the person working on the fix and two people reviewing it.

According to mozilla, the official process of contributing is:

  1. Build the project
  2. Find a bug
  3. Fix the bug
  4. Get code reviewed
  5. Respond to the review
  6. Automated testing

In this case, it does seem to match these steps. The review itself involved a reviewer pointing out errors and requesting certain changes to be made. A second review then came in and clarified that some of the issues the first reviewer had were not problems with the code being submitted, but a version problem on the reviewer’s side. The process seems relatively quick, and with enough collaboration to catch issues well.

Software Profiling Investigation (SPO600)

Today’s investigation is into the world of software profiling, with the open source profiling tool gprof. Gprof is a tool for unix systems that analyzes the performance of a another program. It does this by inserting additional code when compiling that other program, so that it can provide a great level of detail in its analysis.

I chose to run this test gprof with one of my programs made for lab 5 of SPO600. The program scales fake “sound samples” (random integers in this case).  I have prepared the program to run extra samples, for a longer run time and better profiling.  With 1,000,000,000 samples, it takes about 53 seconds to run. After compiling with -g -pg -O3 options on the gcc compiler. After compiling, once we run the program we find an extra file in the directory called gmon.out.

Running gprof on our executable, we gave me this:

picgprof1

Unfortunately, there isn’t any content in the profile. Some searching led me to recompile it differently to avoid a bug in gcc. Now get one new line to show up.

picprof2

The bulk of this is an explanation of what each stat means.

The name is main, the name of the function being profiled in this line. It seems that you can look at each function individually and its impact. Since there is only one function defined in this program, it takes up 100% of the time. And since no functions are being called inside of main, self seconds and cumulative seconds are the same. The other categories (involving the function being called) are left empty, perhaps because main is a special function, and doesn’t get profiled in this way.

Algorithm Selection (SP0600 Lab5)

In this lab we take a short C program that scales a set of random int16_t values representing the amplitude of sound waves for audio encoding, and multiplying them by 0.75 (lowering the volume). This type of operation occurs often when using audio since the computer must do these calculations in the background.

We begin with the naive algorithm (taking each value and performing a multiplication operation using floating-point arithmetic 0.75. Performing this over 10 000 000 samples on the aarch64 server supplied by our school gives gave our entire program a runtime of 514ms. Subtracting the setup and teardown, as well as unrelated code (which took 496ms to run), we get 18ms of runtime for the operations we care about.

 

Lets look at two alternative algorithms to see if we can speed things up:

 

Alternative Algorithm 1: Pre-scaled array

Since we are performing the same operation many times, and since we are likely to get repeat inputs and outputs, perhaps it would save time to do each possible calculation once (about 65,000 times, for all possible values of an int16_t). This way instead of doing a new calculation, we can just look up the output value that we already know tot be correct.

Unfortnately, this did not result in a speedup. It actually slowed things down to a total of 42ms (vs the original 18). I was expecting it to be faster, and I think it would have, except for the problem of our int16_t’s being unsigned. The subtracting and adding of the negative portions (since you cannot use a negative index in an array) may have slowed down my implementation here. This method also requires quite a bit of memory overhead (an array storing around 65000 16-bit integers). I tested this with 1 billion samples after 10 million seemed to be slower, and found that it was still slower. The payoff didn’t seem become better in the long game either.

 

Alternative Algorithm 2: Fixed-point arithmetic

Instead of using floating point arithmetic to multiple by 0.75, we can use regular integer multiplication and bit-shifting to get the same results. Here, my algorithm multiplied input sample values by 6, and shifted the result to the right by 3 bits (truncating the last 3 digits, thus dividing by 2 three times). The end result is that we get an integer value that is 3/4 of the original.

Now, this algorithm has given us a significant speedup (at least, around the specific operations being done) and is the best of the three. 13ms compared to the naive algorithm’s 18ms puts us at 28% faster with the same memory overhead (or lack of it, compared with the pre-calculated array).

 

 

Are they fast enough to process CD data? – At 44100 samples per second, and 2 channels, I think each algorithm is sufficient. If it can process 10 million samples in a few milliseconds, it should easily be able to keep up with a CD. I will note however, that we should use the best algorithm regardless, since it won’t be the only program running on the machine.

 

Further Optimizations

As stated before, I think treating the values as unsigned integers may prove to speed things up, especially in the pre-calculated array algorithm. The addition and subtraction waste processing of a repetitive and useless task that could be avoided.

Building a software package (SPO600 Lab 2)

Part 1

In this lab we download and build an open source software package and get it running on our own machine. The following are the steps and comments on the process.

  1.  I chose nano as the software package to install, expecting it to be relatively small and straightforward to set up. After a bit of searching, I downloaded the compressed source code from the  downloads page of the nano website.
  2. Since nano is meant for use on Linux systems, I decided to build it on seneca’s Matrix server remotely using Putty. I tried to locate the makefile in order to build the program, but was confused by the various files with names like Makefile.am and Makefile.in.
  3. After seeking some professor assistance, I discovered that despite all the files with makefile in their name, none of them was quite what I was looking for. It was necessary to run a configure command to let the package recognize what environment it was in and create a Makefile accordingly.
  4. New problem: permissions. It turns out that unzipping the file on my windows machine and dropping into my matrix account. After unzipping it on matrix…
  5. > configure
  6. > make
  7. Success!

An unexpectedly bumpy journey. Lessons learned:

  • Makefiles don’t come automatically. It may take some digging/configuring to get it to appear.
  • Be careful with unzipping.