Code optimization

klavierspieler21
Posts: 16
Joined: Fri Jul 13, 2007 12:21 am
Location: Waterloo

Code optimization

Post by klavierspieler21 »

I coded my solution to crossflip in C++ but I didn't get the speed I expected...
I know my algorithm is asymptotically near optimal (I didn't try to understand how to solve a matrix in under O(n^3) time over the Z_2 field), but I don't know much about low-level optimization techniques. Anybody care to elaborate or to point me to resources?

Thank you very much!
User avatar
CodeX
Posts: 350
Joined: Fri Oct 17, 2008 5:28 pm

Post by CodeX »

Quickest trick I'd suggest would be to turn optimizations on when compiling such as GCC -O3 or wcl386 -ohaltrock or equivalent on other compilers
klavierspieler21
Posts: 16
Joined: Fri Jul 13, 2007 12:21 am
Location: Waterloo

Post by klavierspieler21 »

heh tried that already :P with -O3 it's still taking me slightly more than 10 mins at lvl 220 something.. compiler can't help if i'm incompetent lol
User avatar
CodeX
Posts: 350
Joined: Fri Oct 17, 2008 5:28 pm

Post by CodeX »

it can try :P personally if I want to bust out some speed I write code in C/ASM but a good start is to try and reduce the number of function calls (some of these might be replaced with inline code by the compiler anyway though). Here is a bit of a book on it which might help (haven't read it) but also if you do try the ASM route you might actually end up with slower running code if you don't know about your processors features such as piplining, caching, stalls and hastier alternatives to things so you'd be better off with C if you aren't familiar with those things. I can't imagine you'd get too much faster than 10 minutes by tinkering with the compilation so you might want to look for other alternative solutions, maybe even parallel processing if that is available to you so that you can spread the load over numerous threads or maybe if your using a generator for exhaustive searching then you might want to consider optimizing it so the more probable attempts are generated first.
User avatar
CodeX
Posts: 350
Joined: Fri Oct 17, 2008 5:28 pm

Post by CodeX »

I don't think optimizing your method will be enough to achieve level 643 in a reasonable amount of time unless your doing i/o operations all over the place, you probably need to change your approach slightly.

I am also wondering why you haven't got the last six levels on Runaway Robot done yet? Python can do 513 without optimizations in 10 minutes and C can do it fairly crudely in 0.008 so unless your (I assume) C++ solution took 19 hours on 507 you should be able to do it?
Zeta
Posts: 62
Joined: Thu Apr 16, 2009 3:37 pm

Post by Zeta »

How big are your matrices? I have here an example with 40000 variables and 0.12% of the entries in the matrix equal one. Using gaussian elimination i can single-threaded solve it in under 4 minutes (Java, 3Ghz) without special optimizations for sparse matrices.
klavierspieler21
Posts: 16
Joined: Fri Jul 13, 2007 12:21 am
Location: Waterloo

Post by klavierspieler21 »

I'm at 1.87Ghz, -O3 C++, lvl 224 is 91x91 so 8281 variables (I kept the duds for simplicity), and I'm taking over 10 minutes. Something must be wrong... =[
klavierspieler21
Posts: 16
Joined: Fri Jul 13, 2007 12:21 am
Location: Waterloo

Post by klavierspieler21 »

I haven't finished Runaway because I'm lazy :P
I coded it up after my first introductory course in computer science at Waterloo (I never programmed before, unless you count html lol). We used Scheme as an intro language. Clearly I did not understand memory management and call stacks after 4 months... because speed isn't where I'm failing. My Scheme code was busting out gigabytes of my RAM and crashed my computer xD
User avatar
CodeX
Posts: 350
Joined: Fri Oct 17, 2008 5:28 pm

Post by CodeX »

:o My stack probably didn't go over 40.5KiB with all calls and data structures (under 400 calls deep) but I think I've just about got to grips with managing marvellous mutable main memory modules in C, I guess you've got it memory issues sorted now for C++? I think if you get up to 513 your rank will go up for being joint first on the game if that kind of thing motivates you, if you didn't want to change your code you might be able to get some running time on a 64bit platform with 4GB+ of RAM if your memory use isn't going up too rapidly :P
Zeta
Posts: 62
Joined: Thu Apr 16, 2009 3:37 pm

Post by Zeta »

You do realize that the addition and multiplication operations in F_2 are also the bitwise operators XOR and AND, respectively? Don't do OO design overkill. Every entry in the matrix is represented by one bit. For maximum performance make use of SSE2 (pxor) to handle 128 bit in one instruction. On the other hand if you have enough RAM even one byte per matrix entry is a possible design choice which leads to a simpler implementation.
klavierspieler21
Posts: 16
Joined: Fri Jul 13, 2007 12:21 am
Location: Waterloo

Post by klavierspieler21 »

@CodeX: Oh I didn't know about the joint firstness lol. Yeah one day I'll recode that thing and I'll get to 513 easy peasy :) In fact, I didn't even code this version myself. It was exam time and my friends wanted me to give them practice questions. So I broke the problem down into individual functions and by the end of the day, I assembled everything up and had it running! It was an awesome day.

@Zeta: Yes I did realize addition is XOR, but don't see the point of multiplication in this problem (Maybe I'm missing something..). My code is actually less than 175 lines long without the comments so ya, no OOP for sure! Here's something I'm not too knowledgeable.. I first tried to use vectors of bools to take advantage of the 1 bit per cell thing. However there's something to it and it makes the code very slow. I simply changed all the bools to chars and it make the code speed up by a factor of 3-4. So now I'm using vectors of vectors of chars to represent the matrix I have to Gaussian Eliminate. I should have enough RAM to support the 8x increase in memory.. it's more the speed that I don't understand how to get. In any case, I still have to figure out how to HTTP POST lol..
User avatar
CodeX
Posts: 350
Joined: Fri Oct 17, 2008 5:28 pm

Post by CodeX »

XOR only represents binary addition where no two equal bits are both 1 as that would cause the need for a carry which XOR doesn't provide so it depends on what you are doing as to whether or not you can substitute an ADD with an XOR e.g.
01010100 +
10101000 =
11111100
which is the same as an XOR but if any two equal bits are set then
010101010 +
001010110 =
100000000
which isn't the same as the XOR of the two (11111100). As for multiplication, a shift in binary is the same as multiplying/dividing by 2 just as doing so in decimal is the same as multiplying/dividing by 10, the can be handy for multiplying/dividing by powers of two e.g. a / 8 == a >> 3. Multiplication on 80x86 integer values isn't very nice as it uses two registers for output instead of one and takes 10-11 cycles where as a shift only takes 1-4 and uses a single register, combined with the fact an add can take 1-3 cycles it can be quicker to do something like a << 4 + a << 2 instead of a * 20. Here's a couple of handy things to remember to avoid the troublesome multiplication, division and modulus for when n is a power of two:

Code: Select all

a * n = a << log2(n)
a / n = a >> log2(n)
a % n = a & (n - 1)
As for your speed issue it helps to understand the low level operations that your computer does (i.e. learn assembly language), your computer (an 80x86 processor I assume) normally works with units of memory in sizes of 1,2,4 and maybe 8 bytes (if you use a 64bit OS) and so the process of adding a series of these units is as simple as:

Code: Select all

char myArray[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
long Sum = 0;
for(int i = 0; i < 16 /* sizeof(myArray) / sizeof(myArray[0])*/; i++){
	Sum += myArray[i]
}
whereas when you use a single bits it has to extract the values:

Code: Select all

char myBits[] = {0xaa,0x55};
long Sum = 0;
for(int i = 0; i < 16 /* sizeof(myBits)*8 */; i++){
	Sum += (myBits[i/8] >> (7 - (i % 8))) & 1; // we have to extract the bit first
}
as you can see that has a lot more operations and takes up a lot more time to get the sum of 16 units of data. The choice between the two is a time/memory tradeoff issue as using bits means you only use ⅛ of the memory but takes significantly longer to execute. Where possible unless you have to work to memory constraints it's best to operate on the largest unit of memory your processor can handle such as instead of this:

Code: Select all

char sthToZero[] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
for(int i = 0; i < 16 /*sizeof(sthToZero)*/; i++){
	sthToZero[i] ^= sthToZero[i];
}
you would be better off having something like this:

Code: Select all

long *ptrLong = (long*)sthToZero;
for(int i = 0; i < 4 /*sizeof(sthToZero) / sizeof(long)*/; i++){
	ptrLong[i] ^= ptrLong[i];
}
The second one takes ¼ the number of operations and it is actually faster to access a 32bit value in one go than it is to access an 8bit one on most PCs. Instruction sets like MMX, 3DNow! and SSE use SMID operations (Single Instruction Multiple Data) which allow you to speed up operations on vectors/arrays of data and are available through libraries (possibly), compiler optimizations (GCC's flags are -mmmx -m3dnow -msse -msse2 ect.) or (inline) assembly. You might want to check what your Processor supports so you can try to benefit from them, most 80x86 CPUs nowadays support at least SSE2 (both Intel & AMDs).

The process of optimizing code can be a grotty business if you get pedantic about it so you're probably best with sticking catering for your compiler and it's optimizations instead of taking up assembly to do it, it's quite likely that you will actually make slower code if you use assembly as the compiler knows about architecture specific speed issues.

ramble ramble ramble ramble ramble ramble ramble
klavierspieler21
Posts: 16
Joined: Fri Jul 13, 2007 12:21 am
Location: Waterloo

Post by klavierspieler21 »

Holy fuck I just cut down my running time for lvl150 from >100s to 17s and my memory usage from megabytes to kilobytes just by setting the right compiler flags and changing my chars to longs (32 bits)!! I guess with those flags on, it effectively does the multiple bit operations that I wanted.

There is so much to learn. Thank you CodeX, Zeta and hacker.org

EDIT: Wait no.. memory usage is the same.. I misread lol... and lvl232 is still taking 4 mins (better than 15+ mins though)
Zeta
Posts: 62
Joined: Thu Apr 16, 2009 3:37 pm

Post by Zeta »

@CodeX:
There is no carry in F_2 (the finite field with 2 elements). You can pack a chunk of a matrix row with length sizeof(int)*8 into a single int. Later you have to do lots of operations like 'add one row to the other'. This can be done using XOR operations on the integer values and looping over an array of int representing a row of the matrix. So you get SIMD practically for free. Compared to using one byte per matrix entry the performance gain was a factor 10 for me.

@klavierspieler21:
1) You're right. No multiplication in F_2 required here.

2) -mtune=native is another compiler switch you could try.

3) You can get information about software optimization here.

4) vector<bool> is optimized for memory usage. As you experienced the runtime penalty is severe compared to vector<char>. I suspect most of your performance issues stem from using the STL, simply use char[].
klavierspieler21
Posts: 16
Joined: Fri Jul 13, 2007 12:21 am
Location: Waterloo

Post by klavierspieler21 »

@Zeta: Yeah I think I'm inherently really lazy lol.. I was hoping not having to pack multiple bools in an int. It becomes a hassle when I want to find the right pivot and checking for stuff like that, and on top of that there is ensuring the padding is done correctly when your data is not of length divisible by sizeof(int). I hope to be joining club 643 soon. Working from 8 to 6 kinda leaves little time for this :). I actually coded up crossflip a few months ago with Scheme and I did pack up the bits because it supported unbounded Ints so it was quick and easy and under 150 lines. The problem is that I only had an interpreter. Scheme isn't popular enough to have a quality, easy to use, standard compiler.

One last question.. those SSE optimizations simply take advantage of doing multiple independent operations at a time right? So it is XOR'ing (let's say) 4 bytes at the same time instead of 1 byte at a time (if I'm using chars). If that is the case, why isn't this in the normal compilation/optimization? I don't understand how this would not be useful.
Post Reply