Page 2 of 4

Posted: Wed Aug 11, 2010 2:02 pm
by Zeta
gcc is conservative about the used instruction set. When compiling for 32 bit you get x386 compatible code. You have to provide the right switches:

-O3 -march=native -ftree-vectorizer-verbose=2

-O3 enables among other things the vectorizer (-ftree-vectorize)
-march=native enables every instruction set extension available on your computer
-ftree-vectorizer-verbose=2 gives some information how successful the vectorizer was

Posted: Mon Nov 15, 2010 10:14 pm
by MyNameIsAlreadyTaken
How fast are your algorithms?


I got to lvl 489 by now and it took me approx 10 minutes to calculate its solution. Assuming that my algorithm runs in O(n^3) and that 643 has two times the size of 489, it is going to take at least 1 1/2 hours and 1.2GB of RAM to complete just the last one :shock:


Is it possible to optimize my code a lot more or do I just have to run my script everyday for two weeks?

Posted: Tue Nov 16, 2010 9:10 am
by tails
Hi, it took 18 seconds for me to solve the last level.
I think we can do much quicker if we use a better algorithm and a better computer.

Posted: Tue Nov 16, 2010 3:00 pm
by MyNameIsAlreadyTaken
18 seconds? Holy shit oO


I used Gauss on an average Computer [dualcore 2,2GHz, 64 Bit Ubuntu 10.10] and may be able to double the speed if I use threads [gonna try out that one next]. I guess you used a way better algorithm?


Edit: Ok, threading seems to slow down my program :[ Looks like I can't optimize anything without having to use a completely other approach.

Posted: Tue Nov 16, 2010 3:48 pm
by tails
Threading sounds good. I couldn't do that because I used a laptop with Celeron M (single core, single thread).
Also, I didn't even use SSE (just because I was lazy :-)).
So, I'm pretty sure we can do much quicker.

Posted: Wed Nov 17, 2010 3:06 am
by Zeta
Threading is not easy if you want to gain speed. Multi-core processors have a 1st level cache for every core and and a shared 2nd level cache. Several copies of a cache line may exist in different 1st level caches. In this case a coherence protocol ensures some form of consistency between these copies. The necessary memory operations may slow down your program considerably.

In your case the data should be aligned to the cache boundary and different threads should not access data in the same cache line: Each thread should work on a vertical slice of the matrix.

I believe it's possible to get the time for the last level down to 0.1s with a optimized (multithreaded, SIMD) c implementation. But always optimize the algorithm first and then the implementation :wink:.

Posted: Sun May 08, 2011 2:39 am
by helly0d
How's your algorithm run in O(n^3) mine is in o(n^6) or by n you mean width*length ? I too use Gaussian Elimination with some freak combination of C with ASM and at 224 is taking about 1 min to solve but at 460 it takes 1h and 10 min. I still tried to play with flags. How could you get 128 bits in one instruction ? I am able just 32 bits. Some hints of optimization from this point further ?

Posted: Sat May 21, 2011 4:03 pm
by MyNameIsAlreadyTaken
Yeah, by n i meant length * height.

Solving #224 takes 4.78s on my computer [not counting the time to get the level and sending the solution]. In every instruction I manipulate 64 bits at the same time by using unsigned long long as type for saving the data [which improves my code a lot since I use a 64 bit OS].

My Code is basically an implementation of Gauss specialized to work mod 2. I can't tell why it runs faster than yours, the only compiler-optimization I used is -O3 [and I didn't use any ASM].

Posted: Sat Jul 02, 2011 10:29 am
by bsguedes
In my computer it takes 15 minutes to solve level 220.

I'm using Java, and I'm solving the linear system using Gauss' method. My solution is very raw, I've made no simplifications in my solver. I know that we are dealing with a sparse matrix, but I don't know how can I use this to improve my solver time. (yeah I'm not a mathematician lol)

Posted: Sat Jul 02, 2011 3:34 pm
by Adzeye
Gauss is a good start, of course, but what you have to remember is to apply it to the context of this puzzle. We have huge matrices of 0's and 1's, but only a tiny fraction of it is actually a '1', which is what you are interested in. Most of the time you spend iterating through the matrix does not advance you towards a solution. So ask yourself, how can you make it such that you only read/write your matrix when it is actually serves a purpose?

This puzzle has a ton of small optimization you can make, I started with about 15 minutes for a level around 220 and it took me 15 minutes for lvl 643. Benchmarking the different steps your solver takes is quite important to know what takes time. I still have a few optimizations that could speed the thing up by about 4x but my solver already finished the game before I could implement them :P

Posted: Sat Jul 02, 2011 5:21 pm
by bsguedes
Thanks Adzeye !

I've thought about what you've said and I tried to change the data structure of my matrix. I was using a bi-dimensional array, full of zeros and I'm trying now an implementation using linked lists. I thought the second one would be better, but it fails horribly.

I think I'm going to follow your advice about benchmarking the different steps of my solver... that should give me a hint where I need to improve my algorithm. Thanks again!

PS.: level 236 : 56 minutes. And level 237 destroys may alocated memory. Sweet.

Posted: Sat Jul 02, 2011 5:52 pm
by Adzeye
I think that bitfield representation and operations is still the fastest option. The idea is more that, among your thousands of "equations", both during the elimination and substitution steps of gaussian, only a few will contain the parameter/variable you are trying to do something with.

Posted: Mon Jul 04, 2011 1:53 pm
by bsguedes
I have innocently thought that implementing the Gauss method with all the adaptations needed to this problem would be enough to achieve level 643 (similarly to Oneofus, for example). However, it seems to be not so simple.

I've found the slowest part of my code, which is the simplest portion of it ... I cannot see how I can reduce the complexity of this little set of lines still solving the puzzle... I guess I should think of a completely different approach :(

Posted: Sun Aug 28, 2011 1:26 am
by compudemon
i switched over my code to use BigInteger seems to have cut my memory use by a factor of 16 for now but it should improve as the memory needed for the matrix itself grows compared to other variables. basically i am just converting the matrix to an augmented identity matrix. is to nice that binary ops are mathematically valid. so long as you obey the commutative rules.

Posted: Sat Sep 03, 2011 1:58 am
by compudemon
Java is slow
C++ is full of memory pointer hassles

so i converted all my code to c++
created my own classes for bigints, stringbuffer, stringbufferarray
as any true hacker might do
since i have no idea how to use sockets i got libcurl

bigints is inlined and optimized with no bounds checking
basically an array of _int64 arrays

setbit is data[index >> 6] |= 1LL<<(index & 63);
clearbit is data[index >> 6] &= ~(1LL<<(index & 63));

so far its running 10~5 x faster then java and it seems to have a memory leak somewhere
not very impressive speed increase considering all the inline-ing and optimizing
i may just stay with java and its very impressive jit compiler for other challenges
c++ sucks with all the pointers