Code optimization
gcc is conservative about the used instruction set. When compiling for 32 bit you get x386 compatible code. You have to provide the right switches:
-O3 -march=native -ftree-vectorizer-verbose=2
-O3 enables among other things the vectorizer (-ftree-vectorize)
-march=native enables every instruction set extension available on your computer
-ftree-vectorizer-verbose=2 gives some information how successful the vectorizer was
-O3 -march=native -ftree-vectorizer-verbose=2
-O3 enables among other things the vectorizer (-ftree-vectorize)
-march=native enables every instruction set extension available on your computer
-ftree-vectorizer-verbose=2 gives some information how successful the vectorizer was
- MyNameIsAlreadyTaken
- Posts: 31
- Joined: Sun Oct 17, 2010 10:21 am
- Location: Germany
How fast are your algorithms?
I got to lvl 489 by now and it took me approx 10 minutes to calculate its solution. Assuming that my algorithm runs in O(n^3) and that 643 has two times the size of 489, it is going to take at least 1 1/2 hours and 1.2GB of RAM to complete just the last one :shock:
Is it possible to optimize my code a lot more or do I just have to run my script everyday for two weeks?
I got to lvl 489 by now and it took me approx 10 minutes to calculate its solution. Assuming that my algorithm runs in O(n^3) and that 643 has two times the size of 489, it is going to take at least 1 1/2 hours and 1.2GB of RAM to complete just the last one :shock:
Is it possible to optimize my code a lot more or do I just have to run my script everyday for two weeks?
- MyNameIsAlreadyTaken
- Posts: 31
- Joined: Sun Oct 17, 2010 10:21 am
- Location: Germany
18 seconds? Holy shit oO
I used Gauss on an average Computer [dualcore 2,2GHz, 64 Bit Ubuntu 10.10] and may be able to double the speed if I use threads [gonna try out that one next]. I guess you used a way better algorithm?
Edit: Ok, threading seems to slow down my program :[ Looks like I can't optimize anything without having to use a completely other approach.
I used Gauss on an average Computer [dualcore 2,2GHz, 64 Bit Ubuntu 10.10] and may be able to double the speed if I use threads [gonna try out that one next]. I guess you used a way better algorithm?
Edit: Ok, threading seems to slow down my program :[ Looks like I can't optimize anything without having to use a completely other approach.
Last edited by MyNameIsAlreadyTaken on Tue Nov 16, 2010 3:48 pm, edited 1 time in total.
Threading is not easy if you want to gain speed. Multi-core processors have a 1st level cache for every core and and a shared 2nd level cache. Several copies of a cache line may exist in different 1st level caches. In this case a coherence protocol ensures some form of consistency between these copies. The necessary memory operations may slow down your program considerably.
In your case the data should be aligned to the cache boundary and different threads should not access data in the same cache line: Each thread should work on a vertical slice of the matrix.
I believe it's possible to get the time for the last level down to 0.1s with a optimized (multithreaded, SIMD) c implementation. But always optimize the algorithm first and then the implementation .
In your case the data should be aligned to the cache boundary and different threads should not access data in the same cache line: Each thread should work on a vertical slice of the matrix.
I believe it's possible to get the time for the last level down to 0.1s with a optimized (multithreaded, SIMD) c implementation. But always optimize the algorithm first and then the implementation .
How's your algorithm run in O(n^3) mine is in o(n^6) or by n you mean width*length ? I too use Gaussian Elimination with some freak combination of C with ASM and at 224 is taking about 1 min to solve but at 460 it takes 1h and 10 min. I still tried to play with flags. How could you get 128 bits in one instruction ? I am able just 32 bits. Some hints of optimization from this point further ?
- MyNameIsAlreadyTaken
- Posts: 31
- Joined: Sun Oct 17, 2010 10:21 am
- Location: Germany
Yeah, by n i meant length * height.
Solving #224 takes 4.78s on my computer [not counting the time to get the level and sending the solution]. In every instruction I manipulate 64 bits at the same time by using unsigned long long as type for saving the data [which improves my code a lot since I use a 64 bit OS].
My Code is basically an implementation of Gauss specialized to work mod 2. I can't tell why it runs faster than yours, the only compiler-optimization I used is -O3 [and I didn't use any ASM].
Solving #224 takes 4.78s on my computer [not counting the time to get the level and sending the solution]. In every instruction I manipulate 64 bits at the same time by using unsigned long long as type for saving the data [which improves my code a lot since I use a 64 bit OS].
My Code is basically an implementation of Gauss specialized to work mod 2. I can't tell why it runs faster than yours, the only compiler-optimization I used is -O3 [and I didn't use any ASM].
In my computer it takes 15 minutes to solve level 220.
I'm using Java, and I'm solving the linear system using Gauss' method. My solution is very raw, I've made no simplifications in my solver. I know that we are dealing with a sparse matrix, but I don't know how can I use this to improve my solver time. (yeah I'm not a mathematician lol)
I'm using Java, and I'm solving the linear system using Gauss' method. My solution is very raw, I've made no simplifications in my solver. I know that we are dealing with a sparse matrix, but I don't know how can I use this to improve my solver time. (yeah I'm not a mathematician lol)
Gauss is a good start, of course, but what you have to remember is to apply it to the context of this puzzle. We have huge matrices of 0's and 1's, but only a tiny fraction of it is actually a '1', which is what you are interested in. Most of the time you spend iterating through the matrix does not advance you towards a solution. So ask yourself, how can you make it such that you only read/write your matrix when it is actually serves a purpose?
This puzzle has a ton of small optimization you can make, I started with about 15 minutes for a level around 220 and it took me 15 minutes for lvl 643. Benchmarking the different steps your solver takes is quite important to know what takes time. I still have a few optimizations that could speed the thing up by about 4x but my solver already finished the game before I could implement them
This puzzle has a ton of small optimization you can make, I started with about 15 minutes for a level around 220 and it took me 15 minutes for lvl 643. Benchmarking the different steps your solver takes is quite important to know what takes time. I still have a few optimizations that could speed the thing up by about 4x but my solver already finished the game before I could implement them
Thanks Adzeye !
I've thought about what you've said and I tried to change the data structure of my matrix. I was using a bi-dimensional array, full of zeros and I'm trying now an implementation using linked lists. I thought the second one would be better, but it fails horribly.
I think I'm going to follow your advice about benchmarking the different steps of my solver... that should give me a hint where I need to improve my algorithm. Thanks again!
PS.: level 236 : 56 minutes. And level 237 destroys may alocated memory. Sweet.
I've thought about what you've said and I tried to change the data structure of my matrix. I was using a bi-dimensional array, full of zeros and I'm trying now an implementation using linked lists. I thought the second one would be better, but it fails horribly.
I think I'm going to follow your advice about benchmarking the different steps of my solver... that should give me a hint where I need to improve my algorithm. Thanks again!
PS.: level 236 : 56 minutes. And level 237 destroys may alocated memory. Sweet.
I have innocently thought that implementing the Gauss method with all the adaptations needed to this problem would be enough to achieve level 643 (similarly to Oneofus, for example). However, it seems to be not so simple.
I've found the slowest part of my code, which is the simplest portion of it ... I cannot see how I can reduce the complexity of this little set of lines still solving the puzzle... I guess I should think of a completely different approach
I've found the slowest part of my code, which is the simplest portion of it ... I cannot see how I can reduce the complexity of this little set of lines still solving the puzzle... I guess I should think of a completely different approach
-
- Posts: 33
- Joined: Sat Aug 13, 2011 2:13 pm
i switched over my code to use BigInteger seems to have cut my memory use by a factor of 16 for now but it should improve as the memory needed for the matrix itself grows compared to other variables. basically i am just converting the matrix to an augmented identity matrix. is to nice that binary ops are mathematically valid. so long as you obey the commutative rules.
-
- Posts: 33
- Joined: Sat Aug 13, 2011 2:13 pm
Java is slow
C++ is full of memory pointer hassles
so i converted all my code to c++
created my own classes for bigints, stringbuffer, stringbufferarray
as any true hacker might do
since i have no idea how to use sockets i got libcurl
bigints is inlined and optimized with no bounds checking
basically an array of _int64 arrays
setbit is data[index >> 6] |= 1LL<<(index & 63);
clearbit is data[index >> 6] &= ~(1LL<<(index & 63));
so far its running 10~5 x faster then java and it seems to have a memory leak somewhere
not very impressive speed increase considering all the inline-ing and optimizing
i may just stay with java and its very impressive jit compiler for other challenges
c++ sucks with all the pointers
C++ is full of memory pointer hassles
so i converted all my code to c++
created my own classes for bigints, stringbuffer, stringbufferarray
as any true hacker might do
since i have no idea how to use sockets i got libcurl
bigints is inlined and optimized with no bounds checking
basically an array of _int64 arrays
setbit is data[index >> 6] |= 1LL<<(index & 63);
clearbit is data[index >> 6] &= ~(1LL<<(index & 63));
so far its running 10~5 x faster then java and it seems to have a memory leak somewhere
not very impressive speed increase considering all the inline-ing and optimizing
i may just stay with java and its very impressive jit compiler for other challenges
c++ sucks with all the pointers