You could probably speed that up alot by doing the outer loop in parallel processes!
As from what I'm seeing, the effects of one iteration of the outer loop doesn't effect the next iteration of the outer loop, so its perfect for parallel programming.