How did you do that?
My solution in perl uses a sliding window and takes just a few sec to display the right solution (even if it continues to run further but don't display any more possible results)
I'm not quite sure if there is really a substantially faster way because you need
to create all the hash elements first. While doing so you loose time for calculation/comparison.
Hm.. you could start with a pre-built hash and not with a 1mio character string...
But as you say: If the solution is calculated in a convenient period of time, it's OK.
On the other side: We seem to be all geeks and to care about those nasty little details
I hashed strings of length N to position within PI, found matches, and brute-force searched for the longest subsequent sequence. It was very fast in the end, but initially terribly slow - I had to use quite a large N value (lots of memory) to get the fast results. That was surprising because I expected the large N value to make the initial indexing step too slow, and the memory use to make the later searching steps slow too... but I was wrong.
I don't remember the exact hash string length I used in the end though, just that it was bigger than I had expected.
I used the sample program lrstext.c of the libstree library (http://www.icir.org/christian/libstree/) which computes the longest repeated substring of a string using a suffix tree.
#!/usr/bin/env python
fd = open("pimillion", "r")
pi = fd.readline()
fd.close()
for seqlen in xrange(1, 500001):
print "len: ", seqlen
for index in xrange(0, len(pi)-seqlen):
sequence = pi[index:index+seqlen]
if sequence in pi[index+1:]:
print "startindex", index, "sequence:", sequence
break
Maybe it would have been faster if I used a special method for searching substrings instead of the "in" operator.
I found my code - I knew it must have been there somewhere...
Tidying a few things up, and timing it for index lengths 5 through 9, it executes in about 12 seconds for length 5, and about 5.5 seconds for length 9. I didn't go higher, as it already needs about 256MB of memory to store the index (according to top, anyway) and the host only had 512MB. It also wasn't getting much faster any more.
This is all with garbage collection turned off during the critical phases - that saves a lot of time, if you don't run out of memory in the process. It seemed faster to do an explicit collect after the indexing phase though.
import gc
def make_index(data,indexlen):
map = {}
for y in xrange(len(data)+1-indexlen):
p = data[y:y+indexlen]
if p not in map:
map[p] = []
map[p].append(y)
return map
def find_longest(data,map):
longestpos = None
longestlen = 1
for starts in map.itervalues():
for (p1,x) in enumerate(starts):
for y in starts[p1+1:]:
l = longestlen
while y+l < len(data) and data[x:x+l] == data[y:y+l]:
l += 1
if l > longestlen:
longestpos = x
longestlen = l
return data[longestpos:longestpos+longestlen]
fp = open("pi.txt")
data = fp.read()
fp.close()
data = ''.join([x for x in data if x.isdigit()])
gc.disable()
gc.collect()
map = make_index(data,7)
gc.collect()
longest = find_longest(data,map)
print longest
I took successive 8-digit sequences as hashkey a set of already encountered positions for that 8digit sequence. When encountering an already hashed sequence, look if it is acrually longer and remrember the longest.
Takes 4 seconds to find the answer in unoptimized code.
The text of this challenge is really ambiguous, at first I was searching (with great success and many time) for the longest repeated sequence like 111111.
For the solution I used a map to search for the d length "word" with the maximum appearance. Every time I've ran the program changed the value of d until the result was 1. The previous result was the good one.
import operator
f = open("D:\\tmp.txt")
s = f.read(10000000)
dict = {}
d = 13
for i in range(0, len(s) - d - 1):
a = s[i:i + d]
if dict.has_key(str(a)):
dict[str(a)] += 1
else:
dict[str(a)] = 1
max_seq = max(dict.iteritems(), key=operator.itemgetter(1))[0]
print(max_seq)
print(dict[max_seq])
I also used a "sliding window" for the first time using a specific length. Those sequences that found are stored in the list. Afterwards i just go through the list and check if those strings (+x) are also found. when not - it gets removed from the list.
The most time is used for making the index (> 5 minutes), the rest is fast. So, when i look at the other solutions above and read how long they take for the answer, than i can see a clearly disadvantage of c# - its f*cking slow
Here is my bit of python. I got the solution after some minutes (15? 30? I don't know as I ate cake with my father and my grandmother ... reminds me of xkcd ):
I like your zsh version... going back to the roots... (even though it's nearly not zsh specific).
I often forget that the paste command exists. It's very useful here, nice algorithm.
The "real" execution time is at least 2 min, since you can't know before, that the solution is 12 chars long... but still impressive