A Repeat of Pi

megabreit · Post by **megabreit** » Tue Feb 03, 2009 12:56 am

How did you do that?
My solution in perl uses a sliding window and takes just a few sec to display the right solution (even if it continues to run further but don't display any more possible results)

Code: Select all

#!/usr/bin/perl
$string="141592....58151" # the 1mio pi string
$strlen=length($string);
$sub_len=5;	# start with 5
my %match;
while ($sub_len <= ($strlen/2)) {
	undef %match;
	for ($idx=0; $idx<$strlen-$sub_len; $idx++) {
		$str1=substr($string,$idx,$sub_len);
		if ( not defined $match{$str1} )
		{
			$match{$str1}=1;
		} else {
			print "$str1\n";
			last;
		}
	}
	$sub_len+=1;
}

MerickOWA · Post by **MerickOWA** » Tue Feb 03, 2009 10:04 pm

I did the same sort of thing only in reverse, starting with the largest string and going to the smallest.

There seems like there should be a much faster way to do this with a hash of sets of consecutive numbers from pi or something like that.

I didn't worry about it too much because the simple search just took a little while to run and found the answer.

megabreit · Post by **megabreit** » Tue Feb 03, 2009 11:53 pm

I'm not quite sure if there is really a substantially faster way because you need
to create all the hash elements first. While doing so you loose time for calculation/comparison.
Hm.. you could start with a pre-built hash and not with a 1mio character string...

But as you say: If the solution is calculated in a convenient period of time, it's OK.
On the other side: We seem to be all geeks and to care about those nasty little details

gfoot · Post by **gfoot** » Thu Feb 05, 2009 2:21 pm

I hashed strings of length N to position within PI, found matches, and brute-force searched for the longest subsequent sequence. It was very fast in the end, but initially terribly slow - I had to use quite a large N value (lots of memory) to get the fast results. That was surprising because I expected the large N value to make the initial indexing step too slow, and the memory use to make the later searching steps slow too... but I was wrong.

I don't remember the exact hash string length I used in the end though, just that it was bigger than I had expected.

teebee · Post by **teebee** » Thu Feb 05, 2009 9:59 pm

I used the sample program lrstext.c of the libstree library (http://www.icir.org/christian/libstree/) which computes the longest repeated substring of a string using a suffix tree.

theStack · Post by **theStack** » Mon Feb 23, 2009 3:13 pm

Probably the slowest solution ever (took a few hours to compute), pure brute-force approach without any optimization in python:

Code: Select all

#!/usr/bin/env python
fd = open("pimillion", "r")
pi = fd.readline()
fd.close()

for seqlen in xrange(1, 500001):
    print "len: ", seqlen
    for index in xrange(0, len(pi)-seqlen):
        sequence = pi[index:index+seqlen]
        if sequence in pi[index+1:]:
            print "startindex", index, "sequence:", sequence
            break

Maybe it would have been faster if I used a special method for searching substrings instead of the "in" operator.

gfoot · Post by **gfoot** » Mon Feb 23, 2009 9:21 pm

I found my code - I knew it must have been there somewhere...

Tidying a few things up, and timing it for index lengths 5 through 9, it executes in about 12 seconds for length 5, and about 5.5 seconds for length 9. I didn't go higher, as it already needs about 256MB of memory to store the index (according to top, anyway) and the host only had 512MB. It also wasn't getting much faster any more.

This is all with garbage collection turned off during the critical phases - that saves a lot of time, if you don't run out of memory in the process. It seemed faster to do an explicit collect after the indexing phase though.

Code: Select all

import gc

def make_index(data,indexlen):
    map = {}
    for y in xrange(len(data)+1-indexlen):
        p = data[y:y+indexlen]
        if p not in map:
            map[p] = []
        map[p].append(y)
    return map

def find_longest(data,map):
    longestpos = None
    longestlen = 1
    for starts in map.itervalues():
        for (p1,x) in enumerate(starts):
            for y in starts[p1+1:]:
                l = longestlen
                while y+l < len(data) and data[x:x+l] == data[y:y+l]:
                    l += 1
                if l > longestlen:
                    longestpos = x
                    longestlen = l
    return data[longestpos:longestpos+longestlen]


fp = open("pi.txt")
data = fp.read()
fp.close()

data = ''.join([x for x in data if x.isdigit()])

gc.disable()
gc.collect()

map = make_index(data,7)
gc.collect()
longest = find_longest(data,map)

print longest

blablaSTX · Post by **blablaSTX** » Tue Aug 25, 2009 1:58 pm

I took successive 8-digit sequences as hashkey a set of already encountered positions for that 8digit sequence. When encountering an already hashed sequence, look if it is acrually longer and remrember the longest.
Takes 4 seconds to find the answer in unoptimized code.

Code: Select all

_98_arepeatOfPI
    |sequenceHash digits oldPositions l maxLength k p1 p2
     maxPosition1 maxPosition2|

    maxLength := -1.
    digits := self digitsOfPI.

    sequenceHash := Dictionary new.

    "/ take 8-digit seuences as key
    1 to:digits size-8 do:[:index |
        k := Integer readFrom:(digits copyFrom:index to:index+7).
        oldPositions := sequenceHash at:k ifAbsent:nil.

        "/ to avoid a million sets, store either single position or set of positions
        "/ most combinations only occur once
        oldPositions isNil ifTrue:[
            sequenceHash at:k put:index.
        ] ifFalse:[
            oldPositions isInteger ifTrue:[
                oldPositions := Set with:oldPositions.
            ].
            oldPositions do:[:eachOldPosition |
                l := 8.
                p1 := eachOldPosition+8.
                p2 := index+8.
                [ (digits at:p1) = (digits at:p2) ] whileTrue:[
                    l := l + 1.
                    p1 := p1 + 1.
                    p2 := p2 + 1.
                ].
                l > maxLength ifTrue:[
                    maxLength := l.
                    maxPosition1 := eachOldPosition.
                    maxPosition2 := index.
                ].
            ].
            oldPositions add:index.
            sequenceHash at:k put:oldPositions.
        ].
    ].
    Transcript 
        show:('longest repeated sequence: %1 at %2 and %3'
                bindWith:(digits copyFrom:maxPosition1 to:maxPosition1 + maxLength - 1)
                with:maxPosition1
                with:maxPosition2).

    "
     self _98_arepeatOfPI
    "

markobr · Post by **markobr** » Thu Oct 14, 2010 3:11 pm

Not really fast but quite perlish with substr and =~ :

Code: Select all

$min_len = 2;
STELLE:
for ($stelle = 0; $stelle < 1000000; ++$stelle) {
	LEN:
	for ($len = $min_len; $len < (1000000 - $stelle); ++$len) {
		$folge = substr($pi,$stelle,$len);
		$rest = substr($pi,$stelle+$len);
		if ($rest =~ /$folge/) {
			$max = $folge;
			$min_len = $len;
			print STDERR "$len ab $stelle: $folge\n";
		}
		else {
			next STELLE;
		}
	}
}

SinistraD · Post by **SinistraD** » Sat Feb 05, 2011 9:30 pm

The text of this challenge is really ambiguous, at first I was searching (with great success and many time) for the longest repeated sequence like 111111.

For the solution I used a map to search for the d length "word" with the maximum appearance. Every time I've ran the program changed the value of d until the result was 1. The previous result was the good one.

Code: Select all

import operator

f = open("D:\\tmp.txt")
s = f.read(10000000)

dict = {}
d = 13

for i in range(0, len(s) - d - 1):
    a = s[i:i + d]
    if dict.has_key(str(a)):
        dict[str(a)] += 1
    else:
        dict[str(a)] = 1

max_seq = max(dict.iteritems(), key=operator.itemgetter(1))[0]

print(max_seq)
print(dict[max_seq])

DamaTeq · Post by **DamaTeq** » Tue Feb 15, 2011 9:13 pm

I also used a "sliding window" for the first time using a specific length. Those sequences that found are stored in the list. Afterwards i just go through the list and check if those strings (+x) are also found. when not - it gets removed from the list.

The most time is used for making the index (> 5 minutes), the rest is fast. So, when i look at the other solutions above and read how long they take for the answer, than i can see a clearly disadvantage of c# - its f*cking slow

Code: Select all

static void Main(string[] args)
        {
            String pi = "";
            using (StreamReader sr = new StreamReader(File.OpenRead("pi.txt")))
            {
                pi = sr.ReadToEnd().Substring(2);
            }

            int startingLength = 10;
            List<int> foundIndexes = makeIndexList(pi, startingLength);
            String seq = FindLongestRepeatedSequence(pi, startingLength+1, foundIndexes);
            
            Console.WriteLine("The answer is: " + seq);
            Console.WriteLine("Ready ...");
            Console.ReadLine();
        }
        private static String FindLongestRepeatedSequence(String pi, int length,  List<int> foundIndexes)
        {
            while (true)
            {
                for (int i = foundIndexes.Count - 1; i >= 0; i--)
                {
                    int index = foundIndexes[i];
                    if (index + length >= pi.Length) return pi.Substring(index, length-1);

                    String search = pi.Substring(index, length);
                    String remaining = pi.Substring(length + index);

                    if (remaining.IndexOf(search) == -1)
                    {
                        if (foundIndexes.Count == 1) return search;
                        foundIndexes.RemoveAt(i);
                    }
                }
                length++;
            }
        }
        private static List<int> makeIndexList(String pi, int length)
        {
            List<int> foundIndexes = new List<int>();

            Parallel.ForEach(Enumerable.Range(0, pi.Length - length -1), x =>
            {
                if (pi.LastIndexOf(pi.Substring(x, length)) != x)
                    foundIndexes.Add(x);
            });
            return foundIndexes;
        }

moose · Post by **moose** » Wed Jul 27, 2011 1:31 pm

Here is my bit of python. I got the solution after some minutes (15? 30? I don't know as I ate cake with my father and my grandmother

... reminds me of xkcd ):

Code: Select all

import string, psyco

psyco.full()

def readpi(digits=1000000,filename = '/home/moose/Desktop/challenges/pi.txt'):
    f = open(filename)
    strDigits = f.read()
    strDigits = strDigits.replace("\n", "")
    return strDigits[0:digits]

digits = 1000000
maxLength = 0
start = 0
piDigits = str(readpi(digits))
sequence = ""
print("calculated %i digits of pi" % digits)

while start < digits:
    length = maxLength + 1
    while string.count(piDigits, piDigits[start:start+length]) >= 2 and (start+length<digits):
        length += 1
    length -= 1
    if length > maxLength:
        maxLength = length
        sequence = piDigits[start:start+maxLength]
        print("Sequence-Length\t: %i" % maxLength)
        print("Sequence\t: %s" % sequence)
        print("Start\t\t: %i" % start)
        print("")
    start += 1

print("Length of longest repeated sequence in the %i digits of pi: %i" % (digits, maxLength))
print("Sequence: %s" % sequence)

longxie · Post by **longxie** » Mon Sep 05, 2011 3:18 am

I just google it, and get the answer at http://boards.straightdope.com/sdmb/arc ... 39467.html

zafarrancho · Post by **zafarrancho** » Fri Sep 16, 2011 4:04 pm

I did it with zsh:

# cat pi_10e6digits.txt| sed 's;.;&\n;g' > pi_10e6_newlines.txt
# for i in {1..12} ; cat pi_10e6_newlines.txt| tail -n +$i > listpi.$i.txt
# paste listpi.{1..12}.txt| sed 's;\s;;g' | sort -n | uniq -c | grep -v " 1 "

computing time was about 10 seconds

megabreit · Post by **megabreit** » Sat Sep 17, 2011 12:41 am

I like your zsh version... going back to the roots... (even though it's nearly not zsh specific).
I often forget that the paste command exists. It's very useful here, nice algorithm.
The "real" execution time is at least 2 min, since you can't know before, that the solution is 12 chars long... but still impressive