Instr()

General FreeBASIC programming questions.
albert
Posts: 5701
Joined: Sep 28, 2006 2:41
Location: California, USA

Instr()

Postby albert » Mar 02, 2020 1:45

@Coders

Is there a way to speed up the instr() function...

Building a 4 byte dictionary of a input of 10,000,000 bytes takes hours...

chrs = 10,000,000 bytes input
if instr( 1 , dict , ( next 4 chrs ) ) = 0 then dict+= next 4 chrs

Searching the dict , to try to find the 4 chrs takes over a half hour... I opened Linux "TOP" and killed it after 40 minutes.
caseih
Posts: 1504
Joined: Feb 26, 2007 5:32

Re: Instr()

Postby caseih » Mar 02, 2020 3:25

Might help if you describe what you are trying to do and sought for input on how to create an algorithm to do it effectively. Showing a complete, working example of your present code might help also.
albert
Posts: 5701
Joined: Sep 28, 2006 2:41
Location: California, USA

Re: Instr()

Postby albert » Mar 02, 2020 3:33

@caseih

Try the below code , it takes forever.....

Code: Select all


screen 19

    dim as string str1
    for a as longint = 0 to 10000000
        str1+= chr( int( rnd * 256 )
    next
   
    dim as string dict = ""
    dim as string n1
    for a as longint = 1 to len( str1 ) step 4
       
        n1 = mid( str1 , a , 4 )
       
        if instr( 1 , dict , n1 ) = 0 then dict+= n1
       
        if a mod 1000000 = 0 then print len( str1 ) - a
       
    next
   
print len( dict )

sleep
end

Munair
Posts: 886
Joined: Oct 19, 2017 15:00
Location: 't Zand, NL
Contact:

Re: Instr()

Postby Munair » Mar 02, 2020 8:52

It is not just the Instr() function. You use mid() and you create new strings like str1+= chr( int( rnd * 256 ) and dict+= n1. That isn't fast code by definition.

A faster way would be to use predefined string buffers to avoid string allocation with each iteration. And if you want to make it really fast, access string positions with pointers.
jj2007
Posts: 1520
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Instr()

Postby jj2007 » Mar 02, 2020 9:32

Munair is correct. You need to create the strings differently, like this:

Code: Select all

    dim as string str1
    str1=space(10000000)
    for a as longint = 1 to 10000000
        mid(str1, a, 1)= chr( int( rnd * 256 ))
    next
fxm
Posts: 9695
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

Re: Instr()

Postby fxm » Mar 02, 2020 10:52

You are both right, and therefore rather for speed:

Code: Select all

    dim as string str1
    str1 = space(10000000)
    for a as longint = 0 to 10000000-1
        str1[a] = int( rnd * 256 )
    next
grindstone
Posts: 725
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Postby grindstone » Mar 02, 2020 13:55

Double post, erased. (Sorry)
Last edited by grindstone on Mar 02, 2020 14:00, edited 1 time in total.
grindstone
Posts: 725
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Postby grindstone » Mar 02, 2020 13:56

Double post, erased.
Last edited by grindstone on Mar 02, 2020 13:59, edited 1 time in total.
grindstone
Posts: 725
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Postby grindstone » Mar 02, 2020 13:58

Just for pleasure, the snippet from above with pointer access:

Code: Select all

Dim As String str1
For a As LongInt = 0 To 10000000
   str1+= Chr( Int( Rnd * 256 ))
Next

'str1 = "abcdefghijklmnopqrstuvwx"

Dim As String dict = String(Len(str1) + 10, 0)
Dim As String n1, n2
Dim As Long Ptr lp = Cast(Long Ptr, StrPtr(str1))
Dim As Long Ptr dp = Cast(Long Ptr, StrPtr(dict))
Dim As Long b, c, dictlen

? Len( str1 )
?
For a As LongInt = 1 To Len( str1 ) Step 4
   n1 = Mid( str1 , a , 4 )
   b = (a-1)/4
   n2 = Mki(*(lp + b))
   
   Dim As Long n1v = cvi(n1)
   Do
      For c = 0 To dictlen / 4
         If *(dp + c) = n1v Then
            Exit Do
         EndIf
      Next
      c -= 1
      *(dp + c) = n1v
      dictlen += 4
   Loop until -1
   
   'If a Mod 1000000 = 0 Then Print Len( str1 ) - a
   If a/4 Mod 10000 = 0 Then Print Len( str1 ) - a
Next

?
'? dict
Print Len( dict )
? dictlen
dict = Left(dict, dictlen)
'? dict;"*"
Print Len( dict )

Sleep
End


It doesn't seem to be much faster.
grindstone
Posts: 725
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Postby grindstone » Mar 02, 2020 13:59

Double post, erased. (Sorry)
jimg
Posts: 24
Joined: Jan 16, 2020 19:43
Location: Oregon

Re: Instr()

Postby jimg » Mar 02, 2020 15:05

albert-

with your instr instruction, you are testing for any occurrence of a four byte string. Did you mean for the match to occur anywhere within the string, or only at four byte boundaries?

e.g.
if str1 looks like-

Code: Select all

abcdefghijklmnopbcdefghi
1   2   3   4   5   6    7   8   9   

and you pick up the fifth group (bcde) as n1 to test, it matches the second character of str1.
Was that your intention, or was it only supposed to match if it started on a four character boundary?
If it was only on a four character boundary, the search time could be shortened quite a bit I think.
albert
Posts: 5701
Joined: Sep 28, 2006 2:41
Location: California, USA

Re: Instr()

Postby albert » Mar 02, 2020 19:45

The dictionary is on a 4 byte boundary...

It steps through the input string by 4 bytes and tests the dictionary , to see if it's in the dictionary..
If it's not in the dictionary , then it adds the 4 bytes to the dictionary..

Dictionary searches , have to be on a 4 byte boundary.. location mod 4 = 1
So you can divide the dict location by 4 , output = dict( location \ 4 )


I'd like to somehow to determine if a dict location , occurs more than once in the input string....
So you can shorten the dictionary to just bytes that occur more than a certain number of times..( not sure?? )

Small lengths of input , would obviously expand...
But Gigabyte size inputs , might have several of a 4 byte set?? and might compress??
dodicat
Posts: 6534
Joined: Jan 10, 2006 20:30
Location: Scotland

Re: Instr()

Postby dodicat » Mar 02, 2020 21:12

Here are a couple of functions Albert.

Code: Select all




Function tally overload(somestring As String,partstring As String,arr() As Integer) As Integer
    Dim As Integer i,j,ln,lnp,count,num
    Dim As boolean filler=false
    ln=Len(somestring)
    lnp=Len(partstring)
    start:
    count=0
    i=-1
    Do
        i+=1
        If somestring[i] <> partstring[0] Then continue do
        If somestring[i] = partstring[0] Then
            For j=0 To lnp-1
                If somestring[j+i]<>partstring[j] Then continue do
            Next j
        End If
        count+=1
        If filler = true Then arr(count)=i+1
        i=i+lnp-1
    Loop Until i>=ln-1
    If filler = false Then Redim arr(count) 'size is now known, repeat the operation to fil arr()
    arr(0)=count                            '  save tally in (0)
    num=count
    If filler=true Then Goto _return
    filler=true
    Goto start
    _return:
    Return num
End Function

Function tally overload(somestring As String,partstring As String) As Integer
    Dim As Integer i,j,ln,lnp,count,num
    ln=Len(somestring)
    lnp=Len(partstring)
    count=0
    i=-1
    Do
        i+=1
        If somestring[i] <> partstring[0] Then continue do
        If somestring[i] = partstring[0] Then
            For j=0 To lnp-1
                If somestring[j+i]<>partstring[j] then continue do
            Next j
        End If
        count+=1
        i=i+lnp-1
    Loop Until i>=ln-1
    Return count
End Function

' //=========== Trial =========== //

Dim shared As String s:s="abababababab"

For i As Long=1 To 20
    s+=s
Next
print "string length ";len(s)

Redim As Integer arr()
Dim As Double t1,t2

print tally(s,"aba")," number of hits"
print "now with array"
t1=Timer
Var num=tally(s,"aba",arr())
t2=Timer
Print num," number of hits"
Print t2-t1,"time taken"
Print
print "positions of first 400 hits:"
Dim As String comma
For i As Long=1 To 400  'the tally is also held in arr(0)
    If i<arr(0) Then comma ="," Else comma =""
    Print arr(i);comma;
Next i
Print
sleep



 
jimg
Posts: 24
Joined: Jan 16, 2020 19:43
Location: Oregon

Re: Instr()

Postby jimg » Mar 02, 2020 22:05

If I understood you correctly, then your instr is not the right thing to use. As I showed in my example, it will find a match at byte two, which is not on a 4-byte boundary. If everything is to be in units of four bytes, then it should be faster to do everything in four-byte integers (longs), rather than characters.
jimg
Posts: 24
Joined: Jan 16, 2020 19:43
Location: Oregon

Re: Instr()

Postby jimg » Mar 02, 2020 23:30

Using integers would look something like this. Still not terribly fast, but I don't see how you could run any faster using brute force code.
To get any more speed, you would have to do some kind of hashing. Perhaps make 256 dictionaries, where all words in a dictionary would start with the same letter, so you would have much less to search for duplicates.

Anyway, this works-

Code: Select all

   Dim as string str1
   Dim As LongInt a
   Dim As Double tt,lt,st
   
   Randomize 0  ' get same string for comparison while testing
   str1 = space(10000000)
   For a = 0 to 10000000-1
     str1[a] = Int(Rnd*95+32)   ' get printable characters for testing, normally int( rnd * 256 )
   Next
   Mid(str1,5,4)="abcd"    ' ** for testing
   Mid(str1,33,4)="abcd"   ' ** for testing, make fake duplicate to delete
   Print "len(str1)=";Len(str1)
   Print "str1="
   Print Left(str1,80)
   print

   Dim As Long b = 0
   Dim as string dict = ""
   Dim as string n1
   Dim As Long d = 1
   Dim As ZString Ptr sptr,dptr
   Dim As Long dcount

   dict=Space(10000000)
   sptr=StrPtr(str1)
   dptr=StrPtr(dict)
   'Print sptr,dptr
   
   tt=Timer:lt=tt
   dcount=0
   n1=Left(str1,4)  ' always do first one
   Mid(dict,1,4)=n1
   sptr=sptr+4
   dcount=1
   For a = 1 to len(str1)/4
      'n1 = mid( str1 , a , 4 )
      Asm
         mov eax,[sptr]  ' address of next 4byte item in str1 wanted
         mov eax,[eax] 'pick up 4-byte string wanted
         mov edx,[dptr]  ' start at beginning of Dict
         mov ecx,[dcount]
         lpstrt:
            cmp eax,[edx]
            je  dfound     ' found it, skip out and set up next one
            add edx,4
            dec ecx
            jnz lpstrt  ' go try next one
            mov [edx],eax  ' didn't find, save it in dict
            mov ecx,[dcount]
            inc ecx
            mov [dcount],ecx
         dfound:
      End Asm
      sptr=sptr+4
     
      b=b+1
      If b=20000 Then    ' take a peek at results every 20,000 items
         lt=Timer-lt
         If a<21000 Then  ' print out test val first time
            Print "dict="
            Print Left(dict,80)  ' test print
            Print "check to see if second abcd is skipped in dict (col 32)"
            print
            'Sleep
         EndIf
         Print a;" dict count=";dcount;" t=";lt;" tt=";timer-tt
         b=0
         lt=Timer
      EndIf
   Next

   Print "final count=";dcount

   sleep
   sleep
End


Return to “General”

Who is online

Users browsing this forum: No registered users and 3 guests