Instr()
Instr()
@Coders
Is there a way to speed up the instr() function...
Building a 4 byte dictionary of a input of 10,000,000 bytes takes hours...
chrs = 10,000,000 bytes input
if instr( 1 , dict , ( next 4 chrs ) ) = 0 then dict+= next 4 chrs
Searching the dict , to try to find the 4 chrs takes over a half hour... I opened Linux "TOP" and killed it after 40 minutes.
Is there a way to speed up the instr() function...
Building a 4 byte dictionary of a input of 10,000,000 bytes takes hours...
chrs = 10,000,000 bytes input
if instr( 1 , dict , ( next 4 chrs ) ) = 0 then dict+= next 4 chrs
Searching the dict , to try to find the 4 chrs takes over a half hour... I opened Linux "TOP" and killed it after 40 minutes.
Re: Instr()
Might help if you describe what you are trying to do and sought for input on how to create an algorithm to do it effectively. Showing a complete, working example of your present code might help also.
Re: Instr()
@caseih
Try the below code , it takes forever.....
Try the below code , it takes forever.....
Code: Select all
screen 19
dim as string str1
for a as longint = 0 to 10000000
str1+= chr( int( rnd * 256 )
next
dim as string dict = ""
dim as string n1
for a as longint = 1 to len( str1 ) step 4
n1 = mid( str1 , a , 4 )
if instr( 1 , dict , n1 ) = 0 then dict+= n1
if a mod 1000000 = 0 then print len( str1 ) - a
next
print len( dict )
sleep
end
Re: Instr()
It is not just the Instr() function. You use mid() and you create new strings like str1+= chr( int( rnd * 256 ) and dict+= n1. That isn't fast code by definition.
A faster way would be to use predefined string buffers to avoid string allocation with each iteration. And if you want to make it really fast, access string positions with pointers.
A faster way would be to use predefined string buffers to avoid string allocation with each iteration. And if you want to make it really fast, access string positions with pointers.
Re: Instr()
Munair is correct. You need to create the strings differently, like this:
Code: Select all
dim as string str1
str1=space(10000000)
for a as longint = 1 to 10000000
mid(str1, a, 1)= chr( int( rnd * 256 ))
next
Re: Instr()
You are both right, and therefore rather for speed:
Code: Select all
dim as string str1
str1 = space(10000000)
for a as longint = 0 to 10000000-1
str1[a] = int( rnd * 256 )
next
-
- Posts: 862
- Joined: May 05, 2015 5:35
- Location: Germany
Re: Instr()
Double post, erased. (Sorry)
Last edited by grindstone on Mar 02, 2020 14:00, edited 1 time in total.
-
- Posts: 862
- Joined: May 05, 2015 5:35
- Location: Germany
Re: Instr()
Double post, erased.
Last edited by grindstone on Mar 02, 2020 13:59, edited 1 time in total.
-
- Posts: 862
- Joined: May 05, 2015 5:35
- Location: Germany
Re: Instr()
Just for pleasure, the snippet from above with pointer access:
It doesn't seem to be much faster.
Code: Select all
Dim As String str1
For a As LongInt = 0 To 10000000
str1+= Chr( Int( Rnd * 256 ))
Next
'str1 = "abcdefghijklmnopqrstuvwx"
Dim As String dict = String(Len(str1) + 10, 0)
Dim As String n1, n2
Dim As Long Ptr lp = Cast(Long Ptr, StrPtr(str1))
Dim As Long Ptr dp = Cast(Long Ptr, StrPtr(dict))
Dim As Long b, c, dictlen
? Len( str1 )
?
For a As LongInt = 1 To Len( str1 ) Step 4
n1 = Mid( str1 , a , 4 )
b = (a-1)/4
n2 = Mki(*(lp + b))
Dim As Long n1v = cvi(n1)
Do
For c = 0 To dictlen / 4
If *(dp + c) = n1v Then
Exit Do
EndIf
Next
c -= 1
*(dp + c) = n1v
dictlen += 4
Loop until -1
'If a Mod 1000000 = 0 Then Print Len( str1 ) - a
If a/4 Mod 10000 = 0 Then Print Len( str1 ) - a
Next
?
'? dict
Print Len( dict )
? dictlen
dict = Left(dict, dictlen)
'? dict;"*"
Print Len( dict )
Sleep
End
-
- Posts: 862
- Joined: May 05, 2015 5:35
- Location: Germany
Re: Instr()
Double post, erased. (Sorry)
Re: Instr()
albert-
with your instr instruction, you are testing for any occurrence of a four byte string. Did you mean for the match to occur anywhere within the string, or only at four byte boundaries?
e.g.
if str1 looks like-
and you pick up the fifth group (bcde) as n1 to test, it matches the second character of str1.
Was that your intention, or was it only supposed to match if it started on a four character boundary?
If it was only on a four character boundary, the search time could be shortened quite a bit I think.
with your instr instruction, you are testing for any occurrence of a four byte string. Did you mean for the match to occur anywhere within the string, or only at four byte boundaries?
e.g.
if str1 looks like-
Code: Select all
abcdefghijklmnopbcdefghi
1 2 3 4 5 6 7 8 9
Was that your intention, or was it only supposed to match if it started on a four character boundary?
If it was only on a four character boundary, the search time could be shortened quite a bit I think.
Re: Instr()
The dictionary is on a 4 byte boundary...
It steps through the input string by 4 bytes and tests the dictionary , to see if it's in the dictionary..
If it's not in the dictionary , then it adds the 4 bytes to the dictionary..
Dictionary searches , have to be on a 4 byte boundary.. location mod 4 = 1
So you can divide the dict location by 4 , output = dict( location \ 4 )
I'd like to somehow to determine if a dict location , occurs more than once in the input string....
So you can shorten the dictionary to just bytes that occur more than a certain number of times..( not sure?? )
Small lengths of input , would obviously expand...
But Gigabyte size inputs , might have several of a 4 byte set?? and might compress??
It steps through the input string by 4 bytes and tests the dictionary , to see if it's in the dictionary..
If it's not in the dictionary , then it adds the 4 bytes to the dictionary..
Dictionary searches , have to be on a 4 byte boundary.. location mod 4 = 1
So you can divide the dict location by 4 , output = dict( location \ 4 )
I'd like to somehow to determine if a dict location , occurs more than once in the input string....
So you can shorten the dictionary to just bytes that occur more than a certain number of times..( not sure?? )
Small lengths of input , would obviously expand...
But Gigabyte size inputs , might have several of a 4 byte set?? and might compress??
Re: Instr()
Here are a couple of functions Albert.
Code: Select all
Function tally overload(somestring As String,partstring As String,arr() As Integer) As Integer
Dim As Integer i,j,ln,lnp,count,num
Dim As boolean filler=false
ln=Len(somestring)
lnp=Len(partstring)
start:
count=0
i=-1
Do
i+=1
If somestring[i] <> partstring[0] Then continue do
If somestring[i] = partstring[0] Then
For j=0 To lnp-1
If somestring[j+i]<>partstring[j] Then continue do
Next j
End If
count+=1
If filler = true Then arr(count)=i+1
i=i+lnp-1
Loop Until i>=ln-1
If filler = false Then Redim arr(count) 'size is now known, repeat the operation to fil arr()
arr(0)=count ' save tally in (0)
num=count
If filler=true Then Goto _return
filler=true
Goto start
_return:
Return num
End Function
Function tally overload(somestring As String,partstring As String) As Integer
Dim As Integer i,j,ln,lnp,count,num
ln=Len(somestring)
lnp=Len(partstring)
count=0
i=-1
Do
i+=1
If somestring[i] <> partstring[0] Then continue do
If somestring[i] = partstring[0] Then
For j=0 To lnp-1
If somestring[j+i]<>partstring[j] then continue do
Next j
End If
count+=1
i=i+lnp-1
Loop Until i>=ln-1
Return count
End Function
' //=========== Trial =========== //
Dim shared As String s:s="abababababab"
For i As Long=1 To 20
s+=s
Next
print "string length ";len(s)
Redim As Integer arr()
Dim As Double t1,t2
print tally(s,"aba")," number of hits"
print "now with array"
t1=Timer
Var num=tally(s,"aba",arr())
t2=Timer
Print num," number of hits"
Print t2-t1,"time taken"
Print
print "positions of first 400 hits:"
Dim As String comma
For i As Long=1 To 400 'the tally is also held in arr(0)
If i<arr(0) Then comma ="," Else comma =""
Print arr(i);comma;
Next i
Print
sleep
Re: Instr()
If I understood you correctly, then your instr is not the right thing to use. As I showed in my example, it will find a match at byte two, which is not on a 4-byte boundary. If everything is to be in units of four bytes, then it should be faster to do everything in four-byte integers (longs), rather than characters.
Re: Instr()
Using integers would look something like this. Still not terribly fast, but I don't see how you could run any faster using brute force code.
To get any more speed, you would have to do some kind of hashing. Perhaps make 256 dictionaries, where all words in a dictionary would start with the same letter, so you would have much less to search for duplicates.
Anyway, this works-
To get any more speed, you would have to do some kind of hashing. Perhaps make 256 dictionaries, where all words in a dictionary would start with the same letter, so you would have much less to search for duplicates.
Anyway, this works-
Code: Select all
Dim as string str1
Dim As LongInt a
Dim As Double tt,lt,st
Randomize 0 ' get same string for comparison while testing
str1 = space(10000000)
For a = 0 to 10000000-1
str1[a] = Int(Rnd*95+32) ' get printable characters for testing, normally int( rnd * 256 )
Next
Mid(str1,5,4)="abcd" ' ** for testing
Mid(str1,33,4)="abcd" ' ** for testing, make fake duplicate to delete
Print "len(str1)=";Len(str1)
Print "str1="
Print Left(str1,80)
print
Dim As Long b = 0
Dim as string dict = ""
Dim as string n1
Dim As Long d = 1
Dim As ZString Ptr sptr,dptr
Dim As Long dcount
dict=Space(10000000)
sptr=StrPtr(str1)
dptr=StrPtr(dict)
'Print sptr,dptr
tt=Timer:lt=tt
dcount=0
n1=Left(str1,4) ' always do first one
Mid(dict,1,4)=n1
sptr=sptr+4
dcount=1
For a = 1 to len(str1)/4
'n1 = mid( str1 , a , 4 )
Asm
mov eax,[sptr] ' address of next 4byte item in str1 wanted
mov eax,[eax] 'pick up 4-byte string wanted
mov edx,[dptr] ' start at beginning of Dict
mov ecx,[dcount]
lpstrt:
cmp eax,[edx]
je dfound ' found it, skip out and set up next one
add edx,4
dec ecx
jnz lpstrt ' go try next one
mov [edx],eax ' didn't find, save it in dict
mov ecx,[dcount]
inc ecx
mov [dcount],ecx
dfound:
End Asm
sptr=sptr+4
b=b+1
If b=20000 Then ' take a peek at results every 20,000 items
lt=Timer-lt
If a<21000 Then ' print out test val first time
Print "dict="
Print Left(dict,80) ' test print
Print "check to see if second abcd is skipped in dict (col 32)"
print
'Sleep
EndIf
Print a;" dict count=";dcount;" t=";lt;" tt=";timer-tt
b=0
lt=Timer
EndIf
Next
Print "final count=";dcount
sleep
sleep
End