Instr()

General FreeBASIC programming questions.
jimg
Posts: 24
Joined: Jan 16, 2020 19:43
Location: Oregon

Re: Instr()

Post by jimg »

That last one takes 779 seconds on my computer.
By sorting the dictionary so words starting with the same letter are grouped together, and only searching within the group needed, it take 16 seconds, almost down to usable speed.
Here's the test program

Code: Select all

   Dim As ULong a,i,b,j
   Dim As UByte uu,vv,yy,onetime
   Dim as string str1
   Dim As Double tt,lt,st
   st=timer
   Dim As ULong dictcount(255) ' count of words starting with each character
   dim as uLong dictptr(255)  ' pointer into dict of start of each group
   Randomize 0  ' get same string for comparison while testing
   str1 = space(10000000)
   For a = 0 to 10000000-1 step 4
      If a=4 Or a=32 Then   ' stick in test strings we can find
         str1[a]=97:str1[a+1]=98:str1[a+2]=99:str1[a+3]=100
         dictcount(97)+=1  ' count
      Else
            uu=Int(Rnd*94+33)   ' get printable characters for testing, normally int( rnd * 256 )
            str1[a] = uu
            dictcount(uu)+=1  ' count
            str1[a+1]=Int(Rnd*94+33)
            str1[a+2]=Int(Rnd*94+33)
            str1[a+3]=Int(Rnd*94+33)
      EndIf
   Next
   Print "len(str1)=";Len(str1)
   Print "str1="
   Print Left(str1,80)
   Print
   a=dictcount(0)
   For i=1 To 255
      dictptr(i)=a*4   ' get offset into dict for each character
      a=a+(dictcount(i))
   Next
   If a<>10000000/4 then
      Print "error! a=";a
      sleep
   EndIf
   
   Dim as String dict = ""
   Dim as string n1
   Dim As ZString Ptr sptr,dptr
   Dim As ULong dcount,tempstart,tempsize,dcountx,dptrbase
   Dim As ULong dcountz(255)  ' current count of string for each character in dict
   dict=Space(10000000)
   sptr=StrPtr(str1)
   dptr=StrPtr(dict)
   Dim As ULong dictptrold(255)
   dptrbase=Culng(dptr)  ' convert from zstring ptr to 32bit integer for adding
   For i=0 To 255
      dictptrold(i)=dictptr(i)    ' save old for testing
      dictptr(i)=(dictptr(i))+dptrbase  ' get actual address
   Next
   Print "setup time=";Timer-st
   tt=Timer:lt=tt
   dcount=0
   For a = 0 to (len(str1)-1) Step 4
      'n1 = mid( str1 , a , 4 )
      uu=str1[a]  ' get first char of string
      tempstart=dictptr(uu) ' the location within dict that strings starting with uu begin
      dcountx=dcountz(uu)  ' counts so far for strings starting with this character
      Asm
         mov eax,[sptr]    ' address of next 4byte item in str1 wanted
         mov eax,[eax]     ' pick up 4-byte string wanted
         mov edx,[tempstart]  ' start at first word of Dict for the first letter found
         mov ecx,[dcountx] ' how many of these we've found so far
         cmp ecx,0
         je savit          ' we don't have any for this letter yet, just go save it
         lpstrt:
            cmp eax,[edx]  ' does the string wanted (in eax) match the word in the dictionary
            je  dfound     ' found it, skip out and set up next one
            add edx,4      ' next word of dictionary
            dec ecx        ' count down available words in dict to test
            jnz lpstrt     ' still some left, go try next one
         savit:            
            mov [edx],eax  ' didn't find, save it in next blank spot of dict
            inc dword Ptr [dcountx] ' count strings found for this starting letter
            inc dword Ptr [dcount]  ' count overall strings found
         dfound:
      End Asm
      dcountz(uu)=dcountx
      sptr=sptr+4
      
      b=b+1
      If b=100000 Then    ' take a peek at results every 100,000 items
         lt=Timer-lt
         If onetime=0 Then  ' print out test val first time
            Print "dict="
            Print Left(dict,80)
            j=dictptrold(97) ' index of start of words starting with "a"
            Print mid(dict,j+1,80)  ' test print
            Print "check to see if second abcd is skipped in dict (col 32)"
            print
            onetime=1
         EndIf
         Print a;" dict count=";dcount;" t=";lt;" tt=";timer-tt
         b=0
         lt=Timer
      EndIf
   Next
   print
   Print "final count=";dcount;" in ";Timer-tt;" seconds" 
   Print "(";2500000-dcount;" duplicates discarded)"
   sleep
   For i=32 To 127
      j=dictptrold(i)
      a=dictptrold(i+1)-j
      b=dcountz(i)
      print i;" count=";b;" size=";a;" size/4=";a/4;" ";Mid(dict,j+1,20)
   Next
   Sleep
   sleep
End
jimg
Posts: 24
Joined: Jan 16, 2020 19:43
Location: Oregon

Re: Instr()

Post by jimg »

And finally, albert, this one uses separate dictionaries for each of 256 possible starting letters (dicts(255)).
It runs in 8 seconds, so it takes longer to compile than to run. I'm done here unless you have other questions.

Code: Select all

   '#Define debug
   Dim As ULong a,i,b,j,testsize
   Dim As UByte uu,vv,yy,onetime
   Dim as string str1
   Dim As Double tt,lt,st
   st=timer
   Dim As ULong dictcount(255) ' count of words starting with each character
   Randomize 0  ' get same string for comparison while testing

   testsize=10000000
   str1 = String(testsize,0)

   For a = 0 to Len(str1)-3 step 4
      #Ifdef debug
         If a=4 Or a=32 Then   ' stick in test strings we can find
            str1[a]=97:str1[a+1]=98:str1[a+2]=99:str1[a+3]=100
            dictcount(97)+=1  ' count
         Else
      #EndIf
      uu=Int(Rnd*94+33)   ' get printable characters for testing, normally int( rnd * 256 )
      str1[a] = uu
      dictcount(uu)+=1  ' count
      str1[a+1]=Int(Rnd*94+33)
      str1[a+2]=Int(Rnd*94+33)
      str1[a+3]=Int(Rnd*94+33)
      #Ifdef debug
         EndIf
      #endif
   Next
   #Ifdef debug
      Print "len(str1)=";Len(str1)
      Print "str1="
      Print Left(str1,80)
      Print
   #EndIf
   Dim As String dicts(255)  ' create dictionaries
   a=dictcount(0)
   For i=1 To 255
      a=a+(dictcount(i))
      dicts(i)=String(dictcount(i)*4,0)  ' preset space in dictionaries
   Next
   If a<>testsize/4 then Print "error! a=":Sleep

   Dim As ZString Ptr sptr,dptr
   Dim As ULong dcount,dcountx
   Dim As ULong dcountz(255)  ' current count of string for each character in dict
   
   sptr=StrPtr(str1)
   Print "setup time=";Timer-st
   tt=Timer:lt=tt
   dcount=0
   For a = 0 to (Len(str1)-1) Step 4
      'n1 = mid( str1 , a , 4 )
      uu=str1[a]  ' get first char of string
      dptr=StrPtr(dicts(uu)) ' get start of dictionary for character uu
      dcountx=dcountz(uu)  ' counts so far for strings starting with this character
      Asm
         mov eax,[sptr]    ' address of next 4byte item in str1 wanted
         mov eax,[eax]     ' pick up 4-byte string wanted
         mov edx,[dptr]    ' start at first word of Dict for the first letter found
         mov ecx,[dcountx] ' how many of these we've found so far
         cmp ecx,0
         je savit          ' we don't have any for this letter yet, just go save it
         lpstrt:
            cmp eax,[edx]  ' does the string wanted (in eax) match the word in the dictionary
            je  dfound     ' found it, skip out and set up next one
            add edx,4      ' next word of dictionary
            dec ecx        ' count down available words in dict to test
            jnz lpstrt     ' still some left, go try next one
         savit:            
            mov [edx],eax  ' didn't find, save it in next blank spot of dict
            inc dword Ptr [dcountx] ' count strings found for this starting letter
            inc dword Ptr [dcount]  ' count overall strings found
         dfound:
      End Asm
      dcountz(uu)=dcountx
      sptr=sptr+4
      
      #Ifdef debug
         b=b+1   
         If b=100000 Then    ' take a peek at results every 100,000 items
            lt=Timer-lt
            If onetime=0 Then  ' print out test val first time
               Print left(dicts(97),80)  ' test print
               Print "check to see if second abcd is skipped in dict (col 32)"
               print
               onetime=1
            EndIf
            Print a;" dict count=";dcount;" t=";lt;" tt=";timer-tt
            b=0
            lt=Timer
         EndIf
      #endif
   Next
   print
   Print "final count=";dcount;" in ";Timer-tt;" seconds" 
   Print "(";2500000-dcount;" duplicates discarded)"
   #Ifdef debug
      sleep
      
      For i=32 To 127
         a=Len(dicts(i))
         b=dcountz(i)
         print i;" count=";b;" size=";a;" size/4=";a/4;" ";Left(dicts(i),40)
      Next
      Sleep
   #endif
   Sleep
End
jimg
Posts: 24
Joined: Jan 16, 2020 19:43
Location: Oregon

Re: Instr()

Post by jimg »

albert-

I know I said I was done, but I couldn't resist.

Here is a program that reads in a file and creates the dictionaries.

Using a 13 meg test file, on my machine ( with plenty of memory and fast cpu) it takes a grand total of 0.3 seconds. Is that fast enough?
( it did take 260 seconds for a 1.1 gigabyte file though.)

Change the filename in line 13 and try it out on your own test data :)

Code: Select all

   #Define debug
   Dim As ULong a,i,b,j,k,recordsize,ttt,tsize
   Dim As UByte uu,vv,yy,onetime
   Dim as string str1
   Dim As Double tt,lt,st,qt
   ttt=255
   Dim As ULong dictcount(255,ttt) ' count of words starting with each character
   Dim As String dicts(255,ttt)    ' create dictionaries

   recordsize = 65536  ' read buffer size
   str1 = String(recordsize,0)

   Open "CalcGuideLibre.odt" For binary As #1  ' my 13 meg test file.
   st=Timer
   tt=st   
   
   b=0:k=0  ' count records and bytes
   Do While Not Eof(1)
      str1=Input (recordsize, 1)
      k=k+1
      b=b+Len(str1)
      For a=0 To Len(str1)-1 Step 4   ' read through once to get sizes needed for dictionary
         uu=str1[a]
         vv=str1[a+1]
         dictcount(uu,vv)+=1  ' count strings that start with these two letters
      Next
   Loop
   Print "done, ";k;" records, total size=";b;" last record=";Len(str1)

   Seek #1,1   ' rewind file

   ' create space for dictionary words
   a=0
   For i=0 To 255
      For j=0 To 255
         a=a+(dictcount(i,j))  ' keep running count to be sure we got it right
         dicts(i,j)=String(dictcount(i,j)*4,0)  ' preset space in dictionaries
      Next
   Next
   Print "total size=";a;" words, ";a*4;" bytes"

   Dim As ZString Ptr sptr,dptr  ' to get the address of the strings in the input and dictionary
   Dim As ULong dcountz(255,ttt) ' current count of string for each character in dict
   Dim As ULong dcount,dcountx
   
   print
   b=0
   dcount=0
   tsize=0
   Do While Not Eof(1)

      str1=Input (recordsize, 1)
      tsize=tsize+Len(str1)
      
      sptr=StrPtr(str1) ' start of input data
      For a = 0 to (Len(str1)-1) Step 4
         uu=str1[a]   ' get first char of string to use as an index into the dictionaries
         vv=str1[a+1] ' get the second character of the string
         dptr=StrPtr(dicts(uu,vv)) ' get start of dictionary for character set uu,vv
         dcountx=dcountz(uu,vv)  ' counts so far for strings starting with these characters
         Asm
            mov eax,[sptr]    ' address of next 4byte item in str1 wanted
            mov eax,[eax]     ' pick up 4-byte string wanted
            mov edx,[dptr]    ' start at first word of Dict for the first letter found
            mov ecx,[dcountx] ' how many of these we've found so far
            jecxz savit       ' don't have any for these letters yet, just save new one
         lpstrt:
            cmp eax,[edx]  ' does the string wanted (in eax) match the word in the dictionary
            je  dfound     ' already have in dictionary, just skip this one
            Add edx,4      ' next word of dictionary
            Sub ecx,1      ' count down available words in dict to test   
            jnz lpstrt     ' still some left, go try next one   
         savit:            
            mov [edx],eax           ' didn't find, save it in next blank spot of dict
            inc dword Ptr [dcountx] ' count strings found for this starting letter
            inc dword Ptr [dcount]  ' count overall strings found
         dfound:
         End Asm
         dcountz(uu,vv)=dcountx  ' save count of words in this dictionary for these characters
         sptr=sptr+4             ' move to next input word
         
         #Ifdef debug
            b=b+1   
            If b=200000 Then    ' take a peek at results every 200,000 items
               lt=Timer-st
               Print tsize/4;" dict count=";dcount;" t=";lt;" tt=";timer-tt
               b=0
               st=Timer
            EndIf
         #EndIf
      Next
   
   Loop
   close
   lt=Timer-tt
   Print
   Print "final count=";dcount;" words saved in ";lt;" seconds" 
   Print "(";Int((tsize/4)-dcount);" duplicates discarded)"
   #Ifdef debug
      ?:? "press a key for a partial dictionary dump":?
      sleep
      For i=32 To 127  ' test print out some dictionaries
         'For k=0 To 255
         For k=97 To 99  ' show the ones with second char= a or b or c or d for a test
            a=Len(dicts(i,k))
            b=dcountz(i,k)
            print i;",";k;" count=";b;" size=";a;" size/4=";a/4;" tsize=";Len(RTrim(dicts(i,k),Chr(0)));" ";
            For j=0 To 30
               vv=dicts(i,k)[j]
               If (vv>31 and vv<127) or (vv>128 And vv<255) Then
                  Print Chr(vv);
               Else
                  Print " ";
               EndIf
            Next
            print
            Left(dicts(i,k),30)
         Next
      Next
      Sleep
   #endif
   Sleep
End
Thanks for the entertainment, and if you want to discuss this, feel free to contact me.
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Post by grindstone »

Very impressive!

I've tried another way to do the job: Writing the 4-byte-words into a tree structure and then traverse the tree to create the dictionary string. It works nearly as fast as jimg's code, but it runs out of memory (on my 32 bit machine) if the number of different words exceeds a certain amount.
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Post by grindstone »

A different approach (my thanks to dodicat for mentioning the tally string). I think this can stick with jimg's code - without assembly *grin* :

Code: Select all

Dim As String str1, substr
Dim As String dict
Dim As Integer strlen
Dim As UByte Ptr stp

Const tallylen  = 256^4/8

Dim As UByte Ptr tally = Callocate(tallylen)
Dim As ULong v, vs

strlen = 10000000
'strlen = 13*1024*1024
'strlen = 12
str1 = String(strlen, 0)

For a As LongInt = 0 To strlen - 1
	str1[a] = Int( Rnd * 256 )
Next

'some strings for testing
'str1 = "abcdefghijklmnopqrstuvwxefghabcd1234abcd"
'str1 = "abcdefgh"
'str1 = "abcd"

dict = String(Len(str1), 0)

Print Len(str1); " ("; Len(str1) / 4; " values )"
Print

Dim As Double timerem = Timer

Print "Building tally string..."
Dim As ULong Ptr von = Cast(ULong Ptr, StrPtr(str1))
Dim As ULong Ptr bis = von
Dim As Integer dummy = (Len(str1) / 4) - 1
bis += dummy

For a As Long Ptr = von To bis 
	v = *a
	vs = v Shr 3
	tally[vs] = BitSet(tally[vs], v And 7)
Next

Print Timer - timerem
Print

Print "Building dict string..."
Dim As ULong Ptr dictptr = Cast(ULong Ptr, StrPtr(dict))
Dim As Integer dc, tc
Dim As ULongInt Ptr ultp = Cast(ULongInt Ptr, tally)
Dim As ULongInt tv
Dim As UInteger ubp

For y As UInteger = 0 To tallylen / 8 - 1
	tv = ultp[y] 'the vastly most bytes will be 0, so it pays to check 8 bytes at a time
	If tv <> 0 Then 'at least one bit is set
		For z As UByte= 0 To 7 'all 8 bytes
			ubp = y * 8 + z 'index of the byte of the tally string
			If tally[ubp] <> 0 Then 'if at least one bit of the byte is set
				For b As UByte = 0 To 7 'check all 8 bits of the byte
					If Bit(tally[ubp], b) Then 'if bit is set
						dictptr[dc] = (ubp Shl 3) Or b 'write the corresponding value to the dict string
						dc += 1 'increase index of dict string
					EndIf
				Next
			EndIf
		Next
	EndIf
Next

Print Timer - timerem
Print

dict = Left(dict, dc * 4)
Print Len(dict)/4; " values ("; (Len(str1) - Len(dict))/4; " discarded )"
'? dict
DeAllocate tally
Print "OK"
Sleep
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Post by grindstone »

And now the same with assembly code. Anyone to beat this?

Code: Select all

Const tallylen  = 256^4/8

Dim As String str1, substr, dict
Dim As Integer strlen, dummy
Dim As ULong Ptr von, bis
Dim As UByte Ptr dictptr, tally = Callocate(tallylen)

strlen = 10000000
'strlen = 13*1024*1024
'strlen = 12
str1 = String(strlen, 0)

For a As LongInt = 0 To strlen - 1
	str1[a] = Int( Rnd * 256 )
Next

'some strings for testing
'str1 = "abcdefghijklmnopqrstuvwxefghabcd1234abcd"
'str1 = "abcdefghijklmnopqrstuvwxefghabcd1234abcd5678xyzw"
'str1 = "abcdefgh"
'str1 = "abcd"

dict = String(Len(str1), "*")

Print Len(str1); " ("; Len(str1) / 4; " values )"
Print

Dim As Double timerem = Timer

Print "Building tally string..."
von = Cast(ULong Ptr, StrPtr(str1))
bis = von
dummy = (Len(str1) / 4)
bis += dummy

Asm
	mov edx, [von]		'load pointer to str1
loop1:		
	mov eax, [edx]		'load value of str1 pointer position into eax
	mov ebx, eax			'copy value to ebx
	Shr eax, 3				'divide value by 8
	Add eax, [tally]	'calculate pointer to tally byte
	And ebx, 7				'index of bit to set
	bts [eax], ebx		'set corresponding bit in tally string
	Add edx, 4				'pointer to next value of str1
	cmp edx, [bis]		'test for end of str1
	jb loop1
End Asm

Print Timer - timerem
Print

Print "Building dict string..."
dictptr = Cast(ubyte Ptr, StrPtr(dict))
von = Cast(ULong Ptr, tally)
bis = von
dummy = tallylen / 4
bis += dummy - 1

Asm
	mov eax, [tally]	'get pointer to tally string
loop2:		
	mov edx, [eax]		'load value at tally pointer position
	cmp edx, 0
	jz zero						'all tally bits are 0 -> next value
notzero:
	bsf ebx, edx			'search for the least significant bit that is set
	mov ecx, eax			'pointer to tally string
	Sub ecx, [tally]	'get byte position in tally string
	Shl ecx, 3				'multiply with 8
	Add ecx, ebx			'add bit position
	btr edx, ebx			'reset bit
	mov ebx, [dictptr]'get pointer to dict string
	mov [ebx], ecx		'write value to dict string
	Add ebx, 4				'set pointer to next dict string position
	mov [dictptr], ebx'save dict pointer
	cmp edx, 0				'check if another bit is set
	jne notzero				
zero:
	Add eax, 4				'next tally string value
	cmp eax, [bis]		'check for end of tally string
	jb  loop2
End Asm
	
Print Timer - timerem
Print

dummy = Cast(UInteger, dictptr)
dummy -= Cast(UInteger, StrPtr(dict))
dict = Left(dict, dummy)
Print Len(dict)/4; " values ("; (Len(str1) - Len(dict))/4; " discarded )"
'? dict;"#"
DeAllocate tally
Print
Print "OK"
Sleep
EDIT: Fixed a bug (Did miss the very last value of the source string)
Last edited by grindstone on Mar 09, 2020 13:04, edited 1 time in total.
jimg
Posts: 24
Joined: Jan 16, 2020 19:43
Location: Oregon

Re: Instr()

Post by jimg »

I lifted your string generator to be sure we were working on the same input. I ran your last assembly version which I called tally, and my last version that wasn't reading a file (instrtest11) for comparison. This is what I got when running in a command window from a dos prompt:

Code: Select all

F:\FreeBasicProgs\Testing\instrtest>tally
 10000000 ( 2500000 values )

Building tally string...
 0.1963284000034804

Building dict string...
 0.318041300003479

 2499308 values ( 692 discarded )

OK

F:\FreeBasicProgs\Testing\instrtest>instrtest11
 10000000 ( 2500000 values )


final count=2499309 in  0.1459838000014351 seconds
( 691 duplicates discarded)

F:\FreeBasicProgs\Testing\instrtest>
So very close. For me, your test string was easier since there weren't as many duplicates.

Now here's an interesting part-- I went back and ran the test on twice as many characters and look-

Code: Select all

F:\FreeBasicProgs\Testing\instrtest>tally
 20000000 ( 5000000 values )

 0.2285514000028641

Building dict string...
 0.3475486000028631

 4997155 values ( 2845 discarded )

OK

F:\FreeBasicProgs\Testing\instrtest>instrtest11
 20000000 ( 5000000 values )


final count=4997156 in  0.4366541000009683 seconds
( 2844 duplicates discarded)

F:\FreeBasicProgs\Testing\instrtest>
You won handily for the larger number of characters! In fact, twice as many took nearly the same time.

but to really suprise me, I tried 90,000,000 or 9 times as many-

Code: Select all

F:\FreeBasicProgs\Testing\instrtest>tally
 90000000 ( 22500000 values )

 0.4611695000032121

Building dict string...
 0.7383165000032115

 22441197 values ( 58803 discarded )

OK

F:\FreeBasicProgs\Testing\instrtest>instrtest11
 90000000 ( 22500000 values )


final count=22441198 in  4.142522199998847 seconds
( 58802 duplicates discarded)

F:\FreeBasicProgs\Testing\instrtest>
If I didn't see it with my own eyes, I wouldn't have believed it!
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Post by grindstone »

Thank you for testing, due to it I found a bug in my code (and fixed it, see above).

But very strange - our machines seem to behave quite different:
  • 10000000 characters:
    instrtest11: 0.689609581455835
    tally: 0.4877191733044564

    20000000 characters:
    instrtest11: 1.343974237574243
    tally: 0.6782979671420524

    90000000 characters:
    instrtest11: 10.87436923649341 (!)
    tally: 1.980990909922488
I confess, my hardware and OS are quite old: Intel Core 2 Quad at 2.4GHz / 4GB RAM / WinXP32
albert
Posts: 6000
Joined: Sep 28, 2006 2:41
Location: California, USA

Re: Instr()

Post by albert »

I was trying to make a dictionary of 4 byte values.. Hoping to get data compression..

But the dictionary comes out to be the same size , as the input data..
grindstone
Posts: 862
Joined: May 05, 2015 5:35
Location: Germany

Re: Instr()

Post by grindstone »

I recoded my snippet to really count (instead of calculate) the number of values.

Code: Select all

Const tallylen  = 256^4/8

Dim As String str1, substr, dict
Dim As Integer strlen, dummy, inputcount, outputcount
Dim As ULong Ptr von, bis
Dim As UByte Ptr dictptr, tally = Callocate(tallylen)

Open ExePath + "\inputdata.dat" For Binary As #1 'replace with your input data file
str1 = Input(Lof(1), #1)
Close

'some strings for testing
'str1 = "abcdefghijklmnopqrstuvwxefghabcd1234abcd"
'str1 = "abcdefghijklmnopqrstuvwx"
'str1 = "abcdefghijklmnopqrstuvwxefghabcd1234abcd5678xyzw"
'str1 = "abcdefgh"
'str1 = "abcd"

If Len(str1) Mod 4 <> 0 Then
	Print "ERROR: Length of input string must be a multiple of 4"
	DeAllocate tally
	Sleep
	End
EndIf

dict = String(Len(str1), 0)

Dim As Double timerem = Timer

Print "Building tally string..."
von = Cast(ULong Ptr, StrPtr(str1))
bis = von
dummy = (Len(str1) / 4)
bis += dummy

Asm
	mov edx, [von]		'load pointer to str1
	mov ecx, 0				'initialize sample counter
loop1:		
	mov eax, [edx]		'load value of str1 pointer position into eax
	mov ebx, eax			'copy value to ebx
	Shr eax, 3				'divide value by 8
	Add eax, [tally]	'calculate pointer to tally byte
	And ebx, 7				'index of bit to set
	bts [eax], ebx		'set corresponding bit in tally string
	inc ecx						'increase sample counter
	Add edx, 4				'pointer to next value of str1
	cmp edx, [bis]		'test for end of str1
	jb loop1
	mov [inputcount], ecx	'store number of samples
End Asm

Print Timer - timerem
Print

Print "Building dict string..."
dictptr = Cast(UByte Ptr, StrPtr(dict))
von = Cast(ULong Ptr, tally)
bis = von
dummy = tallylen / 4
bis += dummy - 1

Asm
	mov eax, [tally]	'get pointer to tally string
loop2:		
	mov edx, [eax]		'load value at tally pointer position
	cmp edx, 0
	jz zero						'all tally bits are 0 -> next value
notzero:
	bsf ebx, edx			'search for the least significant bit that is set
	mov ecx, eax			'pointer to tally string
	Sub ecx, [tally]	'get byte position in tally string
	Shl ecx, 3				'multiply with 8
	Add ecx, ebx			'add bit position
	btr edx, ebx			'reset bit
	mov ebx, [dictptr]'get pointer to dict string
	mov [ebx], ecx		'write value to dict string
	Add ebx, 4				'set pointer to next dict string position
	mov [dictptr], ebx'save dict pointer
	mov ebx, [outputcount]
	inc ebx
	mov [outputcount], ebx
	cmp edx, 0				'check if another bit is set
	jne notzero				
zero:
	Add eax, 4				'next tally string value
	cmp eax, [bis]		'check for end of tally string
	jb  loop2
End Asm
	
Print Timer - timerem
Print

Open ExePath + "\dictionary.dat" For Output As #1
Print #1, dict;
Close

Print "Dictionary stored to: "; ExePath; "\dictionary.dat"
Print

'? dict;"#"
DeAllocate tally
Print
Print "     Number of input values "; inputcount
Print "Number of dictionary values "; outputcount; " ("; inputcount - outputcount; " discarded )
Print
Print "OK"
Sleep
If the sizes of input and dictionary are the same, you can be sure that there definitely are no duplicates. But maybe at least it helps that the values in the dictionary are sorted. ;-)
jimg
Posts: 24
Joined: Jan 16, 2020 19:43
Location: Oregon

Re: Instr()

Post by jimg »

albert-

The data we have been using for testing, random numbers of the range 0-255, is not compressible, otherwise it wouldn't be very good random numbers. For example, I ran the generator for str1 and saved it to a file. That file was 190,000,000 bytes long. I then ran that file through 7-zip to compress it as much as possible. The results were 190,000,156 bytes. So it grew instead of becoming smaller.

What is your input data really like? Is it compressible? We might be able to help with some more info.

For example, I ran the test again but only generating characters from 32 to 126, the normal text characters, and the 190 mbytes compressed to 157,688,900 bytes. Not great but some gain. If your data is some repeating patterns like a limited vocabulary, the compression could be much greater.
Post Reply