Duplicates

General FreeBASIC programming questions.
Provoni
Posts: 391
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Postby Provoni » Jun 20, 2020 9:56

I just realized that multiple CRC32 buckets can be used like so:

Code: Select all

if crc32array(0,j)=0 then 'unique
   crc32array(0,j)=1
   print #2,s
else 'collision, go to 2nd bucket
   print #3,s
   j=crc32(s+"*") 'unused character in corpus
   if crc32array(1,j)=0 then 'unique
      crc32array(1,j)=1
      print #2,s
   else 'collision
      print #3,s
      'add 3rd bucket if you like etc...
   end if
end if

Probably need to use bits to save memory.
Provoni
Posts: 391
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Postby Provoni » Jun 20, 2020 10:19

8 buckets:

For every new bucket I add a unique character to the string that is otherwise unused through the corpus. Okay or not?

EDIT: Not okay, there was an error in my application of the idea.

Code: Select all

screenres 640,480,32

dim as uinteger i,j,b,l
dim as string s
dim as double t=timer

redim shared as ubyte crc32array(2^32)

declare function crc32(buf as string) as uinteger<32>

open "h:\corpus.txt" for binary as #1
open "g:\uniq.txt" for binary as #2
open "g:\coll.txt" for binary as #3

do
   
   line input #1,s
   
   if len(s)>0 then
      
      j=crc32(s)
      if bit(crc32array(j),0)=0 then 'unique
         crc32array(j)=bitset(crc32array(j),0)
         print #2,s
      else 'collision, go to 2nd bucket
         j=crc32(s+"0") 'unused character in corpus
         if bit(crc32array(j),1)=0 then 'unique
            crc32array(j)=bitset(crc32array(j),1)
            print #2,s
         else 'collision, go to 3rd bucket
            j=crc32(s+"1") 'unused character in corpus
            if bit(crc32array(j),2)=0 then 'unique
               crc32array(j)=bitset(crc32array(j),2)
               print #2,s
            else 'collision, go to 4th bucket
               j=crc32(s+"2") 'unused character in corpus
               if bit(crc32array(j),3)=0 then 'unique
                  crc32array(j)=bitset(crc32array(j),3)
                  print #2,s
               else 'collision, go to 5th bucket
                  j=crc32(s+"3") 'unused character in corpus
                  if bit(crc32array(j),4)=0 then 'unique
                     crc32array(j)=bitset(crc32array(j),4)
                     print #2,s
                  else 'collision, go to 6th bucket
                     j=crc32(s+"4") 'unused character in corpus
                     if bit(crc32array(j),5)=0 then 'unique
                        crc32array(j)=bitset(crc32array(j),5)
                        print #2,s
                     else 'collision, go to 7th bucket
                        j=crc32(s+"5") 'unused character in corpus
                        if bit(crc32array(j),6)=0 then 'unique
                           crc32array(j)=bitset(crc32array(j),6)
                           print #2,s
                        else 'collision, go to 8th bucket
                           j=crc32(s+"6") 'unused character in corpus
                           if bit(crc32array(j),7)=0 then 'unique
                              crc32array(j)=bitset(crc32array(j),7)
                              print #2,s
                           else 'collision
                              print #3,s
                           end if
                        end if
                     end if
                  end if
               end if
            end if
         end if
      end if
      
      l+=1
   
   end if
   
   b+=len(s)
   if timer-t>1 then
      t=timer
      screenlock
      cls
      print "Lines: "+str(l)
      print "GB: "+str(b/1073741824)
      screenunlock
   end if

   if l=1000000 then exit do
   
loop until eof(1)

close #1
close #2
close #3

cls
print "Lines: "+str(l)
print "GB: "+str(b/1073741824)

beep
sleep

function crc32(buf as string) as uinteger<32> 'rosetta code
   
   static as uinteger<32> table(256)
   static as uinteger<32> have_table
   dim as uinteger<32> crc, k
   dim as ulong i, j
   if have_table = 0 then
      for i = 0 to 255
         k = i
         for j = 0 to 7
            if (k and 1) then
               k shr= 1
               k xor= &hedb88320
            else
               k shr= 1
            end if
            table(i) = k
         next
      next
      have_table = 1
   end if
   crc = not crc 'crc = &hffffffff
   for i = 0 to len(buf) -1
      crc = (crc shr 8) xor table((crc and &hff) xor buf[i])
   next
   return not crc

end function
Provoni
Posts: 391
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Postby Provoni » Jun 20, 2020 10:56

Don't think my idea will work.
Last edited by Provoni on Jun 20, 2020 11:10, edited 1 time in total.
Tourist Trap
Posts: 2958
Joined: Jun 02, 2015 16:24

Re: Duplicates

Postby Tourist Trap » Jun 20, 2020 10:59

marcov wrote:It also depends on your speed requirements and magnitude of precision you expect. I did a test if a beer-bottle label was well placed (no folds) and didn't have errors (parts were printed on), and for that just summing the differences of the channel |r-r2|+|g-g2| + |b-b2| was already quite ok. But that was not angled, just comparing to a master with a rough sense of registration, using an AND mask to avoid very big faults near places of dramatic color changes due to small variations in registration.

Hi marcov, thanks for the insights. I think I'll start a thread if I have more questions later. It's about satellite images of landscape that are not taken from the same point of view, and still needs to be compared. Ok let's see that later. Thanks again :)
dodicat wrote:Sorry it is not very hi tech.

No, you did good, the only thing is that there is a transform between image1 and image2 that depends on camera, so it's a key point to be taken inton account. Anyway thanks, I'll try to come back later with more accurate demands.
Lost Zergling
Posts: 335
Joined: Dec 02, 2011 22:51
Location: France

Re: Duplicates

Postby Lost Zergling » Jun 26, 2020 10:14

Hello,

TT wrote : No, you did good, the only thing is that there is a transform between image1 and image2 that depends on camera, so it's a key point to be taken inton account.

@TT. I have no knowledge in the domain considered, however, intuitively, I see a problem: the very small variances in the transformation which will generate weak geometric distortions and prevent exact matching. It would therefore be advisable to associate standardized geographic signatures as keys and not directly the images, so as not to lose precision or degrade the original information. Then there would be the management of contours, overlaps, positioning and scaling.
It therefore seems to me that the processing of duplicates on this type of data cannot or should not ignore the specificity linked to the nature of the data. The precision of the signatures could tend, in particular, towards a "derivative" of the contours.

On the other aspect (data duplicates), I revised my lzle pseudo-code (just a backbone) using only one list, this code would work on split files and therefore much smaller containing no duplicates between each other . These files can be obtained by distributing the source data between several output files according to a scale and a calculation on the ascii codes.
For each distribution file:
While not eof(smallbigfile)
key=ComputeShortHash(FileLine)
If MyList.HashTag(key)=1 Then
MyList.Check(3)
End If
Wend
MyList.NFRecursive(1)
MyList.NFMethod(-1)
MyList.Root
While MyList.HashStep
If MyList.Check<>3 Then
MyList.NodeFlat
End If
Wend
While not eof(smallbigfile)
If MyList.HasHashTag(ComputeShortHash(FileLine)) Then
key=ComputeFullHash(FileLine)
If MyList.HashTag(key)=1 Then
MyList.Check(4)
End If
End If
Wend
MyList.Root
While MyList.HashStep
If MyList.Check<>4 Then
MyList.NodeFlat
End If
Wend
' to this point list contains only full hash duplicates and some memory slots in garbage collector
' The list shall be then switched to multi value mode, then file is parsed a third time and exact matches for full hash will be stored as values for comparisons..
ps : using some more advanced features not fully tested in that use case..(edited, forgot some stuff)

Return to “General”

Who is online

Users browsing this forum: No registered users and 10 guests