Duplicates

General FreeBASIC programming questions.
Provoni
Posts: 513
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Post by Provoni »

I just realized that multiple CRC32 buckets can be used like so:

Code: Select all

if crc32array(0,j)=0 then 'unique
	crc32array(0,j)=1
	print #2,s
else 'collision, go to 2nd bucket
	print #3,s
	j=crc32(s+"*") 'unused character in corpus
	if crc32array(1,j)=0 then 'unique
		crc32array(1,j)=1
		print #2,s
	else 'collision
		print #3,s
		'add 3rd bucket if you like etc...
	end if
end if
Probably need to use bits to save memory.
Provoni
Posts: 513
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Post by Provoni »

8 buckets:

For every new bucket I add a unique character to the string that is otherwise unused through the corpus. Okay or not?

EDIT: Not okay, there was an error in my application of the idea.

Code: Select all

screenres 640,480,32

dim as uinteger i,j,b,l
dim as string s
dim as double t=timer

redim shared as ubyte crc32array(2^32)

declare function crc32(buf as string) as uinteger<32>

open "h:\corpus.txt" for binary as #1
open "g:\uniq.txt" for binary as #2
open "g:\coll.txt" for binary as #3

do
	
	line input #1,s
	
	if len(s)>0 then
		
		j=crc32(s)
		if bit(crc32array(j),0)=0 then 'unique
			crc32array(j)=bitset(crc32array(j),0)
			print #2,s
		else 'collision, go to 2nd bucket
			j=crc32(s+"0") 'unused character in corpus
			if bit(crc32array(j),1)=0 then 'unique
				crc32array(j)=bitset(crc32array(j),1)
				print #2,s
			else 'collision, go to 3rd bucket
				j=crc32(s+"1") 'unused character in corpus
				if bit(crc32array(j),2)=0 then 'unique
					crc32array(j)=bitset(crc32array(j),2)
					print #2,s
				else 'collision, go to 4th bucket
					j=crc32(s+"2") 'unused character in corpus
					if bit(crc32array(j),3)=0 then 'unique
						crc32array(j)=bitset(crc32array(j),3)
						print #2,s
					else 'collision, go to 5th bucket
						j=crc32(s+"3") 'unused character in corpus
						if bit(crc32array(j),4)=0 then 'unique
							crc32array(j)=bitset(crc32array(j),4)
							print #2,s
						else 'collision, go to 6th bucket
							j=crc32(s+"4") 'unused character in corpus
							if bit(crc32array(j),5)=0 then 'unique
								crc32array(j)=bitset(crc32array(j),5)
								print #2,s
							else 'collision, go to 7th bucket
								j=crc32(s+"5") 'unused character in corpus
								if bit(crc32array(j),6)=0 then 'unique
									crc32array(j)=bitset(crc32array(j),6)
									print #2,s
								else 'collision, go to 8th bucket
									j=crc32(s+"6") 'unused character in corpus
									if bit(crc32array(j),7)=0 then 'unique
										crc32array(j)=bitset(crc32array(j),7)
										print #2,s
									else 'collision
										print #3,s
									end if
								end if
							end if
						end if
					end if
				end if
			end if
		end if
		
		l+=1
	
	end if
	
	b+=len(s)
	if timer-t>1 then
		t=timer
		screenlock
		cls
		print "Lines: "+str(l)
		print "GB: "+str(b/1073741824)
		screenunlock
	end if

	if l=1000000 then exit do
	
loop until eof(1)

close #1
close #2
close #3

cls
print "Lines: "+str(l)
print "GB: "+str(b/1073741824)

beep
sleep

function crc32(buf as string) as uinteger<32> 'rosetta code
	
	static as uinteger<32> table(256)
	static as uinteger<32> have_table
	dim as uinteger<32> crc, k
	dim as ulong i, j
	if have_table = 0 then
		for i = 0 to 255
			k = i
			for j = 0 to 7
				if (k and 1) then
					k shr= 1
					k xor= &hedb88320
				else
					k shr= 1
				end if
				table(i) = k
			next
		next
		have_table = 1
	end if
	crc = not crc 'crc = &hffffffff
	for i = 0 to len(buf) -1
		crc = (crc shr 8) xor table((crc and &hff) xor buf[i])
	next
	return not crc

end function
Provoni
Posts: 513
Joined: Jan 05, 2014 12:33
Location: Belgium

Re: Duplicates

Post by Provoni »

Don't think my idea will work.
Last edited by Provoni on Jun 20, 2020 11:10, edited 1 time in total.
Tourist Trap
Posts: 2958
Joined: Jun 02, 2015 16:24

Re: Duplicates

Post by Tourist Trap »

marcov wrote: It also depends on your speed requirements and magnitude of precision you expect. I did a test if a beer-bottle label was well placed (no folds) and didn't have errors (parts were printed on), and for that just summing the differences of the channel |r-r2|+|g-g2| + |b-b2| was already quite ok. But that was not angled, just comparing to a master with a rough sense of registration, using an AND mask to avoid very big faults near places of dramatic color changes due to small variations in registration.
Hi marcov, thanks for the insights. I think I'll start a thread if I have more questions later. It's about satellite images of landscape that are not taken from the same point of view, and still needs to be compared. Ok let's see that later. Thanks again :)
dodicat wrote: Sorry it is not very hi tech.
No, you did good, the only thing is that there is a transform between image1 and image2 that depends on camera, so it's a key point to be taken inton account. Anyway thanks, I'll try to come back later with more accurate demands.
Lost Zergling
Posts: 534
Joined: Dec 02, 2011 22:51
Location: France

Re: Duplicates

Post by Lost Zergling »

Hello,
TT wrote : No, you did good, the only thing is that there is a transform between image1 and image2 that depends on camera, so it's a key point to be taken inton account.
@TT. I have no knowledge in the domain considered, however, intuitively, I see a problem: the very small variances in the transformation which will generate weak geometric distortions and prevent exact matching. It would therefore be advisable to associate standardized geographic signatures as keys and not directly the images, so as not to lose precision or degrade the original information. Then there would be the management of contours, overlaps, positioning and scaling.
It therefore seems to me that the processing of duplicates on this type of data cannot or should not ignore the specificity linked to the nature of the data. The precision of the signatures could tend, in particular, towards a "derivative" of the contours.

On the other aspect (data duplicates), I revised my lzle pseudo-code (just a backbone) using only one list, this code would work on split files and therefore much smaller containing no duplicates between each other . These files can be obtained by distributing the source data between several output files according to a scale and a calculation on the ascii codes.
For each distribution file:
While not eof(smallbigfile)
key=ComputeShortHash(FileLine)
If MyList.HashTag(key)=1 Then
MyList.Check(3)
End If
Wend
MyList.NFRecursive(1)
MyList.NFMethod(-1)
MyList.Root
While MyList.HashStep
If MyList.Check<>3 Then
MyList.NodeFlat
End If
Wend
While not eof(smallbigfile)
If MyList.HasHashTag(ComputeShortHash(FileLine)) Then
key=ComputeFullHash(FileLine)
If MyList.HashTag(key)=1 Then
MyList.Check(4)
End If
End If
Wend
MyList.Root
While MyList.HashStep
If MyList.Check<>4 Then
MyList.NodeFlat
End If
Wend
' to this point list contains only full hash duplicates and some memory slots in garbage collector
' The list shall be then switched to multi value mode, then file is parsed a third time and exact matches for full hash will be stored as values for comparisons..
ps : using some more advanced features not fully tested in that use case..(edited, forgot some stuff)
Post Reply