I would suggest a procedure : compute a "short" hash on each line and store this short value in a list (must be short enought to allow RAM use), and write duplicates into a file, something like this :
While not eof(hugefile)
key=ComputeShortHash(FileLine)
If MyList.HashTag(key)=1 Then
Write to new file key
End If
Wend
MyList.Recycle
Then loading duplicate candidate into renewed memory slots:
While not eof(new file)
MyList.HashTag(shorthash)
Wend
To this point, possible duplicates are stored in a list in memory
Then parse again huge file :
While not eof(huge file)
If MyList.HasHashTag(ComputeShortHash(FileLine)) Then
If MyList2.HashTag(ComputeFullHash)=1 Then
duplicate, do nothin
Else ' just a candidate, not real duplicate
Write Line to resultFile
End If
Else
Write Line to resultFile
End If
Wend
Indeed a second index (MyList2) will require Ram space, but only on a subset of duplicate candidate (first filter with firt list)
Two cascaded index in mem, meaning the ComputeShortHash function is very precisely tuned.
Sounds heavy and random results in terms of balance between the cascaded indexes...
Duplicates
-
- Posts: 372
- Joined: Dec 02, 2011 22:51
- Location: France
Re: Duplicates
How many duplicates do you expect? If there are only 30 duplicate lines in a TB file (rest unique), it is more complicated (heavy) then if there are only 30 unique lines in a TB. What kind of data does the text file contain? If you can discard certain data directly after reading, that also make the task more manageable.
In the past I made something similar (in C) for firewall log-files, containing many duplicates (with timestamps and/or certain IP-addresses removed), but that tool processed files of a few GB not TB.
In the past I made something similar (in C) for firewall log-files, containing many duplicates (with timestamps and/or certain IP-addresses removed), but that tool processed files of a few GB not TB.
Re: Duplicates
Thanks everyone for all the replies. Still need to catch up.
No clue, it could be.
I never counted the lines, will follow up with a line count. Not sure about unique lines amount. Each line is a comment from a forum. My machine has 64 GB RAM so I guess if the line count is manageable then a low quality hash/checksum could do. Perhaps the lines can be split up in smaller files by length!
How long will that have to run?
Great idea, thanks.
badidea wrote:Isn't a checksum just a 'low quality' hash?
No clue, it could be.
counting_pine wrote:So we’re talking roughly a billion lines of text?
One question that occurs to me is, how many unique lines are there likely to be?
I never counted the lines, will follow up with a line count. Not sure about unique lines amount. Each line is a comment from a forum. My machine has 64 GB RAM so I guess if the line count is manageable then a low quality hash/checksum could do. Perhaps the lines can be split up in smaller files by length!
grindstone wrote:From the coding point of view it's not a big deal:
How long will that have to run?
integer wrote:Only those lines with same byte length were byte compared for duplication.
Great idea, thanks.
Re: Duplicates
Provoni wrote:I have a very large text file (near 1 TB) and on each line there is some text with a minimum length of twenty bytes and a maximum length of perhaps thousands of bytes. From this file I want to remove the duplicates entries. How would one approach this in FreeBASIC?
(just my first idea, before reading the rest: Use a machine with lots of memory (32GB or +) and a 64-bit compiler.
Walk the file, for every line
- store the offset of the line into an index that for each linelength has a list of checksums and then a record with the offsets and if it is a duplicate or not (crc16 should be fine).
(so a map of map of filerecords <filesize,<checksum,offset+duplicate flag>>).
- then start walking over the filesizes. If within a list<checksum,offset+duplicate flag> that contains lines of a constant size, there one of the lists<checksum,offset> have multiple items then, seek the file, and compare them to make sure they are duplicates. If they are duplicates, mark the relevant items as such.
- After this is complete, add all entries that are NOT duplicates to a list ordered on offset. (this is harder than it sounds for large lists)
- walk the file, and at the same time maintain a cursor in the list with offsets.
- If a line has an offset in the list with offsets, write it to a new file.
The new file has a list with offsets.
If your duplicates are not random (e.g. might appear in clusters), then having a block level caching system for the comparison might help. You can even run multiple sizes in different threads.
If you need extra memory, write the filtering of the <filesize,<checksum,offset>> list to disk. Then read them while inserting them into the <offset> list. This will avoid having to have two lists in memory. So in all this means two passes of the file, plus the various seeks+reads due to comparing (which will depend on your duplicate%)
Yes, you can do it with less memory, but it will more work to program all the merge sorting and much slower, and both will be magnitudes, not just a bit.
Ideally you want the index to fit into memory twice. If you really are interested, I'm willing to detail it, but in all scenarios, you will want to optimize the hardware to as modern as reasonable first. Borrow it if needed, specially if it is a one off. :-)
IT is as much soft skills (convincing your boss to buy better hardware) as hard skills (programming) :-)
p.s. I have own pascal code for a duplicate file finder based on some of these principles. Since it uses whole files in the multi megabyte scale, it uses MD5, but it builds a list for <int64,<filerectlist>> where <filerectlist> is a simple list with checksum and some other params.
p.p.s. when replying please specify if this is for a one-off case, or that you have to do this e.g. every month/year etc for new measurement data or so.
Last edited by marcov on Jun 18, 2020 20:29, edited 7 times in total.
Re: Duplicates
badidea wrote:Isn't a checksum just a 'low quality' hash?
No, it is usually considered to be a hash to detect errors introduced by mis-sampling of bits in hardware, and those are often of the type that bits close to eachother differ. (since caused by the same disruption that causes multiple bits to toggle), so these checksums try to avoid that multiple bits varying close to eachother lead to the same checksum.
In ye old days, it was also to be implementable in low clocked hardware. (though that is debatable, nowadays with Intel processors have SHA hardware, and microprocessors reaching 100MIPS and having CRC32 hardware)
Usually any hash has a data length where the number of collisions start rising. You want to use the hash that _just_ fits the data so that you are in the area of relative low collisions.
See for example first paragraph of https://en.wikipedia.org/wiki/MD5
Anyway, I would also chose a hash function and a smart tree structure to find a hash quickly.
MD5 is a hash meant for 32-bit files, so overkill for lines of a few thousands of bytes. Note that collisions are never 100% excluded so for nearly all real world applications, hashing is used to decrease the number of compares, not to do them away completely.
If a cheap hash removed 99.99% of the compares, using a safer hash just eats CPU time. For the data range up to thousands bytes (a few sectors worth) CRC16 is fine, specially since in the above algorithm it is combined with the exact size of the data. So the actual hash is <size,crc16> not just crc16. But since we have a list based on linesize, we don't need to store the size for each line extra.
Tourist Trap wrote:jj2007 wrote:Have you ever tried to sort a billion lines of text? Any idea how much time it will take?
I think this kind of files is the reason why people use database systems.
No. Database systems are very bad for operations that touch all data, IOW queries must have at least one major condition indexed, and the number of queries must be much larger than the original insertion/preparation/indexing of the data to make such a thing worthwhile.
-
- Posts: 767
- Joined: May 05, 2015 5:35
- Location: Germany
Re: Duplicates
One (serious) question: Is it really faster to make CRC16 hashes of two strings and compare the hashes than directly comparing the strings for equality?
-
- Posts: 2958
- Joined: Jun 02, 2015 16:24
Re: Duplicates
marcov wrote:Tourist Trap wrote:I think this kind of files is the reason why people use database systems.
No. Database systems are very bad for operations that touch all data, IOW queries must have at least one major condition indexed, and the number of queries must be much larger than the original insertion/preparation/indexing of the data to make such a thing worthwhile.
What I meant was, the data should be stored in a database system from the very start, so it would have been easier to maintain from the very beginning and less prone to some file disaster. I was not suggesting to convert the file in a database now that it became so huge. It would take a while on my machine anyway.
Marcov, by the way, do you know how to quantify the error made when matching 2 photos shot from 2 different angles. I think it's about image registration, but any clue would help me for some business at work.
Re: Duplicates
grindstone wrote:One (serious) question: Is it really faster to make CRC16 hashes of two strings and compare the hashes than directly comparing the strings for equality?
It is already faster, to compare Len(any_string), instead of comparing strings themselfs.
(instead of a hash for example, as a first step)
Second step: do full string comparison only, if they have equal length.
Re: Duplicates
The point is that you have to compare string #123 to one Billion strings that follow this one. Feasible with a 32-bit integer comparison, but not with a direct comparison.grindstone wrote:One (serious) question: Is it really faster to make CRC16 hashes of two strings and compare the hashes than directly comparing the strings for equality?
-
- Posts: 372
- Joined: Dec 02, 2011 22:51
- Location: France
Re: Duplicates
Bottlenecks are : 1) number of unique keyvalues to check at a time (meaning max index size must fit available mem space), 2) processing speed. To this purpose, main data could be first dispatched between several files according to a computed value so as to be sure there is no possible duplicate between dispatched files(ie : first ascii+30, (second ascii+30)*10, and so on). This just requires time and disk space. Then, on each file apply a procedure (custom cascaded index as describe or other) to create result file. Finally concatenate result files.
Last edited by Lost Zergling on Jun 19, 2020 9:11, edited 2 times in total.
Re: Duplicates
grindstone wrote:One (serious) question: Is it really faster to make CRC16 hashes of two strings and compare the hashes than directly comparing the strings for equality?
If you have n lines in a bucket (all lines of same linesize), you need to compare all n lines to all n-1 other lines. Which means you have (n*(n-1))/2 comparisons (divide by two since arguments swap count as same comparison).
Now, you compute n hashes, and only compare only within lines that have the same hash, which will only be a few, and rarely quadratic.
In short, unless nearly all linesize buckets have only a very short (2,3) strings in them, it is nearly certainly faster, even if you could fit them all in memory in the first place (you can store the CRC16 on the initial run, rather than reloading the lines, but then you do CRC16 on all lines).
Last edited by marcov on Jun 19, 2020 8:52, edited 1 time in total.
Re: Duplicates
Tourist Trap wrote:marcov wrote:Tourist Trap wrote:I think this kind of files is the reason why people use database systems.
No. Database systems are very bad for operations that touch all data, IOW queries must have at least one major condition indexed, and the number of queries must be much larger than the original insertion/preparation/indexing of the data to make such a thing worthwhile.
What I meant was, the data should be stored in a database system from the very start, so it would have been easier to maintain from the very beginning and less prone to some file disaster. I was not suggesting to convert the file in a database now that it became so huge. It would take a while on my machine anyway.
Still, if the kind of query you want to do will typically touch everything, that is still hopeless. Also be very careful with very large datasets, DBMS costs and hardware have a tendency to increase non-linear in such cases (indicator: instead of naming a price, they want to make an appointment to discuss a "solution").
Marcov, by the way, do you know how to quantify the error made when matching 2 photos shot from 2 different angles. I think it's about image registration, but any clue would help me for some business at work.
If you can match it, variance or stddev of the pixels on a LAB or HSV basis would be my initial hunch.
If your registration is not 100%, you might do this with a kernel, iow calc the difference o the color but calc an averge color with a kernel operation first, e.g. multiply the pixes with a 3x3 matrix of weights centered on a pixel e.g.
.025 .1 .025
.1 .5 .1
.025 .1 .025
(so middle pixel counts for 50%, horizontel/vertical for a total of 4*.1 = 40%, and diagonals for 4*0.025=10%)
It also depends on your speed requirements and magnitude of precision you expect. I did a test if a beer-bottle label was well placed (no folds) and didn't have errors (parts were printed on), and for that just summing the differences of the channel |r-r2|+|g-g2| + |b-b2| was already quite ok. But that was not angled, just comparing to a master with a rough sense of registration, using an AND mask to avoid very big faults near places of dramatic color changes due to small variations in registration.
Re: Duplicates
Tourist Trap wrote:marcov wrote:Tourist Trap wrote:I think this kind of files is the reason why people use database systems.
. . .
Marcov, by the way, do you know how to quantify the error made when matching 2 photos shot from 2 different angles. I think it's about image registration, but any clue would help me for some business at work.
TT
Keeping it as simple as possible, sorry I am not marcov, maybe he is involved in this kind of thing.
Code: Select all
Screen 20,32
#define tolerance 20
Sub getsize(picture As String,Byref dimensionx As Long,Byref dimensiony As Long)
Open picture For Binary Access Read As #1
Get #1, 19, dimensionx
Get #1, 23, dimensiony
Close #1
End Sub
Function Filter(Byref tim As Ulong Pointer,_
Byval rad As Single,_
Byval destroy As Long=1,_
Byval fade As Long=0) As Ulong Pointer
#define map(a,b,_x_,c,d) ((d)-(c))*((_x_)-(a))/((b)-(a))+(c)
If fade<0 Then fade=0:If fade>100 Then fade=100
Type p2
As Long x,y
As Ulong col
End Type
#macro _point(_x,_y,colour)
pixel=row+pitch*(_y)+(_x)*4
(colour)=*pixel
#endmacro
#macro ppset(_x,_y,colour)
pixel=row+pitch*(_y)+(_x)*4
*pixel=(colour)
#endmacro
#macro average()
ar=0:ag=0:ab=0:inc=0
xmin=x:If xmin>rad Then xmin=rad
xmax=rad:If x>=(_x-1-rad) Then xmax=_x-1-x
ymin=y:If ymin>rad Then ymin=rad
ymax=rad:If y>=(_y-1-rad) Then ymax=_y-1-y
For y1 As Long=-ymin To ymax
For x1 As Long=-xmin To xmax
inc=inc+1
ar=ar+(NewPoints(x+x1,y+y1).col Shr 16 And 255)
ag=ag+(NewPoints(x+x1,y+y1).col Shr 8 And 255)
ab=ab+(NewPoints(x+x1,y+y1).col And 255)
Next x1
Next y1
If fade=0 Then
averagecolour=Rgb(ar/(inc),ag/(inc),ab/(inc))
Else
averagecolour=Rgb(fd*ar/(inc),fd*ag/(inc),fd*ab/(inc))
End If
#endmacro
Dim As Single fd=map(0,100,fade,1,0)
Dim As Integer _x,_y
Imageinfo tim,_x,_y
Dim As Ulong Pointer im=Imagecreate(_x,_y)
Dim As Integer pitch
Dim As Any Pointer row
Dim As Ulong Pointer pixel
Dim As Ulong col
Imageinfo tim,,,,pitch,row
Dim As p2 NewPoints(_x-1,_y-1)
For y As Long=0 To (_y)-1
For x As Long=0 To (_x)-1
_point(x,y,col)
NewPoints(x,y)=Type<p2>(x,y,col)
Next x
Next y
Dim As Ulong averagecolour
Dim As Long ar,ag,ab
Dim As Long xmin,xmax,ymin,ymax,inc
Imageinfo im,,,,pitch,row
For y As Long=0 To _y-1
For x As Long=0 To _x-1
average()
ppset((NewPoints(x,y).x),(NewPoints(x,y).y),averagecolour)
Next x
Next y
If destroy Then Imagedestroy tim: tim = 0
Function= im
End Function
Sub sort(array() As Ulong,begin As Long,Finish As Long)
Dim As Long i=begin,j=finish
Dim As Ulong x =array(((I+J)\2))
While I <= J
While array(I) < X :I+=1:Wend
While array(J) > X :J-=1:Wend
If I<=J Then Swap array(I),array(J): I+=1:J-=1
Wend
If J >begin Then sort(array(),begin,J)
If I <Finish Then sort(array(),I,Finish)
End Sub
Function compare(a1() As Ulong,a2() As Ulong) As Double
sort(a1(),Lbound(a1),Ubound(a1))
sort(a2(),Lbound(a2),Ubound(a2))
Dim As Double x
For n As Long=Lbound(a1) To Ubound(a1)
if abs(cast(ubyte ptr,@a1(n))[0]-cast(ubyte ptr,@a2(n))[0]) <tolerance andalso _
abs(cast(ubyte ptr,@a1(n))[1]-cast(ubyte ptr,@a2(n))[1]) <tolerance andalso _
abs(cast(ubyte ptr,@a1(n))[2]-cast(ubyte ptr,@a2(n))[2]) <tolerance then x+=1
Next n
Return x/(Ubound(a1)-Lbound(a1)+1)
End Function
'=========================================================
Dim As Any Ptr i1=Imagecreate(512, 500,Rgb(0,100,255))
Dim As Any Ptr i2=Imagecreate(512, 500,Rgb(0,100,255))
For n As Long=1 To 100000
Var x=Rnd*512,y=Rnd*500,r=2+Rnd*5,c=Rnd*Rgb(255,255,255)
Circle i1,(x,y),r,c,,,,f
Circle i2,(x-1,y-1),r,c,,,,f ' <<< slight difference
Next n
i1=filter(i1,10)
i2=filter(i2,10)
Put(0,0),i1,Pset
Put(512,0),i2,Pset
Bsave "bitmap1.bmp",i1
Bsave "bitmap2.bmp",i2
Dim As Long w,h
getsize "bitmap1.bmp",w,h 'same as bitmap2
Redim As Ulong a1((w+1)*(h+1)),a2((w+1)*(h+1))
Bload "bitmap1.bmp",@a1(0)
Bload "bitmap2.bmp",@a2(0)
Locate 36
Print compare(a1(),a2()),"(0 is completely different, 1 is exactly the same)"
Sleep
Kill "bitmap1.bmp"
Kill "bitmap2.bmp"
Sorry it is not very hi tech.
Re: Duplicates
Can't keep up with all the information but thanks allot!
Maybe about 1-3 times a year. I am creating letter n-grams frequencies for my solver http://zodiackillersite.com/viewtopic.php?f=81&t=3198 and want to test if removing redundancy from the corpus improves things. So for now not it is just a test.
Thanks for your run down marcov, it helps me allot. The file is only 0.5 TB and has 4,143,722,588 lines and the longest line has a length of 28,409.
I don't know enough about programming to put people's examples into practice so I will try something simple such as the following:
Have a 1-dim CRC32 array (2^32). And thus for each line calculate the CRC32 with the following code https://rosettacode.org/wiki/CRC-32#FreeBASIC and mark the CRC32 value in that array. If CRC32 exists than output to a collision file and if not exist output to non-dupe file. Still have to see about the collision file afterwards and I am curious about its size. May be able to do a second run on the collision file while reading the lines in a slightly different manner such that the CRC32 is different each time.
marcov wrote:p.p.s. when replying please specify if this is for a one-off case, or that you have to do this e.g. every month/year etc for new measurement data or so.
Maybe about 1-3 times a year. I am creating letter n-grams frequencies for my solver http://zodiackillersite.com/viewtopic.php?f=81&t=3198 and want to test if removing redundancy from the corpus improves things. So for now not it is just a test.
Thanks for your run down marcov, it helps me allot. The file is only 0.5 TB and has 4,143,722,588 lines and the longest line has a length of 28,409.
I don't know enough about programming to put people's examples into practice so I will try something simple such as the following:
Have a 1-dim CRC32 array (2^32). And thus for each line calculate the CRC32 with the following code https://rosettacode.org/wiki/CRC-32#FreeBASIC and mark the CRC32 value in that array. If CRC32 exists than output to a collision file and if not exist output to non-dupe file. Still have to see about the collision file afterwards and I am curious about its size. May be able to do a second run on the collision file while reading the lines in a slightly different manner such that the CRC32 is different each time.
Re: Duplicates
Can anyone make a CRC-n? For example CRC33 or CRC35.
Who is online
Users browsing this forum: No registered users and 16 guests