Hey all,
I have a very large text file (near 1 TB) and on each line there is some text with a minimum length of twenty bytes and a maximum length of perhaps thousands of bytes. From this file I want to remove the duplicates entries. How would one approach this in FreeBASIC?
Thanks
Duplicates
Re: Duplicates
Looks like a case for hashing
Re: Duplicates
jj2007 wrote:Looks like a case for hashing
Thanks jj2007. I was thinking of using a checksum for each line of text.
Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value.
A good hash function satisfies two basic properties: 1) it should be very fast to compute; 2) it should minimize duplication of output values (collisions).
Re: Duplicates
Isn't a checksum just a 'low quality' hash?
See for example first paragraph of https://en.wikipedia.org/wiki/MD5
Anyway, I would also chose a hash function and a smart tree structure to find a hash quickly.
See for example first paragraph of https://en.wikipedia.org/wiki/MD5
Anyway, I would also chose a hash function and a smart tree structure to find a hash quickly.
-
- Site Admin
- Posts: 6237
- Joined: Jul 05, 2005 17:32
- Location: Manchester, Lancs
Re: Duplicates
So we’re talking roughly a billion lines of text?
One question that occurs to me is, how many unique lines are there likely to be?
I think any hash-based method will need a table with at least that many entries, which could easily be gigabytes in size and might not easily fit in memory. And I’m not sure how easy collision checking will be.
A good strategy might be to sort the array, then duplicate checking is easy.
Instead of creating temporary copies of the entire file, you can create an index of “pointers” into the file for each line. For less than 1TiB, 5-byte pointers will fit nicely.
One question that occurs to me is, how many unique lines are there likely to be?
I think any hash-based method will need a table with at least that many entries, which could easily be gigabytes in size and might not easily fit in memory. And I’m not sure how easy collision checking will be.
A good strategy might be to sort the array, then duplicate checking is easy.
Instead of creating temporary copies of the entire file, you can create an index of “pointers” into the file for each line. For less than 1TiB, 5-byte pointers will fit nicely.
-
- Posts: 767
- Joined: May 05, 2015 5:35
- Location: Germany
Re: Duplicates
From the coding point of view it's not a big deal:But my guess is, that Provoni does not have enough free disk space for a copy of that file and therefore is looking for a way to remove the duplicated data from this existing file. AFAIK there's no trivial way to do so, although it should be possible in principle.
Code: Select all
Dim As String src, dst, oldfile, newfile
Dim As Boolean duplicate
oldfile = "C:\oldfile.txt" 'replace with the desired file names
newfile = "C:\newfile.txt"
Open oldfile For Input As #1
Open newfile For Output As #2
'copy the 1st entry from source to destination file
Line Input #1, src
Print #2, src
Close 2
Do Until Eof(1)
Line Input #1, src 'read the next entry from the source file
Open newfile For Input As #2
duplicate = FALSE
Do 'scan the destination file if entry already exists
Line Input #2, dst
If dst = src Then 'duplication found
duplicate = TRUE 'set flag
Exit Do 'terminate scan
EndIf
Loop Until Eof(2)
Close 2
If duplicate = FALSE Then 'if no duplication then write entry into destination file
Open newfile For Append As #2
Print #2, src
Close 2
EndIf
Loop
Close
-
- Posts: 767
- Joined: May 05, 2015 5:35
- Location: Germany
Re: Duplicates
Ok, I think that's a passable way to solve that task (if you find any bugs, please let me know):The only issue that remains is how to cut off and release the empty space at the end of the file and actualize the file information on the hard disk. If anyone has a clue how to do that, please let us share your wisdom.
Code: Select all
Dim As ZString*2 h
Dim As String src, dst, testfile, g
Dim As Integer filepointer1, filepointer2, readfrom, writeto, x, y, endofdata
testfile = "C:\testfile.txt"
'create testfile
Open testfile For Output As #1
Print #1, "one"
Print #1, "two"
Print #1, "three"
Print #1, "three"
Print #1, "three"
Print #1, "four"
Print #1, "five"
Print #1, "six"
Print #1, "seven"
Print #1, "three"
Print #1, "eight"
Print #1, "two"
Print #1, "nine"
Print #1, "ten"
Print #1, "one"
Close
Print "Open the original testfile in an editor for purpose of comparison"
Print
Print "Then press any key to continue"
Print
Sleep
Open testfile For Binary As #1
Print "Length of original file: "; Lof(1)
endofdata = Lof(1) 'pointer to 'real' end of valid data
Do
filepointer1 = Seek(1) 'remind file pointer before reading the source data
Line Input #1, src
filepointer2 = Seek(1) 'remind file pointer after reading the source data
Do Until Seek(1) >= endofdata 'until the end of valid data
writeto = Seek(1) 'pointer to the beginnig of the area where the rest data will be written if a
' duplication was found
Line Input #1, dst
readfrom = Seek(1) 'pointer to the beginning of the rest data if a duplication was found
If src = dst Then 'duplication
filepointer2 = filepointer1 'reset file pointer to check the same entry again in case of multiple duplications
'shift rest data overwriting the duplication (to be improved)
For x = 0 To endofdata - readfrom
Get #1, readfrom + x, h
Put #1, writeto + x, h
Next
'overwrite the invalid data at the end of the file
For y = writeto + x To writeto + x + Len(dst) + 1
Put #1, y, "#"
Next
endofdata -= Len(dst) + 2 'set the new end of valid data
EndIf
Loop
Seek 1, filepointer2 'reset the file pointer to the beginning of the entry to check next
Loop Until Seek(1) >= endofdata
Print "Amount of valid data (should be the length of the revised file)"; endofdata
Print
Print "Now open the same file in the editor to see the result of the revision"
Close
Sleep
Re: Duplicates
Have you ever tried to sort a billion lines of text? Any idea how much time it will take?counting_pine wrote:So we’re talking roughly a billion lines of text?
...
A good strategy might be to sort the array, then duplicate checking is easy.
Re: Duplicates
In the upcoming FBC 1.08.0 (currently DEV state) that functionality is implemented.grindstone wrote:The only issue that remains is how to cut off and release the empty space at the end of the file and actualize the file information on the hard disk.
See the numbers #567 and #568 in changelog:
1. Add new statement "FLUSH" - flush file buffers to disk (567)
---
1. rtlib: FileSetEof, rename FileTruncate to FileSetEof (568)
For testing, download latest build (2020-01-13) last change #601: 2020-01-12
-
- Posts: 2958
- Joined: Jun 02, 2015 16:24
Re: Duplicates
jj2007 wrote:Have you ever tried to sort a billion lines of text? Any idea how much time it will take?
I think this kind of files is the reason why people use database systems.
Re: Duplicates
1 TB is a bit big for me.
Here is one about 3 million digits, (Grindstones list doubled up a few times)
Here is one about 3 million digits, (Grindstones list doubled up a few times)
Code: Select all
Function tally(somestring As String,partstring As String) As Integer
Dim As Integer i,j,ln,lnp,count,num
ln=Len(somestring)
lnp=Len(partstring)
count=0
i=-1
Do
i+=1
If somestring[i] <> partstring[0] Then continue do
If somestring[i] = partstring[0] Then
For j=0 To lnp-1
If somestring[j+i]<>partstring[j] then continue do
Next j
End If
count+=1
i=i+lnp-1
Loop Until i>=ln-1
Return count
End Function
Function findAndReplace(original As String , find As String , replace As String) As String
If Len(find) = 0 Then Return original
Var t = tally(original,find) 'find occurencies of find
Dim As long found, n, staid,m
Var Lf = Len(find) , Lr = Len(replace) , Lo = Len(original)
Dim As long x = Len(original) - t * Lf + t * Lr 'length of output string
dim As String res = String(x,0) 'output string
Do
st:
If original[n] = find[0] Then 'got a possible
For m = 0 To Lf - 1
If original[n + m] <> find[m] Then Goto lbl 'nope
Next m
found = 1 'Bingo
End If
If found Then
For m = 0 To Lr - 1
res[staid] = replace[m] 'load the replacerment
staid += 1
Next m
n += Lf
found = 0
goto fin
End If
lbl:
res[staid] = original[n]
staid += 1
n += 1
fin:
Loop Until n >= Lo
Return res
End Function
sub insert(s as string,char as string,position as long)
If position > 0 then
var part1=Mid(s,1,position-1)
var part2=trim(Mid(s,position),chr(13,10))
s=part1+char+chr(13,10)+part2
End if
end sub
#Include "file.bi"
Function loadfile overload(file as string) as String
If FileExists(file)=0 Then Print file;" not found":Sleep:end
dim as long f=freefile
Open file For Binary Access Read As #f
Dim As String text
If Lof(f) > 0 Then
text = String(Lof(f), 0)
Get #f, , text
End If
Close #f
return text
end Function
sub savefile overload(filename As String,p As String)
Dim As long n=freefile
If Open (filename For Binary Access Write As #n)=0 Then
Put #n,,p
Close
Else
Print "Unable to save " + filename:sleep:end
End If
End sub
Open "testfile" For Output As #1
Print #1, "one"
Print #1, "two"
Print #1, "three"
Print #1, "three"
Print #1, "three"
Print #1, "four"
Print #1, "five"
Print #1, "six"
Print #1, "seven"
Print #1, "three"
Print #1, "eight"
Print #1, "two"
Print #1, "nine"
Print #1, "ten"
Print #1, "one"
Close
var L=loadfile("testfile")
for n as long=1 to 15 'increase size and re save
L+=L
next
savefile("testfile",L)
print "File length = ";filelen("testfile")
L=loadfile("testfile") '' re load bigger file
'print "Sample"
'print mid(L,1,1000)
Open "testfile" For input As #2
dim as string s
while not eof(2)
line input #2,s
var p=instr(L,s)
L=findandreplace(L,s,"")
insert(L,s,p)
wend
close
print L
sleep
kill "testfile"
Re: Duplicates
Provoni wrote:Hey all,
I have a very large text file (near 1 TB) and on each line there is some text with a minimum length of twenty bytes and a maximum length of perhaps thousands of bytes. From this file I want to remove the duplicates entries. How would one approach this in FreeBASIC?
Thanks
@Provoni
Thank you for this question.
I do not have an answer, but it meshes with a problem that vexed me.
The longest common substring problem.
I used a 4-byte index pointer to the first byte in the string and a 2-byte short for the length of the string.
A "slightly modified" merge short was used to sort and maintain the StringLength, Pointers, (and alphanumeric sequence)
Only those lines with same byte length were byte compared for duplication.
Interesting "little" problem.
-
- Posts: 767
- Joined: May 05, 2015 5:35
- Location: Germany
Re: Duplicates
Thank you, that's good news.MrSwiss wrote:In the upcoming FBC 1.08.0 (currently DEV state) that functionality is implemented.
-
- Site Admin
- Posts: 6237
- Joined: Jul 05, 2005 17:32
- Location: Manchester, Lancs
Re: Duplicates
OK, fair enough..jj2007 wrote:Have you ever tried to sort a billion lines of text? Any idea how much time it will take?counting_pine wrote:So we’re talking roughly a billion lines of text?
...
A good strategy might be to sort the array, then duplicate checking is easy.
Running the numbers, I guess a terabyte file of a billion lines would have to be iterated through roughly 30 times. That's a lot.
I was weighing it up against the alternative of a very large hash table. But I guess "very large" is basically just gigabytes. Too much to easily fit in RAM, but pretty easy in an era of terabyte hard disks.
So I guess a disk-based hash table of around a billion, plus a second table for collisions. Each hash table entry contains an index into the second table. The row of the second table contains the index of the matching line, and also a future link into the same table for any hash collisions.
Re: Duplicates
You can't load a terabyte file into memory. So I would:
- read 2 MB chunks from source file into memory (roughly what the L2 cache can hold)
- write 32-bit hashes to a file
- write 64-bit values to another file, with length (16 bits) and start positions (48 bits) of the strings in the source file
- reopen the hash file and start with the first hash, checking all others for identity
- if so, check for collisions, then mark the duplicate as bad in the hash file and give the string in the positions file length -1
- repeat this for all hashes
- finally, open a destination file, and (based on positive length in the positions file) write all valid strings to this file.
- read 2 MB chunks from source file into memory (roughly what the L2 cache can hold)
- write 32-bit hashes to a file
- write 64-bit values to another file, with length (16 bits) and start positions (48 bits) of the strings in the source file
- reopen the hash file and start with the first hash, checking all others for identity
- if so, check for collisions, then mark the duplicate as bad in the hash file and give the string in the positions file length -1
- repeat this for all hashes
- finally, open a destination file, and (based on positive length in the positions file) write all valid strings to this file.
Who is online
Users browsing this forum: No registered users and 18 guests