## remove duplicates intries in an array

MrSwiss
Posts: 3830
Joined: Jun 02, 2013 9:27
Location: Switzerland

### Re: remove duplicates intries in an array

Since strings are distinctly different from numeric/boolean data-types ...
They should therefore be treated somewhat differently too.
Because it should work correctly independent of "casing" (lower-/upper-/mixed-case),
we explicitly compare equally cased strings.
Below code is explicitly string specific and caters for two scenarios:
• default: mixed lengths of strings (in array), mostly expected.
• otherwise: equal lengths of strings expected/known (depends on input, e.g. *.csv file)
In default mode a two step approach is used (for speed). The method is called: "quit early".
(NO chronological order of the elements attempted.):

Code: Select all

`Sub MakeUnique( _                       ' case insensitive check          a()   As String,        _    ByVal chlen As Boolean = TRUE _     ' default: do len check first!    )    Var i = LBound(a), j = 0, _        k = UBound(a)    If chlen Then                       ' faster if: mixed string length        While i < k                     ' only if both string lengths are equal _            j = i + 1                   ' we do a full string comparison, other- _            While j <= k                ' wise: quit early                If Len(a(i)) = Len(a(j)) Then                    If LCase(a(i)) = LCase(a(j)) Then                        a(j) = a(k)                        k -= 1                    Else                        j += 1                    End If                Else                    j += 1                End If            Wend            i += 1        Wend    Else                                ' faster if: strings are mostly/always _        While i < k                     ' of the same length            j = i + 1            While j <= k                If LCase(a(i)) = LCase(a(j)) Then                    a(j) = a(k)                    k -= 1                Else                    j += 1                End If            Wend            i += 1        Wend    End If    Redim Preserve a(LBound(a) To k)    ' remove 'empties'End Sub`
Sorting should be dealt with, in a dedicated procedure (IMO).
aloberoger
Posts: 502
Joined: Jan 13, 2009 19:23

### Re: remove duplicates intries in an array

ok it's good now, i will test the speed later
MrSwiss
Posts: 3830
Joined: Jun 02, 2013 9:27
Location: Switzerland

### Re: remove duplicates intries in an array

aloberoger wrote:ok it's good now, i will test the speed later
Just remember that, at the top of the priority-list is: QUALITY, before anything else (including speed).

All the speed in the world is useless if the code produces nonsense even if only sporadically.
(Those are typically the worst bugs to find and thereafter fix.)
counting_pine
Posts: 6287
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

### Re: remove duplicates intries in an array

I wonder if skipping the length check will save enough time - compared to the lower case conversions and comparison - to be worth having an optimised routine for.
I think a better optimisation would be to use a case-insensitive comparison routine, even if you have to implement it yourself. This saves having to copy/convert both strings, and makes a preliminary length check unnecessary.

Code: Select all

`cat /tmp/equal.basfunction equal_lcase(byref s1 as const string, byref s2 as const string) as boolean  if len(s1) <> len(s2) then return false  for i as integer = 0 to len(s1)-1    if s1[i] = s2[i] then      '' same ASCII value    elseif ((s1[i] xor s2[i]) and not 32) = 0 then      if cunsg((s1[i] and not 32) - asc("A")) < 26 then        '' same letter, different case      else        '' not a letter        return false      end if    else      '' not equal      return false    end if  next i  return trueend functionassert(equal_lcase("", ""))assert(equal_lcase("ABCdef", "abcDEF"))assert(equal_lcase("Hello", "hello"))assert(not equal_lcase("123", "1234"))assert(not equal_lcase("A", "B"))assert(not equal_lcase("A", "b"))for i as integer = 0 to 255  assert(equal_lcase(chr(i), chr(i)))  select case (i and not 32)  case asc("A") to asc("Z")    assert(equal_lcase(chr(i), chr(i xor 32)))  case else    assert(not equal_lcase(chr(i), chr(i xor 32)))  end selectnext i`
MrSwiss
Posts: 3830
Joined: Jun 02, 2013 9:27
Location: Switzerland

### Re: remove duplicates intries in an array

counting_pine wrote:I wonder if skipping the length check will save enough time
The skipping of the len-check is exclusively for: "known to be equal len" strings (as from a formated *.csv or similar).
NOT for the typical string array containing strings of arbitrary length.

The lenght check is "optimisation" because it skips a lot of those string conversions/comparisons ...
Probably in the region of reducing them (in a typicall string-array) to <= 10%, skipping >= 90%.
(Just to explain the logic I'm employing.)

You seem to go far deeper into details here. I'll look into that later too.
counting_pine
Posts: 6287
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

### Re: remove duplicates intries in an array

Sorry, I phrased my suggestion badly, and made it sound like I meant the opposite of what I intended.
I agree, the length check is a great time saver. I also suggest that its cost is negligible enough that it’s not worth duplicating most of the code just for the purpose of skipping it.
MrSwiss
Posts: 3830
Joined: Jun 02, 2013 9:27
Location: Switzerland

### Re: remove duplicates intries in an array

@counting_pine, great idea/job ... thank you.

I've finally dug into your code, optimized it to the last CPU clock-tick and inserted it into MakeUnique():
• offhand speed gain: roughly a factor 10 (not timed, could be more)
• test-file size: about 400 KB (type: *.csv)
• twice tested:
- first: splitted into lines (equal string length, about 4'300 lines)
- second: splitted into words (different string length, much shorter, roughly 48'000 words)
The code now:

Code: Select all

`' was: Function equal_lcase() -- original code by counting_pine' renamed and recoded (for speed) -- MrSwissFunction Equal_XCase( _                 ' case insensitive string comp.    ByRef s1    As Const String, _    ByRef s2    As Const String  _    ) As Boolean    If Len(s1) <> Len(s2) Then Return FALSE    For i As UInteger = 0 to Len(s1) - 1        If s1[i] = s2[i] Then            ' same ASCII value        ElseIf ((s1[i] Xor s2[i]) And -33) = 0 Then            If CUnsg((s1[i] And -33) - 65) < 26 Then                ' same letter, different case            Else                ' not a letter                Return FALSE            End If        Else            ' not equal            Return FALSE        End If    Next    ' all checks passed = are equal (except: case may differ)    Return TRUEEnd Function' string treatment differs from numeric data-types ...Private Sub MakeUnique OverLoad( _      ' case insensitive check          a()   As String _    )    Var i = LBound(a), j = 0, _        k = UBound(a)    While i < k        j = i + 1        While j <= k            If Equal_XCase(a(i), a(j)) Then ' string equality check _                a(j) = a(k) : k -= 1    ' case insensitive comparison            Else                j += 1            End If        Wend                i += 1    Wend    Redim Preserve a(LBound(a) To k)    ' remove 'empties'End Sub`
counting_pine
Posts: 6287
Joined: Jul 05, 2005 17:32
Location: Manchester, Lancs

### Re: remove duplicates intries in an array

Another potential speedup, although slightly unintuitive, would be to use Swap instead of assignments to reorder the array.
Swapping string descriptors leaves the string data unchanged, and saves having to create or destroy new strings.

In conjunction with the string comparison function, that should mean that no string allocations at all are required inside the routine.
MrSwiss
Posts: 3830
Joined: Jun 02, 2013 9:27
Location: Switzerland

### Re: remove duplicates intries in an array

I've tested your suggestion using: Swap ...
This time I've implemented timing, expecting a smallish difference only. No optimisations used.

Result: the difference (if any) varies more in-between runs, than in-between implementations ...
Leading to the conclusion: Swap doesn't influence the result in a measurable way.
(this may in a different context be very much different)

This proves clearly, that proper testing always beats (possibly erroneous) assumptions.
fxm
Moderator
Posts: 10377
Joined: Apr 22, 2009 12:46
Location: Paris suburbs, FRANCE

### Re: remove duplicates intries in an array

It might be necessary to test this with many array elements containing long strings to see the difference.
counting_pine