Faster string copying/building

New to FreeBASIC? Post your questions here.
Iczer
Posts: 60
Joined: Jul 04, 2017 18:09

Faster string copying/building

Postby Iczer » Apr 04, 2018 22:13

is there a faster method of copying/building strings than this:

Code: Select all

Dim As ZString Ptr sSubject
Dim As ZString * 64 sReplasement
Dim As String sOutPutString

Dim As Long iPosStart = 0, start_match, current_position

' -------------------------------------------------------------
' string building/replasement part:

iMatchCount += 1

sOutPutString &= Left(sSubject[iPosStart], start_match - iPosStart)
sOutPutString &= sReplasement

iPosStart = current_position

' -------------------------------------------------------------



I'm making PCRE RegExp functions and it's working, but after around 50...80 replacements per subject string code with above string building part beginning to lose to AutoIt functions. At 20k matches it's 3..4 seconds depending on subject string while AutoIt function with same conditions runs about 19..20 milliseconds
MrSwiss
Posts: 3661
Joined: Jun 02, 2013 9:27
Location: Switzerland

Re: Faster string copying/building

Postby MrSwiss » Apr 04, 2018 22:19

Try String (instead of ZString) since, tests a while ago showed it, to be faster than ZString.
Tourist Trap
Posts: 2958
Joined: Jun 02, 2015 16:24

Re: Faster string copying/building

Postby Tourist Trap » Apr 04, 2018 22:30

Worth mentioning. Related topic of interest may be this one for you:
viewtopic.php?f=3&t=24563&hilit=split+string
Iczer
Posts: 60
Joined: Jul 04, 2017 18:09

Re: Faster string copying/building

Postby Iczer » Apr 04, 2018 22:34

i cannot change sSubject - it should be ZString Ptr as parameters for pcre_exec() and pcre_callout() functions.
If I use sReplasement as string it give strange rezults in case:

Code: Select all

sOutPutString &= sReplasement

but if I use it like this it's OK:

Code: Select all

sOutPutString &= Chr(sReplasement[0])


[quote="Tourist Trap"][/quote]
Thanks !
cbruce
Posts: 136
Joined: Sep 12, 2007 19:13
Location: Dallas, Texas

Re: Faster string copying/building

Postby cbruce » Apr 09, 2018 4:20

Iczer, I believe you are looking for a String Builder function.

In regular string concatenation, you are utilizing small buffers and having to allocate memory and move small fragments around a LOT. That is where performance becomes an issue when you have many strings to concatenate.

String Builders start with a large buffer and they place your string parts into that until it is full and then allocate another buffer at least twice as large as the first one and continues. No concatenation involved in filling the buffers - just moves.

It's odd that Tourist Trap didn't mention the String Builder that he posted himself at:
[smile]
PaulSquires
Posts: 883
Joined: Jul 14, 2005 23:41

Re: Faster string copying/building

Postby PaulSquires » Apr 09, 2018 13:35

cbruce wrote:Iczer, I believe you are looking for a String Builder function.

In regular string concatenation, you are utilizing small buffers and having to allocate memory and move small fragments around a LOT. That is where performance becomes an issue when you have many strings to concatenate.

String Builders start with a large buffer and they place your string parts into that until it is full and then allocate another buffer at least twice as large as the first one and continues. No concatenation involved in filling the buffers - just moves.

It's odd that Tourist Trap didn't mention the String Builder that he posted himself at:
[smile]

Coming from a background in PowerBasic, I also thought that speed improvements could be obtained by using a string builder routine. I ported my code from PB that gave incredible improvements concatenating PB's BSTR strings, but the results were slightly worse when dealing with FB strings. I am pretty sure the reason is because FB's strings already have a buffer built into them so making the strings larger offers very little performance penalty. FB string concatenation is pretty fast compared to, at least, PB strings. This was my experience in this area. Maybe others had a different experience.
Iczer
Posts: 60
Joined: Jul 04, 2017 18:09

Re: Faster string copying/building

Postby Iczer » Apr 10, 2018 12:39

Thanks, indeed, culprit was string concatenation, with only memcpy() speed become good
i'm posted RegExp class here https://www.freebasic.net/forum/viewtopic.php?f=8&t=26591&sid=f604f81164ed787f86896c9206e63280 so you can see rezults
Tourist Trap
Posts: 2958
Joined: Jun 02, 2015 16:24

Re: Faster string copying/building

Postby Tourist Trap » Apr 22, 2018 8:35

cbruce wrote:It's odd that Tourist Trap didn't mention the String Builder that he posted himself at:
[smile]

Ahah. It s right. The example is from the documentation. I never really tried to test it. Thanks for your very short definition of stringbuilders. I remember having been fighting to understand what they were and why they were fast in VB.net. The documentation there is not often quite straightforward in its wording.
jj2007
Posts: 1827
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Faster string copying/building

Postby jj2007 » Apr 22, 2018 10:38

cbruce wrote:String Builders start with a large buffer and they place your string parts into that until it is full and then allocate another buffer at least twice as large as the first one and continues.
Why twice as large? The string builder should first gather the total length of the strings to be concatenated (easy in FB, as len is already available), then allocate one buffer with that size and copy all the small fragments there. At least, that's what I do with mine, and it works fine ;-)

Can somebody help me with the FB equivalent to this little attempt at string building? The manual does not mention string arrays...

Code: Select all

$Data Some, small, strings, that, serve, to, create, one, string, array
  Read x$()      ; create a string array from the $Data above
  MyString="Result: "+x$(0)+" "+x$(1)+" "+x$(2)+" "+x$(3)+" "+x$(4)+" "+x$(5)+" "+x$(6)+" "+x$(7)+" "+x$(8)+" "+x$(9)

The last line builds MyString from 20 components. MyString should print as Result: Some small strings that serve to create one string array
cbruce
Posts: 136
Joined: Sep 12, 2007 19:13
Location: Dallas, Texas

Re: Faster string copying/building

Postby cbruce » Apr 22, 2018 15:54

.
jj2007, if all of the strings that you want to concatenate are already available as literals or in variables, then go ahead and get their lengths and allocate a buffer... just as you suggest.

But that does not work if you have to concatenate many unknown strings being passed from other routines, read from data files, and so on. In this case, it is much more efficient to use a string builder.

If you allocate a new buffer that is double the size of the previous buffer every time you run out of space - you have a simple move to get the contents of the old buffer into the new buffer - and you also have room for adding at least as much data as you already had in the previous buffer.

The whole point of this is to minimize the number of memory allocations/deallocations that you need to do - because they kill your program's performance if they are done too often. (Which is why this whole thread got started).

So we create a buffer, stuff strings into it until it is (or would be) filled, create another, bigger buffer, stuff the old buffer's contents into the new buffer - and continue - with a minimal amount of allocations.

A string builder stops us from doing:
buf = buf + str1
buf = buf + str2
buf = buf + str3
buf = buf + str4
...
buf = buf + strN

Concatenations cause new allocations/deallocations.
Once is ok - but 1,000 or 1,000,000 times would be very bad.

A string builder uses a buffer and stuffs strings into it:

Code: Select all

B1 = SPACE(30)
S1 = "111111111" 'len = 9
S2 = "2222" 'len = 4
'
'B1 >>>                              <<<
MID(B1,1,9) = S1
'B1 >>>111111111                     <<<
MID(B1,10,4) = S2
'B1 >>>1111111112222                 <<<
'
'When strings won't fit, allocate a second, larger string B2 as your
'new buffer...
B2 = SPACE(LEN(B1) * 2)
'MID() the old B1 contents into the B2 buffer...
MID(B2, 1, end_of_data_in_buffer) = LEFT(B1, end_of_data_in_buffer)
'Deallocate your old buffer...
B1 = ""
'and continue
'...
'...
'switch back to B1 the next time the builder needs to grow.
'etc...
'
'when you are done string building...
'grab the *real* string contents out of the buffer you are working in...
myresult_string = MID(Bx, 1, end_of_data_in_buffer)


Moving strings into a large buffer is MUCH more efficient than doing a bunch of concatenations!

If you look at Microsoft's .NET stringbuilder class, you will see that they start with a buffer of 16 bytes, and then double it each time from there. Java's class is very similar, and so is everyone else's.

.NET StringBuilder Class (System.Text)
-- Refer to the "Memory Allocation" section:
https://msdn.microsoft.com/en-us/library/system.text.stringbuilder%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396

You can see from the MS doc that "double the buffer size each time" is just a basic rule of thumb - (one that has worked well for most generic string builder classes for the past 10-20 years). Specific needs should drive the parameters of your string builder if you are looking for the best performance.

I personally tend to start with a 1k buffer and grow by 4x... but that is because of the specific data that I work with a lot of the time.
.
jj2007
Posts: 1827
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Faster string copying/building

Postby jj2007 » Apr 22, 2018 16:38

cbruce wrote:jj2007, if all of the strings that you want to concatenate are already available as literals or in variables, then go ahead and get their lengths and allocate a buffer... just as you suggest.

But that does not work if you have to concatenate many unknown strings being passed from other routines, read from data files, and so on. In this case, it is much more efficient to use a string builder.
Well, my method is a string builder, and it can use strings that are literals, in variables, or are passed from other routines. If the length is not known, it can be calculated, of course.

The only open question here is whether it's more efficient to:
- guess the size of the buffer needed, and reallocate if the guess was wrong (the Microsoft method),
- get the size in advance, i.e. before copying anything, for every string passed, and allocate only once for the precise amount needed.

It would be nice to have a testbed, and do some benchmarking. My example above builds a string from ten literal components (in this case: spaces), and ten elements of a string array. It takes a bit less than a second to do that a Million times, but I have no idea whether that's faster or slower than the other method. What kind of specific data are you using?
cbruce
Posts: 136
Joined: Sep 12, 2007 19:13
Location: Dallas, Texas

Re: Faster string copying/building

Postby cbruce » Apr 22, 2018 18:10

.
Sorry jj2007, but if you are concatenating, you are not using what is defined as a string builder. Your program, behind all of that concatenation, is doing allocates/deallocates of every string.

For some odd reason, almost every major language vendor out there has implemented a native string builder class in order to make string handling more efficient, without every programmer having to "roll their own" string builder.

I dealt with an enormous amount of variable length text records being read from files that I have to extract certain variable data from and create variable length result strings that go to another process.

I have a very large file buffer that can handle the maximum length of any text record that I will read. I use a custom string builder and move the variable length data that I search and parse from the individual file data directly into the buffers until I have my result. I then extract the shorter result from the larger buffer and pass that result along.

When I re-wrote the program, into its current version, I initially used concatenation to get the feel for all of the string handling that I would have to do. Once I optimized a string builder for my specific needs, that single change increased the performance of the program enormously. A run went from 27 hours down to 2 1/2 hours... just by using a real string builder that minimized the number of memory allocations caused by all of the string concatenations.

If you have written a real world application that can search and parse over 50 variable length data elements (using a dictionary of over 4000 terms), from (very dirty) variable length file records in (6) 2GB files, into millions of result strings, using string concatenation - and do it in 2 1/2 hours... then I think I have something to learn from you... and welcome your input.

Also jj2007, I apologize if I have misunderstood your postings... and you are not using string concatenation. But I would also love to see you use whatever else your method is on the amount of data that I am talking about, when you don't have any idea how many, or what size, the strings are that you will be extracting and building into a variable length result string.

Actually jj2007, I apologize again. I wrote the current version of that program in 2006 - (still in use today and no bugs - Yay!). I was limited by resources and system capabilities. I had to use a string builder and write intermediate large buffers out to disk to have enough system resources to work. If I wrote it today, with the amount of RAM available and the capability of large strings; I could simply allocate (1) 600MB buffer (the largest needed) and string build directly into that... which might be exactly what you are talking about with your small examples.

Errata:

FYI, it's a proprietary search system that is used internally by a large corporation. I started it as a consultant, with MUCH smaller datasets, back in 1987, using QuickBasic and a full text search package called Concordance on an IBM PC running OS/2 and Citrix Multi-User. When the system (with local users on Twinax connected terminals) became popular at the company's Dallas headquarters, I actually had to work with Citrix to get Ethernet cards and a new thing called a "TCP/IP" stack running on Multi-User so that the company's remote offices could access the system also.

My company was called in to evaluate a system they had purchased that ran on 2 big Sun systems. After a review, I explained to them (in a politically correct fashion) that they had gotten screwed. They asked me what they should be using. I researched and told them that all of things on the market were pretty stupid, oversized, and highly overpriced. I told them to get a good PC programmer and I told them what they needed him to do.

When they asked if I could do it, we worked up a SOW, which I thought they would turn down - since it had my rate in there at $300.00 per hour (in 1987!!!) - (we were an IBM system programming house - and we were the best in the industry - we were able to charge high rates). We never thought a customer would pay that rate (at the time) for a PC programmer. So we were a little flabbergasted when they said, "Yes". By the way - the total cost for the PC system, infrastructure, and my time - came it at less than a tenth of the cost of the system it replaced - (they sued the original vendor... and won).

A funny note, when they first asked if I could do it, I told them, "yes", but I didn't know if they could afford me, because I worked for dark chocolate... and I ate a lot! When we got their version of the contract to sign, they had actually added a clause that read, "And all the dark chocolate he can eat!". When I showed up for work the first day, their engineering department had built a two foot tall plexiglass case with a little slot on the bottom... and that case was filled with just about every kind of dark chocolate candy that existed at that time!... I loved those people... That case was in my office and full every time that they needed me to show up back there and work on the program.
.
jj2007
Posts: 1827
Joined: Oct 23, 2016 15:28
Location: Roma, Italia
Contact:

Re: Faster string copying/building

Postby jj2007 » Apr 22, 2018 20:14

cbruce wrote:Sorry jj2007, but if you are concatenating, you are not using what is defined as a string builder. Your program, behind all of that concatenation, is doing allocates/deallocates of every string.
It does one single allocation of the n bytes that are needed for all strings to be copied into the single result string.

If you have written a real world application that can search and parse over 50 variable length data elements (using a dictionary of over 4000 terms), from (very dirty) variable length file records in (6) 2GB files, into millions of result strings, using string concatenation - and do it in 2 1/2 hours... then I think I have something to learn from you... and welcome your input.
The time needed is difficult to estimate without knowing a bit more about your data and process. To give you a rough indication, I've changed the example and am now creating an array of those result strings composed of 20 items each (apologies, this is obviously not FreeBasic):

Code: Select all

  Read x$()      ; create a source string array from the $Data above
  Dim My$()      ; create a dynamic string array to store the results of the string building (is it concatenation??)
  NanoTimer()    ; start timing
  xor ecx, ecx
  .Repeat
   Let My$(ecx)="Result: "+x$(0)+" "+x$(1)+" "+x$(2)+" "+x$(3)+" "+x$(4)+" "+x$(5)+" "+x$(6)+" "+x$(7)+" "+x$(8)+" "+x$(9)
   inc ecx
  .Until ecx>=1000000
  Store "Concat.txt", My$()
  Print Cpu$(), CrLf$, NanoTimer$(), " for building and storing one Million strings from 20 components each", CrLf$, esi
Output:

Code: Select all

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
819 ms for building and storing one Million strings from 20 components each
The result file is 71,000,000 bytes. Note the timing stops when the file is closed, i.e. the data are in the file cache but the OS will take more time to write them to disk.

70 MB is still pretty far from your 6*2 GB. Your case is huge, and I don't know how much getting the strings from your data files will cost. However, consider that my method - calculate size first, then do a single precise allocation - keeps the request for allocations to the absolute minimum. The Microsoft method of guessing first, then doubling the allocation every now and then might bring in you in trouble because of paging. Again, without knowing about your case, this is rather speculative... but I am curious, and I like challenges ;-)

Return to “Beginners”

Who is online

Users browsing this forum: No registered users and 12 guests