Comments on: How to Efficiently Convert a String to an int in C++

By: Alexander Petrov

Alexander Petrov — Tue, 31 Jul 2018 07:49:00 +0000

also see discussion at https://www.reddit.com/r/cpp/comments/92bkxp/how_to_efficiently_convert_a_string_to_an_int_in_c/

By: jft

jft — Sun, 29 Jul 2018 09:00:00 +0000

A couple of questions have arisen over reading a file into memory (file_as_view() )

1) Will this process work with cin (ie can you redirect a file to stdin on the command line and read it in this way).

Yes – but a couple of changes are needed. ifstream& should be replaced with istream& and after the .seekg(), you need to test whether any data is present (ie that stdin has been redirected to a valid file). See the example below

2) Rather than using the allocated memory in the same function, can you return it so that it can be used elsewhere.

Yes. You could just return buffer (as unique_ptr is moveable but not copyable), but then you loose the size. You could return a pair of buffer and ifs.gcount() which gives you what you need, but perhaps the easiest is to return a string. Since c++17 string.data() is non-const so the referenced memory can be changed. So rather than allocating memory with make_unique(), create a string of the required size and use .data() as the parameter to .read(). Then the contents of the file are read directly into the string’s buffer and the correct string size can be set after the read. This string will then be returned by move semantics.

Consider

std::string readFile(std::istream& ifs)
{
ifs.clear();
ifs.seekg(0, ifs.end);

std::string fdata;

if (ifs.tellg() > 0) {
const auto fsz = static_cast(ifs.tellg());

fdata.assign(fsz, 0);

ifs.seekg(0);
ifs.read(fdata.data(), fsz);
fdata.resize(static_cast(ifs.gcount()));
}

return fdata;
}

3) What can you do if the file is too large to read all its contents into memory?
This is a good question. The obvious one, of course, is to use one of the other ways of reading data from a file (such as stream extraction) but this has the performance penalty. To benefit from the performance of memory read, there are a couple of ways. The easier is to read the file into memory in ‘chunks’ into a buffer. Then extract the required data from the buffer and when you come to the end of the buffer, read the next ‘chunk’ of the file into the buffer etc until the end of the file is reached. However, this has the slight drawback that the processing has to wait until the buffer is read when its end is reached. This small delay will probably be acceptable in most situations. However in very high performance situations, you need the maximum possible throughput. In this situation you use a method called ‘double buffering’. Essentially there are two buffers – and two threads. The idea is that the main thread processes the data in the primary buffer and whilst this is happening, the other thread is reading the next chunk of data into the other buffer. When the main thread comes to the end of the data in the primary buffer, the pointers to the two buffers are switched so that the main thread now processes data from the other buffer (now the primary buffer) without any wait and the other thread now reads the next chunk of data into the buffer that was primary. This reading and switching of buffers occurs until the actual end of file is reached. This is quite complicated so no code is shown here – it probably deserves an article of its own. There are also specific OS ways (such as Windows asynchronous file read etc).

By: jft

jft — Sat, 28 Jul 2018 11:34:00 +0000

In reply to artana. For MS VS2017 15.7.5 disabling sync has no effect on any of the timings - as standard streams aren't used. As far as I know (?), sync_with_stdio() only effects synchronisation between the standard input/output streams so that if enabled (by default) then cin/stdio, cerr/stderr and cout/stdout are synced so that code can switch between them without sync loss.

By: artana

artana — Fri, 27 Jul 2018 23:43:00 +0000

In reply to Miguel Raggi. Yes, msvc's stream performance has been shown to be awful in the past, and most of it is tied to syncing stdio and stream output. If you disable the sync with sync_with_stdio(0) performance is several magnitudes better.

By: Miguel Raggi

Miguel Raggi — Fri, 27 Jul 2018 13:29:00 +0000

I think VS is probably doing something wrong with streams. They are way too slow. On my laptop, your string test with 50M with gcc 8.1.1 on linux takes 3.5s, not 40s. This is merely 7 times slower than from_chars (which takes half a second), not 50 as in your results.

BTW, Clang doesn’t compile the code with an undefined reference to `__muloti4′

By: jft

jft — Fri, 27 Jul 2018 10:45:00 +0000

In reply to Bartlomiej Filipek.

VS2017.8 is supposed to fix many things with MS’s existing VS2017 .7 that ‘conforms to the c++17 standard’ with the introduction of the new parser. I’ve got outstanding VS.7 c++ bugs that are ‘fixed in .8’. That’s also supposed to allow Range v3 to compile, ‘sort out’ the issues with constexpr and provide floating point …
VS2017.8 is now at Preview 5!

By: jft

jft — Fri, 27 Jul 2018 09:37:00 +0000

In reply to Nikita Alekseev. If you manage to get some performance figures from tests, please do share.

By: Nikita Alekseev

Nikita Alekseev — Fri, 27 Jul 2018 08:43:00 +0000

In reply to jft. Yeah, thanks I see that string_view is a temporary and will not outlive unique_ptr. I'm still interested how mmap would perform with string_view

By: jft

jft — Fri, 27 Jul 2018 08:39:00 +0000

In reply to Nikita Alekseev.

The memory will indeed be freed when returning from the function. However at that point the vector nos has been filled by as_from_chars() whose contents are not reliant upon the memory so there is no issue with string_view. Note that the vector nums is unused here and is a ‘leftover’ from the original code from which this was text function was extracted. Doh!

Yes, the allocation and de-allocation of the memory does have an overhead. However in fairness to the other methods it was felt that this is part of the process and so wasn’t excluded.

“Also have you considered using mmap with string_view?”

No, not yet. The issue, as you mentioned above, is to avoid having a string_view which points to invalid memory. This isn’t always as easy to ensure as it might seem.

By: Nikita Alekseev

Nikita Alekseev — Fri, 27 Jul 2018 07:37:00 +0000

Thank you for a nice insight!

Regarding the code snippet where you read file into a buffer and then create string_view: since you create buffer as a unique pointer, won’t the memory be freed when returning from a function? This may result to a string_view which points to invalid memory. Also deallocation may slow down the benchmark.

Also have you considered using mmap with string_view?

Comments on: How to *Efficiently* Convert a String to an int in C++

By: Alexander Petrov

By: jft

By: jft

By: artana

By: Miguel Raggi

By: jft

By: jft

By: Nikita Alekseev

By: jft

By: Nikita Alekseev

Comments on: How to Efficiently Convert a String to an int in C++