Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change usage of std::string to movable object #144

Open
olzhabay opened this issue Oct 5, 2018 · 10 comments
Open

Change usage of std::string to movable object #144

olzhabay opened this issue Oct 5, 2018 · 10 comments

Comments

@olzhabay
Copy link
Member

olzhabay commented Oct 5, 2018

Constructor of std::string does not allow any moving. Instead, it always copies all data.
Except if pointer std::string is not used. However, that is used minorly in project.
This leads to too many unnecessary copying around data, which can be size of block (64mb).

Proposal, use custom class that will not be copying data, but moving it. Also, it will control number of owners and gets destructed when owners<=0

class Slice {
  char* ptr;
  uint64_t size;
  atomic<int> owners;
}

Other way is to use std::shared_ptrstd::string

Solving this will do some improvement to performance

@vicentebolea
Copy link
Member

Thanks for pointing out @olzhabay.

I totally get your point, however, in many sections of the code whenever expensive string assignment was done I forced the move using std::move(std::string). In such case I understood that it will move the internal data. Yet I do not know the internals of std::move. What do you think?

@olzhabay
Copy link
Member Author

olzhabay commented Oct 8, 2018

It does not move if std::string goes out of stack scope. Also, std::string generates too much assembly code. Examples:

@olzhabay
Copy link
Member Author

olzhabay commented Oct 8, 2018

What if we make the data wrapper class serializable like this for now? Maybe later on, the implementation of raw block sending can be implemented.

@vicentebolea
Copy link
Member

I have reviewed the proposed slice class which is a class that wraps basically a string (length and char array) plus some directives to allow fast concurrent access.

I also understand the rationale of the proposed idea, you find std::string inefficient for our project and slices provides a custom way for us to tweek and optimize the block content handling.

std::move in C++ strings

Still we can totally use std::move in the strings in those two scenarios:

First one

 string a = "hello", b;
 b = std::move(a);

Second one

string fun(){
  string a = "100"  
  return  std::move(a); // std::move is useless, compiler will do RVO if possible 
}

string b = fun(); 

Summary

Using slice for block abstraction instead of string will allow us:

  • Concurrent access to its content. YAGNI, we use one thread and proactor pattern:-1:
  • Tailored class to our block case (large size). Good point 👍

Ditching std::string in favor of slices will cost us:

  • One more class to maintain and debug. KISS
  • There is so much support and optimization for std::string (inside BOOST and stdlib). Think about alignment and other things. Important

Verdict

I do not want to turn down good ideas and work. I think you have in mind a bigger refactor than slice custom class where slices class makes much more sense for that purpose. As I see it for now, it is premature optimization as we try to solve a problem that our architecture does not have (yet).

If it is part of a bigger optimization and you are thinking to change the architecture to multi-threaded, it will come super handy, I will happily accept slices along its change of architecture.

As for now I am reticent to ditch std::string

If you think I am mistaken, please point it out and lets keep the conversation. If you can't still convince me, do it by facts and benchmarks :)

signed-off: [email protected]

@olzhabay
Copy link
Member Author

@vicentebolea got it. Proceeding with tests and benchmarks to see the actual performance benefits

@olzhabay
Copy link
Member Author

olzhabay commented Oct 12, 2018

Performance benchmark results

tested two cases:

  • creation of an object
  • move of an object
    tested with [8, 64, 512, 1k, 8k, 64m] bytes

std::string performance

----------------------------------------------------------------------------
Benchmark                     Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------
BM_creation/8                34 ns         34 ns   20388586 bytes_per_second=222.224M/s
BM_creation/64               61 ns         61 ns   11668637 bytes_per_second=1008.09M/s
BM_creation/512             113 ns        113 ns    6118007 bytes_per_second=4.20737G/s
BM_creation/1024            142 ns        142 ns    4906628 bytes_per_second=6.72681G/s
BM_creation/8192            362 ns        362 ns    1933918 bytes_per_second=21.091G/s
BM_creation/67108864   20001119 ns   19945407 ns         33 bytes_per_second=3.13355G/s
BM_move/8                     1 ns          1 ns  626554280 bytes_per_second=6.68698G/s
BM_move/64                    1 ns          1 ns  626517599 bytes_per_second=53.4915G/s
BM_move/512                   1 ns          1 ns  627175796 bytes_per_second=427.968G/s
BM_move/1024                  1 ns          1 ns  627080058 bytes_per_second=855.979G/s
BM_move/8192                  1 ns          1 ns  626729370 bytes_per_second=6.68706T/s
BM_move/67108864              1 ns          1 ns  627153977 bytes_per_second=53.4822P/s

slice performance

----------------------------------------------------------------------------
Benchmark                     Time           CPU Iterations UserCounters...
----------------------------------------------------------------------------
BM_creation/8                31 ns         31 ns   22385795 bytes_per_second=245M/s
BM_creation/64               31 ns         31 ns   22250523 bytes_per_second=1.90128G/s
BM_creation/512              44 ns         44 ns   16038834 bytes_per_second=10.8969G/s
BM_creation/1024             48 ns         48 ns   14534327 bytes_per_second=19.8019G/s
BM_creation/8192            139 ns        139 ns    4993379 bytes_per_second=54.8901G/s
BM_creation/67108864    6466286 ns    6466127 ns         89 bytes_per_second=9.66575G/s
BM_move/8                     0 ns          0 ns 1000000000 bytes_per_second=31.5797P/s
BM_move/64                    0 ns          0 ns 1000000000 bytes_per_second=260.75P/s
BM_move/512                   0 ns          0 ns 1000000000 bytes_per_second=2.0278E/s
BM_move/1024                  0 ns          0 ns 1000000000 bytes_per_second=4.05561E/s
BM_move/8192                  0 ns          0 ns 1000000000 bytes_per_second=32.5937E/s
BM_move/67108864              0 ns          0 ns 1000000000 bytes_per_second=72.9697Z/s

@vicentebolea
Copy link
Member

vicentebolea commented Oct 12, 2018 via email

@vicentebolea
Copy link
Member

vicentebolea commented Oct 12, 2018 via email

@olzhabay
Copy link
Member Author

Update on performance benchmark results

Here are again comparison of slice and std::string
In this case we compare their performance on serialization and deserialization.
Sample benchmark code:

static void BM_serialization(benchmark::State& state) {
  char* data = new char[state.range(0)];
  srand(0);
  int ascii = (std::rand() % 255) + 1;
  memset(data, ascii, state.range(0));
  TYPE str(data, state.range(0));
  for (auto _ : state) {
    std::ofstream ofs("stream");
    {
      boost::archive::binary_oarchive oa(ofs);
      oa << str;
    }
    TYPE new_str;
    {
      std::ifstream ifs("stream");
      boost::archive::binary_iarchive ia(ifs);
      ia >> new_str;
    }
    assert(str == new_str);
  }
}
BENCHMARK(BM_serialization)->Arg(8)->Arg(64)->Arg(512)->Arg(1<<10)->Arg(8<<10)->Arg(64<<20);

Results for std::string

---------------------------------------------------------------------------------
Benchmark                          Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------------
BM_serialization/8            257080 ns      96024 ns       7136
BM_serialization/64           272027 ns     102924 ns      10258
BM_serialization/512          259539 ns      90391 ns      10055
BM_serialization/1024         261867 ns     103443 ns       6890
BM_serialization/8192         312548 ns     124600 ns       6665
BM_serialization/67108864 1091854852 ns   92877561 ns         11

Results for slice

---------------------------------------------------------------------------------
Benchmark                          Time           CPU Iterations UserCounters...
---------------------------------------------------------------------------------
BM_serialization/8            265745 ns      94510 ns       7766
BM_serialization/64           295327 ns     107273 ns       8124
BM_serialization/512          297243 ns     106744 ns       7802
BM_serialization/1024         260830 ns      98153 ns       9907
BM_serialization/8192         260027 ns      97212 ns       9469
BM_serialization/67108864     300611 ns      87311 ns       7906

We can see significant result performance improvement on 64MB data serialization/deserialization

@vicentebolea
Copy link
Member

@olzhabay, you are doing a very fine job benchmarking this one. it looks like serialization exec time for str is linearly proportional to its size.

We can see significant result performance improvement on 64MB data serialization/deserialization

For slice it just takes a bit more exec time from 8 KB to 64MB, that is too much of a difference, even CPU time it drops a bit for a 10000 times bigger input. I believe it its faster but not this much.

How about moving this one to a pull request so that we can check the code together?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants