Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

macos: fix missing pthread mutex init after calloc #21

Open
wants to merge 1 commit into
base: mariadb-4.x
Choose a base branch
from

Conversation

sitano
Copy link

@sitano sitano commented Jul 31, 2024

calls constructor for a mutex in a struct value init-ed with gu_calloc.

in path gcs_core_create() -> gcs_group_init(), the first one allocates gcs_core_t* core with gu_calloc() whereas gcs_code_t has gcs_group_t group with gu::Mutex memb_mtx_. After memory allocation gu::Mutex constructor was not called that lead to an error on Darwin in a call to pthread mutex lock.

cherry-pick from #20.

Stacktraces from Darwin

Originally it failed on boot in:

[...libgalera_smm...]
    frame #0: 0x0000000104b3f0e0 libgalera_smm.so`gcs_open(conn=0x0000000154707500, channel="blah", url="gcomm://", bootstrap=true) at gcs.cpp:1643:10
    frame #1: 0x0000000104adac44 libgalera_smm.so`galera::Gcs::connect(this=0x000000015502a0e0, cluster_name="blah", cluster_url="gcomm://", bootstrap=true) at galera_gcs.hpp:115:20
    frame #2: 0x0000000104adaa44 libgalera_smm.so`galera::ReplicatorSMM::connect(this=0x0000000155029800, cluster_name="blah", cluster_url="gcomm://", state_donor="", bootstrap=true) at replicator_smm.cpp:356:21
  * frame #3: 0x0000000104a5ffc0 libgalera_smm.so`galera_connect(gh=0x00000001446040c0, cluster_name="blah", cluster_url="gcomm://", state_donor="", bootstrap=true) at wsrep_provider.cpp:203:22
[...mariadbd+wsrep border...]
    frame #4: 0x00000001012c7be0 mariadbd`wsrep::wsrep_provider_v26::connect(this=0x00006000038fc0c0, cluster_name="blah", cluster_url="gcomm://", state_donor="", bootstrap=true) at wsrep_provider_v26.cpp:818:29
    frame #5: 0x00000001012a0890 mariadbd`wsrep::server_state::connect(this=0x0000000154608730, cluster_name="blah", cluster_address="gcomm://", state_donor="", bootstrap=true) at server_state.cpp:525:23
    frame #6: 0x0000000100b856b4 mariadbd`wsrep_start_replication(wsrep_cluster_address="gcomm://") at wsrep_mysqld.cc:1207:46
    frame #7: 0x0000000100b85280 mariadbd`wsrep_init_startup(sst_first=true) at wsrep_mysqld.cc:1012:8
    frame #8: 0x000000010036c3bc mariadbd`init_server_components() at mysqld.cc:5233:9
    frame #9: 0x0000000100368640 mariadbd`mysqld_main(argc=26, argv=0x00000001547057f0) at mysqld.cc:5970:7
    frame #10: 0x00000001000038c0 mariadbd`main(argc=5, argv=0x000000016fdff048) at main.cc:34:10
    frame #11: 0x00000001884ae0e0 dyld`start + 2360

where as init previously called from:

libgalera_smm.so`gcs_group_init(group=0x000000014d006f88, cnf=0x000000014d82ab10, cache=0x000000014d82af18, node_name="mariadb-primary-1", inc_addr="fe80:0", gcs_proto_ver='\x04', repl_proto_ver=11, appl_proto_ver=4) at gcs_group.cpp:64:27
    frame #1: 0x0000000104b37258 libgalera_smm.so`gcs_core_create(conf=0x000000014d82ab10, cache=0x000000014d82af18, node_name="mariadb-primary-1", inc_addr="fe80:0", repl_proto_ver=11, appl_proto_ver=4, gcs_proto_ver=4) at gcs_core.cpp:151:21
    frame #2: 0x0000000104b3e8ec libgalera_smm.so`gcs_create(conf=0x000000014d82ab10, gcache=0x000000014d82af18, progress_cb=0x000000014d82b2d0, node_name="mariadb-primary-1", inc_addr="fe80:0", repl_proto_ver=11, appl_proto_ver=4) at gcs.cpp:317:19
    frame #3: 0x0000000104af4f08 libgalera_smm.so`galera::Gcs::Gcs(this=0x000000014d82b2e0, config=0x000000014d82ab10, cache=0x000000014d82af18, cb=0x000000014d82b2d0, repl_proto_ver=11, appl_proto_ver=4, node_name="mariadb-primary-1", node_incoming="fe80:0") at galera_gcs.hpp:99:19
    frame #4: 0x0000000104ad9190 libgalera_smm.so`galera::Gcs::Gcs(this=0x000000014d82b2e0, config=0x000000014d82ab10, cache=0x000000014d82af18, cb=0x000000014d82b2d0, repl_proto_ver=11, appl_proto_ver=4, node_name="mariadb-primary-1", node_incoming="fe80:0") at galera_gcs.hpp:104:9
    frame #5: 0x0000000104ad7f4c libgalera_smm.so`galera::ReplicatorSMM::ReplicatorSMM(this=0x000000014d82aa00, args=0x000000016fdfbea8) at replicator_smm.cpp:129:5
    frame #6: 0x0000000104ad9a98 libgalera_smm.so`galera::ReplicatorSMM::ReplicatorSMM(this=0x000000014d82aa00, args=0x000000016fdfbea8) at replicator_smm.cpp:167:1
    frame #7: 0x0000000104a5f170 libgalera_smm.so`galera_init(gh=0x000000014bf08520, args=0x000000016fdfbea8) at wsrep_provider.cpp:50:23
    frame #8: 0x00000001012c664c mariadbd`wsrep::wsrep_provider_v26::wsrep_provider_v26(this=0x0000600001f440c0, server_state=0x000000014be04d90, provider_options="", provider_spec="libgalera_smm.so", services=0x00000001024b9478) at wsrep_provider_v26.cpp:783:9
    frame #9: 0x00000001012c7a80 mariadbd`wsrep::wsrep_provider_v26::wsrep_provider_v26(this=0x0000600001f440c0, server_state=0x000000014be04d90, provider_options="", provider_spec="libgalera_smm.so", services=0x00000001024b9478) at wsrep_provider_v26.cpp:747:1
    frame #10: 0x00000001012965a8 mariadbd`wsrep::provider::make_provider(server_state=0x000000014be04d90, provider_spec="libgalera_smm.so", provider_options="", services=0x00000001024b9478) at provider.cpp:37:20
    frame #11: 0x00000001012a06ac mariadbd`wsrep::server_state::load_provider(this=0x000000014be04d90, provider_spec="libgalera_smm.so", provider_options="", services=0x00000001024b9478) at server_state.cpp:505:17
    frame #12: 0x0000000100b84684 mariadbd`wsrep_init() at wsrep_mysqld.cc:907:38
    frame #13: 0x0000000100b851dc mariadbd`wsrep_init_startup(sst_first=true) at wsrep_mysqld.cc:988:7
    frame #14: 0x000000010036c3bc mariadbd`init_server_components() at mysqld.cc:5233:9
    frame #15: 0x0000000100368640 mariadbd`mysqld_main(argc=26, argv=0x000000014d0054f0) at mysqld.cc:5970:7
    frame #16: 0x00000001000038c0 mariadbd`main(argc=5, argv=0x000000016fdff048) at main.cc:34:10
    frame #17: 0x00000001884ae0e0 dyld`start + 2360

calls constructor for a mutex in a struct value init-ed with gu_calloc.

in path `gcs_core_create() -> gcs_group_init()`, the first one allocates
`gcs_core_t* core` with gu_calloc() whereas `gcs_code_t` has
`gcs_group_t group` with `gu::Mutex memb_mtx_`. After memory allocation
gu::Mutex constructor was not called that lead to an error on Darwin in
a call to pthread mutex lock.

Signed-off-by: Ivan Prisyazhnyy <[email protected]>
@sitano sitano force-pushed the ivan/fix_gcs_group_mutex_init branch from 67981b9 to ec7d79d Compare July 31, 2024 13:57
@grooverdan
Copy link
Member

created MDEV-34717 to refer to this.

@@ -61,6 +61,7 @@ gcs_group_init (gcs_group_t* group, gu::Config* const cnf, gcache_t* const cache
int const appl_proto_ver)
{
// here we also create default node instance.
new (&group->memb_mtx_) gu::Mutex(NULL);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of the recent MDEV-34625 and MariaDB/server#3408. I understood that macOS would use a clang based compiler by default.

I think that use of placement new after a call to a calloc like operation may lead to surprises when using GCC 6 or later, if the constructor is expecting that some fields were already initialized by a previous write. GCC 6 and later could optimize away such pre-constructor writes, thanks to -flifetime-dse.

That macOS (as well as AIX) are not happy with zero initialization for pthread_mutex_t bit me in MariaDB/server#3433.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reminds me of the recent MDEV-34625 and MariaDB/server#3408. I understood that macOS would use a clang based compiler by default.

I think that use of placement new after a call to a calloc like operation may lead to surprises when using GCC 6 or later, if the constructor is expecting that some fields were already initialized by a previous write. GCC 6 and later could optimize away such pre-constructor writes, thanks to -flifetime-dse.

MacOS uses Clang yes. But MDEV-34625 IMHO is different. Moreover, the Godbolt example does not look correct - it is fine that the memset in https://gcc.godbolt.org/z/5n87z1raG is optimized out because it is expected (I suppose) that after a class constructor is called all class variables are initialized. Thus, the compiler may conclude that if we have void *buf = malloc(size S); s = new (buf) buf; is equivalent to S *s = new S;.

But it is different if you have malloc() buf size bigger than the memory that the constructor() is expected to touch:

https://gcc.godbolt.org/z/Y43YW7vKj

struct S {
  int i;     // uninitialized in consturctor
  S() {};
};

struct A {
    char a[256];
    S s;
};

int bar() {
  A *buf = (A *)malloc(sizeof(A));
  memset(buf, 0, sizeof(A));
  S* s = new(&buf->s) S;
  return s->i;
}

became call calloc in assembly or rep stosq. So memset is Not eliminated in this case and will not be eliminated in the case of gcs_core_t* core = GU_CALLOC (1, gcs_core_t); as far as sizeof(gcs_core_t) >> gcs_group_t.

it does not eliminate calloc() even if -flifetime-dse=0 (try saving *buf to external volatile var) (default is 2) (https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html).

if you still think it is not safe, I can offer that we could replace the mutex there instead of gu::Mutex memb_mtx_ we could write gu_mutex_t mutable value_ and init without calling class constructors. WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the constructor is expecting that some fields were already initialized by a previous write.

I think its forbidden by the spec (speculating) - constructor must init all class fields

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is UB, with no doubt. If all fields of gu::Mutex are initialized by the constructor, then there should be no issue with the GCC -flifetime-dse on any platform.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what we can see in gu_mutex.hpp they all are init-ed

namespace gu
{
    class Mutex
    {
    public:

        Mutex (const wsrep_mutex_key_t* key) : value_()
#ifdef GU_MUTEX_DEBUG
                 , owned_()
                 , locked_()
#endif /* GU_MUTEX_DEBUG */
        {
            if (gu_mutex_init (key, &value_))
                gu_throw_fatal;
        }

    protected:

        gu_mutex_t  mutable value_;
#ifdef GU_MUTEX_DEBUG
        gu_thread_t mutable owned_;
        bool        mutable locked_;
#endif /* GU_MUTEX_DEBUG */
    };
}

@janlindstrom
Copy link

This issue is currently worked by Alexey and his comment was "I'd rather make a constructor for group struct."

@sitano
Copy link
Author

sitano commented Aug 8, 2024

@janlindstrom shall we close this PR then? Speaking of "I'd rather make a constructor for group struct." the less invasive thing could be just to replace the gu::Mutex memb_mtx_ with gu_mutex_t memb_mtx_; and calling if (gu_mutex_init (NULL, &memb_mtx_)) gu_throw_fatal; in place of new (&group->memb_mtx_) gu::Mutex(NULL); - much less invasive than refactoring init code into the constructor.

@janlindstrom
Copy link

@sitano Lets wait for Alexey's decision. Galera and MariaDB currently do not support officially MacOS so this is not very high-priority issue currently.

@ayurchen
Copy link

ayurchen commented Aug 15, 2024

@sitano This should be (I have no ways to test it on MacOS unfortunately) fixed in 5dc30e6 by introducing a proper constructor for struct gcs_group and struct gcs_core so that there is no calloc there. The mutex is initialized in initialization list - and destroyed accordingly.

@sitano
Copy link
Author

sitano commented Aug 16, 2024

@ayurchen I would love to check it works on Mac in Monday but I don't know where to find this commit: 5dc30e6 (no such commit), its probably somewhere in non public repo?

@janlindstrom
Copy link

@sitano Try #23

@sitano
Copy link
Author

sitano commented Aug 19, 2024

Sorry, I am a bit short on time. Hope I will do the trials tomorrow.

@sitano
Copy link
Author

sitano commented Aug 22, 2024

tested 78f68e9 on MBP M3Max (23.5.0 Darwin Kernel Version 23.5.0: Wed May 1 20:17:33 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T6031 arm64) (/Library/Developer/CommandLineTools/SDKs/MacOSX14.4.sdk) built from https://github.com/sitano/galera/tree/ivan/galera-macos with

$ cmake .. -G Ninja -DCMAKE_C_COMPILER_LAUNCHER=sccache -DCMAKE_CXX_COMPILER_LAUNCHER=sccache -DCMAKE_BUILD_TYPE="Debug" -DCMAKE_INSTALL_PREFIX="install" -DWITH_DBUG_TRACE=OFF -DNOT_FOR_DISTRIBUTION=YES -DCMAKE_VERBOSE_MAKEFILE=ON -DMYSQL_MAINTAINER_MODE=OFF -DWITH_ZLIB=bundled -DWITH_PCRE=bundled -DOPENSSL_ROOT_DIR=$(brew --prefix)/opt/openssl@3
$ cmake --build . --parallel 16

works well!

I have managed to execute Galera node and join another node. (@janlindstrom, @ayurchen )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants