Description of problem: Several races exist in RPC communications. * When rpc_clnt_reconfig() is called to change the port (to switch from glusterd to glusterfsd in protocol/client), a disconnection could be received and a reconnect attempted before protocol/client calls rpc_transport_disconnect(). This causes an spurious connection that will be shortly closed, but it's enough to send a handshake request, which fails when the socked is closed again and triggers a CHILD_DOWN event. This causes that some volumes come online with less bricks than expected. This could cause some unnecessary damage that self-heal will need to handle. * After calling rpc_clnt_reconfig() and rpc_transport_disconnect(), an rpc_clnt_notify() will be received to disconnect the client from glusterd. If before calling rpc_clnt_cleanup_and_start(), the reconnect timer is triggered, an issue similar to the previous one will happen. * When rpc_clnt_notify() is called with RPC_TRANSPORT_CLEANUP, rpc_clnt_destroy() is immediately called, but there could still be some timer callbacks running and using resources from the connection. * On rpc_clnt_remove_ping_timer_locked(), it can happen that the timer is configured and it has just been triggered to be executed but the callback is still not running. In this case the function will return 1, causing a call to rpc_clnt_unref(). If that was the last reference, when the timer callback is executed, it will access already destroyed resources. Version-Release number of selected component (if applicable): mainline How reproducible: It's difficult. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: IMO, one thing that makes it very hard to make it work as expected with a bad control of what's happening, is that we are allowing high level clients to directly call rpc_transport_disconnect() and other lower level functions bypassing rpc-clnt. Considering that we can have multiple threads accessing the same connection in different ways, it's difficult to correctly coordinate all accesses.
https://github.com/gluster/glusterfs/issues/391 tracks the same. Will keep it open there, and please check it in github.