Bug 1541032 - Races in network communications
Summary: Races in network communications
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: GlusterFS
Classification: Community
Component: rpc
Version: mainline
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Milind Changire
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-02-01 14:53 UTC by Xavi Hernandez
Modified: 2019-06-20 05:14 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2019-06-20 05:14:57 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Xavi Hernandez 2018-02-01 14:53:56 UTC
Description of problem:

Several races exist in RPC communications.

* When rpc_clnt_reconfig() is called to change the port (to switch from glusterd to glusterfsd in protocol/client), a disconnection could be received and a reconnect attempted before protocol/client calls rpc_transport_disconnect(). This causes an spurious connection that will be shortly closed, but it's enough to send a handshake request, which fails when the socked is closed again and triggers a CHILD_DOWN event. This causes that some volumes come online with less bricks than expected. This could cause some unnecessary damage that self-heal will need to handle.

* After calling rpc_clnt_reconfig() and rpc_transport_disconnect(), an rpc_clnt_notify() will be received to disconnect the client from glusterd. If before calling rpc_clnt_cleanup_and_start(), the reconnect timer is triggered, an issue similar to the previous one will happen.

* When rpc_clnt_notify() is called with RPC_TRANSPORT_CLEANUP, rpc_clnt_destroy() is immediately called, but there could still be some timer callbacks running and using resources from the connection.

* On rpc_clnt_remove_ping_timer_locked(), it can happen that the timer is configured and it has just been triggered to be executed but the callback is still not running. In this case the function will return 1, causing a call to rpc_clnt_unref(). If that was the last reference, when the timer callback is executed, it will access already destroyed resources.

Version-Release number of selected component (if applicable): mainline


How reproducible:

It's difficult.

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

IMO, one thing that makes it very hard to make it work as expected with a bad control of what's happening, is that we are allowing high level clients to directly call rpc_transport_disconnect() and other lower level functions bypassing rpc-clnt. Considering that we can have multiple threads accessing the same connection in different ways, it's difficult to correctly coordinate all accesses.

Comment 1 Amar Tumballi 2019-06-20 05:14:57 UTC
https://github.com/gluster/glusterfs/issues/391 tracks the same. Will keep it open there, and please check it in github.


Note You need to log in before you can comment on or make changes to this bug.