Bug 1408104

Summary: Fix potential socket_poller thread deadlock and resource leak
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Atin Mukherjee <amukherj>
Component: rpcAssignee: Milind Changire <mchangir>
Status: CLOSED UPSTREAM QA Contact: Rahul Hinduja <rhinduja>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.2CC: atumball, bugs, kaushal, mchangir, nbalacha, nchilaka, rhs-bugs
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1408101 Environment:
Last Closed: 2018-11-20 05:42:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1408101, 1561332    
Bug Blocks: 1647277    

Description Atin Mukherjee 2016-12-22 06:38:19 UTC
+++ This bug was initially created as a clone of Bug #1408101 +++

The fix for bug #1404181 [1], has a potential deadlock and resource leak of the socket_poller thread.

A disconnect caused by a PARENT_DOWN event during a fuse graph switch, can lead to the socket_poller thread being deadlocked. The deadlock doesn't affect the fuse client as no new fops are sent on the old graph.

In addition to the above, the race in gfapi solved by [1] can also occur in other codepaths, and need to be solved.

Quoting Raghavendra G's comment from the review,
"""
- The race addressed by this patch (race b/w socket_disconnect cleaning up resources in priv and socket_poller using the same and resulting in undefined behaviour - crash/corruption etc) can potentially happen irrespective of the codepaths socket_disconnect is invoked from (like glusterd, client_portmap_cbk, handling of PARENT_DOWN, changelog etc). Note the usage of word "potential" here and I am not saying that this race happens in existing code. However, I would like this issue gets fixed for these potential cases too.
- If there are fops in progress at the time of graph switch, sending PARENT_DOWN event on the currently active (soon to be old) graph is deferred till all the fops are complete (though new graph becomes active and new I/O is redirected to that graph). So, PARENT_DOWN event can be sent after processing last response (to fop). This means PARENT_DOWN can be sent in thread executing socket_poller itself. Since PARENT_DOWN triggers a disconnect and disconnect waits for socket_poller to complete, we've a deadlock. Specifically the deadlock is: socket_poller -> notify-msg-received -> fuse processes fop response -> fuse sends PARENT_DOWN -> rpc-clnt calls rpc_clnt_disable -> socket_disconnect -> wait till socket_poller to complete before returning from socket_disconnect. Luckily we've have a socket_poller thread for each transport and threads that deadlock are the threads belonging to transports from older graphs on which no I/O happening. So, at worst this will be a case of resource leakage (threads/sockets etc) of old graph.

"""


[1] https://review.gluster.org/16141

Comment 5 Atin Mukherjee 2018-11-10 07:07:43 UTC
This BZ is more than 2 years old now. Is this still valid? Do we have any plans to address this issue in coming months? If not, can we have a conclusion on the bug (say won't fix?) and take this bug to the closure?

Comment 6 Amar Tumballi 2018-11-20 05:42:51 UTC
Atin, considering you raised this issue, and it is by code-analysis, I will be closing the bug as CLOSED-UPSTREAM, as upstream bug is still active.

Comment 7 Red Hat Bugzilla 2023-09-14 03:36:40 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days