1211866 – Concurrently detaching a peer on one node and stopping glusterd on other node, leads to dead-lock on former node

Bug 1211866 - Concurrently detaching a peer on one node and stopping glusterd on other node, leads to dead-lock on former node

Summary: Concurrently detaching a peer on one node and stopping glusterd on other node...

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	glusterd
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Atin Mukherjee
QA Contact:	Byreddy
Docs Contact:
URL:
Whiteboard:	GlusterD
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-04-15 06:51 UTC by SATHEESARAN
Modified:	2016-01-25 07:00 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-01-25 05:45:43 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
cli.log file from node1 (39.60 KB, text/plain) 2015-04-15 07:00 UTC, SATHEESARAN	no flags	Details
View All

Description SATHEESARAN 2015-04-15 06:51:34 UTC

Description of problem:
-----------------------
In a 2 node gluster cluster, from node1 peer detach node2 and from node2 stop glusterd. And this lead to deadlock in node1 consistently

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
glusterfs-3.7 nightly ( glusterfs-3.7dev-0.994.gitf522001.el6.x86_64 )

How reproducible:
-----------------
Always

Steps to Reproduce:
--------------------
1. Create a 2 node cluster ( say, node1, node2 )
2. Perform following steps concurrently. I used TERMINATOR tool to broadcast commands concurrently to 2 node
3. On node1, execute - "gluster peer detach <node2>"
4. On node2, execute - "service glusterd stop"

Actual results:
---------------
All the commands executed on node1, resulted in "Error: cli timeout"

Expected results:
-----------------
There shouldn't be any problem in executing gluster commands

Comment 1 SATHEESARAN 2015-04-15 06:52:29 UTC

Following logs are seen in cli.log

<snip>
[2015-04-14 16:41:28.732861] I [cli-cmd-volume.c:1832:cli_check_gsync_present] 0-: geo-replication not installed
[2015-04-14 16:41:28.733712] I [event-epoll.c:629:event_dispatch_epoll_worker] 0-epoll: Started thread with index 1
[2015-04-14 16:41:28.733855] I [socket.c:2409:socket_event_handler] 0-transport: disconnecting now
[2015-04-14 16:41:31.732510] I [socket.c:2409:socket_event_handler] 0-transport: disconnecting now
[2015-04-14 16:41:34.733371] I [socket.c:2409:socket_event_handler] 0-transport: disconnecting now
[2015-04-14 16:41:37.734100] I [socket.c:2409:socket_event_handler] 0-transport: disconnecting now
[2015-04-14 16:41:40.734911] I [socket.c:2409:socket_event_handler] 0-transport: disconnecting now
[2015-04-14 16:41:43.735664] I [socket.c:2409:socket_event_handler] 0-transport: disconnecting now

</snip>

Comment 2 SATHEESARAN 2015-04-15 07:00:28 UTC

Created attachment 1014595 [details]
cli.log file from node1

Comment 3 krishnan parthasarathi 2015-04-16 03:30:03 UTC

Satheesaran,

Could you attach a core taken on the hung glusterd process? You could do that by attaching gdb to the process and issuing gcore inside gdb.

Comment 4 SATHEESARAN 2015-04-16 06:15:09 UTC

(In reply to krishnan parthasarathi from comment #3)
> Satheesaran,
> 
> Could you attach a core taken on the hung glusterd process? You could do
> that by attaching gdb to the process and issuing gcore inside gdb.

I have already got that core file, but couldn't attach it to the bug as BZ was slow to reach yesterday. Attaching it now

Comment 7 Byreddy 2016-01-25 05:45:43 UTC

Checked this issue with latest 3.1.2 build ( glusterfs-3.7.5-17 ) on rhel7 

1. Created two node cluster (N1 and N2)
2.Performed below steps using the terminator tool to broadcast commands concurrently to 2 nodes.
a)On N1 > gluster peer detach N2
b)On N2 > systemctl stop glusterd

No issues observed and able to get the response for gluster commands issued  on N1.

Based on above verification details closing this bug as working in current release.

Comment 8 SATHEESARAN 2016-01-25 07:00:34 UTC

(In reply to Byreddy from comment #7)
> Checked this issue with latest 3.1.2 build ( glusterfs-3.7.5-17 ) on rhel7 
> 
> 1. Created two node cluster (N1 and N2)
> 2.Performed below steps using the terminator tool to broadcast commands
> concurrently to 2 nodes.
> a)On N1 > gluster peer detach N2
> b)On N2 > systemctl stop glusterd
> 
> No issues observed and able to get the response for gluster commands issued 
> on N1.
> 
> Based on above verification details closing this bug as working in current
> release.

Thanks Byreddy for verifying this issue with the latest RHGS 3.1.2 nightly build.
I will re-open this bz, if the issue happens again

Note You need to log in before you can comment on or make changes to this bug.