Bug 1037851

Summary: glusterd hangs on big lock
Product: Red Hat Gluster Storage Reporter: Anand Avati <aavati>
Component: glusterfsAssignee: Vijaikumar Mallikarjuna <vmallika>
Status: CLOSED ERRATA QA Contact: ssamanta
Severity: high Docs Contact:
Priority: high    
Version: 2.1CC: bhubbard, chrisw, gluster-bugs, grajaiya, kparthas, mzywusko, nsathyan, psriniva, sasundar, sdharane, smohan, ssamanta, vagarwal, vbellur, vmallika
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 2.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0.49rhs-1.el6rhs Doc Type: Bug Fix
Doc Text:
Previously, glusterd would become unresponsive when it disconnected from one of its peers while a gluster CLI command is in execution. With this fix, glusterd does not become unresponsive in such a scenario.
Story Points: ---
Clone Of: 1037849 Environment:
Last Closed: 2014-02-25 08:07:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1037849    
Bug Blocks:    
Attachments:
Description Flags
Verification logs
none
Verification Logs-2 none

Comment 3 Pavithra 2014-01-03 11:05:55 UTC
Hi Vijai, 

I made minor changes to the doc text. Please verify.

Comment 4 Vijaikumar Mallikarjuna 2014-01-03 11:14:58 UTC
Doc-text looks good to me

Comment 5 ssamanta 2014-01-13 07:21:59 UTC
Created attachment 849167 [details]
Verification logs

Comment 6 ssamanta 2014-01-13 09:01:30 UTC
The following thing are done to reproduce the issue.

Testcase
=========
1. Create a distribute volume with 3 nodes and start the volume

[root@rhsauto015 brick1]# gluster volume info
 
Volume Name: testvol2
Type: Distribute
Volume ID: 61259fc6-d2f6-410e-8832-ea529cf709de
Status: Started
Number of Bricks: 3
Transport-type: tcp
Bricks:
Brick1: 10.70.37.7:/rhs/brick1/ex10
Brick2: 10.70.36.245:/rhs/brick1/ex9
Brick3: 10.70.37.10:/rhs/brick1/ex8


2. Block the glusterd traffic by setting the ip table rules on glusterd port on all nodes.

 [root@rhsauto015 ~]# iptables -I INPUT 1 -p tcp --dport 24007 -j DROP
 [root@rhsauto015 ~]# iptables -I OUTPUT 1 -p tcp --dport 24007 -j DROP
3. Through a script simulatneously connect to the nodes and execute the CLI commands(gluster volume info and gluster peer status continuously for 10000 times)


4. The glusterd hangs after some iterations while fetching the "gluster peer status" information from the other nodes. There is no glusterd crash. I will raise a new bug for this issue.
  
 Attached verification logs.

Comment 7 ssamanta 2014-01-13 09:39:33 UTC
Marking it "Assigned" as the glusterd still hangs.

Comment 8 krishnan parthasarathi 2014-01-15 09:10:20 UTC
Steps to reproduce (since the description doesn't provide one at all):
1) Set the following iptables rule

#Blocks incoming messages to glusterd
iptables -I INPUT 1 -p tcp --dport 513:65535 -j DROP 

2) Execute volume-profile-info command periodically, in a loop (say)

3) Remove the above iptables rule.

#Unblocks
iptables -D INPUT 1

The issue was that the Big lock was being held while submitting RPC requests. This code path was active in volume-profile command and hence the hang (actually deadlock) was observed. The fix address only this issue.

The steps in comment#6 is not related to this issue.

Sobhan,
What do you mean when you say glusterd is hung? How did you confirm that glusterd was indeed hung?

Comment 9 Vivek Agarwal 2014-01-15 10:52:07 UTC
Based on https://bugzilla.redhat.com/show_bug.cgi?id=1037851#c6, running the command in a loop for 10000 times is a stress test :)

Comment 10 ssamanta 2014-01-15 16:12:49 UTC
Krishnan P,

I waited till 10-15 sec by which the response should have definitely received by the local glustered daemon and make sure the glusterd is not hung.

I have tried number of iterations(i.e 100) by blocking the glusterd port using iptables rules and there was no glusterd hung.

Comment 11 ssamanta 2014-01-16 03:32:38 UTC
Verification Information:


Test-1
=======
1. Create a distribute volume using 3 server nodes

2.Set the following iptables rule as follows (i.e which allowing incoming ssh,dns,http etc)

#Blocks incoming messages to glusterd
iptables -I INPUT 1 -p tcp --dport 513:65535 -j DROP

3.Executed the volume profile info/volume profile start/volume profile info/volume profile stop along with other commands like gluster peer status/gluster volume status parallely in different nodes(100 times). 
Put a sleep 2,10,15 sec in between the commands to make sure there is no issue due to timeouts.

Test-2
======

1.Unblocked the iptables rules(iptables -F/iptables -D INPUT 1)

2.Executed the volume profile info/volume profile start/volume profile info/volume profile stop along with other commands like gluster peer status/gluster volume status parallely in different nodes(100 times). 
Put a sleep 2,10,15 sec in between the commands to make sure there is no issue due to timeouts.

3.Verify glustered donot hang.

Verified with following Build:
=============================
glusterfs 3.4.0.57rhs

Number of RHSS nodes:
====================
3

Comment 12 ssamanta 2014-01-16 12:33:08 UTC
Created attachment 851033 [details]
Verification Logs-2

Comment 14 errata-xmlrpc 2014-02-25 08:07:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-0208.html