Bug 1108505

Summary: quota:peer probe fails after adding the new node to the existing cluster with quota enabled
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Saurabh <saujain>
Component: glusterdAssignee: Kaushal <kaushal>
Status: CLOSED ERRATA QA Contact: Saurabh <saujain>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.0CC: amukherj, asrivast, kaushal, kparthas, mzywusko, nlevinki, nsathyan, ssamanta, vbellur
Target Milestone: ---   
Target Release: RHGS 3.0.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.6.0.19-1 Doc Type: Bug Fix
Doc Text:
Cause: The way quotad was being started on the new peer when peer probed, lead to glusterd being deadlocked. Consequence: As glusterd was deadlocked, the peer probe command failed. Fix: Quotad is now started in a non-blocking way during peer probe, and no longer blocks glusterd. Result: Peer probe completes successfully.
Story Points: ---
Clone Of:
: 1109872 (view as bug list) Environment:
Last Closed: 2014-09-22 19:41:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1109872    
Bug Blocks:    
Attachments:
Description Flags
sosreport of existing rhs node
none
sosreport of new rhss node none

Description Saurabh 2014-06-12 07:20:07 UTC
Description of problem:
I have an existing cluster of four RHSS nodes. created a volume in this cluster with volume type as 6X2, set some options that are used for nfs. 
Now, if I try to peer probe a new RHSS node, the probe is fail.

Version-Release number of selected component (if applicable):
glusterfs-3.6.0.15-1.el6rhs.x86_64

How reproducible:
have been seen for this test.

Steps to Reproduce:
1. create a volume of type 6x2, start it
2. set the nfs related options such as nfs.rpc-auth-allow/reject, nfs.export-dirs/nfs.export-dir, quota enable
3.gluster peer probe a new RHSS node --- step 3 fails,

Actual results:
[root@nfs1 ~]# gluster peer probe 10.70.37.13
peer probe: failed: Probe returned with unknown errno -1

[root@nfs1 ~]# gluster peer status
Number of Peers: 4

Hostname: 10.70.37.215
Uuid: db4a5cde-f048-4796-84dd-19ba9ca98e6f
State: Peer in Cluster (Connected)

Hostname: 10.70.37.44
Uuid: 7f8f341e-4274-40f0-ae83-bde70365d2f4
State: Peer in Cluster (Connected)

Hostname: 10.70.37.201
Uuid: 9512d008-9dd8-4a5b-bf8c-983862a86c4a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.13
Uuid: ccaeac50-ad54-43ef-a5a2-5a7e17666936
State: Probe Sent to Peer (Connected)


Expected results:
peer probe should be a success

Additional info:
ip of the node already part of cluster, from where the command was executed
inet 10.70.37.62/23 brd 10.70.37.255 scope global eth0

ip of new node,
inet 10.70.37.13/23 brd 10.70.37.255 scope global eth0


Peer status from some other node of the existing cluster,
[root@nfs2 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.37.62
Uuid: bd23f0cb-d64a-4ddb-8543-6e1bbc812c7d
State: Peer in Cluster (Connected)

Hostname: 10.70.37.44
Uuid: 7f8f341e-4274-40f0-ae83-bde70365d2f4
State: Peer in Cluster (Connected)

Hostname: 10.70.37.201
Uuid: 9512d008-9dd8-4a5b-bf8c-983862a86c4a
State: Peer in Cluster (Connected)

Comment 1 Saurabh 2014-06-12 07:22:42 UTC
Created attachment 907965 [details]
sosreport of existing rhs node

Comment 2 Saurabh 2014-06-12 07:24:57 UTC
Created attachment 907967 [details]
sosreport of new rhss node

Comment 5 Saurabh 2014-06-13 10:52:11 UTC
Santosh and myself tried the tests to narrow down the issue, so only one or two times we have seen a peer probe failing. Otherwise in the recent trials of peer probe  have been successful, these latest trials were on a new volume while same nfs options being set.

Comment 6 Saurabh 2014-06-13 10:55:43 UTC
based on #comment5 I would like to request to remove blocker flag and lower down the priority, but we can can't close the bug since the issue happened and we are not very clear why it is not happening at this time.

Comment 7 Saurabh 2014-06-13 11:02:10 UTC
So, probably my bad I didn't test with quota in the latest trials whereas while filing the bz I had quota enabled on volume.
Hence, I tried out the things with quota enabled and it peer probe. As, can be seen in the results mentioned below.

Please do not lower the priority. Changing the summary as well.


Results of latest trial,
[root@nfs1 ~]# gluster peer probe 10.70.37.13
peer probe: failed: Probe returned with unknown errno -1
[root@nfs1 ~]# gluster volume info dist-rep
 
Volume Name: dist-rep
Type: Distributed-Replicate
Volume ID: 7ab235ad-a666-44b3-a46f-d3321f3eb4d6
Status: Started
Snap Volume: no
Number of Bricks: 7 x 2 = 14
Transport-type: tcp
Bricks:
Brick1: 10.70.37.62:/bricks/d1r1
Brick2: 10.70.37.215:/bricks/d1r2
Brick3: 10.70.37.44:/bricks/d2r1
Brick4: 10.70.37.201:/bricks/d2r2
Brick5: 10.70.37.62:/bricks/d3r1
Brick6: 10.70.37.215:/bricks/d3r2
Brick7: 10.70.37.44:/bricks/d4r1
Brick8: 10.70.37.201:/bricks/d4r2
Brick9: 10.70.37.62:/bricks/d5r1
Brick10: 10.70.37.215:/bricks/d5r2
Brick11: 10.70.37.44:/bricks/d6r1
Brick12: 10.70.37.201:/bricks/d6r2
Brick13: 10.70.37.62:/bricks/d1r1-add
Brick14: 10.70.37.215:/bricks/d1r2-add
Options Reconfigured:
features.quota: on
nfs.export-dir: /1(rhsauto054.lab.eng.blr.redhat.com),/2(172.16.0.0/27)
nfs.export-dirs: on
nfs.rpc-auth-reject: 10.70.35.33
nfs.rpc-auth-allow: *.lab.eng.blr.redhat.com
[root@nfs1 ~]# gluster peer status
Number of Peers: 4

Hostname: 10.70.37.44
Uuid: 7f8f341e-4274-40f0-ae83-bde70365d2f4
State: Peer in Cluster (Connected)

Hostname: 10.70.37.201
Uuid: 9512d008-9dd8-4a5b-bf8c-983862a86c4a
State: Peer in Cluster (Connected)

Hostname: 10.70.37.215
Uuid: db4a5cde-f048-4796-84dd-19ba9ca98e6f
State: Peer in Cluster (Connected)

Hostname: 10.70.37.13
Uuid: ccaeac50-ad54-43ef-a5a2-5a7e17666936
State: Probe Sent to Peer (Disconnected)

Comment 8 santosh pradhan 2014-06-14 05:40:11 UTC
Vivek,
I dont know to whom to assign, hence assigning to you. Please assign it to Quota team.

Regards,
Santosh

Comment 10 Kaushal 2014-06-16 14:24:29 UTC
This is another instance of quotad start causing Glusterd to deadlock, similar to 1095585

Comment 11 Atin Mukherjee 2014-06-17 04:03:35 UTC
Downstream patch - https://code.engineering.redhat.com/gerrit/#/c/27049/

Comment 12 Saurabh 2014-06-20 09:46:41 UTC
[root@nfs2 ~]# gluster volume info dist-rep | grep quota
features.quota-deem-statfs: off
features.quota: on

[root@nfs4 ~]# gluster peer probe rhsauto005.lab.eng.blr.redhat.com
peer probe: success. 


[root@nfs3 ~]# gluster peer status
Number of Peers: 4

Hostname: 10.70.37.62
Uuid: ad345a97-3d00-4960-a620-d89f1f715dc0
State: Peer in Cluster (Connected)

Hostname: 10.70.37.215
Uuid: b9eded1c-fbae-4e9b-aa31-26a06e747d83
State: Peer in Cluster (Connected)

Hostname: 10.70.37.201
Uuid: 542bf4aa-b6b5-40c3-82bf-f344fb637a99
State: Peer in Cluster (Connected)

Hostname: rhsauto005.lab.eng.blr.redhat.com
Uuid: 5f0ccbd1-bec3-4c37-be35-6ce38647398c
State: Peer in Cluster (Connected)


Hence, moving this BZ to verified

Comment 14 errata-xmlrpc 2014-09-22 19:41:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html