Bug 1123733

Summary: 'gluster volume status' looks like its hung, when there is no response from one of glusterd
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: glusterdAssignee: Atin Mukherjee <amukherj>
Status: CLOSED WONTFIX QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.0CC: amukherj, asriram, mzywusko, nlevinki, smohan, vbellur
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Executing a command which involves glusterd-glusterd communication, 'Example: gluster volume status', immediately after one of the nodes is down hangs and fails after 2 minutes with cli-timeout message. The subsequent command fails with the error message 'Another transaction in progress' for 10 mins (frame timeout). Workaround: Set a non-zero value for 'ping-timeout' in "/etc/glusterfs/glusterd.vol" file and restart glusterd
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-07 11:23:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description SATHEESARAN 2014-07-28 07:07:28 UTC
Description of problem:
-----------------------
This bug was already reported in RHS 2.1 as 1034479

When there is no response from glusterd on one of the node, 'gluster volume status', will looks like its hung, but for 2 minutes ( cli-timeout ).

The subsequent 'gluster volume status' command and other gluster commands which involve getting info from other glusterd, would fail with error message "Another transaction in progress".

This issue was fixed with the introduction of ping-timer in RHS 2.1.2.
But volume snapshot had some problems with ping-timer ( refer BZ 1096729 ).
Ping timer was disabled by setting ping-timeout to 0 in glusterd volfile by default.

But now again the bug BZ 1034479 appears as ping-timer is disabled

Version-Release number of selected component (if applicable):
--------------------------------------------------------------
glusterfs-3.6.0.25-1.el6rhs

How reproducible:
-----------------
Always

Steps to Reproduce:
-------------------
1. Create a 2+ node cluster
2. Create a volume with bricks on 2 RHS Nodes & start the volume
3. Stop all the Data traffic to other node
4. Execute 'gluster volume status <vol-name>' from one node
5. Execute 'gluster volume status <vol-name>' from again

Actual results:
---------------
1. The first invocation of 'gluster volume status <vol-name>' fails with error code 146, without showing any output

2. The subsequent invocation of 'gluster volume status <vol-name>' fails with 'Another Transaction in progress'

After 10mins+ , 'gluster volume status' works successfully, ignoring the bricks on the NODE which is no longer reachable

Expected results:
-----------------
User should not wait for more than 10 mins to identify the network disconnect.
Any network disconnect should be identified early

Comment 2 SATHEESARAN 2014-07-28 07:22:43 UTC
There are 2 workarounds with its own cost :

1. Wait more than 10 mins ( ~10mins ), for any gluster command to work without error, "Another transaction in progress"
Cost : User need to wait for atleast 10mins before executing any gluster command.
This happens only one time. Once the network disconnect is identified, then the subsequent commands to ignore the node that is not reachable


2. Enable ping-timer. This could be done by doing the following :
   i) Edit glusterd volfile to have ping-timeout option as 30
   ii) Restart glusterd on that node

Cost : volume snapshot fails with ping-timer enabled. Refer BZ 1096729

Comment 3 Atin Mukherjee 2014-07-28 07:32:13 UTC
The ideal solution would be to have ping timer work in a separate e-poll thread and then enable ping timer, with that we would get rid of both this and snapshot related issues.
Can we mark this as a known issue for denali?

Comment 4 SATHEESARAN 2014-07-28 07:37:41 UTC
(In reply to Atin Mukherjee from comment #3)
> The ideal solution would be to have ping timer work in a separate e-poll
> thread and then enable ping timer, with that we would get rid of both this
> and snapshot related issues.
> Can we mark this as a known issue for denali?

Marked this bug for known-issue for Denali

Comment 5 Vijaikumar Mallikarjuna 2014-07-28 08:13:51 UTC
We have a patch for Multi-threaded epoll. We have two approaches we need to choose one of them:
http://review.gluster.org/#/c/8098/
http://review.gluster.org/#/c/3842/

It is risk to take this patch in to Denali as it requires complete testing to be done.


It is always good to enable ping-timer in the file '/etc/glusterfs/glusterd.vol'. Set ping-timeout to 30+
Disable this only if multiple snapshot operations are performed simultaneously from different nodes.

Comment 6 Shalaka 2014-09-21 04:18:12 UTC
Please review and sign-off edited doc text.

Comment 7 Vijaikumar Mallikarjuna 2014-09-22 06:06:55 UTC
Doc text looks good to me

Comment 10 Atin Mukherjee 2017-02-07 11:23:47 UTC
There is no future plan to enable ping time out for glusterd to glusterd communication, we'd not be fixing this in GlusterD 1.0