Description of problem: ----------------------- This bug was already reported in RHS 2.1 as 1034479 When there is no response from glusterd on one of the node, 'gluster volume status', will looks like its hung, but for 2 minutes ( cli-timeout ). The subsequent 'gluster volume status' command and other gluster commands which involve getting info from other glusterd, would fail with error message "Another transaction in progress". This issue was fixed with the introduction of ping-timer in RHS 2.1.2. But volume snapshot had some problems with ping-timer ( refer BZ 1096729 ). Ping timer was disabled by setting ping-timeout to 0 in glusterd volfile by default. But now again the bug BZ 1034479 appears as ping-timer is disabled Version-Release number of selected component (if applicable): -------------------------------------------------------------- glusterfs-3.6.0.25-1.el6rhs How reproducible: ----------------- Always Steps to Reproduce: ------------------- 1. Create a 2+ node cluster 2. Create a volume with bricks on 2 RHS Nodes & start the volume 3. Stop all the Data traffic to other node 4. Execute 'gluster volume status <vol-name>' from one node 5. Execute 'gluster volume status <vol-name>' from again Actual results: --------------- 1. The first invocation of 'gluster volume status <vol-name>' fails with error code 146, without showing any output 2. The subsequent invocation of 'gluster volume status <vol-name>' fails with 'Another Transaction in progress' After 10mins+ , 'gluster volume status' works successfully, ignoring the bricks on the NODE which is no longer reachable Expected results: ----------------- User should not wait for more than 10 mins to identify the network disconnect. Any network disconnect should be identified early
There are 2 workarounds with its own cost : 1. Wait more than 10 mins ( ~10mins ), for any gluster command to work without error, "Another transaction in progress" Cost : User need to wait for atleast 10mins before executing any gluster command. This happens only one time. Once the network disconnect is identified, then the subsequent commands to ignore the node that is not reachable 2. Enable ping-timer. This could be done by doing the following : i) Edit glusterd volfile to have ping-timeout option as 30 ii) Restart glusterd on that node Cost : volume snapshot fails with ping-timer enabled. Refer BZ 1096729
The ideal solution would be to have ping timer work in a separate e-poll thread and then enable ping timer, with that we would get rid of both this and snapshot related issues. Can we mark this as a known issue for denali?
(In reply to Atin Mukherjee from comment #3) > The ideal solution would be to have ping timer work in a separate e-poll > thread and then enable ping timer, with that we would get rid of both this > and snapshot related issues. > Can we mark this as a known issue for denali? Marked this bug for known-issue for Denali
We have a patch for Multi-threaded epoll. We have two approaches we need to choose one of them: http://review.gluster.org/#/c/8098/ http://review.gluster.org/#/c/3842/ It is risk to take this patch in to Denali as it requires complete testing to be done. It is always good to enable ping-timer in the file '/etc/glusterfs/glusterd.vol'. Set ping-timeout to 30+ Disable this only if multiple snapshot operations are performed simultaneously from different nodes.
Please review and sign-off edited doc text.
Doc text looks good to me
There is no future plan to enable ping time out for glusterd to glusterd communication, we'd not be fixing this in GlusterD 1.0