Bug 847214 - glusterd operations hang if the other peers are down
glusterd operations hang if the other peers are down
Product: GlusterFS
Classification: Community
Component: glusterd (Show other bugs)
Unspecified Unspecified
medium Severity unspecified
: ---
: ---
Assigned To: krishnan parthasarathi
Depends On:
Blocks: 852147 918917
  Show dependency treegraph
Reported: 2012-08-10 02:39 EDT by Pranith Kumar K
Modified: 2015-11-03 18:04 EST (History)
4 users (show)

See Also:
Fixed In Version: glusterfs-3.4.0
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 852147 (view as bug list)
Last Closed: 2013-07-24 13:43:19 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Pranith Kumar K 2012-08-10 02:39:51 EDT
Description of problem:
Did a volume set operation while the other peers in the cluster were down. Op-sm hung.
Op-sm is stuck in an infinite state-transition:
Old State: [Ack drain]
New State: [Ack drain]
timestamp: [2012-08-10 06:10:25]

Old State: [Ack drain]
New State: [Ack drain]
timestamp: [2012-08-10 06:10:28]

Old State: [Ack drain]
New State: [Ack drain]
timestamp: [2012-08-10 06:10:28]

Old State: [Ack drain]
New State: [Ack drain]
timestamp: [2012-08-10 06:10:31]

Old State: [Ack drain]
New State: [Ack drain]
timestamp: [2012-08-10 06:10:31]

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
This the setup in which I got the problem, but I think it can be triggered even with 2 machines
1.Have a cluster with 3 machines.
2.Bring two of the glusterds down.
3.Execute any glusterd operation command which uses op-sm, I used volume set. 
Actual results:
The operation will hang after commit-op.
Expected results:
volume set operation should have been successful.

Additional info:
Comment 1 krishnan parthasarathi 2013-04-15 02:43:20 EDT
The infinite loop'ing state transitions in the operation state machine was fixed in http://review.gluster.org/4043.

This was happening because the notify function was queuing events into the operation state machine, on every invocation (triggered on reconnect, once every 3 secs). Concurrently, the operation state machine processes all the events in the queue. So, it is possible for the state machine to be dequeue'ing the events, ad infinitum.

This is perceived as a hang, since the epoll thread could be executing glusterd_op_sm(), which processes all the events, at any point in time, in the op_sm queue.

Moving it to ON_DEV, since this is fixed on both master and release-3.4

master, release-3.4: http://review.gluster.org/4043 - fixed before release-3.4 was branched from master.
Comment 2 Anand Avati 2013-04-22 03:37:48 EDT
REVIEW: http://review.gluster.org/4869 (glusterd: Removed 'proactive' failing of volume op) posted (#1) for review on master by Krishnan Parthasarathi (kparthas@redhat.com)
Comment 3 Anand Avati 2013-04-30 07:23:51 EDT
COMMIT: http://review.gluster.org/4869 committed in master by Vijay Bellur (vbellur@redhat.com) 
commit 3b1ecc6a7fd961c709e82862fd4760b223365863
Author: Krishnan Parthasarathi <kparthas@redhat.com>
Date:   Mon Apr 22 12:27:07 2013 +0530

    glusterd: Removed 'proactive' failing of volume op
    Volume operations were failed 'proactively', on the first disconnect of
    a peer that was participating in the transaction.
    The reason behind having this kludgey code in the first place was to
    'abort' an ongoing volume operation as soon as we perceive the first
    disconnect. But the rpc call backs themselves are capable of injecting
    appropriate state machine events, which would set things in motion for an
    eventual abort of the transaction.
    Change-Id: Iad7cb2bd076f22d89a793dfcd08c2d208b39c4be
    BUG: 847214
    Signed-off-by: Krishnan Parthasarathi <kparthas@redhat.com>
    Reviewed-on: http://review.gluster.org/4869
    Reviewed-by: Jeff Darcy <jdarcy@redhat.com>
    Tested-by: Gluster Build System <jenkins@build.gluster.com>
    Reviewed-by: Vijay Bellur <vbellur@redhat.com>

Note You need to log in before you can comment on or make changes to this bug.