Bug 1400546 - After ganesha node reboot/shutdown, portblock process goes to FAILED state
Summary: After ganesha node reboot/shutdown, portblock process goes to FAILED state
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: common-ha
Version: 3.9
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On: 1399154
Blocks: 1398261
TreeView+ depends on / blocked
 
Reported: 2016-12-01 13:20 UTC by Soumya Koduri
Modified: 2017-03-08 10:20 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.9.1
Clone Of: 1399154
Environment:
Last Closed: 2017-03-08 10:20:15 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Soumya Koduri 2016-12-01 13:20:22 UTC
+++ This bug was initially created as a clone of Bug #1399154 +++

Description of problem:
After ganesha node reboot, portblock process goes to FAILED state.

In a four node cluster, if one of the node gets rebooted/shutdown, portblock process of any of the nodes(not particular node) are in FAILED state.

Even if the shutdown/rebooted node is brought up, failback is not happening if
the portblock process is in FAILED state.


Version-Release number of selected component (if applicable):
nfs-ganesha-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64

How reproducible:
Consistent

Steps to Reproduce:
1. Create 4 node ganesha cluster.
2. Reboot one of the node
3. Check pcs status

Actual results:
portblock process goes to FAILED state in pcs status.

Expected results:
All the process should be up and running.

Additional info:

[root@dhcp46-139 ~]# pcs status
Cluster name: ganesha-ha-360
Stack: corosync
Current DC: dhcp46-124.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Thu Nov 24 13:12:49 2016		Last change: Thu Nov 24 12:32:19 2016 by root via cibadmin on dhcp46-111.lab.eng.blr.redhat.com

4 nodes and 24 resources configured

Online: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
OFFLINE: [ dhcp46-111.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-111.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-111.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-111.lab.eng.blr.redhat.com ]
 Resource Group: dhcp46-111.lab.eng.blr.redhat.com-group
     dhcp46-111.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-124.lab.eng.blr.redhat.com
     dhcp46-111.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-124.lab.eng.blr.redhat.com
     dhcp46-111.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	FAILED dhcp46-124.lab.eng.blr.redhat.com (blocked)
 Resource Group: dhcp46-115.lab.eng.blr.redhat.com-group
     dhcp46-115.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-115.lab.eng.blr.redhat.com
     dhcp46-115.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-115.lab.eng.blr.redhat.com
     dhcp46-115.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp46-115.lab.eng.blr.redhat.com
 Resource Group: dhcp46-139.lab.eng.blr.redhat.com-group
     dhcp46-139.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-139.lab.eng.blr.redhat.com
     dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-139.lab.eng.blr.redhat.com
     dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp46-139.lab.eng.blr.redhat.com
 Resource Group: dhcp46-124.lab.eng.blr.redhat.com-group
     dhcp46-124.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-124.lab.eng.blr.redhat.com
     dhcp46-124.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-124.lab.eng.blr.redhat.com
     dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp46-124.lab.eng.blr.redhat.com

Failed Actions:
* dhcp46-111.lab.eng.blr.redhat.com-nfs_unblock_stop_0 on dhcp46-124.lab.eng.blr.redhat.com 'unknown error' (1): call=83, status=Timed Out, exitreason='none',
    last-rc-change='Thu Nov 24 13:09:40 2016', queued=0ms, exec=20004ms
* dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-124.lab.eng.blr.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='none',
    last-rc-change='Thu Nov 24 13:09:40 2016', queued=0ms, exec=0ms
* dhcp46-115.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-115.lab.eng.blr.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='none',
    last-rc-change='Thu Nov 24 13:09:40 2016', queued=0ms, exec=0ms
* dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-139.lab.eng.blr.redhat.com 'unknown error' (1): call=71, status=Timed Out, exitreason='none',
    last-rc-change='Thu Nov 24 13:09:41 2016', queued=0ms, exec=0ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root@dhcp46-139 ~]# 


ganesha log snippet:
---------------------

Nov 24 13:11:13 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15272:stderr [ 0+0 records in ]
Nov 24 13:11:13 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15272:stderr [ 0+0 records out ]
Nov 24 13:11:13 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15272:stderr [ 0 bytes (0 B) copied, 0.0739975 s, 0.0 kB/s ]
Nov 24 13:11:23 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15428:stderr [ 0+0 records in ]
Nov 24 13:11:23 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15428:stderr [ 0+0 records out ]
Nov 24 13:11:23 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15428:stderr [ 0 bytes (0 B) copied, 0.0539065 s, 0.0 kB/s ]

--- Additional comment from Worker Ant on 2016-11-28 07:40:45 EST ---

REVIEW: http://review.gluster.org/15947 (common-HA: Increase timeout for portblock RA of action=unblock) posted (#1) for review on master by soumya k (skoduri)

--- Additional comment from Worker Ant on 2016-11-28 07:53:30 EST ---

REVIEW: http://review.gluster.org/15947 (common-HA: Increase timeout for portblock RA of action=unblock) posted (#2) for review on master by soumya k (skoduri)

--- Additional comment from Worker Ant on 2016-11-29 11:46:23 EST ---

REVIEW: http://review.gluster.org/15947 (common-HA: Increase timeout for portblock RA of action=unblock) posted (#3) for review on master by soumya k (skoduri)

--- Additional comment from Worker Ant on 2016-12-01 05:46:37 EST ---

COMMIT: http://review.gluster.org/15947 committed in master by Kaleb KEITHLEY (kkeithle) 
------
commit 1b2b5be970f78cc32069516fa347d9943dc17d3e
Author: Soumya Koduri <skoduri>
Date:   Mon Nov 28 17:56:35 2016 +0530

    common-HA: Increase timeout for portblock RA of action=unblock
    
    Portblock RA of action type unblock stores the information about
    the client/server IPs connection in tickle_dir folder created in
    the shared storage. In case of node shutdown/reboot there could be
    cases wherein shared_storage may become unavailable for sometime.
    Hence increase the timeout to avoid that resource agent going into
    FAILED state.
    
    Change-Id: I4f98f819895cb164c3a82ba8084c7c11610f35ff
    BUG: 1399154
    Signed-off-by: Soumya Koduri <skoduri>
    Reviewed-on: http://review.gluster.org/15947
    Smoke: Gluster Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>
    Reviewed-by: Niels de Vos <ndevos>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: jiffin tony Thottan <jthottan>

Comment 1 Worker Ant 2016-12-01 13:22:10 UTC
REVIEW: http://review.gluster.org/15994 (common-HA: Increase timeout for portblock RA of action=unblock) posted (#1) for review on release-3.9 by soumya k (skoduri)

Comment 2 Worker Ant 2016-12-01 18:36:56 UTC
COMMIT: http://review.gluster.org/15994 committed in release-3.9 by Kaleb KEITHLEY (kkeithle) 
------
commit b60607e88d27295bb61e0365e2b23bf54af109f3
Author: Soumya Koduri <skoduri>
Date:   Mon Nov 28 17:56:35 2016 +0530

    common-HA: Increase timeout for portblock RA of action=unblock
    
    Portblock RA of action type unblock stores the information about
    the client/server IPs connection in tickle_dir folder created in
    the shared storage. In case of node shutdown/reboot there could be
    cases wherein shared_storage may become unavailable for sometime.
    Hence increase the timeout to avoid that resource agent going into
    FAILED state.
    
    This is backport of below mainline patch -
      - http://review.gluster.org/15947
    
    >Change-Id: I4f98f819895cb164c3a82ba8084c7c11610f35ff
    >BUG: 1399154
    >Signed-off-by: Soumya Koduri <skoduri>
    >Reviewed-on: http://review.gluster.org/15947
    >Smoke: Gluster Build System <jenkins.org>
    >CentOS-regression: Gluster Build System <jenkins.org>
    >Reviewed-by: Kaleb KEITHLEY <kkeithle>
    >Reviewed-by: Niels de Vos <ndevos>
    >NetBSD-regression: NetBSD Build System <jenkins.org>
    >Reviewed-by: jiffin tony Thottan <jthottan>
    >(cherry picked from commit 1b2b5be970f78cc32069516fa347d9943dc17d3e)
    
    Change-Id: I8e590a1b21c2a73d324e0a20c17378c593c5ebd5
    BUG: 1400546
    Signed-off-by: Soumya Koduri <skoduri>
    Reviewed-on: http://review.gluster.org/15994
    Reviewed-by: jiffin tony Thottan <jthottan>
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Kaleb KEITHLEY <kkeithle>

Comment 3 Kaushal 2017-03-08 10:20:15 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.9.1, please open a new bug report.

glusterfs-3.9.1 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-January/029725.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.