1398261 – After ganesha node reboot/shutdown, portblock process goes to FAILED state

Bug 1398261 - After ganesha node reboot/shutdown, portblock process goes to FAILED state

Summary: After ganesha node reboot/shutdown, portblock process goes to FAILED state

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	nfs-ganesha
Sub Component:
Version:	rhgs-3.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.2.0
Assignee:	Soumya Koduri
QA Contact:	Arthy Loganathan
Docs Contact:
URL:
Whiteboard:
Depends On:	1399154 1400546
Blocks:	1351528
TreeView+	depends on / blocked

Reported:	2016-11-24 11:09 UTC by Arthy Loganathan
Modified:	2017-03-23 05:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:	glusterfs-3.8.4-7
Doc Type:	Known Issue
Doc Text:
Clone Of:
Clones:	1399154 (view as bug list)
Environment:
Last Closed:	2017-03-23 05:50:51 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:0486	0	normal	SHIPPED_LIVE	Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update	2017-03-23 09:18:45 UTC

Description Arthy Loganathan 2016-11-24 11:09:02 UTC

Description of problem:
After ganesha node reboot, portblock process goes to FAILED state.

In a four node cluster, if one of the node gets rebooted/shutdown, portblock process of any of the nodes(not particular node) are in FAILED state.

Even if the shutdown/rebooted node is brought up, failback is not happening if
the portblock process is in FAILED state.


Version-Release number of selected component (if applicable):
nfs-ganesha-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64

How reproducible:
Consistent

Steps to Reproduce:
1. Create 4 node ganesha cluster.
2. Reboot one of the node
3. Check pcs status

Actual results:
portblock process goes to FAILED state in pcs status.

Expected results:
All the process should be up and running.

Additional info:

[root@dhcp46-139 ~]# pcs status
Cluster name: ganesha-ha-360
Stack: corosync
Current DC: dhcp46-124.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Thu Nov 24 13:12:49 2016		Last change: Thu Nov 24 12:32:19 2016 by root via cibadmin on dhcp46-111.lab.eng.blr.redhat.com

4 nodes and 24 resources configured

Online: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
OFFLINE: [ dhcp46-111.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-111.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-111.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ]
     Stopped: [ dhcp46-111.lab.eng.blr.redhat.com ]
 Resource Group: dhcp46-111.lab.eng.blr.redhat.com-group
     dhcp46-111.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-124.lab.eng.blr.redhat.com
     dhcp46-111.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-124.lab.eng.blr.redhat.com
     dhcp46-111.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	FAILED dhcp46-124.lab.eng.blr.redhat.com (blocked)
 Resource Group: dhcp46-115.lab.eng.blr.redhat.com-group
     dhcp46-115.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-115.lab.eng.blr.redhat.com
     dhcp46-115.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-115.lab.eng.blr.redhat.com
     dhcp46-115.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp46-115.lab.eng.blr.redhat.com
 Resource Group: dhcp46-139.lab.eng.blr.redhat.com-group
     dhcp46-139.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-139.lab.eng.blr.redhat.com
     dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-139.lab.eng.blr.redhat.com
     dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp46-139.lab.eng.blr.redhat.com
 Resource Group: dhcp46-124.lab.eng.blr.redhat.com-group
     dhcp46-124.lab.eng.blr.redhat.com-nfs_block	(ocf::heartbeat:portblock):	Started dhcp46-124.lab.eng.blr.redhat.com
     dhcp46-124.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp46-124.lab.eng.blr.redhat.com
     dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock	(ocf::heartbeat:portblock):	Started dhcp46-124.lab.eng.blr.redhat.com

Failed Actions:
* dhcp46-111.lab.eng.blr.redhat.com-nfs_unblock_stop_0 on dhcp46-124.lab.eng.blr.redhat.com 'unknown error' (1): call=83, status=Timed Out, exitreason='none',
    last-rc-change='Thu Nov 24 13:09:40 2016', queued=0ms, exec=20004ms
* dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-124.lab.eng.blr.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='none',
    last-rc-change='Thu Nov 24 13:09:40 2016', queued=0ms, exec=0ms
* dhcp46-115.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-115.lab.eng.blr.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='none',
    last-rc-change='Thu Nov 24 13:09:40 2016', queued=0ms, exec=0ms
* dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-139.lab.eng.blr.redhat.com 'unknown error' (1): call=71, status=Timed Out, exitreason='none',
    last-rc-change='Thu Nov 24 13:09:41 2016', queued=0ms, exec=0ms


Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled
[root@dhcp46-139 ~]# 


ganesha log snippet:
---------------------

Nov 24 13:11:13 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15272:stderr [ 0+0 records in ]
Nov 24 13:11:13 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15272:stderr [ 0+0 records out ]
Nov 24 13:11:13 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15272:stderr [ 0 bytes (0 B) copied, 0.0739975 s, 0.0 kB/s ]
Nov 24 13:11:23 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15428:stderr [ 0+0 records in ]
Nov 24 13:11:23 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15428:stderr [ 0+0 records out ]
Nov 24 13:11:23 dhcp46-124 lrmd[25436]:  notice: dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000:15428:stderr [ 0 bytes (0 B) copied, 0.0539065 s, 0.0 kB/s ]

Comment 11 Arthy Loganathan 2016-11-28 10:30:41 UTC

I have tried rebooting the nodes(different nodes each time) in which shared_storage bricks are present. Have not seen the issue(5/5 times) in 10.70.46.42 cluster setup.

Comment 12 Soumya Koduri 2016-11-28 12:54:11 UTC

Thanks Oyvind and Arthy. Posted fix upstream to increase timeout of unblock RA to 60s during creation. 

http://review.gluster.org/15947

Comment 15 Atin Mukherjee 2016-12-01 13:27:54 UTC

upstream mainline : http://review.gluster.org/15947
upstream 3.9 : http://review.gluster.org/15994
downstream patch : https://code.engineering.redhat.com/gerrit/#/c/91871/

Comment 17 Arthy Loganathan 2016-12-08 07:01:24 UTC

Portblock resource agent comes back to Started State after node reboots/shutdown.

Verified the fix in build,
glusterfs-ganesha-3.8.4-7.el7rhgs.x86_64
nfs-ganesha-2.4.1-2.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.1-2.el7rhgs.x86_64

Comment 23 errata-xmlrpc 2017-03-23 05:50:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html

Note You need to log in before you can comment on or make changes to this bug.