Bug 1334092

Summary:	[NFS-Ganesha] : stonith-enabled option not set with new versions of cman,pacemaker,corosync and pcs
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ambarish <asoman>
Component:	nfs-ganesha	Assignee:	Kaleb KEITHLEY <kkeithle>
Status:	CLOSED ERRATA	QA Contact:	Shashank Raj <sraj>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.1	CC:	asoman, asrivast, jthottan, kgaillot, kkeithle, ndevos, nlevinki, rcyriac, rhinduja, sashinde, skoduri, sraj
Target Milestone:	---	Keywords:	ZStream
Target Release:	RHGS 3.1.3
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-3.7.9-7	Doc Type:	Bug Fix
Doc Text:	This update includes a new version of Pacemaker that contains changes related to the selection of the Designated Co-ordinator (DC). This updated Pacemaker version caused attempts to set the stonith-enabled property to fail, which meant that set-up and operation did not behave as expected. The setup process now waits for DC selection to complete before setting the stonith-enabled property and continuing with the remainder of the setup.	Story Points:	---
Clone Of:
Clones:	1336945 (view as bug list)		Environment:
Last Closed:	2016-06-23 05:21:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1311817, 1336945, 1336947, 1336948

Description Ambarish 2016-05-08 10:02:36 UTC

Description of problem:
-----------------------

I tried setting up Ganesha (v 2.3.1-4) on RHGS 3.1.3 layered over RHEL 6.8.Everything goes through fine(cluster setup,authentication,Ganesha enabling etc) but pcs status shows nodes as "stopped" : 

*Snippet from distaf logs* :

2016-05-06 23:30:50,961 INFO run root.lab.eng.bos.redhat.com (cp): pcs status
2016-05-06 23:30:54,313 INFO run RETCODE: 0
2016-05-06 23:30:54,314 INFO run STDOUT:
Cluster name: G1462557101.26
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Fri May  6 14:00:51 2016		Last change: Fri May  6 14:00:17 2016 by root via cibadmin on gqas001.sbu.lab.eng.bos.redhat.com
Stack: cman
Current DC: gqas015.sbu.lab.eng.bos.redhat.com (version 1.1.14-8.el6-70404b0) - partition with quorum
4 nodes and 16 resources configured

Online: [ gqas001.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com gqas015.sbu.lab.eng.bos.redhat.com gqas016.sbu.lab.eng.bos.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Stopped: [ gqas001.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com gqas015.sbu.lab.eng.bos.redhat.com gqas016.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ gqas001.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com gqas015.sbu.lab.eng.bos.redhat.com gqas016.sbu.lab.eng.bos.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ gqas001.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com gqas015.sbu.lab.eng.bos.redhat.com gqas016.sbu.lab.eng.bos.redhat.com ]
 gqas001.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
 gqas014.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
 gqas015.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped
 gqas016.sbu.lab.eng.bos.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Stopped

PCSD Status:
  gqas001.sbu.lab.eng.bos.redhat.com: Online
  gqas014.sbu.lab.eng.bos.redhat.com: Online
  gqas015.sbu.lab.eng.bos.redhat.com: Online
  gqas016.sbu.lab.eng.bos.redhat.com: Online


I tried downgrading the versions of pacemaker,cman,pcs and corosync and it gives a clean automation run and setup is successful(pcs status=good).


Version-Release number of selected component (if applicable):
--------------------------------------------------------------

[root@gqas001 yum.repos.d]# rpm -qa|grep cman
cman-3.0.12.1-78.el6.x86_64

[root@gqas001 yum.repos.d]# rpm -qa|grep pcs
pcs-0.9.148-7.el6.x86_64

[root@gqas001 yum.repos.d]# rpm -qa|grep pacemaker
pacemaker-libs-1.1.14-8.el6.x86_64

pacemaker-cli-1.1.14-8.el6.x86_64
pacemaker-cluster-libs-1.1.14-8.el6.x86_64
pacemaker-1.1.14-8.el6.x86_64

[root@gqas001 yum.repos.d]# rpm -qa|grep corosync
corosync-1.4.7-5.el6.x86_64
corosynclib-1.4.7-5.el6.x86_64
[root@gqas001 yum.repos.d]# 


How reproducible:
----------------

3/3

Steps to Reproduce:
-------------------

1. Do a  yum install pacemaker cman pcs ccs resource-agents corosync .This will fetch you latest versions of all these packages

2. Run Ganesha setup via distaf.It'll fail with the error above

3. Downgrade the packages and rerun.

Actual results:
--------------

Ganesha setup should be successful with latest version of pacemaker,cman,pcs and corosync packages.


Expected results:
-----------------

Ganesha setup fails on latest versions of pacemaker,cman,pcs and corosync packages.

Additional info:
----------------

Testbed : RHEL 6.8

Comment 6 Shashank Raj 2016-05-09 14:27:14 UTC

I upgraded my setup with the same versions and i am able to reproduce the issue:

[root@dhcp43-33 ~]# rpm -qa|grep pacemaker
pacemaker-1.1.14-8.el6.x86_64
pacemaker-libs-1.1.14-8.el6.x86_64
pacemaker-cli-1.1.14-8.el6.x86_64
pacemaker-cluster-libs-1.1.14-8.el6.x86_64
[root@dhcp43-33 ~]# rpm -qa|grep pcs
pcsc-lite-libs-1.5.2-15.el6.x86_64
pcs-0.9.148-7.el6.x86_64
[root@dhcp43-33 ~]# rpm -qa|grep cman
cman-3.0.12.1-78.el6.x86_64
[root@dhcp43-33 ~]# rpm -qa|grep corosync
corosync-1.4.7-5.el6.x86_64
corosynclib-1.4.7-5.el6.x86_64

ganesha setup is successful but if we check the pcs status, the nodes are shown in stopped state with a message "WARNING: no stonith devices and stonith-enabled is not false":

[root@dhcp43-33 ~]# pcs status
Cluster name: G1462802414.82
WARNING: no stonith devices and stonith-enabled is not false
Last updated: Tue May 10 01:06:25 2016          Last change: Tue May 10 01:01:23 2016 by root via cibadmin on dhcp43-33.lab.eng.blr.redhat.com
Stack: cman
Current DC: dhcp43-33.lab.eng.blr.redhat.com (version 1.1.14-8.el6-70404b0) - partition with quorum
4 nodes and 16 resources configured

Online: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Stopped: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ]
 dhcp43-33.lab.eng.blr.redhat.com-cluster_ip-1  (ocf::heartbeat:IPaddr):        Stopped
 dhcp43-40.lab.eng.blr.redhat.com-cluster_ip-1  (ocf::heartbeat:IPaddr):        Stopped
 dhcp42-11.lab.eng.blr.redhat.com-cluster_ip-1  (ocf::heartbeat:IPaddr):        Stopped
 dhcp42-78.lab.eng.blr.redhat.com-cluster_ip-1  (ocf::heartbeat:IPaddr):        Stopped

PCSD Status:
  dhcp43-33.lab.eng.blr.redhat.com: Online
  dhcp43-40.lab.eng.blr.redhat.com: Online
  dhcp42-11.lab.eng.blr.redhat.com: Online
  dhcp42-78.lab.eng.blr.redhat.com: Online

All the corresponding services are up and running as below:

[root@dhcp43-33 yum.repos.d]# service nfs-ganesha status
ganesha.nfsd (pid  14058) is running...
[root@dhcp43-33 yum.repos.d]# service pcsd status
pcsd (pid  14809) is running...
[root@dhcp43-33 yum.repos.d]# service pacemaker status
pacemakerd (pid  14755) is running...
[root@dhcp43-33 yum.repos.d]# service corosync status
corosync (pid  14472) is running...

Volume can be exported properly but not able to mount

following messages can be seen in /var/log/messages:

May 10 01:01:23 dhcp43-33 pengine[14765]:    error: Resource start-up disabled since no STONITH resources have been defined
May 10 01:01:23 dhcp43-33 pengine[14765]:    error: Either configure some or disable STONITH with the stonith-enabled option
May 10 01:01:23 dhcp43-33 pengine[14765]:    error: NOTE: Clusters with shared data need STONITH to ensure data integrity

>>>> pcs property status shows as below:

[root@dhcp43-33 ~]# pcs property show stonith-enabled
Cluster Properties:

>>>> tried setting it manually:

[root@dhcp43-33 ~]# pcs property set stonith-enabled=false
[root@dhcp43-33 ~]# pcs property show stonith-enabled
Cluster Properties:
 stonith-enabled: false

and after that pcs status shows the status properly:

[root@dhcp43-33 ~]# pcs status
Cluster name: G1462802414.82
Last updated: Tue May 10 01:19:18 2016		Last change: Tue May 10 01:18:21 2016 by root via cibadmin on dhcp43-33.lab.eng.blr.redhat.com
Stack: cman
Current DC: dhcp43-33.lab.eng.blr.redhat.com (version 1.1.14-8.el6-70404b0) - partition with quorum
4 nodes and 16 resources configured

Online: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ]

Full list of resources:

 Clone Set: nfs_setup-clone [nfs_setup]
     Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ]
 Clone Set: nfs-mon-clone [nfs-mon]
     Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ]
 Clone Set: nfs-grace-clone [nfs-grace]
     Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ]
 dhcp43-33.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp43-33.lab.eng.blr.redhat.com
 dhcp43-40.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp43-40.lab.eng.blr.redhat.com
 dhcp42-11.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-11.lab.eng.blr.redhat.com
 dhcp42-78.lab.eng.blr.redhat.com-cluster_ip-1	(ocf::heartbeat:IPaddr):	Started dhcp42-78.lab.eng.blr.redhat.com

PCSD Status:
  dhcp43-33.lab.eng.blr.redhat.com: Online
  dhcp43-40.lab.eng.blr.redhat.com: Online
  dhcp42-11.lab.eng.blr.redhat.com: Online
  dhcp42-78.lab.eng.blr.redhat.com: Online

and i am able to mount the volume and perform IO's from the mount point.

Comment 7 Shashank Raj 2016-05-09 14:30:14 UTC

changing the title accordingly as setting up ganesha doesn't fail.

Comment 9 Kaleb KEITHLEY 2016-05-09 18:03:12 UTC

I set up a four node cluster with RHEL 6.8Beta. I used the default HA components available from the 6.8 HA channel, i.e. 
  pcs-0.9.139-9.el6_7.2.x86_64
  pacemaker-1.1.12-8.el6_7.2.x86_64
  corosync-1.4.7-5.el6.x86_64
  cman-3.0.12.1-78.el6.x86_64

I have nothing in my logs about failing to set stonith.

We did have a issue with some of our RHEL7 installs getting older versions than what was in the HA channel and I requested that we be sure we were getting the correct (latest) versions.

But I'm not sure why we're trying to use newer versions than what's in the HA channel.

Requesting input from Ken Gaillot or Andy Beekhof. (Too bad there's no way to put needinfo on more than one person.)

Comment 10 Kaleb KEITHLEY 2016-05-09 18:07:19 UTC

To be clear(er):

But I'm not sure why we're trying to use newer versions than what's in the HA channel for RHEL6.

Comment 11 Ken Gaillot 2016-05-09 19:03:43 UTC

RHEL 6.8 does have 1.1.14-8; not sure why the channel isn't showing that.

Comment 13 Shashank Raj 2016-05-17 12:04:47 UTC

I tried installing/updating the pcs and pacemaker packages on a ISO installed RHGS 3.1.2 and after subscribing to RHEL-6 HA channel. it pulls the latest versions of pcs and pacemaker:

subscription-manager repos --enable=rhel-6-server-rpms --enable=rhel-scalefs-for-rhel-6-server-rpms --enable=rhs-3-for-rhel-6-server-rpms --enable=rh-gluster-3-nfs-for-rhel-6-server-rpms --enable=rhel-ha-for-rhel-6-server-rpms

--------------------------------------------------------------------------
[root@dhcp43-67 yum.repos.d]# yum install pacemaker


Dependencies Resolved

=====================================================================================================================================================
 Package                               Arch                  Version                             Repository                                     Size
=====================================================================================================================================================
Installing:
 pacemaker                             x86_64                1.1.14-8.el6                        rhel-ha-for-rhel-6-server-rpms                461 k
Installing for dependencies:
 cifs-utils                            x86_64                4.8.1-20.el6                        rhel-6-server-rpms                             65 k
 libqb                                 x86_64                0.17.1-2.el6                        rhel-ha-for-rhel-6-server-rpms                 71 k
 libtool-ltdl                          x86_64                2.2.6-15.5.el6                      rhel-6-server-rpms                             44 k
 pacemaker-cli                         x86_64                1.1.14-8.el6                        rhel-ha-for-rhel-6-server-rpms                230 k
 pacemaker-cluster-libs                x86_64                1.1.14-8.el6                        rhel-ha-for-rhel-6-server-rpms                 84 k
 pacemaker-libs                        x86_64                1.1.14-8.el6                        rhel-ha-for-rhel-6-server-rpms                478 k
 perl-TimeDate                         noarch                1:1.16-13.el6                       rhel-6-server-rpms                             37 k
 resource-agents                       x86_64                3.9.5-34.el6_8.2                    rhel-ha-for-rhel-6-server-rpms                386 k
 samba-common                          x86_64                3.6.509-169.6.el6rhs                rhs-3-for-rhel-6-server-rpms                   10 M
 samba-winbind                         x86_64                3.6.509-169.6.el6rhs                rhs-3-for-rhel-6-server-rpms                  2.2 M
 samba-winbind-clients                 x86_64                3.6.509-169.6.el6rhs                rhs-3-for-rhel-6-server-rpms                  2.0 M

-----------------------------------------------------------------------------

[root@dhcp43-67 yum.repos.d]# yum update pcs

Dependencies Resolved

=====================================================================================================================================================
 Package                          Arch                     Version                            Repository                                        Size
=====================================================================================================================================================
Updating:
 pcs                              x86_64                   0.9.148-7.el6                      rhel-ha-for-rhel-6-server-rpms                   5.3 M
Updating for dependencies:
 python-clufter                   x86_64                   0.56.2-1.el6                       rhel-ha-for-rhel-6-server-rpms                   352 k

-----------------------------------------------------------------------------

So in this case we need to have the fix for this bug otherwise all the customers, updating to pcs and pacemaker packages which are available in RHEL6 base HA channel, will hit this issue.

Proposing it as blocker for 3.1.3

Comment 14 Ken Gaillot 2016-05-17 17:48:17 UTC

It's actually the previous behavior that could be considered a bug; Red Hat does not support HA clusters without properly configured fencing. The correct fix for this issue is to configure and test fencing devices.

Comment 20 Shashank Raj 2016-06-01 11:30:43 UTC

Verified this bug with the latest glusterfs-3.7.9-7 and nfs-ganesha-2.3.1-7 build and the issue reported in this bug is fixed.

After setting up ganesha cluster, earlier stonith-enabled value was not getting set and because of which the nodes remains in stopped state. but with the latest build, its getting set and no issues related to stonith-enabled are seen as below:

[root@dhcp43-119 ~]# pcs property show
Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.14-8.el6-70404b0
 have-watchdog: false
 no-quorum-policy: ignore
 stonith-enabled: false
Node Attributes:
 dhcp42-33.lab.eng.blr.redhat.com: grace-active=1
 dhcp43-119.lab.eng.blr.redhat.com: grace-active=1

However there is a new bug for RHEL 6.8 which stills makes nodes to be in stopped state and there are other grace related failures, which is been tracked under below bug:

https://bugzilla.redhat.com/show_bug.cgi?id=1341567


based on the above observation, marking this bug as Verified.

Comment 22 Kaleb KEITHLEY 2016-06-02 12:16:29 UTC

requested doctext provided

Comment 24 Kaleb KEITHLEY 2016-06-09 09:56:56 UTC

The user doesn't need to wait. (This isn't a user visible change, per se.)

The setup process (initiated by issuing a  `gluster nfs-ganesha enable` command) has been fixed so that it waits as necessary.

I've made a slight change to the doc text. Otherwise it looks fine.

Comment 27 errata-xmlrpc 2016-06-23 05:21:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240