Description of problem: ----------------------- I tried setting up Ganesha (v 2.3.1-4) on RHGS 3.1.3 layered over RHEL 6.8.Everything goes through fine(cluster setup,authentication,Ganesha enabling etc) but pcs status shows nodes as "stopped" : *Snippet from distaf logs* : 2016-05-06 23:30:50,961 INFO run root.lab.eng.bos.redhat.com (cp): pcs status 2016-05-06 23:30:54,313 INFO run RETCODE: 0 2016-05-06 23:30:54,314 INFO run STDOUT: Cluster name: G1462557101.26 WARNING: no stonith devices and stonith-enabled is not false Last updated: Fri May 6 14:00:51 2016 Last change: Fri May 6 14:00:17 2016 by root via cibadmin on gqas001.sbu.lab.eng.bos.redhat.com Stack: cman Current DC: gqas015.sbu.lab.eng.bos.redhat.com (version 1.1.14-8.el6-70404b0) - partition with quorum 4 nodes and 16 resources configured Online: [ gqas001.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com gqas015.sbu.lab.eng.bos.redhat.com gqas016.sbu.lab.eng.bos.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ gqas001.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com gqas015.sbu.lab.eng.bos.redhat.com gqas016.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ gqas001.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com gqas015.sbu.lab.eng.bos.redhat.com gqas016.sbu.lab.eng.bos.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ gqas001.sbu.lab.eng.bos.redhat.com gqas014.sbu.lab.eng.bos.redhat.com gqas015.sbu.lab.eng.bos.redhat.com gqas016.sbu.lab.eng.bos.redhat.com ] gqas001.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped gqas014.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped gqas015.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped gqas016.sbu.lab.eng.bos.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped PCSD Status: gqas001.sbu.lab.eng.bos.redhat.com: Online gqas014.sbu.lab.eng.bos.redhat.com: Online gqas015.sbu.lab.eng.bos.redhat.com: Online gqas016.sbu.lab.eng.bos.redhat.com: Online I tried downgrading the versions of pacemaker,cman,pcs and corosync and it gives a clean automation run and setup is successful(pcs status=good). Version-Release number of selected component (if applicable): -------------------------------------------------------------- [root@gqas001 yum.repos.d]# rpm -qa|grep cman cman-3.0.12.1-78.el6.x86_64 [root@gqas001 yum.repos.d]# rpm -qa|grep pcs pcs-0.9.148-7.el6.x86_64 [root@gqas001 yum.repos.d]# rpm -qa|grep pacemaker pacemaker-libs-1.1.14-8.el6.x86_64 pacemaker-cli-1.1.14-8.el6.x86_64 pacemaker-cluster-libs-1.1.14-8.el6.x86_64 pacemaker-1.1.14-8.el6.x86_64 [root@gqas001 yum.repos.d]# rpm -qa|grep corosync corosync-1.4.7-5.el6.x86_64 corosynclib-1.4.7-5.el6.x86_64 [root@gqas001 yum.repos.d]# How reproducible: ---------------- 3/3 Steps to Reproduce: ------------------- 1. Do a yum install pacemaker cman pcs ccs resource-agents corosync .This will fetch you latest versions of all these packages 2. Run Ganesha setup via distaf.It'll fail with the error above 3. Downgrade the packages and rerun. Actual results: -------------- Ganesha setup should be successful with latest version of pacemaker,cman,pcs and corosync packages. Expected results: ----------------- Ganesha setup fails on latest versions of pacemaker,cman,pcs and corosync packages. Additional info: ---------------- Testbed : RHEL 6.8
I upgraded my setup with the same versions and i am able to reproduce the issue: [root@dhcp43-33 ~]# rpm -qa|grep pacemaker pacemaker-1.1.14-8.el6.x86_64 pacemaker-libs-1.1.14-8.el6.x86_64 pacemaker-cli-1.1.14-8.el6.x86_64 pacemaker-cluster-libs-1.1.14-8.el6.x86_64 [root@dhcp43-33 ~]# rpm -qa|grep pcs pcsc-lite-libs-1.5.2-15.el6.x86_64 pcs-0.9.148-7.el6.x86_64 [root@dhcp43-33 ~]# rpm -qa|grep cman cman-3.0.12.1-78.el6.x86_64 [root@dhcp43-33 ~]# rpm -qa|grep corosync corosync-1.4.7-5.el6.x86_64 corosynclib-1.4.7-5.el6.x86_64 ganesha setup is successful but if we check the pcs status, the nodes are shown in stopped state with a message "WARNING: no stonith devices and stonith-enabled is not false": [root@dhcp43-33 ~]# pcs status Cluster name: G1462802414.82 WARNING: no stonith devices and stonith-enabled is not false Last updated: Tue May 10 01:06:25 2016 Last change: Tue May 10 01:01:23 2016 by root via cibadmin on dhcp43-33.lab.eng.blr.redhat.com Stack: cman Current DC: dhcp43-33.lab.eng.blr.redhat.com (version 1.1.14-8.el6-70404b0) - partition with quorum 4 nodes and 16 resources configured Online: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ] dhcp43-33.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp43-40.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp42-11.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp42-78.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped PCSD Status: dhcp43-33.lab.eng.blr.redhat.com: Online dhcp43-40.lab.eng.blr.redhat.com: Online dhcp42-11.lab.eng.blr.redhat.com: Online dhcp42-78.lab.eng.blr.redhat.com: Online All the corresponding services are up and running as below: [root@dhcp43-33 yum.repos.d]# service nfs-ganesha status ganesha.nfsd (pid 14058) is running... [root@dhcp43-33 yum.repos.d]# service pcsd status pcsd (pid 14809) is running... [root@dhcp43-33 yum.repos.d]# service pacemaker status pacemakerd (pid 14755) is running... [root@dhcp43-33 yum.repos.d]# service corosync status corosync (pid 14472) is running... Volume can be exported properly but not able to mount following messages can be seen in /var/log/messages: May 10 01:01:23 dhcp43-33 pengine[14765]: error: Resource start-up disabled since no STONITH resources have been defined May 10 01:01:23 dhcp43-33 pengine[14765]: error: Either configure some or disable STONITH with the stonith-enabled option May 10 01:01:23 dhcp43-33 pengine[14765]: error: NOTE: Clusters with shared data need STONITH to ensure data integrity >>>> pcs property status shows as below: [root@dhcp43-33 ~]# pcs property show stonith-enabled Cluster Properties: >>>> tried setting it manually: [root@dhcp43-33 ~]# pcs property set stonith-enabled=false [root@dhcp43-33 ~]# pcs property show stonith-enabled Cluster Properties: stonith-enabled: false and after that pcs status shows the status properly: [root@dhcp43-33 ~]# pcs status Cluster name: G1462802414.82 Last updated: Tue May 10 01:19:18 2016 Last change: Tue May 10 01:18:21 2016 by root via cibadmin on dhcp43-33.lab.eng.blr.redhat.com Stack: cman Current DC: dhcp43-33.lab.eng.blr.redhat.com (version 1.1.14-8.el6-70404b0) - partition with quorum 4 nodes and 16 resources configured Online: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com ] dhcp43-33.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-33.lab.eng.blr.redhat.com dhcp43-40.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp43-40.lab.eng.blr.redhat.com dhcp42-11.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-11.lab.eng.blr.redhat.com dhcp42-78.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp42-78.lab.eng.blr.redhat.com PCSD Status: dhcp43-33.lab.eng.blr.redhat.com: Online dhcp43-40.lab.eng.blr.redhat.com: Online dhcp42-11.lab.eng.blr.redhat.com: Online dhcp42-78.lab.eng.blr.redhat.com: Online and i am able to mount the volume and perform IO's from the mount point.
changing the title accordingly as setting up ganesha doesn't fail.
I set up a four node cluster with RHEL 6.8Beta. I used the default HA components available from the 6.8 HA channel, i.e. pcs-0.9.139-9.el6_7.2.x86_64 pacemaker-1.1.12-8.el6_7.2.x86_64 corosync-1.4.7-5.el6.x86_64 cman-3.0.12.1-78.el6.x86_64 I have nothing in my logs about failing to set stonith. We did have a issue with some of our RHEL7 installs getting older versions than what was in the HA channel and I requested that we be sure we were getting the correct (latest) versions. But I'm not sure why we're trying to use newer versions than what's in the HA channel. Requesting input from Ken Gaillot or Andy Beekhof. (Too bad there's no way to put needinfo on more than one person.)
To be clear(er): But I'm not sure why we're trying to use newer versions than what's in the HA channel for RHEL6.
RHEL 6.8 does have 1.1.14-8; not sure why the channel isn't showing that.
I tried installing/updating the pcs and pacemaker packages on a ISO installed RHGS 3.1.2 and after subscribing to RHEL-6 HA channel. it pulls the latest versions of pcs and pacemaker: subscription-manager repos --enable=rhel-6-server-rpms --enable=rhel-scalefs-for-rhel-6-server-rpms --enable=rhs-3-for-rhel-6-server-rpms --enable=rh-gluster-3-nfs-for-rhel-6-server-rpms --enable=rhel-ha-for-rhel-6-server-rpms -------------------------------------------------------------------------- [root@dhcp43-67 yum.repos.d]# yum install pacemaker Dependencies Resolved ===================================================================================================================================================== Package Arch Version Repository Size ===================================================================================================================================================== Installing: pacemaker x86_64 1.1.14-8.el6 rhel-ha-for-rhel-6-server-rpms 461 k Installing for dependencies: cifs-utils x86_64 4.8.1-20.el6 rhel-6-server-rpms 65 k libqb x86_64 0.17.1-2.el6 rhel-ha-for-rhel-6-server-rpms 71 k libtool-ltdl x86_64 2.2.6-15.5.el6 rhel-6-server-rpms 44 k pacemaker-cli x86_64 1.1.14-8.el6 rhel-ha-for-rhel-6-server-rpms 230 k pacemaker-cluster-libs x86_64 1.1.14-8.el6 rhel-ha-for-rhel-6-server-rpms 84 k pacemaker-libs x86_64 1.1.14-8.el6 rhel-ha-for-rhel-6-server-rpms 478 k perl-TimeDate noarch 1:1.16-13.el6 rhel-6-server-rpms 37 k resource-agents x86_64 3.9.5-34.el6_8.2 rhel-ha-for-rhel-6-server-rpms 386 k samba-common x86_64 3.6.509-169.6.el6rhs rhs-3-for-rhel-6-server-rpms 10 M samba-winbind x86_64 3.6.509-169.6.el6rhs rhs-3-for-rhel-6-server-rpms 2.2 M samba-winbind-clients x86_64 3.6.509-169.6.el6rhs rhs-3-for-rhel-6-server-rpms 2.0 M ----------------------------------------------------------------------------- [root@dhcp43-67 yum.repos.d]# yum update pcs Dependencies Resolved ===================================================================================================================================================== Package Arch Version Repository Size ===================================================================================================================================================== Updating: pcs x86_64 0.9.148-7.el6 rhel-ha-for-rhel-6-server-rpms 5.3 M Updating for dependencies: python-clufter x86_64 0.56.2-1.el6 rhel-ha-for-rhel-6-server-rpms 352 k ----------------------------------------------------------------------------- So in this case we need to have the fix for this bug otherwise all the customers, updating to pcs and pacemaker packages which are available in RHEL6 base HA channel, will hit this issue. Proposing it as blocker for 3.1.3
It's actually the previous behavior that could be considered a bug; Red Hat does not support HA clusters without properly configured fencing. The correct fix for this issue is to configure and test fencing devices.
Verified this bug with the latest glusterfs-3.7.9-7 and nfs-ganesha-2.3.1-7 build and the issue reported in this bug is fixed. After setting up ganesha cluster, earlier stonith-enabled value was not getting set and because of which the nodes remains in stopped state. but with the latest build, its getting set and no issues related to stonith-enabled are seen as below: [root@dhcp43-119 ~]# pcs property show Cluster Properties: cluster-infrastructure: cman dc-version: 1.1.14-8.el6-70404b0 have-watchdog: false no-quorum-policy: ignore stonith-enabled: false Node Attributes: dhcp42-33.lab.eng.blr.redhat.com: grace-active=1 dhcp43-119.lab.eng.blr.redhat.com: grace-active=1 However there is a new bug for RHEL 6.8 which stills makes nodes to be in stopped state and there are other grace related failures, which is been tracked under below bug: https://bugzilla.redhat.com/show_bug.cgi?id=1341567 based on the above observation, marking this bug as Verified.
requested doctext provided
The user doesn't need to wait. (This isn't a user visible change, per se.) The setup process (initiated by issuing a `gluster nfs-ganesha enable` command) has been fixed so that it waits as necessary. I've made a slight change to the doc text. Otherwise it looks fine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240