Description of problem: In a 4 node ganesha cluster, if 2 nodes are rebooted then pacemaker quorum will be lost and starts stopping the services in cluster on all the node. Intermittently, nfs_unblock resource agent is going to FAILED state for any of the nodes and the following message is observed in pcs status. Failed Actions: * dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock_stop_0 on dhcp46-42.lab.eng.blr.redhat.com 'insufficient privileges' (4): call=92, status=complete, exitreason='none', last-rc-change='Tue Nov 29 17:28:10 2016', queued=0ms, exec=103m Version-Release number of selected component (if applicable): glusterfs-ganesha-3.8.4-5.el7rhgs.x86_64 nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64 How reproducible: Intermittent Steps to Reproduce: 1. Create a 4 node ganesha cluster. 2. Reboot 2 nodes in the ganesha cluster Actual results: "Insufficient privileges" messages observed in pcs status for nfs_unblock resource agent. Expected results: No error messages should be seen. Additional info: [root@dhcp46-42 ~]# pcs status Cluster name: ganesha-ha-360 Stack: corosync Current DC: dhcp46-42.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum Last updated: Tue Nov 29 20:07:55 2016 Last change: Tue Nov 29 17:25:21 2016 by root via cibadmin on dhcp46-42.lab.eng.blr.redhat.com 4 nodes and 24 resources configured Online: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] OFFLINE: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ] Resource Group: dhcp46-42.lab.eng.blr.redhat.com-group dhcp46-42.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-42.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp46-101.lab.eng.blr.redhat.com-group dhcp46-101.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-42.lab.eng.blr.redhat.com dhcp46-101.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-42.lab.eng.blr.redhat.com dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): FAILED dhcp46-42.lab.eng.blr.redhat.com (blocked) Resource Group: dhcp47-155.lab.eng.blr.redhat.com-group dhcp47-155.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp47-155.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp47-167.lab.eng.blr.redhat.com-group dhcp47-167.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp47-167.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp47-167.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Failed Actions: * dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock_stop_0 on dhcp46-42.lab.eng.blr.redhat.com 'insufficient privileges' (4): call=92, status=complete, exitreason='none', last-rc-change='Tue Nov 29 17:28:10 2016', queued=0ms, exec=103ms * dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock_monitor_10000 on dhcp46-42.lab.eng.blr.redhat.com 'unknown error' (1): call=73, status=Timed Out, exitreason='none', last-rc-change='Tue Nov 29 17:27:52 2016', queued=0ms, exec=0ms sosreports are at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/ha_reboot_case/
I am hitting this issue even with selinux set to permissive mode. [root@dhcp46-42 ~]# pcs status Cluster name: ganesha-ha-360 Stack: corosync Current DC: dhcp46-42.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum Last updated: Wed Nov 30 19:34:09 2016 Last change: Wed Nov 30 19:25:42 2016 by root via crm_attribute on dhcp46-101.lab.eng.blr.redhat.com 4 nodes and 24 resources configured Online: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] OFFLINE: [ dhcp47-155.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Started: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Started: [ dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Stopped: [ dhcp46-101.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp46-101.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com ] Stopped: [ dhcp47-155.lab.eng.blr.redhat.com ] Resource Group: dhcp46-42.lab.eng.blr.redhat.com-group dhcp46-42.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-42.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-42.lab.eng.blr.redhat.com dhcp46-42.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp46-42.lab.eng.blr.redhat.com Resource Group: dhcp46-101.lab.eng.blr.redhat.com-group dhcp46-101.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-101.lab.eng.blr.redhat.com dhcp46-101.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-101.lab.eng.blr.redhat.com dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): FAILED dhcp46-101.lab.eng.blr.redhat.com (blocked) Resource Group: dhcp47-155.lab.eng.blr.redhat.com-group dhcp47-155.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-42.lab.eng.blr.redhat.com dhcp47-155.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp46-42.lab.eng.blr.redhat.com Resource Group: dhcp47-167.lab.eng.blr.redhat.com-group dhcp47-167.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp47-167.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp47-167.lab.eng.blr.redhat.com dhcp47-167.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Started dhcp47-167.lab.eng.blr.redhat.com Failed Actions: * dhcp46-101.lab.eng.blr.redhat.com-nfs_unblock_stop_0 on dhcp46-101.lab.eng.blr.redhat.com 'insufficient privileges' (4): call=85, status=complete, exitreason='none', last-rc-change='Wed Nov 30 19:34:06 2016', queued=0ms, exec=294ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@dhcp46-42 ~]# getenforce Permissive [root@dhcp46-42 ~]#
Arthy, Could you provide steps on when above issue is seen? Does it happen only when quorum is lost (doesn't look like from above comment)? Also is there is any functionality impact on the ongoing I/Os? Thanks!
Steps: 1) Create a 4 node ganesha cluster. 2) Create volume and enable ganesha on that volume. Run IO on mount point. 3) Reboot the 4 th node where shared storage brick is not present. 4) Ensure failover happens successfully( 4th node will get failovered to 3rd node) 5) Reboot the 3rd node. 'insufficient privileges' message is seen along with FAILED portblock resource agent. It happens even when quorum is not lost. IOs are running as expected as the nodes are getting failovered to other node. But, even if the nodes are brought up ie., starting pacemaker and nfs-ganesha service , failback is not happening as expected. Observing the following messages in /var/log/glusterfs/run-gluster-shared_storage.log of the node which is failovered. [root@dhcp46-101 ~]# tail -f /var/log/glusterfs/run-gluster-shared_storage.log [2016-12-01 07:40:48.709169] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 1639: LOOKUP() /nfs-ganesha/tickle_dir/10.70.44.139 => -1 (Input/output error) [2016-12-01 07:40:58.753192] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 1665: LOOKUP() /nfs-ganesha/tickle_dir/10.70.44.139 => -1 (Input/output error) [2016-12-01 07:41:08.798482] W [fuse-bridge.c:471:fuse_entry_cbk] 0-glusterfs-fuse: 1692: LOOKUP() /nfs-ganesha/tickle_dir/10.70.44.139 => -1 (Input/output error) [root@dhcp46-101 ~]# cat /var/run/gluster/shared_storage/nfs-ganesha/tickle_dir/10.70.44.139cat: /var/run/gluster/shared_storage/nfs-ganesha/tickle_dir/10.70.44.139: Input/output error
sosreports are located at, http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/ha_reboot_case/new/
Hi Oyvind, We seem to be hitting this issue intermittently - portblock resource-agent failing with "insufficient privileges" error. We do not see much errors logged in /var/log/messages apart from this message. Is there any way to increase log level and get more information wrt these errors? Also could you please suggest on how we can recover the cluster state back to normal in case of such errors. Restarting Pacemaker service seem to be taking quite a lot of time (almost > 1/2 hour) when any RA goes to FAILED state. Thanks!
That's strange. Sounds like iptables is reporting insuffient privileges for some strange reason. "pcs resource cleanup <resource>" should get rid of that status, but try it on a test setup first to ensure it doesnt restart the other resources with constraints to it. You can try e.g. "pcs resource debug-monitor --full <resource>" or any other action to see what's happening, but if it's intermittent you might need to add -x to the #!/bin/sh line at the beginning of the resource agent (/usr/lib/ocf/resource.d/heartbeat/portblock), but this will add lots of "spam" to your logs.
Thanks a lot Oyvind. That seem to be working. I modified the RA and tried rebooting nodes in various orders. Haven't hit the issue. I shall clone this bug to RHEL/resource-agents component for the fix in the portblock resource-agent. Meanwhile I tried the work-around which you have suggested. The resource-agents came back to normal state. [root@dhcp46-111 ganesha]# pcs status Cluster name: ganesha-ha-360 Stack: corosync Current DC: dhcp46-115.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum Last updated: Wed Dec 7 16:53:23 2016 Last change: Wed Dec 7 14:10:53 2016 by root via cibadmin on dhcp46-111.lab.eng.blr.redhat.com 4 nodes and 24 resources configured Online: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com ] OFFLINE: [ dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Started: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com ] Stopped: [ dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Resource Group: dhcp46-111.lab.eng.blr.redhat.com-group dhcp46-111.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-111.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-111.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp46-115.lab.eng.blr.redhat.com-group dhcp46-115.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-115.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-115.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp46-139.lab.eng.blr.redhat.com-group dhcp46-139.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Started dhcp46-111.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Started dhcp46-111.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): FAILED dhcp46-111.lab.eng.blr.redhat.com (blocked) Resource Group: dhcp46-124.lab.eng.blr.redhat.com-group dhcp46-124.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-124.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Failed Actions: * dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock_start_0 on dhcp46-115.lab.eng.blr.redhat.com 'insufficient privileges' (4): call=109, status=complete, exitreason='none', last-rc-change='Wed Dec 7 14:13:30 2016', queued=0ms, exec=73ms * dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock_stop_0 on dhcp46-111.lab.eng.blr.redhat.com 'insufficient privileges' (4): call=128, status=complete, exitreason='none', last-rc-change='Wed Dec 7 14:42:10 2016', queued=0ms, exec=191ms Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@dhcp46-111 ganesha]# [root@dhcp46-111 ganesha]# pcs resource cleanup dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock Cleaning up dhcp46-139.lab.eng.blr.redhat.com-nfs_block on dhcp46-111.lab.eng.blr.redhat.com, removing fail-count-dhcp46-139.lab.eng.blr.redhat.com-nfs_block Cleaning up dhcp46-139.lab.eng.blr.redhat.com-nfs_block on dhcp46-115.lab.eng.blr.redhat.com, removing fail-count-dhcp46-139.lab.eng.blr.redhat.com-nfs_block Cleaning up dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1 on dhcp46-111.lab.eng.blr.redhat.com, removing fail-count-dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1 Cleaning up dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1 on dhcp46-115.lab.eng.blr.redhat.com, removing fail-count-dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1 Cleaning up dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock on dhcp46-111.lab.eng.blr.redhat.com, removing fail-count-dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock Cleaning up dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock on dhcp46-115.lab.eng.blr.redhat.com, removing fail-count-dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock Waiting for 6 replies from the CRMd...... OK [root@dhcp46-111 ganesha]# [root@dhcp46-111 ganesha]# pcs status Cluster name: ganesha-ha-360 Stack: corosync Current DC: dhcp46-115.lab.eng.blr.redhat.com (version 1.1.15-11.el7_3.2-e174ec8) - partition WITHOUT quorum Last updated: Wed Dec 7 16:55:08 2016 Last change: Wed Dec 7 16:55:00 2016 by hacluster via crmd on dhcp46-115.lab.eng.blr.redhat.com 4 nodes and 24 resources configured Online: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com ] OFFLINE: [ dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Full list of resources: Clone Set: nfs_setup-clone [nfs_setup] Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Clone Set: nfs-mon-clone [nfs-mon] Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Clone Set: nfs-grace-clone [nfs-grace] Stopped: [ dhcp46-111.lab.eng.blr.redhat.com dhcp46-115.lab.eng.blr.redhat.com dhcp46-124.lab.eng.blr.redhat.com dhcp46-139.lab.eng.blr.redhat.com ] Resource Group: dhcp46-111.lab.eng.blr.redhat.com-group dhcp46-111.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-111.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-111.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp46-115.lab.eng.blr.redhat.com-group dhcp46-115.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-115.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-115.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp46-139.lab.eng.blr.redhat.com-group dhcp46-139.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-139.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-139.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Resource Group: dhcp46-124.lab.eng.blr.redhat.com-group dhcp46-124.lab.eng.blr.redhat.com-nfs_block (ocf::heartbeat:portblock): Stopped dhcp46-124.lab.eng.blr.redhat.com-cluster_ip-1 (ocf::heartbeat:IPaddr): Stopped dhcp46-124.lab.eng.blr.redhat.com-nfs_unblock (ocf::heartbeat:portblock): Stopped Daemon Status: corosync: active/disabled pacemaker: active/disabled pcsd: active/enabled [root@dhcp46-111 ganesha]#
portblock process is not going to FAILED state and insufficient privileges message is not seen in pcs status. Verified the fix in build, nfs-ganesha-gluster-2.4.1-6.el7rhgs.x86_64 nfs-ganesha-2.4.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-12.el7rhgs.x86_64 resource-agents-3.9.5-82.el7_3.4.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html