Hide Forgot
Description of problem: So during IHA testing on OSP13 (rhel-7.5) we got into this weird state. 1) 3 controllers with 2 remotes: [root@controller-2 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-1 (version 1.1.18-11.el7-2b07d5c5a9) - partition with quorum Last updated: Mon Apr 9 13:42:20 2018 Last change: Mon Apr 9 12:28:33 2018 by hacluster via crmd on controller-2 14 nodes configured 60 resources configured Online: [ controller-0 controller-1 controller-2 ] RemoteOnline: [ overcloud-novacompute-0 overcloud-novacompute-1 ] GuestOnline: [ galera-bundle-0@controller-2 galera-bundle-1@controller-0 galera-bundle-2@controller-1 rabbitmq-bundle-0@controller-2 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controll er-1 redis-bundle-0@controller-2 redis-bundle-1@controller-0 redis-bundle-2@controller-1 ] Full list of resources: overcloud-novacompute-0 (ocf::pacemaker:remote): Started controller-0 overcloud-novacompute-1 (ocf::pacemaker:remote): Started controller-1 2) Everything is fine overcloud-novacompute-0 (I hop on and can run pcs commands just fine) 3) On novacompute-1 'pcs' won't work and I see this in the log: Apr 09 13:35:30 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:35:30 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-15184-15): Remote I/O error (121) 3.1) authkey seems okay: [root@overcloud-novacompute-1 ~]# A=/etc/pacemaker/authkey [root@overcloud-novacompute-1 ~]# ls -l $A; md5sum $A -rw-r-----. 1 hacluster haclient 4096 Apr 9 06:02 /etc/pacemaker/authkey 0e56b524b37d59770f71a9625990d03a /etc/pacemaker/authkey [root@controller-2 ~]# A=/etc/pacemaker/authkey [root@controller-2 ~]# ls -l $A; md5sum $A -rw-r-----. 1 hacluster haclient 4096 Apr 9 06:03 /etc/pacemaker/authkey 0e56b524b37d59770f71a9625990d03a /etc/pacemaker/authkey On controller-1 (which runs the resource managing novacompute-1) I see this: Apr 09 13:47:54 [19658] controller-1 pengine: info: common_print: overcloud-novacompute-1 (ocf::pacemaker:remote): Started controller-1 Apr 09 13:47:54 [19658] controller-1 pengine: info: short_print: Started: [ overcloud-novacompute-0 overcloud-novacompute-1 ] Apr 09 13:47:54 [19658] controller-1 pengine: info: LogActions: Leave overcloud-novacompute-1 (Started controller-1) Apr 09 13:47:54 [19658] controller-1 pengine: info: LogActions: Leave compute-unfence-trigger:1 (Started overcloud-novacompute-1) There are no denied in selinux [root@overcloud-novacompute-1 ~]# grep -ir denied /var/log/audit/audit.log |grep -v ssh [root@overcloud-novacompute-1 ~]# Also the problematic novacompute-1 node is running a dummy resource: Clone Set: compute-unfence-trigger-clone [compute-unfence-trigger] Started: [ overcloud-novacompute-0 overcloud-novacompute-1 ] Stopped: [ controller-0 controller-1 controller-2 ] Resource: compute-unfence-trigger (class=ocf provider=pacemaker type=Dummy) The full journal on novacompute-1 is this one [root@overcloud-novacompute-1 ~]# journalctl -u pacemaker_remote -- Logs begin at Mon 2018-04-09 05:55:14 UTC, end at Mon 2018-04-09 14:01:01 UTC. -- Apr 09 06:02:25 overcloud-novacompute-1 systemd[1]: Started Pacemaker Remote Service. Apr 09 06:02:25 overcloud-novacompute-1 systemd[1]: Starting Pacemaker Remote Service... Apr 09 06:02:25 overcloud-novacompute-1 pacemaker_remoted[16305]: notice: Additional logging available in /var/log/pacemaker.log Apr 09 06:02:25 overcloud-novacompute-1 pacemaker_remoted[16305]: notice: Starting TLS listener on port 3121 Apr 09 06:02:25 overcloud-novacompute-1 pacemaker_remoted[16305]: notice: Listening on address :: Apr 09 06:04:10 overcloud-novacompute-1 pacemaker_remoted[16305]: notice: LRMD client connection established. 0x55c09e4621f0 id: 876b3865-105e-46ca-a36b-2dfeaf3205f0 Apr 09 06:16:14 overcloud-novacompute-1 pacemaker_remoted[16305]: notice: ip-10.0.0.110_monitor_0:22391:stderr [ ocf-exit-reason:Unable to find nic or netmask. ] Apr 09 06:17:21 overcloud-novacompute-1 pacemaker_remoted[16305]: notice: ip-172.17.4.17_monitor_0:22625:stderr [ ocf-exit-reason:Unable to find nic or netmask. ] -- Reboot -- Apr 09 11:24:10 overcloud-novacompute-1 systemd[1]: Started Pacemaker Remote Service. Apr 09 11:24:10 overcloud-novacompute-1 systemd[1]: Starting Pacemaker Remote Service... Apr 09 11:24:10 overcloud-novacompute-1 pacemaker_remoted[2860]: notice: Additional logging available in /var/log/pacemaker.log Apr 09 11:24:10 overcloud-novacompute-1 pacemaker_remoted[2860]: notice: Starting TLS listener on port 3121 Apr 09 11:24:10 overcloud-novacompute-1 pacemaker_remoted[2860]: notice: Listening on address :: -- Reboot -- Apr 09 11:44:41 overcloud-novacompute-1 pacemaker_remoted[2896]: notice: Additional logging available in /var/log/pacemaker.log Apr 09 11:44:41 overcloud-novacompute-1 pacemaker_remoted[2896]: notice: Starting TLS listener on port 3121 Apr 09 11:44:41 overcloud-novacompute-1 systemd[1]: Started Pacemaker Remote Service. Apr 09 11:44:41 overcloud-novacompute-1 systemd[1]: Starting Pacemaker Remote Service... Apr 09 11:44:41 overcloud-novacompute-1 pacemaker_remoted[2896]: notice: Listening on address :: -- Reboot -- Apr 09 12:39:13 overcloud-novacompute-1 pacemaker_remoted[2919]: notice: Additional logging available in /var/log/pacemaker.log Apr 09 12:39:13 overcloud-novacompute-1 pacemaker_remoted[2919]: notice: Starting TLS listener on port 3121 Apr 09 12:39:13 overcloud-novacompute-1 systemd[1]: Started Pacemaker Remote Service. Apr 09 12:39:13 overcloud-novacompute-1 systemd[1]: Starting Pacemaker Remote Service... Apr 09 12:39:13 overcloud-novacompute-1 pacemaker_remoted[2919]: notice: Listening on address :: Apr 09 13:00:05 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:00:05 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-6354-15): Remote I/O error (121) Apr 09 13:02:07 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:02:07 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-6703-15): Remote I/O error (121) Apr 09 13:02:35 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:02:35 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-6795-15): Remote I/O error (121) Apr 09 13:07:00 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:07:00 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-7775-15): Remote I/O error (121) Apr 09 13:09:34 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:09:34 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-8140-15): Remote I/O error (121) Apr 09 13:21:17 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:21:17 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-9522-15): Remote I/O error (121) Apr 09 13:27:49 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:27:49 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-10290-15): Remote I/O error (121) Apr 09 13:32:12 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:32:12 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-10918-15): Remote I/O error (121) Apr 09 13:32:43 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:32:43 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-10974-15): Remote I/O error (121) Apr 09 13:33:38 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:33:38 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-13951-15): Remote I/O error (121) Apr 09 13:33:38 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:33:38 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-13960-15): Remote I/O error (121) Apr 09 13:33:38 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:33:38 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-13961-15): Remote I/O error (121) Apr 09 13:33:39 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:33:39 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-13964-15): Remote I/O error (121) Apr 09 13:33:39 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:33:39 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-13970-15): Remote I/O error (121) Apr 09 13:33:44 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:33:44 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-14174-15): Remote I/O error (121) Apr 09 13:33:44 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:33:44 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-14176-15): Remote I/O error (121) Apr 09 13:33:44 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:33:44 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-14231-15): Remote I/O error (121) Apr 09 13:35:30 overcloud-novacompute-1 pacemaker_remoted[2919]: error: No ipc providers available for uid 0 gid 0 Apr 09 13:35:30 overcloud-novacompute-1 pacemaker_remoted[2919]: error: Error in connection setup (2919-15184-15): Remote I/O error (121) Not sure if the reboots (fencing) have brought us in this state?
I see: Apr 9 03:53:19 controller-1 crmd[19659]: error: Result of monitor operation for overcloud-novacompute-1 on controller-1: Error Apr 9 03:53:19 controller-1 crmd[19659]: error: remote-node overcloud-novacompute-1 unexpectedly disconneced during monitor operation Apr 9 03:53:19 controller-1 crmd[19659]: notice: State transition S_IDLE -> S_POLICY_ENGINE Apr 9 03:53:19 controller-1 crmd[19659]: notice: Initiating stop operation ip-10.0.0.110_stop_0 on controller-0 I see fencing failing: Apr 9 03:53:19 controller-1 crmd[19659]: notice: Requesting fencing (reboot) of node overcloud-novacompute-1 Apr 9 03:53:19 controller-1 stonith-ng[19655]: notice: Client crmd.19659.946a26a5 wants to fence (reboot) 'overcloud-novacompute-1' with device '(any)' Apr 9 03:53:32 controller-1 stonith-ng[19655]: notice: Operation 'off' [566658] (call 16 from crmd.19659) for host 'overcloud-novacompute-1' with device 'stonith-fence_ipmilan-5254008be2cc' returned: 0 (OK) Apr 9 03:53:32 controller-1 stonith-ng[19655]: notice: Call to stonith-fence_ipmilan-5254008be2cc for 'overcloud-novacompute-1 off' on behalf of crmd.19659@controller-1: OK (0) Apr 9 03:53:42 controller-1 stonith-ng[19655]: notice: Call to stonith-fence_compute-fence-nova for 'overcloud-novacompute-1 off' on behalf of crmd.19659@controller-1: Generic Pacemaker error (-201) Apr 9 03:53:42 controller-1 stonith-ng[19655]: notice: All fencing options to fence overcloud-novacompute-1 for crmd.19659@controller-1.3823b3bd failed Apr 9 03:53:42 controller-1 stonith-ng[19655]: error: Operation reboot of overcloud-novacompute-1 by controller-2 for crmd.19659@controller-1.3823b3bd: Generic Pacemaker error Apr 9 03:53:42 controller-1 crmd[19659]: notice: Stonith operation 16/27:199:0:69d170fd-52b7-46b4-85cd-cd4a4d08c203: Generic Pacemaker error (-201) Apr 9 03:53:42 controller-1 crmd[19659]: notice: Stonith operation 16 for overcloud-novacompute-1 failed (Generic Pacemaker error): aborting transition. Apr 9 03:53:42 controller-1 crmd[19659]: notice: Transition aborted: Stonith failed which cant be good. That repeats once or twice but then for some reason the cluster forgets the compute is in a failed state
This is the "why": [root@controller-1 heat-admin]# crm_diff -n /var/lib/pacemaker/pengine/pe-input-97.bz2 -o /var/lib/pacemaker/pengine/pe-warn-4.bz2 <diff format="2"> <version> <source admin_epoch="0" epoch="104" num_updates="12"/> <target admin_epoch="0" epoch="104" num_updates="17"/> </version> <change operation="delete" path="/cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='overcloud-novacompute-1']/lrm_rsc_op[@id='overcloud-novacompute-1_last_failure_0']"/> <change operation="delete" path="/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-fail-count-overcloud-novacompute-1.monitor_20000']"/> <change operation="delete" path="/cib/status/node_state[@id='2']/transient_attributes[@id='2']/instance_attributes[@id='status-2']/nvpair[@id='status-2-last-failure-overcloud-novacompute-1.monitor_20000']"/> .... Which was caused by the crmd: Apr 09 07:55:35 [19654] controller-1 cib: info: cib_perform_op: Diff: --- 0.104.14 2 Apr 09 07:55:35 [19654] controller-1 cib: info: cib_perform_op: Diff: +++ 0.104.15 (null) Apr 09 07:55:35 [19654] controller-1 cib: info: cib_perform_op: -- /cib/status/node_state[@id='2']/lrm[@id='2']/lrm_resources/lrm_resource[@id='overcloud-novacompute-1']/lrm_rsc_op[@id='overcloud-novacompute-1_last_failure_0'] Apr 09 07:55:35 [19654] controller-1 cib: info: cib_perform_op: + /cib: @num_updates=15 Apr 09 07:55:35 [19654] controller-1 cib: info: cib_process_request: Completed cib_delete operation for section /cib/status/node_state[@uname='controller-1']/lrm/lrm_resources/lrm_resource[@id='overcloud-novacompute-1']/lrm_rsc_op[@id='overcloud-novacompute-1_last_failure_0']: OK (rc=0, origin=controller-1/crmd/2090, version=0.104.15) Apr 09 07:55:35 [19659] controller-1 crmd: info: do_te_invoke: Processing graph 203 (ref=pe_calc-dc-1523260535-572) derived from /var/lib/pacemaker/pengine/pe-warn-4.bz2 Apr 09 07:55:35 [19659] controller-1 crmd: info: te_crm_command: Executing crm-event (26): clear_failcount on controller-1 Which came from this action in pe-warn-4.bz4 <crm_event id="26" operation="clear_failcount" operation_key="overcloud-novacompute-1_clear_failcount_0" on_node="controller-1" on_node_uuid="2">
Pretty clearly that action should depend on a successful stonith operation. Currently it has no prerequisites
Created attachment 1419616 [details] Test case
Ohhh... ./lib/pengine/complex.c-673- /* we want to override any default failure_timeout in use when remote ./lib/pengine/complex.c:674: * reconnect_interval is in use. */ ./lib/pengine/complex.c-675- (*rsc)->failure_timeout = (*rsc)->remote_reconnect_ms / 1000; so this bug actually demonstrates the danger of setting the reconnect interval too low
https://github.com/ClusterLabs/pacemaker/commit/4e984bb Requesting z-stream as this is impacting IHA testing
Verified , remote-ocf are alive and consistent across reboots and load on a IHA env. tested on: pacemaker-1.1.18-12, OSP13 Test Job results: https://drive.google.com/file/d/1FNQVyfmnDkp6nUUuUnJDA7F4ErcUQSe8/view?usp=sharing
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3055