Description of problem:
With Instance HA configured, simulating a node failure on a node with instances running, the instances fail to evacuate. I provided sosreports to Andrew Beekhof and Fabio Di Nitto and below are the comments:
>>> Subject: RE: Need your insight to what might be wrong
>>>
>>> I just tried spinning up a new instance on the stamp (worked OK) and
>>> noted where the instance was running. I then logged on to the
>>> compute node where the instance was running and simulated a crash on
>>> the node with the “echo c
>>>> /proc/sysrq-trigger” command. I observed the node panic and shutoff.
>>>
>>> No changes to the status were observed while watching the “watch
>>> nova list --fields name,status,host” command on the director node
>>> until after I powered the node back on. Then, I observed the new
>>> instance status was “SHUTOFF”.
>>>
>>> I generated sosreports on controller0 and nova0 where the instances
>>> was and is still located. (see attached files (I had to split the
>>> controller file and second part will come in a followup email) ).
>>>
>>> I will verify the domain info. Does the nova-evacuate resource
>>> require the domain to be set, because the RHEL Use HA guide didn’t
>>> have an option listed/
>>
>> Correct. Its only needed on the fence agent and
>> nova-compute-checkevacuate resource.
>>
>> It looks like I'm missing the pacemaker.log files from the three
>> compute nodes and I only have sos reports from one of the controllers.
>>
>> The logs I do have worry me a great deal though.
>>
>> On compute-0, time appears to be a relative concept:
>>
>> Apr 7 14:07:04 localhost ceilometer-polling: 2016-04-07 18:07:04.969
>> 102896 WARNING oslo_config.cfg
>> [req-c240d133-e6ec-4d83-b3bc-03d50d618309 admin - - - -] Option
>> "alarm_history_time_to_live" from group "database" is deprecated for
>> removal. Its value may be silently ignored in the future.
>> Apr 7 14:36:59 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:36:59.440 102658 ERROR oslo.messaging._drivers.impl_rabbit [-]
>> AMQP server on 192.168.140.94:5672 is unreachable: Too many
>> heartbeats missed. Trying again in 1 seconds.
>> Apr 7 14:37:03 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:03.454 102658 ERROR oslo.messaging._drivers.impl_rabbit [-]
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 113]
>> EHOSTUNREACH. Trying again in 1 seconds.
>> Apr 7 14:37:06 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:06.461 102658 ERROR oslo.messaging._drivers.impl_rabbit [-]
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 113]
>> EHOSTUNREACH. Trying again in 1 seconds.
>> Apr 7 14:37:09 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:09.117 102658 ERROR oslo.messaging._drivers.impl_rabbit [-]
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 111]
>> ECONNREFUSED. Trying again in 2 seconds.
>> Apr 7 14:37:11 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:11.130 102658 INFO oslo.messaging._drivers.impl_rabbit [-]
>> Reconnected to AMQP server on 192.168.140.93:5672 Apr 7 14:04:50
>> overcloud-novacompute-0 kernel: qbr687024ad-fa: port
>> 2(tap687024ad-fa) entered disabled state
>>
>> It looks like the ipmi devices need longer timeouts (corosync.log):
>>
>> Apr 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>> error: log_operation: Operation 'reboot' [13745] (call 2 from
>> crmd.73138) for host 'overcloud-controller-1' with device
>> 'overcloud-controller-1-ipmi' returned: 1 (Operation not permitted).
>> Trying: overcloud-controller-1-ipmi
>> Apr 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>> warning: log_operation: overcloud-controller-1-ipmi:13745 [
>> Connection timed out ] Apr 07 18:36:28 [73134]
>> overcloud-controller-0.localdomain stonith-ng:
>> warning: log_operation: overcloud-controller-1-ipmi:13745 [ ] Apr
>> 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>> warning: log_operation: overcloud-controller-1-ipmi:13745 [ ]
>>
>> And again the time is hard to establish because in /var/log/messages
>> the same logs come out as having occurred 4 hours earlier:
>>
>> Apr 7 14:36:28 localhost stonith-ng[73134]: error: Operation
>> 'reboot' [13745] (call 2 from crmd.73138) for host
>> 'overcloud-controller-1' with device 'overcloud-controller-1-ipmi'
>> returned: 1 (Operation not permitted). Trying:
>> overcloud-controller-1-ipmi
>> Apr 7 14:36:28 localhost stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [ Connection timed out ] Apr 7
>> 14:36:28 localhost stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [ ] Apr 7 14:36:28 localhost
>> stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [ ] Apr 7 14:36:28 localhost
>> stonith-ng[73134]: warning: Agent 'fence_compute' does not advertise
>> support for 'reboot', performing 'off' action instead Apr 7 14:36:28
>> localhost stonith-ng[73134]: notice: Operation 'reboot' [14733]
>> (call 2 from crmd.73138) for host 'overcloud-controller-1' with
>> device 'fence-nova' returned: 0 (OK) Apr 7 14:36:28 localhost
>> stonith-ng[73134]: notice: Operation reboot of
>> overcloud-controller-1 by overcloud-controller-0 for
>> crmd.73138: OK
>>
>> Then things get really screwy:
>>
>> Apr 07 21:38:02 [73138] overcloud-controller-0.localdomain crmd:
>> notice: te_fence_node: Executing reboot fencing operation (255) on
>> overcloud-controller-0 (timeout=60000)
>> Apr 07 21:38:03 [73138] overcloud-controller-0.localdomain crmd:
>> crit: tengine_stonith_notify: We were allegedly just fenced by
>> overcloud-controller-0 for overcloud-controller-0!
>>
>> overcloud-controller-0 wants to fence itself for some reason, ipmi
>> claims it did so, but overcloud-controller-0 is still around to get
>> the notification!!!
>>
>> The answer looks to be in these logs:
>>
>> Apr 07 21:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>> notice: log_operation: Operation 'reboot' [35674] (call 4 from
>> crmd.73138) for host 'overcloud-controller-2' with device 'fence-nova'
>> returned: 0 (OK)
>> Apr 07 21:38:03 [73134] overcloud-controller-0.localdomain stonith-ng:
>> notice: log_operation: Operation 'reboot' [36357] (call 5 from
>> crmd.73138) for host 'overcloud-controller-0' with device 'fence-nova'
>> returned: 0 (OK)
>>
>> which suggest fence-compute might be configured with the undercloud's
>> credentials.
>>
>> In total, I see the following fencing events:
>>
>> Apr 07 18:35:01 [73138] overcloud-controller-0.localdomain crmd:
>> notice: te_fence_node: Executing reboot fencing operation (541) on
>> overcloud-controller-1 (timeout=60000)
>> Apr 07 18:39:10 [73138] overcloud-controller-0.localdomain crmd:
>> notice: te_fence_node: Executing reboot fencing operation (604) on
>> overcloud-novacompute-1.localdomain (timeout=60000)
>> Apr 07 21:35:23 [73138] overcloud-controller-0.localdomain crmd:
>> notice: te_fence_node: Executing reboot fencing operation (214) on
>> overcloud-controller-2 (timeout=60000)
>> Apr 07 21:38:02 [73138] overcloud-controller-0.localdomain crmd:
>> notice: te_fence_node: Executing reboot fencing operation (255) on
>> overcloud-controller-0 (timeout=60000)
>>
>> and only one of them is for a compute node.
>> The first seems like a genuine outage, but the remainder appear to be
>> due to rabbitmq failing to stop when requested by the cluster.
>>
>> There was a bug matching these symptoms, I will check with our Rabbit
>> maintainer to see if the fixes have gone out yet.
Description of problem: With Instance HA configured, simulating a node failure on a node with instances running, the instances fail to evacuate. I provided sosreports to Andrew Beekhof and Fabio Di Nitto and below are the comments: >>> Subject: RE: Need your insight to what might be wrong >>> >>> I just tried spinning up a new instance on the stamp (worked OK) and >>> noted where the instance was running. I then logged on to the >>> compute node where the instance was running and simulated a crash on >>> the node with the “echo c >>>> /proc/sysrq-trigger” command. I observed the node panic and shutoff. >>> >>> No changes to the status were observed while watching the “watch >>> nova list --fields name,status,host” command on the director node >>> until after I powered the node back on. Then, I observed the new >>> instance status was “SHUTOFF”. >>> >>> I generated sosreports on controller0 and nova0 where the instances >>> was and is still located. (see attached files (I had to split the >>> controller file and second part will come in a followup email) ). >>> >>> I will verify the domain info. Does the nova-evacuate resource >>> require the domain to be set, because the RHEL Use HA guide didn’t >>> have an option listed/ >> >> Correct. Its only needed on the fence agent and >> nova-compute-checkevacuate resource. >> >> It looks like I'm missing the pacemaker.log files from the three >> compute nodes and I only have sos reports from one of the controllers. >> >> The logs I do have worry me a great deal though. >> >> On compute-0, time appears to be a relative concept: >> >> Apr 7 14:07:04 localhost ceilometer-polling: 2016-04-07 18:07:04.969 >> 102896 WARNING oslo_config.cfg >> [req-c240d133-e6ec-4d83-b3bc-03d50d618309 admin - - - -] Option >> "alarm_history_time_to_live" from group "database" is deprecated for >> removal. Its value may be silently ignored in the future. >> Apr 7 14:36:59 localhost neutron-openvswitch-agent: 2016-04-07 >> 18:36:59.440 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] >> AMQP server on 192.168.140.94:5672 is unreachable: Too many >> heartbeats missed. Trying again in 1 seconds. >> Apr 7 14:37:03 localhost neutron-openvswitch-agent: 2016-04-07 >> 18:37:03.454 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] >> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 113] >> EHOSTUNREACH. Trying again in 1 seconds. >> Apr 7 14:37:06 localhost neutron-openvswitch-agent: 2016-04-07 >> 18:37:06.461 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] >> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 113] >> EHOSTUNREACH. Trying again in 1 seconds. >> Apr 7 14:37:09 localhost neutron-openvswitch-agent: 2016-04-07 >> 18:37:09.117 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] >> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 111] >> ECONNREFUSED. Trying again in 2 seconds. >> Apr 7 14:37:11 localhost neutron-openvswitch-agent: 2016-04-07 >> 18:37:11.130 102658 INFO oslo.messaging._drivers.impl_rabbit [-] >> Reconnected to AMQP server on 192.168.140.93:5672 Apr 7 14:04:50 >> overcloud-novacompute-0 kernel: qbr687024ad-fa: port >> 2(tap687024ad-fa) entered disabled state >> >> It looks like the ipmi devices need longer timeouts (corosync.log): >> >> Apr 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng: >> error: log_operation: Operation 'reboot' [13745] (call 2 from >> crmd.73138) for host 'overcloud-controller-1' with device >> 'overcloud-controller-1-ipmi' returned: 1 (Operation not permitted). >> Trying: overcloud-controller-1-ipmi >> Apr 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng: >> warning: log_operation: overcloud-controller-1-ipmi:13745 [ >> Connection timed out ] Apr 07 18:36:28 [73134] >> overcloud-controller-0.localdomain stonith-ng: >> warning: log_operation: overcloud-controller-1-ipmi:13745 [ ] Apr >> 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng: >> warning: log_operation: overcloud-controller-1-ipmi:13745 [ ] >> >> And again the time is hard to establish because in /var/log/messages >> the same logs come out as having occurred 4 hours earlier: >> >> Apr 7 14:36:28 localhost stonith-ng[73134]: error: Operation >> 'reboot' [13745] (call 2 from crmd.73138) for host >> 'overcloud-controller-1' with device 'overcloud-controller-1-ipmi' >> returned: 1 (Operation not permitted). Trying: >> overcloud-controller-1-ipmi >> Apr 7 14:36:28 localhost stonith-ng[73134]: warning: >> overcloud-controller-1-ipmi:13745 [ Connection timed out ] Apr 7 >> 14:36:28 localhost stonith-ng[73134]: warning: >> overcloud-controller-1-ipmi:13745 [ ] Apr 7 14:36:28 localhost >> stonith-ng[73134]: warning: >> overcloud-controller-1-ipmi:13745 [ ] Apr 7 14:36:28 localhost >> stonith-ng[73134]: warning: Agent 'fence_compute' does not advertise >> support for 'reboot', performing 'off' action instead Apr 7 14:36:28 >> localhost stonith-ng[73134]: notice: Operation 'reboot' [14733] >> (call 2 from crmd.73138) for host 'overcloud-controller-1' with >> device 'fence-nova' returned: 0 (OK) Apr 7 14:36:28 localhost >> stonith-ng[73134]: notice: Operation reboot of >> overcloud-controller-1 by overcloud-controller-0 for >> crmd.73138: OK >> >> Then things get really screwy: >> >> Apr 07 21:38:02 [73138] overcloud-controller-0.localdomain crmd: >> notice: te_fence_node: Executing reboot fencing operation (255) on >> overcloud-controller-0 (timeout=60000) >> Apr 07 21:38:03 [73138] overcloud-controller-0.localdomain crmd: >> crit: tengine_stonith_notify: We were allegedly just fenced by >> overcloud-controller-0 for overcloud-controller-0! >> >> overcloud-controller-0 wants to fence itself for some reason, ipmi >> claims it did so, but overcloud-controller-0 is still around to get >> the notification!!! >> >> The answer looks to be in these logs: >> >> Apr 07 21:36:28 [73134] overcloud-controller-0.localdomain stonith-ng: >> notice: log_operation: Operation 'reboot' [35674] (call 4 from >> crmd.73138) for host 'overcloud-controller-2' with device 'fence-nova' >> returned: 0 (OK) >> Apr 07 21:38:03 [73134] overcloud-controller-0.localdomain stonith-ng: >> notice: log_operation: Operation 'reboot' [36357] (call 5 from >> crmd.73138) for host 'overcloud-controller-0' with device 'fence-nova' >> returned: 0 (OK) >> >> which suggest fence-compute might be configured with the undercloud's >> credentials. >> >> In total, I see the following fencing events: >> >> Apr 07 18:35:01 [73138] overcloud-controller-0.localdomain crmd: >> notice: te_fence_node: Executing reboot fencing operation (541) on >> overcloud-controller-1 (timeout=60000) >> Apr 07 18:39:10 [73138] overcloud-controller-0.localdomain crmd: >> notice: te_fence_node: Executing reboot fencing operation (604) on >> overcloud-novacompute-1.localdomain (timeout=60000) >> Apr 07 21:35:23 [73138] overcloud-controller-0.localdomain crmd: >> notice: te_fence_node: Executing reboot fencing operation (214) on >> overcloud-controller-2 (timeout=60000) >> Apr 07 21:38:02 [73138] overcloud-controller-0.localdomain crmd: >> notice: te_fence_node: Executing reboot fencing operation (255) on >> overcloud-controller-0 (timeout=60000) >> >> and only one of them is for a compute node. >> The first seems like a genuine outage, but the remainder appear to be >> due to rabbitmq failing to stop when requested by the cluster. >> >> There was a bug matching these symptoms, I will check with our Rabbit >> maintainer to see if the fixes have gone out yet.