Bug 1325961 - Instance HA failing to evacuate Instance failure
Summary: Instance HA failing to evacuate Instance failure
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 8.0 (Liberty)
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: beta
: 8.0 (Liberty)
Assignee: Peter Lemenkov
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks: 1310828
TreeView+ depends on / blocked
 
Reported: 2016-04-11 13:46 UTC by John Williams
Modified: 2019-10-10 11:50 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-18 14:45:48 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description John Williams 2016-04-11 13:46:07 UTC
Description of problem:

With Instance HA configured, simulating a node failure on a node with instances running, the instances fail to evacuate.  I provided sosreports to Andrew Beekhof and Fabio Di Nitto and below are the comments: 


>>> Subject: RE: Need your insight to what might be wrong
>>>
>>> I just tried spinning up a new instance on the stamp (worked OK) and 
>>> noted where the instance was running.  I then logged on to the 
>>> compute node where the instance was running and simulated a crash on 
>>> the node with the “echo c
>>>> /proc/sysrq-trigger” command.   I observed the node panic and shutoff.
>>>
>>> No changes to the status were observed while watching the “watch 
>>> nova list --fields name,status,host” command on the director node 
>>> until after I powered the node back on.  Then, I observed the new 
>>> instance status was “SHUTOFF”.
>>>
>>> I generated sosreports on controller0 and nova0 where the instances 
>>> was and is still located.  (see attached files (I had to split the 
>>> controller file and second part will come in a followup email) ).
>>>
>>> I will verify the domain info.  Does the nova-evacuate resource 
>>> require the domain to be set, because the RHEL Use HA guide didn’t 
>>> have an option listed/
>>
>> Correct. Its only needed on the fence agent and 
>> nova-compute-checkevacuate resource.
>>
>> It looks like I'm missing the pacemaker.log files from the three 
>> compute nodes and I only have sos reports from one of the controllers.
>>
>> The logs I do have worry me a great deal though.
>>
>> On compute-0, time appears to be a relative concept:
>>
>> Apr  7 14:07:04 localhost ceilometer-polling: 2016-04-07 18:07:04.969
>> 102896 WARNING oslo_config.cfg
>> [req-c240d133-e6ec-4d83-b3bc-03d50d618309 admin - - - -] Option 
>> "alarm_history_time_to_live" from group "database" is deprecated for 
>> removal.  Its value may be silently ignored in the future.
>> Apr  7 14:36:59 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:36:59.440 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] 
>> AMQP server on 192.168.140.94:5672 is unreachable: Too many 
>> heartbeats missed. Trying again in 1 seconds.
>> Apr  7 14:37:03 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:03.454 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] 
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 113] 
>> EHOSTUNREACH. Trying again in 1 seconds.
>> Apr  7 14:37:06 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:06.461 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] 
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 113] 
>> EHOSTUNREACH. Trying again in 1 seconds.
>> Apr  7 14:37:09 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:09.117 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] 
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 111] 
>> ECONNREFUSED. Trying again in 2 seconds.
>> Apr  7 14:37:11 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:11.130 102658 INFO oslo.messaging._drivers.impl_rabbit [-] 
>> Reconnected to AMQP server on 192.168.140.93:5672 Apr  7 14:04:50 
>> overcloud-novacompute-0 kernel: qbr687024ad-fa: port
>> 2(tap687024ad-fa) entered disabled state
>>
>> It looks like the ipmi devices need longer timeouts (corosync.log):
>>
>> Apr 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>>    error: log_operation: Operation 'reboot' [13745] (call 2 from
>> crmd.73138) for host 'overcloud-controller-1' with device 
>> 'overcloud-controller-1-ipmi' returned: 1 (Operation not permitted).
>> Trying: overcloud-controller-1-ipmi
>> Apr 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>>  warning: log_operation: overcloud-controller-1-ipmi:13745 [ 
>> Connection timed out ] Apr 07 18:36:28 [73134] 
>> overcloud-controller-0.localdomain stonith-ng:
>>  warning: log_operation: overcloud-controller-1-ipmi:13745 [  ] Apr 
>> 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>>  warning: log_operation: overcloud-controller-1-ipmi:13745 [  ]
>>
>> And again the time is hard to establish because in /var/log/messages 
>> the same logs come out as having occurred 4 hours earlier:
>>
>> Apr  7 14:36:28 localhost stonith-ng[73134]:   error: Operation
>> 'reboot' [13745] (call 2 from crmd.73138) for host 
>> 'overcloud-controller-1' with device 'overcloud-controller-1-ipmi'
>> returned: 1 (Operation not permitted). Trying:
>> overcloud-controller-1-ipmi
>> Apr  7 14:36:28 localhost stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [ Connection timed out ] Apr  7 
>> 14:36:28 localhost stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [  ] Apr  7 14:36:28 localhost 
>> stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [  ] Apr  7 14:36:28 localhost 
>> stonith-ng[73134]: warning: Agent 'fence_compute' does not advertise 
>> support for 'reboot', performing 'off' action instead Apr  7 14:36:28 
>> localhost stonith-ng[73134]:  notice: Operation 'reboot' [14733] 
>> (call 2 from crmd.73138) for host 'overcloud-controller-1' with 
>> device 'fence-nova' returned: 0 (OK) Apr  7 14:36:28 localhost 
>> stonith-ng[73134]:  notice: Operation reboot of 
>> overcloud-controller-1 by overcloud-controller-0 for
>> crmd.73138: OK
>>
>> Then things get really screwy:
>>
>> Apr 07 21:38:02 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (255) on
>> overcloud-controller-0 (timeout=60000)
>> Apr 07 21:38:03 [73138] overcloud-controller-0.localdomain       crmd:
>>     crit: tengine_stonith_notify: We were allegedly just fenced by
>> overcloud-controller-0 for overcloud-controller-0!
>>
>> overcloud-controller-0 wants to fence itself for some reason, ipmi 
>> claims it did so, but overcloud-controller-0 is still around to get 
>> the notification!!!
>>
>> The answer looks to be in these logs:
>>
>> Apr 07 21:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>>   notice: log_operation: Operation 'reboot' [35674] (call 4 from
>> crmd.73138) for host 'overcloud-controller-2' with device 'fence-nova'
>> returned: 0 (OK)
>> Apr 07 21:38:03 [73134] overcloud-controller-0.localdomain stonith-ng:
>>   notice: log_operation: Operation 'reboot' [36357] (call 5 from
>> crmd.73138) for host 'overcloud-controller-0' with device 'fence-nova'
>> returned: 0 (OK)
>>
>> which suggest fence-compute might be configured with the undercloud's 
>> credentials.
>>
>> In total, I see the following fencing events:
>>
>> Apr 07 18:35:01 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (541) on
>> overcloud-controller-1 (timeout=60000)
>> Apr 07 18:39:10 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (604) on 
>> overcloud-novacompute-1.localdomain (timeout=60000)
>> Apr 07 21:35:23 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (214) on
>> overcloud-controller-2 (timeout=60000)
>> Apr 07 21:38:02 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (255) on
>> overcloud-controller-0 (timeout=60000)
>>
>> and only one of them is for a compute node.
>> The first seems like a genuine outage, but the remainder appear to be 
>> due to rabbitmq failing to stop when requested by the cluster.
>>
>> There was a bug matching these symptoms, I will check with our Rabbit 
>> maintainer to see if the fixes have gone out yet.

Comment 2 Andrew Beekhof 2016-04-12 04:40:02 UTC
I think we're good now.

There was a mismatch between how the remote nodes and the fencing agent was configured.

Comment 4 Fabio Massimo Di Nitto 2016-04-18 14:45:48 UTC
After live session with Partner, rabbit was working fine, missing some config options for instance HA. Problem has been solved.


Note You need to log in before you can comment on or make changes to this bug.