Bug 1325961

Summary: Instance HA failing to evacuate Instance failure
Product: Red Hat OpenStack Reporter: John Williams <j_t_williams>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED NOTABUG QA Contact: Shai Revivo <srevivo>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 8.0 (Liberty)CC: abeekhof, apevec, arkady_kanevsky, cdevine, christopher_dearborn, fdinitto, jeckersb, John_walsh, j_t_williams, kurt_hey, lhh, morazi, rajini.karthik, randy_perryman, rsussman, sreichar, srevivo, wayne_allen
Target Milestone: betaKeywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-04-18 14:45:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1310828    

Description John Williams 2016-04-11 13:46:07 UTC
Description of problem:

With Instance HA configured, simulating a node failure on a node with instances running, the instances fail to evacuate.  I provided sosreports to Andrew Beekhof and Fabio Di Nitto and below are the comments: 


>>> Subject: RE: Need your insight to what might be wrong
>>>
>>> I just tried spinning up a new instance on the stamp (worked OK) and 
>>> noted where the instance was running.  I then logged on to the 
>>> compute node where the instance was running and simulated a crash on 
>>> the node with the “echo c
>>>> /proc/sysrq-trigger” command.   I observed the node panic and shutoff.
>>>
>>> No changes to the status were observed while watching the “watch 
>>> nova list --fields name,status,host” command on the director node 
>>> until after I powered the node back on.  Then, I observed the new 
>>> instance status was “SHUTOFF”.
>>>
>>> I generated sosreports on controller0 and nova0 where the instances 
>>> was and is still located.  (see attached files (I had to split the 
>>> controller file and second part will come in a followup email) ).
>>>
>>> I will verify the domain info.  Does the nova-evacuate resource 
>>> require the domain to be set, because the RHEL Use HA guide didn’t 
>>> have an option listed/
>>
>> Correct. Its only needed on the fence agent and 
>> nova-compute-checkevacuate resource.
>>
>> It looks like I'm missing the pacemaker.log files from the three 
>> compute nodes and I only have sos reports from one of the controllers.
>>
>> The logs I do have worry me a great deal though.
>>
>> On compute-0, time appears to be a relative concept:
>>
>> Apr  7 14:07:04 localhost ceilometer-polling: 2016-04-07 18:07:04.969
>> 102896 WARNING oslo_config.cfg
>> [req-c240d133-e6ec-4d83-b3bc-03d50d618309 admin - - - -] Option 
>> "alarm_history_time_to_live" from group "database" is deprecated for 
>> removal.  Its value may be silently ignored in the future.
>> Apr  7 14:36:59 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:36:59.440 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] 
>> AMQP server on 192.168.140.94:5672 is unreachable: Too many 
>> heartbeats missed. Trying again in 1 seconds.
>> Apr  7 14:37:03 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:03.454 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] 
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 113] 
>> EHOSTUNREACH. Trying again in 1 seconds.
>> Apr  7 14:37:06 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:06.461 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] 
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 113] 
>> EHOSTUNREACH. Trying again in 1 seconds.
>> Apr  7 14:37:09 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:09.117 102658 ERROR oslo.messaging._drivers.impl_rabbit [-] 
>> AMQP server on 192.168.140.94:5672 is unreachable: [Errno 111] 
>> ECONNREFUSED. Trying again in 2 seconds.
>> Apr  7 14:37:11 localhost neutron-openvswitch-agent: 2016-04-07
>> 18:37:11.130 102658 INFO oslo.messaging._drivers.impl_rabbit [-] 
>> Reconnected to AMQP server on 192.168.140.93:5672 Apr  7 14:04:50 
>> overcloud-novacompute-0 kernel: qbr687024ad-fa: port
>> 2(tap687024ad-fa) entered disabled state
>>
>> It looks like the ipmi devices need longer timeouts (corosync.log):
>>
>> Apr 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>>    error: log_operation: Operation 'reboot' [13745] (call 2 from
>> crmd.73138) for host 'overcloud-controller-1' with device 
>> 'overcloud-controller-1-ipmi' returned: 1 (Operation not permitted).
>> Trying: overcloud-controller-1-ipmi
>> Apr 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>>  warning: log_operation: overcloud-controller-1-ipmi:13745 [ 
>> Connection timed out ] Apr 07 18:36:28 [73134] 
>> overcloud-controller-0.localdomain stonith-ng:
>>  warning: log_operation: overcloud-controller-1-ipmi:13745 [  ] Apr 
>> 07 18:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>>  warning: log_operation: overcloud-controller-1-ipmi:13745 [  ]
>>
>> And again the time is hard to establish because in /var/log/messages 
>> the same logs come out as having occurred 4 hours earlier:
>>
>> Apr  7 14:36:28 localhost stonith-ng[73134]:   error: Operation
>> 'reboot' [13745] (call 2 from crmd.73138) for host 
>> 'overcloud-controller-1' with device 'overcloud-controller-1-ipmi'
>> returned: 1 (Operation not permitted). Trying:
>> overcloud-controller-1-ipmi
>> Apr  7 14:36:28 localhost stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [ Connection timed out ] Apr  7 
>> 14:36:28 localhost stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [  ] Apr  7 14:36:28 localhost 
>> stonith-ng[73134]: warning:
>> overcloud-controller-1-ipmi:13745 [  ] Apr  7 14:36:28 localhost 
>> stonith-ng[73134]: warning: Agent 'fence_compute' does not advertise 
>> support for 'reboot', performing 'off' action instead Apr  7 14:36:28 
>> localhost stonith-ng[73134]:  notice: Operation 'reboot' [14733] 
>> (call 2 from crmd.73138) for host 'overcloud-controller-1' with 
>> device 'fence-nova' returned: 0 (OK) Apr  7 14:36:28 localhost 
>> stonith-ng[73134]:  notice: Operation reboot of 
>> overcloud-controller-1 by overcloud-controller-0 for
>> crmd.73138: OK
>>
>> Then things get really screwy:
>>
>> Apr 07 21:38:02 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (255) on
>> overcloud-controller-0 (timeout=60000)
>> Apr 07 21:38:03 [73138] overcloud-controller-0.localdomain       crmd:
>>     crit: tengine_stonith_notify: We were allegedly just fenced by
>> overcloud-controller-0 for overcloud-controller-0!
>>
>> overcloud-controller-0 wants to fence itself for some reason, ipmi 
>> claims it did so, but overcloud-controller-0 is still around to get 
>> the notification!!!
>>
>> The answer looks to be in these logs:
>>
>> Apr 07 21:36:28 [73134] overcloud-controller-0.localdomain stonith-ng:
>>   notice: log_operation: Operation 'reboot' [35674] (call 4 from
>> crmd.73138) for host 'overcloud-controller-2' with device 'fence-nova'
>> returned: 0 (OK)
>> Apr 07 21:38:03 [73134] overcloud-controller-0.localdomain stonith-ng:
>>   notice: log_operation: Operation 'reboot' [36357] (call 5 from
>> crmd.73138) for host 'overcloud-controller-0' with device 'fence-nova'
>> returned: 0 (OK)
>>
>> which suggest fence-compute might be configured with the undercloud's 
>> credentials.
>>
>> In total, I see the following fencing events:
>>
>> Apr 07 18:35:01 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (541) on
>> overcloud-controller-1 (timeout=60000)
>> Apr 07 18:39:10 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (604) on 
>> overcloud-novacompute-1.localdomain (timeout=60000)
>> Apr 07 21:35:23 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (214) on
>> overcloud-controller-2 (timeout=60000)
>> Apr 07 21:38:02 [73138] overcloud-controller-0.localdomain       crmd:
>>   notice: te_fence_node: Executing reboot fencing operation (255) on
>> overcloud-controller-0 (timeout=60000)
>>
>> and only one of them is for a compute node.
>> The first seems like a genuine outage, but the remainder appear to be 
>> due to rabbitmq failing to stop when requested by the cluster.
>>
>> There was a bug matching these symptoms, I will check with our Rabbit 
>> maintainer to see if the fixes have gone out yet.

Comment 2 Andrew Beekhof 2016-04-12 04:40:02 UTC
I think we're good now.

There was a mismatch between how the remote nodes and the fencing agent was configured.

Comment 4 Fabio Massimo Di Nitto 2016-04-18 14:45:48 UTC
After live session with Partner, rabbit was working fine, missing some config options for instance HA. Problem has been solved.