Bug 1230249

Summary: Instance evacuation not initated
Product: Red Hat OpenStack Reporter: Fabio Massimo Di Nitto <fdinitto>
Component: openstack-novaAssignee: Eoghan Glynn <eglynn>
Status: CLOSED NOTABUG QA Contact: nlevinki <nlevinki>
Severity: high Docs Contact:
Priority: high    
Version: 7.0 (Kilo)CC: berrange, dasmith, eglynn, kchamart, ndipanov, pbrady, sbauza, sferdjao, sgordon, vromanso, yeylon
Target Milestone: ---   
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-06-16 08:58:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1185030, 1251948, 1261487    

Description Fabio Massimo Di Nitto 2015-06-10 13:43:50 UTC
Description of problem:

while testing instance HA, we failed a compute node hard. There were 3 or 4 instances running on given compute node.

All but one instance have been rebuilt on another compute node.

One specific instance was still "running" on the fail compute node, based on nova list, nova show <instance>.

Rebuilt was not started/completed till the compute node started again 10/15 minutes later.

Version-Release number of selected component (if applicable):

openstack-nova-common-2015.1.0-4.el7ost.noarch
openstack-nova-console-2015.1.0-4.el7ost.noarch
openstack-nova-scheduler-2015.1.0-4.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-4.el7ost.noarch
openstack-nova-conductor-2015.1.0-4.el7ost.noarch
openstack-nova-api-2015.1.0-4.el7ost.noarch
python-nova-2015.1.0-4.el7ost.noarch
python-novaclient-2.23.0-1.el7ost.noarch


How reproducible:

randomly


Steps to Reproduce:
1. deploy N instance
2. fail a compute node
3. request evacuation using fence_compute code/logic as written by Sylvain
4. notice instances not being rebuilt.

Actual results:

Instance was in SHUTDOWN after the compute node was able to start again. The instance had to be recovered manually

Expected results:

Instance starts happily by itself.

Additional info:

http://mrg-01.mpc.lab.eng.bos.redhat.com/sosreports/

This event happened within the last 14 hours in those sosreports. Please make sure to download them fast because they will vanish at somepoint (Eoghan is aware of it).

Comment 3 Fabio Massimo Di Nitto 2015-06-16 08:58:02 UTC
After more in depth investigation, the problem has been isolated to pacemaker. Not a nova bug (we already have it under the radar in pacemaker).