Description of problem: while testing instance HA, we failed a compute node hard. There were 3 or 4 instances running on given compute node. All but one instance have been rebuilt on another compute node. One specific instance was still "running" on the fail compute node, based on nova list, nova show <instance>. Rebuilt was not started/completed till the compute node started again 10/15 minutes later. Version-Release number of selected component (if applicable): openstack-nova-common-2015.1.0-4.el7ost.noarch openstack-nova-console-2015.1.0-4.el7ost.noarch openstack-nova-scheduler-2015.1.0-4.el7ost.noarch openstack-nova-novncproxy-2015.1.0-4.el7ost.noarch openstack-nova-conductor-2015.1.0-4.el7ost.noarch openstack-nova-api-2015.1.0-4.el7ost.noarch python-nova-2015.1.0-4.el7ost.noarch python-novaclient-2.23.0-1.el7ost.noarch How reproducible: randomly Steps to Reproduce: 1. deploy N instance 2. fail a compute node 3. request evacuation using fence_compute code/logic as written by Sylvain 4. notice instances not being rebuilt. Actual results: Instance was in SHUTDOWN after the compute node was able to start again. The instance had to be recovered manually Expected results: Instance starts happily by itself. Additional info: http://mrg-01.mpc.lab.eng.bos.redhat.com/sosreports/ This event happened within the last 14 hours in those sosreports. Please make sure to download them fast because they will vanish at somepoint (Eoghan is aware of it).
After more in depth investigation, the problem has been isolated to pacemaker. Not a nova bug (we already have it under the radar in pacemaker).