DescriptionFabio Massimo Di Nitto
2015-06-10 13:43:50 UTC
Description of problem:
while testing instance HA, we failed a compute node hard. There were 3 or 4 instances running on given compute node.
All but one instance have been rebuilt on another compute node.
One specific instance was still "running" on the fail compute node, based on nova list, nova show <instance>.
Rebuilt was not started/completed till the compute node started again 10/15 minutes later.
Version-Release number of selected component (if applicable):
openstack-nova-common-2015.1.0-4.el7ost.noarch
openstack-nova-console-2015.1.0-4.el7ost.noarch
openstack-nova-scheduler-2015.1.0-4.el7ost.noarch
openstack-nova-novncproxy-2015.1.0-4.el7ost.noarch
openstack-nova-conductor-2015.1.0-4.el7ost.noarch
openstack-nova-api-2015.1.0-4.el7ost.noarch
python-nova-2015.1.0-4.el7ost.noarch
python-novaclient-2.23.0-1.el7ost.noarch
How reproducible:
randomly
Steps to Reproduce:
1. deploy N instance
2. fail a compute node
3. request evacuation using fence_compute code/logic as written by Sylvain
4. notice instances not being rebuilt.
Actual results:
Instance was in SHUTDOWN after the compute node was able to start again. The instance had to be recovered manually
Expected results:
Instance starts happily by itself.
Additional info:
http://mrg-01.mpc.lab.eng.bos.redhat.com/sosreports/
This event happened within the last 14 hours in those sosreports. Please make sure to download them fast because they will vanish at somepoint (Eoghan is aware of it).
Comment 3Fabio Massimo Di Nitto
2015-06-16 08:58:02 UTC
After more in depth investigation, the problem has been isolated to pacemaker. Not a nova bug (we already have it under the radar in pacemaker).