Bug 1308876

Summary: Nodes are not being started at deploy time
Product: Red Hat OpenStack Reporter: Amit Ugol <augol>
Component: openstack-ironicAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED WORKSFORME QA Contact: Toure Dunnon <tdunnon>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: augol, mburns, rhel-osp-director-maint, srevivo
Target Milestone: ---Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-06 15:00:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
ironic logs none

Description Amit Ugol 2016-02-16 10:39:52 UTC
Created attachment 1127557 [details]
ironic logs

Description of problem:
redeploying a 2nd time after heat stack-delete overcloud will sometimes fail to start VMs, keeping the status from nova's point of view in spawning forever. the deployment will ultimately fail on timeout.

Version-Release number of selected component (if applicable):
ironic 2015.1.2-2

How reproducible:
50%

Steps to Reproduce:
1. delete an overcloud deployment
2. re-run the same deployment

Actual results:
VMs remain in power-off state 


Expected results:
All needed VMs start


Additional info:

This is the status after ~30 minutes into starting the deployment:
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| UUID                                 | Name | Instance UUID                        | Power State | Provision State | Maintenance |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| 57459d9b-6b74-4aef-9218-6678a76bb787 | None | b4174381-57af-43e4-9124-1f9f52f89741 | power on    | active          | False       |
| a474aa51-d537-4829-8586-ada9c22e75c6 | None | None                                 | power off   | available       | False       |
| e83c125d-f8b5-4d06-9ace-0b491bafae1a | None | 8d529d51-5035-4671-afd0-594d64804cca | power off   | deploying       | False       |
| 01921b20-f5e9-4298-a157-901cedfafdab | None | 0402aeeb-bbb9-4844-9eb9-c8dc93cd27ee | power on    | active          | False       |
| 32d2b69c-79aa-403e-8bd8-888d76dfbf5e | None | 3b7757d8-332b-4d86-b343-4789b6d9050c | power off   | deploying       | False       |
| 6689daae-ba1b-467e-bf48-833c4aec3dad | None | None                                 | power off   | available       | False       |
| e8f68851-e1cb-46ab-ae40-d2d52483b5fe | None | 51fe1ffc-e3d4-4fbd-a83e-2eae06a1b1a4 | power on    | active          | False       |
| 3945694d-0e54-4211-8b3a-f65d3d0e5f47 | None | 73f87561-9262-46de-b2a6-4daaff6f34ee | power on    | active          | False       |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
attached ironic logs

Comment 1 Lucas Alvares Gomes 2016-08-18 13:39:17 UTC
Looking at the logs this seem to be something to do with the hypervisor:

2016-02-16 04:22:40.455 1230 DEBUG oslo_concurrency.processutils [-] Result was 1 ssh_execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:363
2016-02-16 04:22:40.456 1230 ERROR ironic.drivers.modules.ssh [-] Cannot execute SSH cmd LC_ALL=C /usr/bin/virsh --connect qemu:///system destroy baremetalbrbm_brbm1_2. Reason: Unexpected error while running command.
Command: LC_ALL=C /usr/bin/virsh --connect qemu:///system destroy baremetalbrbm_brbm1_2
Exit code: 1
Stdout: u'\n'
Stderr: u'2016-02-16 09:22:07.728+0000: 81748: info : libvirt version: 1.2.17, package: 13.el7_2.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-11-23-07:46:04, x86-019.build.eng.bos.redhat.com)\n2016-02-16 09:22:07.728+0000: 81748: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f2d25f7cf40 after 6 keepalive messages in 35 seconds\n2016-02-16 09:22:07.728+0000: 81764: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f2d25f7cf40 after 6 keepalive messages in 35 seconds\nerror: Failed to destroy domain baremetalbrbm_brbm1_2\nerror: internal error: received hangup / error event on socket\n'.

...

Could you verify if you can start these VMs manually using virsh?

Comment 2 Amit Ugol 2016-09-06 15:00:52 UTC
Things have been running smoother since. Also changing the CI method to do it has changed, the error leading to this issue is being bypassed.