Bug 1308876 - Nodes are not being started at deploy time
Nodes are not being started at deploy time
Status: CLOSED WORKSFORME
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
unspecified Severity urgent
: ---
: 7.0 (Kilo)
Assigned To: Lucas Alvares Gomes
Toure Dunnon
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-16 05:39 EST by Amit Ugol
Modified: 2016-09-06 11:00 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-09-06 11:00:52 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
ironic logs (1021.89 KB, application/x-xz)
2016-02-16 05:39 EST, Amit Ugol
no flags Details

  None (edit)
Description Amit Ugol 2016-02-16 05:39:52 EST
Created attachment 1127557 [details]
ironic logs

Description of problem:
redeploying a 2nd time after heat stack-delete overcloud will sometimes fail to start VMs, keeping the status from nova's point of view in spawning forever. the deployment will ultimately fail on timeout.

Version-Release number of selected component (if applicable):
ironic 2015.1.2-2

How reproducible:
50%

Steps to Reproduce:
1. delete an overcloud deployment
2. re-run the same deployment

Actual results:
VMs remain in power-off state 


Expected results:
All needed VMs start


Additional info:

This is the status after ~30 minutes into starting the deployment:
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| UUID                                 | Name | Instance UUID                        | Power State | Provision State | Maintenance |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| 57459d9b-6b74-4aef-9218-6678a76bb787 | None | b4174381-57af-43e4-9124-1f9f52f89741 | power on    | active          | False       |
| a474aa51-d537-4829-8586-ada9c22e75c6 | None | None                                 | power off   | available       | False       |
| e83c125d-f8b5-4d06-9ace-0b491bafae1a | None | 8d529d51-5035-4671-afd0-594d64804cca | power off   | deploying       | False       |
| 01921b20-f5e9-4298-a157-901cedfafdab | None | 0402aeeb-bbb9-4844-9eb9-c8dc93cd27ee | power on    | active          | False       |
| 32d2b69c-79aa-403e-8bd8-888d76dfbf5e | None | 3b7757d8-332b-4d86-b343-4789b6d9050c | power off   | deploying       | False       |
| 6689daae-ba1b-467e-bf48-833c4aec3dad | None | None                                 | power off   | available       | False       |
| e8f68851-e1cb-46ab-ae40-d2d52483b5fe | None | 51fe1ffc-e3d4-4fbd-a83e-2eae06a1b1a4 | power on    | active          | False       |
| 3945694d-0e54-4211-8b3a-f65d3d0e5f47 | None | 73f87561-9262-46de-b2a6-4daaff6f34ee | power on    | active          | False       |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
attached ironic logs
Comment 1 Lucas Alvares Gomes 2016-08-18 09:39:17 EDT
Looking at the logs this seem to be something to do with the hypervisor:

2016-02-16 04:22:40.455 1230 DEBUG oslo_concurrency.processutils [-] Result was 1 ssh_execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:363
2016-02-16 04:22:40.456 1230 ERROR ironic.drivers.modules.ssh [-] Cannot execute SSH cmd LC_ALL=C /usr/bin/virsh --connect qemu:///system destroy baremetalbrbm_brbm1_2. Reason: Unexpected error while running command.
Command: LC_ALL=C /usr/bin/virsh --connect qemu:///system destroy baremetalbrbm_brbm1_2
Exit code: 1
Stdout: u'\n'
Stderr: u'2016-02-16 09:22:07.728+0000: 81748: info : libvirt version: 1.2.17, package: 13.el7_2.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-11-23-07:46:04, x86-019.build.eng.bos.redhat.com)\n2016-02-16 09:22:07.728+0000: 81748: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f2d25f7cf40 after 6 keepalive messages in 35 seconds\n2016-02-16 09:22:07.728+0000: 81764: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f2d25f7cf40 after 6 keepalive messages in 35 seconds\nerror: Failed to destroy domain baremetalbrbm_brbm1_2\nerror: internal error: received hangup / error event on socket\n'.

...

Could you verify if you can start these VMs manually using virsh?
Comment 2 Amit Ugol 2016-09-06 11:00:52 EDT
Things have been running smoother since. Also changing the CI method to do it has changed, the error leading to this issue is being bypassed.

Note You need to log in before you can comment on or make changes to this bug.