Bug 1308876 - Nodes are not being started at deploy time
Summary: Nodes are not being started at deploy time
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 7.0 (Kilo)
Assignee: Lucas Alvares Gomes
QA Contact: Toure Dunnon
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-16 10:39 UTC by Amit Ugol
Modified: 2016-09-06 15:00 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-09-06 15:00:52 UTC
Target Upstream Version:


Attachments (Terms of Use)
ironic logs (1021.89 KB, application/x-xz)
2016-02-16 10:39 UTC, Amit Ugol
no flags Details

Description Amit Ugol 2016-02-16 10:39:52 UTC
Created attachment 1127557 [details]
ironic logs

Description of problem:
redeploying a 2nd time after heat stack-delete overcloud will sometimes fail to start VMs, keeping the status from nova's point of view in spawning forever. the deployment will ultimately fail on timeout.

Version-Release number of selected component (if applicable):
ironic 2015.1.2-2

How reproducible:
50%

Steps to Reproduce:
1. delete an overcloud deployment
2. re-run the same deployment

Actual results:
VMs remain in power-off state 


Expected results:
All needed VMs start


Additional info:

This is the status after ~30 minutes into starting the deployment:
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| UUID                                 | Name | Instance UUID                        | Power State | Provision State | Maintenance |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| 57459d9b-6b74-4aef-9218-6678a76bb787 | None | b4174381-57af-43e4-9124-1f9f52f89741 | power on    | active          | False       |
| a474aa51-d537-4829-8586-ada9c22e75c6 | None | None                                 | power off   | available       | False       |
| e83c125d-f8b5-4d06-9ace-0b491bafae1a | None | 8d529d51-5035-4671-afd0-594d64804cca | power off   | deploying       | False       |
| 01921b20-f5e9-4298-a157-901cedfafdab | None | 0402aeeb-bbb9-4844-9eb9-c8dc93cd27ee | power on    | active          | False       |
| 32d2b69c-79aa-403e-8bd8-888d76dfbf5e | None | 3b7757d8-332b-4d86-b343-4789b6d9050c | power off   | deploying       | False       |
| 6689daae-ba1b-467e-bf48-833c4aec3dad | None | None                                 | power off   | available       | False       |
| e8f68851-e1cb-46ab-ae40-d2d52483b5fe | None | 51fe1ffc-e3d4-4fbd-a83e-2eae06a1b1a4 | power on    | active          | False       |
| 3945694d-0e54-4211-8b3a-f65d3d0e5f47 | None | 73f87561-9262-46de-b2a6-4daaff6f34ee | power on    | active          | False       |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
attached ironic logs

Comment 1 Lucas Alvares Gomes 2016-08-18 13:39:17 UTC
Looking at the logs this seem to be something to do with the hypervisor:

2016-02-16 04:22:40.455 1230 DEBUG oslo_concurrency.processutils [-] Result was 1 ssh_execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:363
2016-02-16 04:22:40.456 1230 ERROR ironic.drivers.modules.ssh [-] Cannot execute SSH cmd LC_ALL=C /usr/bin/virsh --connect qemu:///system destroy baremetalbrbm_brbm1_2. Reason: Unexpected error while running command.
Command: LC_ALL=C /usr/bin/virsh --connect qemu:///system destroy baremetalbrbm_brbm1_2
Exit code: 1
Stdout: u'\n'
Stderr: u'2016-02-16 09:22:07.728+0000: 81748: info : libvirt version: 1.2.17, package: 13.el7_2.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-11-23-07:46:04, x86-019.build.eng.bos.redhat.com)\n2016-02-16 09:22:07.728+0000: 81748: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f2d25f7cf40 after 6 keepalive messages in 35 seconds\n2016-02-16 09:22:07.728+0000: 81764: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f2d25f7cf40 after 6 keepalive messages in 35 seconds\nerror: Failed to destroy domain baremetalbrbm_brbm1_2\nerror: internal error: received hangup / error event on socket\n'.

...

Could you verify if you can start these VMs manually using virsh?

Comment 2 Amit Ugol 2016-09-06 15:00:52 UTC
Things have been running smoother since. Also changing the CI method to do it has changed, the error leading to this issue is being bypassed.


Note You need to log in before you can comment on or make changes to this bug.