Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1308876

Summary:

Nodes are not being started at deploy time

Product:

Red Hat OpenStack

Reporter:

Amit Ugol <augol>

Component:

openstack-ironic

Assignee:

Lucas Alvares Gomes <lmartins>

Status:

CLOSED WORKSFORME

QA Contact:

Toure Dunnon <tdunnon>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

7.0 (Kilo)

CC:

augol, mburns, rhel-osp-director-maint, srevivo

Target Milestone:

---

Keywords:

ZStream

Target Release:

7.0 (Kilo)

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-09-06 15:00:52 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ironic logs	none

Description Amit Ugol 2016-02-16 10:39:52 UTC

Created attachment 1127557 [details]
ironic logs

Description of problem:
redeploying a 2nd time after heat stack-delete overcloud will sometimes fail to start VMs, keeping the status from nova's point of view in spawning forever. the deployment will ultimately fail on timeout.

Version-Release number of selected component (if applicable):
ironic 2015.1.2-2

How reproducible:
50%

Steps to Reproduce:
1. delete an overcloud deployment
2. re-run the same deployment

Actual results:
VMs remain in power-off state 


Expected results:
All needed VMs start


Additional info:

This is the status after ~30 minutes into starting the deployment:
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| UUID                                 | Name | Instance UUID                        | Power State | Provision State | Maintenance |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
| 57459d9b-6b74-4aef-9218-6678a76bb787 | None | b4174381-57af-43e4-9124-1f9f52f89741 | power on    | active          | False       |
| a474aa51-d537-4829-8586-ada9c22e75c6 | None | None                                 | power off   | available       | False       |
| e83c125d-f8b5-4d06-9ace-0b491bafae1a | None | 8d529d51-5035-4671-afd0-594d64804cca | power off   | deploying       | False       |
| 01921b20-f5e9-4298-a157-901cedfafdab | None | 0402aeeb-bbb9-4844-9eb9-c8dc93cd27ee | power on    | active          | False       |
| 32d2b69c-79aa-403e-8bd8-888d76dfbf5e | None | 3b7757d8-332b-4d86-b343-4789b6d9050c | power off   | deploying       | False       |
| 6689daae-ba1b-467e-bf48-833c4aec3dad | None | None                                 | power off   | available       | False       |
| e8f68851-e1cb-46ab-ae40-d2d52483b5fe | None | 51fe1ffc-e3d4-4fbd-a83e-2eae06a1b1a4 | power on    | active          | False       |
| 3945694d-0e54-4211-8b3a-f65d3d0e5f47 | None | 73f87561-9262-46de-b2a6-4daaff6f34ee | power on    | active          | False       |
+--------------------------------------+------+--------------------------------------+-------------+-----------------+-------------+
attached ironic logs

Comment 1 Lucas Alvares Gomes 2016-08-18 13:39:17 UTC

Looking at the logs this seem to be something to do with the hypervisor:

2016-02-16 04:22:40.455 1230 DEBUG oslo_concurrency.processutils [-] Result was 1 ssh_execute /usr/lib/python2.7/site-packages/oslo_concurrency/processutils.py:363
2016-02-16 04:22:40.456 1230 ERROR ironic.drivers.modules.ssh [-] Cannot execute SSH cmd LC_ALL=C /usr/bin/virsh --connect qemu:///system destroy baremetalbrbm_brbm1_2. Reason: Unexpected error while running command.
Command: LC_ALL=C /usr/bin/virsh --connect qemu:///system destroy baremetalbrbm_brbm1_2
Exit code: 1
Stdout: u'\n'
Stderr: u'2016-02-16 09:22:07.728+0000: 81748: info : libvirt version: 1.2.17, package: 13.el7_2.2 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-11-23-07:46:04, x86-019.build.eng.bos.redhat.com)\n2016-02-16 09:22:07.728+0000: 81748: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f2d25f7cf40 after 6 keepalive messages in 35 seconds\n2016-02-16 09:22:07.728+0000: 81764: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f2d25f7cf40 after 6 keepalive messages in 35 seconds\nerror: Failed to destroy domain baremetalbrbm_brbm1_2\nerror: internal error: received hangup / error event on socket\n'.

...

Could you verify if you can start these VMs manually using virsh?

Comment 2 Amit Ugol 2016-09-06 15:00:52 UTC

Things have been running smoother since. Also changing the CI method to do it has changed, the error leading to this issue is being bypassed.