RDO tickets are now tracked in Jira https://issues.redhat.com/projects/RDO/issues/
Bug 1271289 - overcloud-novacompute stuck in spawning state
Summary: overcloud-novacompute stuck in spawning state
Keywords:
Status: CLOSED EOL
Alias: None
Product: RDO
Classification: Community
Component: rdo-manager
Version: Liberty
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: GA
: Liberty
Assignee: Hugh Brock
QA Contact: Shai Revivo
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-10-13 14:29 UTC by Tzach Shefi
Modified: 2017-06-18 06:21 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-06-18 06:21:52 UTC
Embargoed:


Attachments (Terms of Use)
logs (4.12 MB, text/plain)
2015-10-13 14:31 UTC, Tzach Shefi
no flags Details
Ironic journal output (4.91 MB, text/plain)
2015-10-14 14:08 UTC, Tzach Shefi
no flags Details

Description Tzach Shefi 2015-10-13 14:29:18 UTC
Description of problem: While deploying overcloud on VM running on a Centos server, controller node is listed as active, compute node stuck at spawning state, even after a few hours.   


Version-Release number of selected component (if applicable):


How reproducible:
Unsure 

Steps to Reproduce:
1.deploy overcloud ( 1 compue + 1 controller) on Centos Virt env using guide: https://repos.fedorapeople.org/repos/openstack-m/rdo-manager-docs/liberty/environments/virtual.html

2. Deploy overcloud and wait for compute node to reach spawning state. 
On service checking virsh 

Actual results:
Compute node doesn't complete installation

[stack@puma53 ~]$ virsh list --all
 Id    Name                           State
----------------------------------------------------
 2     instack                        running
 9     baremetalbrbm_2                running
 -     baremetalbrbm_0                shut off
 -     baremetalbrbm_1                shut off

Expected results:
Compute node should complete installation/ vm should be running . 

Additional info:
Not sure exactly when overcloudrc is created but I noticed it's not found, maybe not created yet. Figured if controller node is active I might already see this file. 

Adding nova logs from instack server, as well as  virsh logs if they help.

Comment 1 Tzach Shefi 2015-10-13 14:31:09 UTC
Created attachment 1082468 [details]
logs

Comment 2 Tzach Shefi 2015-10-14 08:21:10 UTC
More debugging info:

sshd and firewall rules on virt host OK having tested below:
I can ssh into the virt host from my laptop with root user, checking 10.X.X.X net
Can also ssh from instack vm to virt host, checking 192.168.122.X net. 

If overcloud controller node was created successfully and it uses same ssh virt power-on method I doubt this stopped working all of a sudden for compute nodes.
My guess problem is something else.

Comment 4 Tzach Shefi 2015-10-14 14:06:17 UTC
Adding ironic journal output (ironic.log) might shed some more light, started going over this file a few minutes ago. 

The stuck spawning compute node's ID is: 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7

Comment 5 Tzach Shefi 2015-10-14 14:08:07 UTC
Created attachment 1082835 [details]
Ironic journal output

Comment 6 Alexander Chuzhoy 2015-10-14 22:40:17 UTC
So all nodes except one aren't able to get IP during PXE boot.
`nova list` for them shows  status: BUILD  and task-state: spawning.

Running ironic node-port-list on the respective node(s) - I see that there are 2 MAC addresses. One of them (top) is of the NIC, that's used for PXE:

+--------------------------------------+-------------------+                    
| UUID                                 | Address           |                    
+--------------------------------------+-------------------+                    
| cfe91f4c-add7-439b-8df9-998f653710e5 | 00:0a:f7:79:93:2a |                    
| 2ece3a6e-2326-4ecc-b303-b1391e0d259c | c8:1f:66:c7:e9:2b |                    
+--------------------------------------+-------------------+  

The iptables has this entry (among others).

Chain ironic-inspector (1 references)
target     prot opt source               destination         

DROP       all  --  anywhere             anywhere             MAC 00:0A:F7:79:93:2A

Running tcpdump on the undercloud I see the attempted bootp:
02:59:10.438750 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:0a:f7:79:93:2a (oui Unknown), length 548


At some point I attempted removing the bottom MAC and re-attempted deployment - completed successfully.

Comment 7 Tzach Shefi 2015-10-15 06:39:14 UTC
Something is odd my overcloud VMs including the "active" controller have only one MAC per VM, also evident on virsh dumpxml files. 
   
ironic node-port-list 4626bf90-7f95-4bd7-8bee-5f5b0a0981c6
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| 0c48962d-8b6a-440c-8545-92acd6f89aec | 00:60:0c:4e:e2:0f |
+--------------------------------------+-------------------+

[stack@instack ~]$ ironic node-port-list 8738f24c-45b4-4a17-b6ad-8963723c62df
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| a74b757b-09fd-494b-9dd6-1c20d4efabc2 | 00:d7:c4:3f:c9:73 |
+--------------------------------------+-------------------+

[stack@instack ~]$ ironic node-port-list 9f39b3fc-6670-4ee3-9376-ad197bf00760
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| 65f29507-b464-4f38-acf3-ea81563bad8a | 00:02:9d:99:2f:60 |
+--------------------------------------+-------------------+

Shouldn't have instack-virt-setup built VMs with needed eth/networks, maybe something went wrong during that stage?

Comment 8 Alexander Chuzhoy 2015-10-16 03:45:41 UTC
Reproduced the behavior in comment #6.
One machine wasn't able to pxeboot during the deployment.
Removed the second MAC showsn in ironic node-port-list <node>.
Re-attempted deployment - completed successfully.

Comment 9 Omri Hochman 2015-10-21 16:46:56 UTC
Encountered the same issue when I've tried HA deployment on Bare-Metal :

It looks like 2 out of 3 controllers in the overcloud deployment were stuck in "build" and there was no process and the deployment eventually failed over time-out.

Workarounds (after failed deployment) :
---------------
(1) heat stack-delete overcloud 
(2) for j in $(for i in `ironic node-list|awk '/power/ {print $2}'`; do ironic node-port-list $i|awk '/c8:1f/ {print $2}'; done); do ironic port-delete $j; done    
(3) re-run the overcloud deployment commmand 

Environment: 
------------
rdo-release-liberty-1.noarch
instack-0.0.8-1.el7.noarch
instack-undercloud-2.1.3-1.el7.noarch
openstack-tripleo-heat-templates-0.8.7-1.el7.noarch
openstack-ironic-inspector-2.2.2-1.el7.noarch

Comment 13 Christopher Brown 2017-06-17 17:33:34 UTC
I think this can be closed as its very old and I don't personally encounter this issue.


Note You need to log in before you can comment on or make changes to this bug.