Bug 1271289

Summary: overcloud-novacompute stuck in spawning state
Product: [Community] RDO Reporter: Tzach Shefi <tshefi>
Component: rdo-managerAssignee: Hugh Brock <hbrock>
Status: CLOSED EOL QA Contact: Shai Revivo <srevivo>
Severity: high Docs Contact:
Priority: urgent    
Version: LibertyCC: chris.brown, jcoufal, mburns, ohochman, sasha
Target Milestone: GA   
Target Release: Liberty   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-06-18 06:21:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
Ironic journal output none

Description Tzach Shefi 2015-10-13 14:29:18 UTC
Description of problem: While deploying overcloud on VM running on a Centos server, controller node is listed as active, compute node stuck at spawning state, even after a few hours.   


Version-Release number of selected component (if applicable):


How reproducible:
Unsure 

Steps to Reproduce:
1.deploy overcloud ( 1 compue + 1 controller) on Centos Virt env using guide: https://repos.fedorapeople.org/repos/openstack-m/rdo-manager-docs/liberty/environments/virtual.html

2. Deploy overcloud and wait for compute node to reach spawning state. 
On service checking virsh 

Actual results:
Compute node doesn't complete installation

[stack@puma53 ~]$ virsh list --all
 Id    Name                           State
----------------------------------------------------
 2     instack                        running
 9     baremetalbrbm_2                running
 -     baremetalbrbm_0                shut off
 -     baremetalbrbm_1                shut off

Expected results:
Compute node should complete installation/ vm should be running . 

Additional info:
Not sure exactly when overcloudrc is created but I noticed it's not found, maybe not created yet. Figured if controller node is active I might already see this file. 

Adding nova logs from instack server, as well as  virsh logs if they help.

Comment 1 Tzach Shefi 2015-10-13 14:31:09 UTC
Created attachment 1082468 [details]
logs

Comment 2 Tzach Shefi 2015-10-14 08:21:10 UTC
More debugging info:

sshd and firewall rules on virt host OK having tested below:
I can ssh into the virt host from my laptop with root user, checking 10.X.X.X net
Can also ssh from instack vm to virt host, checking 192.168.122.X net. 

If overcloud controller node was created successfully and it uses same ssh virt power-on method I doubt this stopped working all of a sudden for compute nodes.
My guess problem is something else.

Comment 4 Tzach Shefi 2015-10-14 14:06:17 UTC
Adding ironic journal output (ironic.log) might shed some more light, started going over this file a few minutes ago. 

The stuck spawning compute node's ID is: 7f9f4f52-3ee6-42d9-9275-ff88582dd6e7

Comment 5 Tzach Shefi 2015-10-14 14:08:07 UTC
Created attachment 1082835 [details]
Ironic journal output

Comment 6 Alexander Chuzhoy 2015-10-14 22:40:17 UTC
So all nodes except one aren't able to get IP during PXE boot.
`nova list` for them shows  status: BUILD  and task-state: spawning.

Running ironic node-port-list on the respective node(s) - I see that there are 2 MAC addresses. One of them (top) is of the NIC, that's used for PXE:

+--------------------------------------+-------------------+                    
| UUID                                 | Address           |                    
+--------------------------------------+-------------------+                    
| cfe91f4c-add7-439b-8df9-998f653710e5 | 00:0a:f7:79:93:2a |                    
| 2ece3a6e-2326-4ecc-b303-b1391e0d259c | c8:1f:66:c7:e9:2b |                    
+--------------------------------------+-------------------+  

The iptables has this entry (among others).

Chain ironic-inspector (1 references)
target     prot opt source               destination         

DROP       all  --  anywhere             anywhere             MAC 00:0A:F7:79:93:2A

Running tcpdump on the undercloud I see the attempted bootp:
02:59:10.438750 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 00:0a:f7:79:93:2a (oui Unknown), length 548


At some point I attempted removing the bottom MAC and re-attempted deployment - completed successfully.

Comment 7 Tzach Shefi 2015-10-15 06:39:14 UTC
Something is odd my overcloud VMs including the "active" controller have only one MAC per VM, also evident on virsh dumpxml files. 
   
ironic node-port-list 4626bf90-7f95-4bd7-8bee-5f5b0a0981c6
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| 0c48962d-8b6a-440c-8545-92acd6f89aec | 00:60:0c:4e:e2:0f |
+--------------------------------------+-------------------+

[stack@instack ~]$ ironic node-port-list 8738f24c-45b4-4a17-b6ad-8963723c62df
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| a74b757b-09fd-494b-9dd6-1c20d4efabc2 | 00:d7:c4:3f:c9:73 |
+--------------------------------------+-------------------+

[stack@instack ~]$ ironic node-port-list 9f39b3fc-6670-4ee3-9376-ad197bf00760
+--------------------------------------+-------------------+
| UUID                                 | Address           |
+--------------------------------------+-------------------+
| 65f29507-b464-4f38-acf3-ea81563bad8a | 00:02:9d:99:2f:60 |
+--------------------------------------+-------------------+

Shouldn't have instack-virt-setup built VMs with needed eth/networks, maybe something went wrong during that stage?

Comment 8 Alexander Chuzhoy 2015-10-16 03:45:41 UTC
Reproduced the behavior in comment #6.
One machine wasn't able to pxeboot during the deployment.
Removed the second MAC showsn in ironic node-port-list <node>.
Re-attempted deployment - completed successfully.

Comment 9 Omri Hochman 2015-10-21 16:46:56 UTC
Encountered the same issue when I've tried HA deployment on Bare-Metal :

It looks like 2 out of 3 controllers in the overcloud deployment were stuck in "build" and there was no process and the deployment eventually failed over time-out.

Workarounds (after failed deployment) :
---------------
(1) heat stack-delete overcloud 
(2) for j in $(for i in `ironic node-list|awk '/power/ {print $2}'`; do ironic node-port-list $i|awk '/c8:1f/ {print $2}'; done); do ironic port-delete $j; done    
(3) re-run the overcloud deployment commmand 

Environment: 
------------
rdo-release-liberty-1.noarch
instack-0.0.8-1.el7.noarch
instack-undercloud-2.1.3-1.el7.noarch
openstack-tripleo-heat-templates-0.8.7-1.el7.noarch
openstack-ironic-inspector-2.2.2-1.el7.noarch

Comment 13 Christopher Brown 2017-06-17 17:33:34 UTC
I think this can be closed as its very old and I don't personally encounter this issue.