Bug 1492204

Summary: Unable to perform cleaning - ipxe error 040ee119 in a simple flat network setup
Product: Red Hat OpenStack Reporter: Dan Yasny <dyasny>
Component: openstack-ironicAssignee: RHOS Maint <rhos-maint>
Status: CLOSED NOTABUG QA Contact: Dan Yasny <dyasny>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 12.0 (Pike)CC: bfournie, dtantsur, dyasny, mburns, mlammon, rhel-osp-director-maint, srevivo
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-19 14:35:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Dan Yasny 2017-09-15 19:06:05 UTC
Description of problem:
A simple installation of overcloud ironic with the ctlplane network used for provisioning baremetal nodes

Trying to clean the nodes, they boot up into http://ipxe.org/err/040ee1 and remain there indefinitely in "clean wait":
(overcloud) [stack@undercloud-0 ~]$ openstack baremetal node list
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| UUID                                 | Name     | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+
| 6d0f3803-d1e5-4565-a8bc-4bf92a5cb1db | ironic-1 | None          | power on    | clean wait         | False       |
| d5471d52-bb98-4919-9729-aa6337c10aca | ironic-0 | None          | power on    | clean wait         | False       |
+--------------------------------------+----------+---------------+-------------+--------------------+-------------+

On the controller:
[root@controller-0 ~]# tcpdump -n port 67 and port 68
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
19:02:00.970424 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:64:88:48, length 396
19:02:04.910861 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:64:88:48, length 396
19:02:12.819908 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:64:88:48, length 396
19:02:28.638163 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:64:88:48, length 396
19:02:55.939528 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ca:34:c4, length 396
19:02:59.882079 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ca:34:c4, length 396
19:03:07.791158 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ca:34:c4, length 396
19:03:23.609301 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:ca:34:c4, length 396



Version-Release number of selected component (if applicable):
python-ironic-lib-2.9.0-0.20170821163713.dcc5a47.el7ost.noarch
openstack-ironic-common-9.1.1-0.20170824135903.d783dff.el7ost.noarch
openstack-ironic-api-9.1.1-0.20170824135903.d783dff.el7ost.noarch
openstack-ironic-inspector-6.0.1-0.20170824132804.0e72dcb.el7ost.noarch
puppet-ironic-11.3.1-0.20170825175845.407b7d8.el7ost.noarch
python-ironicclient-1.16.0-0.20170821151022.835c5d4.el7ost.noarch
python-ironic-inspector-client-2.0.0-0.20170814165407.0ccc767.el7ost.noarch
openstack-ironic-conductor-9.1.1-0.20170824135903.d783dff.el7ost.noarch


How reproducible:
always

Steps to Reproduce:
1. deploy with oc ironic enabled
2. try to clean a node
3.

Actual results:
failed as described above

Expected results:
clean should work

Additional info:

Comment 1 Bob Fournier 2017-09-19 02:33:40 UTC
This looks like its not a  bug, just a configuration issue with the virtual network that the baremetal node is using.  We see DHCP requests received at the controller eth0 interface which is attached to the ctlplane.  We do not see DHCP requests received at the qdhcp namespace which is on the baremetal network. 

From the virt-host we can see the following:

This is the interface that the BM host is using, it only uses
the "data" network which corresponds to the ctlplane
[root@sealusa5 ~]# virsh domiflist ironic-0
Interface  Type       Source     Model       MAC
-------------------------------------------------------
vnet12     network    data       virtio      52:54:00:64:88:48

From the controller and undercloud-0, we can see the "data" network 
maps to eth0 which is on the ctlplane:

[root@sealusa5 ~]# virsh domiflist controller-0
Interface  Type       Source     Model       MAC
-------------------------------------------------------
vnet6      network    data       virtio      52:54:00:ab:01:d3  <== eth0
vnet8      network    management virtio      52:54:00:47:e6:2e  <== eth1 & br-isolated
vnet10     network    external   virtio      52:54:00:a9:fc:7a <== eth2 & br-ex

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:ab:01:d3 brd ff:ff:ff:ff:ff:ff
    inet 192.168.24.10/24 brd 192.168.24.255 scope global eth0

[root@sealusa5 ~]# virsh domiflist undercloud-0
Interface  Type       Source     Model       MAC
-------------------------------------------------------
vnet0      network    data       virtio      52:54:00:0f:4d:2b  <== eth0 & br-ctlplane
vnet1      network    management virtio      52:54:00:a5:0f:66  <== eth1
vnet2      network    external   virtio      52:54:00:f0:02:5a <== eth2

8: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:54:00:0f:4d:2b brd ff:ff:ff:ff:ff:ff
    inet 192.168.24.1/24 brd 192.168.24.255 scope global br-ctlplane

So in order for ironic-0 to access the baremetal network it should be using eth1 and
the management network.  That will allow it to reach tap48ce1bef-b0 on br-int and should therefore get a DHCP response back.

    Bridge br-int
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port "tap48ce1bef-b0"
            tag: 1
            Interface "tap48ce1bef-b0"
                type: internal
        Port br-int
            Interface br-int
                type: internal
        Port int-br-isolated
            Interface int-br-isolated
                type: patch
                options: {peer=phy-br-isolated}
        Port patch-tun
            Interface patch-tun
                type: patch
                options: {peer=patch-int}
        Port int-br-ex
            Interface int-br-ex
                type: patch
                options: {peer=phy-br-ex}

Comment 2 Bob Fournier 2017-09-19 14:35:25 UTC
I'm closing this for now as the reason the BM node is not getting a DHCP response and logging the ipxe code 040ee119 is because its sending DHCP requests on the ctlplane, not baremetal network.  This is due to the infrared virtual network setup.  I'd like to understand more how the infrared virsh setup is done for virt networks so we can get cleaning to complete but I don't see evidence of this being an Ironic bug. We can revisit this once virt networks are changed.

BTW, including for reference the infrared virtual networking setup Dan had sent: http://infrared.readthedocs.io/en/latest/virsh.html#network-layout