Bug 1294598 - overcloud nodes keep on rebooting on deploy awaiting DHCP offers
overcloud nodes keep on rebooting on deploy awaiting DHCP offers
Status: CLOSED CURRENTRELEASE
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic (Show other bugs)
7.0 (Kilo)
All Linux
urgent Severity urgent
: ---
: 8.0 (Liberty)
Assigned To: Lucas Alvares Gomes
Toure Dunnon
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-12-29 01:54 EST by Anand Nande
Modified: 2016-10-31 07:33 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-31 07:33:45 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Virtual Console Comment #4 Testing (11.96 KB, image/png)
2015-12-31 14:53 EST, Benjamin Schmaus
no flags Details

  None (edit)
Description Anand Nande 2015-12-29 01:54:18 EST
On a freshly installed director system - while deploying overcloud nodes,
overcloud nodes not recieving DHCPOFFER and are continuously rebooting.

In the httpboot/boot.ipxe strangely does not have any interface in chain pxelinux.cfg line :

 #!ipxe
 load the MAC-specific file or fail if it's not found
 chain pxelinux.cfg/${mac:hexhyp} || goto error_no_config
 :error_no_config
 echo PXE boot failed. No configuration found for MAC ${net0/mac}
 echo Press any key to reboot...
 prompt --timeout 180
 reboot

I used the recommendation in : 1234601#c47 which points to BZ 1267030, also from that,
I tried the ipxe rpms attached to the BZ :

ipxe-bootimgs-20151005-1.git6847232.el7.test.noarch.rpm
ipxe-roms-20151005-1.git6847232.el7.test.noarch.rpm
ipxe-roms-qemu-20151005-1.git6847232.el7.test.noarch.rpm

and :

$ sudo cp -afv /usr/share/ipxe/undionly.kpxe /tftpboot/undionly.kpxe
‘/usr/share/ipxe/undionly.kpxe’ -> ‘/tftpboot/undionly.kpxe’

and

$ sed -i 's|${mac}|${net0/mac}|g'\
/usr/share/instack-undercloud/ironic-discoverd/os-apply-config/httpboot/discoverd.ipxe\
 /usr/lib/python2.7/site-packages/ironic/drivers/modules/boot.ipxe \
/usr/lib/python2.7/site-packages/ironic/drivers/modules/ipxe_config.template \
/httpboot/*.ipxe

Using this the introspection went perfectly fine - but deploy does not succeed,
the nodes keep searching for DHCPOFFER and keep rebooting.

During deploy, I ran :

# tcpdumps 0 -i any -n port 67 and port68   ..(on director)

I could see the DHCP-Requests and replies.

FWIW : I am using the following deploy command :

$ openstack overcloud deploy --templates  \
--control-flavor control --compute-flavor compute \
--neutron-network-type vxlan --neutron-tunnel-types vxlan


 [stack@osp7 ~]$ ironic node-list
+--------------------------------------+------+---------------+-------------+-----------------+-------------+
| UUID                                 | Name | Instance UUID | Power State | Provision State | Maintenance |
+--------------------------------------+------+---------------+-------------+-----------------+-------------+
| 33b12666-047c-4af4-9b02-d50380b41e48 | None | None          | power off   | available       | False       |
| b850b5f1-d653-4c5d-a25b-2e97c4fc705e | None | None          | power off   | available       | False       |
| 14b91d24-6d7c-401a-8627-ada0512ddb8d | None | None          | power off   | available       | False       |
| cbfa7271-f240-4478-a5ee-f0b7cf283a46 | None | None          | power off   | available       | False       |
+--------------------------------------+------+---------------+-------------+-----------------+-------------+


Actual results: The overcloud nodes keep on rebooting awaiting DHCP offers.


Expected results: The overcloud deploy should complete.


Additional info:
================
logs located at : 

$ ssh kerb-username@collab-shell.usersys.redhat.com    ...(enter your kerp-password)
$ cd /cases/01543475

Awaiting from customer end output of : 

for i in {79437c02-5e91-43b5-af53-4226b5d8b59e,6b665b11-16ae-4c09-960b-191e06c0e801,1df7919e-1b0b-409d-94af-c5c49aac6d37,e8e9c083-9abc-4a39-ad89-5a3056618a4f};do ironic node-validate $i && ironic node-port-list $i --detail && ironic node-get-boot-device $i && ironic node-show;done
Comment 3 Anand Nande 2015-12-30 07:41:43 EST
On the director - Their ovs-vsctl shows br-ctlplane (untagged) resides on eno53:

[stack@osp7 ~]$ sudo ovs-vsctl show 
cae8ff96-12ca-4bed-b294-44468bb05a0f
    Bridge br-int
        fail_mode: secure
        Port br-int
            Interface br-int
                type: internal
        Port "tap16b37905-83"
            tag: 1
            Interface "tap16b37905-83"
                type: internal
        Port int-br-ctlplane
            Interface int-br-ctlplane
                type: patch
                options: {peer=phy-br-ctlplane}
    Bridge br-ctlplane
        Port "eno53"
            Interface "eno53"
        Port br-ctlplane
            Interface br-ctlplane
                type: internal
        Port phy-br-ctlplane
            Interface phy-br-ctlplane
                type: patch
                options: {peer=int-br-ctlplane}
    ovs_version: "2.4.0"

And eno53 is not 'tagged' using native vlan:

$ cat etc/sysconfig/network-scripts/ifcfg-eno53
# This file is autogenerated by os-net-config
DEVICE=eno53
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
DEVICETYPE=ovs
TYPE=OVSPort
OVS_BRIDGE=br-ctlplane
BOOTPROTO=none

switchport has been set to access mode.

customer also tested if an ip in the overcloud nodes' vlan segment 
can ping the undercloud system by installing rhel 7.2 on it (for testing),
ping reply received by pining the director, customer then wiped the disk
of this prospective-overcloud node where rhel was installed for testing.
Comment 4 Benjamin Schmaus 2015-12-31 14:51:03 EST
I am able to duplicate this issue in a virtual lab.  What I see happening is the MAC address passed on the pxe prompt is my second nic interface mac (see screenshot attached) but the pxelinux.cfg/ directory shows the following softlinks for the correct mac that should be booted:

[root@ospd pxelinux.cfg]# ls -lart
total 4
lrwxrwxrwx. 1 ironic ironic   53 Dec 31 13:38 52-54-00-5a-0f-28 -> /httpboot/31fff9e3-d536-4d58-966c-763161c966a1/config
lrwxrwxrwx. 1 ironic ironic   53 Dec 31 13:38 5254005a0f28 -> /httpboot/31fff9e3-d536-4d58-966c-763161c966a1/config
drwxr-xr-x. 5 ironic ironic 4096 Dec 31 13:38 ..
lrwxrwxrwx. 1 ironic ironic   53 Dec 31 13:38 52-54-00-ed-8e-1d -> /httpboot/17c3ad04-1e3c-4019-be78-97aa4ea8a48c/config
lrwxrwxrwx. 1 ironic ironic   53 Dec 31 13:38 525400ed8e1d -> /httpboot/17c3ad04-1e3c-4019-be78-97aa4ea8a48c/config
drwxr-xr-x. 2 ironic ironic   92 Dec 31 13:38 .

I had no issues with introspection but I did have to resort to setting net0 in the discoverd.ipxe file as per this BZ#1234601 comment #14:

If I manually change those softlink files above to the correct mac the host does boot up.  I have yet to see if the deploy is successful though
Comment 5 Benjamin Schmaus 2015-12-31 14:51:49 EST
I should note above the setup is using the 7.2 director and latest images.
Comment 6 Benjamin Schmaus 2015-12-31 14:53 EST
Created attachment 1110815 [details]
Virtual Console Comment #4 Testing
Comment 7 Lucas Alvares Gomes 2016-01-05 07:06:47 EST
(In reply to Anand Nande from comment #0)
> On a freshly installed director system - while deploying overcloud nodes,
> overcloud nodes not recieving DHCPOFFER and are continuously rebooting.
> 
> In the httpboot/boot.ipxe strangely does not have any interface in chain
> pxelinux.cfg line :
> 
>  #!ipxe
>  load the MAC-specific file or fail if it's not found
>  chain pxelinux.cfg/${mac:hexhyp} || goto error_no_config
>  :error_no_config
>  echo PXE boot failed. No configuration found for MAC ${net0/mac}
>  echo Press any key to reboot...
>  prompt --timeout 180
>  reboot
> 
> I used the recommendation in : 1234601#c47 which points to BZ 1267030, also
> from that,
> I tried the ipxe rpms attached to the BZ :
> 
> ipxe-bootimgs-20151005-1.git6847232.el7.test.noarch.rpm
> ipxe-roms-20151005-1.git6847232.el7.test.noarch.rpm
> ipxe-roms-qemu-20151005-1.git6847232.el7.test.noarch.rpm
> 
> and :
> 
> $ sudo cp -afv /usr/share/ipxe/undionly.kpxe /tftpboot/undionly.kpxe
> ‘/usr/share/ipxe/undionly.kpxe’ -> ‘/tftpboot/undionly.kpxe’
> 
> and
> 
> $ sed -i 's|${mac}|${net0/mac}|g'\
> /usr/share/instack-undercloud/ironic-discoverd/os-apply-config/httpboot/
> discoverd.ipxe\
>  /usr/lib/python2.7/site-packages/ironic/drivers/modules/boot.ipxe \
> /usr/lib/python2.7/site-packages/ironic/drivers/modules/ipxe_config.template
> \
> /httpboot/*.ipxe
> 
> Using this the introspection went perfectly fine - but deploy does not
> succeed,
> the nodes keep searching for DHCPOFFER and keep rebooting.
> 

Few things here.

Yeah, inspection will usually go fine because it has a wildcard to boot from any MAC address, where deployment the DHCP server in Neutron will only send DHCPOFFER if booting from the MAC address registered in the Ironic port.

...


The replace command above won't affect all the ${mac} variables in the templates. See boot.ipxe for example, the chain command:

chain pxelinux.cfg/${mac:hexhyp} || goto error_no_config

If you want to force the use of net0 you have to change it to:

chain pxelinux.cfg/${net0/mac:hexhyp} || goto error_no_config


...

By forcing net0 you have to make sure that the MAC address registered as an ironic port is the mac address of the first interface. Because Ironic will pass that address to neutron and it will configure the DHCP server to answer the DHCP requests from that address.
Comment 8 Benjamin Schmaus 2016-01-05 09:28:16 EST
I have tried the workaround in comment #7 but that does not seem to work.  Even with my ironic port list reflecting the net0 mac address, it still wants to look for the mac address file of the second nic (net1).
Comment 9 Benjamin Schmaus 2016-01-05 09:31:24 EST
I noticed that the boot.ipxe got overwritten again, after I started the deploy.  Once the nodes failed to boot, if I did update boot.ipxe again with comment #7 changes it does seem to boot.  Is it normal for boot.ipxe to get overwritten upon deploy?
Comment 10 Lucas Alvares Gomes 2016-01-05 10:17:43 EST
(In reply to Benjamin Schmaus from comment #9)
> I noticed that the boot.ipxe got overwritten again, after I started the
> deploy.  Once the nodes failed to boot, if I did update boot.ipxe again with
> comment #7 changes it does seem to boot.  Is it normal for boot.ipxe to get
> overwritten upon deploy?

Hi Benjamin,

Yes, the boot.ipxe gets overwritten at deploy time. You can create a custom boot.ipxe script and change it in the ironic.conf file and restart the ironic-conductor service, e.g

[pxe]
ipxe_boot_script=/path/to/boot.ipxe

Or, you can change the boot.ipxe script shipped by Ironic (the one the ipxe_boot_script configuration option points to by default).

...

As a context, the reason why the boot.ipxe needs to be overwritten is because of updates, otherwise a patch fixing a bug in the boot.ipxe script won't get in effect if we don't make sure that the boot.ipxe script passed to the DHCP server is the same as the one pointed by the ipxe_boot_script configuration file. 

We've attempted to not overwrite it on every deploy [0] but we reverted that patch later on [1].

[0] https://review.openstack.org/#/c/218290/
[1] https://review.openstack.org/#/c/219749/
Comment 11 Anand Nande 2016-01-06 01:58:02 EST
(In reply to Benjamin Schmaus from comment #9)
> I noticed that the boot.ipxe got overwritten again, after I started the
> deploy.  Once the nodes failed to boot, if I did update boot.ipxe again with
> comment #7 changes it does seem to boot.  Is it normal for boot.ipxe to get
> overwritten upon deploy?

Hi Benjamin,

So you

- update the boot.ipxe with following setting the net0:

chain pxelinux.cfg/${net0/mac:hexhyp} || goto error_no_config

- then you run deploy, which fails to boot the overcloud nodes,
  (nodes are continuously rebooting trying to find a DHCP ip)
  and over-writes boot.ipxe removing the net0 from it?

- (while the deploy is still running) you again modify the boot.ipxe
  and the overcloud nodes receive the DHCP ip and the deploy succeeds?

Is this correct?
Comment 13 Lucas Alvares Gomes 2016-01-06 06:23:35 EST
(In reply to Anand Nande from comment #11)
> (In reply to Benjamin Schmaus from comment #9)
> > I noticed that the boot.ipxe got overwritten again, after I started the
> > deploy.  Once the nodes failed to boot, if I did update boot.ipxe again with
> > comment #7 changes it does seem to boot.  Is it normal for boot.ipxe to get
> > overwritten upon deploy?
> 
> Hi Benjamin,
> 
> So you
> 
> - update the boot.ipxe with following setting the net0:
> 
> chain pxelinux.cfg/${net0/mac:hexhyp} || goto error_no_config
> 
> - then you run deploy, which fails to boot the overcloud nodes,
>   (nodes are continuously rebooting trying to find a DHCP ip)
>   and over-writes boot.ipxe removing the net0 from it?
> 
> - (while the deploy is still running) you again modify the boot.ipxe
>   and the overcloud nodes receive the DHCP ip and the deploy succeeds?
> 

Hi Anand, did you change the [pxe]ipxe_boot_script configuration option to an already modified script and restarted the ironic-conductor service ?

Ironic will always overwrite the script with the file pointed by that configuration option, so you need to point it to an already modified boot.ipxe script.
Comment 15 Benjamin Schmaus 2016-01-06 08:00:48 EST
Anand - Correct on your question in comment #11 - but also I learned that its better to define a custom boot.ipxe with the net0/mac portion changed due to the overwrite as pointed out in comment #10.
Comment 16 chris alfonso 2016-01-06 11:33:21 EST
Anand, after following comment #14 carefully, any change in results?
Comment 17 Anand Nande 2016-01-13 02:01:18 EST
(In reply to Lucas Alvares Gomes from comment #13)
> (In reply to Anand Nande from comment #11)
> > (In reply to Benjamin Schmaus from comment #9)
> > > I noticed that the boot.ipxe got overwritten again, after I started the
> > > deploy.  Once the nodes failed to boot, if I did update boot.ipxe again with
> > > comment #7 changes it does seem to boot.  Is it normal for boot.ipxe to get
> > > overwritten upon deploy?
> > 
> > Hi Benjamin,
> > 
> > So you
> > 
> > - update the boot.ipxe with following setting the net0:
> > 
> > chain pxelinux.cfg/${net0/mac:hexhyp} || goto error_no_config
> > 
> > - then you run deploy, which fails to boot the overcloud nodes,
> >   (nodes are continuously rebooting trying to find a DHCP ip)
> >   and over-writes boot.ipxe removing the net0 from it?
> > 
> > - (while the deploy is still running) you again modify the boot.ipxe
> >   and the overcloud nodes receive the DHCP ip and the deploy succeeds?
> > 
> 
> Hi Anand, did you change the [pxe]ipxe_boot_script configuration option to
> an already modified script and restarted the ironic-conductor service ?
> 
> Ironic will always overwrite the script with the file pointed by that
> configuration option, so you need to point it to an already modified
> boot.ipxe script.

Hi Lucas,

We tried to :

$ cp boot.ipxe custom.ipxe

$ vi custom.ipxe

chain pxelinux.cfg/${net0/mac:hexhyp} || goto error_no_config

- changed  the ironic.conf to point to custom.ipxe

- the ironic logs complained about custom.ipxe was same as custom.ipxe
  (not sure what this meant). 

- Reverted back to using boot.ipxe with the fix in https://bugzilla.redhat.com/show_bug.cgi?id=1234601#c29

- Then the deploy was failing with "no valid host found". I saw that the node which the flavor was being assigned had the wrong disk size (GB), corrected the flavors and ran the deploy again.

- It worked this time!
Comment 18 Lucas Alvares Gomes 2016-01-19 13:50:59 EST
Thanks for the reply

> Hi Lucas,
> 
> We tried to :
> 
> $ cp boot.ipxe custom.ipxe
> 
> $ vi custom.ipxe
> 
> chain pxelinux.cfg/${net0/mac:hexhyp} || goto error_no_config
> 
> - changed  the ironic.conf to point to custom.ipxe
> 
> - the ironic logs complained about custom.ipxe was same as custom.ipxe
>   (not sure what this meant). 

Ouch, AFAIK we don't diff iPXE boot scripts to check for their contents. Do you have the output of that log please? I will investigate the issue.

Did you restart ir-conductor after changing ironic.conf?

> 
> - Reverted back to using boot.ipxe with the fix in
> https://bugzilla.redhat.com/show_bug.cgi?id=1234601#c29
> 

Did you revert it back due because of the log message? Or have you tried to deploy it after pointing it to the custom script and it failed?

Cause I don't think we should need a .service that replace contents of a file when you can set a custom one in Ironic which contain the customization already.

Cheers,
Lucas
Comment 20 Dave Maley 2016-01-25 18:19:50 EST
based on comment 17 it seems the issue reported here has been resolved, however it's not clear if there is any remaining tasks thus I will defer to ENG on closing (notabug).
Comment 21 Dmitry Tantsur 2016-10-31 07:33:45 EDT
Hi!

We've updated the iPXE ROM we ship, and it is supposed to fix this and many more similar issues. If you encounter an issue again, please feel free to open another report.

Note You need to log in before you can comment on or make changes to this bug.