Bug 1225621 - Provisioning fails for one of the overcloud nodes on baremetal env
Summary: Provisioning fails for one of the overcloud nodes on baremetal env
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 7.0 (Kilo)
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ga
: Director
Assignee: Lucas Alvares Gomes
QA Contact: Marius Cornea
URL:
Whiteboard:
: 1220933 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-27 19:48 UTC by Marius Cornea
Modified: 2015-08-05 13:52 UTC (History)
8 users (show)

Fixed In Version: openstack-ironic-2015.1.0-3.el7ost
Doc Type: Bug Fix
Doc Text:
Incorrect ordering of DHCP options the configuration file caused machines to fail on boot. This fix uses tags to the DHCP option to provide the correct ordering. Machine now chainload the boot with the iPXE ROM and then invoking the HTTP URL to continue the boot. This results in a successful boot.
Clone Of:
Environment:
Last Closed: 2015-08-05 13:52:05 UTC
Target Upstream Version:


Attachments (Terms of Use)
pxe boot error (36.49 KB, image/png)
2015-05-27 19:48 UTC, Marius Cornea
no flags Details
ironic.conf (1.40 KB, text/plain)
2015-05-28 08:46 UTC, Marius Cornea
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2015:1549 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform director Release 2015-08-05 17:49:10 UTC

Description Marius Cornea 2015-05-27 19:48:28 UTC
Created attachment 1030778 [details]
pxe boot error

Description of problem:
My instackenv.json consists of 3 baremetal servers. After I run 'instack-deploy-overcloud --tuskar' one of the nodes gets provisioned and another one gets into 'wait call-back' provision state. Console shows a TFTP file not found error for that node. After some time the 3rd node is used for provisioning and overcloud deployment can continue. 

Version-Release number of selected component (if applicable):
openstack-tripleo-common-0.0.0.post4-1.el7ost.noarch
openstack-tripleo-heat-templates-0.8.4-2.el7ost.noarch
openstack-tripleo-image-elements-0.9.3-1.el7ost.noarch
openstack-tripleo-0.0.5-999.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1.dev55-1.el7ost.noarch
openstack-ironic-conductor-2015.1.0-2.el7ost.noarch
python-ironicclient-0.5.1-5.el7ost.noarch
openstack-ironic-discoverd-1.1.0-1.el7ost.noarch
openstack-ironic-common-2015.1.0-2.el7ost.noarch
python-ironic-discoverd-1.1.0-1.el7ost.noarch
openstack-ironic-api-2015.1.0-2.el7ost.noarch

How reproducible:


Steps to Reproduce:
1. Install undercloud
2. Register nodes
3. Discover nodes 
5. Run instack-deploy-overcloud --tuskar

Actual results:
Provision fails for one of the nodes. 

Expected results:
Node gets provisioned.

Additional info:
I deleted the overcloud heat stack / ironic nodes multiple times and always get the same result for the same node. I am attaching the console error that's output when the node is trying to boot.

Comment 3 Marius Cornea 2015-05-28 08:46:41 UTC
Created attachment 1031084 [details]
ironic.conf

Attaching the ironic.conf file.

Comment 4 Lucas Alvares Gomes 2015-05-28 12:58:40 UTC
The reason why it happens is because when Neutron is laying down the DHCP options, the order that it's written to the file may vary, e.g:

$ cat /var/lib/neutron/dhcp/8e6c5607-fc9a-4479-a616-cdbfb49019ba/opts
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:server-ip-address,10.3.58.1
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:tftp-server,10.3.58.1
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,tag:!ipxe,option:bootfile-name,undionly.kpxe
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:server-ip-address,10.3.58.1
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:tftp-server,10.3.58.1
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:!ipxe,option:bootfile-name,undionly.kpxe

You can see that, we have 2 rules for sending the bootfile to the PXE request:

1) tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:!ipxe,option:bootfile-name,undionly.kpxe

You can see that we have an "!pxe" tag there, which basically means: If the request doesn't come
from iPXE ACK the DHCP request with the undionly.kpxe file (the "!" in the tag is a negation). So PXE
will then chainload into iPXE and send a fresh DHCP request which is now will come from iPXE

And then DHCP server should send the iPXE URL (http://10.3.58.1:8088/boot.ipxe)

2) tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe

But you can see that 2) doesn't explicitly check if the request actually comes from iPXE (no "ipxe" tag) so depending on the order that Neutron lay down this configuration a PXE request can be answered with the 2).

This patch[1] is fixing this problem by telling the DHCP server to only ACK with the iPXE url if the request is coming from an iPXE image (by adding a tag). So it should look like:

tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:ipxe,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe

The patch [1] has been applied to rdo-manager (branches: mgt-master and mgt-kilo).

Lemme know if it's now fixed for you.

[1] https://github.com/rdo-management/ironic/commit/445132c9152e5ae528c907887b2b943424a9fa55

Comment 5 Marius Cornea 2015-05-28 13:01:39 UTC
Deployment went fine multiple times after applying the provided patch. Thanks!

Comment 6 chris alfonso 2015-06-02 17:49:50 UTC
*** Bug 1220933 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2015-08-05 13:52:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549


Note You need to log in before you can comment on or make changes to this bug.