Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 1225621 - Provisioning fails for one of the overcloud nodes on baremetal env
Provisioning fails for one of the overcloud nodes on baremetal env
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director (Show other bugs)
7.0 (Kilo)
Unspecified Unspecified
unspecified Severity high
: ga
: Director
Assigned To: Lucas Alvares Gomes
Marius Cornea
: Triaged
: 1220933 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-05-27 15:48 EDT by Marius Cornea
Modified: 2015-08-05 09:52 EDT (History)
8 users (show)

See Also:
Fixed In Version: openstack-ironic-2015.1.0-3.el7ost
Doc Type: Bug Fix
Doc Text:
Incorrect ordering of DHCP options the configuration file caused machines to fail on boot. This fix uses tags to the DHCP option to provide the correct ordering. Machine now chainload the boot with the iPXE ROM and then invoking the HTTP URL to continue the boot. This results in a successful boot.
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-08-05 09:52:05 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
pxe boot error (36.49 KB, image/png)
2015-05-27 15:48 EDT, Marius Cornea
no flags Details
ironic.conf (1.40 KB, text/plain)
2015-05-28 04:46 EDT, Marius Cornea
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2015:1549 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform director Release 2015-08-05 13:49:10 EDT

  None (edit)
Description Marius Cornea 2015-05-27 15:48:28 EDT
Created attachment 1030778 [details]
pxe boot error

Description of problem:
My instackenv.json consists of 3 baremetal servers. After I run 'instack-deploy-overcloud --tuskar' one of the nodes gets provisioned and another one gets into 'wait call-back' provision state. Console shows a TFTP file not found error for that node. After some time the 3rd node is used for provisioning and overcloud deployment can continue. 

Version-Release number of selected component (if applicable):
openstack-tripleo-common-0.0.0.post4-1.el7ost.noarch
openstack-tripleo-heat-templates-0.8.4-2.el7ost.noarch
openstack-tripleo-image-elements-0.9.3-1.el7ost.noarch
openstack-tripleo-0.0.5-999.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1.dev55-1.el7ost.noarch
openstack-ironic-conductor-2015.1.0-2.el7ost.noarch
python-ironicclient-0.5.1-5.el7ost.noarch
openstack-ironic-discoverd-1.1.0-1.el7ost.noarch
openstack-ironic-common-2015.1.0-2.el7ost.noarch
python-ironic-discoverd-1.1.0-1.el7ost.noarch
openstack-ironic-api-2015.1.0-2.el7ost.noarch

How reproducible:


Steps to Reproduce:
1. Install undercloud
2. Register nodes
3. Discover nodes 
5. Run instack-deploy-overcloud --tuskar

Actual results:
Provision fails for one of the nodes. 

Expected results:
Node gets provisioned.

Additional info:
I deleted the overcloud heat stack / ironic nodes multiple times and always get the same result for the same node. I am attaching the console error that's output when the node is trying to boot.
Comment 3 Marius Cornea 2015-05-28 04:46:41 EDT
Created attachment 1031084 [details]
ironic.conf

Attaching the ironic.conf file.
Comment 4 Lucas Alvares Gomes 2015-05-28 08:58:40 EDT
The reason why it happens is because when Neutron is laying down the DHCP options, the order that it's written to the file may vary, e.g:

$ cat /var/lib/neutron/dhcp/8e6c5607-fc9a-4479-a616-cdbfb49019ba/opts
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:server-ip-address,10.3.58.1
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:tftp-server,10.3.58.1
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,tag:!ipxe,option:bootfile-name,undionly.kpxe
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:server-ip-address,10.3.58.1
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:tftp-server,10.3.58.1
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:!ipxe,option:bootfile-name,undionly.kpxe

You can see that, we have 2 rules for sending the bootfile to the PXE request:

1) tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:!ipxe,option:bootfile-name,undionly.kpxe

You can see that we have an "!pxe" tag there, which basically means: If the request doesn't come
from iPXE ACK the DHCP request with the undionly.kpxe file (the "!" in the tag is a negation). So PXE
will then chainload into iPXE and send a fresh DHCP request which is now will come from iPXE

And then DHCP server should send the iPXE URL (http://10.3.58.1:8088/boot.ipxe)

2) tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe

But you can see that 2) doesn't explicitly check if the request actually comes from iPXE (no "ipxe" tag) so depending on the order that Neutron lay down this configuration a PXE request can be answered with the 2).

This patch[1] is fixing this problem by telling the DHCP server to only ACK with the iPXE url if the request is coming from an iPXE image (by adding a tag). So it should look like:

tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:ipxe,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe

The patch [1] has been applied to rdo-manager (branches: mgt-master and mgt-kilo).

Lemme know if it's now fixed for you.

[1] https://github.com/rdo-management/ironic/commit/445132c9152e5ae528c907887b2b943424a9fa55
Comment 5 Marius Cornea 2015-05-28 09:01:39 EDT
Deployment went fine multiple times after applying the provided patch. Thanks!
Comment 6 chris alfonso 2015-06-02 13:49:50 EDT
*** Bug 1220933 has been marked as a duplicate of this bug. ***
Comment 10 errata-xmlrpc 2015-08-05 09:52:05 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549

Note You need to log in before you can comment on or make changes to this bug.