Bug 1225621

Summary: Provisioning fails for one of the overcloud nodes on baremetal env
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: rhosp-directorAssignee: Lucas Alvares Gomes <lmartins>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: calfonso, dmacpher, lmartins, mburns, rhel-osp-director-maint, sasha, sclewis
Target Milestone: gaKeywords: Triaged
Target Release: Director   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-ironic-2015.1.0-3.el7ost Doc Type: Bug Fix
Doc Text:
Incorrect ordering of DHCP options the configuration file caused machines to fail on boot. This fix uses tags to the DHCP option to provide the correct ordering. Machine now chainload the boot with the iPXE ROM and then invoking the HTTP URL to continue the boot. This results in a successful boot.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-05 13:52:05 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pxe boot error
none
ironic.conf none

Description Marius Cornea 2015-05-27 19:48:28 UTC
Created attachment 1030778 [details]
pxe boot error

Description of problem:
My instackenv.json consists of 3 baremetal servers. After I run 'instack-deploy-overcloud --tuskar' one of the nodes gets provisioned and another one gets into 'wait call-back' provision state. Console shows a TFTP file not found error for that node. After some time the 3rd node is used for provisioning and overcloud deployment can continue. 

Version-Release number of selected component (if applicable):
openstack-tripleo-common-0.0.0.post4-1.el7ost.noarch
openstack-tripleo-heat-templates-0.8.4-2.el7ost.noarch
openstack-tripleo-image-elements-0.9.3-1.el7ost.noarch
openstack-tripleo-0.0.5-999.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1.dev55-1.el7ost.noarch
openstack-ironic-conductor-2015.1.0-2.el7ost.noarch
python-ironicclient-0.5.1-5.el7ost.noarch
openstack-ironic-discoverd-1.1.0-1.el7ost.noarch
openstack-ironic-common-2015.1.0-2.el7ost.noarch
python-ironic-discoverd-1.1.0-1.el7ost.noarch
openstack-ironic-api-2015.1.0-2.el7ost.noarch

How reproducible:


Steps to Reproduce:
1. Install undercloud
2. Register nodes
3. Discover nodes 
5. Run instack-deploy-overcloud --tuskar

Actual results:
Provision fails for one of the nodes. 

Expected results:
Node gets provisioned.

Additional info:
I deleted the overcloud heat stack / ironic nodes multiple times and always get the same result for the same node. I am attaching the console error that's output when the node is trying to boot.

Comment 3 Marius Cornea 2015-05-28 08:46:41 UTC
Created attachment 1031084 [details]
ironic.conf

Attaching the ironic.conf file.

Comment 4 Lucas Alvares Gomes 2015-05-28 12:58:40 UTC
The reason why it happens is because when Neutron is laying down the DHCP options, the order that it's written to the file may vary, e.g:

$ cat /var/lib/neutron/dhcp/8e6c5607-fc9a-4479-a616-cdbfb49019ba/opts
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:server-ip-address,10.3.58.1
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,option:tftp-server,10.3.58.1
tag:f59d3d7e-5cf3-49b3-9a38-6dc3b9887e7c,tag:!ipxe,option:bootfile-name,undionly.kpxe
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:server-ip-address,10.3.58.1
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:tftp-server,10.3.58.1
tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:!ipxe,option:bootfile-name,undionly.kpxe

You can see that, we have 2 rules for sending the bootfile to the PXE request:

1) tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:!ipxe,option:bootfile-name,undionly.kpxe

You can see that we have an "!pxe" tag there, which basically means: If the request doesn't come
from iPXE ACK the DHCP request with the undionly.kpxe file (the "!" in the tag is a negation). So PXE
will then chainload into iPXE and send a fresh DHCP request which is now will come from iPXE

And then DHCP server should send the iPXE URL (http://10.3.58.1:8088/boot.ipxe)

2) tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe

But you can see that 2) doesn't explicitly check if the request actually comes from iPXE (no "ipxe" tag) so depending on the order that Neutron lay down this configuration a PXE request can be answered with the 2).

This patch[1] is fixing this problem by telling the DHCP server to only ACK with the iPXE url if the request is coming from an iPXE image (by adding a tag). So it should look like:

tag:ee56e5a7-9a80-4e1e-82a8-30701aa06b56,tag:ipxe,option:bootfile-name,http://10.3.58.1:8088/boot.ipxe

The patch [1] has been applied to rdo-manager (branches: mgt-master and mgt-kilo).

Lemme know if it's now fixed for you.

[1] https://github.com/rdo-management/ironic/commit/445132c9152e5ae528c907887b2b943424a9fa55

Comment 5 Marius Cornea 2015-05-28 13:01:39 UTC
Deployment went fine multiple times after applying the provided patch. Thanks!

Comment 6 chris alfonso 2015-06-02 17:49:50 UTC
*** Bug 1220933 has been marked as a duplicate of this bug. ***

Comment 10 errata-xmlrpc 2015-08-05 13:52:05 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549