Bug 1395819 - DRAC out-of-band inspection should correctly set pxe_enabled on ports
Summary: DRAC out-of-band inspection should correctly set pxe_enabled on ports
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: z2
: 14.0 (Rocky)
Assignee: Chris Dearborn
QA Contact: mlammon
URL: https://review.openstack.org/#/c/617951/
Whiteboard:
Depends On:
Blocks: 1476900 1519552 1665980
TreeView+ depends on / blocked
 
Reported: 2016-11-16 18:30 UTC by Chris Dearborn
Modified: 2019-05-24 21:06 UTC (History)
18 users (show)

Fixed In Version: openstack-ironic-11.1.2-1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1665980 (view as bug list)
Environment:
Last Closed: 2019-05-24 21:06:11 UTC
Target Upstream Version:


Attachments (Terms of Use)
PXE boot failure (103.42 KB, image/gif)
2016-11-17 19:49 UTC, Chris Dearborn
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 617951 0 None MERGED Fix OOB introspection to use pxe_enabled flag in idrac driver 2020-07-29 10:30:06 UTC
OpenStack gerrit 624272 0 None MERGED Fix OOB introspection to use pxe_enabled flag in idrac driver 2020-07-29 10:30:06 UTC
Storyboard 2004340 0 None None None 2018-11-14 22:38:54 UTC

Description Chris Dearborn 2016-11-16 18:30:03 UTC
Description of problem:
Overcloud nodes fail to pxe boot during overcloud deployment.

Version-Release number of selected component (if applicable):
OSP10, 11/14 puddle.

How reproducible:
Install 11/14 puddle.  Attempt to deploy overcloud.  Note that overcloud nodes fail to PXE boot.

Steps to Reproduce:
1. See above.

Actual results:
Overcloud nodes fail to PXE boot with the following error on their consoles:
PXE-E51: No DHCP or proxyDHCP offers were received.

PXE-M0F: Exiting Broadcom PXE ROM.

Expected results:
Overcloud nodes should PXE boot.

Additional info:
On the director node:
sudo tcpdump -i br-ctlplane port 67 or port 68 -e -n
Shows many DHCP requests coming in, but no DHCP responses going out:
18:14:55.536557 24:6e:96:11:87:c4 > Broadcast, ethertype IPv4 (0x0800), length 590: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 24:6e:96:11:87:c4, length 548
18:14:57.266428 ec:f4:bb:db:a4:f4 > Broadcast, ethertype IPv4 (0x0800), length 590: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from ec:f4:bb:db:a4:f4, length 548

No idea if this is relevant, but in /var/log/ironic-inspector/ironic-inspector.log there are many messages stating that DHCP is already disabled:
2016-11-16 18:25:01.056 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir
ewall.py:142
2016-11-16 18:25:16.054 744 DEBUG futurist.periodics [-] Submitting periodic function 'ironic_inspector.main.periodic_update' _process_scheduled /usr/lib/python2.7/site-p
ackages/futurist/periodics.py:614
2016-11-16 18:25:16.057 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir
ewall.py:142
2016-11-16 18:25:31.055 744 DEBUG futurist.periodics [-] Submitting periodic function 'ironic_inspector.main.periodic_update' _process_scheduled /usr/lib/python2.7/site-p
ackages/futurist/periodics.py:614
2016-11-16 18:25:31.058 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir
ewall.py:142

Comment 1 Mike Orazi 2016-11-16 20:24:20 UTC
Moved to Ironic/DFG:HardProv based on the log in the report.

Comment 2 Mike Burns 2016-11-17 19:37:50 UTC
Lucas/Dmitry, can one of you look at this?

Chris also seems to think this is related to out-of-band introspection.

Comment 3 Mike Burns 2016-11-17 19:38:44 UTC
Chris, can you confirm my comment 2?

Comment 4 Chris Dearborn 2016-11-17 19:48:33 UTC
Some additional information:

If out-of-band introspection is used with the pxe_drac driver, then when overcloud deployment is launched, all overcloud nodes fail to pxe boot due to no DHCP response coming back.

To reproduce:
- Load nodes into ironic
- Transition all nodes to the manageable state
- Launch OOB introspection on all nodes: openstack baremetal node inspect <guid>
- Transition all nodes to the availablae state 
- Launch overcloud deployment
- Note overcloud nodes do not PXE boot and show error in the attachment

If in-band introspection is used with the pxe_drac driver, then when overcloud deployment is launched, the overcloud nodes pxe boot successfully and deployment proceeds.

To reproduce:
- Load nodes into ironic
- openstack baremetal introspection bulk start
- Launch overcloud deployment
- Note overcloud nodes successfully PXE boot

Comment 5 Chris Dearborn 2016-11-17 19:49:13 UTC
Created attachment 1221632 [details]
PXE boot failure

Comment 6 Dmitry Tantsur 2016-11-18 10:17:36 UTC
Hi!

I think I know the root cause. There is one unpleasant thing about OOB inspection in Ironic (at least how it's implemented now): it does not distinguish between PXE booting and other ports. And there is an unpleasant thing about Nova: it configures a random Ironic port for deployment, which means it might end up NOT the port configured for PXE.

I think we have an upstream bug hanging in a indefinite state for that. Now, ironic-inspector actually works around the problem by leaving only the PXE booting port in Ironic.

Could you please check if removing all ports but PXE booting one actually fixes the problem?

Comment 7 Chris Dearborn 2016-11-19 12:58:54 UTC
Yes, deleting all the ports on each node except for the PXE boot port one causes the nodes to PXE boot successfully.

Comment 8 Dmitry Tantsur 2016-11-21 10:46:29 UTC
Got it. I guess we need an RFE to only create one port, I know that HPE folks are planning on something similar.. Do you think it's even possible via iDRAC?

Comment 9 Chris Dearborn 2016-11-22 15:01:12 UTC
Not sure how to determine which is the PXE boot port.  The only way I can think of is that the user would have to provide the MAC of that port.  Could you point me at the code in ironic-inspector that deals with only creating the 1 port?

Comment 10 Dmitry Tantsur 2016-11-22 15:14:49 UTC
Inspector uses information that we pipe from iPXE ROM: https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/utils.py#L41-L47. Then we have some logic to figure out which ports to create based on configuration: https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/plugins/standard.py#L220-L229

Comment 11 Sean Merrow 2016-12-09 15:19:27 UTC
Moving this RFE to JS tracker for OSP 12 as per target.

Comment 13 Dmitry Tantsur 2017-04-06 14:37:58 UTC
Based on the PTG discussion, we only have to correctly set pxe_enabled for Ironic ports. This makes it not an RFE any more.

Nonetheless, it has to be possible for iDrac to report which ports are set for PXE-booting..

Comment 14 Chris Dearborn 2017-04-20 15:16:05 UTC
The mode that the NIC is set to can be obtained from DCIM_NICEnumeration, the LegacyBootProto attribute.  If the NIC is set to PXE boot, then the CurrentValue will be "PXE".

Comment 15 Sean Merrow 2017-05-04 14:48:30 UTC
For clarity and tracking:

- Is the information in Comment 14 sufficient to move forward on the suggested in Comment 13?
- Will the fix be in Ironic or iDRAC driver?
- Who will drive this upstream and own QA, Red Hat or Dell EMC?

Comment 18 Sean Merrow 2017-05-30 15:52:01 UTC
Hi Chris, 

Regarding this BZ, apparently the required support for LegacyBootProto has to be implemented in python-dracclient, then Ironic can be updated to use it. The expectation is that Dell EMC will drive this work in the upstream. 

Is there an OpenStack Gerrit that we can link to this BZ for monitoring?

Thanks,
Sean

Comment 19 Chris Dearborn 2017-09-05 14:46:37 UTC
No, there is no upstream bug or upstream submission for this at the present time.  We implemented a work around for this issue nearly a year ago so that we could use OOB introspection.  We have a downstream story to switch from our work around method to using pxe_enabled as mentioned above, however other things currently have a higher priority.  Hopefully we will be able to get to this soon.

Comment 21 Dan Sneddon 2018-08-15 22:14:07 UTC
Chris, I am closing this bug for now, as we have had no activity for over a year. If you submit an upstream patch for this please feel free to respond on this bug and we will reopen it.

Comment 22 Chris Dearborn 2018-11-14 22:37:41 UTC
Reopening.  We have a contractor working on this defect now.  A patch has been submitted upstream for this and is going through the review process

Comment 23 Chris Dearborn 2018-11-14 22:39:23 UTC
See: https://review.openstack.org/#/c/617951/

Comment 26 Bob Fournier 2019-05-06 19:33:13 UTC
I orginally put the wrong FixedInVersion for this bug so it wasn't automatically moved to ON_QA.  Fixing the FixedInVersion and setting to ON_QA

Comment 27 Jon Schlueter 2019-05-08 11:26:46 UTC
The build mentioned in fixed in version has been released


Note You need to log in before you can comment on or make changes to this bug.