Bug 1395819

Summary: DRAC out-of-band inspection should correctly set pxe_enabled on ports
Product: Red Hat OpenStack Reporter: Chris Dearborn <christopher_dearborn>
Component: openstack-ironicAssignee: Chris Dearborn <christopher_dearborn>
Status: CLOSED CURRENTRELEASE QA Contact: mlammon
Severity: medium Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: arkady_kanevsky, bfournie, cdevine, christopher_dearborn, dbecker, dcain, dsneddon, dtantsur, jschluet, kurt_hey, lmartins, mburns, morazi, rhel-osp-director-maint, smerrow, sreichar, srevivo, tvignaud
Target Milestone: z2Keywords: OtherQA, Reopened, TestOnly, Triaged, ZStream
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
URL: https://review.openstack.org/#/c/617951/
Whiteboard:
Fixed In Version: openstack-ironic-11.1.2-1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1665980 (view as bug list) Environment:
Last Closed: 2019-05-24 21:06:11 UTC Type: Feature Request
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1476900, 1519552, 1665980    
Attachments:
Description Flags
PXE boot failure none

Description Chris Dearborn 2016-11-16 18:30:03 UTC
Description of problem:
Overcloud nodes fail to pxe boot during overcloud deployment.

Version-Release number of selected component (if applicable):
OSP10, 11/14 puddle.

How reproducible:
Install 11/14 puddle.  Attempt to deploy overcloud.  Note that overcloud nodes fail to PXE boot.

Steps to Reproduce:
1. See above.

Actual results:
Overcloud nodes fail to PXE boot with the following error on their consoles:
PXE-E51: No DHCP or proxyDHCP offers were received.

PXE-M0F: Exiting Broadcom PXE ROM.

Expected results:
Overcloud nodes should PXE boot.

Additional info:
On the director node:
sudo tcpdump -i br-ctlplane port 67 or port 68 -e -n
Shows many DHCP requests coming in, but no DHCP responses going out:
18:14:55.536557 24:6e:96:11:87:c4 > Broadcast, ethertype IPv4 (0x0800), length 590: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 24:6e:96:11:87:c4, length 548
18:14:57.266428 ec:f4:bb:db:a4:f4 > Broadcast, ethertype IPv4 (0x0800), length 590: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from ec:f4:bb:db:a4:f4, length 548

No idea if this is relevant, but in /var/log/ironic-inspector/ironic-inspector.log there are many messages stating that DHCP is already disabled:
2016-11-16 18:25:01.056 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir
ewall.py:142
2016-11-16 18:25:16.054 744 DEBUG futurist.periodics [-] Submitting periodic function 'ironic_inspector.main.periodic_update' _process_scheduled /usr/lib/python2.7/site-p
ackages/futurist/periodics.py:614
2016-11-16 18:25:16.057 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir
ewall.py:142
2016-11-16 18:25:31.055 744 DEBUG futurist.periodics [-] Submitting periodic function 'ironic_inspector.main.periodic_update' _process_scheduled /usr/lib/python2.7/site-p
ackages/futurist/periodics.py:614
2016-11-16 18:25:31.058 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir
ewall.py:142

Comment 1 Mike Orazi 2016-11-16 20:24:20 UTC
Moved to Ironic/DFG:HardProv based on the log in the report.

Comment 2 Mike Burns 2016-11-17 19:37:50 UTC
Lucas/Dmitry, can one of you look at this?

Chris also seems to think this is related to out-of-band introspection.

Comment 3 Mike Burns 2016-11-17 19:38:44 UTC
Chris, can you confirm my comment 2?

Comment 4 Chris Dearborn 2016-11-17 19:48:33 UTC
Some additional information:

If out-of-band introspection is used with the pxe_drac driver, then when overcloud deployment is launched, all overcloud nodes fail to pxe boot due to no DHCP response coming back.

To reproduce:
- Load nodes into ironic
- Transition all nodes to the manageable state
- Launch OOB introspection on all nodes: openstack baremetal node inspect <guid>
- Transition all nodes to the availablae state 
- Launch overcloud deployment
- Note overcloud nodes do not PXE boot and show error in the attachment

If in-band introspection is used with the pxe_drac driver, then when overcloud deployment is launched, the overcloud nodes pxe boot successfully and deployment proceeds.

To reproduce:
- Load nodes into ironic
- openstack baremetal introspection bulk start
- Launch overcloud deployment
- Note overcloud nodes successfully PXE boot

Comment 5 Chris Dearborn 2016-11-17 19:49:13 UTC
Created attachment 1221632 [details]
PXE boot failure

Comment 6 Dmitry Tantsur 2016-11-18 10:17:36 UTC
Hi!

I think I know the root cause. There is one unpleasant thing about OOB inspection in Ironic (at least how it's implemented now): it does not distinguish between PXE booting and other ports. And there is an unpleasant thing about Nova: it configures a random Ironic port for deployment, which means it might end up NOT the port configured for PXE.

I think we have an upstream bug hanging in a indefinite state for that. Now, ironic-inspector actually works around the problem by leaving only the PXE booting port in Ironic.

Could you please check if removing all ports but PXE booting one actually fixes the problem?

Comment 7 Chris Dearborn 2016-11-19 12:58:54 UTC
Yes, deleting all the ports on each node except for the PXE boot port one causes the nodes to PXE boot successfully.

Comment 8 Dmitry Tantsur 2016-11-21 10:46:29 UTC
Got it. I guess we need an RFE to only create one port, I know that HPE folks are planning on something similar.. Do you think it's even possible via iDRAC?

Comment 9 Chris Dearborn 2016-11-22 15:01:12 UTC
Not sure how to determine which is the PXE boot port.  The only way I can think of is that the user would have to provide the MAC of that port.  Could you point me at the code in ironic-inspector that deals with only creating the 1 port?

Comment 10 Dmitry Tantsur 2016-11-22 15:14:49 UTC
Inspector uses information that we pipe from iPXE ROM: https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/utils.py#L41-L47. Then we have some logic to figure out which ports to create based on configuration: https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/plugins/standard.py#L220-L229

Comment 11 Sean Merrow 2016-12-09 15:19:27 UTC
Moving this RFE to JS tracker for OSP 12 as per target.

Comment 13 Dmitry Tantsur 2017-04-06 14:37:58 UTC
Based on the PTG discussion, we only have to correctly set pxe_enabled for Ironic ports. This makes it not an RFE any more.

Nonetheless, it has to be possible for iDrac to report which ports are set for PXE-booting..

Comment 14 Chris Dearborn 2017-04-20 15:16:05 UTC
The mode that the NIC is set to can be obtained from DCIM_NICEnumeration, the LegacyBootProto attribute.  If the NIC is set to PXE boot, then the CurrentValue will be "PXE".

Comment 15 Sean Merrow 2017-05-04 14:48:30 UTC
For clarity and tracking:

- Is the information in Comment 14 sufficient to move forward on the suggested in Comment 13?
- Will the fix be in Ironic or iDRAC driver?
- Who will drive this upstream and own QA, Red Hat or Dell EMC?

Comment 18 Sean Merrow 2017-05-30 15:52:01 UTC
Hi Chris, 

Regarding this BZ, apparently the required support for LegacyBootProto has to be implemented in python-dracclient, then Ironic can be updated to use it. The expectation is that Dell EMC will drive this work in the upstream. 

Is there an OpenStack Gerrit that we can link to this BZ for monitoring?

Thanks,
Sean

Comment 19 Chris Dearborn 2017-09-05 14:46:37 UTC
No, there is no upstream bug or upstream submission for this at the present time.  We implemented a work around for this issue nearly a year ago so that we could use OOB introspection.  We have a downstream story to switch from our work around method to using pxe_enabled as mentioned above, however other things currently have a higher priority.  Hopefully we will be able to get to this soon.

Comment 21 Dan Sneddon 2018-08-15 22:14:07 UTC
Chris, I am closing this bug for now, as we have had no activity for over a year. If you submit an upstream patch for this please feel free to respond on this bug and we will reopen it.

Comment 22 Chris Dearborn 2018-11-14 22:37:41 UTC
Reopening.  We have a contractor working on this defect now.  A patch has been submitted upstream for this and is going through the review process

Comment 23 Chris Dearborn 2018-11-14 22:39:23 UTC
See: https://review.openstack.org/#/c/617951/

Comment 26 Bob Fournier 2019-05-06 19:33:13 UTC
I orginally put the wrong FixedInVersion for this bug so it wasn't automatically moved to ON_QA.  Fixing the FixedInVersion and setting to ON_QA

Comment 27 Jon Schlueter 2019-05-08 11:26:46 UTC
The build mentioned in fixed in version has been released