Hide Forgot
Description of problem: Overcloud nodes fail to pxe boot during overcloud deployment. Version-Release number of selected component (if applicable): OSP10, 11/14 puddle. How reproducible: Install 11/14 puddle. Attempt to deploy overcloud. Note that overcloud nodes fail to PXE boot. Steps to Reproduce: 1. See above. Actual results: Overcloud nodes fail to PXE boot with the following error on their consoles: PXE-E51: No DHCP or proxyDHCP offers were received. PXE-M0F: Exiting Broadcom PXE ROM. Expected results: Overcloud nodes should PXE boot. Additional info: On the director node: sudo tcpdump -i br-ctlplane port 67 or port 68 -e -n Shows many DHCP requests coming in, but no DHCP responses going out: 18:14:55.536557 24:6e:96:11:87:c4 > Broadcast, ethertype IPv4 (0x0800), length 590: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 24:6e:96:11:87:c4, length 548 18:14:57.266428 ec:f4:bb:db:a4:f4 > Broadcast, ethertype IPv4 (0x0800), length 590: 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from ec:f4:bb:db:a4:f4, length 548 No idea if this is relevant, but in /var/log/ironic-inspector/ironic-inspector.log there are many messages stating that DHCP is already disabled: 2016-11-16 18:25:01.056 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir ewall.py:142 2016-11-16 18:25:16.054 744 DEBUG futurist.periodics [-] Submitting periodic function 'ironic_inspector.main.periodic_update' _process_scheduled /usr/lib/python2.7/site-p ackages/futurist/periodics.py:614 2016-11-16 18:25:16.057 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir ewall.py:142 2016-11-16 18:25:31.055 744 DEBUG futurist.periodics [-] Submitting periodic function 'ironic_inspector.main.periodic_update' _process_scheduled /usr/lib/python2.7/site-p ackages/futurist/periodics.py:614 2016-11-16 18:25:31.058 744 DEBUG ironic_inspector.firewall [-] DHCP is already disabled, not updating _disable_dhcp /usr/lib/python2.7/site-packages/ironic_inspector/fir ewall.py:142
Moved to Ironic/DFG:HardProv based on the log in the report.
Lucas/Dmitry, can one of you look at this? Chris also seems to think this is related to out-of-band introspection.
Chris, can you confirm my comment 2?
Some additional information: If out-of-band introspection is used with the pxe_drac driver, then when overcloud deployment is launched, all overcloud nodes fail to pxe boot due to no DHCP response coming back. To reproduce: - Load nodes into ironic - Transition all nodes to the manageable state - Launch OOB introspection on all nodes: openstack baremetal node inspect <guid> - Transition all nodes to the availablae state - Launch overcloud deployment - Note overcloud nodes do not PXE boot and show error in the attachment If in-band introspection is used with the pxe_drac driver, then when overcloud deployment is launched, the overcloud nodes pxe boot successfully and deployment proceeds. To reproduce: - Load nodes into ironic - openstack baremetal introspection bulk start - Launch overcloud deployment - Note overcloud nodes successfully PXE boot
Created attachment 1221632 [details] PXE boot failure
Hi! I think I know the root cause. There is one unpleasant thing about OOB inspection in Ironic (at least how it's implemented now): it does not distinguish between PXE booting and other ports. And there is an unpleasant thing about Nova: it configures a random Ironic port for deployment, which means it might end up NOT the port configured for PXE. I think we have an upstream bug hanging in a indefinite state for that. Now, ironic-inspector actually works around the problem by leaving only the PXE booting port in Ironic. Could you please check if removing all ports but PXE booting one actually fixes the problem?
Yes, deleting all the ports on each node except for the PXE boot port one causes the nodes to PXE boot successfully.
Got it. I guess we need an RFE to only create one port, I know that HPE folks are planning on something similar.. Do you think it's even possible via iDRAC?
Not sure how to determine which is the PXE boot port. The only way I can think of is that the user would have to provide the MAC of that port. Could you point me at the code in ironic-inspector that deals with only creating the 1 port?
Inspector uses information that we pipe from iPXE ROM: https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/utils.py#L41-L47. Then we have some logic to figure out which ports to create based on configuration: https://github.com/openstack/ironic-inspector/blob/master/ironic_inspector/plugins/standard.py#L220-L229
Moving this RFE to JS tracker for OSP 12 as per target.
Based on the PTG discussion, we only have to correctly set pxe_enabled for Ironic ports. This makes it not an RFE any more. Nonetheless, it has to be possible for iDrac to report which ports are set for PXE-booting..
The mode that the NIC is set to can be obtained from DCIM_NICEnumeration, the LegacyBootProto attribute. If the NIC is set to PXE boot, then the CurrentValue will be "PXE".
For clarity and tracking: - Is the information in Comment 14 sufficient to move forward on the suggested in Comment 13? - Will the fix be in Ironic or iDRAC driver? - Who will drive this upstream and own QA, Red Hat or Dell EMC?
Hi Chris, Regarding this BZ, apparently the required support for LegacyBootProto has to be implemented in python-dracclient, then Ironic can be updated to use it. The expectation is that Dell EMC will drive this work in the upstream. Is there an OpenStack Gerrit that we can link to this BZ for monitoring? Thanks, Sean
No, there is no upstream bug or upstream submission for this at the present time. We implemented a work around for this issue nearly a year ago so that we could use OOB introspection. We have a downstream story to switch from our work around method to using pxe_enabled as mentioned above, however other things currently have a higher priority. Hopefully we will be able to get to this soon.
Chris, I am closing this bug for now, as we have had no activity for over a year. If you submit an upstream patch for this please feel free to respond on this bug and we will reopen it.
Reopening. We have a contractor working on this defect now. A patch has been submitted upstream for this and is going through the review process
See: https://review.openstack.org/#/c/617951/
I orginally put the wrong FixedInVersion for this bug so it wasn't automatically moved to ON_QA. Fixing the FixedInVersion and setting to ON_QA
The build mentioned in fixed in version has been released