Red Hat Bugzilla – Bug 1301694
pxe boot timed on baremetal nodes during overcloud introspection
Last modified: 2016-04-18 03:11:43 EDT
Description of problem:
In openstack-director 7.3 introspection of overcloud nodes timed out.
The package ipxe-bootimgs was update last Thursday 21.01.2016 (ipxe-bootimgs-20150821-1.git4e03af8e.el7.noarch). The package contained an iPXE ROM that caused the boot process to time out on baremetal node.
Switching to latest ROM from http://boot.ipxe.org/undionly.kpxe solved the problem.
Version-Release number of selected component (if applicable):
Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
ethtool -i enp5s0f1
Steps to Reproduce:
1. Install openstack director 7.3 on BM setup
2. Start introspection
Introspection failed due to time out
/var/log/ dir is attached.
Created attachment 1118144 [details]
var log dir
Reproduced the issue and the workaround (http://boot.ipxe.org/undionly.kpxe) on BM.
The issue doesn't reproduce on virtual setup.
another work around - http://etherpad.corp.redhat.com/ironic-ipxe-to-pxe
can you please elaborate?
Namely, for non-virt purposes, two packages are built from the ipxe SRPM: ipxe-roms, and ipxe-bootimgs.
The former (= ipxe-roms) is what actually contains PCI expansion ROMs, which are meant as *replacements* for the PCI expansion ROMs that are already burned into physical NICs. See:
Whereas the latter (= ipxe-bootimgs) contains standalone iPXE images that can be booted / bootstrapped with various *existing* boot mechanisms: USB, CD-ROM, or the preexistent PXE boot capability (= factory installed PCI expansion ROM) of your NIC. See:
In light of the above, the bug report confuses me:
(a) It references "ipxe-bootimgs", and it states that replacing "undionly.kpxe" on the TFTP server (which file indeed comes from "ipxe-bootimgs") with a fresh upstream binary fixes things.
These points consistently imply that there's a problem with "undionly.kpxe" from the ipxe-bootimgs package. Also they imply that there is no intent to reflash physical NICs with ROM files retrieved from "ipxe-roms".
(b) However, the comments also imply that *downloading* "undionly.kpxe" from the TFTP server runs into issues now.
I don't understand how that's possible, since in that phase the only relevance the ipxe rebase may have is the changed *size* of the file being downloaded ("undionly.kpxe"). Since the same factory-installed PCI oprom of the physical NIC is used for this download as before, I don't see how the ipxe rebase can have any effect here.
Especially this comment: "Eliminating networking we found that the iPXE ROM is having trouble" is hard to understand:
- If you fully eliminate the network, you can't even download "undionly.kpxe"
- If you keep the local subnet alive (so that TFTP works and "undionly.kpxe"
is downloaded successfully), but prevent "undionly.kpxe" from loading further
stuff (e.g., via HTTP), then the statement "iPXE ROM is having trouble" is
hard to interpret:
- The NIC's oprom obviously managed to load "undionly.kpxe", so it is not
having trouble (and that ROM doesn't even originate from iPXE),
- "undionly.kpxe", which could have trouble, is *not a ROM*.
Anyway, assuming this is a network driver issue in iPXE, and because comment 0 named Intel 82576, and because fresh upstream iPXE works, we can look for upstream commits our latest rebase lacks:
$ git log --oneline --reverse 4e03af8e..master -- src/drivers/net/intel.c
d5f7ee6 [intel] Add PCI IDs for i210/i211 flashless operation
fff9281 [intel] Forcibly skip PHY reset on some models
d694592 [intel] Add INTEL_NO_PHY_RST for I217-LM
My guess is either fff9281 or d694592. (The bug report doesn't contain exact vendor ID / device ID, so it's just a guess.)
In attachment 1118144 [details] I found the "dmesg" file. It says:
[ 1.090871] pci 0000:05:00.1: [8086:10c9] type 00 class 0x020000
Searching the iPXE source for 10c9, it is found in "src/drivers/net/intel.c", but it is not affected by the commits listed in comment 8:
src/drivers/net/intel.c: PCI_ROM ( 0x8086, 0x10c9, "82576", "82576", 0 ),
(It doesn't have the INTEL_NO_PHY_RST flag.)
So I have to think this is not a NIC driver issue in iPXE; probably something more generic.
7.3 Installtion from the 29 Jan is having the latest ROM from http://boot.ipxe.org/undionly.kpxe
3260852374 64047 /usr/share/ipxe/undionly.kpxe
Documentation on failing back to PXE is drafted as a knowledgebase article.
(In reply to Udi Shkalim from comment #18)
> 7.3 Installtion from the 29 Jan is having the latest ROM from
> cksum /usr/share/ipxe/undionly.kpxe
> 3260852374 64047 /usr/share/ipxe/undionly.kpxe
Please ignore the above comment. I used a borrowed setup.
I'm currently re-testing with the package from brew
Can you please regenerate the rpm in brew? it's empty and there is no other source.
I used your repos to update IPXE to:
But the deployment failed and checked the cksum of undionly.kpxe under my /tftpboot against http://boot.ipxe.org/undionly.kpxe and saw that they are different, after I replaced the files the deployment pass the ironic phase.
[root@puma33 ~]# cksum undionly.kpxe
1521140302 64074 undionly.kpxe
[root@puma33 ~]# cksum /tftpboot/undionly.kpxe
750298637 63517 /tftpboot/undionly.kpxe
is it possible to have access to your setup to test? In case not can you test with with newer version of ipxe in batcave repo (should be ipxe-20160127-0.git6366fa7a.el7)?
This bug has been addressed by a combination of a new KB article which describes the process of switching to PXE for users whose hardware doesn't work with iPXE, and by the shipping of an updated iPXE ROM, as tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1267030
In Reply to comment 25
We still have ipxe-bootimgs.noarch 0:20150821-1.git4e03af8e.el7.test
This fails our installations
bug https://bugzilla.redhat.com/show_bug.cgi?id=1267030 Fixed In Version: is ipxe-20150821-1.git4e03af8e.el7
Which still fail the installation from time to time
Workaround for 7.3 is documented, we will take the new iPXE when it is available and fixed (probably in OSP8). Closing
Cloned for OSP8 for tracking purposes: https://bugzilla.redhat.com/show_bug.cgi?id=1308611