Bug 1301694

Summary: pxe boot timed on baremetal nodes during overcloud introspection
Product: Red Hat OpenStack Reporter: Udi Shkalim <ushkalim>
Component: ipxeAssignee: Dmitry Tantsur <dtantsur>
Status: CLOSED WONTFIX QA Contact: yeylon <yeylon>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.0 (Kilo)CC: ahirshbe, alex.williamson, apevec, athomas, fhubik, jcoufal, jen, kbasil, lersek, lhh, mburns, mcornea, mrezanin, oblaut, sasha, srevivo, ushkalim, yeylon
Target Milestone: y3Keywords: Reopened
Target Release: 7.0 (Kilo)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1308611 (view as bug list) Environment:
Last Closed: 2016-02-15 15:37:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1308611    
Attachments:
Description Flags
var log dir none

Description Udi Shkalim 2016-01-25 17:37:23 UTC
Description of problem:
In openstack-director 7.3 introspection of overcloud nodes timed out. 
The package ipxe-bootimgs was update last Thursday 21.01.2016 (ipxe-bootimgs-20150821-1.git4e03af8e.el7.noarch). The package contained an iPXE ROM that caused the boot process to time out on baremetal node.
Switching to latest ROM from http://boot.ipxe.org/undionly.kpxe solved the problem.

Version-Release number of selected component (if applicable):
ipxe-bootimgs-20150821-1.git4e03af8e.el7.noarch
Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)
ethtool -i enp5s0f1
driver: igb
version: 5.2.15-k
firmware-version: 2.1.3
bus-info: 0000:05:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

How reproducible:
100%

Steps to Reproduce:
1. Install openstack director 7.3 on BM setup
2. Start introspection 
3.

Actual results:
Introspection failed due to time out

Expected results:
Introspection pass

Additional info:
/var/log/ dir is attached.

Comment 2 Udi Shkalim 2016-01-25 17:40:17 UTC
Created attachment 1118144 [details]
var log dir

Comment 5 Alexander Chuzhoy 2016-01-25 22:30:32 UTC
Reproduced the issue and the workaround (http://boot.ipxe.org/undionly.kpxe) on BM.
The issue doesn't reproduce on virtual setup.

Comment 6 Udi Shkalim 2016-01-26 12:04:19 UTC
another work around - http://etherpad.corp.redhat.com/ironic-ipxe-to-pxe

Comment 7 Laszlo Ersek 2016-01-26 16:48:42 UTC
Hi,

can you please elaborate?

Namely, for non-virt purposes, two packages are built from the ipxe SRPM: ipxe-roms, and ipxe-bootimgs.

The former (= ipxe-roms) is what actually contains PCI expansion ROMs, which are meant as *replacements* for the PCI expansion ROMs that are already burned into physical NICs. See:

http://ipxe.org/howto/romburning

Whereas the latter (= ipxe-bootimgs) contains standalone iPXE images that can be booted / bootstrapped with various *existing* boot mechanisms: USB, CD-ROM, or the preexistent PXE boot capability (= factory installed PCI expansion ROM) of your NIC. See:

http://ipxe.org/howto/chainloading

In light of the above, the bug report confuses me:

(a) It references "ipxe-bootimgs", and it states that replacing "undionly.kpxe" on the TFTP server (which file indeed comes from "ipxe-bootimgs") with a fresh upstream binary fixes things.

These points consistently imply that there's a problem with "undionly.kpxe" from the ipxe-bootimgs package. Also they imply that there is no intent to reflash physical NICs with ROM files retrieved from "ipxe-roms".

(b) However, the comments also imply that *downloading* "undionly.kpxe" from the TFTP server runs into issues now.

I don't understand how that's possible, since in that phase the only relevance the ipxe rebase may have is the changed *size* of the file being downloaded ("undionly.kpxe"). Since the same factory-installed PCI oprom of the physical NIC is used for this download as before, I don't see how the ipxe rebase can have any effect here.

Especially this comment: "Eliminating networking we found that the iPXE ROM is having trouble" is hard to understand:

- If you fully eliminate the network, you can't even download "undionly.kpxe"
  via TFTP.

- If you keep the local subnet alive (so that TFTP works and "undionly.kpxe"
  is downloaded successfully), but prevent "undionly.kpxe" from loading further
  stuff (e.g., via HTTP), then the statement "iPXE ROM is having trouble" is
  hard to interpret:

  - The NIC's oprom obviously managed to load "undionly.kpxe", so it is not
    having trouble (and that ROM doesn't even originate from iPXE),

  - "undionly.kpxe", which could have trouble, is *not a ROM*.

Comment 8 Laszlo Ersek 2016-01-26 16:51:28 UTC
Anyway, assuming this is a network driver issue in iPXE, and because comment 0 named Intel 82576, and because fresh upstream iPXE works, we can look for upstream commits our latest rebase lacks:

$ git log --oneline --reverse 4e03af8e..master -- src/drivers/net/intel.c

d5f7ee6 [intel] Add PCI IDs for i210/i211 flashless operation
fff9281 [intel] Forcibly skip PHY reset on some models
d694592 [intel] Add INTEL_NO_PHY_RST for I217-LM

My guess is either fff9281 or d694592. (The bug report doesn't contain exact vendor ID / device ID, so it's just a guess.)

Comment 9 Laszlo Ersek 2016-01-26 16:57:36 UTC
In attachment 1118144 [details] I found the "dmesg" file. It says:

[    1.090871] pci 0000:05:00.1: [8086:10c9] type 00 class 0x020000

Searching the iPXE source for 10c9, it is found in "src/drivers/net/intel.c", but it is not affected by the commits listed in comment 8:

src/drivers/net/intel.c:        PCI_ROM ( 0x8086, 0x10c9, "82576", "82576", 0 ),

(It doesn't have the INTEL_NO_PHY_RST flag.)

So I have to think this is not a NIC driver issue in iPXE; probably something more generic.

Comment 18 Udi Shkalim 2016-02-02 09:58:13 UTC
7.3 Installtion from the 29 Jan is having the latest ROM from  http://boot.ipxe.org/undionly.kpxe

cksum /usr/share/ipxe/undionly.kpxe
3260852374 64047 /usr/share/ipxe/undionly.kpxe

Comment 19 Angus Thomas 2016-02-03 13:50:27 UTC
Documentation on failing back to PXE is drafted as a knowledgebase article.

Comment 20 Udi Shkalim 2016-02-04 14:56:16 UTC
(In reply to Udi Shkalim from comment #18)
> 7.3 Installtion from the 29 Jan is having the latest ROM from 
> http://boot.ipxe.org/undionly.kpxe
> 
> cksum /usr/share/ipxe/undionly.kpxe
> 3260852374 64047 /usr/share/ipxe/undionly.kpxe

Please ignore the above comment. I used a borrowed setup.

I'm currently re-testing with the package from brew 
https://brewweb.devel.redhat.com/taskinfo?taskID=10401510

Comment 21 Asaf Hirshberg 2016-02-04 15:01:18 UTC
Laszlo,

Can you please regenerate the rpm in brew? it's empty and there is no other source.

Thanks.

Comment 24 Asaf Hirshberg 2016-02-07 12:47:04 UTC
Hey Miroslav,

I used your repos to update IPXE to:
 ipxe-bootimgs.noarch 0:20150821-1.git4e03af8e.el7.test 
But the deployment failed and checked the cksum of undionly.kpxe under my /tftpboot against http://boot.ipxe.org/undionly.kpxe and saw that they are different, after I replaced the files the deployment pass the ironic phase. 

[root@puma33 ~]# cksum undionly.kpxe
1521140302 64074 undionly.kpxe
[root@puma33 ~]# cksum /tftpboot/undionly.kpxe 
750298637 63517 /tftpboot/undionly.kpxe

Comment 25 Miroslav Rezanina 2016-02-08 13:44:07 UTC
Hi Asaf,
is it possible to have access to your setup to test? In case not can you test with with newer version of ipxe in batcave repo (should be ipxe-20160127-0.git6366fa7a.el7)?

Mirek

Comment 27 Angus Thomas 2016-02-09 15:12:38 UTC
This bug has been addressed by a combination of a new KB article which describes the process of switching to PXE for users whose hardware doesn't work with iPXE, and by the shipping of an updated iPXE ROM, as tracked here: https://bugzilla.redhat.com/show_bug.cgi?id=1267030

Comment 28 Ofer Blaut 2016-02-15 13:40:47 UTC
Hi Angus

In Reply to comment 25

We still have  ipxe-bootimgs.noarch 0:20150821-1.git4e03af8e.el7.test

This fails our installations 

bug https://bugzilla.redhat.com/show_bug.cgi?id=1267030 Fixed In Version: is ipxe-20150821-1.git4e03af8e.el7 

Which still fail the installation from time to time

Ofer

Comment 29 Jaromir Coufal 2016-02-15 15:37:23 UTC
Workaround for 7.3 is documented, we will take the new iPXE when it is available and fixed (probably in OSP8). Closing

Comment 30 Jaromir Coufal 2016-02-15 15:40:23 UTC
Cloned for OSP8 for tracking purposes: https://bugzilla.redhat.com/show_bug.cgi?id=1308611

Comment 31 Red Hat Bugzilla 2023-09-14 03:16:44 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days