Bug 1267030

Summary: ipxe timeout when performing introspection through Intel i350 NIC
Product: Red Hat Enterprise Linux 7 Reporter: Vincent S. Cojot <vcojot>
Component: ipxeAssignee: Lucas Alvares Gomes <lmartins>
ipxe sub component: ipxe-bootimgs QA Contact: Raviv Bar-Tal <rbartal>
Status: CLOSED ERRATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: alex.williamson, apevec, arkady_kanevsky, bkopilov, cdevine, chayang, christopher_dearborn, dsavinea, dyocum, gael_rehault, ggillies, goneri, huding, ipilcher, jen, joherr, John_walsh, jraju, juzhang, knoel, kschinck, kurt_hey, lhh, lmiksik, lzap, mburns, mcornea, morazi, mrezanin, randy_perryman, rbartal, rhel-osp-director-maint, rsussman, sasha, sbaker, sreichar, srevivo, vcojot, wayne_allen, weliao, xdmoon, xfu
Version: 7.1Keywords: Rebase
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ipxe-20150821-1.git4e03af8e.el7 Doc Type: Rebase: Bug Fixes and Enhancements
Doc Text:
Story Points: ---
Clone Of:
: 1290569 1300702 (view as bug list) Environment:
Last Closed: 2016-11-04 00:36:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1298313    
Bug Blocks: 1290569, 1300702, 1300704    
Attachments:
Description Flags
ipxe timeout
none
Screencast showing tcpdum of client's MAC on hypervisor and client console.. none

Description Vincent S. Cojot 2015-09-28 21:02:55 UTC
Description of problem:

On a few Dell R420 servers with both Broadcom and Intel NICs, ipxe works fine when netbooting from the Broadcom NIC but times out from netbooting from the Intel NIC.


Version-Release number of selected component (if applicable):

$ rpm -qf /usr/share/instack-undercloud/ipxe/post-install.d/88-setup-ipxe
instack-undercloud-2.1.2-23.el7ost.noarch

$ rpm -qf /usr/share/ipxe/undionly.kpxe
ipxe-bootimgs-20130517-6.gitc4bce43.el7.noarch


How reproducible:

always (we tried flashing the firmwares, to no avail)

Steps to Reproduce:
1. set the MAC to that of the Intel NIC in instackenv.json
2. start introspection


Actual results:

ipxe times out on the Intel NIC but works on the Broadcom NIC (inside the same VLAN and on the same switch).

Expected results:

Introspection should finish fine.

Additional info:

- From the node itself, on a pre-installed RHEL7.0, 'dhclient' takes only a few secs on Broadcom and close to 30 seconds on the Intel NICs.

- We also found a workaround by updating the iPXE payload (updating the undionly.kpxe binary from the latest builds available on ipxe.org):
on the instack machine:
# curl -O http://boot.ipxe.org/undionly.kpxe
# chmod 744 /tftpboot/undionly.kpxe
# chown ironic:ironic /tftpboot/undionly.kpxe
# chcon system_u:object_r:tftpdir_t:s0 /tftpboot/undionly.kpxe

Comment 2 Vincent S. Cojot 2015-09-28 21:08:11 UTC
Created attachment 1078072 [details]
ipxe timeout

Comment 3 Dmitry Tantsur 2015-10-01 12:36:09 UTC
Hi! So, if you can confirm that newer iPXE firmware works for you, than updating ipxe-bootimgs to something newer than May 2013 (which we have judging by the RPM version) is probably the only thing we can do. Mike, do you think we could retarget this bug to ipxe-bootimgs package?

Comment 4 Mike Burns 2015-10-01 13:04:55 UTC
In this case, we're limited to what is shipped in RHEL.  Adding Miroslav who seems to own ipxe in RHEL

Comment 5 Miroslav Rezanina 2015-10-02 07:55:29 UTC
Hi Mike,
we can try to rebase ipxe in 7.3 in case there's not proper patch found.

Comment 6 Mike Burns 2015-10-02 11:23:07 UTC
Great, moving this to RHEL, then.

Comment 8 Vincent S. Cojot 2015-10-05 20:03:22 UTC
Hi everyone,
I don't think this issue is related to OOO. The ipxe payload update is merely a workaround for the issue we ran into. We discovered that it works better (it does not timeout) if we use the more recent ipxe payload.
At any case:
1) we're still looking into the base issue (DHCP timeout with Intel NICs and Nortel switches)
2) the ipxe payloads in RHEL7.x need an update (IMHO).

For the curious, here a small screencast captured on my desktop and showing:

1) tcpdump for the client's MAC on the hypervisor hosting the instack VM.
2) the client machine's console. Notice the delay in obtaining the first lease through PXE and witness the timeout with the default iPXE payload (the newer payload worked around that issue and allowed us to sucessfully instrospect and deploy).

Kind regards,

Vincent

Comment 9 Vincent S. Cojot 2015-10-05 20:04:14 UTC
Created attachment 1080064 [details]
Screencast showing tcpdum of client's MAC on hypervisor and client console..

Comment 10 Lukas Zapletal 2015-10-15 09:37:25 UTC
Satellite 6 customers hit this as well, please rebase.

Comment 19 Gonéri Le Bouder 2015-11-26 14:26:43 UTC
Enabling PortFast (STP) on the switch fix the issue.

Comment 20 Mike Burns 2016-01-13 14:39:17 UTC
*** Bug 1290569 has been marked as a duplicate of this bug. ***

Comment 27 Chris Dearborn 2016-02-19 17:28:19 UTC
FYI, at Dell, we are not seeing timeout issues when PXE booting from Intel NICs.

Comment 29 Dan Yocum 2016-04-21 14:47:47 UTC
I can verify that the Dell R630 and R730xd systems with Intel X520 i350 nics are booting properly using the following ROMS:

ipxe-bootimgs-20160127-1.git6366fa7a.el7.noarch

NB: the git hash should match the ipxe version hash displayed when chainloading.

Comment 30 Chao Yang 2016-08-23 10:45:19 UTC
Hi Raviv,

Would you please verify this bug as it is ON_QA now? Thanks!

Comment 31 Raviv Bar-Tal 2016-09-11 12:39:01 UTC
The problem is solved by the new roms, there is no new failure report related to this problem, this was verified with the Udi the owner of bug https://bugzilla.redhat.com/show_bug.cgi?id=1301694
and As Dan wrote in comment #29.

Comment 34 errata-xmlrpc 2016-11-04 00:36:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2214.html