1322056 – ipxe freeze during HTTP download with last RPM

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1322056 - ipxe freeze during HTTP download with last RPM

Summary: ipxe freeze during HTTP download with last RPM

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	ipxe
Sub Component:
Version:	7.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	pre-dev-freeze
Target Release:	7.3
Assignee:	Ladi Prosek
QA Contact:	Raviv Bar-Tal
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-03-29 16:43 UTC by Gonéri Le Bouder
Modified:	2016-11-04 00:39 UTC (History)
CC List:	12 users (show)
Fixed In Version:	ipxe-20160127-4.git6366fa7a.el7
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	1336473 (view as bug list)
Environment:
Last Closed:	2016-11-04 00:39:08 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
iPXE with TCP keepalive (64.00 KB, application/octet-stream) 2016-05-27 15:11 UTC, Ladi Prosek	no flags	Details
iPXE with TCP keepalive (undionly) (62.77 KB, application/octet-stream) 2016-05-30 14:44 UTC, Ladi Prosek	no flags	Details
iPXE with final upstream TCP keepalive (undionly) (62.73 KB, application/octet-stream) 2016-06-13 11:54 UTC, Ladi Prosek	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Priority	Status	Summary	Last Updated
OpenStack gerrit	310283	None	None	None	2016-05-12 16:42:37 UTC
OpenStack gerrit	315255	None	None	None	2016-05-12 16:43:07 UTC
Red Hat Bugzilla	1310778	urgent	CLOSED	ipxe freeze during HTTP download in virtual and hardware env	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2016:2214	normal	SHIPPED_LIVE	ipxe bug fix and enhancement update	2016-11-03 13:24:33 UTC

Description Gonéri Le Bouder 2016-03-29 16:43:50 UTC

Description of problem:

OSP beta9 comes with ipxe-roms-qemu-20160127-1.git6366fa7a.el7.noarch.rpm. It seems to increase the number of freeze during the HTTP transfer.

Here, about 1/20 deployments fail. The nodes are DELL R730xd without UEFI.

This is very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1310778, the mains difference is the increased number of time where the problem occur.

Comment 2 Mike Burns 2016-04-07 21:36:02 UTC

This bug did not make the OSP 8.0 release.  It is being deferred to OSP 10.

Comment 4 Gonéri Le Bouder 2016-05-12 16:42:38 UTC

https://review.openstack.org/#/c/310283/ resolves the introspection issue with iPXE. I backported the patch for Liberty:
https://review.openstack.org/#/c/315255/

Comment 5 Jason Guiditta 2016-05-16 14:39:10 UTC

Gonéri, Thanks for the patches.  I think we need to do a little cleanup on this BZ to get it targetted to the right places.  Here is what I propose (and I can do this if you like).  Firstly, your patches are against puppet, not ipxe component, even though they fix an issue _related_ to ipxe.  So the first step is to reassign to openstack-puppet-modules.  Next, we need to make 2 clones of this BZ against osp 7 and 8 (liberty and mitaka), so they can be properly assessed for possible inclusion.  Let me know if any issues with this, otherwise, I will do these steps in a little while today.

Comment 6 Gonéri Le Bouder 2016-05-16 14:49:09 UTC

I don't (yet) plan to fix Kilo/OSP7. ok, I clone the issue and reassign to openstack-puppet-modules.

Comment 8 Ladi Prosek 2016-05-23 08:31:41 UTC

Gonéri, can you please elaborate on what's believed to be broken in iPXE and the best way to reproduce it?

Comment 9 Gonéri Le Bouder 2016-05-24 16:31:42 UTC

iPXE hangs as soon as something unexpected happens on the network. For example, if we a packet is broken because of a wrong MTU, iPXE won't request a resend but will just freeze.

This behavior is documented and the --timeout parameter is here to avoid the freeze. http://lists.ipxe.org/pipermail/ipxe-devel/2014-October/003829.html

We now use this parameter in OSP8 to avoid the issue but to do so, we had to upgrade to iPXE RPM. The one from RHEL7 is to old and does not support the parameter.

Comment 10 Gonéri Le Bouder 2016-05-24 16:38:18 UTC

A better long term solution is probably to fix the TCP stack of iPXE. It should request a resend instead of freezing.

Comment 11 Ladi Prosek 2016-05-26 09:31:22 UTC

Thank you! Yes, the HTTP/TCP implementation seems to be pretty bad.

iPXE> imgfetch http://speedtest.reliableservers.com/100MBtest.bin

usually freezes within a few MB and rarely runs to completion on my VM.

Comment 12 Ladi Prosek 2016-05-27 12:26:15 UTC

(In reply to Gonéri Le Bouder from comment #0)
> This is very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1310778,
> the mains difference is the increased number of time where the problem occur.

One thing I noticed in my testing is that if I let the VM use the default MAC address (52:54:00:12:34:56 for virtio-net), the chances of the connection freezing are extremely high. There's likely more than one endpoint claiming this MAC within the local network infrastructure and the switches get confused. This is basically equivalent to pulling the cable as described by the ipxe-devel thread.

This may be a long shot but is it possible that the reason why the problem occurs more often are MAC address conflicts? Or maybe just reusing MAC addresses aggressively while there's still stray traffic associated with the previous physical location of the MAC?

Comment 13 Gonéri Le Bouder 2016-05-27 14:14:00 UTC

If I break the MTU (e.g: 1550 for the server, 1400 for the client) and so send too large frame to the iPXE client, it will also free. This time, the problem at the TCP level.

Comment 14 Ladi Prosek 2016-05-27 14:39:24 UTC

(In reply to Gonéri Le Bouder from comment #13)
> If I break the MTU (e.g: 1550 for the server, 1400 for the client) and so
> send too large frame to the iPXE client, it will also free. This time, the
> problem at the TCP level.

Not sure I'm following. If the client sits on a lower MTU network than the server, it is a responsibility of the infra between them to do fragmentation as needed. Are you saying that iPXE doesn't correctly compute the link MTU? Can you please elaborate?

I have prototyped TCP keepalive in iPXE and the results on my test setup are promising - no more freezes. How difficult would it be for you to give it try?

Comment 15 Gonéri Le Bouder 2016-05-27 14:55:18 UTC

No, sorry if I was not clear enough. This is just a way to generate broken TCP frame. In this case, the frame checksum is broken and iPXE should request a resend and maybe end up with a timeout.

I can test an iPXE build on my environment if you want.

Comment 16 Ladi Prosek 2016-05-27 15:11:40 UTC

Created attachment 1162513 [details]
iPXE with TCP keepalive

Thanks - I'm assuming that 1af41000.rom (attached) is all you need, please let me know otherwise.

Comment 17 Gonéri Le Bouder 2016-05-30 13:47:31 UTC

Can I get a undionly.kpxe file instead? OpenStack Ironic chainloads from the regular PXE client to iPXE with:

tag:!ipxe,option:bootfile-name,undionly.kpxe

Comment 18 Ladi Prosek 2016-05-30 14:44:38 UTC

Created attachment 1162836 [details]
iPXE with TCP keepalive (undionly)

Sure thing, attaching undionly.kpxe.

Comment 19 Gonéri Le Bouder 2016-05-30 20:06:24 UTC

I did about 90 successful boot with the patched iPXE. The patch seems to improve the situation.

Comment 20 Ladi Prosek 2016-05-31 11:59:19 UTC

(In reply to Gonéri Le Bouder from comment #19)
> I did about 90 successful boot with the patched iPXE. The patch seems to
> improve the situation.

Thank you. May I ask what the confidence level of this statement is? Would you expect it to fail in 1/20 cases as mentioned in the description?

I will pursue getting the TCP keepalive functionality into iPXE. At the same time, though, I suspect that there may be underlying networking problems on your end probably worth a closer look.

The patch simply sends a TCP keepalive packet every 5 seconds if the connection stalls. This makes sure that the L2/L3 path to the server is "refreshed". Switch tables, NAT tables, whatever is in the way and needs to see a client->server packet before server->client starts working could be the culprit.

Comment 21 Gonéri Le Bouder 2016-05-31 13:06:00 UTC

I did something like 200 iPXE boot of the 9 nodes platform during the night. So far so good.

Comment 22 Gonéri Le Bouder 2016-06-01 13:39:12 UTC

The platform has been rebooting happily since yesterday without any issue.

Comment 23 Ladi Prosek 2016-06-01 15:05:50 UTC

Excellent, thanks for testing it! I'll work on getting the fix in.

Comment 24 Ladi Prosek 2016-06-13 11:54:45 UTC

Created attachment 1167467 [details]
iPXE with final upstream TCP keepalive (undionly)

I am attaching iPXE with the final fix committed upstream by Michael Brown:
http://git.ipxe.org/ipxe.git/commitdiff/188789e

Gonéri, could you please give this one a try also? It is slightly different from the prototype so it warrants retesting. Thanks!

Comment 25 Ladi Prosek 2016-06-24 08:28:51 UTC

Moving to POST and adding Lucas to cc.

Comment 27 Gonéri Le Bouder 2016-07-14 08:00:10 UTC

Hi Ladi,

I give a try to the new rom. It works fine as expected but I don't have enough time to do advanced testing.

Comment 28 Ladi Prosek 2016-07-14 08:11:34 UTC

(In reply to Gonéri Le Bouder from comment #27)
> Hi Ladi,
> 
> I give a try to the new rom. It works fine as expected but I don't have
> enough time to do advanced testing.

Thank you!

Comment 30 Miroslav Rezanina 2016-08-02 10:02:11 UTC

Fix included in ipxe-20160127-4.git6366fa7a.el7

Comment 33 Raviv Bar-Tal 2016-09-11 11:33:35 UTC

There are fewer freeze with the new ipxe images, and the --timeout parameter restart the remaining few freeze which can happen for many network issues, even on virt and dedicated switch environment.

Comment 35 errata-xmlrpc 2016-11-04 00:39:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2214.html

Note You need to log in before you can comment on or make changes to this bug.