Bug 1310778 - ipxe freeze during HTTP download in virtual and hardware env
Summary: ipxe freeze during HTTP download in virtual and hardware env
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Unspecified
urgent
low
Target Milestone: async
: 8.0 (Liberty)
Assignee: Lucas Alvares Gomes
QA Contact: Toure Dunnon
URL:
Whiteboard:
Depends On:
Blocks: 1261979 1310828 1337206
TreeView+ depends on / blocked
 
Reported: 2016-02-22 16:09 UTC by Gonéri Le Bouder
Modified: 2016-06-09 19:40 UTC (History)
10 users (show)

Fixed In Version: openstack-ironic-4.2.3-1.el7ost
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1337206 (view as bug list)
Environment:
Last Closed: 2016-06-09 19:40:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
screenshot (19.26 KB, image/png)
2016-02-22 16:09 UTC, Gonéri Le Bouder
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1567449 0 None None None 2016-04-07 14:10:32 UTC
OpenStack gerrit 283893 0 None None None 2016-02-24 01:13:06 UTC
OpenStack gerrit 306196 0 None None None 2016-05-18 13:32:12 UTC
Red Hat Product Errata RHBA-2016:1220 0 normal SHIPPED_LIVE openstack-ironic bug fix advisory 2016-06-09 23:39:56 UTC

Internal Links: 1322056 1336473 1337250

Description Gonéri Le Bouder 2016-02-22 16:09:36 UTC
Created attachment 1129398 [details]
screenshot

Description of problem:


I already had some freezes during the HTTP download of the kernel/initrd
with the ipxe from the ipxe rpm, including the one from BZ1267030. In this case the Ironic introspection will fail with a timeout.

Today I managed to reproduce the issue with the last upstream ipxe ( git
f468f12b1eca15e703aa2a79f1c82969c04c2322 ) on a OVB environment. IMO,
ipxe should be able to raise a timeout in such case.

I attached a screenshoot.


Version-Release number of selected component (if applicable):


How reproducible:

Boot a VM and request them to fetch the kernel/initrd from a HTTP share.

Actual results:

99% of the cases will and up with a success.

Expected results:


Additional info:

Comment 2 Gonéri Le Bouder 2016-02-22 16:26:40 UTC
According to http://lists.ipxe.org/pipermail/ipxe-devel/2014-October/003829.html, this is the purpose of the --timeout parameter, but I don't see this argument in the configuration file generated by Ironic.

-------------------
#!ipxe

dhcp

goto deploy

:deploy
kernel http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/deploy_kernel selinux=0 disk=cciss/c0d0,sda,hda,vda iscsi_target_iqn=iqn.2008-10.org.openstack:bfc53b24-9f79-4e17-8bc0-b9657047a3c4 deployment_id=bfc53b24-9f79-4e17-8bc0-b9657047a3c4 deployment_key=2Q1KKOTSH42OS9FYPYYHF0DT564PDWCU ironic_api_url=http://192.0.2.240:6385 troubleshoot=0 text nofb nomodeset vga=normal boot_option=local ip=${ip}:${next-server}:${gateway}:${netmask} BOOTIF=${mac}  ipa-api-url=http://192.0.2.240:6385 ipa-driver-name=pxe_ipmitool coreos.configdrive=0

initrd http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/deploy_ramdisk
boot

:boot_partition
kernel http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/kernel root={{ ROOT }} ro text nofb nomodeset vga=normal
initrd http://192.0.2.240:8088/bfc53b24-9f79-4e17-8bc0-b9657047a3c4/ramdisk
boot

:boot_whole_disk
kernel chain.c32
append mbr:{{ DISK_IDENTIFIER }}

Comment 4 Steve Baker 2016-02-23 03:28:07 UTC
If you're using OVB you'll likely need to adjust the undercloud MTU settings, in my environment I do it by:

  mtu=$(ifconfig eth0 |egrep -o "mtu [^ ]*")
  ifconfig eth1 $mtu
  ifconfig eth2 $mtu

This comment may not be helpful if you actually want IPXE to have better timeout handling

Comment 5 Gonéri Le Bouder 2016-02-23 15:35:52 UTC
Thanks Steve. Indeed this seems to be the root of the problem here.

Comment 6 Gonéri Le Bouder 2016-02-24 01:11:23 UTC
https://review.openstack.org/283893 I pushed a review to add a default timeout to reduce the impact of this kind of problem.

Comment 9 Gonéri Le Bouder 2016-03-03 18:15:51 UTC
I opened https://review.openstack.org/#/c/288041 to be able to adjust the MTU through the undercloud.conf

Comment 10 Gonéri Le Bouder 2016-03-05 16:24:38 UTC
Chris, I'm pretty sure this is not a blocker in your case. The problem happens only when the MTU is < 1500. We had the issue once with Gael but it was because of a misconfiguration.

Comment 11 Gonéri Le Bouder 2016-03-12 18:22:40 UTC
Ok, I'd just a freeze during agent.ramdisk download. This time the MTU was ok (1400). I use upstream ipxe image, not the one provided by Red Hat.

Comment 12 Gonéri Le Bouder 2016-04-07 14:07:17 UTC
The final patch to get the --timeout merged is here:
https://review.openstack.org/#/c/294787
and is blocked by:
https://bugs.launchpad.net/ironic/+bug/1567449

Comment 13 Gael Rehault 2016-04-14 13:07:05 UTC
This is where we stand with this - this bug went away for a while, and reappeared in Beta9/RC/GA releases.
in https://bugzilla.redhat.com/show_bug.cgi?id=1322056, Goneri highlighted the fact that the ipxe packages got updated,so we tried (using both beta9 and GA) a workaround, installing the previous packages & locking them prior to installing the undercloud

[osp_admin@director ~]$ rpm -qa | grep ipxe
ipxe-bootimgs-20150821-1.git4e03af8e.el7.noarch
ipxe-roms-qemu-20150821-1.git4e03af8e.el7.noarch

Unfortunately, that does not prevent the freezes to happen for me (ran into it 2 out of 3 deployments today with the above in place), so there might be something else at play besides those packages versions

Also to clarify, this bug can occur in 2 different steps of the deployment process :
1 - during the bulk introspection of the nodes
2 - during the overcloud deployment

Comment 17 errata-xmlrpc 2016-06-09 19:40:04 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1220


Note You need to log in before you can comment on or make changes to this bug.