Bug 1326086 - Introspection fails. dnsmasq-tftp[1490]: failed sending /tftpboot/undionly.kpxe
Summary: Introspection fails. dnsmasq-tftp[1490]: failed sending /tftpboot/undionly.kpxe
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic-discoverd
Version: 8.0 (Liberty)
Hardware: x86_64
OS: Linux
urgent
high
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Raviv Bar-Tal
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-11 18:33 UTC by Dan Yocum
Modified: 2016-05-09 12:28 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-05-09 12:28:51 UTC
Target Upstream Version:


Attachments (Terms of Use)
introspection logs - v7.3.1 (12.36 KB, text/plain)
2016-04-11 18:33 UTC, Dan Yocum
no flags Details
screen capture of error on console (620.32 KB, image/png)
2016-04-11 18:53 UTC, Dan Yocum
no flags Details
ipxe boot screencap - version 1.0.0+ (4e03af8e) (452.16 KB, image/png)
2016-04-18 19:33 UTC, Dan Yocum
no flags Details

Description Dan Yocum 2016-04-11 18:33:58 UTC
Created attachment 1146077 [details]
introspection logs - v7.3.1

Description of problem:

When attempting to perform node introspection, it fails ~75% of the time.

The error in logs is the following:

pr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPREQUEST(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc
Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPACK(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc
Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPREQUEST(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc
Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPACK(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc
Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: error 0 TFTP Aborted received from 10.3.3.70
Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: failed sending /tftpboot/undionly.kpxe to 10.3.3.70
Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: sent /tftpboot/undionly.kpxe to 10.3.3.70
Apr 11 10:35:17 ops2 ironic-api: 10.3.3.1 - - [11/Apr/2016 10:35:17] "GET / HTTP/1.0" 200 354

(more logs attached)

This causes iPXE boot to fail (and continue to boot from an old deployment on disk.)


Version-Release number of selected component (if applicable):

[stack@ops2 ~]$ rpm -qa | grep 'ironic\|tripleo\|oscplug' | sort
openstack-ironic-api-2015.1.2-2.el7ost.noarch
openstack-ironic-common-2015.1.2-2.el7ost.noarch
openstack-ironic-conductor-2015.1.2-2.el7ost.noarch
openstack-ironic-discoverd-1.1.0-8.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-6.git49b57eb.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-123.el7ost.noarch
openstack-tripleo-image-elements-0.9.6-10.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch
python-ironicclient-0.5.1-12.el7ost.noarch
python-ironic-discoverd-1.1.0-8.el7ost.noarch
python-rdomanager-oscplugin-0.0.10-28.el7ost.noarch


How reproducible:

75% of the time

Steps to Reproduce:
1. Open a console on the node to be inspected, then run the following script

### inspect-node.sh ####
#!/bin/bash
#
# pass in the node uuid as parameter 1 on the cli
uuid=$1

    echo "Starting introspection on $uuid"
    ironic node-set-maintenance $uuid true
    openstack baremetal introspection start $uuid
    RET_VAL=1
    while [ $RET_VAL -ne 0 ]; 
    do
      sleep 120
      ironic node-list | grep "${uuid}.*power off" &> /dev/null
      RET_VAL=$?
      if [ $RET_VAL -eq 1 ]; then
        echo "still running, sleeping..."
      fi
    done
    echo "introspection is complete. exiting"
    ironic node-set-maintenance $uuid false

2. Run 'journalctl -l -u openstack-ironic-discoverd -u openstack-ironic-discoverd-dnsmasq -u openstack-ironic-conductor -f' in a separate terminal
2. Get ready to hit ctl-b to drop to iPXE> prompt on the console
3. execute the script

Actual results:

see error above

Expected results:

success!

Additional info:

Comment 2 Dan Yocum 2016-04-11 18:53:44 UTC
Created attachment 1146091 [details]
screen capture of error on console

Comment 3 Marius Cornea 2016-04-12 08:48:13 UTC
I've seen something similar in BZ#1301694 (not a connection reset but a timeout). The workaround was to update the iPXE ROM or fallback to PXE so maybe it's worth checking it.

Comment 4 Dan Yocum 2016-04-13 01:23:09 UTC
Marius - I've tried "changing" the undionly.kpxe to 3 different versions now with no luck. 

I used (and extended) the guide here to use pxe boot instead of ipxe: http://etherpad.corp.redhat.com/ironic-ipxe-to-pxe.  I was able to get the nodes to boot, at least, but I'm about to open another BZ re: them failing to complete, dropping to the dracut# prompt, instead.  :(

Comment 5 Dan Yocum 2016-04-13 01:23:30 UTC
Also, this affect OSP-d v8.

Comment 6 Dan Yocum 2016-04-13 01:24:21 UTC
Hardware being used is Dell r630 and r730xd with BIOS v1.3.6 and firmware 2.15.10.10.

Comment 7 Dmitry Tantsur 2016-04-18 08:08:24 UTC
I wonder if this NIC has its own iPXE firmware. Could you please make a screenshot of the iPXE ROM booting? There should be a version hash displayed.

Comment 8 Dan Yocum 2016-04-18 19:33:07 UTC
Created attachment 1148274 [details]
ipxe boot screencap - version 1.0.0+ (4e03af8e)

Comment 9 Dan Yocum 2016-04-18 19:37:25 UTC
It should still chainload whatever is on the director, right?

Comment 10 Steve Baker 2016-04-19 01:14:11 UTC
This screenshot shows 4e03af8e (21-08-2015)
RHOS-7 has            c4bce43  (16-05-2013)
RHOS-8 has            dc795b9f (27-01-2016)

If iPXE boots from firmware then it *does not* chainload to Director's iPXE however I have a WIP patch which does this if the version doesn't match some desired version - I plan to talk to Ironic folk in Austin to see if I should take that idea further.

In the meantime I would recommend you update the nic firmware to iPXE dc795b9f from the ipxe packages in RHOS-8

Comment 11 Dan Yocum 2016-04-21 14:47:10 UTC
I can verify that the Dell R630 and R730xd systems with Intel X520 i350 nics are booting properly using the following ROMS:

ipxe-bootimgs-20160127-1.git6366fa7a.el7.noarch

NB: the git hash should match the ipxe version hash displayed when chainloading.

Comment 12 Dmitry Tantsur 2016-05-09 12:28:51 UTC
Hi, sorry for not following up earlier. I'd like to close this bug, as it seems that update we carry resolves the issue. Unfortunately, we can't fix iPXE firmware that we don't control. Steve's idea is worth discussing, but I don't think we can backport it to OSPd8 even if we agree on it.

Please feel free to reopen if I'm missing something here. Thanks for reporting!


Note You need to log in before you can comment on or make changes to this bug.