Bug 1326086

Summary: Introspection fails. dnsmasq-tftp[1490]: failed sending /tftpboot/undionly.kpxe
Product: Red Hat OpenStack Reporter: Dan Yocum <dyocum>
Component: openstack-ironic-discoverdAssignee: RHOS Maint <rhos-maint>
Status: CLOSED WORKSFORME QA Contact: Raviv Bar-Tal <rbartal>
Severity: high Docs Contact:
Priority: urgent    
Version: 8.0 (Liberty)CC: apevec, dtantsur, dyocum, lhh, mburns, mcornea, rhel-osp-director-maint, sbaker
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-09 12:28:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
introspection logs - v7.3.1
none
screen capture of error on console
none
ipxe boot screencap - version 1.0.0+ (4e03af8e) none

Description Dan Yocum 2016-04-11 18:33:58 UTC
Created attachment 1146077 [details]
introspection logs - v7.3.1

Description of problem:

When attempting to perform node introspection, it fails ~75% of the time.

The error in logs is the following:

pr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPREQUEST(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc
Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPACK(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc
Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPREQUEST(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc
Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPACK(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc
Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: error 0 TFTP Aborted received from 10.3.3.70
Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: failed sending /tftpboot/undionly.kpxe to 10.3.3.70
Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: sent /tftpboot/undionly.kpxe to 10.3.3.70
Apr 11 10:35:17 ops2 ironic-api: 10.3.3.1 - - [11/Apr/2016 10:35:17] "GET / HTTP/1.0" 200 354

(more logs attached)

This causes iPXE boot to fail (and continue to boot from an old deployment on disk.)


Version-Release number of selected component (if applicable):

[stack@ops2 ~]$ rpm -qa | grep 'ironic\|tripleo\|oscplug' | sort
openstack-ironic-api-2015.1.2-2.el7ost.noarch
openstack-ironic-common-2015.1.2-2.el7ost.noarch
openstack-ironic-conductor-2015.1.2-2.el7ost.noarch
openstack-ironic-discoverd-1.1.0-8.el7ost.noarch
openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch
openstack-tripleo-common-0.0.1.dev6-6.git49b57eb.el7ost.noarch
openstack-tripleo-heat-templates-0.8.6-123.el7ost.noarch
openstack-tripleo-image-elements-0.9.6-10.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch
python-ironicclient-0.5.1-12.el7ost.noarch
python-ironic-discoverd-1.1.0-8.el7ost.noarch
python-rdomanager-oscplugin-0.0.10-28.el7ost.noarch


How reproducible:

75% of the time

Steps to Reproduce:
1. Open a console on the node to be inspected, then run the following script

### inspect-node.sh ####
#!/bin/bash
#
# pass in the node uuid as parameter 1 on the cli
uuid=$1

    echo "Starting introspection on $uuid"
    ironic node-set-maintenance $uuid true
    openstack baremetal introspection start $uuid
    RET_VAL=1
    while [ $RET_VAL -ne 0 ]; 
    do
      sleep 120
      ironic node-list | grep "${uuid}.*power off" &> /dev/null
      RET_VAL=$?
      if [ $RET_VAL -eq 1 ]; then
        echo "still running, sleeping..."
      fi
    done
    echo "introspection is complete. exiting"
    ironic node-set-maintenance $uuid false

2. Run 'journalctl -l -u openstack-ironic-discoverd -u openstack-ironic-discoverd-dnsmasq -u openstack-ironic-conductor -f' in a separate terminal
2. Get ready to hit ctl-b to drop to iPXE> prompt on the console
3. execute the script

Actual results:

see error above

Expected results:

success!

Additional info:

Comment 2 Dan Yocum 2016-04-11 18:53:44 UTC
Created attachment 1146091 [details]
screen capture of error on console

Comment 3 Marius Cornea 2016-04-12 08:48:13 UTC
I've seen something similar in BZ#1301694 (not a connection reset but a timeout). The workaround was to update the iPXE ROM or fallback to PXE so maybe it's worth checking it.

Comment 4 Dan Yocum 2016-04-13 01:23:09 UTC
Marius - I've tried "changing" the undionly.kpxe to 3 different versions now with no luck. 

I used (and extended) the guide here to use pxe boot instead of ipxe: http://etherpad.corp.redhat.com/ironic-ipxe-to-pxe.  I was able to get the nodes to boot, at least, but I'm about to open another BZ re: them failing to complete, dropping to the dracut# prompt, instead.  :(

Comment 5 Dan Yocum 2016-04-13 01:23:30 UTC
Also, this affect OSP-d v8.

Comment 6 Dan Yocum 2016-04-13 01:24:21 UTC
Hardware being used is Dell r630 and r730xd with BIOS v1.3.6 and firmware 2.15.10.10.

Comment 7 Dmitry Tantsur 2016-04-18 08:08:24 UTC
I wonder if this NIC has its own iPXE firmware. Could you please make a screenshot of the iPXE ROM booting? There should be a version hash displayed.

Comment 8 Dan Yocum 2016-04-18 19:33:07 UTC
Created attachment 1148274 [details]
ipxe boot screencap - version 1.0.0+ (4e03af8e)

Comment 9 Dan Yocum 2016-04-18 19:37:25 UTC
It should still chainload whatever is on the director, right?

Comment 10 Steve Baker 2016-04-19 01:14:11 UTC
This screenshot shows 4e03af8e (21-08-2015)
RHOS-7 has            c4bce43  (16-05-2013)
RHOS-8 has            dc795b9f (27-01-2016)

If iPXE boots from firmware then it *does not* chainload to Director's iPXE however I have a WIP patch which does this if the version doesn't match some desired version - I plan to talk to Ironic folk in Austin to see if I should take that idea further.

In the meantime I would recommend you update the nic firmware to iPXE dc795b9f from the ipxe packages in RHOS-8

Comment 11 Dan Yocum 2016-04-21 14:47:10 UTC
I can verify that the Dell R630 and R730xd systems with Intel X520 i350 nics are booting properly using the following ROMS:

ipxe-bootimgs-20160127-1.git6366fa7a.el7.noarch

NB: the git hash should match the ipxe version hash displayed when chainloading.

Comment 12 Dmitry Tantsur 2016-05-09 12:28:51 UTC
Hi, sorry for not following up earlier. I'd like to close this bug, as it seems that update we carry resolves the issue. Unfortunately, we can't fix iPXE firmware that we don't control. Steve's idea is worth discussing, but I don't think we can backport it to OSPd8 even if we agree on it.

Please feel free to reopen if I'm missing something here. Thanks for reporting!