| Summary: | Introspection fails. dnsmasq-tftp[1490]: failed sending /tftpboot/undionly.kpxe | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Dan Yocum <dyocum> | ||||||||
| Component: | openstack-ironic-discoverd | Assignee: | RHOS Maint <rhos-maint> | ||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Raviv Bar-Tal <rbartal> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 8.0 (Liberty) | CC: | apevec, dtantsur, dyocum, lhh, mburns, mcornea, rhel-osp-director-maint, sbaker | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2016-05-09 12:28:51 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
Created attachment 1146091 [details]
screen capture of error on console
I've seen something similar in BZ#1301694 (not a connection reset but a timeout). The workaround was to update the iPXE ROM or fallback to PXE so maybe it's worth checking it. Marius - I've tried "changing" the undionly.kpxe to 3 different versions now with no luck. I used (and extended) the guide here to use pxe boot instead of ipxe: http://etherpad.corp.redhat.com/ironic-ipxe-to-pxe. I was able to get the nodes to boot, at least, but I'm about to open another BZ re: them failing to complete, dropping to the dracut# prompt, instead. :( Also, this affect OSP-d v8. Hardware being used is Dell r630 and r730xd with BIOS v1.3.6 and firmware 2.15.10.10. I wonder if this NIC has its own iPXE firmware. Could you please make a screenshot of the iPXE ROM booting? There should be a version hash displayed. Created attachment 1148274 [details]
ipxe boot screencap - version 1.0.0+ (4e03af8e)
It should still chainload whatever is on the director, right? This screenshot shows 4e03af8e (21-08-2015) RHOS-7 has c4bce43 (16-05-2013) RHOS-8 has dc795b9f (27-01-2016) If iPXE boots from firmware then it *does not* chainload to Director's iPXE however I have a WIP patch which does this if the version doesn't match some desired version - I plan to talk to Ironic folk in Austin to see if I should take that idea further. In the meantime I would recommend you update the nic firmware to iPXE dc795b9f from the ipxe packages in RHOS-8 I can verify that the Dell R630 and R730xd systems with Intel X520 i350 nics are booting properly using the following ROMS: ipxe-bootimgs-20160127-1.git6366fa7a.el7.noarch NB: the git hash should match the ipxe version hash displayed when chainloading. Hi, sorry for not following up earlier. I'd like to close this bug, as it seems that update we carry resolves the issue. Unfortunately, we can't fix iPXE firmware that we don't control. Steve's idea is worth discussing, but I don't think we can backport it to OSPd8 even if we agree on it. Please feel free to reopen if I'm missing something here. Thanks for reporting! |
Created attachment 1146077 [details] introspection logs - v7.3.1 Description of problem: When attempting to perform node introspection, it fails ~75% of the time. The error in logs is the following: pr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPREQUEST(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPACK(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPREQUEST(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc Apr 11 10:35:16 ops2 dnsmasq-dhcp[1490]: DHCPACK(br-ctlplane) 10.3.3.70 ec:f4:bb:e7:06:cc Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: error 0 TFTP Aborted received from 10.3.3.70 Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: failed sending /tftpboot/undionly.kpxe to 10.3.3.70 Apr 11 10:35:16 ops2 dnsmasq-tftp[1490]: sent /tftpboot/undionly.kpxe to 10.3.3.70 Apr 11 10:35:17 ops2 ironic-api: 10.3.3.1 - - [11/Apr/2016 10:35:17] "GET / HTTP/1.0" 200 354 (more logs attached) This causes iPXE boot to fail (and continue to boot from an old deployment on disk.) Version-Release number of selected component (if applicable): [stack@ops2 ~]$ rpm -qa | grep 'ironic\|tripleo\|oscplug' | sort openstack-ironic-api-2015.1.2-2.el7ost.noarch openstack-ironic-common-2015.1.2-2.el7ost.noarch openstack-ironic-conductor-2015.1.2-2.el7ost.noarch openstack-ironic-discoverd-1.1.0-8.el7ost.noarch openstack-tripleo-0.0.7-0.1.1664e566.el7ost.noarch openstack-tripleo-common-0.0.1.dev6-6.git49b57eb.el7ost.noarch openstack-tripleo-heat-templates-0.8.6-123.el7ost.noarch openstack-tripleo-image-elements-0.9.6-10.el7ost.noarch openstack-tripleo-puppet-elements-0.0.1-5.el7ost.noarch python-ironicclient-0.5.1-12.el7ost.noarch python-ironic-discoverd-1.1.0-8.el7ost.noarch python-rdomanager-oscplugin-0.0.10-28.el7ost.noarch How reproducible: 75% of the time Steps to Reproduce: 1. Open a console on the node to be inspected, then run the following script ### inspect-node.sh #### #!/bin/bash # # pass in the node uuid as parameter 1 on the cli uuid=$1 echo "Starting introspection on $uuid" ironic node-set-maintenance $uuid true openstack baremetal introspection start $uuid RET_VAL=1 while [ $RET_VAL -ne 0 ]; do sleep 120 ironic node-list | grep "${uuid}.*power off" &> /dev/null RET_VAL=$? if [ $RET_VAL -eq 1 ]; then echo "still running, sleeping..." fi done echo "introspection is complete. exiting" ironic node-set-maintenance $uuid false 2. Run 'journalctl -l -u openstack-ironic-discoverd -u openstack-ironic-discoverd-dnsmasq -u openstack-ironic-conductor -f' in a separate terminal 2. Get ready to hit ctl-b to drop to iPXE> prompt on the console 3. execute the script Actual results: see error above Expected results: success! Additional info: