Bug 1234343

Summary: 75 % success for introspection (VM)
Product: Red Hat OpenStack Reporter: Jaromir Coufal <jcoufal>
Component: python-rdomanager-oscpluginAssignee: Dmitry Tantsur <dtantsur>
Status: CLOSED ERRATA QA Contact: Toure Dunnon <tdunnon>
Severity: unspecified Docs Contact:
Priority: high    
Version: DirectorCC: bnemec, calfonso, dnavale, jcoufal, jliberma, jslagle, jtrowbri, lmartins, mandreou, mburns, psedlak, rhel-osp-director-maint, tdunnon, yeylon
Target Milestone: gaKeywords: Triaged
Target Release: Director   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-rdomanager-oscplugin-0.0.8-14.el7ost Doc Type: Bug Fix
Doc Text:
Issues in the KVM PXE code displayed failures when too many nodes tried to PXE-boot simultaneously, resulting in some nodes failing to connect to DHCP. With this update, the sleep value is increased, allowing introspection on the nodes. As a result, DHCP is no longer an issue, making the introspection a little longer.
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-08-05 13:55:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
no bootable device for one of the nodes
none
dnsmasq log output none

Description Jaromir Coufal 2015-06-22 12:08:09 UTC
Created attachment 1041777 [details]
no bootable device for one of the nodes

Description of problem:
I am constantly getting failures for virtual machines introspection. THe percentage is about 75 % of success. The rest of the nodes are not able to get discovered and are returning "No bootable device." (screenshot attached). Few minutes ago I was able to discover 15 nodes of 20.

Version-Release number of selected component (if applicable):
2015-06-17.2 http://openstack.etherpad.corp.redhat.com/rhel-osp-director-puddle-2015-06-17-2

How reproducible:
75 % of time

Steps to Reproduce:
1. trigger introspection on multiple nodes

Actual results:
75 % of success

Expected results:
100 % of success

Comment 4 Dmitry Tantsur 2015-06-24 08:34:17 UTC
A couple of questions: is it always the same node? does the same thing happen with deploy?

Also please attach $ sudo journalctl -u openstack-ironic-discoverd-dnsmasq

CC'ing Lucas as he may know more about iPXE.

Comment 5 Jaromir Coufal 2015-06-24 10:52:01 UTC
Hey, so...

is it always the same node?
-- no, various nodes, not always the same ones

does the same thing happen with deploy?
-- no, deploy never got stuck with similar issue

I don't have the machine available anymore, so I cannot provide any other logs.

Comment 6 Dmitry Tantsur 2015-06-24 10:55:38 UTC
Ok, I will try to reproduce it myself. In the meanwhile, if someone has the same issue, I'm in bad need of logs, please provide some!

Comment 7 Ben Nemec 2015-06-24 18:15:48 UTC
Created attachment 1042800 [details]
dnsmasq log output

This is the openstack-ironic-discoverd-dnsmasq log output from a failing run.  The MAC address of the failed node is fa:16:3e:4e:ee:38, and it looks like it's the same address in use problem you had mentioned to me before.

Comment 8 Dmitry Tantsur 2015-06-25 07:58:08 UTC
Exactly. Do you think it's a good time to redirect this bug to kvm or whatever manages the PXE firmware? I think everybody here reproduced this bug at least once...

Comment 9 Dmitry Tantsur 2015-06-25 08:18:39 UTC
Oh btw, we had a sleep in our scripts:
https://github.com/rdo-management/instack-undercloud/blob/master/scripts/instack-ironic-deployment#L134
which is no longer in a new CLI:
https://github.com/rdo-management/python-rdomanager-oscplugin/blob/master/rdomanager_oscplugin/v1/baremetal.py#L136-L145

We have to bring it back, I'll submit a patch.

Comment 10 Dmitry Tantsur 2015-06-25 08:34:44 UTC
And here's the patch: https://review.gerrithub.io/#/c/237591/

Comment 11 Marios Andreou 2015-06-26 10:00:53 UTC
heh dmitry its like deja vu. I was having issues with vm introspection last 2 days, especially yesterday lots of poking. I remember when this happened the first time round and the sleep was added ;)

Glad I came across this, will try out since am refreshing envs today for poodle. (I can only do vm envs)

Comment 13 Dmitry Tantsur 2015-06-29 08:18:05 UTC
Patch merged, I believe it will work around the problem.

Comment 14 James Slagle 2015-07-01 14:01:23 UTC
*** Bug 1234956 has been marked as a duplicate of this bug. ***

Comment 19 errata-xmlrpc 2015-08-05 13:55:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2015:1549