Bug 1247722

Summary: messages report Introspection for one of the nodes 'has timed out' while the command returns ' Discovery completed.'
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: python-rdomanager-oscpluginAssignee: Brad P. Crochet <brad>
Status: CLOSED ERRATA QA Contact: Omri Hochman <ohochman>
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: akrivoka, calfonso, dmacpher, jslagle, kbasil, mburns, nbarcet, ohochman, rhel-osp-director-maint
Target Milestone: y1Keywords: ZStream
Target Release: 7.0 (Kilo)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: python-rdomanager-oscplugin-0.0.9-1.el7ost Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-10-08 12:15:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log1
none
valid_instaclk_.json
none
instackenv.json
none
messages none

Description Omri Hochman 2015-07-28 16:46:18 UTC
messages report Introspection for nodes 'has timed out' while the command returns ' Discovery completed.' 

Environment :
-------------
python-rdomanager-oscplugin-0.0.8-43.el7ost.noarch
instack-undercloud-2.1.2-22.el7ost.noarch
instack-0.0.7-1.el7ost.noarch
openstack-ironic-conductor-2015.1.0-9.el7ost.noarch
python-ironicclient-0.5.1-9.el7ost.noarch
python-ironic-discoverd-1.1.0-5.el7ost.noarch
openstack-ironic-common-2015.1.0-9.el7ost.noarch
openstack-ironic-discoverd-1.1.0-5.el7ost.noarch
openstack-ironic-api-2015.1.0-9.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-4.el7ost.noarch
openstack-puppet-modules-2015.1.8-8.el7ost.noarch
puppet-3.6.2-2.el7.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch


Description: 
-------------
The 'openstack baremetal introspection bulk start'  --> took  ~30 minutes      t and returned 'Discovery completed.'  - so It seems like everything is  : OK , 
I proceed with the overcloud deployment that failed, looking at /var/log/messages it seems that ironic discovery of one of the nodes  has timed out.

Jul 28 12:10:47 rhos-compute-node-18 ironic-discoverd: ERROR:ironic_discoverd.node_cache:Introspection for nodes [u'f619570d-6ce4-4891-acfa-f5ba4cf6f7a1', u'4
ffc6413-ad56-4080-bb78-2dc82762a355', u'51929ba4-c49a-4f71-8f7d-2d0d350249be', u'e0ac4748-b4bd-44cc-81be-ece57cd56d65'] has timed out


The question is - why the discovery returned 'Discovery completed.' when one of the nodes has timed out.


var/log/messages (attached ) 
-----------------------------
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: tail = self.send(data, flags)
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: total_sent += fd.send(data[total_sent:], flags)
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: error: [Errno 32] Broken pipe
Jul 28 12:10:47 rhos-compute-node-18 ironic-discoverd: ERROR:ironic_discoverd.node_cache:Introspection for nodes [u'f619570d-6ce4-4891-acfa-f5ba4cf6f7a1', u'4
ffc6413-ad56-4080-bb78-2dc82762a355', u'51929ba4-c49a-4f71-8f7d-2d0d350249be', u'e0ac4748-b4bd-44cc-81be-ece57cd56d65'] has timed out
Jul 28 12:10:47 rhos-compute-node-18 ironic-api: 2015-07-28 12:10:47.818 1612 DEBUG keystonemiddleware.auth_token [-] Removing headers from request environmen
t: X-Service-Catalog,X-Identity-Status,X-Service-Identity-Status,X-Roles,X-Service-Roles,X-Domain-Name,X-Service-Domain-Name,X-Project-Id,X-Service-Project-Id
,X-Project-Domain-Name,X-Service-Project-Domain-Name,X-User-Id,X-Service-User-Id,X-User-Name,X-Service-User-Name,X-Project-Name,X-Service-Project-Name,X-User-
Domain-Id,X-Service-User-Domain-Id,X-Domain-Id,X-Service-Domain-Id,X-User-Domain-Name,X-Service-User-Domain-Name,X-Project-Domain-Id,X-Service-Project-Domain-
Id,X-Role,X-User,X-Tenant-Name,X-Tenant-Id,X-Tenant _remove_auth_headers /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:672
Jul 28 12:10:47 rhos-compute-node-18 ironic-api: 2015-07-28 12:10:47.819 1612 DEBUG keystonemiddleware.auth_token [-] Authenticating user token __call__ /usr/
lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:581

Comment 1 Omri Hochman 2015-07-28 16:49:40 UTC
Created attachment 1057062 [details]
log1

adding file :  sudo journalctl -u openstack-ironic-discoverd -u openstack-ironic-discoverd-dnsmasq > log1

Comment 2 Omri Hochman 2015-07-28 16:53:41 UTC
Created attachment 1057074 [details]
valid_instaclk_.json

running validator on the instackenv.json returns fine .

Comment 3 Omri Hochman 2015-07-28 16:54:19 UTC
Created attachment 1057075 [details]
instackenv.json

Comment 4 Omri Hochman 2015-07-28 16:57:21 UTC
Created attachment 1057076 [details]
messages

Comment 6 Omri Hochman 2015-07-28 17:14:49 UTC
possible solution for the discovery timeout :https://bugzilla.redhat.com/show_bug.cgi?id=1235255#c5  

but it still not fixing the bad reporting.

Comment 7 Omri Hochman 2015-07-28 17:15:22 UTC
possible solution for the discovery timeout :https://bugzilla.redhat.com/show_bug.cgi?id=1235255#c5  

but it still not fixing the bad reporting.

Comment 9 Rafael Rosa 2015-07-29 12:11:04 UTC
Is this reproducible 100% of the time?

Comment 10 Omri Hochman 2015-07-29 15:35:52 UTC
(In reply to Rafael Rosa from comment #9)
> Is this reproducible 100% of the time?

just to emphasis - the problem is when node-time outs during discovery, this situation should not happen if everything is OK . 

In case there is a node time out the discovery should not return 'Discovery completed.'     

To reproduce we'll need to find a negative scenario that will cause node to time-out, it reproduced on my setup while I opened the bug (since I had a problem ),  but I assume that another way to reproduce it will be to tamper with the instackenv.json file with false parameters, in a way that will cause timeout.

Comment 12 Omri Hochman 2015-09-15 20:34:54 UTC
unable to reproduce with python-rdomanager-oscplugin-0.0.10-3.el7ost.noarch

when attempted to set a wrong MAC address in the instack.json - it seems that the introspection succeeded anyway, and when used a false password for IPMI nic the operation failed immediately as expected.

Comment 14 errata-xmlrpc 2015-10-08 12:15:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:1862