Bug 1247722 - messages report Introspection for one of the nodes 'has timed out' while the command returns ' Discovery completed.'
messages report Introspection for one of the nodes 'has timed out' while the ...
Status: CLOSED ERRATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-rdomanager-oscplugin (Show other bugs)
unspecified
x86_64 Linux
high Severity high
: y1
: 7.0 (Kilo)
Assigned To: Brad P. Crochet
Omri Hochman
: ZStream
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-07-28 12:46 EDT by Omri Hochman
Modified: 2015-10-08 08:15 EDT (History)
10 users (show)

See Also:
Fixed In Version: python-rdomanager-oscplugin-0.0.9-1.el7ost
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-10-08 08:15:50 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log1 (412.69 KB, text/x-vhdl)
2015-07-28 12:49 EDT, Omri Hochman
no flags Details
valid_instaclk_.json (4.02 KB, text/plain)
2015-07-28 12:53 EDT, Omri Hochman
no flags Details
instackenv.json (2.36 KB, text/plain)
2015-07-28 12:54 EDT, Omri Hochman
no flags Details
messages (20.00 KB, application/x-tar)
2015-07-28 12:57 EDT, Omri Hochman
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
Gerrithub.io 243621 None None None Never

  None (edit)
Description Omri Hochman 2015-07-28 12:46:18 EDT
messages report Introspection for nodes 'has timed out' while the command returns ' Discovery completed.' 

Environment :
-------------
python-rdomanager-oscplugin-0.0.8-43.el7ost.noarch
instack-undercloud-2.1.2-22.el7ost.noarch
instack-0.0.7-1.el7ost.noarch
openstack-ironic-conductor-2015.1.0-9.el7ost.noarch
python-ironicclient-0.5.1-9.el7ost.noarch
python-ironic-discoverd-1.1.0-5.el7ost.noarch
openstack-ironic-common-2015.1.0-9.el7ost.noarch
openstack-ironic-discoverd-1.1.0-5.el7ost.noarch
openstack-ironic-api-2015.1.0-9.el7ost.noarch
openstack-tripleo-puppet-elements-0.0.1-4.el7ost.noarch
openstack-puppet-modules-2015.1.8-8.el7ost.noarch
puppet-3.6.2-2.el7.noarch
openstack-heat-templates-0-0.6.20150605git.el7ost.noarch


Description: 
-------------
The 'openstack baremetal introspection bulk start'  --> took  ~30 minutes      t and returned 'Discovery completed.'  - so It seems like everything is  : OK , 
I proceed with the overcloud deployment that failed, looking at /var/log/messages it seems that ironic discovery of one of the nodes  has timed out.

Jul 28 12:10:47 rhos-compute-node-18 ironic-discoverd: ERROR:ironic_discoverd.node_cache:Introspection for nodes [u'f619570d-6ce4-4891-acfa-f5ba4cf6f7a1', u'4
ffc6413-ad56-4080-bb78-2dc82762a355', u'51929ba4-c49a-4f71-8f7d-2d0d350249be', u'e0ac4748-b4bd-44cc-81be-ece57cd56d65'] has timed out


The question is - why the discovery returned 'Discovery completed.' when one of the nodes has timed out.


var/log/messages (attached ) 
-----------------------------
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 376, in sendall
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: tail = self.send(data, flags)
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: total_sent += fd.send(data[total_sent:], flags)
Jul 28 12:10:46 rhos-compute-node-18 ironic-api: error: [Errno 32] Broken pipe
Jul 28 12:10:47 rhos-compute-node-18 ironic-discoverd: ERROR:ironic_discoverd.node_cache:Introspection for nodes [u'f619570d-6ce4-4891-acfa-f5ba4cf6f7a1', u'4
ffc6413-ad56-4080-bb78-2dc82762a355', u'51929ba4-c49a-4f71-8f7d-2d0d350249be', u'e0ac4748-b4bd-44cc-81be-ece57cd56d65'] has timed out
Jul 28 12:10:47 rhos-compute-node-18 ironic-api: 2015-07-28 12:10:47.818 1612 DEBUG keystonemiddleware.auth_token [-] Removing headers from request environmen
t: X-Service-Catalog,X-Identity-Status,X-Service-Identity-Status,X-Roles,X-Service-Roles,X-Domain-Name,X-Service-Domain-Name,X-Project-Id,X-Service-Project-Id
,X-Project-Domain-Name,X-Service-Project-Domain-Name,X-User-Id,X-Service-User-Id,X-User-Name,X-Service-User-Name,X-Project-Name,X-Service-Project-Name,X-User-
Domain-Id,X-Service-User-Domain-Id,X-Domain-Id,X-Service-Domain-Id,X-User-Domain-Name,X-Service-User-Domain-Name,X-Project-Domain-Id,X-Service-Project-Domain-
Id,X-Role,X-User,X-Tenant-Name,X-Tenant-Id,X-Tenant _remove_auth_headers /usr/lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:672
Jul 28 12:10:47 rhos-compute-node-18 ironic-api: 2015-07-28 12:10:47.819 1612 DEBUG keystonemiddleware.auth_token [-] Authenticating user token __call__ /usr/
lib/python2.7/site-packages/keystonemiddleware/auth_token/__init__.py:581
Comment 1 Omri Hochman 2015-07-28 12:49:40 EDT
Created attachment 1057062 [details]
log1

adding file :  sudo journalctl -u openstack-ironic-discoverd -u openstack-ironic-discoverd-dnsmasq > log1
Comment 2 Omri Hochman 2015-07-28 12:53:41 EDT
Created attachment 1057074 [details]
valid_instaclk_.json

running validator on the instackenv.json returns fine .
Comment 3 Omri Hochman 2015-07-28 12:54:19 EDT
Created attachment 1057075 [details]
instackenv.json
Comment 4 Omri Hochman 2015-07-28 12:57:21 EDT
Created attachment 1057076 [details]
messages
Comment 6 Omri Hochman 2015-07-28 13:14:49 EDT
possible solution for the discovery timeout :https://bugzilla.redhat.com/show_bug.cgi?id=1235255#c5  

but it still not fixing the bad reporting.
Comment 7 Omri Hochman 2015-07-28 13:15:22 EDT
possible solution for the discovery timeout :https://bugzilla.redhat.com/show_bug.cgi?id=1235255#c5  

but it still not fixing the bad reporting.
Comment 9 Rafael Rosa 2015-07-29 08:11:04 EDT
Is this reproducible 100% of the time?
Comment 10 Omri Hochman 2015-07-29 11:35:52 EDT
(In reply to Rafael Rosa from comment #9)
> Is this reproducible 100% of the time?

just to emphasis - the problem is when node-time outs during discovery, this situation should not happen if everything is OK . 

In case there is a node time out the discovery should not return 'Discovery completed.'     

To reproduce we'll need to find a negative scenario that will cause node to time-out, it reproduced on my setup while I opened the bug (since I had a problem ),  but I assume that another way to reproduce it will be to tamper with the instackenv.json file with false parameters, in a way that will cause timeout.
Comment 12 Omri Hochman 2015-09-15 16:34:54 EDT
unable to reproduce with python-rdomanager-oscplugin-0.0.10-3.el7ost.noarch

when attempted to set a wrong MAC address in the instack.json - it seems that the introspection succeeded anyway, and when used a false password for IPMI nic the operation failed immediately as expected.
Comment 14 errata-xmlrpc 2015-10-08 08:15:50 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2015:1862

Note You need to log in before you can comment on or make changes to this bug.