Bug 1466339 - Overcloud nodes hangs on deploy intermittently. After reboot it will sometimes be imaged [NEEDINFO]
Overcloud nodes hangs on deploy intermittently. After reboot it will sometime...
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ironic (Show other bugs)
10.0 (Newton)
Unspecified Unspecified
medium Severity medium
: ---
: ---
Assigned To: RHOS Maint
mlammon
: Unconfirmed
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-06-29 09:09 EDT by Jeremy
Modified: 2018-02-06 15:09 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-09-28 11:19:21 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
bfournie: needinfo? (jmelvin)


Attachments (Terms of Use)
ironic error on the overcloud node console. (508.99 KB, image/png)
2017-06-29 09:09 EDT, Jeremy
no flags Details

  None (edit)
Description Jeremy 2017-06-29 09:09:39 EDT
Created attachment 1292877 [details]
ironic error on the overcloud node console.

Description of problem: During deployment some nodes hang before they get the overcloud image. On the console we see ironic errors.(screenshot attached) .Also 
see errors in ironic-conductor.log:
2017-06-28 18:09:04.575 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node 65660b90-60f8-45e1-8bb8-216d5c6211a0


Version-Release number of selected component (if applicable):
openstack-ironic-api-6.2.2-2.el7ost.noarch                  Wed Jun 14 01:23:25 2017
openstack-ironic-common-6.2.2-2.el7ost.noarch   
openstack-ironic-conductor-6.2.2-2.el7ost.noarch            Wed Jun 14 01:23:19 2017

How reproducible:
intermittent

Steps to Reproduce:
1.deploy
2.notice some nodes hang
3.

Actual results:

some nodes hang. HAve to reset and re- pxe the node and it will sometimes work.
Expected results:
always pxe and deploy overcloud image 

Additional info:
Comment 1 Red Hat Bugzilla Rules Engine 2017-06-29 09:09:49 EDT
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.
Comment 3 Bob Fournier 2017-08-24 16:20:57 EDT
For the IPA screenshot we see "ironic_python_agent.ironic_api_client ConnectionError: HTTPConnectionPool (host=45.12.159.21, port=6385)... Failed to establish a new connection: [Errno 101] Network is unreachable"

In the ironic_conductor.log there are a few node timeouts logged, I assume this
for nodes that exhibited deploy failures.
2017-06-28 14:40:04.306 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node d620b044-d848-42bf-86b3-e3ba98cec67d
2017-06-28 15:12:04.327 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node d620b044-d848-42bf-86b3-e3ba98cec67d
2017-06-28 16:27:04.417 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node 8e7c299b-7a52-4ca7-8101-bc3ab1c51ebe
2017-06-28 16:58:04.464 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node 8e7c299b-7a52-4ca7-8101-bc3ab1c51ebe
2017-06-28 18:09:04.575 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node 65660b90-60f8-45e1-8bb8-216d5c6211a0

It would be useful to get the logs in /var/log/ironic/deploy for the node when an error occurs, e.g for 65660b90-60f8-45e1-8bb8-216d5c6211a0. I'm not sure if its possible to retrieve them at this point.
Comment 4 Bob Fournier 2017-09-25 17:46:07 EDT
Hi Jeremy - it looks like the related case is closed.  There wasn't much info in the conductor log, we'd really need ramdisk logs to try and figure out what is going on here.  From the conductor logs it just looks like a temporary network issue.  Is there any more info we can get to try and make progress?  Thanks.
Comment 5 Bob Fournier 2017-09-28 11:19:21 EDT
Jeremy - closing this for now as case is closed and we don't have much to go on besides the network instability in the screen shot.  Please reopen if this occurs again, we will need to debug PXE boot process. Thank you.

Note You need to log in before you can comment on or make changes to this bug.