Bug 1466339

Summary: Overcloud nodes hangs on deploy intermittently. After reboot it will sometimes be imaged
Product: Red Hat OpenStack Reporter: Jeremy <jmelvin>
Component: openstack-ironicAssignee: RHOS Maint <rhos-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: mlammon
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: bfournie, jmelvin, mburns, rhel-osp-director-maint, srevivo
Target Milestone: ---Keywords: Unconfirmed
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-09-28 15:19:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ironic error on the overcloud node console. none

Description Jeremy 2017-06-29 13:09:39 UTC
Created attachment 1292877 [details]
ironic error on the overcloud node console.

Description of problem: During deployment some nodes hang before they get the overcloud image. On the console we see ironic errors.(screenshot attached) .Also 
see errors in ironic-conductor.log:
2017-06-28 18:09:04.575 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node 65660b90-60f8-45e1-8bb8-216d5c6211a0


Version-Release number of selected component (if applicable):
openstack-ironic-api-6.2.2-2.el7ost.noarch                  Wed Jun 14 01:23:25 2017
openstack-ironic-common-6.2.2-2.el7ost.noarch   
openstack-ironic-conductor-6.2.2-2.el7ost.noarch            Wed Jun 14 01:23:19 2017

How reproducible:
intermittent

Steps to Reproduce:
1.deploy
2.notice some nodes hang
3.

Actual results:

some nodes hang. HAve to reset and re- pxe the node and it will sometimes work.
Expected results:
always pxe and deploy overcloud image 

Additional info:

Comment 1 Red Hat Bugzilla Rules Engine 2017-06-29 13:09:49 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 3 Bob Fournier 2017-08-24 20:20:57 UTC
For the IPA screenshot we see "ironic_python_agent.ironic_api_client ConnectionError: HTTPConnectionPool (host=45.12.159.21, port=6385)... Failed to establish a new connection: [Errno 101] Network is unreachable"

In the ironic_conductor.log there are a few node timeouts logged, I assume this
for nodes that exhibited deploy failures.
2017-06-28 14:40:04.306 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node d620b044-d848-42bf-86b3-e3ba98cec67d
2017-06-28 15:12:04.327 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node d620b044-d848-42bf-86b3-e3ba98cec67d
2017-06-28 16:27:04.417 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node 8e7c299b-7a52-4ca7-8101-bc3ab1c51ebe
2017-06-28 16:58:04.464 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node 8e7c299b-7a52-4ca7-8101-bc3ab1c51ebe
2017-06-28 18:09:04.575 15646 ERROR ironic.conductor.utils [req-2aa38d75-31d9-4f6b-9bc7-86fdbedf8174 - - - - -] Timeout reached while waiting for callback for node 65660b90-60f8-45e1-8bb8-216d5c6211a0

It would be useful to get the logs in /var/log/ironic/deploy for the node when an error occurs, e.g for 65660b90-60f8-45e1-8bb8-216d5c6211a0. I'm not sure if its possible to retrieve them at this point.

Comment 4 Bob Fournier 2017-09-25 21:46:07 UTC
Hi Jeremy - it looks like the related case is closed.  There wasn't much info in the conductor log, we'd really need ramdisk logs to try and figure out what is going on here.  From the conductor logs it just looks like a temporary network issue.  Is there any more info we can get to try and make progress?  Thanks.

Comment 5 Bob Fournier 2017-09-28 15:19:21 UTC
Jeremy - closing this for now as case is closed and we don't have much to go on besides the network instability in the screen shot.  Please reopen if this occurs again, we will need to debug PXE boot process. Thank you.

Comment 6 Red Hat Bugzilla 2023-09-14 04:00:11 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days