Bug 1251117 - [rhel-osp-director] Undercloud nova/ironic rpc_response_timeout is too low for deployment
Summary: [rhel-osp-director] Undercloud nova/ironic rpc_response_timeout is too low fo...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: Director
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: async
: Director
Assignee: Ben Nemec
QA Contact: Amit Ugol
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-08-06 13:17 UTC by Rhys Oxenham
Modified: 2019-09-09 13:48 UTC (History)
19 users (show)

Fixed In Version: instack-undercloud-2.1.2-23.el7ost
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-08-13 20:04:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gerrithub.io 242450 0 None None None Never
Red Hat Product Errata RHBA-2015:1624 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OSP 7 director Bug Fix Advisory 2015-08-14 00:03:44 UTC

Description Rhys Oxenham 2015-08-06 13:17:45 UTC
Description of problem:

When deploying an overcloud, the undercloud can become heavily loaded, in some circumstances this causes a deployment to fail. This problem is particularly exacerbated in virtual environments where the undercloud and overcloud nodes are resident on the same hypervisor.

The deployment primarily fails because Nova fails to spawn an instance on-top of its designated node. I'm led to believe that this is because Nova is experiencing messaging timeouts.

2015-08-05 13:47:10.324 1930 TRACE nova.servicegroup.drivers.db MessagingTimeout: Timed out waiting for a reply to message ID 27060ef48bed45beb783127a569342b4

2015-08-05 13:49:33.011 1930 ERROR nova.compute.manager [req-8d96b585-f033-4c35-a544-41d0de7994c5 - - - - -] [instance: 209d0efc-b1f7-45cc-9887-35d9c448dcdf] Instance failed to spawn

015-08-05 13:49:34.100 1930 WARNING nova.virt.ironic.driver [req-8d96b585-f033-4c35-a544-41d0de7994c5 - - - - -] Destroy called on non-existing instance 209d0efc-b1f7-45cc-9887-35d9c448dcdf.

Nova then unsuccessfully attempts to clean up the nodes, eventually stripping the instance_uuid from the ironic node, leaving Ironic nodes in an orphaned state, partially deployed with the overcloud-full image, and the heat-stack failed.

Also, Ironic seems to time out too, often causing somewhat related deployment failures, in this example the conductor gets out of sync:

{"error_message": "{\"debuginfo\": null, \"faultcode\": \"Client\", \"faultstring\": \"No valid host was found. Reason: No conductor service registered which supports driver pxe_ssh.\"}"}

Version-Release number of selected component (if applicable):

RHEL OSP 7.0 GA + Director GA

How reproducible:

In my tests 75% of the time the deployment would fail. 20% of the time, SOME of the nodes would deploy correctly, but others not, and 5% of the time the deployment would succeed without fault.

Steps to Reproduce:
1. Create an environment where underclound and overcloud nodes are VM's on the same host
2. Attempt to provision a 5x node overcloud deployment
3. Watch /var/log/nova/nova-compute.log for message drop outs and then it failing to spawn - eventually leading to the Ironic deployments failing, and stack CREATE_FAILED.

Actual results:

Vast majority of deployments failing.

Expected results:

Deployment succeeds, every time, unless user error ;-)

Additional info:

This can be successfully worked around with the following settings. I've attempted to deploy 6 times with these changes and it's worked every time:

undercloud# openstack-config --set /etc/nova/nova.conf DEFAULT rpc_response_timeout 600
undercloud# openstack-config --set /etc/ironic/ironic.conf DEFAULT rpc_response_timeout 600
undercloud# openstack-service restart nova
undercloud# openstack-service restart ironic

Then attempt the redeployment after ensuring there's a clean 'ironic node-list', 'nova list --all-tenants', and 'heat stack-list'.

Comment 4 Amit Ugol 2015-08-10 11:34:42 UTC
I've tried HA deployments and thus far with 600 it looks much better.

Comment 6 errata-xmlrpc 2015-08-13 20:04:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:1624


Note You need to log in before you can comment on or make changes to this bug.