1410977 – deployment failed with timeout due to network configuration of a node

Bug 1410977 - deployment failed with timeout due to network configuration of a node

Summary: deployment failed with timeout due to network configuration of a node

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rhosp-director
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Emilien Macchi
QA Contact:	Gurenko Alex
Docs Contact:
URL:
Whiteboard:	scale_lab
Depends On:
Blocks:	1414467 1481685
TreeView+	depends on / blocked

Reported:	2017-01-07 02:52 UTC by Yogev Rabl
Modified:	2023-09-14 03:40 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-01-28 12:23:22 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
undercloud logs and /var/log/messages from the problematic node (5.81 MB, application/x-xz) 2017-01-07 02:55 UTC, Yogev Rabl	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-28622	0	None	None	None	2023-09-14 03:40:16 UTC

Description Yogev Rabl 2017-01-07 02:52:53 UTC

Description of problem:
A 26 overcloud nodes failed with a time out, when Heat failed to get the 'done' from one of the Ceph storage nodes while setting the network configuration of the node.
The nodes network configuration is set as it should be but heat doesn't acknowledge it. 
The deployment details are: 3 controller nodes, 3 compute nodes and 20 Ceph storage nodes.   

This deployment has been tested twice with the same identical results.


Version-Release number of selected component (if applicable):
openstack-tripleo-puppet-elements-5.1.0-2.el7ost.noarch
openstack-tripleo-ui-1.0.5-3.el7ost.noarch
openstack-tripleo-image-elements-5.1.0-1.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch
python-tripleoclient-5.4.0-2.el7ost.noarch
puppet-tripleo-5.4.0-3.el7ost.noarch
openstack-tripleo-common-5.4.0-3.el7ost.noarch
openstack-tripleo-validations-5.1.0-5.el7ost.noarch
openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
openstack-heat-common-7.0.0-7.el7ost.noarch
openstack-heat-templates-0-0.9.1e6015dgit.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch
openstack-heat-api-7.0.0-7.el7ost.noarch
puppet-heat-9.4.1-1.el7ost.noarch
openstack-heat-api-cfn-7.0.0-7.el7ost.noarch
openstack-heat-engine-7.0.0-7.el7ost.noarch
python-heatclient-1.5.0-1.el7ost.noarch
python-heat-agent-0-0.9.1e6015dgit.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-ironic-conductor-6.2.2-2.el7ost.noarch
python-ironicclient-1.7.0-1.el7ost.noarch
python-ironic-lib-2.1.1-2.el7ost.noarch
python-ironic-inspector-client-1.9.0-2.el7ost.noarch
puppet-ironic-9.4.1-1.el7ost.noarch
openstack-ironic-common-6.2.2-2.el7ost.noarch
openstack-ironic-api-6.2.2-2.el7ost.noarch
openstack-ironic-inspector-4.2.1-1.el7ost.noarch
python-nova-14.0.2-7.el7ost.noarch
openstack-nova-api-14.0.2-7.el7ost.noarch
openstack-nova-compute-14.0.2-7.el7ost.noarch
openstack-nova-scheduler-14.0.2-7.el7ost.noarch
openstack-nova-cert-14.0.2-7.el7ost.noarch
python-novaclient-6.0.0-1.el7ost.noarch
openstack-nova-common-14.0.2-7.el7ost.noarch
puppet-nova-9.4.0-1.el7ost.noarch
openstack-nova-conductor-14.0.2-7.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy an overcloud with the same amount of nodes as specified above


Actual results:
The deployment fails due to a time out 4 hours after starting

Expected results:
Either the deployment fails after shorter time with a specific error about the problematic node, or the deployment should discard the node and choose another one (if handy) and continue on.  

Additional info:
Attached are the logs from the undercloud

Comment 1 Yogev Rabl 2017-01-07 02:55:13 UTC

Created attachment 1238161 [details]
undercloud logs and /var/log/messages from the problematic node

Comment 5 Ben England 2017-05-08 18:51:53 UTC

changed priority and severity.

There are multiple problems 
- why did it take 4 hours to give up on the defective node?
- is partial success an option here?  It should be. Especially since it is possible to fix the problem with the missing node and add it into the cluster later, right?  

In any sufficiently large cluster there will *almost always* be a missing block device, network interface, etc. and the problem should be reported at the end but should not prevent the deployment from succeeding in a reasonable amount of time.  So this problem is in fact a OpenStack scalability problem, which makes it a high priority.

Comment 6 Joe Talerico 2017-06-19 14:02:39 UTC

(In reply to Ben England from comment #5)
> changed priority and severity.
> 
> There are multiple problems 
> - why did it take 4 hours to give up on the defective node?

This is nothing specific to Ceph, 4 hours is the default timeout Ben. If you feel that this is not acceptable, suggest a suitable timeout. Larger deployments get pretty close to the 4 hour timeout, if the Timeout is changed (increased) we need to update keystone as well.

> - is partial success an option here?  It should be. Especially since it is
> possible to fix the problem with the missing node and add it into the
> cluster later, right?  

This is not a option today. We have had multiple discussions with the DF DFG team on how to have acceptable tolerance for failures during deployment.

> 
> In any sufficiently large cluster there will *almost always* be a missing
> block device, network interface, etc. and the problem should be reported at
> the end but should not prevent the deployment from succeeding in a
> reasonable amount of time.  So this problem is in fact a OpenStack
> scalability problem, which makes it a high priority.

Comment 12 Ben England 2019-01-14 00:51:16 UTC

The timeout should not be on the entire deployment, it should be on the next batch of nodes.   If you are doing a 1000-node deployment, 4 hours is not enough, nor should it be.  We really don't care if it takes 12 hours, or 24 hours, as long as the deployment is steadily making progress at a reasonable rate.  

We don't care if every single host gets deployed successfully.  For really large deployments it is likely that *some* host(s) will malfunction.  That's just expected behavior in a very large cluster.  Perhaps there could be a percentage of hosts that we expect to deploy successfully, defaulting to something like 95% (and round up, so if only 4 hosts in the cluster, you expect them all to deploy).

Comment 15 Ben England 2019-01-28 12:23:22 UTC

Dropping this bz on advice of the Perf & Scale team dealing with OpenStack.

Joe Talerico's comment: "The timeout is configurable, unless something else has changed? Regardless, we would never recommend deploying all 1000 nodes at once. We have seen where building large number of hosts is error prone."

Comment 16 Red Hat Bugzilla 2023-09-14 03:37:01 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.