Bug 1410977

Summary: deployment failed with timeout due to network configuration of a node
Product: Red Hat OpenStack Reporter: Yogev Rabl <yrabl>
Component: rhosp-directorAssignee: Emilien Macchi <emilien>
Status: CLOSED NOTABUG QA Contact: Gurenko Alex <agurenko>
Severity: medium Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: abond, bengland, dbecker, emacchi, emilien, flucifre, jefbrown, johfulto, jomurphy, jtaleric, mburns, morazi, rhel-osp-director-maint, rsussman, slinaber, tvignaud, twilkins
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: scale_lab
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-28 12:23:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1414467, 1481685    
Attachments:
Description Flags
undercloud logs and /var/log/messages from the problematic node none

Description Yogev Rabl 2017-01-07 02:52:53 UTC
Description of problem:
A 26 overcloud nodes failed with a time out, when Heat failed to get the 'done' from one of the Ceph storage nodes while setting the network configuration of the node.
The nodes network configuration is set as it should be but heat doesn't acknowledge it. 
The deployment details are: 3 controller nodes, 3 compute nodes and 20 Ceph storage nodes.   

This deployment has been tested twice with the same identical results.


Version-Release number of selected component (if applicable):
openstack-tripleo-puppet-elements-5.1.0-2.el7ost.noarch
openstack-tripleo-ui-1.0.5-3.el7ost.noarch
openstack-tripleo-image-elements-5.1.0-1.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch
python-tripleoclient-5.4.0-2.el7ost.noarch
puppet-tripleo-5.4.0-3.el7ost.noarch
openstack-tripleo-common-5.4.0-3.el7ost.noarch
openstack-tripleo-validations-5.1.0-5.el7ost.noarch
openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
openstack-heat-common-7.0.0-7.el7ost.noarch
openstack-heat-templates-0-0.9.1e6015dgit.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch
openstack-heat-api-7.0.0-7.el7ost.noarch
puppet-heat-9.4.1-1.el7ost.noarch
openstack-heat-api-cfn-7.0.0-7.el7ost.noarch
openstack-heat-engine-7.0.0-7.el7ost.noarch
python-heatclient-1.5.0-1.el7ost.noarch
python-heat-agent-0-0.9.1e6015dgit.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-ironic-conductor-6.2.2-2.el7ost.noarch
python-ironicclient-1.7.0-1.el7ost.noarch
python-ironic-lib-2.1.1-2.el7ost.noarch
python-ironic-inspector-client-1.9.0-2.el7ost.noarch
puppet-ironic-9.4.1-1.el7ost.noarch
openstack-ironic-common-6.2.2-2.el7ost.noarch
openstack-ironic-api-6.2.2-2.el7ost.noarch
openstack-ironic-inspector-4.2.1-1.el7ost.noarch
python-nova-14.0.2-7.el7ost.noarch
openstack-nova-api-14.0.2-7.el7ost.noarch
openstack-nova-compute-14.0.2-7.el7ost.noarch
openstack-nova-scheduler-14.0.2-7.el7ost.noarch
openstack-nova-cert-14.0.2-7.el7ost.noarch
python-novaclient-6.0.0-1.el7ost.noarch
openstack-nova-common-14.0.2-7.el7ost.noarch
puppet-nova-9.4.0-1.el7ost.noarch
openstack-nova-conductor-14.0.2-7.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy an overcloud with the same amount of nodes as specified above


Actual results:
The deployment fails due to a time out 4 hours after starting

Expected results:
Either the deployment fails after shorter time with a specific error about the problematic node, or the deployment should discard the node and choose another one (if handy) and continue on.  

Additional info:
Attached are the logs from the undercloud

Comment 1 Yogev Rabl 2017-01-07 02:55:13 UTC
Created attachment 1238161 [details]
undercloud logs and /var/log/messages from the problematic node

Comment 5 Ben England 2017-05-08 18:51:53 UTC
changed priority and severity.

There are multiple problems 
- why did it take 4 hours to give up on the defective node?
- is partial success an option here?  It should be. Especially since it is possible to fix the problem with the missing node and add it into the cluster later, right?  

In any sufficiently large cluster there will *almost always* be a missing block device, network interface, etc. and the problem should be reported at the end but should not prevent the deployment from succeeding in a reasonable amount of time.  So this problem is in fact a OpenStack scalability problem, which makes it a high priority.

Comment 6 Joe Talerico 2017-06-19 14:02:39 UTC
(In reply to Ben England from comment #5)
> changed priority and severity.
> 
> There are multiple problems 
> - why did it take 4 hours to give up on the defective node?

This is nothing specific to Ceph, 4 hours is the default timeout Ben. If you feel that this is not acceptable, suggest a suitable timeout. Larger deployments get pretty close to the 4 hour timeout, if the Timeout is changed (increased) we need to update keystone as well.

> - is partial success an option here?  It should be. Especially since it is
> possible to fix the problem with the missing node and add it into the
> cluster later, right?  

This is not a option today. We have had multiple discussions with the DF DFG team on how to have acceptable tolerance for failures during deployment.

> 
> In any sufficiently large cluster there will *almost always* be a missing
> block device, network interface, etc. and the problem should be reported at
> the end but should not prevent the deployment from succeeding in a
> reasonable amount of time.  So this problem is in fact a OpenStack
> scalability problem, which makes it a high priority.

Comment 12 Ben England 2019-01-14 00:51:16 UTC
The timeout should not be on the entire deployment, it should be on the next batch of nodes.   If you are doing a 1000-node deployment, 4 hours is not enough, nor should it be.  We really don't care if it takes 12 hours, or 24 hours, as long as the deployment is steadily making progress at a reasonable rate.  

We don't care if every single host gets deployed successfully.  For really large deployments it is likely that *some* host(s) will malfunction.  That's just expected behavior in a very large cluster.  Perhaps there could be a percentage of hosts that we expect to deploy successfully, defaulting to something like 95% (and round up, so if only 4 hosts in the cluster, you expect them all to deploy).

Comment 15 Ben England 2019-01-28 12:23:22 UTC
Dropping this bz on advice of the Perf & Scale team dealing with OpenStack.

Joe Talerico's comment: "The timeout is configurable, unless something else has changed? Regardless, we would never recommend deploying all 1000 nodes at once. We have seen where building large number of hosts is error prone."

Comment 16 Red Hat Bugzilla 2023-09-14 03:37:01 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days