Bug 1410977 - deployment failed with timeout due to network configuration of a node [NEEDINFO]
Summary: deployment failed with timeout due to network configuration of a node
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 10.0 (Newton)
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: ---
Assignee: Emilien Macchi
QA Contact: Gurenko Alex
URL:
Whiteboard: scale_lab
Depends On:
Blocks: 1481685 1414467
TreeView+ depends on / blocked
 
Reported: 2017-01-07 02:52 UTC by Yogev Rabl
Modified: 2019-01-28 12:23 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-28 12:23:22 UTC
Target Upstream Version:
slinaber: needinfo? (emilien)


Attachments (Terms of Use)
undercloud logs and /var/log/messages from the problematic node (5.81 MB, application/x-xz)
2017-01-07 02:55 UTC, Yogev Rabl
no flags Details

Description Yogev Rabl 2017-01-07 02:52:53 UTC
Description of problem:
A 26 overcloud nodes failed with a time out, when Heat failed to get the 'done' from one of the Ceph storage nodes while setting the network configuration of the node.
The nodes network configuration is set as it should be but heat doesn't acknowledge it. 
The deployment details are: 3 controller nodes, 3 compute nodes and 20 Ceph storage nodes.   

This deployment has been tested twice with the same identical results.


Version-Release number of selected component (if applicable):
openstack-tripleo-puppet-elements-5.1.0-2.el7ost.noarch
openstack-tripleo-ui-1.0.5-3.el7ost.noarch
openstack-tripleo-image-elements-5.1.0-1.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch
python-tripleoclient-5.4.0-2.el7ost.noarch
puppet-tripleo-5.4.0-3.el7ost.noarch
openstack-tripleo-common-5.4.0-3.el7ost.noarch
openstack-tripleo-validations-5.1.0-5.el7ost.noarch
openstack-tripleo-0.0.8-0.2.4de13b3git.el7ost.noarch
openstack-heat-common-7.0.0-7.el7ost.noarch
openstack-heat-templates-0-0.9.1e6015dgit.el7ost.noarch
openstack-tripleo-heat-templates-5.1.0-7.el7ost.noarch
openstack-heat-api-7.0.0-7.el7ost.noarch
puppet-heat-9.4.1-1.el7ost.noarch
openstack-heat-api-cfn-7.0.0-7.el7ost.noarch
openstack-heat-engine-7.0.0-7.el7ost.noarch
python-heatclient-1.5.0-1.el7ost.noarch
python-heat-agent-0-0.9.1e6015dgit.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-ironic-conductor-6.2.2-2.el7ost.noarch
python-ironicclient-1.7.0-1.el7ost.noarch
python-ironic-lib-2.1.1-2.el7ost.noarch
python-ironic-inspector-client-1.9.0-2.el7ost.noarch
puppet-ironic-9.4.1-1.el7ost.noarch
openstack-ironic-common-6.2.2-2.el7ost.noarch
openstack-ironic-api-6.2.2-2.el7ost.noarch
openstack-ironic-inspector-4.2.1-1.el7ost.noarch
python-nova-14.0.2-7.el7ost.noarch
openstack-nova-api-14.0.2-7.el7ost.noarch
openstack-nova-compute-14.0.2-7.el7ost.noarch
openstack-nova-scheduler-14.0.2-7.el7ost.noarch
openstack-nova-cert-14.0.2-7.el7ost.noarch
python-novaclient-6.0.0-1.el7ost.noarch
openstack-nova-common-14.0.2-7.el7ost.noarch
puppet-nova-9.4.0-1.el7ost.noarch
openstack-nova-conductor-14.0.2-7.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy an overcloud with the same amount of nodes as specified above


Actual results:
The deployment fails due to a time out 4 hours after starting

Expected results:
Either the deployment fails after shorter time with a specific error about the problematic node, or the deployment should discard the node and choose another one (if handy) and continue on.  

Additional info:
Attached are the logs from the undercloud

Comment 1 Yogev Rabl 2017-01-07 02:55:13 UTC
Created attachment 1238161 [details]
undercloud logs and /var/log/messages from the problematic node

Comment 5 Ben England 2017-05-08 18:51:53 UTC
changed priority and severity.

There are multiple problems 
- why did it take 4 hours to give up on the defective node?
- is partial success an option here?  It should be. Especially since it is possible to fix the problem with the missing node and add it into the cluster later, right?  

In any sufficiently large cluster there will *almost always* be a missing block device, network interface, etc. and the problem should be reported at the end but should not prevent the deployment from succeeding in a reasonable amount of time.  So this problem is in fact a OpenStack scalability problem, which makes it a high priority.

Comment 6 Joe Talerico 2017-06-19 14:02:39 UTC
(In reply to Ben England from comment #5)
> changed priority and severity.
> 
> There are multiple problems 
> - why did it take 4 hours to give up on the defective node?

This is nothing specific to Ceph, 4 hours is the default timeout Ben. If you feel that this is not acceptable, suggest a suitable timeout. Larger deployments get pretty close to the 4 hour timeout, if the Timeout is changed (increased) we need to update keystone as well.

> - is partial success an option here?  It should be. Especially since it is
> possible to fix the problem with the missing node and add it into the
> cluster later, right?  

This is not a option today. We have had multiple discussions with the DF DFG team on how to have acceptable tolerance for failures during deployment.

> 
> In any sufficiently large cluster there will *almost always* be a missing
> block device, network interface, etc. and the problem should be reported at
> the end but should not prevent the deployment from succeeding in a
> reasonable amount of time.  So this problem is in fact a OpenStack
> scalability problem, which makes it a high priority.

Comment 12 Ben England 2019-01-14 00:51:16 UTC
The timeout should not be on the entire deployment, it should be on the next batch of nodes.   If you are doing a 1000-node deployment, 4 hours is not enough, nor should it be.  We really don't care if it takes 12 hours, or 24 hours, as long as the deployment is steadily making progress at a reasonable rate.  

We don't care if every single host gets deployed successfully.  For really large deployments it is likely that *some* host(s) will malfunction.  That's just expected behavior in a very large cluster.  Perhaps there could be a percentage of hosts that we expect to deploy successfully, defaulting to something like 95% (and round up, so if only 4 hosts in the cluster, you expect them all to deploy).

Comment 15 Ben England 2019-01-28 12:23:22 UTC
Dropping this bz on advice of the Perf & Scale team dealing with OpenStack.

Joe Talerico's comment: "The timeout is configurable, unless something else has changed? Regardless, we would never recommend deploying all 1000 nodes at once. We have seen where building large number of hosts is error prone."


Note You need to log in before you can comment on or make changes to this bug.