Bug 1410977
Summary: | deployment failed with timeout due to network configuration of a node | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Yogev Rabl <yrabl> | ||||
Component: | rhosp-director | Assignee: | Emilien Macchi <emilien> | ||||
Status: | CLOSED NOTABUG | QA Contact: | Gurenko Alex <agurenko> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 10.0 (Newton) | CC: | abond, bengland, dbecker, emacchi, emilien, flucifre, jefbrown, johfulto, jomurphy, jtaleric, mburns, morazi, rhel-osp-director-maint, rsussman, slinaber, tvignaud, twilkins | ||||
Target Milestone: | --- | Keywords: | Triaged, ZStream | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | scale_lab | ||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-01-28 12:23:22 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1414467, 1481685 | ||||||
Attachments: |
|
Description
Yogev Rabl
2017-01-07 02:52:53 UTC
Created attachment 1238161 [details]
undercloud logs and /var/log/messages from the problematic node
changed priority and severity. There are multiple problems - why did it take 4 hours to give up on the defective node? - is partial success an option here? It should be. Especially since it is possible to fix the problem with the missing node and add it into the cluster later, right? In any sufficiently large cluster there will *almost always* be a missing block device, network interface, etc. and the problem should be reported at the end but should not prevent the deployment from succeeding in a reasonable amount of time. So this problem is in fact a OpenStack scalability problem, which makes it a high priority. (In reply to Ben England from comment #5) > changed priority and severity. > > There are multiple problems > - why did it take 4 hours to give up on the defective node? This is nothing specific to Ceph, 4 hours is the default timeout Ben. If you feel that this is not acceptable, suggest a suitable timeout. Larger deployments get pretty close to the 4 hour timeout, if the Timeout is changed (increased) we need to update keystone as well. > - is partial success an option here? It should be. Especially since it is > possible to fix the problem with the missing node and add it into the > cluster later, right? This is not a option today. We have had multiple discussions with the DF DFG team on how to have acceptable tolerance for failures during deployment. > > In any sufficiently large cluster there will *almost always* be a missing > block device, network interface, etc. and the problem should be reported at > the end but should not prevent the deployment from succeeding in a > reasonable amount of time. So this problem is in fact a OpenStack > scalability problem, which makes it a high priority. The timeout should not be on the entire deployment, it should be on the next batch of nodes. If you are doing a 1000-node deployment, 4 hours is not enough, nor should it be. We really don't care if it takes 12 hours, or 24 hours, as long as the deployment is steadily making progress at a reasonable rate. We don't care if every single host gets deployed successfully. For really large deployments it is likely that *some* host(s) will malfunction. That's just expected behavior in a very large cluster. Perhaps there could be a percentage of hosts that we expect to deploy successfully, defaulting to something like 95% (and round up, so if only 4 hosts in the cluster, you expect them all to deploy). Dropping this bz on advice of the Perf & Scale team dealing with OpenStack. Joe Talerico's comment: "The timeout is configurable, unless something else has changed? Regardless, we would never recommend deploying all 1000 nodes at once. We have seen where building large number of hosts is error prone." The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |