| Summary: | Unable to recover stack after deleting a node from nova | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | bigswitch <rhosp-bugs-internal> |
| Component: | rhosp-director | Assignee: | Angus Thomas <athomas> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Arik Chernetsky <achernet> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 8.0 (Liberty) | CC: | bnemec, dbecker, josorior, jtaleric, mburns, mcornea, morazi, nkinder, rhel-osp-director-maint, sbaker, shardy, srevivo, zbitter |
| Target Milestone: | async | Keywords: | ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-11-21 19:28:31 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
bigswitch
2016-04-13 23:24:25 UTC
We really need to see the status_reason to know what it is failing on. Is there a reason you deleted the server from Nova and scaled the whole deployment down rather than following the documented procedure for removing a single failed node? Zane - I did not delete the node, BSN did. I suggested to allow heat to reschedule the instance. However to recover form this, I mentioned to go down below 129 nodes (128).. Node 129 is the one BSN deleted.. This did not fix the problem however. OK. It would have been less likely to have gone wrong had they not deleted the node from Nova. It's also generally more efficient to remove a single node using the instructions at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html-single/Director_Installation_and_Usage/index.html#sect-Removing_Nodes_from_the_Overcloud than to scale down the whole cloud, but that isn't a problem as such. There's a good chance that the environment was somehow recoverable, but since that is now moot we'll have to wait for the logs to try to figure out the cause. Zane, I did not follow the documented procedure. This was what I did previously in rhosp 7 to remove nodes which is in error state and it seems to work. the heat and nova log is uploaded to https://bigswitch.box.com/s/jclkl4aorrg8lnqrleviaz0it9s3sj6k As the file is too large, I removed the previous archived files in nova log. I see these lines in the log, which corresponds to the failure in Joe's paste: 2016-04-13 19:58:10.859 32159 INFO heat.engine.stack [-] Stack UPDATE FAILED (overcloud-Compute-4k5xawnt7nxb-69-vw37eergl5yb): StackValidationFailed: resources.InternalApiPort: Property error: InternalApiPort.Properties.ControlPlaneIP: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-77505425-1038-4587-97fc-e1db7204bca0) 2016-04-13 19:58:31.460 32154 INFO heat.engine.stack [-] Stack UPDATE FAILED (overcloud-Compute-4k5xawnt7nxb-73-khfk3yampjnq): ClientException: resources.NetIpMap: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-fe72e7d6-689c-4381-8ff0-6db719c1fb74) Note that this is Compute-69 and Compute-73, so not even the one that was being replaced. From the Nova log, and this is consistent with a lot earlier errors in Heat too, the 500 error was due to a timeout in Keystone. I would guess that these timeouts in Keystone are at the root of all of the issues you have seen. I also saw https://bugs.launchpad.net/heat/+bug/1562042 appear in the Heat log, but I don't believe it's related to any of the symptoms described. Reassigning component as the root cause appears to be keystone timeouts, not anything in Heat. There have been changes in the way keystone is configured by default with regards to threads/workers in the past year, which may have very well addressed the timeouts that were seen with the deployment mentioned in this bug. Given the age of this bug and the fact that it was related to a particular deployment that is very likely not around anymore, I am not sure that this bug is still relevant. I am going to close this issue, but please feel free to reopen it if this issue is still occurring. |