Hide Forgot
Description of problem: This is seen in Dell scaling setup. When attempting to scale from 100 to 150 compute nodes, one node uplink is down and this cause the deployment to fail. I deleted the offending node from nova, and put it in maintenance mode in ironic. Attempt to deploy to 150 nodes fail to update heat stack. I then increase to 160 nodes but it is still failing. Next I drop the number of compute nodes to 128 and heat stack is updated successfully. I then attempt to scale to 150 nodes but no new node was spawn. Joe suggest to drop the number of nodes to 100 and heat stack failed to update, no node was removed. Seems there is no way to recover the heat stack and continue deployment. I will upload heat and nova logs from undercloud to box and provide link Version-Release number of selected component (if applicable): RHOSP 8 beta 9 How reproducible: Seen once Additional info:
We really need to see the status_reason to know what it is failing on. Is there a reason you deleted the server from Nova and scaled the whole deployment down rather than following the documented procedure for removing a single failed node?
Zane - I did not delete the node, BSN did. I suggested to allow heat to reschedule the instance. However to recover form this, I mentioned to go down below 129 nodes (128).. Node 129 is the one BSN deleted.. This did not fix the problem however.
OK. It would have been less likely to have gone wrong had they not deleted the node from Nova. It's also generally more efficient to remove a single node using the instructions at https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html-single/Director_Installation_and_Usage/index.html#sect-Removing_Nodes_from_the_Overcloud than to scale down the whole cloud, but that isn't a problem as such. There's a good chance that the environment was somehow recoverable, but since that is now moot we'll have to wait for the logs to try to figure out the cause.
Zane, I did not follow the documented procedure. This was what I did previously in rhosp 7 to remove nodes which is in error state and it seems to work. the heat and nova log is uploaded to https://bigswitch.box.com/s/jclkl4aorrg8lnqrleviaz0it9s3sj6k As the file is too large, I removed the previous archived files in nova log.
I see these lines in the log, which corresponds to the failure in Joe's paste: 2016-04-13 19:58:10.859 32159 INFO heat.engine.stack [-] Stack UPDATE FAILED (overcloud-Compute-4k5xawnt7nxb-69-vw37eergl5yb): StackValidationFailed: resources.InternalApiPort: Property error: InternalApiPort.Properties.ControlPlaneIP: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-77505425-1038-4587-97fc-e1db7204bca0) 2016-04-13 19:58:31.460 32154 INFO heat.engine.stack [-] Stack UPDATE FAILED (overcloud-Compute-4k5xawnt7nxb-73-khfk3yampjnq): ClientException: resources.NetIpMap: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-fe72e7d6-689c-4381-8ff0-6db719c1fb74) Note that this is Compute-69 and Compute-73, so not even the one that was being replaced. From the Nova log, and this is consistent with a lot earlier errors in Heat too, the 500 error was due to a timeout in Keystone. I would guess that these timeouts in Keystone are at the root of all of the issues you have seen. I also saw https://bugs.launchpad.net/heat/+bug/1562042 appear in the Heat log, but I don't believe it's related to any of the symptoms described.
Reassigning component as the root cause appears to be keystone timeouts, not anything in Heat.
There have been changes in the way keystone is configured by default with regards to threads/workers in the past year, which may have very well addressed the timeouts that were seen with the deployment mentioned in this bug. Given the age of this bug and the fact that it was related to a particular deployment that is very likely not around anymore, I am not sure that this bug is still relevant. I am going to close this issue, but please feel free to reopen it if this issue is still occurring.