Description of problem: I've did a mistake in configuration by adding non-existing node in a deployment while deploying split stack and I've tried to retry by putting a stack into a failed state and starting over. In order to fail the stack I've executed sudo systemctl restart openstack-heat-engine Version-Release number of selected component (if applicable): How reproducible: unknown Steps to Reproduce: 1. start overcloud_deploy.sh script 2. soon after that try to execute sudo systemctl restart openstack-heat-engine on a controller Actual results: mistral-server was waiting for a timeout from a non-existing node, so stack was still in a create_in_progress state after several attempts and took more than 20 min Expected results: I expect for stack to fail right away if the openstack-heat-engine is restarted Additional info: The environment is gone now, but I'm happy to try and reproduce it again and grab whatever logs can help with that. At the end I've killed mistral-server and restarted the openstack-heat-engine again and then it moved on and failed the stack
I actually have an environment right now with all nodes accessible still not failing after multiple restarts of openstack-heat-engine. Anything in particular I can get from any of the nodes?
heat-engine log from the undercloud would always be the first step.
Created attachment 1348048 [details] Heat logs from the undercloud-0 (In reply to Zane Bitter from comment #2) > heat-engine log from the undercloud would always be the first step. Please find longs attached
Logs show that several stacks were reset at startup. At 11:27: overcloud-ControllerDeployedServer-zxqkr7yjuspt-2-ok3t3nxxgooh overcloud-ControllerDeployedServer-zxqkr7yjuspt-0-kpkxpear7v6a overcloud-NetworkerDeployedServer-en3vik4inlyf-0-vtezrjjmoxgv overcloud-ControllerDeployedServer-zxqkr7yjuspt-1-sryydr53kbkz overcloud-NetworkerDeployedServer-en3vik4inlyf-1-bxaj4q44o6wb At 11:39: overcloud-DatabaseDeployedServer-ko4krg6khwjf-2-aewwpayccrb7 overcloud-ControllerDeployedServer-zxqkr7yjuspt overcloud-DatabaseDeployedServer-ko4krg6khwjf-0-o26mghkf53fr overcloud-ComputeDeployedServer-lpbd6odarfmr-0-ravoce76qtyt overcloud-NetworkerDeployedServer-en3vik4inlyf Notably, there was no stack update going on between those two restarts - so in theory we should have reset all of those the first time. Conspicuously missing from either list is the top-level 'overcloud' stack itself. Taken together, this tends to suggest that the reset code is working but that we're not picking up every in-progress stack in the initial DB query.
I didn't make any conclusion with the logs, but I tested the reset and it doesn't work properly. Going to fix that.