Description of problem: Background: This upgrade included adding octavia. Initially this failure was due to a template issue with the public endpoint TLS configuration (missing tls-endpoints-public-dns.yaml environment file). This caused the failure during WorkflowTasks_Step5_Execution because the public endpoints were not correctly defined and the octavia amphora image upload failed. However, after resolving the endpoint issue, subsequent converge deployments fail/hang on this WorkflowTasks_Step5_Execution resource. The mistral workflow associated with this resource never gets re-executed (mistral: tripleo.octavia_post.v1.octavia_post_deploy). I can reproduce the initial failure in a lab; however, once I resolve the template issue the deployment will complete successfully. The issue seems to be isolated to this environment. I will provide additional details and logs in a private comment. My thoughts on how to proceed (looking for feedback here): - backup the heat database on the undercloud. - delete the AllNodesDeploySteps nested stack and mark the resource as unhealthy heat stack-delete <uuid_for_AllNodesDeploySteps_nested_stack> heat resource-mark-unhealthy overcloud <uuid_for_AllNodesDeploySteps_nested_stack> - run the upgrade converge step again.
This was resolved by restarting heat-engine on the overcloud. Sorry for the noise.
I mean restarting heat-engine on the undercloud :)