Description of problem:
Background: This upgrade included adding octavia. Initially this failure was due to a template issue with the public endpoint TLS configuration (missing tls-endpoints-public-dns.yaml environment file). This caused the failure during WorkflowTasks_Step5_Execution because the public endpoints were not correctly defined and the octavia amphora image upload failed. However, after resolving the endpoint issue, subsequent converge deployments fail/hang on this WorkflowTasks_Step5_Execution resource. The mistral workflow associated with this resource never gets re-executed (mistral: tripleo.octavia_post.v1.octavia_post_deploy).
I can reproduce the initial failure in a lab; however, once I resolve the template issue the deployment will complete successfully. The issue seems to be isolated to this environment.
I will provide additional details and logs in a private comment.
My thoughts on how to proceed (looking for feedback here):
- backup the heat database on the undercloud.
- delete the AllNodesDeploySteps nested stack and mark the resource as unhealthy
heat stack-delete <uuid_for_AllNodesDeploySteps_nested_stack>
heat resource-mark-unhealthy overcloud <uuid_for_AllNodesDeploySteps_nested_stack>
- run the upgrade converge step again.
This was resolved by restarting heat-engine on the overcloud. Sorry for the noise.
I mean restarting heat-engine on the undercloud :)