1732900 – OSP 10->13 FFU, during ffwd-upgrade converge step, WorkflowTasks_Step5_Execution is stuck in CREATE_IN_PROGRESS

Bug 1732900 - OSP 10->13 FFU, during ffwd-upgrade converge step, WorkflowTasks_Step5_Execution is stuck in CREATE_IN_PROGRESS

Summary: OSP 10->13 FFU, during ffwd-upgrade converge step, WorkflowTasks_Step5_Execut...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	13.0 (Queens)
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	RHOS Maint
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-24 16:29 UTC by Matt Flusche
Modified:	2023-09-07 20:21 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-24 18:47:58 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	OSP-28248	0	None	None	None	2023-09-07 20:21:41 UTC

Description Matt Flusche 2019-07-24 16:29:43 UTC

Description of problem:


Background:  This upgrade included adding octavia.  Initially this failure was due to a template issue with the public endpoint TLS configuration (missing tls-endpoints-public-dns.yaml environment file).  This caused the failure during WorkflowTasks_Step5_Execution because the public endpoints were not correctly defined and the octavia amphora image upload failed.  However, after resolving the endpoint issue, subsequent converge deployments fail/hang on this WorkflowTasks_Step5_Execution resource.  The mistral workflow associated with this resource never gets re-executed (mistral: tripleo.octavia_post.v1.octavia_post_deploy).

I can reproduce the initial failure in a lab; however, once I resolve the template issue the deployment will complete successfully. The issue seems to be isolated to this environment.

I will provide additional details and logs in a private comment.

My thoughts on how to proceed (looking for feedback here):

- backup the heat database on the undercloud.
- delete the AllNodesDeploySteps nested stack and mark the resource as unhealthy 

  heat stack-delete <uuid_for_AllNodesDeploySteps_nested_stack>
  heat resource-mark-unhealthy overcloud  <uuid_for_AllNodesDeploySteps_nested_stack>

- run the upgrade converge step again.

Comment 4 Matt Flusche 2019-07-24 18:47:58 UTC

This was resolved by restarting heat-engine on the overcloud.  Sorry for the noise.

Comment 5 Matt Flusche 2019-07-24 18:49:50 UTC

I mean restarting heat-engine on the undercloud :)

Note You need to log in before you can comment on or make changes to this bug.