1508521 – restarting openstack-heat-engine didn't put stack in failed state

Bug 1508521 - restarting openstack-heat-engine didn't put stack in failed state

Summary: restarting openstack-heat-engine didn't put stack in failed state

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-heat
Sub Component:
Version:	12.0 (Pike)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	12.0 (Pike)
Assignee:	Thomas Hervé
QA Contact:	Ronnie Rasouli
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-01 15:33 UTC by Gurenko Alex
Modified:	2018-10-23 13:05 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-23 13:05:54 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Heat logs from the undercloud-0 (4.35 MB, application/x-xz) 2017-11-05 08:44 UTC, Gurenko Alex	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Launchpad	1735755	0	None	None	None	2017-12-01 14:23:21 UTC
OpenStack gerrit	525933	0	None	MERGED	Fix reset_stack_status	2020-11-04 04:49:25 UTC

Description Gurenko Alex 2017-11-01 15:33:49 UTC

Description of problem:
 I've did a mistake in configuration by adding non-existing node in a deployment while deploying split stack and I've tried to retry by putting a stack into a failed state and starting over. In order to fail the stack I've executed sudo systemctl restart openstack-heat-engine


Version-Release number of selected component (if applicable):


How reproducible: unknown


Steps to Reproduce:
1. start overcloud_deploy.sh script
2. soon after that try to execute sudo systemctl restart openstack-heat-engine on a controller

Actual results:

mistral-server was waiting for a timeout from a non-existing node, so stack was still in a create_in_progress state after several attempts and took more than 20 min

Expected results:

I expect for stack to fail right away if the openstack-heat-engine is restarted


Additional info:
 The environment is gone now, but I'm happy to try and reproduce it again and grab whatever logs can help with that. At the end I've killed mistral-server and restarted the openstack-heat-engine again and then it moved on and failed the stack

Comment 1 Gurenko Alex 2017-11-02 16:15:37 UTC

I actually have an environment right now with all nodes accessible still not failing after multiple restarts of openstack-heat-engine. Anything in particular I can get from any of the nodes?

Comment 2 Zane Bitter 2017-11-02 17:52:12 UTC

heat-engine log from the undercloud would always be the first step.

Comment 3 Gurenko Alex 2017-11-05 08:44:39 UTC

Created attachment 1348048 [details]
Heat logs from the undercloud-0

(In reply to Zane Bitter from comment #2)
> heat-engine log from the undercloud would always be the first step.

Please find longs attached

Comment 4 Zane Bitter 2017-11-13 20:01:00 UTC

Logs show that several stacks were reset at startup. At 11:27:

overcloud-ControllerDeployedServer-zxqkr7yjuspt-2-ok3t3nxxgooh
overcloud-ControllerDeployedServer-zxqkr7yjuspt-0-kpkxpear7v6a
overcloud-NetworkerDeployedServer-en3vik4inlyf-0-vtezrjjmoxgv
overcloud-ControllerDeployedServer-zxqkr7yjuspt-1-sryydr53kbkz
overcloud-NetworkerDeployedServer-en3vik4inlyf-1-bxaj4q44o6wb

At 11:39:
overcloud-DatabaseDeployedServer-ko4krg6khwjf-2-aewwpayccrb7
overcloud-ControllerDeployedServer-zxqkr7yjuspt
overcloud-DatabaseDeployedServer-ko4krg6khwjf-0-o26mghkf53fr
overcloud-ComputeDeployedServer-lpbd6odarfmr-0-ravoce76qtyt
overcloud-NetworkerDeployedServer-en3vik4inlyf

Notably, there was no stack update going on between those two restarts - so in theory we should have reset all of those the first time. Conspicuously missing from either list is the top-level 'overcloud' stack itself. Taken together, this tends to suggest that the reset code is working but that we're not picking up every in-progress stack in the initial DB query.

Comment 5 Thomas Hervé 2017-12-01 14:20:52 UTC

I didn't make any conclusion with the logs, but I tested the reset and it doesn't work properly. Going to fix that.

Note You need to log in before you can comment on or make changes to this bug.