Bug 1508521 - restarting openstack-heat-engine didn't put stack in failed state
Summary: restarting openstack-heat-engine didn't put stack in failed state
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-heat
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 12.0 (Pike)
Assignee: Thomas Hervé
QA Contact: Ronnie Rasouli
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-01 15:33 UTC by Gurenko Alex
Modified: 2018-10-23 13:05 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-10-23 13:05:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Heat logs from the undercloud-0 (4.35 MB, application/x-xz)
2017-11-05 08:44 UTC, Gurenko Alex
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1735755 0 None None None 2017-12-01 14:23:21 UTC
OpenStack gerrit 525933 0 None MERGED Fix reset_stack_status 2020-11-04 04:49:25 UTC

Description Gurenko Alex 2017-11-01 15:33:49 UTC
Description of problem:
 I've did a mistake in configuration by adding non-existing node in a deployment while deploying split stack and I've tried to retry by putting a stack into a failed state and starting over. In order to fail the stack I've executed sudo systemctl restart openstack-heat-engine


Version-Release number of selected component (if applicable):


How reproducible: unknown


Steps to Reproduce:
1. start overcloud_deploy.sh script
2. soon after that try to execute sudo systemctl restart openstack-heat-engine on a controller

Actual results:

mistral-server was waiting for a timeout from a non-existing node, so stack was still in a create_in_progress state after several attempts and took more than 20 min

Expected results:

I expect for stack to fail right away if the openstack-heat-engine is restarted


Additional info:
 The environment is gone now, but I'm happy to try and reproduce it again and grab whatever logs can help with that. At the end I've killed mistral-server and restarted the openstack-heat-engine again and then it moved on and failed the stack

Comment 1 Gurenko Alex 2017-11-02 16:15:37 UTC
I actually have an environment right now with all nodes accessible still not failing after multiple restarts of openstack-heat-engine. Anything in particular I can get from any of the nodes?

Comment 2 Zane Bitter 2017-11-02 17:52:12 UTC
heat-engine log from the undercloud would always be the first step.

Comment 3 Gurenko Alex 2017-11-05 08:44:39 UTC
Created attachment 1348048 [details]
Heat logs from the undercloud-0

(In reply to Zane Bitter from comment #2)
> heat-engine log from the undercloud would always be the first step.

Please find longs attached

Comment 4 Zane Bitter 2017-11-13 20:01:00 UTC
Logs show that several stacks were reset at startup. At 11:27:

overcloud-ControllerDeployedServer-zxqkr7yjuspt-2-ok3t3nxxgooh
overcloud-ControllerDeployedServer-zxqkr7yjuspt-0-kpkxpear7v6a
overcloud-NetworkerDeployedServer-en3vik4inlyf-0-vtezrjjmoxgv
overcloud-ControllerDeployedServer-zxqkr7yjuspt-1-sryydr53kbkz
overcloud-NetworkerDeployedServer-en3vik4inlyf-1-bxaj4q44o6wb

At 11:39:
overcloud-DatabaseDeployedServer-ko4krg6khwjf-2-aewwpayccrb7
overcloud-ControllerDeployedServer-zxqkr7yjuspt
overcloud-DatabaseDeployedServer-ko4krg6khwjf-0-o26mghkf53fr
overcloud-ComputeDeployedServer-lpbd6odarfmr-0-ravoce76qtyt
overcloud-NetworkerDeployedServer-en3vik4inlyf

Notably, there was no stack update going on between those two restarts - so in theory we should have reset all of those the first time. Conspicuously missing from either list is the top-level 'overcloud' stack itself. Taken together, this tends to suggest that the reset code is working but that we're not picking up every in-progress stack in the initial DB query.

Comment 5 Thomas Hervé 2017-12-01 14:20:52 UTC
I didn't make any conclusion with the logs, but I tested the reset and it doesn't work properly. Going to fix that.


Note You need to log in before you can comment on or make changes to this bug.