Bug 1414779
Summary: | [UPDATES] ERROR: The "pre-update" hook is not defined on SoftwareDeployment "UpdateDeployment" | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Yurii Prokulevych <yprokule> | ||||
Component: | openstack-heat | Assignee: | Zane Bitter <zbitter> | ||||
Status: | CLOSED ERRATA | QA Contact: | Yurii Prokulevych <yprokule> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 11.0 (Ocata) | CC: | aschultz, cpaquin, cwolfe, ddomingo, jcoufal, jschluet, lbezdick, mburns, mcornea, pneedle, radoslaw.smigielski, ramishra, randy_perryman, rhel-osp-director-maint, sathlang, sbaker, sclewis, shardy, srevivo, therve, yprokule, zbitter | ||||
Target Milestone: | rc | Keywords: | Triaged | ||||
Target Release: | 11.0 (Ocata) | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | openstack-heat-8.0.0-6.el7ost | Doc Type: | Bug Fix | ||||
Doc Text: |
Previously, when a pre-update hook was set on a resource that was in a FAILED state, the Orchestration service recorded an event indicating the hook was active. The service would then immediately create a replacement resource without waiting for the hook to be cleared by the user. As a result, the tripleoclient service believed the hook to be pending (based on the event), but fail upon trying to clear it as the replacement resource did not have a hook set. This, in turn, prevented the director from completing an overcloud update with the following message:
ERROR: The "pre-update" hook is not defined on SoftwareDeployment
"UpdateDeployment"
This also affected other client-side applications that used hooks. In the director, this could have also resulted in UpdateDeployment executing on two Controller nodes simultaneously, instead of serialized so that only one Controller is updated at a time.
With this release, the Orchestration service now pauses until the hook is cleared by the user, regardless of the state of the resource. This allows director overcloud updates to complete even when there is an UpdateDeployment resource in a FAILED state.
|
Story Points: | --- | ||||
Clone Of: | |||||||
: | 1428845 1428877 1428879 (view as bug list) | Environment: | |||||
Last Closed: | 2017-05-17 19:40:49 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1394025, 1428845, 1428877, 1428879 | ||||||
Attachments: |
|
Description
Yurii Prokulevych
2017-01-19 12:27:35 UTC
also seeing this same error on rhel 7.3, osp 8 when attempting to perform a minor update 'heat hook-poll -n3 overcloud' shows: +------------------+--------------------------------------+------------------------------------------------+-----------------+---------------------+--------------------------------------------------+ | resource_name | id | resource_status_reason | resource_status | event_time | stack_name | +------------------+--------------------------------------+------------------------------------------------+-----------------+---------------------+--------------------------------------------------+ | UpdateDeployment | 0b664e55-909f-45aa-b71a-b5a76f59699e | UPDATE paused until Hook pre-update is cleared | CREATE_FAILED | 2017-01-31T22:08:51 | overcloud-Controller-rdppxbh4bla3-1-qjgvcvnvfbyb | +------------------+--------------------------------------+------------------------------------------------+-----------------+---------------------+--------------------------------------------------+ but when clearing the hook we see: ERROR: The "pre-update" hook is not defined on SoftwareDeployment "UpdateDeployment" [98363e1a-78e6-44f5-887e-4df05d07eb0b] Stack "overcloud-Controller-rdppxbh4b1a3-1-qjgvcvnvfbyb" [b34c7474-0247-418d-b591-2c241144749b] Trivia: when the resource state is FAILED, the hook gets set in the database but Heat does *not* wait. Oddly, we're seeing the opposite here: the state is indeed FAILED, but Heat is waiting, even though when we try to clear the hook we're told it does not exist in the database. So a possible scenario is: - Heat sets the hook in the DB and creates an event, but does not wait for it to be cleared due to being in the FAILED state. - Heat starts creating a replacement resource (also due to being in a FAILED state). - The replacement resource has no hooks set, so clearing the hook fails even though it appears to exist based on the event list (which hook-poll uses). - The replacement deployment never succeeds, due to some problem on the server that caused it to be in the FAILED state in the first place (in this instance we were seeing strange messages from os-collect-config). - Eventually the whole stack times out, but in the meantime everything looks frozen. Looking at the event list should help to confirm this. Created attachment 1246503 [details]
heat-event-list-output
See attachment. Thanks. Looking at only the last few events for the UpdateDeployment in overcloud-Controller-rdppxbh4b1a3-1-qjgvcvnvfbyb: Engine went down during resource CREATE | CREATE_FAILED | 22:07:05 UPDATE paused until Hook pre-update is cleared | CREATE_FAILED | 22:08:51 state changed | CREATE_IN_PROGRESS | 22:08:53 Signal: deployment succeeded | SIGNAL_IN_PROGRESS | 23:54:33 state changed | CREATE_COMPLETE | 23:54:34 Unknown | SIGNAL_COMPLETE | 00:01:01 So it was in CREATE_FAILED due to the engine restart during the previous update. The hook was recorded in the database, but Heat immediately proceeded to creating a replacement because it was in the FAILED state (I'd consider this a bug, because the point of using breakpoints here is that we don't want the UpdateDeployment happening on two controllers simultaneously). Eventually the create actually succeeded(!), but it took 1 3/4 hours. Looking at the rest of the log, it appears that everything proceeded normally after that until heat-engine was restarted shortly afterwards (at 00:06:04). Zane - does this mean that we should be able to kick off the update and expect not to run into the hook issue? Zane - does this mean that we should be able to kick off the update and expect not to run into the hook issue? (In reply to Chris Paquin from comment #8) > Zane - does this mean that we should be able to kick off the update and > expect not to run into the hook issue? Yes. Hi Yurii, upstream has merged in stable/ocata, moving to POST. Verified with: openstack-heat-engine-8.0.0-7.el7ost.noarch openstack-heat-api-cfn-8.0.0-7.el7ost.noarch openstack-heat-common-8.0.0-7.el7ost.noarch openstack-heat-api-8.0.0-7.el7ost.noarch openstack stack list +--------------------------------------+------------+-----------------+----------------------+----------------------+ | ID | Stack Name | Stack Status | Creation Time | Updated Time | +--------------------------------------+------------+-----------------+----------------------+----------------------+ | a906c7a0-3e20-49f4-acfc-585dfefe9452 | overcloud | UPDATE_COMPLETE | 2017-04-07T08:08:02Z | 2017-04-07T10:58:01Z | +--------------------------------------+------------+-----------------+----------------------+----------------------+ nova list +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | 469c7b43-b4e8-44fb-8c2c-3e41da55aa3f | ceph-0 | ACTIVE | - | Running | ctlplane=192.168.24.10 | | 4f397ab2-8e9d-429c-b839-930bfe506b22 | ceph-1 | ACTIVE | - | Running | ctlplane=192.168.24.19 | | e66945fa-9441-4815-a00d-259bedb9f34e | ceph-2 | ACTIVE | - | Running | ctlplane=192.168.24.12 | | 3e7ce3c9-1b1b-4869-adf7-11968cfcf549 | compute-0 | ACTIVE | - | Running | ctlplane=192.168.24.7 | | 038bb09b-4a0a-4d84-adf8-2f465a4d16a9 | compute-1 | ACTIVE | - | Running | ctlplane=192.168.24.9 | | 409fd440-527c-46f0-9295-71d0e0810493 | controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.22 | | 3635c8e6-ecbb-4d23-9f23-cd1dc52d868b | controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.15 | | 5f35c248-a1b4-473b-be2a-e991f1a44c70 | controller-2 | ACTIVE | - | Running | ctlplane=192.168.24.16 | | 6df65fb3-fcaa-4c59-b8b2-68e3af0aca1c | galera-0 | ACTIVE | - | Running | ctlplane=192.168.24.24 | | 9d75252c-c6e9-4a66-9bbd-6c22db341580 | galera-1 | ACTIVE | - | Running | ctlplane=192.168.24.17 | | b7dc4b7c-0c9f-4e94-b9b4-d403b47b3063 | galera-2 | ACTIVE | - | Running | ctlplane=192.168.24.18 | | 4626ab11-0983-4991-bd1d-c6691908a3fe | messaging-0 | ACTIVE | - | Running | ctlplane=192.168.24.11 | | f931298e-4927-4bee-bcad-296d0da5bb92 | messaging-1 | ACTIVE | - | Running | ctlplane=192.168.24.23 | | c0f4407c-89f5-4e25-8c0c-129f6698cc9c | messaging-2 | ACTIVE | - | Running | ctlplane=192.168.24.8 | | d8587085-ddf8-40d7-8236-b8923f7ef0ff | networker-0 | ACTIVE | - | Running | ctlplane=192.168.24.6 | | a39935f4-76fe-453d-9a33-de5657582473 | networker-1 | ACTIVE | - | Running | ctlplane=192.168.24.20 | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ Was this backported to OSP 8, 9 and 10? if so what are the Bug Numbers? This was backported, and you can find the bug numbers in the clones list. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1245 I am still on OSP 8 and doing minor upgrade from 8.0 to latest available 8.0. I am also subscribed to rhel-7-server-openstack-8-director-rpms, rhel-7-server-openstack-8-rpms repos. So the fix suppose to be in openstack-heat-5.0.3-2.el7ost but the latest I see in official repository are: [stack@undercloud ~]$ rpm -qa | grep openstack-heat openstack-heat-api-5.0.1-9.el7ost.noarch openstack-heat-engine-5.0.1-9.el7ost.noarch openstack-heat-api-cloudwatch-5.0.1-9.el7ost.noarch openstack-heat-api-cfn-5.0.1-9.el7ost.noarch openstack-heat-templates-0-0.1.20151019.el7ost.noarch openstack-heat-common-5.0.1-9.el7ost.noarch So was this fix released for OSP 8.0 too? or maybe it's been forgotten and hasn't been added to official OSP 8.0 repos rhel-7-server-openstack-8-director-rpms, rhel-7-server-openstack-8-rpms? (In reply to Radosław Śmigielski from comment #25) > So was this fix released for OSP 8.0 too? or maybe it's been forgotten and > hasn't been added to official OSP 8.0 repos > rhel-7-server-openstack-8-director-rpms, rhel-7-server-openstack-8-rpms? The bugzilla for this issue in OSP 8 is bug 1428845; it hasn't been released yet. |