Bug 1382146 - overcloud update from osp8 to osp9 fails when ceilometer is in a weird state [NEEDINFO]
Summary: overcloud update from osp8 to osp9 fails when ceilometer is in a weird state
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rhosp-director
Version: 9.0 (Mitaka)
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Angus Thomas
QA Contact: Omri Hochman
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-05 21:27 UTC by Jeremy
Modified: 2019-12-16 07:03 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-02-09 15:13:31 UTC
Target Upstream Version:
jmelvin: needinfo? (athomas)


Attachments (Terms of Use)

Description Jeremy 2016-10-05 21:27:26 UTC
Description of problem:Following https://access.redhat.com/documentation/en/red-hat-openstack-platform/9/single/upgrading-red-hat-openstack-platform/#sect-Major-Upgrading_the_Overcloud-Aodh, I ran the first portion of the deploy with  /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml env file. It failed the first time because ceilometer-alarm-evaluator was not stopped. I stopped ceilometer-alarm-evaluator in pcs and re-ran the deploy. This time it errored out saying ceilometer-alarm-evaluator could not be found. I noticed the update script had deleted the service but the update script does not seem to be resilient enough to handle it being gone and re-deploying the script again. I will post all the logs / commands below.


Version-Release number of selected component (if applicable):
director osp8 , updated to osp9 before the overcloud deploy

How reproducible:
unknown

Steps to Reproduce:
1.steps above
2.
3.



Additional info:

[root@overcloud-controller-0 heat-admin]# pcs status |grep -A3 ceilometer
 Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier] (unmanaged)
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine] (unmanaged)
     openstack-heat-engine	(systemd:openstack-heat-engine):	Started overcloud-controller-1 (unmanaged)
--
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] (unmanaged)
     openstack-ceilometer-api	(systemd:openstack-ceilometer-api):	Started overcloud-controller-1 (unmanaged)
     openstack-ceilometer-api	(systemd:openstack-ceilometer-api):	Started overcloud-controller-0 (unmanaged)
     openstack-ceilometer-api	(systemd:openstack-ceilometer-api):	Started overcloud-controller-2 (unmanaged)
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] (unmanaged)
     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] (unmanaged)
--
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] (unmanaged)
     openstack-ceilometer-collector	(systemd:openstack-ceilometer-collector):	Started overcloud-controller-1 (unmanaged)
     openstack-ceilometer-collector	(systemd:openstack-ceilometer-collector):	Started overcloud-controller-0 (unmanaged)
     openstack-ceilometer-collector	(systemd:openstack-ceilometer-collector):	Started overcloud-controller-2 (unmanaged)
 Clone Set: openstack-keystone-clone [openstack-keystone] (unmanaged)
     openstack-keystone	(systemd:openstack-keystone):	Started overcloud-controller-1 (unmanaged)
     openstack-keystone	(systemd:openstack-keystone):	Started overcloud-controller-0 (unmanaged)
--
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] (unmanaged)
     openstack-ceilometer-notification	(systemd:openstack-ceilometer-notification):	Started overcloud-controller-1 (unmanaged)
     openstack-ceilometer-notification	(systemd:openstack-ceilometer-notification):	Started overcloud-controller-0 (unmanaged)
     openstack-ceilometer-notification	(systemd:openstack-ceilometer-notification):	Started overcloud-controller-2 (unmanaged)
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api] (unmanaged)
     openstack-cinder-api	(systemd:openstack-cinder-api):	Started overcloud-controller-1 (unmanaged)
     openstack-cinder-api	(systemd:openstack-cinder-api):	Started overcloud-controller-0 (unmanaged)
--
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] (unmanaged)
     openstack-ceilometer-central	(systemd:openstack-ceilometer-central):	Started overcloud-controller-1 (unmanaged)
     openstack-ceilometer-central	(systemd:openstack-ceilometer-central):	Started overcloud-controller-0 (unmanaged)
     openstack-ceilometer-central	(systemd:openstack-ceilometer-central):	Started overcloud-controller-2 (unmanaged)
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] (unmanaged)
     openstack-heat-api-cfn	(systemd:openstack-heat-api-cfn):	Started overcloud-controller-1 (unmanaged)
     openstack-heat-api-cfn	(systemd:openstack-heat-api-cfn):	Started overcloud-controller-0 (unmanaged)
[root@overcloud-controller-0 heat-admin]# 

###controller 1 o-collect-config
(heat-config) [WARNING] Skipping config d98d6438-6b4d-46fd-8a51-7bb4aa7e3bf9, already deployed
45:13,731] (heat-config) [WARNING] To force-deploy, rm /var/run/heat-config/deployed/d98d6438-6b4d-46fd-8a51-7bb4aa7e3bf9.json
45:13,732] (heat-config) [DEBUG] Running /var/lib/heat-config/hooks/script < /var/run/heat-config/deployed/e55975c3-8ad7-430e-bf52-34a46b23474a.json
45:15,044] (heat-config) [INFO] {"deploy_stdout": " Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier] (unmanaged)\n", "deploy_stderr": "Error
45:15,045] (heat-config) [DEBUG] [2016-10-05 19:45:13,767] (heat-config) [INFO] deploy_server_id=e393527a-042b-4cac-ae3f-660595551211
45:13,767] (heat-config) [INFO] deploy_action=CREATE
45:13,767] (heat-config) [INFO] deploy_stack_id=overcloud-UpdateWorkflow-scacp7uesqfa-AodhPreUpgradeDeployment-se43m2criqnr/a5a8f420-86e9-4e81-8c11-4226917d5c77
45:13,767] (heat-config) [INFO] deploy_resource_name=0
45:13,767] (heat-config) [INFO] deploy_signal_transport=CFN_SIGNAL
45:13,768] (heat-config) [INFO] deploy_signal_id=http://192.0.3.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3Ae67e4f8a71b7460e8c83d4bce82571e7%3Astacks%2Fovercloud-UpdateWorkflow-scac
45:13,768] (heat-config) [INFO] deploy_signal_verb=POST
45:13,768] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/e55975c3-8ad7-430e-bf52-34a46b23474a
45:15,040] (heat-config) [INFO]  Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier] (unmanaged)
45:15,041] (heat-config) [DEBUG] Error: unable to find a resource/clone/master/group: openstack-ceilometer-alarm-evaluator
45:15,041] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/e55975c3-8ad7-430e-bf52-34a46b23474a. [1]
45:15,045] (heat-config) [INFO] Completed /var/lib/heat-config/hooks/script

[stack@cctg-aci-cloud-host1 ~]$ heat resource-list a5a8f420-86e9-4e81-8c11-4226917d5c77
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
+---------------+--------------------------------------+------------------------------+-----------------+---------------------+
| resource_name | physical_resource_id                 | resource_type                | resource_status | updated_time        |
+---------------+--------------------------------------+------------------------------+-----------------+---------------------+
| 1             | e2b66dc0-ddd4-4401-96d1-e6fe22a563f9 | OS::Heat::SoftwareDeployment | CREATE_COMPLETE | 2016-10-05T19:14:59 |
| 2             | dfc72353-aa10-4f13-b7a0-e9531d6fae59 | OS::Heat::SoftwareDeployment | CREATE_COMPLETE | 2016-10-05T19:14:59 |
| 0             | fa29ee25-457c-44ff-89ba-29c9b82c17fa | OS::Heat::SoftwareDeployment | CREATE_FAILED   | 2016-10-05T20:09:48 |
+---------------+--------------------------------------+------------------------------+-----------------+---------------------+
[stack@cctg-aci-cloud-host1 ~]$ heat deployment-show  fa29ee25-457c-44ff-89ba-29c9b82c17fa
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED", 
  "server_id": "e393527a-042b-4cac-ae3f-660595551211", 
  "config_id": "9d949012-129a-4372-964c-3a54e18fd3ed", 
  "output_values": {
    "deploy_stdout": " Clone Set: openstack-ceilometer-alarm-notifier-clone [openstack-ceilometer-alarm-notifier] (unmanaged)\n", 
    "deploy_stderr": "Error: unable to find a resource/clone/master/group: openstack-ceilometer-alarm-evaluator\n", 
    "deploy_status_code": 1
  }, 
  "creation_time": "2016-10-05T20:09:48", 
  "updated_time": "2016-10-05T20:10:41", 
  "input_values": {}, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", 
  "id": "fa29ee25-457c-44ff-89ba-29c9b82c17fa"


//notes hacked update script in /var/lib/heat-config/heat-config-script/ to comment out alarm-eval lines. this allowed the script to run locally on the controller. Prob not persistent and only on 1 controllre so did on undercloud in /usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh  Update for that portion now completed successfully

Comment 1 Jeremy 2016-10-05 21:35:06 UTC
seeing similar behavior for the keystone portion of the upgrade. The script seems to try to stop keystone in pcs but it doesn't fully stop. This was the same thing that happened for ceilometer which lead to the problem.

[root@overcloud-controller-0 heat-admin]# pcs status |grep -A3 keystone
 Clone Set: openstack-keystone-clone [openstack-keystone] (unmanaged)
     openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-1 (unmanaged)
     openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-0 (unmanaged)
     openstack-keystone	(systemd:openstack-keystone):	(target-role:Stopped) Started overcloud-controller-2 (unmanaged)

Comment 2 Andrew Blum 2016-10-31 15:21:56 UTC
I am seeing the same error in our training environment.  The heat templates put the whole cluster in maintenance (the unmanaged state):

./extraconfig/tasks/pacemaker_maintenance_mode.sh:    pcs property set maintenance-mode=true

But then later try to disable a couple of ceilometer resources:

/usr/share/openstack-tripleo-heat-templates/extraconfig/tasks/major_upgrade_pacemaker_migrations.sh

        if pcs status | grep openstack-ceilometer-alarm; then
            # Disable pacemaker resources for ceilometer-alarms
            pcs resource disable openstack-ceilometer-alarm-evaluator
            check_resource openstack-ceilometer-alarm-evaluator stopped 600
            pcs resource delete openstack-ceilometer-alarm-evaluator
            pcs resource disable openstack-ceilometer-alarm-notifier
            check_resource openstack-ceilometer-alarm-notifier stopped 600
            pcs resource delete openstack-ceilometer-alarm-notifier

My workaround was to disable those ceilometer resources manually before running the upgrade so this validation check passed:

[stack@director ~]$ ssh heat-admin@control0 "sudo pcs resource disable
openstack-ceilometer-alarm-evaluator"
[stack@director ~]$ ssh heat-admin@control0 "sudo pcs resource disable
openstack-ceilometer-alarm-notifier"

This resulted in a successful upgrade.  I feel like this is working around some issue in the upgrade logic IMO, since you can't disable a resource if its unmanaged.

Comment 3 Red Hat Bugzilla Rules Engine 2017-02-07 20:53:03 UTC
This bugzilla has been removed from the release and needs to be reviewed and Triaged for another Target Release.

Comment 4 Sofer Athlan-Guyot 2017-02-09 15:13:31 UTC
Hi,

I was able to upgrade successfully aodh.  So closing this one.  If you still have the issue, feel free to re-open it.

Regards,


Note You need to log in before you can comment on or make changes to this bug.