Bug 1451101

Summary:	All pcs services unmanaged during aodh phase (step 1) of OSP 8 -> 9 upgrade
Product:	Red Hat OpenStack	Reporter:	jliberma <jliberma>
Component:	openstack-tripleo-heat-templates	Assignee:	Sofer Athlan-Guyot <sathlang>
Status:	CLOSED ERRATA	QA Contact:	Marius Cornea <mcornea>
Severity:	high	Docs Contact:
Priority:	high
Version:	9.0 (Mitaka)	CC:	amuller, dbecker, dmaley, emacchi, jjoyce, jliberma, jmelvin, lruzicka, mbultel, mburns, mcornea, morazi, ohochman, rhel-osp-director-maint, sathlang, yroblamo
Target Milestone:	zstream	Keywords:	Triaged, ZStream
Target Release:	9.0 (Mitaka)
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	openstack-tripleo-heat-templates-2.0.0-59.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-09-27 13:08:31 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1464464
Bug Blocks:

Description jliberma@redhat.com 2017-05-15 19:23:57 UTC

Created attachment 1279086 [details]
Steps to reproduce in their environment

Description of problem:

Customer upgrading from OSP 8 to OSP 9. During AODH upgrade step all pcs services go unmanaged. This is reproducible every time. APpears to be due to pacemaker timeout stopping various services on all nodes. Fencing is not configured.

NOTE: This step is recoverable by putting cluster in managed state, disabling and removing ceilometer alarm and event notifier manually + all constraints, then re-running the upgrade step.

Version-Release number of selected component (if applicable):

OSP director 8

How reproducible:

Every time

Steps to Reproduce:
1. Attached PDF
2.
3.

Actual results:

All pcs services go umanaged requiring manual intevention at first step of upgrade.

Complete errors documented here:
https://access.redhat.com/support/cases/#/case/01819909

Warning: 'openstack-ceilometer-alarm-evaluator' is unmanaged May 15 11:23:02 ocd00-controller-0.localdomain os-collect-config[4435]: ERROR: cluster finished transition but openstack-ceilometer-alarm-evaluator was not in stopped state, exiting. May 15 11:23:02 ocd00-controller-0.localdomain os-collect-config[4435]: [2017-05-15 11:23:02,428] (heat-config) [DEBUG] May 15 11:23:02 ocd00-controller-0.localdomain os-collect-config[4435]: [2017-05-15 11:23:02,428] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/70e88030-db11-47a9-8303-cbf3d3980f6d

Expected results:

Deploy command executes successfully.

Additional info:

Verifying that both OSP 8 and 9 repos are available during 1st two steps of upgrade and that all content is at latest version from CDN.

Comment 1 jliberma@redhat.com 2017-05-15 20:10:46 UTC

NOTE: 

Customer has os-collect-config-0.1.37-6.el7ost (https://bugzilla.redhat.com/show_bug.cgi?id=1350489)

Both OSP 8 and 9 repos are available on the overcloud nodes during the aodh migration step.

Comment 2 jliberma@redhat.com 2017-05-15 20:12:44 UTC

Here are the notes I made for correcting this issue when I encountered it in my test environment:

Optional: Manually correcting Ceilometer errors during Aodh update

 
NOTE -- I had to manually delete the ceilometer pcs resources and constraints then re-run the update command

NOTE -- I also saw one instance where the overcloud deploy completed successfully but the cluster services did not restart cleanly


This is the error from controller node os-collect-config:

Apr 09 10:50:36 overcloud-controller-0.localdomain os-collect-config[2956]: ERROR: cluster finished transition but openstack-ceilometer-alarm-evaluator was not in stopped state, exiting.
Apr 09 10:50:36 overcloud-controller-0.localdomain os-collect-config[2956]: ERROR: cluster finished transition but openstack-ceilometer-alarm-evaluator was not in stopped state, exiting.
Apr 09 10:50:36 overcloud-controller-0.localdomain os-collect-config[2956]: [2017-04-09 10:50:36,136] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/1011c63b-49c2-4f17-9332-6df4c8d54494. [1]
 

Plus all pcs services are unmanaged because pcs is in maintenance mode:

$ pcs status
$ pcs property list --all | grep maintenance-mode
$ pcs cluster cib | grep maintenance-mode
$ grep maintenance /var/log/messages

 
Unset maintenance mode:
$ pcs property unset maintenance-mode


Clean up ceilometer on the bootstrap node:

$ pcs resource disable openstack-ceilometer-alarm-evaluator
$ pcs status | grep openstack-ceilometer-alarm-evaluator -A 1

 
Delete the openstack-ceilometer-alarm-evaluator resource:
$ pcs resource delete openstack-ceilometer-alarm-evaluator
$ pcs status | grep openstack-ceilometer-alarm-evaluator -A 1

 
Repeat for openstack-ceilometer-alarm-notifier:
$ pcs resource disable openstack-ceilometer-alarm-notifier
$ pcs status | grep openstack-ceilometer-alarm-notifier -A 2
$ pcs resource delete openstack-ceilometer-alarm-notifier
$ pcs status | grep openstack-ceilometer-alarm-notifier -A 2

 

Remove ceilometer constraints:

$ if pcs constraint order show | grep "start delay-clone then start openstack-ceilometer-alarm-evaluator-clone"; then pcs constraint remove order-delay-clone-openstack-ceilometer-alarm-evaluator-clone-mandatory; fi

$ if pcs constraint order show | grep "start openstack-ceilometer-alarm-notifier-clone then start openstack-ceilometer-notification-clone"; then pcs constraint remove order-openstack-ceilometer-alarm-notifier-clone-openstack-ceilometer-notification-clone-mandatory; fi

$ if pcs constraint order show | grep "start openstack-ceilometer-alarm-evaluator-clone then start openstack-ceilometer-alarm-notifier-clone"; then pcs constraint remove order-openstack-ceilometer-alarm-evaluator-clone-openstack-ceilometer-alarm-notifier-clone-mandatory; fi

$ if pcs constraint colocation show | grep "openstack-ceilometer-notification-clone with openstack-ceilometer-alarm-notifier-clone"; then pcs constraint remove colocation-openstack-ceilometer-notification-clone-openstack-ceilometer-alarm-notifier-clone-INFINITY; fi

$ if pcs constraint colocation show | grep "openstack-ceilometer-alarm-notifier-clone with openstack-ceilometer-alarm-evaluator-clone"; then pcs constraint remove colocation-openstack-ceilometer-alarm-notifier-clone-openstack-ceilometer-alarm-evaluator-clone-INFINITY; fi

$ if pcs constraint colocation show | grep "openstack-ceilometer-alarm-evaluator-clone with delay-clone"; then pcs constraint remove colocation-openstack-ceilometer-alarm-evaluator-clone-delay-clone-INFINITY; fi

$ pcs constraint list | grep ceilometer-alarm


From undercloud, remove ceilometer alarm package:
$ source ~/stackrc
$ run-on-overcloud sudo yum -y remove openstack-ceilometer-alarm
$ ctl-health 

Now re-run the deploy command.


NOTE -- It is being tested whether these steps should be done BEFORE running the AODH migration script.

Comment 3 Jeremy 2017-05-16 13:29:56 UTC

This most recent comment was recently updated by one of our engineers:

I believe we've uncovered a root cause, a crash in one of the cluster daemons that makes it look like one of the services fails to stop.
With no fencing configured, there is nothing the cluster can do to recover by itself leaving the upgrade in a partial state.

I will co-ordinate with folks here on the best way to move forward.

Comment 4 jliberma@redhat.com 2017-05-16 13:46:55 UTC

I've reproduced this failure multiple times in libirt/kvm and baremetal environments. 

Manually disabling the pcs ceilometer-[alarm,notifier] services, removing them, removing the constraints, and uninstalling the rpms prior to running the aodh migration script avoids this issue.

Comment 5 Omri Hochman 2017-05-16 17:46:18 UTC

Could be related to : https://bugzilla.redhat.com/show_bug.cgi?id=1451170 

fix should be in : pacemaker-1.1.16-9.el7

Comment 6 Dave Maley 2017-05-18 15:01:38 UTC

requesting hotfix approval for once we have the build ready

Comment 7 Sofer Athlan-Guyot 2017-05-30 17:04:43 UTC

Hi,

moving this to POST as an hotfix has been packaged.  Do we need something else to get it out?


It should be noted that the root cause of those pacemaker is related to https://bugzilla.redhat.com/show_bug.cgi?id=1278181 .

It happens that OSP-8 GA may not have the fix to have the heat agent script status in /var/lib (where they persist reboot) but in /var/run where there don't.   The net effect being that the *all* the heat scripts are run again, causing all sort of dysfunction during the upgrade. 

The correct procedure is described there https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html-single/Director_Installation_and_Usage/index.html#sect-Updating_Overcloud_Packages for osp7 and osp8 shouldn't be impacted if *deployed* with openstack-heat-templates-0-0.8.20150605git.el7ost.noarch.rpm.  Earlier version are impacted.

The step described in the osp7 documentation should happen after undercloud upgrade but before any reboot.

I've created https://bugzilla.redhat.com/show_bug.cgi?id=1456928 to track the update in the documentation.

Comment 8 Sofer Athlan-Guyot 2017-06-01 09:27:44 UTC

Hi Dave,

did you get the hotfix ?

Comment 10 Sofer Athlan-Guyot 2017-06-26 11:47:56 UTC

Hi,

so we may have found a race condition here.  This is followed up in https://bugzilla.redhat.com/show_bug.cgi?id=1464464 .  Dave is that ok if I close this one as duplicate ?

Comment 11 jliberma@redhat.com 2017-07-24 11:46:48 UTC

The other bug is a duplicate of this bug, so we should keep this one open and clsoe the other.

Comment 13 Sofer Athlan-Guyot 2017-07-25 09:47:51 UTC

*** Bug 1464464 has been marked as a duplicate of this bug. ***

Comment 14 Sofer Athlan-Guyot 2017-07-25 09:51:11 UTC

Hi,

we need to adjust this one to target osp9, not sure how to proceed.

Thanks,

Comment 15 mathieu bultel 2017-08-25 07:46:38 UTC

Im  setting default priority and severity here, correct if you think its not the right settings.
Thank you.

Comment 19 Marius Cornea 2017-09-22 10:24:39 UTC

[stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates
openstack-tripleo-heat-templates-2.0.0-60.el7ost.noarch

After running:

#!/bin/bash

openstack overcloud deploy --force-postconfig \
--templates \
--libvirt-type kvm \
--ntp-server clock.redhat.com \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 3 \
--ceph-storage-flavor ceph \
-e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \
-e /home/stack/virt/internal.yaml \
-e /home/stack/virt/network/network-environment.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
-e /home/stack/virt/hostnames.yml \
-e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml


[root@controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.16-12.el7-94ff4df) - partition with quorum
Last updated: Fri Sep 22 10:22:20 2017
Last change: Fri Sep 22 10:05:43 2017 by root via cibadmin on controller-0

3 nodes configured
115 resources configured

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

 ip-172.17.4.10	(ocf::heartbeat:IPaddr2):	Started controller-0
 Clone Set: haproxy-clone [haproxy]
     Started: [ controller-0 controller-1 controller-2 ]
 ip-172.17.3.10	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.1.10	(ocf::heartbeat:IPaddr2):	Started controller-2
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.11	(ocf::heartbeat:IPaddr2):	Started controller-1
 Master/Slave Set: redis-master [redis]
     Masters: [ controller-0 ]
     Slaves: [ controller-1 controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ controller-0 controller-1 controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ controller-0 controller-1 controller-2 ]
 ip-192.168.24.13	(ocf::heartbeat:IPaddr2):	Started controller-2
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: httpd-clone [httpd]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-keystone-clone [openstack-keystone]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ controller-0 controller-1 controller-2 ]
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started controller-0
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener]
     Stopped: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier]
     Stopped: [ controller-0 controller-1 controller-2 ]
 Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator]
     Stopped: [ controller-0 controller-1 controller-2 ]

Comment 21 errata-xmlrpc 2017-09-27 13:08:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2813