Bug 1451101
Summary: | All pcs services unmanaged during aodh phase (step 1) of OSP 8 -> 9 upgrade | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | jliberma <jliberma> |
Component: | openstack-tripleo-heat-templates | Assignee: | Sofer Athlan-Guyot <sathlang> |
Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 9.0 (Mitaka) | CC: | amuller, dbecker, dmaley, emacchi, jjoyce, jliberma, jmelvin, lruzicka, mbultel, mburns, mcornea, morazi, ohochman, rhel-osp-director-maint, sathlang, yroblamo |
Target Milestone: | zstream | Keywords: | Triaged, ZStream |
Target Release: | 9.0 (Mitaka) | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-2.0.0-59.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-09-27 13:08:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1464464 | ||
Bug Blocks: |
Description
jliberma@redhat.com
2017-05-15 19:23:57 UTC
NOTE: Customer has os-collect-config-0.1.37-6.el7ost (https://bugzilla.redhat.com/show_bug.cgi?id=1350489) Both OSP 8 and 9 repos are available on the overcloud nodes during the aodh migration step. Here are the notes I made for correcting this issue when I encountered it in my test environment: Optional: Manually correcting Ceilometer errors during Aodh update NOTE -- I had to manually delete the ceilometer pcs resources and constraints then re-run the update command NOTE -- I also saw one instance where the overcloud deploy completed successfully but the cluster services did not restart cleanly This is the error from controller node os-collect-config: Apr 09 10:50:36 overcloud-controller-0.localdomain os-collect-config[2956]: ERROR: cluster finished transition but openstack-ceilometer-alarm-evaluator was not in stopped state, exiting. Apr 09 10:50:36 overcloud-controller-0.localdomain os-collect-config[2956]: ERROR: cluster finished transition but openstack-ceilometer-alarm-evaluator was not in stopped state, exiting. Apr 09 10:50:36 overcloud-controller-0.localdomain os-collect-config[2956]: [2017-04-09 10:50:36,136] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/1011c63b-49c2-4f17-9332-6df4c8d54494. [1] Plus all pcs services are unmanaged because pcs is in maintenance mode: $ pcs status $ pcs property list --all | grep maintenance-mode $ pcs cluster cib | grep maintenance-mode $ grep maintenance /var/log/messages Unset maintenance mode: $ pcs property unset maintenance-mode Clean up ceilometer on the bootstrap node: $ pcs resource disable openstack-ceilometer-alarm-evaluator $ pcs status | grep openstack-ceilometer-alarm-evaluator -A 1 Delete the openstack-ceilometer-alarm-evaluator resource: $ pcs resource delete openstack-ceilometer-alarm-evaluator $ pcs status | grep openstack-ceilometer-alarm-evaluator -A 1 Repeat for openstack-ceilometer-alarm-notifier: $ pcs resource disable openstack-ceilometer-alarm-notifier $ pcs status | grep openstack-ceilometer-alarm-notifier -A 2 $ pcs resource delete openstack-ceilometer-alarm-notifier $ pcs status | grep openstack-ceilometer-alarm-notifier -A 2 Remove ceilometer constraints: $ if pcs constraint order show | grep "start delay-clone then start openstack-ceilometer-alarm-evaluator-clone"; then pcs constraint remove order-delay-clone-openstack-ceilometer-alarm-evaluator-clone-mandatory; fi $ if pcs constraint order show | grep "start openstack-ceilometer-alarm-notifier-clone then start openstack-ceilometer-notification-clone"; then pcs constraint remove order-openstack-ceilometer-alarm-notifier-clone-openstack-ceilometer-notification-clone-mandatory; fi $ if pcs constraint order show | grep "start openstack-ceilometer-alarm-evaluator-clone then start openstack-ceilometer-alarm-notifier-clone"; then pcs constraint remove order-openstack-ceilometer-alarm-evaluator-clone-openstack-ceilometer-alarm-notifier-clone-mandatory; fi $ if pcs constraint colocation show | grep "openstack-ceilometer-notification-clone with openstack-ceilometer-alarm-notifier-clone"; then pcs constraint remove colocation-openstack-ceilometer-notification-clone-openstack-ceilometer-alarm-notifier-clone-INFINITY; fi $ if pcs constraint colocation show | grep "openstack-ceilometer-alarm-notifier-clone with openstack-ceilometer-alarm-evaluator-clone"; then pcs constraint remove colocation-openstack-ceilometer-alarm-notifier-clone-openstack-ceilometer-alarm-evaluator-clone-INFINITY; fi $ if pcs constraint colocation show | grep "openstack-ceilometer-alarm-evaluator-clone with delay-clone"; then pcs constraint remove colocation-openstack-ceilometer-alarm-evaluator-clone-delay-clone-INFINITY; fi $ pcs constraint list | grep ceilometer-alarm From undercloud, remove ceilometer alarm package: $ source ~/stackrc $ run-on-overcloud sudo yum -y remove openstack-ceilometer-alarm $ ctl-health Now re-run the deploy command. NOTE -- It is being tested whether these steps should be done BEFORE running the AODH migration script. This most recent comment was recently updated by one of our engineers: I believe we've uncovered a root cause, a crash in one of the cluster daemons that makes it look like one of the services fails to stop. With no fencing configured, there is nothing the cluster can do to recover by itself leaving the upgrade in a partial state. I will co-ordinate with folks here on the best way to move forward. I've reproduced this failure multiple times in libirt/kvm and baremetal environments. Manually disabling the pcs ceilometer-[alarm,notifier] services, removing them, removing the constraints, and uninstalling the rpms prior to running the aodh migration script avoids this issue. Could be related to : https://bugzilla.redhat.com/show_bug.cgi?id=1451170 fix should be in : pacemaker-1.1.16-9.el7 requesting hotfix approval for once we have the build ready Hi, moving this to POST as an hotfix has been packaged. Do we need something else to get it out? It should be noted that the root cause of those pacemaker is related to https://bugzilla.redhat.com/show_bug.cgi?id=1278181 . It happens that OSP-8 GA may not have the fix to have the heat agent script status in /var/lib (where they persist reboot) but in /var/run where there don't. The net effect being that the *all* the heat scripts are run again, causing all sort of dysfunction during the upgrade. The correct procedure is described there https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux_OpenStack_Platform/7/html-single/Director_Installation_and_Usage/index.html#sect-Updating_Overcloud_Packages for osp7 and osp8 shouldn't be impacted if *deployed* with openstack-heat-templates-0-0.8.20150605git.el7ost.noarch.rpm. Earlier version are impacted. The step described in the osp7 documentation should happen after undercloud upgrade but before any reboot. I've created https://bugzilla.redhat.com/show_bug.cgi?id=1456928 to track the update in the documentation. Hi Dave, did you get the hotfix ? Hi, so we may have found a race condition here. This is followed up in https://bugzilla.redhat.com/show_bug.cgi?id=1464464 . Dave is that ok if I close this one as duplicate ? The other bug is a duplicate of this bug, so we should keep this one open and clsoe the other. *** Bug 1464464 has been marked as a duplicate of this bug. *** Hi, we need to adjust this one to target osp9, not sure how to proceed. Thanks, Im setting default priority and severity here, correct if you think its not the right settings. Thank you. [stack@undercloud-0 ~]$ rpm -qa | grep tripleo-heat-templates openstack-tripleo-heat-templates-2.0.0-60.el7ost.noarch After running: #!/bin/bash openstack overcloud deploy --force-postconfig \ --templates \ --libvirt-type kvm \ --ntp-server clock.redhat.com \ --control-scale 3 \ --control-flavor controller \ --compute-scale 1 \ --compute-flavor compute \ --ceph-storage-scale 3 \ --ceph-storage-flavor ceph \ -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml \ -e /home/stack/virt/internal.yaml \ -e /home/stack/virt/network/network-environment.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/virt/hostnames.yml \ -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-aodh.yaml [root@controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.16-12.el7-94ff4df) - partition with quorum Last updated: Fri Sep 22 10:22:20 2017 Last change: Fri Sep 22 10:05:43 2017 by root via cibadmin on controller-0 3 nodes configured 115 resources configured Online: [ controller-0 controller-1 controller-2 ] Full list of resources: ip-172.17.4.10 (ocf::heartbeat:IPaddr2): Started controller-0 Clone Set: haproxy-clone [haproxy] Started: [ controller-0 controller-1 controller-2 ] ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.10 (ocf::heartbeat:IPaddr2): Started controller-2 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.11 (ocf::heartbeat:IPaddr2): Started controller-1 Master/Slave Set: redis-master [redis] Masters: [ controller-0 ] Slaves: [ controller-1 controller-2 ] Master/Slave Set: galera-master [galera] Masters: [ controller-0 controller-1 controller-2 ] Clone Set: mongod-clone [mongod] Started: [ controller-0 controller-1 controller-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ controller-0 controller-1 controller-2 ] Clone Set: memcached-clone [memcached] Started: [ controller-0 controller-1 controller-2 ] ip-192.168.24.13 (ocf::heartbeat:IPaddr2): Started controller-2 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] Started: [ controller-0 controller-1 controller-2 ] Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] Started: [ controller-0 controller-1 controller-2 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Started: [ controller-0 controller-1 controller-2 ] Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] Started: [ controller-0 controller-1 controller-2 ] Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-heat-api-clone [openstack-heat-api] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-nova-api-clone [openstack-nova-api] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] Started: [ controller-0 controller-1 controller-2 ] Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-glance-api-clone [openstack-glance-api] Started: [ controller-0 controller-1 controller-2 ] Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] Started: [ controller-0 controller-1 controller-2 ] Clone Set: delay-clone [delay] Started: [ controller-0 controller-1 controller-2 ] Clone Set: httpd-clone [httpd] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-keystone-clone [openstack-keystone] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-glance-registry-clone [openstack-glance-registry] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-cinder-api-clone [openstack-cinder-api] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] Started: [ controller-0 controller-1 controller-2 ] Clone Set: neutron-server-clone [neutron-server] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Started: [ controller-0 controller-1 controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started controller-0 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] Started: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener] Stopped: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier] Stopped: [ controller-0 controller-1 controller-2 ] Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator] Stopped: [ controller-0 controller-1 controller-2 ] Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2813 |