rhosp-director: overcloud upgrade OSP9->OSP10 fails during major-upgrade-pacemaker phase. Stdout has the following:"WARNING: Waiting for Ceph cluster status to go HEALTH_OK" Environment: openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch openstack-tripleo-heat-templates-5.2.0-20.el7ost.noarch instack-undercloud-5.3.0-1.el7ost.noarch openstack-puppet-modules-9.3.0-1.el7ost.noarch ceph-mon-0.94.9-9.el7cp.x86_64 ceph-common-0.94.9-9.el7cp.x86_64 ceph-radosgw-0.94.9-9.el7cp.x86_64 ceph-osd-0.94.9-9.el7cp.x86_64 ceph-selinux-0.94.9-9.el7cp.x86_64 ceph-0.94.9-9.el7cp.x86_64 Steps to reproduce: 1. The setup was deployed with: cd ; openstack overcloud deploy \ --debug \ --log-file ~/pilot/overcloud_deployment.log \ -t 400 \ --stack overcloud \ --templates ~/pilot/templates/overcloud \ -e ~/pilot/templates/overcloud/environments/network-isolation.yaml \ -e ~/pilot/templates/network-environment.yaml \ -e ~/pilot/templates/node-placement.yaml \ -e ~/pilot/templates/overcloud/environments/storage-environment.yaml \ -e ~/pilot/templates/dell-environment.yaml \ -e ~/pilot/templates/overcloud/environments/puppet-pacemaker.yaml \ --control-flavor control \ --compute-flavor compute \ --ceph-storage-flavor ceph-storage \ --swift-storage-flavor swift-storage \ --block-storage-flavor block-storage \ --neutron-public-interface bond1 \ --neutron-network-type vlan \ --neutron-disable-tunneling \ --os-auth-url http://192.168.120.101:5000/v2.0 \ --os-project-name admin \ --os-user-id admin \ --os-password 69345e1089ebd13bc7183a35269e3c060ff4c460 \ --control-scale 3 \ --compute-scale 3 \ --ceph-storage-scale 3 \ --ntp-server 0.centos.pool.ntp.org \ --neutron-network-vlan-ranges physint:201:220,physext \ --neutron-bridge-mappings physint:br-tenant,physext:br-ex 2. Upgraded the undercloud to OSP10 3. Successfully completed the step with: -e major-upgrade-ceilometer-wsgi-mitaka-newton.yaml 4. Successfully completed the step with: -e major-upgrade-pacemaker-init.yaml 5. Attempted the step with "-e major-upgrade-pacemaker.yaml" Result: I see no concrete error with the failure, but: ################################################################################################################################################### [stack@director ~]$ heat resource-list -n5 overcloud|grep -v COMPLE WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead +-------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+ | resource_name | physical_resource_id | resource_type | resource_status | updated_time | stack_name | +-------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+ | UpdateWorkflow | 8a044aee-0bc7-4b9a-9c22-00a38e64c1ab | OS::TripleO::Tasks::UpdateWorkflow | UPDATE_FAILED | 2017-06-30T21:25:16Z | overcloud | | CephMonUpgradeDeployment | 520e885c-e1fc-4094-a9d7-831c877f3fb3 | OS::Heat::SoftwareDeploymentGroup | CREATE_FAILED | 2017-06-30T21:25:21Z | overcloud-UpdateWorkflow-pz532jqylhv2 | | 0 | 34342ccf-8cfb-4098-ad9f-b5fe046015b3 | OS::Heat::SoftwareDeployment | CREATE_FAILED | 2017-06-30T21:27:21Z | overcloud-UpdateWorkflow-pz532jqylhv2-CephMonUpgradeDeployment-ydhwsxjjv3ga | +-------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+ ################################################################################################################################################### [stack@director ~]$ heat deployment-show 34342ccf-8cfb-4098-ad9f-b5fe046015b3 WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead { "status": "FAILED", "server_id": "c7e83ce4-945f-4342-82c5-8c4082dad060", "config_id": "c6f9e505-31c4-4c0c-a02c-136041b8917a", "output_values": { "deploy_stdout": "INFO: starting c6f9e505-31c4-4c0c-a02c-136041b8917a\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\n", "deploy_stderr": "", "deploy_status_code": 124 }, "creation_time": "2017-06-30T21:27:22Z", "updated_time": "2017-06-30T21:34:03Z", "input_values": { "update_identifier": "", "deploy_identifier": "1498857598" }, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 124", "id": "34342ccf-8cfb-4098-ad9f-b5fe046015b3" } ################################################################################################################################################### [stack@director ~]$ heat deployment-show 34342ccf-8cfb-4098-ad9f-b5fe046015b3 WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead { "status": "FAILED", "server_id": "c7e83ce4-945f-4342-82c5-8c4082dad060", "config_id": "c6f9e505-31c4-4c0c-a02c-136041b8917a", "output_values": { "deploy_stdout": "INFO: starting c6f9e505-31c4-4c0c-a02c-136041b8917a\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\n", "deploy_stderr": "", "deploy_status_code": 124 }, "creation_time": "2017-06-30T21:27:22Z", "updated_time": "2017-06-30T21:34:03Z", "input_values": { "update_identifier": "", "deploy_identifier": "1498857598" }, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 124", "id": "34342ccf-8cfb-4098-ad9f-b5fe046015b3" } ################################################################################################################################################### [stack@director ~]$ openstack stack failures list overcloud overcloud.UpdateWorkflow.CephMonUpgradeDeployment.0: resource_type: OS::Heat::SoftwareDeployment physical_resource_id: 34342ccf-8cfb-4098-ad9f-b5fe046015b3 status: CREATE_FAILED status_reason: | Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124 deploy_stdout: | ... WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK WARNING: Waiting for Ceph cluster status to go HEALTH_OK (truncated, view all with --long) deploy_stderr: | ################################################################################################################################################### All the pcs resources are UP: [root@overcloud-controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Fri Jun 30 22:02:41 2017 Last change: Fri Jun 30 21:15:31 2017 by hacluster via crmd on overcloud-controller-1 3 nodes and 124 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Full list of resources: ip-192.168.140.121 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-192.168.120.127 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-192.168.120.126 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: memcached-clone [memcached] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-192.168.190.5 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-192.168.170.120 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-192.168.140.120 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-core-clone [openstack-core] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-1 ] Slaves: [ overcloud-controller-0 overcloud-controller-2 ] Clone Set: mongod-clone [mongod] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-l3-agent-clone [neutron-l3-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-clone [openstack-heat-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-api-clone [openstack-glance-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-api-clone [openstack-nova-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-sahara-api-clone [openstack-sahara-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-glance-registry-clone [openstack-glance-registry] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-cinder-api-clone [openstack-cinder-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: delay-clone [delay] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: neutron-server-clone [neutron-server] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: httpd-clone [httpd] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled ################################################################################################################################################### Found the following for os-collect-config on controller-0: Jun 30 21:34:00 overcloud-controller-0.fv1dci.org os-collect-config[4122]: [2017-06-30 21:34:00,646] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/c6f9e505-31c4-4c0c-a02c-136041b8917a. [124] ################################################################################################################################################### [root@overcloud-controller-0 ~]# cat /var/lib/heat-config/heat-config-script/c6f9e505-31c4-4c0c-a02c-136041b8917a #!/bin/bash ignore_ceph_upgrade_warnings='False' #!/bin/bash set -eu set -o pipefail echo INFO: starting $(basename "$0") # Exit if not running if ! pidof ceph-mon &> /dev/null; then echo INFO: ceph-mon is not running, skipping exit 0 fi # Exit if not Hammer INSTALLED_VERSION=$(ceph --version | awk '{print $3}') if ! [[ "$INSTALLED_VERSION" =~ ^0\.94.* ]]; then echo INFO: version of Ceph installed is not 0.94, skipping exit 0 fi CEPH_STATUS=$(ceph health | awk '{print $1}') if [ ${CEPH_STATUS} = HEALTH_ERR ]; then echo ERROR: Ceph cluster status is HEALTH_ERR, cannot be upgraded exit 1 fi # Useful when upgrading with OSDs num < replica size if [[ ${ignore_ceph_upgrade_warnings:-False} != [Tt]rue ]]; then timeout 300 bash -c "while [ ${CEPH_STATUS} != HEALTH_OK ]; do echo WARNING: Waiting for Ceph cluster status to go HEALTH_OK; sleep 30; CEPH_STATUS=$(ceph health | awk '{print $1}') done" fi MON_PID=$(pidof ceph-mon) MON_ID=$(hostname -s) # Stop daemon using Hammer sysvinit script service ceph stop mon.${MON_ID} # Ensure it's stopped timeout 60 bash -c "while kill -0 ${MON_PID} 2> /dev/null; do sleep 2; done" # Update to Jewel yum -y -q update ceph-mon ceph # Restart/Exit if not on Jewel, only in that case we need the changes UPDATED_VERSION=$(ceph --version | awk '{print $3}') if [[ "$UPDATED_VERSION" =~ ^0\.94.* ]]; then echo WARNING: Ceph was not upgraded, restarting daemons service ceph start mon.${MON_ID} elif [[ "$UPDATED_VERSION" =~ ^10\.2.* ]]; then # RPM could own some of these but we can't take risks on the pre-existing files for d in /var/lib/ceph/mon /var/log/ceph /var/run/ceph /etc/ceph; do chown -L -R ceph:ceph $d || echo WARNING: chown of $d failed done # Replay udev events with newer rules udevadm trigger # Enable systemd unit systemctl enable ceph-mon.target systemctl enable ceph-mon@${MON_ID} systemctl start ceph-mon@${MON_ID} # Wait for daemon to be back in the quorum timeout 300 bash -c "until (ceph quorum_status | jq .quorum_names | grep -sq ${MON_ID}); do echo WARNING: Waiting for mon.${MON_ID} to re-join quorum; sleep 10; done" # if tunables become legacy, cluster status will be HEALTH_WARN causing # upgrade to fail on following node ceph osd crush tunables default echo INFO: Ceph was upgraded to Jewel else echo ERROR: Ceph was upgraded to an unknown release, daemon is stopped, need manual intervention exit 1 fi ################################################################################################################################################### On controller-1 (the only controller where yum update actually occured): ● ceph-radosgw.service not-found active exited ceph-radosgw.service ################################################################################################################################################### Here it's important to note that on OSP9 I found the following differences between the default and the used templates: 1) [stack@director overcloud]$ diff ./puppet/manifests/overcloud_controller_pacemaker.pp /usr/share/openstack-tripleo-heat-templates/./puppet/manifests/overcloud_controller_pacemaker.pp 597d596 < include ::ceph:rofile::rgw 1066,1067c1065 < # enabled => $non_pcmk_start, < enabled => false, --- > enabled => $non_pcmk_start, 2107,2113d2104 < if $ceph:rofile:arams::enable_rgw < { < exec { 'create_radosgw_keyring': < command => "/usr/bin/ceph auth get-or-create client.radosgw.gateway mon 'allow rwx' osd 'allow rwx' -o /etc/ceph/ceph.client.radosgw.gateway.keyring" , < creates => "/etc/ceph/ceph.client.radosgw.gateway.keyring" , < } < } 2) [stack@director overcloud]$ diff ./puppet/hieradata/ceph.yaml /usr/share/openstack-tripleo-heat-templates/./puppet/hieradata/ceph.yaml 1,19c1,3 < # Copyright (c) 2016 Dell Inc. or its subsidiaries. < # < # Licensed under the Apache License, Version 2.0 (the "License"); < # you may not use this file except in compliance with the License. < # You may obtain a copy of the License at < # < # http://www.apache.org/licenses/LICENSE-2.0 < # < # Unless required by applicable law or agreed to in writing, software < # distributed under the License is distributed on an "AS IS" BASIS, < # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. < # See the License for the specific language governing permissions and < # limitations under the License. < < ceph:rofile:arams::osd_journal_size: 10000 < # CHANGEME: Change osd_pool_default_pg_num and osd_pool_default_pgp_num to < # be the smallest value in 'ceph_pool_pgs' (which appears below). < ceph:rofile:arams::osd_pool_default_pg_num: 256 < ceph:rofile:arams::osd_pool_default_pgp_num: 256 --- > ceph:rofile:arams::osd_journal_size: 1024 > ceph:rofile:arams::osd_pool_default_pg_num: 32 > ceph:rofile:arams::osd_pool_default_pgp_num: 32 21c5,6 < ceph:rofile:arams::osd_pool_default_min_size: 2 --- > ceph:rofile:arams::osd_pool_default_min_size: 1 > ceph:rofile:arams::osds: {/srv/data: {}} 24,68d8 < < # CHANGEME: < # Modify the 'osds' parameter to reflect the list of drives to be used as < # OSDs. A configuration that colocates Ceph journals on every OSD should look < # like this: < # ceph:rofile:arams::osds: < # '/dev/sdb': {} < # '/dev/sdc': {} < # ... and so on. < # A configuration that places Ceph journals on dedicated drives (such as SSDs) < # should look like this: < # ceph:rofile:arams::osds: < # '/dev/sde': < # journal: '/dev/sdb' < # '/dev/sdf': < # journal: '/dev/sdb' < # ... and so on. < ceph:rofile:arams::osds: < '/dev/sdd': < journal: '/dev/sdb' < '/dev/sde': < journal: '/dev/sdb' < '/dev/sdf': < journal: '/dev/sdb' < '/dev/sdg': < journal: '/dev/sdb' < '/dev/sdh': < journal: '/dev/sdc' < '/dev/sdi': < journal: '/dev/sdc' < '/dev/sdj': < journal: '/dev/sdc' < '/dev/sdk': < journal: '/dev/sdc' < < # CHANGEME: The following table lists the pg_num and pgp_num values for each < # of the specified pools. Change the value for each pool based on the size of < # Ceph cluster, using http://ceph.com/pgcalc for guidance. Small pools used by < # the RADOS Gateway (any pool whose name begins with '.'), other than the < # '.rgw.buckets' pool, should not be listed. < ceph_pool_pgs: < 'volumes': 1024 < 'vms': 256 < 'images': 256 < '.rgw.buckets': 512 Checking the status of ceph on this setup with failed upgrade: overcloud-controller-1.fv1dci.org cluster 7cd26246-5d0d-11e7-9a49-525400d76882 health HEALTH_OK monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.121:6789/0} election epoch 12, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e89: 24 osds: 24 up, 24 in pgmap v48179: 2112 pgs, 5 pools, 45659 kB data, 19 objects 1120 MB used, 22331 GB / 22333 GB avail 2112 active+clean
Additional run of the same step results again in failure, but looking for errors in logs on controllers I find this on controller-0: Jul 04 20:23:05 overcloud-controller-0.fv1dci.org os-collect-config[4122]: [2017-07-04 20:23:05,202] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/32d395fb-22ad-436a-8526-1aba7b75d735. [1] [root@overcloud-controller-0 ~]# cat /var/lib/heat-config/heat-config-script/32d395fb-22ad-436a-8526-1aba7b75d735 #!/bin/bash set -eu DEBUG="true" # set false if the verbosity is a problem SCRIPT_NAME=$(basename $0) function log_debug { if [[ $DEBUG = "true" ]]; then echo "`date` $SCRIPT_NAME tripleo-upgrade $(facter hostname) $1" fi } function is_bootstrap_node { if [ "$(hiera -c /etc/puppet/hiera.yaml bootstrap_nodeid)" = "$(facter hostname)" ]; then log_debug "Node is bootstrap" echo "true" fi } function check_resource_pacemaker { if [ "$#" -ne 3 ]; then echo_error "ERROR: check_resource function expects 3 parameters, $# given" exit 1 fi local service=$1 local state=$2 local timeout=$3 if [[ -z $(is_bootstrap_node) ]] ; then log_debug "Node isn't bootstrap, skipping check for $service to be $state here " return else log_debug "Node is bootstrap checking $service to be $state here" fi if [ "$state" = "stopped" ]; then match_for_incomplete='Started' else # started match_for_incomplete='Stopped' fi nodes_local=$(pcs status | grep ^Online | sed 's/.*\[ \(.*\) \]/\1/g' | sed 's/ /\|/g') if timeout -k 10 $timeout crm_resource --wait; then node_states=$(pcs status --full | grep "$service" | grep -v Clone | { egrep "$nodes_local" || true; } ) if echo "$node_states" | grep -q "$match_for_incomplete"; then echo_error "ERROR: cluster finished transition but $service was not in $state state, exiting." exit 1 else echo "$service has $state" fi else echo_error "ERROR: cluster remained unstable for more than $timeout seconds, exiting." exit 1 fi } function pcmk_running { if [[ $(systemctl is-active pacemaker) = "active" ]] ; then echo "true" fi } function is_systemd_unknown { local service=$1 if [[ $(systemctl is-active "$service") = "unknown" ]]; then log_debug "$service found to be unkown to systemd" echo "true" fi } function grep_is_cluster_controlled { local service=$1 if [[ -n $(systemctl status $service -l | grep Drop-In -A 5 | grep pacemaker) || -n $(systemctl status $service -l | grep "Cluster Controlled $service") ]] ; then log_debug "$service is pcmk managed from systemctl grep" echo "true" fi } function is_systemd_managed { local service=$1 #if we have pcmk check to see if it is managed there if [[ -n $(pcmk_running) ]]; then if [[ -z $(pcs status --full | grep $service) && -z $(is_systemd_unknown $service) ]] ; then log_debug "$service found to be systemd managed from pcs status" echo "true" fi else # if it is "unknown" to systemd, then it is pacemaker managed if [[ -n $(is_systemd_unknown $service) ]] ; then return elif [[ -z $(grep_is_cluster_controlled $service) ]] ; then echo "true" fi fi } function is_pacemaker_managed { local service=$1 #if we have pcmk check to see if it is managed there if [[ -n $(pcmk_running) ]]; then if [[ -n $(pcs status --full | grep $service) ]]; then log_debug "$service found to be pcmk managed from pcs status" echo "true" fi else # if it is unknown to systemd, then it is pcmk managed if [[ -n $(is_systemd_unknown $service) ]]; then echo "true" elif [[ -n $(grep_is_cluster_controlled $service) ]] ; then echo "true" fi fi } function is_managed { local service=$1 if [[ -n $(is_pacemaker_managed $service) || -n $(is_systemd_managed $service) ]]; then echo "true" fi } function check_resource_systemd { if [ "$#" -ne 3 ]; then echo_error "ERROR: check_resource function expects 3 parameters, $# given" exit 1 fi local service=$1 local state=$2 local timeout=$3 local check_interval=3 if [ "$state" = "stopped" ]; then match_for_incomplete='active' else # started match_for_incomplete='inactive' fi log_debug "Going to check_resource_systemd for $service to be $state" #sanity check is systemd managed: if [[ -z $(is_systemd_managed $service) ]]; then echo "ERROR - $service not found to be systemd managed." exit 1 fi tstart=$(date +%s) tend=$(( $tstart + $timeout )) while (( $(date +%s) < $tend )); do if [[ "$(systemctl is-active $service)" = $match_for_incomplete ]]; then echo "$service not yet $state, sleeping $check_interval seconds." sleep $check_interval else echo "$service is $state" return fi done echo "Timed out waiting for $service to go to $state after $timeout seconds" exit 1 } function check_resource { local service=$1 local pcmk_managed=$(is_pacemaker_managed $service) local systemd_managed=$(is_systemd_managed $service) if [[ -n $pcmk_managed && -n $systemd_managed ]] ; then log_debug "ERROR $service managed by both systemd and pcmk - SKIPPING" return fi if [[ -n $pcmk_managed ]]; then check_resource_pacemaker $@ return elif [[ -n $systemd_managed ]]; then check_resource_systemd $@ return fi log_debug "ERROR cannot check_resource for $service, not managed here?" } function manage_systemd_service { local action=$1 local service=$2 log_debug "Going to systemctl $action $service" systemctl $action $service } function manage_pacemaker_service { local action=$1 local service=$2 # not if pacemaker isn't running! if [[ -z $(pcmk_running) ]]; then echo "$(facter hostname) pacemaker not active, skipping $action $service here" elif [[ -n $(is_bootstrap_node) ]]; then log_debug "Going to pcs resource $action $service" pcs resource $action $service fi } function stop_or_disable_service { local service=$1 local pcmk_managed=$(is_pacemaker_managed $service) local systemd_managed=$(is_systemd_managed $service) if [[ -n $pcmk_managed && -n $systemd_managed ]] ; then log_debug "Skipping stop_or_disable $service due to management conflict" return fi log_debug "Stopping or disabling $service" if [[ -n $pcmk_managed ]]; then manage_pacemaker_service disable $service return elif [[ -n $systemd_managed ]]; then manage_systemd_service stop $service return fi log_debug "ERROR: $service not managed here?" } function start_or_enable_service { local service=$1 local pcmk_managed=$(is_pacemaker_managed $service) local systemd_managed=$(is_systemd_managed $service) if [[ -n $pcmk_managed && -n $systemd_managed ]] ; then log_debug "Skipping start_or_enable $service due to management conflict" return fi log_debug "Starting or enabling $service" if [[ -n $pcmk_managed ]]; then manage_pacemaker_service enable $service return elif [[ -n $systemd_managed ]]; then manage_systemd_service start $service return fi log_debug "ERROR $service not managed here?" } function restart_service { local service=$1 local pcmk_managed=$(is_pacemaker_managed $service) local systemd_managed=$(is_systemd_managed $service) if [[ -n $pcmk_managed && -n $systemd_managed ]] ; then log_debug "ERROR $service managed by both systemd and pcmk - SKIPPING" return fi log_debug "Restarting $service" if [[ -n $pcmk_managed ]]; then manage_pacemaker_service restart $service return elif [[ -n $systemd_managed ]]; then manage_systemd_service restart $service return fi log_debug "ERROR $service not managed here?" } function echo_error { echo "$@" | tee /dev/fd2 } # swift is a special case because it is/was never handled by pacemaker # when stand-alone swift is used, only swift-proxy is running on controllers function systemctl_swift { services=( openstack-swift-account-auditor openstack-swift-account-reaper openstack-swift-account-replicator openstack-swift-account \ openstack-swift-container-auditor openstack-swift-container-replicator openstack-swift-container-updater openstack-swift-container \ openstack-swift-object-auditor openstack-swift-object-replicator openstack-swift-object-updater openstack-swift-object openstack-swift-proxy ) local action=$1 case $action in stop) services=$(systemctl | grep openstack-swift- | grep running | awk '{print $1}') ;; start) enable_swift_storage=$(hiera -c /etc/puppet/hiera.yaml tripleo::profile::base::swift::storage::enable_swift_storage) if [[ $enable_swift_storage != "true" ]]; then services=( openstack-swift-proxy ) fi ;; *) echo "Unknown action $action passed to systemctl_swift" exit 1 ;; # shouldn't ever happen... esac for service in ${services[@]}; do manage_systemd_service $action $service done } # Special-case OVS for https://bugs.launchpad.net/tripleo/+bug/1635205 # Update condition and add --notriggerun for +bug/1669714 function special_case_ovs_upgrade_if_needed { if rpm -qa | grep "^openvswitch-2.5.0-14" || rpm -q --scripts openvswitch | awk '/postuninstall/,/*/' | grep "systemctl.*try-restart" ; then echo "Manual upgrade of openvswitch - ovs-2.5.0-14 or restart in postun detected" rm -rf OVS_UPGRADE mkdir OVS_UPGRADE && pushd OVS_UPGRADE echo "Attempting to downloading latest openvswitch with yumdownloader" yumdownloader --resolve openvswitch for pkg in $(ls -1 *.rpm); do if rpm -U --test $pkg 2>&1 | grep "already installed" ; then echo "Looks like newer version of $pkg is already installed, skipping" else echo "Updating $pkg with --nopostun --notriggerun" rpm -U --replacepkgs --nopostun --notriggerun $pkg fi done popd else echo "Skipping manual upgrade of openvswitch - no restart in postun detected" fi } # update os-net-config before ovs see https://bugs.launchpad.net/tripleo/+bug/1695893 function update_network() { set +e yum -q -y update os-net-config return_code=$? echo "yum update os-net-config return code: $return_code" # Writes any changes caused by alterations to os-net-config and bounces the # interfaces *before* restarting the cluster. os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes RETVAL=$? if [[ $RETVAL == 2 ]]; then echo "os-net-config: interface configuration files updated successfully" elif [[ $RETVAL != 0 ]]; then echo "ERROR: os-net-config configuration failed" exit $RETVAL fi set -e # special case https://bugs.launchpad.net/tripleo/+bug/1635205 +bug/1669714 special_case_ovs_upgrade_if_needed } #!/bin/bash # Special pieces of upgrade migration logic go into this # file. E.g. Pacemaker cluster transitions for existing deployments, # matching changes to overcloud_controller_pacemaker.pp (Puppet # handles deployment, this file handles migrations). # # This file shouldn't execute any action on its own, all logic should # be wrapped into bash functions. Upgrade scripts will source this # file and call the functions defined in this file where appropriate. # # The migration functions should be idempotent. If the migration has # been already applied, it should be possible to call the function # again without damaging the deployment or failing the upgrade. # If the major version of mysql is going to change after the major # upgrade, the database must be upgraded on disk to avoid failures # due to internal incompatibilities between major mysql versions # https://bugs.launchpad.net/tripleo/+bug/1587449 # This function detects whether a database upgrade is required # after a mysql package upgrade. It returns 0 when no major upgrade # has to take place, 1 otherwise. function is_mysql_upgrade_needed { # The name of the package which provides mysql might differ # after the upgrade. Consider the generic package name, which # should capture the major version change (e.g. 5.5 -> 10.1) local name="mariadb" local output local ret set +e output=$(yum -q check-update $name) ret=$? set -e if [ $ret -ne 100 ]; then # no updates so we exit echo "0" return fi local currentepoch=$(rpm -q --qf "%{epoch}" $name) local currentversion=$(rpm -q --qf "%{version}" $name | cut -d. -f-2) local currentrelease=$(rpm -q --qf "%{release}" $name) local newoutput=$(repoquery -a --pkgnarrow=updates --qf "%{epoch} %{version} %{release}\n" $name) local newepoch=$(echo "$newoutput" | awk '{ print $1 }') local newversion=$(echo "$newoutput" | awk '{ print $2 }' | cut -d. -f-2) local newrelease=$(echo "$newoutput" | awk '{ print $3 }') # With this we trigger the dump restore/path if we change either epoch or # version in the package If only the release tag changes we do not do it # FIXME: we could refine this by trying to parse the mariadb version # into X.Y.Z and trigger the update only if X and/or Y change. output=$(python -c "import rpm; rc = rpm.labelCompare((\"$currentepoch\", \"$currentversion\", None), (\"$newepoch\", \"$newversion\", None)); print rc") if [ "$output" != "-1" ]; then echo "0" return fi echo "1" } # This function returns the list of services to be migrated away from pacemaker # and to systemd. The reason to have these services in a separate function is because # this list is needed in three different places: major_upgrade_controller_pacemaker_{1,2} # and in the function to migrate the cluster from full HA to HA NG function services_to_migrate { # The following PCMK resources the ones the we are going to delete PCMK_RESOURCE_TODELETE=" httpd-clone memcached-clone mongod-clone neutron-dhcp-agent-clone neutron-l3-agent-clone neutron-metadata-agent-clone neutron-netns-cleanup-clone neutron-openvswitch-agent-clone neutron-ovs-cleanup-clone neutron-server-clone openstack-aodh-evaluator-clone openstack-aodh-listener-clone openstack-aodh-notifier-clone openstack-ceilometer-central-clone openstack-ceilometer-collector-clone openstack-ceilometer-notification-clone openstack-cinder-api-clone openstack-cinder-scheduler-clone openstack-glance-api-clone openstack-glance-registry-clone openstack-gnocchi-metricd-clone openstack-gnocchi-statsd-clone openstack-heat-api-cfn-clone openstack-heat-api-clone openstack-heat-api-cloudwatch-clone openstack-heat-engine-clone openstack-nova-api-clone openstack-nova-conductor-clone openstack-nova-consoleauth-clone openstack-nova-novncproxy-clone openstack-nova-scheduler-clone openstack-sahara-api-clone openstack-sahara-engine-clone " echo $PCMK_RESOURCE_TODELETE } # This function will migrate a mitaka system where all the resources are managed # via pacemaker to a newton setup where only a few services will be managed by pacemaker # On a high-level it will operate as follows: # 1. Set the cluster in maintenance-mode so no start/stop action will actually take place # during the conversion # 2. Remove all the colocation constraints and then the ordering constraints, except the # ones related to haproxy/VIPs which exist in Newton as well # 3. Take the cluster out of maintenance-mode # 4. Remove all the resources that won't be managed by pacemaker in newton. The # outcome will be # that they are stopped and removed from pacemakers control # 5. Do a resource cleanup to make sure the cluster is in a clean state function migrate_full_to_ng_ha { if [[ -n $(pcmk_running) ]]; then pcs property set maintenance-mode=true # First we go through all the colocation constraints (except the ones # we want to keep, i.e. the haproxy/ip ones) and we remove those COL_CONSTRAINTS=$(pcs config show | sed -n '/^Colocation Constraints:$/,/^$/p' | grep -v "Colocation Constraints:" | egrep -v "ip-.*haproxy" | awk '{print $NF}' | cut -f2 -d: |cut -f1 -d\)) for constraint in $COL_CONSTRAINTS; do log_debug "Deleting colocation constraint $constraint from CIB" pcs constraint remove "$constraint" done # Now we kill all the ordering constraints (except the haproxy/ip ones) ORD_CONSTRAINTS=$(pcs config show | sed -n '/^Ordering Constraints:/,/^Colocation Constraints:$/p' | grep -v "Ordering Constraints:" | awk '{print $NF}' | cut -f2 -d: |cut -f1 -d\)) for constraint in $ORD_CONSTRAINTS; do log_debug "Deleting ordering constraint $constraint from CIB" pcs constraint remove "$constraint" done # At this stage all the pacemaker resources are removed from the CIB. # Once we remove the maintenance-mode those systemd resources will keep # on running. They shall be systemd enabled via the puppet converge # step later on pcs property set maintenance-mode=false # At this stage there are no constraints whatsoever except the haproxy/ip ones # which we want to keep. We now disable and then delete each resource # that will move to systemd. # We want the systemd resources be stopped before doing "yum update", # that way "systemctl try-restart <service>" is no-op because the # service was down already PCS_STATUS_OUTPUT="$(pcs status)" for resource in $(services_to_migrate) "delay-clone" "openstack-core-clone"; do if echo "$PCS_STATUS_OUTPUT" | grep "$resource"; then log_debug "Deleting $resource from the CIB" if ! pcs resource disable "$resource" --wait=600; then echo_error "ERROR: resource $resource failed to be disabled" exit 1 fi pcs resource delete --force "$resource" else log_debug "Service $resource not found as a pacemaker resource, not trying to delete." fi done # We need to do a pcs resource cleanup here + crm_resource --wait to # make sure the cluster is in a clean state before we stop everything, # upgrade and restart everything pcs resource cleanup # We are making sure here that the cluster is stable before proceeding if ! timeout -k 10 600 crm_resource --wait; then echo_error "ERROR: cluster remained unstable after resource cleanup for more than 600 seconds, exiting." exit 1 fi fi } function disable_standalone_ceilometer_api { if [[ -n $(is_bootstrap_node) ]]; then if [[ -n $(is_pacemaker_managed openstack-ceilometer-api) ]]; then # Disable pacemaker resources for ceilometer-api manage_pacemaker_service disable openstack-ceilometer-api check_resource_pacemaker openstack-ceilometer-api stopped 600 pcs resource delete openstack-ceilometer-api --wait=600 fi fi } #!/bin/bash set -eu if [[ -n $(is_bootstrap_node) ]]; then # run gnocchi upgrade gnocchi-upgrade fi ############################################################################## pcs status looks as following: [root@overcloud-controller-0 ~]# pcs status Cluster name: tripleo_cluster Stack: corosync Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum Last updated: Tue Jul 4 20:32:47 2017 Last change: Tue Jul 4 20:21:11 2017 by root via crm_resource on overcloud-controller-0 3 nodes and 19 resources configured Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Full list of resources: ip-192.168.140.121 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-192.168.120.127 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-192.168.120.126 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Clone Set: haproxy-clone [haproxy] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Master/Slave Set: galera-master [galera] Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] ip-192.168.190.5 (ocf::heartbeat:IPaddr2): Started overcloud-controller-0 ip-192.168.170.120 (ocf::heartbeat:IPaddr2): Started overcloud-controller-1 ip-192.168.140.120 (ocf::heartbeat:IPaddr2): Started overcloud-controller-2 Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] Master/Slave Set: redis-master [redis] Masters: [ overcloud-controller-1 ] Slaves: [ overcloud-controller-0 overcloud-controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled
Hi Sasha, adding comment from the mail, that at first the status of ceph was flipping, before starting this step with this pattern: [stack@director ~]$ ssh heat-admin.120.133 "hostname; sudo ceph status" overcloud-controller-1.fv1dci.org cluster 7cd26246-5d0d-11e7-9a49-525400d76882 health HEALTH_WARN crush map has legacy tunables (require bobtail, min is firefly) monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.121:6789/0} election epoch 12, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e89: 24 osds: 24 up, 24 in pgmap v48626: 2112 pgs, 5 pools, 45659 kB data, 19 objects 1121 MB used, 22331 GB / 22333 GB avail 2112 active+clean [stack@director ~]$ ssh heat-admin.120.133 "hostname; sudo ceph status" overcloud-controller-1.fv1dci.org cluster 7cd26246-5d0d-11e7-9a49-525400d76882 health HEALTH_OK monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.121:6789/0} election epoch 12, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0 osdmap e89: 24 osds: 24 up, 24 in pgmap v48628: 2112 pgs, 5 pools, 45659 kB data, 19 objects 1121 MB used, 22331 GB / 22333 GB avail 2112 active+clean and that re-running the command made the ceph go to OK again. So, we can see the 124 error in the logs: Jun 30 21:34:00 overcloud-controller-0.fv1dci.org os-collect-config[4122]: [2017-06-30 21:34:00,653] (heat-config) [INFO] {"deploy_stdout": "INFO: starting c6f9e505-31c4-4c0c-a02c-136041b8917a\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\n", "deploy_stderr": "", "deploy_status_code": 124} but this is on Jun 30, and was certainly caused by the flipping. As the reference during ceph upgrade on the controller we have this [0] that must have solved the flipping. For reference again this is where the 124 error comes from: https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_ceph_mon.sh#L28..L33 Now from comment 2 it appears that you've reached Step5 of the controller upgrade, as gnocchi upgrade is done there[1], way after the the ceph controller upgrade. So I think that you don't have the ceph anymore (Can you confirm?) That error must have been triggered by the flipping status, solved when this code[0] was run. Now you have a problem with gnocchi database upgrade. We don't have the log from this run in the sosreport. Could you provide us with the controller-0 current sos-report and make sure that everything in /var/log/gnocchi is included or add that manually. Thanks, [0] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_ceph_mon.sh#L74..L77 [1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_5.sh#L5..L9
Hi Sasha, Thanks for the logs. So the gnocchi upgrade fails because the metrics pool is missing (from gnocchi-upgrade.log): 2017-07-04 20:23:04.984 22838 ERROR gnocchi File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325) 2017-07-04 20:23:04.984 22838 ERROR gnocchi ObjectNotFound: error opening pool 'metrics' 2017-07-04 20:23:04.984 22838 ERROR gnocchi If using external osd, you have to create them, this is addressed there in the documentation: https://bugzilla.redhat.com/show_bug.cgi?id=1412295 From this bz: The ceph driver need to have a ceph user and a pool already created. They can be created for example with: ceph osd pool create metrics 8 8 ceph auth get-or-create client.gnocchi mon "allow r" osd "allow rwx pool=metrics" Can you confirm that the ceph osd are externals? If that's the case, then running the above command before the upgrade should solve the problem.
Hi Sasha, what are the lastest news on this one ?
Used "IgnoreCephUpgradeWarnings: true" and the upgrade proceeded.
After discussion with Sebastien and DFG:Ceph the conclusion was that it would be best to not emit WARN for tunables in Ceph... and rely only on the quorum check for the monitors upgrade until then (as ceph-ansible does already). This means changing the upgrade scripts in tripleo to always ignore the Ceph health warnings, which is confirmed to work as per comment #15.
verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2654
closed, no need for needinfo.