1467704 – rhosp-director: overcloud upgrade OSP9->OSP10 fails during major-upgrade-pacemaker phase. Stdout has the following:"WARNING: Waiting for Ceph cluster status to go HEALTH_OK"

Bug 1467704 - rhosp-director: overcloud upgrade OSP9->OSP10 fails during major-upgrade-pacemaker phase. Stdout has the following:"WARNING: Waiting for Ceph cluster status to go HEALTH_OK"

Summary: rhosp-director: overcloud upgrade OSP9->OSP10 fails during major-upgrade-pac...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	10.0 (Newton)
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	z4
Target Release:	10.0 (Newton)
Assignee:	Sofer Athlan-Guyot
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1335596 1356451
TreeView+	depends on / blocked

Reported:	2017-07-04 16:43 UTC by Alexander Chuzhoy
Modified:	2018-05-02 10:50 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-5.3.0-4.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-06 17:09:30 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1709370	None	None	None	2017-08-08 15:59:00 UTC
OpenStack gerrit	491847	None	MERGED	[NEWTON-ONLY] Ignore Ceph healt warning states by default	2020-12-29 21:57:37 UTC
Red Hat Bugzilla	1412295	high	CLOSED	Document that gnocchi requires a ceph pool	2022-03-13 15:12:40 UTC
Red Hat Product Errata	RHBA-2017:2654	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 10 director Bug Fix Advisory	2017-09-06 20:55:36 UTC

Internal Links: 1412295

Description Alexander Chuzhoy 2017-07-04 16:43:35 UTC

rhosp-director: overcloud upgrade OSP9->OSP10 fails during  major-upgrade-pacemaker phase. Stdout has the following:"WARNING: Waiting for Ceph cluster status to go HEALTH_OK"


Environment:
openstack-tripleo-heat-templates-compat-2.0.0-41.el7ost.noarch
openstack-tripleo-heat-templates-5.2.0-20.el7ost.noarch
instack-undercloud-5.3.0-1.el7ost.noarch
openstack-puppet-modules-9.3.0-1.el7ost.noarch

ceph-mon-0.94.9-9.el7cp.x86_64
ceph-common-0.94.9-9.el7cp.x86_64
ceph-radosgw-0.94.9-9.el7cp.x86_64
ceph-osd-0.94.9-9.el7cp.x86_64
ceph-selinux-0.94.9-9.el7cp.x86_64
ceph-0.94.9-9.el7cp.x86_64



Steps to reproduce:
1.
The setup was deployed with:
cd ; openstack overcloud deploy \
 --debug \
 --log-file ~/pilot/overcloud_deployment.log \
 -t 400 \
 --stack overcloud \
 --templates ~/pilot/templates/overcloud \
 -e ~/pilot/templates/overcloud/environments/network-isolation.yaml \
 -e ~/pilot/templates/network-environment.yaml \
 -e ~/pilot/templates/node-placement.yaml \
 -e ~/pilot/templates/overcloud/environments/storage-environment.yaml \
 -e ~/pilot/templates/dell-environment.yaml \
 -e ~/pilot/templates/overcloud/environments/puppet-pacemaker.yaml \
 --control-flavor control \
 --compute-flavor compute \
 --ceph-storage-flavor ceph-storage \
 --swift-storage-flavor swift-storage \
 --block-storage-flavor block-storage \
 --neutron-public-interface bond1 \
 --neutron-network-type vlan \
 --neutron-disable-tunneling \
 --os-auth-url http://192.168.120.101:5000/v2.0 \
 --os-project-name admin \
 --os-user-id admin \
 --os-password 69345e1089ebd13bc7183a35269e3c060ff4c460 \
 --control-scale 3 \
 --compute-scale 3 \
 --ceph-storage-scale 3 \
 --ntp-server 0.centos.pool.ntp.org \
 --neutron-network-vlan-ranges physint:201:220,physext \
 --neutron-bridge-mappings physint:br-tenant,physext:br-ex



2. Upgraded the undercloud to OSP10

3. Successfully completed the step with:
-e major-upgrade-ceilometer-wsgi-mitaka-newton.yaml

4. Successfully completed the step with:
-e major-upgrade-pacemaker-init.yaml
 

5. Attempted the step with "-e major-upgrade-pacemaker.yaml"

Result:
I see no concrete error with the failure, but:
###################################################################################################################################################
[stack@director ~]$ heat resource-list -n5 overcloud|grep -v COMPLE
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
+-------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+
| resource_name                             | physical_resource_id                                                            | resource_type                                                                                                             | resource_status | updated_time         | stack_name                                                                                                           |
+-------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+
| UpdateWorkflow                            | 8a044aee-0bc7-4b9a-9c22-00a38e64c1ab                                            | OS::TripleO::Tasks::UpdateWorkflow                                                                                        | UPDATE_FAILED   | 2017-06-30T21:25:16Z | overcloud                                                                                                            |
| CephMonUpgradeDeployment                  | 520e885c-e1fc-4094-a9d7-831c877f3fb3                                            | OS::Heat::SoftwareDeploymentGroup                                                                                         | CREATE_FAILED   | 2017-06-30T21:25:21Z | overcloud-UpdateWorkflow-pz532jqylhv2                                                                                |
| 0                                         | 34342ccf-8cfb-4098-ad9f-b5fe046015b3                                            | OS::Heat::SoftwareDeployment                                                                                              | CREATE_FAILED   | 2017-06-30T21:27:21Z | overcloud-UpdateWorkflow-pz532jqylhv2-CephMonUpgradeDeployment-ydhwsxjjv3ga                                          |
+-------------------------------------------+---------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------+----------------------------------------------------------------------------------------------------------------------+
###################################################################################################################################################
[stack@director ~]$ heat deployment-show 34342ccf-8cfb-4098-ad9f-b5fe046015b3
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED",
  "server_id": "c7e83ce4-945f-4342-82c5-8c4082dad060",
  "config_id": "c6f9e505-31c4-4c0c-a02c-136041b8917a",
  "output_values": {
    "deploy_stdout": "INFO: starting c6f9e505-31c4-4c0c-a02c-136041b8917a\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\n",
    "deploy_stderr": "",
    "deploy_status_code": 124
  },
  "creation_time": "2017-06-30T21:27:22Z",
  "updated_time": "2017-06-30T21:34:03Z",
  "input_values": {
    "update_identifier": "",
    "deploy_identifier": "1498857598"
  },
  "action": "CREATE",
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 124",
  "id": "34342ccf-8cfb-4098-ad9f-b5fe046015b3"
}

###################################################################################################################################################


[stack@director ~]$ heat deployment-show 34342ccf-8cfb-4098-ad9f-b5fe046015b3
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED",
  "server_id": "c7e83ce4-945f-4342-82c5-8c4082dad060",
  "config_id": "c6f9e505-31c4-4c0c-a02c-136041b8917a",
  "output_values": {
    "deploy_stdout": "INFO: starting c6f9e505-31c4-4c0c-a02c-136041b8917a\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\n",
    "deploy_stderr": "",
    "deploy_status_code": 124
  },
  "creation_time": "2017-06-30T21:27:22Z",
  "updated_time": "2017-06-30T21:34:03Z",
  "input_values": {
    "update_identifier": "",
    "deploy_identifier": "1498857598"
  },
  "action": "CREATE",
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 124",
  "id": "34342ccf-8cfb-4098-ad9f-b5fe046015b3"
}


###################################################################################################################################################

[stack@director ~]$ openstack stack failures list overcloud
overcloud.UpdateWorkflow.CephMonUpgradeDeployment.0:
  resource_type: OS::Heat::SoftwareDeployment
  physical_resource_id: 34342ccf-8cfb-4098-ad9f-b5fe046015b3
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124
  deploy_stdout: |
    ...
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    WARNING: Waiting for Ceph cluster status to go HEALTH_OK
    (truncated, view all with --long)
  deploy_stderr: |


###################################################################################################################################################
All the pcs resources are UP:
[root@overcloud-controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Jun 30 22:02:41 2017          Last change: Fri Jun 30 21:15:31 2017 by hacluster via crmd on overcloud-controller-1

3 nodes and 124 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-192.168.140.121     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-192.168.120.127     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-192.168.120.126     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: memcached-clone [memcached]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.168.190.5       (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-192.168.170.120     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-192.168.140.120     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-core-clone [openstack-core]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 Clone Set: mongod-clone [mongod]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-evaluator-clone [openstack-aodh-evaluator]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-netns-cleanup-clone [neutron-netns-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-0
 Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-aodh-notifier-clone [openstack-aodh-notifier]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-clone [openstack-heat-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-api-clone [openstack-glance-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-api-clone [openstack-nova-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-sahara-api-clone [openstack-sahara-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: delay-clone [delay]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: neutron-server-clone [neutron-server]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: httpd-clone [httpd]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


###################################################################################################################################################
Found the following for os-collect-config on controller-0:
Jun 30 21:34:00 overcloud-controller-0.fv1dci.org os-collect-config[4122]: [2017-06-30 21:34:00,646] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/c6f9e505-31c4-4c0c-a02c-136041b8917a. [124]

###################################################################################################################################################
[root@overcloud-controller-0 ~]# cat /var/lib/heat-config/heat-config-script/c6f9e505-31c4-4c0c-a02c-136041b8917a
#!/bin/bash
ignore_ceph_upgrade_warnings='False'
#!/bin/bash
set -eu
set -o pipefail

echo INFO: starting $(basename "$0")

# Exit if not running
if ! pidof ceph-mon &> /dev/null; then
    echo INFO: ceph-mon is not running, skipping
    exit 0
fi

# Exit if not Hammer
INSTALLED_VERSION=$(ceph --version | awk '{print $3}')
if ! [[ "$INSTALLED_VERSION" =~ ^0\.94.* ]]; then
    echo INFO: version of Ceph installed is not 0.94, skipping
    exit 0
fi

CEPH_STATUS=$(ceph health | awk '{print $1}')
if [ ${CEPH_STATUS} = HEALTH_ERR ]; then
    echo ERROR: Ceph cluster status is HEALTH_ERR, cannot be upgraded
    exit 1
fi

# Useful when upgrading with OSDs num < replica size
if [[ ${ignore_ceph_upgrade_warnings:-False} != [Tt]rue ]]; then
    timeout 300 bash -c "while [ ${CEPH_STATUS} != HEALTH_OK ]; do
      echo WARNING: Waiting for Ceph cluster status to go HEALTH_OK;
      sleep 30;
      CEPH_STATUS=$(ceph health | awk '{print $1}')
    done"
fi

MON_PID=$(pidof ceph-mon)
MON_ID=$(hostname -s)

# Stop daemon using Hammer sysvinit script
service ceph stop mon.${MON_ID}

# Ensure it's stopped
timeout 60 bash -c "while kill -0 ${MON_PID} 2> /dev/null; do
  sleep 2;
done"

# Update to Jewel
yum -y -q update ceph-mon ceph

# Restart/Exit if not on Jewel, only in that case we need the changes
UPDATED_VERSION=$(ceph --version | awk '{print $3}')
if [[ "$UPDATED_VERSION" =~ ^0\.94.* ]]; then
    echo WARNING: Ceph was not upgraded, restarting daemons
    service ceph start mon.${MON_ID}
elif [[ "$UPDATED_VERSION" =~ ^10\.2.* ]]; then
    # RPM could own some of these but we can't take risks on the pre-existing files
    for d in /var/lib/ceph/mon /var/log/ceph /var/run/ceph /etc/ceph; do
        chown -L -R ceph:ceph $d || echo WARNING: chown of $d failed
    done

    # Replay udev events with newer rules
    udevadm trigger

    # Enable systemd unit
    systemctl enable ceph-mon.target
    systemctl enable ceph-mon@${MON_ID}
    systemctl start ceph-mon@${MON_ID}

    # Wait for daemon to be back in the quorum
    timeout 300 bash -c "until (ceph quorum_status | jq .quorum_names | grep -sq ${MON_ID}); do
      echo WARNING: Waiting for mon.${MON_ID} to re-join quorum;
      sleep 10;
    done"

    # if tunables become legacy, cluster status will be HEALTH_WARN causing
    # upgrade to fail on following node
    ceph osd crush tunables default

    echo INFO: Ceph was upgraded to Jewel
else
    echo ERROR: Ceph was upgraded to an unknown release, daemon is stopped, need manual intervention
    exit 1
fi



###################################################################################################################################################
On controller-1 (the only controller where yum update actually occured):
● ceph-radosgw.service                                                              not-found active exited    ceph-radosgw.service



###################################################################################################################################################

Here it's important to note that on OSP9 I found the following differences between the default and the used templates:
1)
[stack@director overcloud]$ diff ./puppet/manifests/overcloud_controller_pacemaker.pp /usr/share/openstack-tripleo-heat-templates/./puppet/manifests/overcloud_controller_pacemaker.pp
597d596
<     include ::ceph:rofile::rgw
1066,1067c1065
<     # enabled        => $non_pcmk_start,
<     enabled        => false,
---
>     enabled        => $non_pcmk_start,
2107,2113d2104
<   if $ceph:rofile:arams::enable_rgw
<   {
<     exec { 'create_radosgw_keyring':
<       command => "/usr/bin/ceph auth get-or-create client.radosgw.gateway mon 'allow rwx' osd 'allow rwx' -o /etc/ceph/ceph.client.radosgw.gateway.keyring" ,
<       creates => "/etc/ceph/ceph.client.radosgw.gateway.keyring" ,
<     }
<   }



2)
[stack@director overcloud]$ diff ./puppet/hieradata/ceph.yaml /usr/share/openstack-tripleo-heat-templates/./puppet/hieradata/ceph.yaml
1,19c1,3
< # Copyright (c) 2016 Dell Inc. or its subsidiaries.
< #
< # Licensed under the Apache License, Version 2.0 (the "License");
< # you may not use this file except in compliance with the License.
< # You may obtain a copy of the License at
< #
< #     http://www.apache.org/licenses/LICENSE-2.0
< #
< # Unless required by applicable law or agreed to in writing, software
< # distributed under the License is distributed on an "AS IS" BASIS,
< # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
< # See the License for the specific language governing permissions and
< # limitations under the License.
<
< ceph:rofile:arams::osd_journal_size: 10000
< # CHANGEME: Change osd_pool_default_pg_num and osd_pool_default_pgp_num to
< # be the smallest value in 'ceph_pool_pgs' (which appears below).
< ceph:rofile:arams::osd_pool_default_pg_num: 256
< ceph:rofile:arams::osd_pool_default_pgp_num: 256
---
> ceph:rofile:arams::osd_journal_size: 1024
> ceph:rofile:arams::osd_pool_default_pg_num: 32
> ceph:rofile:arams::osd_pool_default_pgp_num: 32
21c5,6
< ceph:rofile:arams::osd_pool_default_min_size: 2
---
> ceph:rofile:arams::osd_pool_default_min_size: 1
> ceph:rofile:arams::osds: {/srv/data: {}}
24,68d8
<
< # CHANGEME:
< # Modify the 'osds' parameter to reflect the list of drives to be used as
< # OSDs. A configuration that colocates Ceph journals on every OSD should look
< # like this:
< # ceph:rofile:arams::osds:
< #   '/dev/sdb': {}
< #   '/dev/sdc': {}
< #   ... and so on.
< # A configuration that places Ceph journals on dedicated drives (such as SSDs)
< # should look like this:
< # ceph:rofile:arams::osds:
< #   '/dev/sde':
< #     journal: '/dev/sdb'
< #   '/dev/sdf':
< #     journal: '/dev/sdb'
< #   ... and so on.
< ceph:rofile:arams::osds:
<   '/dev/sdd':
<     journal: '/dev/sdb'
<   '/dev/sde':
<     journal: '/dev/sdb'
<   '/dev/sdf':
<     journal: '/dev/sdb'
<   '/dev/sdg':
<     journal: '/dev/sdb'
<   '/dev/sdh':
<     journal: '/dev/sdc'
<   '/dev/sdi':
<     journal: '/dev/sdc'
<   '/dev/sdj':
<     journal: '/dev/sdc'
<   '/dev/sdk':
<     journal: '/dev/sdc'
<
< # CHANGEME: The following table lists the pg_num and pgp_num values for each
< # of the specified pools. Change the value for each pool based on the size of
< # Ceph cluster, using http://ceph.com/pgcalc for guidance. Small pools used by
< # the RADOS Gateway (any pool whose name begins with '.'), other than the
< # '.rgw.buckets' pool, should not be listed.
< ceph_pool_pgs:
<   'volumes': 1024
<   'vms': 256
<   'images': 256
<   '.rgw.buckets': 512





Checking the status of ceph on this setup with failed upgrade:
overcloud-controller-1.fv1dci.org
    cluster 7cd26246-5d0d-11e7-9a49-525400d76882
     health HEALTH_OK
     monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.121:6789/0}
            election epoch 12, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
     osdmap e89: 24 osds: 24 up, 24 in
      pgmap v48179: 2112 pgs, 5 pools, 45659 kB data, 19 objects
            1120 MB used, 22331 GB / 22333 GB avail
                2112 active+clean

Comment 2 Alexander Chuzhoy 2017-07-04 20:33:40 UTC

Additional run of the same step results again in failure, but looking for errors in logs on controllers I find this on controller-0:

Jul 04 20:23:05 overcloud-controller-0.fv1dci.org os-collect-config[4122]: [2017-07-04 20:23:05,202] (heat-config) [ERROR] Error running /var/lib/heat-config/heat-config-script/32d395fb-22ad-436a-8526-1aba7b75d735. [1]
[root@overcloud-controller-0 ~]# cat /var/lib/heat-config/heat-config-script/32d395fb-22ad-436a-8526-1aba7b75d735
#!/bin/bash

set -eu

DEBUG="true" # set false if the verbosity is a problem
SCRIPT_NAME=$(basename $0)
function log_debug {
  if [[ $DEBUG = "true" ]]; then
    echo "`date` $SCRIPT_NAME tripleo-upgrade $(facter hostname) $1"
  fi
}

function is_bootstrap_node {
  if [ "$(hiera -c /etc/puppet/hiera.yaml bootstrap_nodeid)" = "$(facter hostname)" ]; then
    log_debug "Node is bootstrap"
    echo "true"
  fi
}

function check_resource_pacemaker {
  if [ "$#" -ne 3 ]; then
    echo_error "ERROR: check_resource function expects 3 parameters, $# given"
    exit 1
  fi

  local service=$1
  local state=$2
  local timeout=$3

  if [[ -z $(is_bootstrap_node) ]] ; then
    log_debug "Node isn't bootstrap, skipping check for $service to be $state here "
    return
  else
    log_debug "Node is bootstrap checking $service to be $state here"
  fi

  if [ "$state" = "stopped" ]; then
    match_for_incomplete='Started'
  else # started
    match_for_incomplete='Stopped'
  fi

  nodes_local=$(pcs status  | grep ^Online | sed 's/.*\[ \(.*\) \]/\1/g' | sed 's/ /\|/g')
  if timeout -k 10 $timeout crm_resource --wait; then
    node_states=$(pcs status --full | grep "$service" | grep -v Clone | { egrep "$nodes_local" || true; } )
    if echo "$node_states" | grep -q "$match_for_incomplete"; then
      echo_error "ERROR: cluster finished transition but $service was not in $state state, exiting."
      exit 1
    else
      echo "$service has $state"
    fi
  else
    echo_error "ERROR: cluster remained unstable for more than $timeout seconds, exiting."
    exit 1
  fi

}

function pcmk_running {
  if [[ $(systemctl is-active pacemaker) = "active" ]] ; then
    echo "true"
  fi
}

function is_systemd_unknown {
  local service=$1
  if [[ $(systemctl is-active "$service") = "unknown" ]]; then
    log_debug "$service found to be unkown to systemd"
    echo "true"
  fi
}

function grep_is_cluster_controlled {
  local service=$1
  if [[ -n $(systemctl status $service -l | grep Drop-In -A 5 | grep pacemaker) ||
      -n $(systemctl status $service -l | grep "Cluster Controlled $service") ]] ; then
    log_debug "$service is pcmk managed from systemctl grep"
    echo "true"
  fi
}


function is_systemd_managed {
  local service=$1
  #if we have pcmk check to see if it is managed there
  if [[ -n $(pcmk_running) ]]; then
    if [[ -z $(pcs status --full | grep $service)  && -z $(is_systemd_unknown $service) ]] ; then
      log_debug "$service found to be systemd managed from pcs status"
      echo "true"
    fi
  else
    # if it is "unknown" to systemd, then it is pacemaker managed
    if [[  -n $(is_systemd_unknown $service) ]] ; then
      return
    elif [[ -z $(grep_is_cluster_controlled $service) ]] ; then
      echo "true"
    fi
  fi
}

function is_pacemaker_managed {
  local service=$1
  #if we have pcmk check to see if it is managed there
  if [[ -n $(pcmk_running) ]]; then
    if [[ -n $(pcs status --full | grep $service) ]]; then
      log_debug "$service found to be pcmk managed from pcs status"
      echo "true"
    fi
  else
    # if it is unknown to systemd, then it is pcmk managed
    if [[ -n $(is_systemd_unknown $service) ]]; then
      echo "true"
    elif [[ -n $(grep_is_cluster_controlled $service) ]] ; then
      echo "true"
    fi
  fi
}

function is_managed {
  local service=$1
  if [[ -n $(is_pacemaker_managed $service) || -n $(is_systemd_managed $service) ]]; then
    echo "true"
  fi
}

function check_resource_systemd {

  if [ "$#" -ne 3 ]; then
    echo_error "ERROR: check_resource function expects 3 parameters, $# given"
    exit 1
  fi

  local service=$1
  local state=$2
  local timeout=$3
  local check_interval=3

  if [ "$state" = "stopped" ]; then
    match_for_incomplete='active'
  else # started
    match_for_incomplete='inactive'
  fi

  log_debug "Going to check_resource_systemd for $service to be $state"

  #sanity check is systemd managed:
  if [[ -z $(is_systemd_managed $service) ]]; then
    echo "ERROR - $service not found to be systemd managed."
    exit 1
  fi

  tstart=$(date +%s)
  tend=$(( $tstart + $timeout ))
  while (( $(date +%s) < $tend )); do
    if [[ "$(systemctl is-active $service)" = $match_for_incomplete ]]; then
      echo "$service not yet $state, sleeping $check_interval seconds."
      sleep $check_interval
    else
      echo "$service is $state"
      return
    fi
  done

  echo "Timed out waiting for $service to go to $state after $timeout seconds"
  exit 1
}


function check_resource {
  local service=$1
  local pcmk_managed=$(is_pacemaker_managed $service)
  local systemd_managed=$(is_systemd_managed $service)

  if [[ -n $pcmk_managed && -n $systemd_managed ]] ; then
    log_debug "ERROR $service managed by both systemd and pcmk - SKIPPING"
    return
  fi

  if [[ -n $pcmk_managed ]]; then
    check_resource_pacemaker $@
    return
  elif [[ -n $systemd_managed ]]; then
    check_resource_systemd $@
    return
  fi
  log_debug "ERROR cannot check_resource for $service, not managed here?"
}

function manage_systemd_service {
  local action=$1
  local service=$2
  log_debug "Going to systemctl $action $service"
  systemctl $action $service
}

function manage_pacemaker_service {
  local action=$1
  local service=$2
  # not if pacemaker isn't running!
  if [[ -z $(pcmk_running) ]]; then
    echo "$(facter hostname) pacemaker not active, skipping $action $service here"
  elif [[ -n $(is_bootstrap_node) ]]; then
    log_debug "Going to pcs resource $action $service"
    pcs resource $action $service
  fi
}

function stop_or_disable_service {
  local service=$1
  local pcmk_managed=$(is_pacemaker_managed $service)
  local systemd_managed=$(is_systemd_managed $service)

  if [[ -n $pcmk_managed && -n $systemd_managed ]] ; then
    log_debug "Skipping stop_or_disable $service due to management conflict"
    return
  fi

  log_debug "Stopping or disabling $service"
  if [[ -n $pcmk_managed ]]; then
    manage_pacemaker_service disable $service
    return
  elif [[ -n $systemd_managed ]]; then
    manage_systemd_service stop $service
    return
  fi
  log_debug "ERROR: $service not managed here?"
}

function start_or_enable_service {
  local service=$1
  local pcmk_managed=$(is_pacemaker_managed $service)
  local systemd_managed=$(is_systemd_managed $service)

  if [[ -n $pcmk_managed && -n $systemd_managed ]] ; then
    log_debug "Skipping start_or_enable $service due to management conflict"
    return
  fi

  log_debug "Starting or enabling $service"
  if [[ -n $pcmk_managed ]]; then
    manage_pacemaker_service enable $service
    return
  elif [[ -n $systemd_managed ]]; then
    manage_systemd_service start $service
    return
  fi
  log_debug "ERROR $service not managed here?"
}

function restart_service {
  local service=$1
  local pcmk_managed=$(is_pacemaker_managed $service)
  local systemd_managed=$(is_systemd_managed $service)

  if [[ -n $pcmk_managed && -n $systemd_managed ]] ; then
    log_debug "ERROR $service managed by both systemd and pcmk - SKIPPING"
    return
  fi

  log_debug "Restarting $service"
  if [[ -n $pcmk_managed ]]; then
    manage_pacemaker_service restart $service
    return
  elif [[ -n $systemd_managed ]]; then
    manage_systemd_service restart $service
    return
  fi
  log_debug "ERROR $service not managed here?"
}

function echo_error {
    echo "$@" | tee /dev/fd2
}

# swift is a special case because it is/was never handled by pacemaker
# when stand-alone swift is used, only swift-proxy is running on controllers
function systemctl_swift {
    services=( openstack-swift-account-auditor openstack-swift-account-reaper openstack-swift-account-replicator openstack-swift-account \
               openstack-swift-container-auditor openstack-swift-container-replicator openstack-swift-container-updater openstack-swift-container \
               openstack-swift-object-auditor openstack-swift-object-replicator openstack-swift-object-updater openstack-swift-object openstack-swift-proxy )
    local action=$1
    case $action in
        stop)
            services=$(systemctl | grep openstack-swift- | grep running | awk '{print $1}')
            ;;
        start)
            enable_swift_storage=$(hiera -c /etc/puppet/hiera.yaml tripleo::profile::base::swift::storage::enable_swift_storage)
            if [[ $enable_swift_storage != "true" ]]; then
                services=( openstack-swift-proxy )
            fi
            ;;
        *)  echo "Unknown action $action passed to systemctl_swift"
            exit 1
            ;; # shouldn't ever happen...
    esac
    for service in ${services[@]}; do
        manage_systemd_service $action $service
    done
}

# Special-case OVS for https://bugs.launchpad.net/tripleo/+bug/1635205
# Update condition and add --notriggerun for +bug/1669714
function special_case_ovs_upgrade_if_needed {
    if rpm -qa | grep "^openvswitch-2.5.0-14" || rpm -q --scripts openvswitch | awk '/postuninstall/,/*/' | grep "systemctl.*try-restart" ; then
        echo "Manual upgrade of openvswitch - ovs-2.5.0-14 or restart in postun detected"
        rm -rf OVS_UPGRADE
        mkdir OVS_UPGRADE && pushd OVS_UPGRADE
        echo "Attempting to downloading latest openvswitch with yumdownloader"
        yumdownloader --resolve openvswitch
        for pkg in $(ls -1 *.rpm);  do
            if rpm -U --test $pkg 2>&1 | grep "already installed" ; then
                echo "Looks like newer version of $pkg is already installed, skipping"
            else
                echo "Updating $pkg with --nopostun --notriggerun"
                rpm -U --replacepkgs --nopostun --notriggerun $pkg
            fi
        done
        popd
    else
        echo "Skipping manual upgrade of openvswitch - no restart in postun detected"
    fi

}

# update os-net-config before ovs see https://bugs.launchpad.net/tripleo/+bug/1695893
function update_network() {
    set +e
    yum -q -y update os-net-config
    return_code=$?
    echo "yum update os-net-config return code: $return_code"

    # Writes any changes caused by alterations to os-net-config and bounces the
    # interfaces *before* restarting the cluster.
    os-net-config -c /etc/os-net-config/config.json -v --detailed-exit-codes

    RETVAL=$?
    if [[ $RETVAL == 2 ]]; then
        echo "os-net-config: interface configuration files updated successfully"
    elif [[ $RETVAL != 0 ]]; then
        echo "ERROR: os-net-config configuration failed"
        exit $RETVAL
    fi
    set -e

    # special case https://bugs.launchpad.net/tripleo/+bug/1635205 +bug/1669714
    special_case_ovs_upgrade_if_needed
}
#!/bin/bash

# Special pieces of upgrade migration logic go into this
# file. E.g. Pacemaker cluster transitions for existing deployments,
# matching changes to overcloud_controller_pacemaker.pp (Puppet
# handles deployment, this file handles migrations).
#
# This file shouldn't execute any action on its own, all logic should
# be wrapped into bash functions. Upgrade scripts will source this
# file and call the functions defined in this file where appropriate.
#
# The migration functions should be idempotent. If the migration has
# been already applied, it should be possible to call the function
# again without damaging the deployment or failing the upgrade.

# If the major version of mysql is going to change after the major
# upgrade, the database must be upgraded on disk to avoid failures
# due to internal incompatibilities between major mysql versions
# https://bugs.launchpad.net/tripleo/+bug/1587449
# This function detects whether a database upgrade is required
# after a mysql package upgrade. It returns 0 when no major upgrade
# has to take place, 1 otherwise.
function is_mysql_upgrade_needed {
    # The name of the package which provides mysql might differ
    # after the upgrade. Consider the generic package name, which
    # should capture the major version change (e.g. 5.5 -> 10.1)
    local name="mariadb"
    local output
    local ret
    set +e
    output=$(yum -q check-update $name)
    ret=$?
    set -e
    if [ $ret -ne 100 ]; then
        # no updates so we exit
        echo "0"
        return
    fi

    local currentepoch=$(rpm -q --qf "%{epoch}" $name)
    local currentversion=$(rpm -q --qf "%{version}" $name | cut -d. -f-2)
    local currentrelease=$(rpm -q --qf "%{release}" $name)
    local newoutput=$(repoquery -a --pkgnarrow=updates --qf "%{epoch} %{version} %{release}\n" $name)
    local newepoch=$(echo "$newoutput" | awk '{ print $1 }')
    local newversion=$(echo "$newoutput" | awk '{ print $2 }' | cut -d. -f-2)
    local newrelease=$(echo "$newoutput" | awk '{ print $3 }')

    # With this we trigger the dump restore/path if we change either epoch or
    # version in the package If only the release tag changes we do not do it
    # FIXME: we could refine this by trying to parse the mariadb version
    # into X.Y.Z and trigger the update only if X and/or Y change.
    output=$(python -c "import rpm; rc = rpm.labelCompare((\"$currentepoch\", \"$currentversion\", None), (\"$newepoch\", \"$newversion\", None)); print rc")
    if [ "$output" != "-1" ]; then
        echo "0"
        return
    fi
    echo "1"
}

# This function returns the list of services to be migrated away from pacemaker
# and to systemd. The reason to have these services in a separate function is because
# this list is needed in three different places: major_upgrade_controller_pacemaker_{1,2}
# and in the function to migrate the cluster from full HA to HA NG
function services_to_migrate {
    # The following PCMK resources the ones the we are going to delete
    PCMK_RESOURCE_TODELETE="
    httpd-clone
    memcached-clone
    mongod-clone
    neutron-dhcp-agent-clone
    neutron-l3-agent-clone
    neutron-metadata-agent-clone
    neutron-netns-cleanup-clone
    neutron-openvswitch-agent-clone
    neutron-ovs-cleanup-clone
    neutron-server-clone
    openstack-aodh-evaluator-clone
    openstack-aodh-listener-clone
    openstack-aodh-notifier-clone
    openstack-ceilometer-central-clone
    openstack-ceilometer-collector-clone
    openstack-ceilometer-notification-clone
    openstack-cinder-api-clone
    openstack-cinder-scheduler-clone
    openstack-glance-api-clone
    openstack-glance-registry-clone
    openstack-gnocchi-metricd-clone
    openstack-gnocchi-statsd-clone
    openstack-heat-api-cfn-clone
    openstack-heat-api-clone
    openstack-heat-api-cloudwatch-clone
    openstack-heat-engine-clone
    openstack-nova-api-clone
    openstack-nova-conductor-clone
    openstack-nova-consoleauth-clone
    openstack-nova-novncproxy-clone
    openstack-nova-scheduler-clone
    openstack-sahara-api-clone
    openstack-sahara-engine-clone
    "
    echo $PCMK_RESOURCE_TODELETE
}

# This function will migrate a mitaka system where all the resources are managed
# via pacemaker to a newton setup where only a few services will be managed by pacemaker
# On a high-level it will operate as follows:
# 1. Set the cluster in maintenance-mode so no start/stop action will actually take place
#    during the conversion
# 2. Remove all the colocation constraints and then the ordering constraints, except the
#    ones related to haproxy/VIPs which exist in Newton as well
# 3. Take the cluster out of maintenance-mode
# 4. Remove all the resources that won't be managed by pacemaker in newton. The
#    outcome will be
#    that they are stopped and removed from pacemakers control
# 5. Do a resource cleanup to make sure the cluster is in a clean state
function migrate_full_to_ng_ha {
    if [[ -n $(pcmk_running) ]]; then
        pcs property set maintenance-mode=true

        # First we go through all the colocation constraints (except the ones
        # we want to keep, i.e. the haproxy/ip ones) and we remove those
        COL_CONSTRAINTS=$(pcs config show | sed -n '/^Colocation Constraints:$/,/^$/p' | grep -v "Colocation Constraints:" | egrep -v "ip-.*haproxy" | awk '{print $NF}' | cut -f2 -d: |cut -f1 -d\))
        for constraint in $COL_CONSTRAINTS; do
            log_debug "Deleting colocation constraint $constraint from CIB"
            pcs constraint remove "$constraint"
        done

        # Now we kill all the ordering constraints (except the haproxy/ip ones)
        ORD_CONSTRAINTS=$(pcs config show | sed -n '/^Ordering Constraints:/,/^Colocation Constraints:$/p' | grep -v "Ordering Constraints:"  | awk '{print $NF}' | cut -f2 -d: |cut -f1 -d\))
        for constraint in $ORD_CONSTRAINTS; do
            log_debug "Deleting ordering constraint $constraint from CIB"
            pcs constraint remove "$constraint"
        done
        # At this stage all the pacemaker resources are removed from the CIB.
        # Once we remove the maintenance-mode those systemd resources will keep
        # on running. They shall be systemd enabled via the puppet converge
        # step later on
        pcs property set maintenance-mode=false

        # At this stage there are no constraints whatsoever except the haproxy/ip ones
        # which we want to keep. We now disable and then delete each resource
        # that will move to systemd.
        # We want the systemd resources be stopped before doing "yum update",
        # that way "systemctl try-restart <service>" is no-op because the
        # service was down already 
        PCS_STATUS_OUTPUT="$(pcs status)"
        for resource in $(services_to_migrate) "delay-clone" "openstack-core-clone"; do
             if echo "$PCS_STATUS_OUTPUT" | grep "$resource"; then
                 log_debug "Deleting $resource from the CIB"
                 if ! pcs resource disable "$resource" --wait=600; then
                     echo_error "ERROR: resource $resource failed to be disabled"
                     exit 1
                 fi
                 pcs resource delete --force "$resource"
             else
                 log_debug "Service $resource not found as a pacemaker resource, not trying to delete."
             fi
        done

        # We need to do a pcs resource cleanup here + crm_resource --wait to
        # make sure the cluster is in a clean state before we stop everything,
        # upgrade and restart everything
        pcs resource cleanup
        # We are making sure here that the cluster is stable before proceeding
        if ! timeout -k 10 600 crm_resource --wait; then
            echo_error "ERROR: cluster remained unstable after resource cleanup for more than 600 seconds, exiting."
            exit 1
        fi
    fi
}

function disable_standalone_ceilometer_api {
    if [[ -n $(is_bootstrap_node) ]]; then
        if [[ -n $(is_pacemaker_managed openstack-ceilometer-api) ]]; then
            # Disable pacemaker resources for ceilometer-api
            manage_pacemaker_service disable openstack-ceilometer-api
            check_resource_pacemaker openstack-ceilometer-api stopped 600
            pcs resource delete openstack-ceilometer-api --wait=600
        fi
    fi
}
#!/bin/bash

set -eu

if [[ -n $(is_bootstrap_node) ]]; then
  # run gnocchi upgrade
  gnocchi-upgrade
fi


##############################################################################

pcs status looks as following:
[root@overcloud-controller-0 ~]# pcs status
Cluster name: tripleo_cluster
Stack: corosync
Current DC: overcloud-controller-0 (version 1.1.15-11.el7_3.5-e174ec8) - partition with quorum
Last updated: Tue Jul  4 20:32:47 2017          Last change: Tue Jul  4 20:21:11 2017 by root via crm_resource on overcloud-controller-0

3 nodes and 19 resources configured

Online: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]

Full list of resources:

 ip-192.168.140.121     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-192.168.120.127     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-192.168.120.126     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 Clone Set: haproxy-clone [haproxy]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: galera-master [galera]
     Masters: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 ip-192.168.190.5       (ocf::heartbeat:IPaddr2):       Started overcloud-controller-0
 ip-192.168.170.120     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-1
 ip-192.168.140.120     (ocf::heartbeat:IPaddr2):       Started overcloud-controller-2
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ overcloud-controller-1 ]
     Slaves: [ overcloud-controller-0 overcloud-controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started overcloud-controller-0

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

Comment 3 Sofer Athlan-Guyot 2017-07-05 08:46:45 UTC

Hi Sasha,

adding comment from the mail, that at first the status of ceph was flipping, before starting this step with this pattern:

  
  [stack@director ~]$ ssh heat-admin.120.133 "hostname; sudo ceph status"
  overcloud-controller-1.fv1dci.org
      cluster 7cd26246-5d0d-11e7-9a49-525400d76882
       health HEALTH_WARN
              crush map has legacy tunables (require bobtail, min is firefly)
       monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.121:6789/0}
              election epoch 12, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
       osdmap e89: 24 osds: 24 up, 24 in
        pgmap v48626: 2112 pgs, 5 pools, 45659 kB data, 19 objects
              1121 MB used, 22331 GB / 22333 GB avail
                  2112 active+clean
  [stack@director ~]$ ssh heat-admin.120.133 "hostname; sudo ceph status"
  overcloud-controller-1.fv1dci.org
      cluster 7cd26246-5d0d-11e7-9a49-525400d76882
       health HEALTH_OK
       monmap e2: 3 mons at {overcloud-controller-0=192.168.170.128:6789/0,overcloud-controller-1=192.168.170.124:6789/0,overcloud-controller-2=192.168.170.121:6789/0}
              election epoch 12, quorum 0,1,2 overcloud-controller-2,overcloud-controller-1,overcloud-controller-0
       osdmap e89: 24 osds: 24 up, 24 in
        pgmap v48628: 2112 pgs, 5 pools, 45659 kB data, 19 objects
              1121 MB used, 22331 GB / 22333 GB avail
                  2112 active+clean

and that re-running the command made the ceph go to OK again.

So, we can see the 124 error in the logs:

Jun 30 21:34:00 overcloud-controller-0.fv1dci.org os-collect-config[4122]: [2017-06-30 21:34:00,653] (heat-config) [INFO] {"deploy_stdout": "INFO: starting c6f9e505-31c4-4c0c-a02c-136041b8917a\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\n", "deploy_stderr": "", "deploy_status_code": 124}

but this is on Jun 30, and was certainly caused by the flipping.

As the reference during ceph upgrade on the controller we have this [0] that must have solved the flipping.

For reference again this is where the 124 error comes from:

https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_ceph_mon.sh#L28..L33

Now from comment 2 it appears that you've reached Step5 of the controller upgrade, as gnocchi upgrade is done there[1], way after the the ceph controller upgrade.

So I think that you don't have the ceph anymore (Can you confirm?) That error must have been triggered by the flipping status, solved when this code[0] was run.  Now you have a problem with gnocchi database upgrade.

We don't have the log from this run in the sosreport.  Could you provide us with  the controller-0 current sos-report and make sure that everything in /var/log/gnocchi is included or add that manually.

Thanks,


[0] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_ceph_mon.sh#L74..L77

[1] https://github.com/openstack/tripleo-heat-templates/blob/stable/newton/extraconfig/tasks/major_upgrade_controller_pacemaker_5.sh#L5..L9

Comment 5 Sofer Athlan-Guyot 2017-07-06 14:33:57 UTC

Hi Sasha,

Thanks for the logs.

So the gnocchi upgrade fails because the metrics pool is missing (from gnocchi-upgrade.log):

2017-07-04 20:23:04.984 22838 ERROR gnocchi   File "cradox.pyx", line 1047, in cradox.Rados.open_ioctx (cradox.c:12325)
2017-07-04 20:23:04.984 22838 ERROR gnocchi ObjectNotFound: error opening pool 'metrics'
2017-07-04 20:23:04.984 22838 ERROR gnocchi 

If using external osd, you have to create them, this is addressed there in the documentation:

https://bugzilla.redhat.com/show_bug.cgi?id=1412295

From this bz:

The ceph driver need to have a ceph user and a pool already created. They can be created for example with:

    ceph osd pool create metrics 8 8
    ceph auth get-or-create client.gnocchi mon "allow r" osd "allow rwx pool=metrics"

Can you confirm that the ceph osd are externals?  If that's the case, then running the above command before the upgrade should solve the problem.

Comment 10 Sofer Athlan-Guyot 2017-07-11 09:09:31 UTC

Hi Sasha,

what are the lastest news on this one ?

Comment 15 Alexander Chuzhoy 2017-07-18 20:54:12 UTC

Used "IgnoreCephUpgradeWarnings: true" and the upgrade proceeded.

Comment 16 Giulio Fidente 2017-08-07 10:24:18 UTC

After discussion with Sebastien and DFG:Ceph the conclusion was that it would be best to not emit WARN for tunables in Ceph... and rely only on the quorum check for the monitors upgrade until then (as ceph-ansible does already).

This means changing the upgrade scripts in tripleo to always ignore the Ceph health warnings, which is confirmed to work as per comment #15.

Comment 19 Yogev Rabl 2017-08-22 18:54:01 UTC

verified

Comment 21 errata-xmlrpc 2017-09-06 17:09:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2654

Comment 22 Amit Ugol 2018-05-02 10:50:39 UTC

closed, no need for needinfo.

Note You need to log in before you can comment on or make changes to this bug.