Bug 1620699 - [docs] Major upgrade fails while checking OSD health and timing out
Summary: [docs] Major upgrade fails while checking OSD health and timing out
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Laura Marsh
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-23 10:39 UTC by PURANDHAR SAIRAM MANNIDI
Modified: 2019-03-26 16:04 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-26 16:04:08 UTC
Target Upstream Version:
Embargoed:
tshefi: automate_bug-


Attachments (Terms of Use)

Description PURANDHAR SAIRAM MANNIDI 2018-08-23 10:39:56 UTC
Description of problem:

Major upgrade from OSP 11 to OSP 12 fails while checking OSD health 

Version-Release number of selected component (if applicable):
RH OSP 12

How reproducible:
Always

Steps to Reproduce:
1. Major upgrade with ceph (5 nodes, 4 OSDs/node)
2. Deployment fails while "waiting for clean pgs"

Actual results:
Deployment fails

Expected results:
Deployment should succeed

Additional info:
looks like the issue is at ceph-ansible/infrastructure-playbooks/switch-from-non-containerized-to-containerized-ceph-daemons.yml at task "container - waiting for clean pgs..." 

After increasing health_osd_check_delay and health_osd_check_retries from 15sec and 5 to 40 sec and 30 respectively, deployment succeeded.

The values are referred from ceph-ansible/infrastructure-playbooks/rolling_update.yml

Comment 1 Sébastien Han 2018-08-23 10:48:37 UTC
This does not look like a bug to me. Depending on the infra the timings should be adjusted. We might need to include this in the documentation.

Comment 2 PURANDHAR SAIRAM MANNIDI 2018-08-23 11:44:55 UTC
Aren't those values be configured via some variables rather than changing the default ansible scripts?

Comment 3 Sébastien Han 2018-08-23 14:09:51 UTC
You can run ansible with an extra var like -e health_mon_check_retries=200 and this will work without editing the playbook file.

Comment 4 PURANDHAR SAIRAM MANNIDI 2018-08-24 02:47:00 UTC
changing component from ceph-ansible to tripleo-common for the same changes to be done in mistral workbook if it should be set there.

Comment 5 John Fulton 2018-08-27 20:25:07 UTC
Please update chapter 4.6 and 4.7 of the overcloud upgrade document [1] to include the following note. 

"""
During the migration of Ceph to containers, each Ceph monitor and OSD is brought down sequentially and then the migration does not continue until the same service service that was brought down, is successfully brought back up. Ansible will wait 15 seconds (the delay) and recheck 5 times (the retries) for the service to come back and if the service does not come back the migration will stop so that the operator may intervene. Depending on the size of your Ceph cluster, the retry or delay values may need to be increased. The exact names of these parameters and their defaults are as follows:

 health_mon_check_retries: 5
 health_mon_check_delay: 15
 health_osd_check_retries: 5
 health_osd_check_delay: 15

For example, to have the cluster recheck 30 times and wait 40 seconds between each check, pass the following parameters in a yaml file with a -e to the 'openstack overcloud deploy' command.

parameter_defaults:
  CephAnsibleExtraConfig:
    health_osd_check_delay: 40
    health_osd_check_retries: 30
"""

[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/12/html/upgrading_red_hat_openstack_platform/assembly-preparing_for_overcloud_upgrade#preparing_for_ceph_storage_node_upgrades

Comment 6 Tzach Shefi 2019-01-29 05:58:44 UTC
Doc bug,
Nothing for QE to test/automate with regards to close loop process.


Note You need to log in before you can comment on or make changes to this bug.