Bug 1763175

Summary: [FFU][ceph-ansible] Impossible to set health_osd_check_retries from THT when migrating to containers
Product: Red Hat OpenStack Reporter: Alex Stupnikov <astupnik>
Component: openstack-tripleo-commonAssignee: Giulio Fidente <gfidente>
Status: CLOSED ERRATA QA Contact: Yogev Rabl <yrabl>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: assingh, dsavinea, emacchi, fpantano, gabrioux, gfidente, mburns, slinaber
Target Milestone: z10Keywords: Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: openstack-tripleo-common-8.7.1-6.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 11:22:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2019-10-18 12:09:11 UTC
Description of problem:

Our official documentation recommends to increase restart delays for large Ceph clusters [1] (merged as a solution for bug #1620699). Basically, we recommend the customer to set the following parameters using THT:

parameter_defaults:
  CephAnsibleExtraConfig:
    health_osd_check_delay: 40
    health_osd_check_retries: 30
    health_mon_check_retries: 10
    health_mon_check_delay: 20

The truth is that this configuration change is not a silver bullet and doesn't actually work for the bug #1620699 itself: specified parameters are hard-coded in rolling_update.yml (it is reasonable high there) and switch-from-non-containerized-to-containerized-ceph-daemons.yml (quite low there) playbooks.

I understand that this issue should be likely handled by ceph-ansible (we can increase hard-coded values) or documentation (we can tell customer to adjust playbook), but wanted to ask THT developers to make a first touch and decide which way will work for us here.


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-preparing_for_overcloud_upgrade#increasing-the-restart-delay-for-large-ceph-clusters

Comment 12 errata-xmlrpc 2020-03-10 11:22:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0760