Bug 1763175

Summary:	[FFU][ceph-ansible] Impossible to set health_osd_check_retries from THT when migrating to containers
Product:	Red Hat OpenStack	Reporter:	Alex Stupnikov <astupnik>
Component:	openstack-tripleo-common	Assignee:	Giulio Fidente <gfidente>
Status:	CLOSED ERRATA	QA Contact:	Yogev Rabl <yrabl>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	13.0 (Queens)	CC:	assingh, dsavinea, emacchi, fpantano, gabrioux, gfidente, mburns, slinaber
Target Milestone:	z10	Keywords:	Triaged, ZStream
Target Release:	13.0 (Queens)
Hardware:	All
OS:	All
Whiteboard:
Fixed In Version:	openstack-tripleo-common-8.7.1-6.el7ost	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-03-10 11:22:02 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alex Stupnikov 2019-10-18 12:09:11 UTC

Description of problem:

Our official documentation recommends to increase restart delays for large Ceph clusters [1] (merged as a solution for bug #1620699). Basically, we recommend the customer to set the following parameters using THT:

parameter_defaults:
  CephAnsibleExtraConfig:
    health_osd_check_delay: 40
    health_osd_check_retries: 30
    health_mon_check_retries: 10
    health_mon_check_delay: 20

The truth is that this configuration change is not a silver bullet and doesn't actually work for the bug #1620699 itself: specified parameters are hard-coded in rolling_update.yml (it is reasonable high there) and switch-from-non-containerized-to-containerized-ceph-daemons.yml (quite low there) playbooks.

I understand that this issue should be likely handled by ceph-ansible (we can increase hard-coded values) or documentation (we can tell customer to adjust playbook), but wanted to ask THT developers to make a first touch and decide which way will work for us here.


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-preparing_for_overcloud_upgrade#increasing-the-restart-delay-for-large-ceph-clusters

Comment 12 errata-xmlrpc 2020-03-10 11:22:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0760