1763175 – [FFU][ceph-ansible] Impossible to set health_osd_check_retries from THT when migrating to containers

Bug 1763175 - [FFU][ceph-ansible] Impossible to set health_osd_check_retries from THT when migrating to containers

Summary: [FFU][ceph-ansible] Impossible to set health_osd_check_retries from THT when ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	13.0 (Queens)
Hardware:	All
OS:	All
Priority:	medium
Severity:	medium
Target Milestone:	z10
Target Release:	13.0 (Queens)
Assignee:	Giulio Fidente
QA Contact:	Yogev Rabl
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-10-18 12:09 UTC by Alex Stupnikov
Modified:	2023-09-07 20:50 UTC (History)
CC List:	8 users (show)
Fixed In Version:	openstack-tripleo-common-8.7.1-6.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-03-10 11:22:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1849625	None	None	None	2019-10-24 08:56:51 UTC
OpenStack gerrit	690886	'None'	MERGED	Pass CephAnsibleExtraConfig as ansible extra-vars	2020-11-09 12:24:00 UTC
Red Hat Issue Tracker	OSP-28281	None	None	None	2023-09-07 20:50:52 UTC
Red Hat Product Errata	RHBA-2020:0760	None	None	None	2020-03-10 11:22:54 UTC

Description Alex Stupnikov 2019-10-18 12:09:11 UTC

Description of problem:

Our official documentation recommends to increase restart delays for large Ceph clusters [1] (merged as a solution for bug #1620699). Basically, we recommend the customer to set the following parameters using THT:

parameter_defaults:
  CephAnsibleExtraConfig:
    health_osd_check_delay: 40
    health_osd_check_retries: 30
    health_mon_check_retries: 10
    health_mon_check_delay: 20

The truth is that this configuration change is not a silver bullet and doesn't actually work for the bug #1620699 itself: specified parameters are hard-coded in rolling_update.yml (it is reasonable high there) and switch-from-non-containerized-to-containerized-ceph-daemons.yml (quite low there) playbooks.

I understand that this issue should be likely handled by ceph-ansible (we can increase hard-coded values) or documentation (we can tell customer to adjust playbook), but wanted to ask THT developers to make a first touch and decide which way will work for us here.


[1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/13/html/fast_forward_upgrades/assembly-preparing_for_overcloud_upgrade#increasing-the-restart-delay-for-large-ceph-clusters

Comment 12 errata-xmlrpc 2020-03-10 11:22:02 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0760

Note You need to log in before you can comment on or make changes to this bug.