1649477 – "health_osd_check_retries: 40" in rolling_upgrade.yml playbook needs to be updated

Bug 1649477 - "health_osd_check_retries: 40" in rolling_upgrade.yml playbook needs to be updated

Summary: "health_osd_check_retries: 40" in rolling_upgrade.yml playbook needs to be up...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Documentation
Sub Component:
Version:	3.1
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	rc
Target Release:	3.2
Assignee:	Aron Gunn
QA Contact:	Tiffany Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1641792
TreeView+	depends on / blocked

Reported:	2018-11-13 16:42 UTC by Tiffany Nguyen
Modified:	2019-08-26 06:55 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-08-26 06:55:55 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Tiffany Nguyen 2018-11-13 16:42:13 UTC

Description of problem:
In the rolling_update.yml playbook, "health_osd_check_retries: 40" is too short for a setup with high percentage used.  It causes rolling_upgrade to fail with recommended setting in the doc: 
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/installation_guide_for_red_hat_enterprise_linux/upgrading-a-red-hat-ceph-storage-cluster

We should recommend user to increase the value base on the load of their setup. 

In my case, with 50% filled (used 218T out of 539T), I have to increase "health_osd_check_retries = 50" for rolling_upgrade to success.

Error seen in the log:
...
FAILED - RETRYING: waiting for clean pgs... (2 retries left).
FAILED - RETRYING: waiting for clean pgs... (1 retries left).
fatal: [c06-h01-6048r -> c05-h33-6018r]: FAILED! => {"attempts": 40, "changed": true, "cmd": ["ceph", "--cluster", "ceph", "-s", "--format", "json"], "delta": "0:00:00.220174", "end": "2018-11-01 19:44:24.910327", "failed": true, "rc": 0, "start": "2018-11-01 19:44:24.690153", "stderr": "", "stderr_lines": [], "stdout": "\n{\"fsid\":\"3937e662-4872-4e7b-b9c9-14e09d85c7af\",\"health\":{\"checks\":{\"OSDMAP_FLAGS\":{\"severity\":\"HEALTH_WARN\",\"summary\":{\"message\":\"noout,noscrub,nodeep-scrub flag(s) set\"}},\"PG_DEGRADED\":{\"severity\":\"HEALTH_WARN\",\"summary\":{\"message\":\"Degraded data redundancy: 836/208940469 objects degraded (0.000%), 37 pgs degraded\"}}},\"status\":\"HEALTH_WARN\",
...

Version-Release number of selected component (if applicable):
Ceph-ansible: 3.1.5-1.el7cp 
ceph version 12.2.5-59.el7cp
 
How reproducible:
Perform rolling_upgrade.yml on a setup with high percentage use. (50% filled)

Comment 4 Tiffany Nguyen 2018-12-06 00:33:31 UTC

Doc looks good to me.

Note You need to log in before you can comment on or make changes to this bug.