Description of problem: In the rolling_update.yml playbook, "health_osd_check_retries: 40" is too short for a setup with high percentage used. It causes rolling_upgrade to fail with recommended setting in the doc: https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/3/html/installation_guide_for_red_hat_enterprise_linux/upgrading-a-red-hat-ceph-storage-cluster We should recommend user to increase the value base on the load of their setup. In my case, with 50% filled (used 218T out of 539T), I have to increase "health_osd_check_retries = 50" for rolling_upgrade to success. Error seen in the log: ... FAILED - RETRYING: waiting for clean pgs... (2 retries left). FAILED - RETRYING: waiting for clean pgs... (1 retries left). fatal: [c06-h01-6048r -> c05-h33-6018r]: FAILED! => {"attempts": 40, "changed": true, "cmd": ["ceph", "--cluster", "ceph", "-s", "--format", "json"], "delta": "0:00:00.220174", "end": "2018-11-01 19:44:24.910327", "failed": true, "rc": 0, "start": "2018-11-01 19:44:24.690153", "stderr": "", "stderr_lines": [], "stdout": "\n{\"fsid\":\"3937e662-4872-4e7b-b9c9-14e09d85c7af\",\"health\":{\"checks\":{\"OSDMAP_FLAGS\":{\"severity\":\"HEALTH_WARN\",\"summary\":{\"message\":\"noout,noscrub,nodeep-scrub flag(s) set\"}},\"PG_DEGRADED\":{\"severity\":\"HEALTH_WARN\",\"summary\":{\"message\":\"Degraded data redundancy: 836/208940469 objects degraded (0.000%), 37 pgs degraded\"}}},\"status\":\"HEALTH_WARN\", ... Version-Release number of selected component (if applicable): Ceph-ansible: 3.1.5-1.el7cp ceph version 12.2.5-59.el7cp How reproducible: Perform rolling_upgrade.yml on a setup with high percentage use. (50% filled)
Doc looks good to me.