| Summary: | [ceph-ansible]: The timeout for "waiting for clean pgs" in rolling_update.yml is not sufficient in few cases | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Tejas <tchandra> |
| Component: | Documentation | Assignee: | Aron Gunn <agunn> |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Tejas <tchandra> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 2.1 | CC: | adeza, agunn, aschoen, ceph-eng-bugs, flucifre, gmeno, hnallurv, kdreyer, nthomas, sankarshan, seb |
| Target Milestone: | rc | ||
| Target Release: | 2.1 | ||
| Hardware: | Unspecified | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: Large cluster may take longer to recover when restarting OSDs
Consequence: ceph-ansible will stop the upgrade process because timeout values were reached.
Fix: Increase the timeout values in the rolling_update.yml playbook to wait up to 20 minutes:
health_osd_check_retries: 40
health_osd_check_delay: 30
Result: Clusters that take longer to recover from restarted OSDs will allow ceph-ansible to continue the upgrade process.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-11-28 09:38:09 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
We should have variables to manga these timeouts, can you check? Should be: health_osd_check_retries and health_osd_check_delay, same for monitors. Thanks Seb. I had failed to notice these parameters. We can ask the users to configure these parameters , according to their requirements. What are the default parameters? Would you please work with core team to arrive at reasonable defaults for 80% of clusters Default values are here: https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L163-L164 As Sebastian linked to there in https://bugzilla.redhat.com/show_bug.cgi?id=1395073#c8, the default value is 10 for both health_osd_check_retries and health_osd_check_delay. At this time we are recommending the following to the rolling_update.yml playbook:
Change the values for OSD retries, from:
health_osd_check_retries: 10
health_osd_check_delay: 10
To:
health_osd_check_retries: 40
health_osd_check_delay: 30
This will make ceph-ansible wait up to 20 minutes (at 30 second intervals) *per host* for the cluster to become in a state where ceph-ansible can continue the upgrade process.
Upstream PR https://github.com/ceph/ceph-ansible/pull/1100 Looks good. |
Description of problem: When running rolling_update.yml after the oSD starts the playbook waits for clean pg's. When IO is in progress on the cluster, the timeout value of this might not be sufficient. I observed this on my update. TASK: [waiting for clean pgs...] ********************************************** failed: [magna077 -> magna061] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster slave | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster slave | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health --cluster slave | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:00.469431", "end": "2016-11-15 05:00:03.576944", "rc": 2, "start": "2016-11-15 05:00:03.107513", "warnings": []} stderr: /bin/sh: 1: test: Illegal number: 1 active+recovery_wait+degraded, 1 active+recovering+degraded, 534 FATAL: all hosts have already failed -- aborting Version-Release number of selected component (if applicable): ceph-ansible-1.0.5-44.el7scon.noarch Additional info: After the OSD's were started: TASK: [start ceph osds (systemd)] ********************************************* ok: [magna077] => (item=1) ok: [magna077] => (item=3) ok: [magna077] => (item=8) TASK: [waiting for clean pgs...] ********************************************** root@magna086:~# ceph -s --cluster slave cluster 4673f989-218e-4f64-bb71-f71ee2c828a1 health HEALTH_WARN clock skew detected on mon.magna063, mon.magna067 7 pgs degraded 1 pgs recovering 6 pgs recovery_wait recovery 1108/139053 objects degraded (0.797%) pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?) noout,noscrub,nodeep-scrub,sortbitwise flag(s) set Monitor clock skew detected monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0} election epoch 24, quorum 0,1,2 magna061,magna063,magna067 fsmap e10: 1/1/1 up {0=magna086=up:active} osdmap e124: 9 osds: 9 up, 9 in flags noout,noscrub,nodeep-scrub,sortbitwise pgmap v53100: 536 pgs, 15 pools, 178 GB data, 46351 objects 534 GB used, 7757 GB / 8291 GB avail 1108/139053 objects degraded (0.797%) 529 active+clean 6 active+recovery_wait+degraded 1 active+recovering+degraded client io 30620 kB/s wr, 0 op/s rd, 66 op/s wr root@magna086:~# ceph -s --cluster slave cluster 4673f989-218e-4f64-bb71-f71ee2c828a1 health HEALTH_WARN clock skew detected on mon.magna063, mon.magna067 2 pgs degraded 1 pgs recovering 1 pgs recovery_wait 2 pgs stuck unclean recovery 729/140667 objects degraded (0.518%) pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?) noout,noscrub,nodeep-scrub,sortbitwise flag(s) set Monitor clock skew detected monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0} election epoch 24, quorum 0,1,2 magna061,magna063,magna067 fsmap e10: 1/1/1 up {0=magna086=up:active} osdmap e124: 9 osds: 9 up, 9 in flags noout,noscrub,nodeep-scrub,sortbitwise pgmap v53140: 536 pgs, 15 pools, 180 GB data, 46889 objects 541 GB used, 7750 GB / 8291 GB avail 729/140667 objects degraded (0.518%) 534 active+clean 1 active+recovering+degraded 1 active+recovery_wait+degraded client io 47533 kB/s wr, 0 op/s rd, 104 op/s wr There was just a little more time needed to achieve active clean. oot@magna086:~# ceph -s --cluster slave cluster 4673f989-218e-4f64-bb71-f71ee2c828a1 health HEALTH_WARN clock skew detected on mon.magna063, mon.magna067 pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?) noout,noscrub,nodeep-scrub,sortbitwise flag(s) set Monitor clock skew detected monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0} election epoch 24, quorum 0,1,2 magna061,magna063,magna067 fsmap e10: 1/1/1 up {0=magna086=up:active} osdmap e124: 9 osds: 9 up, 9 in flags noout,noscrub,nodeep-scrub,sortbitwise pgmap v53170: 536 pgs, 15 pools, 180 GB data, 47001 objects 543 GB used, 7748 GB / 8291 GB avail 536 active+clean But the playbook had already failed. Can we have a way to handle this timeout better, incase of heavy IO on the cluster.