Hide Forgot
Description of problem: When running rolling_update.yml after the oSD starts the playbook waits for clean pg's. When IO is in progress on the cluster, the timeout value of this might not be sufficient. I observed this on my update. TASK: [waiting for clean pgs...] ********************************************** failed: [magna077 -> magna061] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster slave | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster slave | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health --cluster slave | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:00.469431", "end": "2016-11-15 05:00:03.576944", "rc": 2, "start": "2016-11-15 05:00:03.107513", "warnings": []} stderr: /bin/sh: 1: test: Illegal number: 1 active+recovery_wait+degraded, 1 active+recovering+degraded, 534 FATAL: all hosts have already failed -- aborting Version-Release number of selected component (if applicable): ceph-ansible-1.0.5-44.el7scon.noarch Additional info: After the OSD's were started: TASK: [start ceph osds (systemd)] ********************************************* ok: [magna077] => (item=1) ok: [magna077] => (item=3) ok: [magna077] => (item=8) TASK: [waiting for clean pgs...] ********************************************** root@magna086:~# ceph -s --cluster slave cluster 4673f989-218e-4f64-bb71-f71ee2c828a1 health HEALTH_WARN clock skew detected on mon.magna063, mon.magna067 7 pgs degraded 1 pgs recovering 6 pgs recovery_wait recovery 1108/139053 objects degraded (0.797%) pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?) noout,noscrub,nodeep-scrub,sortbitwise flag(s) set Monitor clock skew detected monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0} election epoch 24, quorum 0,1,2 magna061,magna063,magna067 fsmap e10: 1/1/1 up {0=magna086=up:active} osdmap e124: 9 osds: 9 up, 9 in flags noout,noscrub,nodeep-scrub,sortbitwise pgmap v53100: 536 pgs, 15 pools, 178 GB data, 46351 objects 534 GB used, 7757 GB / 8291 GB avail 1108/139053 objects degraded (0.797%) 529 active+clean 6 active+recovery_wait+degraded 1 active+recovering+degraded client io 30620 kB/s wr, 0 op/s rd, 66 op/s wr root@magna086:~# ceph -s --cluster slave cluster 4673f989-218e-4f64-bb71-f71ee2c828a1 health HEALTH_WARN clock skew detected on mon.magna063, mon.magna067 2 pgs degraded 1 pgs recovering 1 pgs recovery_wait 2 pgs stuck unclean recovery 729/140667 objects degraded (0.518%) pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?) noout,noscrub,nodeep-scrub,sortbitwise flag(s) set Monitor clock skew detected monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0} election epoch 24, quorum 0,1,2 magna061,magna063,magna067 fsmap e10: 1/1/1 up {0=magna086=up:active} osdmap e124: 9 osds: 9 up, 9 in flags noout,noscrub,nodeep-scrub,sortbitwise pgmap v53140: 536 pgs, 15 pools, 180 GB data, 46889 objects 541 GB used, 7750 GB / 8291 GB avail 729/140667 objects degraded (0.518%) 534 active+clean 1 active+recovering+degraded 1 active+recovery_wait+degraded client io 47533 kB/s wr, 0 op/s rd, 104 op/s wr There was just a little more time needed to achieve active clean. oot@magna086:~# ceph -s --cluster slave cluster 4673f989-218e-4f64-bb71-f71ee2c828a1 health HEALTH_WARN clock skew detected on mon.magna063, mon.magna067 pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?) noout,noscrub,nodeep-scrub,sortbitwise flag(s) set Monitor clock skew detected monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0} election epoch 24, quorum 0,1,2 magna061,magna063,magna067 fsmap e10: 1/1/1 up {0=magna086=up:active} osdmap e124: 9 osds: 9 up, 9 in flags noout,noscrub,nodeep-scrub,sortbitwise pgmap v53170: 536 pgs, 15 pools, 180 GB data, 47001 objects 543 GB used, 7748 GB / 8291 GB avail 536 active+clean But the playbook had already failed. Can we have a way to handle this timeout better, incase of heavy IO on the cluster.
We should have variables to manga these timeouts, can you check? Should be: health_osd_check_retries and health_osd_check_delay, same for monitors.
Thanks Seb. I had failed to notice these parameters. We can ask the users to configure these parameters , according to their requirements.
What are the default parameters? Would you please work with core team to arrive at reasonable defaults for 80% of clusters
Default values are here: https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L163-L164
As Sebastian linked to there in https://bugzilla.redhat.com/show_bug.cgi?id=1395073#c8, the default value is 10 for both health_osd_check_retries and health_osd_check_delay.
At this time we are recommending the following to the rolling_update.yml playbook: Change the values for OSD retries, from: health_osd_check_retries: 10 health_osd_check_delay: 10 To: health_osd_check_retries: 40 health_osd_check_delay: 30 This will make ceph-ansible wait up to 20 minutes (at 30 second intervals) *per host* for the cluster to become in a state where ceph-ansible can continue the upgrade process.
Upstream PR https://github.com/ceph/ceph-ansible/pull/1100
Looks good.