Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 1395073

Summary:	[ceph-ansible]: The timeout for "waiting for clean pgs" in rolling_update.yml is not sufficient in few cases
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Tejas <tchandra>
Component:	Documentation	Assignee:	Aron Gunn <agunn>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Tejas <tchandra>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	2.1	CC:	adeza, agunn, aschoen, ceph-eng-bugs, flucifre, gmeno, hnallurv, kdreyer, nthomas, sankarshan, seb
Target Milestone:	rc
Target Release:	2.1
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Large cluster may take longer to recover when restarting OSDs Consequence: ceph-ansible will stop the upgrade process because timeout values were reached. Fix: Increase the timeout values in the rolling_update.yml playbook to wait up to 20 minutes: health_osd_check_retries: 40 health_osd_check_delay: 30 Result: Clusters that take longer to recover from restarted OSDs will allow ceph-ansible to continue the upgrade process.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-11-28 09:38:09 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tejas 2016-11-15 05:10:05 UTC

Description of problem:
           When running rolling_update.yml after the oSD starts the playbook waits for clean pg's.
When IO is in progress on the cluster, the timeout value of this might not be sufficient.

I observed this on my update.

TASK: [waiting for clean pgs...] ********************************************** 
failed: [magna077 -> magna061] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster slave | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster slave  | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health --cluster slave | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:00.469431", "end": "2016-11-15 05:00:03.576944", "rc": 2, "start": "2016-11-15 05:00:03.107513", "warnings": []}
stderr: /bin/sh: 1: test: Illegal number: 1 active+recovery_wait+degraded, 1 active+recovering+degraded, 534 

FATAL: all hosts have already failed -- aborting



Version-Release number of selected component (if applicable):
ceph-ansible-1.0.5-44.el7scon.noarch


Additional info:

After the OSD's were started:

TASK: [start ceph osds (systemd)] ********************************************* 
ok: [magna077] => (item=1)
ok: [magna077] => (item=3)
ok: [magna077] => (item=8)

TASK: [waiting for clean pgs...] ********************************************** 

root@magna086:~# ceph -s --cluster slave
    cluster 4673f989-218e-4f64-bb71-f71ee2c828a1
     health HEALTH_WARN
            clock skew detected on mon.magna063, mon.magna067
            7 pgs degraded
            1 pgs recovering
            6 pgs recovery_wait
            recovery 1108/139053 objects degraded (0.797%)
            pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?)
            noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
            Monitor clock skew detected 
     monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0}
            election epoch 24, quorum 0,1,2 magna061,magna063,magna067
      fsmap e10: 1/1/1 up {0=magna086=up:active}
     osdmap e124: 9 osds: 9 up, 9 in
            flags noout,noscrub,nodeep-scrub,sortbitwise
      pgmap v53100: 536 pgs, 15 pools, 178 GB data, 46351 objects
            534 GB used, 7757 GB / 8291 GB avail
            1108/139053 objects degraded (0.797%)
                 529 active+clean
                   6 active+recovery_wait+degraded
                   1 active+recovering+degraded
  client io 30620 kB/s wr, 0 op/s rd, 66 op/s wr




root@magna086:~# ceph -s --cluster slave
    cluster 4673f989-218e-4f64-bb71-f71ee2c828a1
     health HEALTH_WARN
            clock skew detected on mon.magna063, mon.magna067
            2 pgs degraded
            1 pgs recovering
            1 pgs recovery_wait
            2 pgs stuck unclean
            recovery 729/140667 objects degraded (0.518%)
            pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?)
            noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
            Monitor clock skew detected 
     monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0}
            election epoch 24, quorum 0,1,2 magna061,magna063,magna067
      fsmap e10: 1/1/1 up {0=magna086=up:active}
     osdmap e124: 9 osds: 9 up, 9 in
            flags noout,noscrub,nodeep-scrub,sortbitwise
      pgmap v53140: 536 pgs, 15 pools, 180 GB data, 46889 objects
            541 GB used, 7750 GB / 8291 GB avail
            729/140667 objects degraded (0.518%)
                 534 active+clean
                   1 active+recovering+degraded
                   1 active+recovery_wait+degraded
  client io 47533 kB/s wr, 0 op/s rd, 104 op/s wr


There was just a little more time needed to achieve active clean.

oot@magna086:~# ceph -s --cluster slave
    cluster 4673f989-218e-4f64-bb71-f71ee2c828a1
     health HEALTH_WARN
            clock skew detected on mon.magna063, mon.magna067
            pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?)
            noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
            Monitor clock skew detected 
     monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0}
            election epoch 24, quorum 0,1,2 magna061,magna063,magna067
      fsmap e10: 1/1/1 up {0=magna086=up:active}
     osdmap e124: 9 osds: 9 up, 9 in
            flags noout,noscrub,nodeep-scrub,sortbitwise
      pgmap v53170: 536 pgs, 15 pools, 180 GB data, 47001 objects
            543 GB used, 7748 GB / 8291 GB avail
                 536 active+clean


But the playbook had already failed.
Can we have a way to handle this timeout better, incase of heavy IO on the cluster.

Comment 3 seb 2016-11-15 13:21:47 UTC

We should have variables to manga these timeouts, can you check?
Should be: health_osd_check_retries and health_osd_check_delay, same for monitors.

Comment 4 Tejas 2016-11-15 15:01:48 UTC

Thanks Seb. 
I had failed to notice these parameters.

We can ask the users to configure these parameters , according to their requirements.

Comment 7 Christina Meno 2016-11-16 16:33:13 UTC

What are the default parameters?
Would you please work with core team to arrive at reasonable defaults for 80% of clusters

Comment 8 seb 2016-11-16 16:38:04 UTC

Default values are here: https://github.com/ceph/ceph-ansible/blob/master/infrastructure-playbooks/rolling_update.yml#L163-L164

Comment 9 Andrew Schoen 2016-11-16 16:49:13 UTC

As Sebastian linked to there in https://bugzilla.redhat.com/show_bug.cgi?id=1395073#c8, the default value is 10 for both health_osd_check_retries and health_osd_check_delay.

Comment 10 Alfredo Deza 2016-11-16 20:53:32 UTC

At this time we are recommending the following to the rolling_update.yml playbook:

Change the values for OSD retries, from:

    health_osd_check_retries: 10
    health_osd_check_delay: 10

To:

    health_osd_check_retries: 40
    health_osd_check_delay: 30


This will make ceph-ansible wait up to 20 minutes (at 30 second intervals) *per host* for the cluster to become in a state where ceph-ansible can continue the upgrade process.

Comment 11 Alfredo Deza 2016-11-17 15:49:41 UTC

Upstream PR https://github.com/ceph/ceph-ansible/pull/1100

Comment 13 Tejas 2016-11-18 07:43:20 UTC

Looks good.