Bug 1395073 - [ceph-ansible]: The timeout for "waiting for clean pgs" in rolling_update.yml is not sufficient in few cases
Summary: [ceph-ansible]: The timeout for "waiting for clean pgs" in rolling_update.yml...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: Documentation
Version: 2.1
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: rc
: 2.1
Assignee: Aron Gunn
QA Contact: Tejas
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-11-15 05:10 UTC by Tejas
Modified: 2016-11-28 09:38 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Large cluster may take longer to recover when restarting OSDs Consequence: ceph-ansible will stop the upgrade process because timeout values were reached. Fix: Increase the timeout values in the rolling_update.yml playbook to wait up to 20 minutes: health_osd_check_retries: 40 health_osd_check_delay: 30 Result: Clusters that take longer to recover from restarted OSDs will allow ceph-ansible to continue the upgrade process.
Clone Of:
Environment:
Last Closed: 2016-11-28 09:38:09 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description Tejas 2016-11-15 05:10:05 UTC
Description of problem:
           When running rolling_update.yml after the oSD starts the playbook waits for clean pg's.
When IO is in progress on the cluster, the timeout value of this might not be sufficient.

I observed this on my update.

TASK: [waiting for clean pgs...] ********************************************** 
failed: [magna077 -> magna061] => {"attempts": 10, "changed": true, "cmd": "test \"$(ceph pg stat --cluster slave | sed 's/^.*pgs://;s/active+clean.*//;s/ //')\" -eq \"$(ceph pg stat --cluster slave  | sed 's/pgs.*//;s/^.*://;s/ //')\" && ceph health --cluster slave | egrep -sq \"HEALTH_OK|HEALTH_WARN\"", "delta": "0:00:00.469431", "end": "2016-11-15 05:00:03.576944", "rc": 2, "start": "2016-11-15 05:00:03.107513", "warnings": []}
stderr: /bin/sh: 1: test: Illegal number: 1 active+recovery_wait+degraded, 1 active+recovering+degraded, 534 

FATAL: all hosts have already failed -- aborting



Version-Release number of selected component (if applicable):
ceph-ansible-1.0.5-44.el7scon.noarch


Additional info:

After the OSD's were started:

TASK: [start ceph osds (systemd)] ********************************************* 
ok: [magna077] => (item=1)
ok: [magna077] => (item=3)
ok: [magna077] => (item=8)

TASK: [waiting for clean pgs...] ********************************************** 

root@magna086:~# ceph -s --cluster slave
    cluster 4673f989-218e-4f64-bb71-f71ee2c828a1
     health HEALTH_WARN
            clock skew detected on mon.magna063, mon.magna067
            7 pgs degraded
            1 pgs recovering
            6 pgs recovery_wait
            recovery 1108/139053 objects degraded (0.797%)
            pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?)
            noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
            Monitor clock skew detected 
     monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0}
            election epoch 24, quorum 0,1,2 magna061,magna063,magna067
      fsmap e10: 1/1/1 up {0=magna086=up:active}
     osdmap e124: 9 osds: 9 up, 9 in
            flags noout,noscrub,nodeep-scrub,sortbitwise
      pgmap v53100: 536 pgs, 15 pools, 178 GB data, 46351 objects
            534 GB used, 7757 GB / 8291 GB avail
            1108/139053 objects degraded (0.797%)
                 529 active+clean
                   6 active+recovery_wait+degraded
                   1 active+recovering+degraded
  client io 30620 kB/s wr, 0 op/s rd, 66 op/s wr




root@magna086:~# ceph -s --cluster slave
    cluster 4673f989-218e-4f64-bb71-f71ee2c828a1
     health HEALTH_WARN
            clock skew detected on mon.magna063, mon.magna067
            2 pgs degraded
            1 pgs recovering
            1 pgs recovery_wait
            2 pgs stuck unclean
            recovery 729/140667 objects degraded (0.518%)
            pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?)
            noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
            Monitor clock skew detected 
     monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0}
            election epoch 24, quorum 0,1,2 magna061,magna063,magna067
      fsmap e10: 1/1/1 up {0=magna086=up:active}
     osdmap e124: 9 osds: 9 up, 9 in
            flags noout,noscrub,nodeep-scrub,sortbitwise
      pgmap v53140: 536 pgs, 15 pools, 180 GB data, 46889 objects
            541 GB used, 7750 GB / 8291 GB avail
            729/140667 objects degraded (0.518%)
                 534 active+clean
                   1 active+recovering+degraded
                   1 active+recovery_wait+degraded
  client io 47533 kB/s wr, 0 op/s rd, 104 op/s wr


There was just a little more time needed to achieve active clean.

oot@magna086:~# ceph -s --cluster slave
    cluster 4673f989-218e-4f64-bb71-f71ee2c828a1
     health HEALTH_WARN
            clock skew detected on mon.magna063, mon.magna067
            pool us-west.rgw.buckets.data has many more objects per pg than average (too few pgs?)
            noout,noscrub,nodeep-scrub,sortbitwise flag(s) set
            Monitor clock skew detected 
     monmap e1: 3 mons at {magna061=10.8.128.61:6789/0,magna063=10.8.128.63:6789/0,magna067=10.8.128.67:6789/0}
            election epoch 24, quorum 0,1,2 magna061,magna063,magna067
      fsmap e10: 1/1/1 up {0=magna086=up:active}
     osdmap e124: 9 osds: 9 up, 9 in
            flags noout,noscrub,nodeep-scrub,sortbitwise
      pgmap v53170: 536 pgs, 15 pools, 180 GB data, 47001 objects
            543 GB used, 7748 GB / 8291 GB avail
                 536 active+clean


But the playbook had already failed.
Can we have a way to handle this timeout better, incase of heavy IO on the cluster.

Comment 3 seb 2016-11-15 13:21:47 UTC
We should have variables to manga these timeouts, can you check?
Should be: health_osd_check_retries and health_osd_check_delay, same for monitors.

Comment 4 Tejas 2016-11-15 15:01:48 UTC
Thanks Seb. 
I had failed to notice these parameters.

We can ask the users to configure these parameters , according to their requirements.

Comment 7 Christina Meno 2016-11-16 16:33:13 UTC
What are the default parameters?
Would you please work with core team to arrive at reasonable defaults for 80% of clusters

Comment 9 Andrew Schoen 2016-11-16 16:49:13 UTC
As Sebastian linked to there in https://bugzilla.redhat.com/show_bug.cgi?id=1395073#c8, the default value is 10 for both health_osd_check_retries and health_osd_check_delay.

Comment 10 Alfredo Deza 2016-11-16 20:53:32 UTC
At this time we are recommending the following to the rolling_update.yml playbook:

Change the values for OSD retries, from:

    health_osd_check_retries: 10
    health_osd_check_delay: 10

To:

    health_osd_check_retries: 40
    health_osd_check_delay: 30


This will make ceph-ansible wait up to 20 minutes (at 30 second intervals) *per host* for the cluster to become in a state where ceph-ansible can continue the upgrade process.

Comment 11 Alfredo Deza 2016-11-17 15:49:41 UTC
Upstream PR https://github.com/ceph/ceph-ansible/pull/1100

Comment 13 Tejas 2016-11-18 07:43:20 UTC
Looks good.


Note You need to log in before you can comment on or make changes to this bug.