Bug 1627802 - [cee/sd][ceph-ansible] the tests (total num of PGS) == (num of active+clean PGs) should take into account PGs that are (deep)scrubbed
Summary: [cee/sd][ceph-ansible] the tests (total num of PGS) == (num of active+clean P...
Keywords:
Status: CLOSED DUPLICATE of bug 1616066
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: Ceph-Ansible
Version: 3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: 3.*
Assignee: Sébastien Han
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-11 13:51 UTC by Tomas Petr
Modified: 2018-09-11 19:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-11 19:08:39 UTC
Embargoed:


Attachments (Terms of Use)

Description Tomas Petr 2018-09-11 13:51:57 UTC
Description of problem:
we have encountered an issue, when on large clusters, where it can happen the PGs are scrubbed 24x7, running the playbook site.yml ends up waiting for all PGs to be active+clean,even when they are

playbook is just checking if total number of PGs (12288) is equal to active+clean PG count, but ignores the ones that are (active+clean)+deep/scrubbed:
pgs:     12277 active+clean", 
         6     active+clean+scrubbing+deep",
         5     active+clean+scrubbing"
--------------------------------
         12288 total PGS

The workaround is to disable the scrub/deep-scrub.

This behavior should be changed,as on large clusters the playbook may take a long while to finish and running without scrub/deep-scrub there's an active problem with it for a long time.
-------------------------

failed: [osd1 -> osd1] (item=osd1) => {"changed": true, "cmd": ["/usr/bin/env", "bash", "/tmp/restart_osd_daemon.sh"], "delta": "0:20:53.733572", "end": "2018-09-07 13:24:09.880074", "item": "osd1", "msg": "non-zero return code", "rc": 1, "start": "2018-09-07 13:03:16.146502", "stderr": "", "stderr_lines": [], "stdout": "Error while running 'ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --cluster ceph -s', PGs were not reported as active+clean\nIt is possible that the cluster has less OSDs than the replica configuration\nWill refuse to continue\n  cluster:\n    id:     XXX\n    health: HEALTH_WARN\n            1 MDSs have many clients failing to respond to cache pressure\n \n  services:\n    mon: 3 daemons, quorum mon1,mon2,mon3\n    mgr: mon1(active), standbys: mon2, mon3\n    mds: cephfs-1/1/1 up  {0=mon1=up:active}, 1 up:standby\n    osd: 188 osds: 188 up, 188 in\n \n  data:\n    pools:   3 pools, 12288 pgs\n    objects: 154M objects, 218 TB\n    usage:   656 TB used, 366 TB / 1023 TB avail\n    pgs:     12277 active+clean\n             6     active+clean+scrubbing+deep\n             5     active+clean+scrubbing\n \n  io:\n    client:   148 MB/s rd, 38526 kB/s wr, 6597 op/s rd, 175 op/s wr\n ", "stdout_lines": ["Error while running 'ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --cluster ceph -s', PGs were not reported as active+clean", "It is possible that the cluster has less OSDs than the replica configuration", "Will refuse to continue", "  cluster:", "    id:     XXX", "    health: HEALTH_WARN", "            1 MDSs have many clients failing to respond to cache pressure", " ", "  services:", "    mon: 3 daemons, quorum mon1,mon2,mon3", "    mgr: mon1(active), standbys: mon2, mon3", "    mds: cephfs-1/1/1 up  {0=mon1=up:active}, 1 up:standby", "    osd: 188 osds: 188 up, 188 in", " ", "  data:", "    pools:   3 pools, 12288 pgs", "    objects: 154M objects, 218 TB", "    usage:   656 TB used, 366 TB / 1023 TB avail", "    pgs:     12277 active+clean", "             6     active+clean+scrubbing+deep", "             5     active+clean+scrubbing", " ", "  io:", "    client:   148 MB/s rd, 38526 kB/s wr, 6597 op/s rd, 175 op/s wr", " "]}


Version-Release number of selected component (if applicable):
ceph-ansible-3.0.39-1.el7cp.noarch

in upstream master the code seems to be still the same as in 3.0.39-1
https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-defaults/templates/restart_osd_daemon.sh.j2#L8

How reproducible:
Always

Steps to Reproduce:
1. run site.yml on cluster where PGs are deep-scrubbed/scrubbed every time
2. the site.yml waits for all PGs to be active+clean until reach max delay*retries and  fails
 even if all PGs are active+clean,but some of them are also scrubbed at the same time
3.

Actual results:
the site.yml waits for all PGs to be active+clean until reach max delay*retries and  fails
 even if all PGs are active+clean,but some of them are also scrubbed at the same time

Expected results:
the playbooks also counting with "active+clean+scrubbing+deep" and "active+clean+scrubbing" PGs as "active+clean"

Additional info:

Comment 3 Greg Farnum 2018-09-11 17:32:10 UTC
I previously reported this as #1616066 and I agree, it is a serious issue.

Comment 4 Tomas Petr 2018-09-11 19:08:39 UTC
(In reply to Greg Farnum from comment #3)
> I previously reported this as #1616066 and I agree, it is a serious issue.

Hi Greg,
thanks for pointing on BZ, I was looking for existing BZ but seems I missed that one.
Marking this as Duplicate

*** This bug has been marked as a duplicate of bug 1616066 ***


Note You need to log in before you can comment on or make changes to this bug.