Bug 1627802
| Summary: | [cee/sd][ceph-ansible] the tests (total num of PGS) == (num of active+clean PGs) should take into account PGs that are (deep)scrubbed | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Tomas Petr <tpetr> |
| Component: | Ceph-Ansible | Assignee: | Sébastien Han <shan> |
| Status: | CLOSED DUPLICATE | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.0 | CC: | aschoen, ceph-eng-bugs, gfarnum, gmeno, nthomas, sankarshan |
| Target Milestone: | rc | ||
| Target Release: | 3.* | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-09-11 19:08:39 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I previously reported this as #1616066 and I agree, it is a serious issue. (In reply to Greg Farnum from comment #3) > I previously reported this as #1616066 and I agree, it is a serious issue. Hi Greg, thanks for pointing on BZ, I was looking for existing BZ but seems I missed that one. Marking this as Duplicate *** This bug has been marked as a duplicate of bug 1616066 *** |
Description of problem: we have encountered an issue, when on large clusters, where it can happen the PGs are scrubbed 24x7, running the playbook site.yml ends up waiting for all PGs to be active+clean,even when they are playbook is just checking if total number of PGs (12288) is equal to active+clean PG count, but ignores the ones that are (active+clean)+deep/scrubbed: pgs: 12277 active+clean", 6 active+clean+scrubbing+deep", 5 active+clean+scrubbing" -------------------------------- 12288 total PGS The workaround is to disable the scrub/deep-scrub. This behavior should be changed,as on large clusters the playbook may take a long while to finish and running without scrub/deep-scrub there's an active problem with it for a long time. ------------------------- failed: [osd1 -> osd1] (item=osd1) => {"changed": true, "cmd": ["/usr/bin/env", "bash", "/tmp/restart_osd_daemon.sh"], "delta": "0:20:53.733572", "end": "2018-09-07 13:24:09.880074", "item": "osd1", "msg": "non-zero return code", "rc": 1, "start": "2018-09-07 13:03:16.146502", "stderr": "", "stderr_lines": [], "stdout": "Error while running 'ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --cluster ceph -s', PGs were not reported as active+clean\nIt is possible that the cluster has less OSDs than the replica configuration\nWill refuse to continue\n cluster:\n id: XXX\n health: HEALTH_WARN\n 1 MDSs have many clients failing to respond to cache pressure\n \n services:\n mon: 3 daemons, quorum mon1,mon2,mon3\n mgr: mon1(active), standbys: mon2, mon3\n mds: cephfs-1/1/1 up {0=mon1=up:active}, 1 up:standby\n osd: 188 osds: 188 up, 188 in\n \n data:\n pools: 3 pools, 12288 pgs\n objects: 154M objects, 218 TB\n usage: 656 TB used, 366 TB / 1023 TB avail\n pgs: 12277 active+clean\n 6 active+clean+scrubbing+deep\n 5 active+clean+scrubbing\n \n io:\n client: 148 MB/s rd, 38526 kB/s wr, 6597 op/s rd, 175 op/s wr\n ", "stdout_lines": ["Error while running 'ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring --cluster ceph -s', PGs were not reported as active+clean", "It is possible that the cluster has less OSDs than the replica configuration", "Will refuse to continue", " cluster:", " id: XXX", " health: HEALTH_WARN", " 1 MDSs have many clients failing to respond to cache pressure", " ", " services:", " mon: 3 daemons, quorum mon1,mon2,mon3", " mgr: mon1(active), standbys: mon2, mon3", " mds: cephfs-1/1/1 up {0=mon1=up:active}, 1 up:standby", " osd: 188 osds: 188 up, 188 in", " ", " data:", " pools: 3 pools, 12288 pgs", " objects: 154M objects, 218 TB", " usage: 656 TB used, 366 TB / 1023 TB avail", " pgs: 12277 active+clean", " 6 active+clean+scrubbing+deep", " 5 active+clean+scrubbing", " ", " io:", " client: 148 MB/s rd, 38526 kB/s wr, 6597 op/s rd, 175 op/s wr", " "]} Version-Release number of selected component (if applicable): ceph-ansible-3.0.39-1.el7cp.noarch in upstream master the code seems to be still the same as in 3.0.39-1 https://github.com/ceph/ceph-ansible/blob/master/roles/ceph-defaults/templates/restart_osd_daemon.sh.j2#L8 How reproducible: Always Steps to Reproduce: 1. run site.yml on cluster where PGs are deep-scrubbed/scrubbed every time 2. the site.yml waits for all PGs to be active+clean until reach max delay*retries and fails even if all PGs are active+clean,but some of them are also scrubbed at the same time 3. Actual results: the site.yml waits for all PGs to be active+clean until reach max delay*retries and fails even if all PGs are active+clean,but some of them are also scrubbed at the same time Expected results: the playbooks also counting with "active+clean+scrubbing+deep" and "active+clean+scrubbing" PGs as "active+clean" Additional info: