Bug 1927128
Summary: | [Tracker for BZ #1937088] When Performed add capacity over arbiter mode cluster ceph health reports PG_AVAILABILITY Reduced data availability: 25 pgs inactive, 25 pgs incomplete | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Pratik Surve <prsurve> | |
Component: | ceph | Assignee: | Greg Farnum <gfarnum> | |
Status: | CLOSED ERRATA | QA Contact: | Pratik Surve <prsurve> | |
Severity: | high | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.7 | CC: | bniver, ebenahar, gfarnum, madam, muagarwa, nberry, nojha, ocs-bugs, owasserm | |
Target Milestone: | --- | Keywords: | AutomationBackLog | |
Target Release: | OCS 4.7.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1937088 (view as bug list) | Environment: | ||
Last Closed: | 2021-05-19 09:19:24 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1937088 | |||
Bug Blocks: |
Description
Pratik Surve
2021-02-10 07:02:58 UTC
Since this is about a basic functionality of cluster expansion while the cluster is deployed as an arbiter mode I see the OSDs are distributed evenly in the CRUSH tree: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.56238 root default -4 0.78119 zone a -3 0.39059 host compute-0 0 hdd 0.19530 osd.0 up 1.00000 1.00000 6 hdd 0.19530 osd.6 up 1.00000 1.00000 -13 0.39059 host compute-3 3 hdd 0.19530 osd.3 up 1.00000 1.00000 7 hdd 0.19530 osd.7 up 1.00000 1.00000 -8 0.78119 zone b -11 0.39059 host compute-1 2 hdd 0.19530 osd.2 up 1.00000 1.00000 5 hdd 0.19530 osd.5 up 1.00000 1.00000 -7 0.39059 host compute-2 1 hdd 0.19530 osd.1 up 1.00000 1.00000 4 hdd 0.19530 osd.4 up 1.00000 1.00000 For other ceph info, see related output from the cluster: http://rhsqe-repo.lab.eng.blr.redhat.com/ocs4qe/pratik/bz/bz_1927128/feb_10/must-gather.local.5824928173416610658/quay-io-rhceph-dev-ocs-must-gather-sha256-8099d74217f9305c717cb1a157a6a89f5e810834edd9dfd80b89484263e6cc62/ceph/must_gather_commands/ @Neha What could be the cause of the "remapped+incomplete" PG status? Greg, can you take a look? thanks! I'm trying to reproduce this as I'm going to need some real debug logs to understand what's happening here. I assume from what I see at https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/ocs/resources/storage_cluster.py#L430 that this add capacity test is adding a new OSD drive to each of the 4 hosts? (Also, if it's easy to just run that with "debug osd = 20" that would help!) Looking at the history of pg 1.1c displayed in OSD 2's logs, I'm seeing it assigned to up [1,2,3,0]; acting [1,2,3,0] -- ie, all the original OSDs up [1,5,3,6]; acting [1,5,3,6] -- so, we've added new OSDs and consequently remapped; looks fine up [1,5,7,6]; acting [1,3] -- so, more new OSDs; as there's not enough overlap from old to new calculated up set, we're going active with only two nodes that already have the data (the others will be getting backfilled) up [1,5,7,6]; acting [1,5] -- and this is the confusing one! 1 and 5 are in the same zone so it's not allowed to go active with these, since the other zone survives. They are conspicuously the first two members of the up set, but I'm just not seeing in the source how this happens Is this reproducible else we can remove the blocker flag? I've finally managed to reproduce this. It's definitely a blocker for the stretch cluster feature -- we can't ship with this kind of peering issue; it breaks data availability which is rather opposite the goal of the feature. I've identified the issue and the fix isn't too complicated, so I'm working on that and doing an audit for this category of error elsewhere. Greg, do we have some update on the fix? Yeah I have a branch which fixed the issue I identified, but that's revealed a few knock-on issues I'm dealing with. I expect to have an upstream PR today (Thursday) during Pacific time business hours. Branch approved upstream (https://github.com/ceph/ceph/pull/40049) and passed rados suite run; will do back port and push into RHCS 4.2 branch once I've slept and the workweek starts! Merged into our ceph-4.2-rhel-patches branch now. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041 |