Description of problem (please be detailed as possible and provide log snippests): On a 9 osd ocs with kms encryption enabled cluster ( did add capacity twice), performed node drain of 1 worker node. Ceph health went to HEALTH_ERR cluster: id: 24c2567d-5e14-4449-8924-7e5c09986569 health: HEALTH_ERR 1 scrub errors Possible data damage: 1 pg inconsistent services: mon: 3 daemons, quorum a,c,h (age 46m) mgr: a(active, since 21h) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 9 osds: 9 up (since 41m), 9 in (since 19h) rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 272 pgs objects: 444 objects, 412 MiB usage: 11 GiB used, 4.5 TiB / 4.5 TiB avail pgs: 271 active+clean 1 active+clean+inconsistent io: client: 7.6 KiB/s rd, 19 KiB/s wr, 8 op/s rd, 5 op/s wr ======================== Node status post node drain NAME STATUS ROLES AGE VERSION compute-0 Ready worker 44h v1.20.0+5fbfd19 compute-1 Ready worker 44h v1.20.0+5fbfd19 compute-2 Ready worker 44h v1.20.0+5fbfd19 control-plane-0 Ready master 44h v1.20.0+5fbfd19 control-plane-1 Ready master 45h v1.20.0+5fbfd19 control-plane-2 Ready master 45h v1.20.0+5fbfd19 Version of all relevant components (if applicable): OCP 4.7.0-0.nightly-2021-03-01-085007 ceph version 14.2.11-95.el8cp (1d6087ae858e7c8e72fe7390c3522c7e0d951240) nautilus (stable) rook: 4.7-102.a0622de60.release_4.7 ocs-operator.v4.7.0-277.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 3 Can this issue reproducible? 1/1 Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. With new OCS cluster, performed add capacity, IOs like nooba obc create/delete, workloads like pgsql, couchbase and node operations like node reboot, node drain, node network failure. 2. After doing add capacity twice, did node drain again, worker node took around 10 mins to drain, later node recovered with *oc adm uncordon* cmd. 2. Couple of OSDs keep re-spun post node drain (later all osds were up and running), Ceph health went from WARN to ERR Actual results: Ceph health is in ERR state with 1 scrub errors Possible data damage: 1 pg inconsistent Expected results: Post node drain, ceph health should have been HEALTH OK than HEALTH_ERR Additional info:
This sounds related to the KMS keys not being loaded properly on restart, but I thought it was already fixed by https://github.com/rook/rook/pull/7240 a couple weeks ago. Seb PTAL
The error is clearly indicated in the osd deployment logs: ['error performing token check: Vault is sealed']% Please fix your setup.
The wrong kubeconfig was shared offline, so I'm re-opening, after looking at the logs, one PR state is active+clean+inconsistent. I ran "ceph pg deep-scrub 1.50" and then instructed Ceph to repair it with "ceph pg repair 1.50", now Ceph health is ok. I believe Ceph would have eventually repaired the PG during the next deep-scrub. Auto-repair works well on Bluestore. Josh/Neha for confirmation. Thanks.
(In reply to Sébastien Han from comment #7) > The wrong kubeconfig was shared offline, so I'm re-opening, after looking at > the logs, one PR state is active+clean+inconsistent. > I ran "ceph pg deep-scrub 1.50" and then instructed Ceph to repair it with > "ceph pg repair 1.50", now Ceph health is ok. > > I believe Ceph would have eventually repaired the PG during the next > deep-scrub. Auto-repair works well on Bluestore. > Josh/Neha for confirmation. > > Thanks. This is true when osd_scrub_auto_repair is enabled and it repairs up to osd_scrub_auto_repair_num_errors errors.
Thanks, as far as I can tell osd_scrub_auto_repair is disabled by default, is it advisable to enable it for OCS by default?
(In reply to Sébastien Han from comment #9) > Thanks, as far as I can tell osd_scrub_auto_repair is disabled by default, > is it advisable to enable it for OCS by default? I think so - it is advisable to enable it in a cluster which has only BlueStore OSDs.
Neha, Can't we auto-detect that from the OSD startup and set osd_scrub_auto_repair to true? Rook can force enable it in the meantime.
(In reply to Sébastien Han from comment #11) > Neha, > > Can't we auto-detect that from the OSD startup and set osd_scrub_auto_repair > to true? We are considering enabling it by default in the next release, so not worth the extra complexity. > Rook can force enable it in the meantime. sure
I guess we are fixing this in rook, please revert back if that is not correct.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2041