Bug 1934990

Summary:	Ceph health ERR post node drain on KMS encryption enabled cluster
Product:	[Red Hat Storage] Red Hat OpenShift Container Storage	Reporter:	Persona non grata <nobody+410372>
Component:	rook	Assignee:	Sébastien Han <shan>
Status:	CLOSED ERRATA	QA Contact:	Persona non grata <nobody+410372>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	bniver, jthottan, madam, muagarwa, nberry, nojha, ocs-bugs, owasserm, shan
Target Milestone:	---	Keywords:	AutomationBackLog, Reopened
Target Release:	OCS 4.7.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-05-19 09:20:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Persona non grata 2021-03-04 07:17:47 UTC

Description of problem (please be detailed as possible and provide log
snippests):
On a 9 osd ocs with kms encryption enabled cluster ( did add capacity twice), performed node drain of 1 worker node.

Ceph health went to HEALTH_ERR

  cluster:
    id:     24c2567d-5e14-4449-8924-7e5c09986569
    health: HEALTH_ERR
            1 scrub errors
            Possible data damage: 1 pg inconsistent
 
  services:
    mon: 3 daemons, quorum a,c,h (age 46m)
    mgr: a(active, since 21h)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 9 osds: 9 up (since 41m), 9 in (since 19h)
    rgw: 1 daemon active (ocs.storagecluster.cephobjectstore.a)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 272 pgs
    objects: 444 objects, 412 MiB
    usage:   11 GiB used, 4.5 TiB / 4.5 TiB avail
    pgs:     271 active+clean
             1   active+clean+inconsistent
 
  io:
    client:   7.6 KiB/s rd, 19 KiB/s wr, 8 op/s rd, 5 op/s wr
 
========================
Node status post node drain

NAME              STATUS   ROLES    AGE   VERSION
compute-0         Ready    worker   44h   v1.20.0+5fbfd19
compute-1         Ready    worker   44h   v1.20.0+5fbfd19
compute-2         Ready    worker   44h   v1.20.0+5fbfd19
control-plane-0   Ready    master   44h   v1.20.0+5fbfd19
control-plane-1   Ready    master   45h   v1.20.0+5fbfd19
control-plane-2   Ready    master   45h   v1.20.0+5fbfd19

 
Version of all relevant components (if applicable):
OCP 4.7.0-0.nightly-2021-03-01-085007
ceph version 14.2.11-95.el8cp (1d6087ae858e7c8e72fe7390c3522c7e0d951240) nautilus (stable)
rook: 4.7-102.a0622de60.release_4.7
ocs-operator.v4.7.0-277.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
3

Can this issue reproducible?
1/1

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. With new OCS cluster, performed add capacity, IOs like nooba obc create/delete, workloads like pgsql, couchbase and node operations like node reboot, node drain, node network failure.
2. After doing add capacity twice, did node drain again, worker node took around 10 mins to drain, later node recovered with *oc adm uncordon* cmd.
2. Couple of OSDs keep re-spun post node drain (later all osds were up and running), Ceph health went from WARN to ERR


Actual results:
Ceph health is in ERR state with 1 scrub errors
            Possible data damage: 1 pg inconsistent

Expected results:
Post node drain, ceph health should have been HEALTH OK than HEALTH_ERR


Additional info:

Comment 5 Travis Nielsen 2021-03-04 20:40:57 UTC

This sounds related to the KMS keys not being loaded properly on restart, but I thought it was already fixed by https://github.com/rook/rook/pull/7240 a couple weeks ago. 
Seb PTAL

Comment 6 Sébastien Han 2021-03-08 09:00:02 UTC

The error is clearly indicated in the osd deployment logs:

['error performing token check: Vault is sealed']%

Please fix your setup.

Comment 7 Sébastien Han 2021-03-08 09:42:11 UTC

The wrong kubeconfig was shared offline, so I'm re-opening, after looking at the logs, one PR state is active+clean+inconsistent.
I ran "ceph pg deep-scrub 1.50" and then instructed Ceph to repair it with "ceph pg repair 1.50", now Ceph health is ok.

I believe Ceph would have eventually repaired the PG during the next deep-scrub. Auto-repair works well on Bluestore.
Josh/Neha for confirmation.

Thanks.

Comment 8 Neha Ojha 2021-03-08 19:51:50 UTC

(In reply to Sébastien Han from comment #7)
> The wrong kubeconfig was shared offline, so I'm re-opening, after looking at
> the logs, one PR state is active+clean+inconsistent.
> I ran "ceph pg deep-scrub 1.50" and then instructed Ceph to repair it with
> "ceph pg repair 1.50", now Ceph health is ok.
> 
> I believe Ceph would have eventually repaired the PG during the next
> deep-scrub. Auto-repair works well on Bluestore.
> Josh/Neha for confirmation.
> 
> Thanks.

This is true when osd_scrub_auto_repair is enabled and it repairs up to osd_scrub_auto_repair_num_errors errors.

Comment 9 Sébastien Han 2021-03-09 08:23:28 UTC

Thanks, as far as I can tell osd_scrub_auto_repair is disabled by default, is it advisable to enable it for OCS by default?

Comment 10 Neha Ojha 2021-03-09 23:45:42 UTC

(In reply to Sébastien Han from comment #9)
> Thanks, as far as I can tell osd_scrub_auto_repair is disabled by default,
> is it advisable to enable it for OCS by default?

I think so - it is advisable to enable it in a cluster which has only BlueStore OSDs.

Comment 11 Sébastien Han 2021-03-11 14:48:33 UTC

Neha,

Can't we auto-detect that from the OSD startup and set osd_scrub_auto_repair to true?
Rook can force enable it in the meantime.

Comment 12 Neha Ojha 2021-03-11 21:39:35 UTC

(In reply to Sébastien Han from comment #11)
> Neha,
> 
> Can't we auto-detect that from the OSD startup and set osd_scrub_auto_repair
> to true?

We are considering enabling it by default in the next release, so not worth the extra complexity.

> Rook can force enable it in the meantime.

sure

Comment 13 Mudit Agarwal 2021-03-16 06:53:58 UTC

I guess we are fixing this in rook, please revert back if that is not correct.

Comment 21 errata-xmlrpc 2021-05-19 09:20:01 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2041