Bug 2091623

Summary:	[MS Tracker] ceph status is in Warning after provider add-on upgrade from v2.0.1 to v2.0.2
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	suchita <sgatfane>
Component:	odf-managed-service	Assignee:	Nobody <nobody>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Neha Berry <nberry>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.10	CC:	aeyal, dbindra, fbalak, ocs-bugs, odf-bz-bot, rchikatw, sapillai, ykukreja
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2099212 (view as bug list)		Environment:
Last Closed:	2023-03-13 11:58:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2099212

Description suchita 2022-05-30 14:09:07 UTC

Description of problem:
We have 2 Setups with appliance mode and appliance mode clusters with a private link.
Both the provider cluster got upgraded. However one of the provider cluster cephs status is in a warning state. it seems NOUT flag is not removed

Version-Release number of selected component (if applicable):
========CSV ======
NAME                                      DISPLAY                       VERSION           REPLACES                                  PHASE
mcg-operator.v4.10.2                      NooBaa Operator               4.10.2            mcg-operator.v4.10.1                      Succeeded
ocs-operator.v4.10.2                      OpenShift Container Storage   4.10.2            ocs-operator.v4.10.0                      Succeeded
ocs-osd-deployer.v2.0.2                   OCS OSD Deployer              2.0.2             ocs-osd-deployer.v2.0.1                   Succeeded
odf-csi-addons-operator.v4.10.2           CSI Addons                    4.10.2            odf-csi-addons-operator.v4.10.0           Succeeded
odf-operator.v4.10.2                      OpenShift Data Foundation     4.10.2            odf-operator.v4.10.0                      Succeeded
ose-prometheus-operator.4.10.0            Prometheus Operator           4.10.0            ose-prometheus-operator.4.8.0             Succeeded
route-monitor-operator.v0.1.418-6459408   Route Monitor Operator        0.1.418-6459408   route-monitor-operator.v0.1.408-c2256a2   Succeeded
--------------
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.14   True        False         8h      Error while reconciling 4.10.14: the cluster operator insights is degraded



How reproducible:
1/2 

Steps to Reproduce:
1. Create an appliance provider cluster with 2 consumer
2. Upgrade ODF deployer version
3.

Actual results:
=====ceph status ====
Mon May 30 02:14:34 PM UTC 2022
  cluster:
    id:     117ddfde-1253-49f8-8709-9a097124651e
    health: HEALTH_WARN
            1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
 
  services:
    mon: 3 daemons, quorum a,b,c (age 8h)
    mgr: a(active, since 8h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 15 osds: 15 up (since 3h), 15 in (since 8h)
 
  data:
    volumes: 1/1 healthy
    pools:   9 pools, 1313 pgs
    objects: 69.54k objects, 270 GiB
    usage:   787 GiB used, 59 TiB / 60 TiB avail
    pgs:     1313 active+clean
 
  io:
    client:   153 KiB/s rd, 167 KiB/s wr, 39 op/s rd, 38 op/s wr



Expected results:
Ceph health should be OK

Additional info:

--------------Few OC output ---------------------
======= storagecluster ==========
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   8h    Ready              2022-05-30T06:03:29Z   
--------------
======= cephcluster ==========
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE   PHASE   MESSAGE                        HEALTH        EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          8h    Ready   Cluster created successfully   HEALTH_WARN   
======= cluster health status=====
HEALTH_WARN 1 OSDs or CRUSH {nodes, device-classes} have {NOUP,NODOWN,NOIN,NOOUT} flags set
Mon May 30 02:14:15 PM UTC 2022
======ceph osd tree ===
ID   CLASS  WEIGHT    TYPE NAME                               STATUS  REWEIGHT  PRI-AFF
 -1         60.00000  root default                                                     
 -5         60.00000      region us-east-1                                             
 -4         20.00000          zone us-east-1a                                          
 -3          4.00000              host default-0-data-0tdh8b                           
  0    ssd   4.00000                  osd.0                       up   1.00000  1.00000
-39          4.00000              host default-1-data-0ps629                           
  5    ssd   4.00000                  osd.5                       up   1.00000  1.00000
-29          4.00000              host default-1-data-4t9tmm                           
  4    ssd   4.00000                  osd.4                       up   1.00000  1.00000
-31          4.00000              host default-2-data-0fr967                           
  3    ssd   4.00000                  osd.3                       up   1.00000  1.00000
-25          4.00000              host default-2-data-1qkmdw                           
  2    ssd   4.00000                  osd.2                       up   1.00000  1.00000
-10         20.00000          zone us-east-1b                                          
 -9          4.00000              host default-0-data-1z5w9r                           
  1    ssd   4.00000                  osd.1                       up   1.00000  1.00000
-35          4.00000              host default-0-data-28rhlr                           
  6    ssd   4.00000                  osd.6                       up   1.00000  1.00000
-37          4.00000              host default-0-data-3pc46c                           
  8    ssd   4.00000                  osd.8                       up   1.00000  1.00000
-27          4.00000              host default-1-data-1p4k2t                           
  9    ssd   4.00000                  osd.9                       up   1.00000  1.00000
-33          4.00000              host default-2-data-4d6pw2                           
  7    ssd   4.00000                  osd.7                       up   1.00000  1.00000
-14         20.00000          zone us-east-1c                                          
-17          4.00000              host default-0-data-4hj5gs                           
 12    ssd   4.00000                  osd.12                      up   1.00000  1.00000
-13          4.00000              host default-1-data-25brsh                           
 14    ssd   4.00000                  osd.14                      up   1.00000  1.00000
-19          4.00000              host default-1-data-39gvhz                           
 11    ssd   4.00000                  osd.11                      up   1.00000  1.00000
-21          4.00000              host default-2-data-2pdzz6                           
 10    ssd   4.00000                  osd.10                      up   1.00000  1.00000
-23          4.00000              host default-2-data-3s5955                           
 13    ssd   4.00000                  osd.13                      up   1.00000  1.00000
=========ceph versions========
{
    "mon": {
        "ceph version 16.2.7-98.el8cp (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.7-98.el8cp (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.7-98.el8cp (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)": 15
    },
    "mds": {
        "ceph version 16.2.7-98.el8cp (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)": 2
    },
    "overall": {
        "ceph version 16.2.7-98.el8cp (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)": 21
    }
}
=========rados df=====
POOL_NAME                                                              USED  OBJECTS  CLONES  COPIES  MISSING_ON_PRIMARY  UNFOUND  DEGRADED  RD_OPS       RD  WR_OPS       WR  USED COMPR  UNDER COMPR
cephblockpool-storageconsumer-14d18631-37ea-4fbe-baca-f102ea34c6cf   47 KiB        5       0      15                   0        0         0     158  133 KiB  693008   11 GiB         0 B          0 B
cephblockpool-storageconsumer-53ca4e1e-e945-422e-8333-9dacf8c4029d   12 KiB        1       0       3                   0        0         0       0      0 B       0      0 B         0 B          0 B
cephblockpool-storageconsumer-6a8c7284-bb38-4f25-bd68-9e20f23773df   12 KiB        1       0       3                   0        0         0       0      0 B       0      0 B         0 B          0 B
cephblockpool-storageconsumer-a2233267-d7f3-449b-b33c-f9ed1e75f1d5  411 GiB    38410       0  115230                   0        0         0  189887  731 MiB  239417  1.0 GiB         0 B          0 B
cephblockpool-storageconsumer-ddc412aa-6102-4292-9e2f-05ce45c5ea68   12 KiB        1       0       3                   0        0         0       0      0 B       0      0 B         0 B          0 B
device_health_metrics                                                   0 B        0       0       0                   0        0         0       0      0 B       0      0 B         0 B          0 B
ocs-storagecluster-cephblockpool                                     12 KiB        1       0       3                   0        0         0       0      0 B       0      0 B         0 B          0 B
ocs-storagecluster-cephfilesystem-data0                             364 GiB    31066       0   93198                   0        0         0  135215  528 MiB  133747  522 MiB         0 B          0 B
ocs-storagecluster-cephfilesystem-metadata                          155 MiB       55       0     165                   0        0         0   65373   87 MiB   16168   63 MiB         0 B          0 B

total_objects    69540
total_used       787 GiB
total_avail      59 TiB
total_space      60 TiB
=========ceph df=====
--- RAW STORAGE ---
CLASS    SIZE   AVAIL     USED  RAW USED  %RAW USED
ssd    60 TiB  59 TiB  787 GiB   787 GiB       1.28
TOTAL  60 TiB  59 TiB  787 GiB   787 GiB       1.28
 
--- POOLS ---
POOL                                                                ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
device_health_metrics                                                1    1      0 B        0      0 B      0     17 TiB
ocs-storagecluster-cephblockpool                                     2  128     19 B        1   12 KiB      0     17 TiB
ocs-storagecluster-cephfilesystem-metadata                           3   32   51 MiB       55  155 MiB      0     17 TiB
ocs-storagecluster-cephfilesystem-data0                              4  512  121 GiB   31.07k  364 GiB   0.70     17 TiB
cephblockpool-storageconsumer-6a8c7284-bb38-4f25-bd68-9e20f23773df   5  128     19 B        1   12 KiB      0     17 TiB
cephblockpool-storageconsumer-ddc412aa-6102-4292-9e2f-05ce45c5ea68   6  128     19 B        1   12 KiB      0     17 TiB
cephblockpool-storageconsumer-53ca4e1e-e945-422e-8333-9dacf8c4029d   7  128     19 B        1   12 KiB      0     17 TiB
cephblockpool-storageconsumer-a2233267-d7f3-449b-b33c-f9ed1e75f1d5   8  128  137 GiB   38.41k  411 GiB   0.79     17 TiB
cephblockpool-storageconsumer-14d18631-37ea-4fbe-baca-f102ea34c6cf   9  128   12 KiB        5   47 KiB      0     17 TiB
=========ceph osd  df=====
ID  CLASS  WEIGHT   REWEIGHT  SIZE    RAW USE  DATA     OMAP     META     AVAIL    %USE  VAR   PGS  STATUS
 0    ssd  4.00000   1.00000   4 TiB   48 GiB   47 GiB   11 KiB  1.2 GiB  4.0 TiB  1.17  0.92  267      up
 5    ssd  4.00000   1.00000   4 TiB   58 GiB   57 GiB   14 KiB  780 MiB  3.9 TiB  1.41  1.10  271      up
 4    ssd  4.00000   1.00000   4 TiB   46 GiB   46 GiB   16 KiB  345 MiB  4.0 TiB  1.13  0.88  246      up
 3    ssd  4.00000   1.00000   4 TiB   57 GiB   57 GiB   21 KiB  751 MiB  3.9 TiB  1.40  1.10  254      up
 2    ssd  4.00000   1.00000   4 TiB   53 GiB   52 GiB   22 KiB  983 MiB  3.9 TiB  1.29  1.01  275      up
 1    ssd  4.00000   1.00000   4 TiB   54 GiB   53 GiB   16 KiB  599 MiB  3.9 TiB  1.32  1.03  264      up
 6    ssd  4.00000   1.00000   4 TiB   52 GiB   51 GiB   13 KiB  935 MiB  3.9 TiB  1.26  0.98  260      up
 8    ssd  4.00000   1.00000   4 TiB   50 GiB   49 GiB   17 KiB  1.0 GiB  4.0 TiB  1.23  0.96  256      up
 9    ssd  4.00000   1.00000   4 TiB   53 GiB   52 GiB   16 KiB  742 MiB  3.9 TiB  1.29  1.01  257      up
 7    ssd  4.00000   1.00000   4 TiB   54 GiB   53 GiB   23 KiB  584 MiB  3.9 TiB  1.32  1.03  276      up
12    ssd  4.00000   1.00000   4 TiB   50 GiB   49 GiB   19 KiB  667 MiB  4.0 TiB  1.22  0.95  267      up
14    ssd  4.00000   1.00000   4 TiB   55 GiB   55 GiB   12 KiB  577 MiB  3.9 TiB  1.35  1.05  265      up
11    ssd  4.00000   1.00000   4 TiB   57 GiB   56 GiB   12 KiB  779 MiB  3.9 TiB  1.39  1.08  264      up
10    ssd  4.00000   1.00000   4 TiB   52 GiB   51 GiB   21 KiB  812 MiB  3.9 TiB  1.26  0.98  261      up
13    ssd  4.00000   1.00000   4 TiB   49 GiB   48 GiB   17 KiB  903 MiB  4.0 TiB  1.19  0.93  256      up
                       TOTAL  60 TiB  787 GiB  776 GiB  256 KiB   11 GiB   59 TiB  1.28                   
MIN/MAX VAR: 0.88/1.10  STDDEV: 0.08
====ceph fs status===
ocs-storagecluster-cephfilesystem - 20 clients
=================================
RANK      STATE                       MDS                     ACTIVITY     DNS    INOS   DIRS   CAPS  
 0        active      ocs-storagecluster-cephfilesystem-a  Reqs:    0 /s   156    159     58    100   
0-s   standby-replay  ocs-storagecluster-cephfilesystem-b  Evts:    0 /s   146    149     48      0   
                   POOL                       TYPE     USED  AVAIL  
ocs-storagecluster-cephfilesystem-metadata  metadata   154M  16.7T  
 ocs-storagecluster-cephfilesystem-data0      data     363G  16.7T  
MDS version: ceph version 16.2.7-98.el8cp (b20d33c3b301e005bed203d3cad7245da3549f80) pacific (stable)

Comment 2 Santosh Pillai 2022-05-31 07:05:13 UTC

Possible root cause:

2022-05-30 11:38:45.089458 I | clusterdisruption-controller: all "zone" failure domains: [us-east-1a us-east-1b us-east-1c]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:1305} {StateName:active+recovery_wait+undersized+degraded+remapped Count:6} {StateName:active+recovering+undersized+remapped Count:2}]"
2022-05-30 11:39:16.400384 I | clusterdisruption-controller: all PGs are active+clean. Restoring default OSD pdb settings
2022-05-30 11:39:16.400402 I | clusterdisruption-controller: creating the default pdb "rook-ceph-osd" with maxUnavailable=1 for all osd
2022-05-30 11:39:16.431569 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-zone-us-east-1a" with maxUnavailable=0 for "zone" failure domain "us-east-1a"
2022-05-30 11:39:16.437856 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-zone-us-east-1b" with maxUnavailable=0 for "zone" failure domain "us-east-1b"
2022-05-30 11:39:16.442985 I | clusterdisruption-controller: deleting temporary blocking pdb with "rook-ceph-osd-zone-us-east-1c" with maxUnavailable=0 for "zone" failure domain "us-east-1c"
2022-05-30 11:39:16.454078 I | clusterdisruption-controller: reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0
2022-05-30 11:39:47.696835 I | clusterdisruption-controller: all "zone" failure domains: [us-east-1a us-east-1b us-east-1c]. osd is down in failure domain: "". active node drains: false. pg health: "cluster is not fully clean. PGs: [{StateName:active+clean Count:1312} {StateName:active+recovering+remapped Count:1}]"

`11:39:16.400384` suggests that all the pgs were active+clean. So default PDB with `maxUnavailable=1` (allowedDiruption=1) was created and temporary blocking PDBs were removed.

But right after that, `11:39:16.454078` suggests that `AllowedDisruptions` in the OSD PDB resource is 0.
So either
1. The PDB took some time to update the `AllowedDisruption` value back to 1.
2. Or an OSD went down temporarily for a very small duration of time

IMO, `1` is mostly likely to happen than `2`.

As a result, another the controller reconciled again. `11:39:47.696835` suggests that PGs were not acive+clean during this reconcile. So it added `no-out` flag on the failure domain `us-east-1a`

```
ceph osd dump -f json

"crush_node_flags":{"us-east-1a":["noout"]},"device_class_flags":{},"stretch_mode":{"stretch_mode_enabled":false,"stretch_bucket_count":0,"degraded_stretch_mode":0,"recovering_stretch_mode":0,"stretch_mode_bucket":0}}

```

As a result of this flag we are seeing this warning message in ceph status.

Workaround:
Manually unset the flag:

`ceph osd unset noout us-east-1a`

This looks like negative case and won't be reproducible every time. So not a blocker and lower priority, IMO.
But should be handled in rook.

Comment 3 Santosh Pillai 2022-05-31 14:05:12 UTC

(In reply to Santosh Pillai from comment #2)

> 
> Workaround: 
> Manually unset the flag:
>  
> `ceph osd unset noout us-east-1a`


The correct command is `ceph osd unset-group noout us-east-1a` 
> 
> This looks like negative case and won't be reproducible every time. So not a
> blocker and lower priority, IMO. 
> But should be handled in rook.

Comment 4 Sahina Bose 2022-06-01 04:36:04 UTC

Seems like a candidate for an SOP @ykukreja

Comment 5 Yashvardhan Kukreja 2022-06-21 07:20:19 UTC

Hi Sahina, 
I agree that this would be worth-tracking via an SOP.

Though right now, it seems very detailed and jargonised from my pov. Therefore, I'd suggest someone from the ODF team to be nominated to draft an SOP considering the intricacies and granular product-level details associated with this problem (just like all the other product-level ODF SOPs were written in the past). 
 
Nothing too descriptive. Just the following items:
- Trigger of the problem/alert - (CephClusterWarningState it seems)
- How to recognize / confirm this problem - output of `oc rsh -n openshift-storage $(oc get pods -o wide -n openshift-storage|grep tool|awk '{print$1}')   ceph status ` I guess
- Troubleshooting - I have a few doubts around this: Should the ceph osd unset-group command executed for each failure domain or just one failure domain? How the recognize those failure domains?

And considering the fact that MTSRE would be the reader/customer of this SOP, I'll review it and see if I understand it properly enough to ensure that if this problem occurs, I'd be able to follow the SOP properly or not.

How does that sound?

Comment 6 Yashvardhan Kukreja 2022-06-21 07:28:49 UTC

(In reply to Yashvardhan Kukreja from comment #5)
> - Troubleshooting - I have a few doubts around this: Should the ceph osd
> unset-group command executed for each failure domain or just one failure
> domain? How the recognize those failure domains?

I guess I get this,
first fetch the details of the crush node flags

```
ceph osd dump -f json
```
under those flags, fetch the failure domain associated with `["noout"]`
For example, for the following output
```
"crush_node_flags":{"us-east-1a":["noout"]},"device_class_flags":{},"stretch_mode":{"stretch_mode_enabled":false,"stretch_bucket_count":0,"degraded_stretch_mode":0,"recovering_stretch_mode":0,"stretch_mode_bucket":0}}
```
"us-east-1" would be that failure domain.

Finally, execute the `ceph osd unset-group noout <failure-domain>` command for each of those failure-domains one-by-one 
OR
Should it be executed like `ceph osd unset-group noout <failure-domain-1> <failure-domain-2> <failure-domain-3>`?

Also, where would this ceph command be executed? Is it going to be in the provider cluster under the toolbox pod ?

Comment 7 Neha Berry 2022-06-23 06:20:43 UTC

(In reply to Yashvardhan Kukreja from comment #6)
> (In reply to Yashvardhan Kukreja from comment #5)
> > - Troubleshooting - I have a few doubts around this: Should the ceph osd
> > unset-group command executed for each failure domain or just one failure
> > domain? How the recognize those failure domains?
> 
> I guess I get this,
> first fetch the details of the crush node flags
> 
> ```
> ceph osd dump -f json
> ```
> under those flags, fetch the failure domain associated with `["noout"]`
> For example, for the following output
> ```
> "crush_node_flags":{"us-east-1a":["noout"]},"device_class_flags":{},
> "stretch_mode":{"stretch_mode_enabled":false,"stretch_bucket_count":0,
> "degraded_stretch_mode":0,"recovering_stretch_mode":0,"stretch_mode_bucket":
> 0}}
> ```
> "us-east-1" would be that failure domain.
> 
> Finally, execute the `ceph osd unset-group noout <failure-domain>` command
> for each of those failure-domains one-by-one 
> OR
> Should it be executed like `ceph osd unset-group noout <failure-domain-1>
> <failure-domain-2> <failure-domain-3>`?
> 
> Also, where would this ceph command be executed? Is it going to be in the
> provider cluster under the toolbox pod ?
Yes we executed in the provider cluster toolbox pod to come out of health warn state

Comment 10 Dhruv Bindra 2022-09-20 07:51:23 UTC

Moving to ON_QA as the tracker bug is in closed state

Comment 16 Filip Balák 2023-02-06 09:36:18 UTC

Last upgrade path was done from deployer 2.0.9 to 2.0.10 without any issues. --> VERIFIED

Comment 17 Ritesh Chikatwar 2023-03-13 11:58:44 UTC

(In reply to Filip Balák from comment #16)
> Last upgrade path was done from deployer 2.0.9 to 2.0.10 without any issues.
> --> VERIFIED

As per the comment, this issue is fixed. Please feel free to reopen if repro. Thanks