2050057 – ODF4.10 : 1 osd down after stopping all nodes on provider cluster on odf to odf setup

Bug 2050057 - ODF4.10 : 1 osd down after stopping all nodes on provider cluster on odf to odf setup

Summary: ODF4.10 : 1 osd down after stopping all nodes on provider cluster on odf to o...

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Neha Ojha
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-03 06:30 UTC by suchita
Modified:	2023-12-08 04:27 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-11-03 02:36:58 UTC
Embargoed:

Attachments	(Terms of Use)

Description suchita 2022-02-03 06:30:42 UTC

Description of problem (please be detailed as possible and provide a log
snippets):
Created ODF 4.10 setup on the rosa cluster for the ODF-ODF setup (Provider-consumer). Hence provider ODF-setup is with host networking True. 
When I stop all the aws instances of the provider cluster, which automatically came in running state within few minutes. Initially I saw 3 osd up and after some time one of the osd went down. 



Version of all relevant components (if applicable):
OpenShift version:    4.9.15
ceph version 16.2.7-35.el8cp (51d904cb9b9eb82f2c11b4cf5252ab3f3ff0d6b4) pacific (stable)
OCS - 4.10.0-122



Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No, manageable post workaround. 

Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
2/2

Can this issue reproduce from the UI?
No


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install ROSA cluster. 
2. Install ODF operator ODF4.10-122
3. Add below 4 inbound rules in clusters worker security ID from aws conole ->security group-> <worker security ID>-> Edit inbound rules  -> Add rules
Type          port -rage      source       description 
custom TCP     6789          10.0.0.16     ceph mon v1
custom TCP     3300          10.0.0.16     ceph mon v2
custom TCP     6800-7300     10.0.0.16     osd
custom TCP     9283          10.0.0.16     ceph manager

4.While creating a storage cluster ensure host networking is true in spec
 "spec:
 hostNetwork: true"

3. Create cluster - an external rosa consumer cluster

4. Verify all pods and OSD are running and UP. ceph health is okay.
5. Stop all instances of provider cluster from aws (CLI/console) 



Actual results:

1.
all nodes status -Not ready and within few min nodes status-ready. 

2. all aws instances running status
3. during recovery osd come up

`------------------------ceph -s
---------------
  cluster:
    id:     458b0172-00d8-4caa-84a7-5bfd259f18bd
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            Reduced data availability: 116 pgs inactive, 90 pgs peering
            6 slow ops, oldest one blocked for 781 sec, daemons [osd.0,osd.1,osd.2] have slow ops.
 
  services:
    mon: 3 daemons, quorum a,b,c (age 13m)
    mgr: a(active, since 13m)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 13m), 3 in (since 21h)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   5 pools, 129 pgs
    objects: 656 objects, 2.1 GiB
    usage:   6.4 GiB used, 3.0 TiB / 3 TiB avail
    pgs:     89.922% pgs not active
             90 peering
             26 activating
             13 active+cleanwAExpected results:

---------------------------------------------------
4. After some time  1 osd is down permanantly
------------ceph -s---------------------------

cluster:
    id:     458b0172-00d8-4caa-84a7-5bfd259f18bd
    health: HEALTH_WARN
            1 osds down
            1 host (1 osds) down
            1 zone (1 osds) down
            Degraded data redundancy: 656/1968 objects degraded (33.333%), 83 pgs degraded, 129 pgs undersized
 
  services:
    mon: 3 daemons, quorum a,b,c (age 35m)
    mgr: a(active, since 35m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 2 up (since 5m), 3 in (since 21h)
 
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 129 pgs
    objects: 656 objects, 2.1 GiB
    usage:   6.4 GiB used, 3.0 TiB / 3 TiB avail
    pgs:     656/1968 objects degraded (33.333%)
             83 active+undersized+degraded
             46 active+undersized
 
  io:
    client:   853 B/s rd, 1 op/s rd, 0 op/s
------------------------------------------------------

Additional info:
$ ceph osd tree
ID   CLASS  WEIGHT   TYPE NAME                                     STATUS  REWEIGHT  PRI-AFF
 -1         3.00000  root default                                                           
 -5         3.00000      region us-east-1                                                   
-10         1.00000          zone us-east-1a                                                
 -9         1.00000              host ocs-deviceset-0-data-0qccqd                           
  1    ssd  1.00000                  osd.1                             up   1.00000  1.00000
-14         1.00000          zone us-east-1b                                                
-13         1.00000              host ocs-deviceset-2-data-0qd5cc                           
  2    ssd  1.00000                  osd.2                             up   1.00000  1.00000
 -4         1.00000          zone us-east-1c                                                
 -3         1.00000              host ocs-deviceset-1-data-0xxpjp                           
  0    ssd  1.00000                  osd.0                           down   1.00000  1.00000

Some command o/p here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/provider/
OCP must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocp_must_gather_p/
OCS Must Gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocs_must_gather/

Comment 2 Travis Nielsen 2022-02-03 18:14:45 UTC

The osd.0 log [1] shows that it is running and the OSD pod description [2] shows that the pod is running and passing its liveness probe. I don't see any logging that indicates the issue. Perhaps there was some issue with the aws networking that wasn't restored after the VMs came back online and is preventing the OSD from communicating so the other OSDs are marking it down. 

Suchita 
- Does this repro consistently? It really seems like an AWS environment issue.
- Was this just a test scenario? Or why were the AWS instances stopped?

Neha any other clues to look for the cause for the OSD being marked down?


[1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ac4648adce5e4f901613016460bff96b20f32f8554295d7d1bbed610ed1e1301/namespaces/openshift-storage/pods/rook-ceph-osd-0-784999fdff-4tkwx/osd/osd/logs/current.log

[2] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ac4648adce5e4f901613016460bff96b20f32f8554295d7d1bbed610ed1e1301/namespaces/openshift-storage/pods/rook-ceph-osd-0-784999fdff-4tkwx/rook-ceph-osd-0-784999fdff-4tkwx.yaml

Comment 3 Neha Ojha 2022-02-03 23:56:42 UTC

From the mon logs [0], looks like osd.0 stopped responding to beacons and the mon marked it down. This smells like a network issue.

2022-02-02T13:06:58.569247812Z debug 2022-02-02T13:06:58.568+0000 7f23cb41b700  0 log_channel(cluster) log [INF] : osd.0 marked down after no beacon for 900.407398 seconds
2022-02-02T13:06:58.569275923Z debug 2022-02-02T13:06:58.568+0000 7f23cb41b700 -1 mon.a@0(leader).osd e197 no beacon from osd.0 since 2022-02-02T12:51:58.161800+0000, 900.407398 seconds ago.  marking down

[0] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ac4648adce5e4f901613016460bff96b20f32f8554295d7d1bbed610ed1e1301/namespaces/openshift-storage/pods/rook-ceph-mon-a-7546889649-dkt7h/mon/mon/logs/current.log

Comment 4 Sahina Bose 2022-02-04 06:24:25 UTC

@tnielsen Travis, this was a test scenario to see how ODF behaves with hostNetworking when one of the nodes get restarted. If this is a network issue, what should we be looking for?

Comment 5 Travis Nielsen 2022-02-04 15:25:41 UTC

Some thoughts:
- Does this repro consistently? Or was it a one-time issue? 
- Does restarting the VM one more time get it working again?
- Review the Ceph networking guide and dig into the network with someone who knows AWS networking: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/

Comment 6 Yaniv Kaul 2022-02-07 07:18:31 UTC

(In reply to Sahina Bose from comment #4)
> @tnielsen Travis, this was a test scenario to see how ODF behaves
> with hostNetworking when one of the nodes get restarted. If this is a
> network issue, what should we be looking for?

The test step "Stop all instances of provider cluster from aws (CLI/console) " - is not one of the nodes get restarted. It's a substantially less relevant test, IMHO.

Comment 7 suchita 2022-02-07 08:04:27 UTC

(In reply to Travis Nielsen from comment #5)
> Some thoughts:
> - Does this repro consistently? Or was it a one-time issue? 
> - Does restarting the VM one more time get it working again?
> - Review the Ceph networking guide and dig into the network with someone who
> knows AWS networking:
> https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/


> - Does this repro consistently? Or was it a one-time issue?  
==>  yes, we tried 2 times and both time it has been observed. 

> - Does restarting the VM one more time get it working again? 
==> I will try this today and will update the observation here.

Comment 8 Orit Wasserman 2022-02-07 12:46:38 UTC

Neha,
For the logs OSD.0 seems to recover and is up and running.
Should not the mons have detected it and marked it up again?
Are we missing a config setting?

Comment 9 suchita 2022-02-08 04:01:36 UTC

> - Does restarting the VM one more time get it working again? (In reply to suchita from comment #7)
> (In reply to Travis Nielsen from comment #5)
> > Some thoughts:
> > - Does this repro consistently? Or was it a one-time issue? 
> > - Does restarting the VM one more time get it working again?
> > - Review the Ceph networking guide and dig into the network with someone who
> > knows AWS networking:
> > https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/
> 
> 
> > - Does this repro consistently? Or was it a one-time issue?  
> ==>  yes, we tried 2 times and both time it has been observed. 
> 
> > - Does restarting the VM one more time get it working again? 
> ==> I will try this today and will update the observation here.

For some reason in setup, I missed reproducing and checking this. I am doing it today and will try to keep the set up for use

Comment 10 suchita 2022-02-08 04:05:42 UTC

(In reply to Orit Wasserman from comment #8)
> Neha,
> For the logs OSD.0 seems to recover and is up and running.
> Should not the mons have detected it and marked it up again?
> Are we missing a config setting?

Which specific Config setting do we need?

Comment 11 Orit Wasserman 2022-02-08 07:48:16 UTC

(In reply to suchita from comment #10)
> (In reply to Orit Wasserman from comment #8)
> > Neha,
> > For the logs OSD.0 seems to recover and is up and running.
> > Should not the mons have detected it and marked it up again?
> > Are we missing a config setting?
> 
> Which specific Config setting do we need?

we are still investigating. I will update when we understand what is going on.

Comment 15 Jilju Joy 2022-02-09 21:24:06 UTC

I tried to reproduce this issue. In the first attempt OSD didn't go down after recovery. Ceph health was HEALTH_OK. Waited more than one hour to see if any OSD is going down. 

In the second attempt on the same provider, 3 OSDs were up for 29 minutes after all the nodes became Ready. But ceph health did not become HEALTH_OK. After 29 minutes one OSD went down.


This is the last ceph status when 3 OSD were up (Thu Feb 10 01:36:38 AM IST 2022)

cluster:
    id:     ad79e075-d011-46a9-992c-a655b1689fed
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            Reduced data availability: 129 pgs inactive, 90 pgs peering
            14 slow ops, oldest one blocked for 1781 sec, daemons [osd.0,osd.1,osd.2] have slow ops.
 
  services:
    mon: 3 daemons, quorum c,d,e (age 29m)
    mgr: a(active, since 29m)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 29m), 3 in (since 2d)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   5 pools, 129 pgs
    objects: 15.06k objects, 55 GiB
    usage:   78 GiB used, 2.9 TiB / 3 TiB avail
    pgs:     100.000% pgs not active
             90 peering
             39 activating




After that one osd was marked down (Thu Feb 10 01:36:57 AM IST 2022)

cluster:
    id:     ad79e075-d011-46a9-992c-a655b1689fed
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            1 osds down
            1 host (1 osds) down
            1 zone (1 osds) down
            Reduced data availability: 2 pgs inactive
            Degraded data redundancy: 15061/45183 objects degraded (33.333%), 95 pgs degraded
            6 slow ops, oldest one blocked for 1796 sec, daemons [osd.1,osd.2] have slow ops.
 
  services:
    mon: 3 daemons, quorum c,d,e (age 30m)
    mgr: a(active, since 30m)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 2 up (since 9s), 3 in (since 2d)
 
  data:
    volumes: 1/1 healthy
    pools:   5 pools, 129 pgs
    objects: 15.06k objects, 55 GiB
    usage:   78 GiB used, 2.9 TiB / 3 TiB avail
    pgs:     3.876% pgs not active
             15061/45183 objects degraded (33.333%)
             95 active+undersized+degraded
             29 active+undersized
             5  activating+undersized
 
  io:
    client:   11 MiB/s rd, 4 op/s rd, 0 op/s wr



OSD pods are running


NAME                                                              READY   STATUS      RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
csi-addons-controller-manager-7b65c778df-784xl                    2/2     Running     5          4h27m   10.129.2.7     ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
csi-cephfsplugin-2dnk8                                            3/3     Running     6          2d9h    10.0.175.33    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
csi-cephfsplugin-ddpct                                            3/3     Running     6          2d9h    10.0.132.60    ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
csi-cephfsplugin-l6s88                                            3/3     Running     6          26h     10.0.204.163   ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
csi-cephfsplugin-provisioner-6465d4c55-fb8p9                      6/6     Running     12         2d9h    10.131.0.26    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
csi-cephfsplugin-provisioner-6465d4c55-t2v66                      6/6     Running     16         4h27m   10.129.2.8     ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-7crdf                                               4/4     Running     8          2d9h    10.0.132.60    ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
csi-rbdplugin-9h47g                                               4/4     Running     8          26h     10.0.204.163   ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-provisioner-5d4f9f74d6-65jp5                        7/7     Running     18         4h27m   10.129.2.22    ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
csi-rbdplugin-provisioner-5d4f9f74d6-mxh2l                        7/7     Running     14         2d9h    10.131.0.22    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
csi-rbdplugin-vsjxx                                               4/4     Running     8          2d9h    10.0.175.33    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
noobaa-operator-5d8bf7d5d8-97xxf                                  1/1     Running     2          4h27m   10.129.2.27    ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
ocs-metrics-exporter-684b49bfb4-rzss9                             1/1     Running     2          4h27m   10.129.2.3     ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
ocs-operator-65f46f66f5-t89kd                                     1/1     Running     4          4h27m   10.129.2.15    ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
odf-console-85fdf68fcc-5qbrt                                      1/1     Running     2          3d3h    10.131.0.4     ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
odf-operator-controller-manager-57756ff5d7-mtqvs                  2/2     Running     4          26h     10.131.0.12    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
rook-ceph-crashcollector-11663c22d14d9287c7e129ee309d1401-v555r   1/1     Running     1          140m    10.0.132.60    ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
rook-ceph-crashcollector-5ae34e94e91c7d5a246468cca7f76ed4-xlq9g   1/1     Running     2          2d9h    10.0.175.33    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
rook-ceph-crashcollector-9c0b4f54f434bc3f9febde270fb1e84c-j7bmr   1/1     Running     1          133m    10.0.204.163   ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-765d776b4pptd   2/2     Running     2          133m    10.0.132.60    ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-58bdfc9djw4kd   2/2     Running     4          2d9h    10.0.175.33    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
rook-ceph-mgr-a-69cdb8fb9f-j88x8                                  2/2     Running     2          133m    10.0.132.60    ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
rook-ceph-mon-c-849859f89b-w7fnr                                  2/2     Running     4          2d9h    10.0.175.33    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
rook-ceph-mon-d-5545bfb4c9-gxpvt                                  2/2     Running     2          133m    10.0.204.163   ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-e-5fdc5f5c79-twwtq                                  2/2     Running     2          140m    10.0.132.60    ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
rook-ceph-operator-6db885d965-dd9zg                               1/1     Running     2          4h27m   10.129.2.16    ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-7cfc878594-zstjh                                  2/2     Running     2          133m    10.0.204.163   ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-1-5bd9697469-mpl4c                                  2/2     Running     2          140m    10.0.132.60    ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-2-5669dccc65-c2lz8                                  2/2     Running     4          2d9h    10.0.175.33    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
rook-ceph-osd-prepare-ocs-deviceset-0-data-0lz4rr--1-58f8p        0/1     Completed   0          2d9h    10.0.175.33    ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
rook-ceph-tools-7c78f9db77-cpc8k                                  1/1     Running     2          4h27m   10.0.204.163   ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>


All nodes in the provider are Ready
$ oc get nodes
NAME                                         STATUS   ROLES          AGE    VERSION
ip-10-0-132-60.us-east-2.compute.internal    Ready    worker         3d6h   v1.22.3+e790d7f
ip-10-0-152-132.us-east-2.compute.internal   Ready    master         3d6h   v1.22.3+e790d7f
ip-10-0-157-3.us-east-2.compute.internal     Ready    infra,worker   3d6h   v1.22.3+e790d7f
ip-10-0-175-33.us-east-2.compute.internal    Ready    worker         3d6h   v1.22.3+e790d7f
ip-10-0-179-252.us-east-2.compute.internal   Ready    master         3d6h   v1.22.3+e790d7f
ip-10-0-185-188.us-east-2.compute.internal   Ready    infra,worker   3d6h   v1.22.3+e790d7f
ip-10-0-204-163.us-east-2.compute.internal   Ready    worker         27h    v1.22.3+e790d7f
ip-10-0-217-74.us-east-2.compute.internal    Ready    master         3d6h   v1.22.3+e790d7f
ip-10-0-218-132.us-east-2.compute.internal   Ready    infra,worker   3d6h   v1.22.3+e790d7f


$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph health detail
HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 zone (1 osds) down; Degraded data redundancy: 15061/45183 objects degraded (33.333%), 100 pgs degraded, 129 pgs undersized
[WRN] OSD_DOWN: 1 osds down
    osd.0 (root=default,region=us-east-2,zone=us-east-2c,host=ocs-deviceset-2-data-09zr8m) is down
[WRN] OSD_HOST_DOWN: 1 host (1 osds) down
    host ocs-deviceset-2-data-09zr8m (root=default,region=us-east-2,zone=us-east-2c) (1 osds) is down
[WRN] OSD_ZONE_DOWN: 1 zone (1 osds) down
    zone us-east-2c (root=default,region=us-east-2) (1 osds) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 15061/45183 objects degraded (33.333%), 100 pgs degraded, 129 pgs undersized
    pg 1.13 is active+undersized+degraded, acting [1,2]
    pg 1.14 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 1.15 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 1.16 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 1.17 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 1.18 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 1.19 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 1.1a is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 1.1b is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 1.1c is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 1.1d is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 1.1e is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 1.1f is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 3.14 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 3.15 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 3.16 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 3.17 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
    pg 3.18 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 3.19 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 3.1a is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 3.1b is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 3.1c is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 3.1d is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 3.1e is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 3.1f is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 4.10 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 4.11 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 4.12 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 4.13 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 4.16 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 4.18 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 4.19 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 4.1a is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 4.1b is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 4.1c is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 4.1d is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 4.1e is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 4.1f is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 5.10 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
    pg 5.11 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
    pg 5.12 is stuck undersized for 74m, current state active+undersized, last acting [2,1]
    pg 5.13 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
    pg 5.17 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
    pg 5.18 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
    pg 5.19 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
    pg 5.1a is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
    pg 5.1b is stuck undersized for 74m, current state active+undersized, last acting [2,1]
    pg 5.1c is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
    pg 5.1d is stuck undersized for 74m, current state active+undersized, last acting [2,1]
    pg 5.1e is stuck undersized for 74m, current state active+undersized, last acting [1,2]
    pg 5.1f is stuck undersized for 74m, current state active+undersized, last acting [1,2]


$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph  osd tree
ID   CLASS  WEIGHT   TYPE NAME                                     STATUS  REWEIGHT  PRI-AFF
 -1         3.00000  root default                                                           
 -5         3.00000      region us-east-2                                                   
-10         1.00000          zone us-east-2a                                                
 -9         1.00000              host ocs-deviceset-1-data-08mh22                           
  1    ssd  1.00000                  osd.1                             up   1.00000  1.00000
-14         1.00000          zone us-east-2b                                                
-13         1.00000              host ocs-deviceset-0-data-0lz4rr                           
  2    ssd  1.00000                  osd.2                             up   1.00000  1.00000
 -4         1.00000          zone us-east-2c                                                
 -3         1.00000              host ocs-deviceset-2-data-09zr8m                           
  0    ssd  1.00000                  osd.0                           down   1.00000  1.00000



Provider must gather - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/

Consumer must-gather logs (consumer was not involved in  this test) - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/jijoy-feb10_20220209T201934/logs/bug_2050057_repro/consumer/



Set debug level before stopping the Nodes.

$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get osd debug_osd
20/20
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get osd debug_ms
1/1
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get mon debug_ms
1/1
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get mon debug_mon
20/20


hostNetwork is enabled

$ oc -n openshift-storage get storagecluster ocs-storagecluster -o yaml| grep hostNetwork -B 5
spec:
  arbiter: {}
  encryption:
    kms: {}
  externalStorage: {}
  hostNetwork: true


Tested in version:

ODF 4.10.0-143
OCP 4.9.17

Comment 17 Sahina Bose 2022-02-10 13:25:10 UTC

@nojha Any inputs that you can provide here? thanks!

Comment 18 Neha Ojha 2022-02-24 00:46:22 UTC

Hi Jiju,

(In reply to Jilju Joy from comment #15)
> I tried to reproduce this issue. In the first attempt OSD didn't go down
> after recovery. Ceph health was HEALTH_OK. Waited more than one hour to see
> if any OSD is going down. 

Can you please explain what recovery means in the above context?

> 
> In the second attempt on the same provider, 3 OSDs were up for 29 minutes
> after all the nodes became Ready. But ceph health did not become HEALTH_OK.
> After 29 minutes one OSD went down.

Do you mind sharing where can I find the corresponding logs? The mon logs in http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/ocs_must_gather/quay-io-ocs-dev-ocs-must-gather-sha256-cb2456b3eec615d652adc6afb1be14f928f1037725664f039aa6b0d2326ff145/namespaces/openshift-storage/pods/rook-ceph-mon-c-849859f89b-w7fnr/mon/mon/logs/ show that osd.0 was already down when we captured it (2022-02-09T20:38:03.921513472  - 2022-02-09T20:50:37.844620518). 

> 
> 
> This is the last ceph status when 3 OSD were up (Thu Feb 10 01:36:38 AM IST
> 2022)
> 
> cluster:
>     id:     ad79e075-d011-46a9-992c-a655b1689fed
>     health: HEALTH_WARN
>             1 filesystem is degraded
>             1 MDSs report slow metadata IOs
>             Reduced data availability: 129 pgs inactive, 90 pgs peering
>             14 slow ops, oldest one blocked for 1781 sec, daemons
> [osd.0,osd.1,osd.2] have slow ops.
>  
>   services:
>     mon: 3 daemons, quorum c,d,e (age 29m)
>     mgr: a(active, since 29m)
>     mds: 1/1 daemons up, 1 standby
>     osd: 3 osds: 3 up (since 29m), 3 in (since 2d)
>  
>   data:
>     volumes: 0/1 healthy, 1 recovering
>     pools:   5 pools, 129 pgs
>     objects: 15.06k objects, 55 GiB
>     usage:   78 GiB used, 2.9 TiB / 3 TiB avail
>     pgs:     100.000% pgs not active
>              90 peering
>              39 activating
> 
> 
> 
> 
> After that one osd was marked down (Thu Feb 10 01:36:57 AM IST 2022)
> 
> cluster:
>     id:     ad79e075-d011-46a9-992c-a655b1689fed
>     health: HEALTH_WARN
>             1 MDSs report slow metadata IOs
>             1 osds down
>             1 host (1 osds) down
>             1 zone (1 osds) down
>             Reduced data availability: 2 pgs inactive
>             Degraded data redundancy: 15061/45183 objects degraded
> (33.333%), 95 pgs degraded
>             6 slow ops, oldest one blocked for 1796 sec, daemons
> [osd.1,osd.2] have slow ops.
>  
>   services:
>     mon: 3 daemons, quorum c,d,e (age 30m)
>     mgr: a(active, since 30m)
>     mds: 1/1 daemons up, 1 hot standby
>     osd: 3 osds: 2 up (since 9s), 3 in (since 2d)
>  
>   data:
>     volumes: 1/1 healthy
>     pools:   5 pools, 129 pgs
>     objects: 15.06k objects, 55 GiB
>     usage:   78 GiB used, 2.9 TiB / 3 TiB avail
>     pgs:     3.876% pgs not active
>              15061/45183 objects degraded (33.333%)
>              95 active+undersized+degraded
>              29 active+undersized
>              5  activating+undersized
>  
>   io:
>     client:   11 MiB/s rd, 4 op/s rd, 0 op/s wr
> 
> 
> 
> OSD pods are running
> 
> 
> NAME                                                              READY  
> STATUS      RESTARTS   AGE     IP             NODE                          
> NOMINATED NODE   READINESS GATES
> csi-addons-controller-manager-7b65c778df-784xl                    2/2    
> Running     5          4h27m   10.129.2.7    
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> csi-cephfsplugin-2dnk8                                            3/3    
> Running     6          2d9h    10.0.175.33   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> csi-cephfsplugin-ddpct                                            3/3    
> Running     6          2d9h    10.0.132.60   
> ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
> csi-cephfsplugin-l6s88                                            3/3    
> Running     6          26h     10.0.204.163  
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> csi-cephfsplugin-provisioner-6465d4c55-fb8p9                      6/6    
> Running     12         2d9h    10.131.0.26   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> csi-cephfsplugin-provisioner-6465d4c55-t2v66                      6/6    
> Running     16         4h27m   10.129.2.8    
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> csi-rbdplugin-7crdf                                               4/4    
> Running     8          2d9h    10.0.132.60   
> ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
> csi-rbdplugin-9h47g                                               4/4    
> Running     8          26h     10.0.204.163  
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> csi-rbdplugin-provisioner-5d4f9f74d6-65jp5                        7/7    
> Running     18         4h27m   10.129.2.22   
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> csi-rbdplugin-provisioner-5d4f9f74d6-mxh2l                        7/7    
> Running     14         2d9h    10.131.0.22   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> csi-rbdplugin-vsjxx                                               4/4    
> Running     8          2d9h    10.0.175.33   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> noobaa-operator-5d8bf7d5d8-97xxf                                  1/1    
> Running     2          4h27m   10.129.2.27   
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> ocs-metrics-exporter-684b49bfb4-rzss9                             1/1    
> Running     2          4h27m   10.129.2.3    
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> ocs-operator-65f46f66f5-t89kd                                     1/1    
> Running     4          4h27m   10.129.2.15   
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> odf-console-85fdf68fcc-5qbrt                                      1/1    
> Running     2          3d3h    10.131.0.4    
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> odf-operator-controller-manager-57756ff5d7-mtqvs                  2/2    
> Running     4          26h     10.131.0.12   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> rook-ceph-crashcollector-11663c22d14d9287c7e129ee309d1401-v555r   1/1    
> Running     1          140m    10.0.132.60   
> ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
> rook-ceph-crashcollector-5ae34e94e91c7d5a246468cca7f76ed4-xlq9g   1/1    
> Running     2          2d9h    10.0.175.33   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> rook-ceph-crashcollector-9c0b4f54f434bc3f9febde270fb1e84c-j7bmr   1/1    
> Running     1          133m    10.0.204.163  
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-765d776b4pptd   2/2    
> Running     2          133m    10.0.132.60   
> ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-58bdfc9djw4kd   2/2    
> Running     4          2d9h    10.0.175.33   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> rook-ceph-mgr-a-69cdb8fb9f-j88x8                                  2/2    
> Running     2          133m    10.0.132.60   
> ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
> rook-ceph-mon-c-849859f89b-w7fnr                                  2/2    
> Running     4          2d9h    10.0.175.33   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> rook-ceph-mon-d-5545bfb4c9-gxpvt                                  2/2    
> Running     2          133m    10.0.204.163  
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> rook-ceph-mon-e-5fdc5f5c79-twwtq                                  2/2    
> Running     2          140m    10.0.132.60   
> ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
> rook-ceph-operator-6db885d965-dd9zg                               1/1    
> Running     2          4h27m   10.129.2.16   
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> rook-ceph-osd-0-7cfc878594-zstjh                                  2/2    
> Running     2          133m    10.0.204.163  
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> rook-ceph-osd-1-5bd9697469-mpl4c                                  2/2    
> Running     2          140m    10.0.132.60   
> ip-10-0-132-60.us-east-2.compute.internal    <none>           <none>
> rook-ceph-osd-2-5669dccc65-c2lz8                                  2/2    
> Running     4          2d9h    10.0.175.33   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> rook-ceph-osd-prepare-ocs-deviceset-0-data-0lz4rr--1-58f8p        0/1    
> Completed   0          2d9h    10.0.175.33   
> ip-10-0-175-33.us-east-2.compute.internal    <none>           <none>
> rook-ceph-tools-7c78f9db77-cpc8k                                  1/1    
> Running     2          4h27m   10.0.204.163  
> ip-10-0-204-163.us-east-2.compute.internal   <none>           <none>
> 
> 
> All nodes in the provider are Ready
> $ oc get nodes
> NAME                                         STATUS   ROLES          AGE   
> VERSION
> ip-10-0-132-60.us-east-2.compute.internal    Ready    worker         3d6h  
> v1.22.3+e790d7f
> ip-10-0-152-132.us-east-2.compute.internal   Ready    master         3d6h  
> v1.22.3+e790d7f
> ip-10-0-157-3.us-east-2.compute.internal     Ready    infra,worker   3d6h  
> v1.22.3+e790d7f
> ip-10-0-175-33.us-east-2.compute.internal    Ready    worker         3d6h  
> v1.22.3+e790d7f
> ip-10-0-179-252.us-east-2.compute.internal   Ready    master         3d6h  
> v1.22.3+e790d7f
> ip-10-0-185-188.us-east-2.compute.internal   Ready    infra,worker   3d6h  
> v1.22.3+e790d7f
> ip-10-0-204-163.us-east-2.compute.internal   Ready    worker         27h   
> v1.22.3+e790d7f
> ip-10-0-217-74.us-east-2.compute.internal    Ready    master         3d6h  
> v1.22.3+e790d7f
> ip-10-0-218-132.us-east-2.compute.internal   Ready    infra,worker   3d6h  
> v1.22.3+e790d7f
> 
> 
> $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph health detail
> HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 zone (1 osds) down;
> Degraded data redundancy: 15061/45183 objects degraded (33.333%), 100 pgs
> degraded, 129 pgs undersized
> [WRN] OSD_DOWN: 1 osds down
>     osd.0
> (root=default,region=us-east-2,zone=us-east-2c,host=ocs-deviceset-2-data-
> 09zr8m) is down
> [WRN] OSD_HOST_DOWN: 1 host (1 osds) down
>     host ocs-deviceset-2-data-09zr8m
> (root=default,region=us-east-2,zone=us-east-2c) (1 osds) is down
> [WRN] OSD_ZONE_DOWN: 1 zone (1 osds) down
>     zone us-east-2c (root=default,region=us-east-2) (1 osds) is down
> [WRN] PG_DEGRADED: Degraded data redundancy: 15061/45183 objects degraded
> (33.333%), 100 pgs degraded, 129 pgs undersized
>     pg 1.13 is active+undersized+degraded, acting [1,2]
>     pg 1.14 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 1.15 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 1.16 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 1.17 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 1.18 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 1.19 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 1.1a is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 1.1b is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 1.1c is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 1.1d is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 1.1e is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 1.1f is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 3.14 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 3.15 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 3.16 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 3.17 is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
>     pg 3.18 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 3.19 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 3.1a is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 3.1b is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 3.1c is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 3.1d is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 3.1e is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 3.1f is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 4.10 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 4.11 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 4.12 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 4.13 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 4.16 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 4.18 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 4.19 is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 4.1a is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 4.1b is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 4.1c is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 4.1d is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 4.1e is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 4.1f is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 5.10 is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
>     pg 5.11 is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
>     pg 5.12 is stuck undersized for 74m, current state active+undersized,
> last acting [2,1]
>     pg 5.13 is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
>     pg 5.17 is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
>     pg 5.18 is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
>     pg 5.19 is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
>     pg 5.1a is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [2,1]
>     pg 5.1b is stuck undersized for 74m, current state active+undersized,
> last acting [2,1]
>     pg 5.1c is stuck undersized for 74m, current state
> active+undersized+degraded, last acting [1,2]
>     pg 5.1d is stuck undersized for 74m, current state active+undersized,
> last acting [2,1]
>     pg 5.1e is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
>     pg 5.1f is stuck undersized for 74m, current state active+undersized,
> last acting [1,2]
> 
> 
> $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph  osd tree
> ID   CLASS  WEIGHT   TYPE NAME                                     STATUS 
> REWEIGHT  PRI-AFF
>  -1         3.00000  root default                                           
> 
>  -5         3.00000      region us-east-2                                   
> 
> -10         1.00000          zone us-east-2a                                
> 
>  -9         1.00000              host ocs-deviceset-1-data-08mh22           
> 
>   1    ssd  1.00000                  osd.1                             up  
> 1.00000  1.00000
> -14         1.00000          zone us-east-2b                                
> 
> -13         1.00000              host ocs-deviceset-0-data-0lz4rr           
> 
>   2    ssd  1.00000                  osd.2                             up  
> 1.00000  1.00000
>  -4         1.00000          zone us-east-2c                                
> 
>  -3         1.00000              host ocs-deviceset-2-data-09zr8m           
> 
>   0    ssd  1.00000                  osd.0                           down  
> 1.00000  1.00000
> 
> 
> 
> Provider must gather -
> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/
> jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/
> 
> Consumer must-gather logs (consumer was not involved in  this test) -
> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/
> jijoy-feb10_20220209T201934/logs/bug_2050057_repro/consumer/
> 
> 
> 
> Set debug level before stopping the Nodes.
> 
> $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get osd debug_osd
> 20/20
> $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get osd debug_ms
> 1/1
> $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get mon debug_ms
> 1/1
> $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get mon debug_mon
> 20/20
> 
> 
> hostNetwork is enabled
> 
> $ oc -n openshift-storage get storagecluster ocs-storagecluster -o yaml|
> grep hostNetwork -B 5
> spec:
>   arbiter: {}
>   encryption:
>     kms: {}
>   externalStorage: {}
>   hostNetwork: true
> 
> 
> Tested in version:
> 
> ODF 4.10.0-143
> OCP 4.9.17

Comment 19 Jilju Joy 2022-02-24 09:53:32 UTC

(In reply to Neha Ojha from comment #18)
> Hi Jiju,
> 
> (In reply to Jilju Joy from comment #15)
> > I tried to reproduce this issue. In the first attempt OSD didn't go down
> > after recovery. Ceph health was HEALTH_OK. Waited more than one hour to see
> > if any OSD is going down. 
> 
> Can you please explain what recovery means in the above context?
I mean the state when the 3 osd are marked 'up' in ceph status after the nodes reach Ready state.
> 
> > 
> > In the second attempt on the same provider, 3 OSDs were up for 29 minutes
> > after all the nodes became Ready. But ceph health did not become HEALTH_OK.
> > After 29 minutes one OSD went down.
> 
> Do you mind sharing where can I find the corresponding logs? The mon logs in
> http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/
> jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/ocs_must_gather/
> quay-io-ocs-dev-ocs-must-gather-sha256-
> cb2456b3eec615d652adc6afb1be14f928f1037725664f039aa6b0d2326ff145/namespaces/
> openshift-storage/pods/rook-ceph-mon-c-849859f89b-w7fnr/mon/mon/logs/ show
> that osd.0 was already down when we captured it

The logs were collected after reproducing the issue. One osd was down at that time.
Time date and time I added in the comment is the date in my system to give an idea about how long it take to mark the 1 osd as down. 

> (2022-02-09T20:38:03.921513472  - 2022-02-09T20:50:37.844620518). 
>

Comment 20 Neha Ojha 2022-03-02 20:36:34 UTC

(In reply to Jilju Joy from comment #19)
> (In reply to Neha Ojha from comment #18)
> > Hi Jiju,
> > 
> > (In reply to Jilju Joy from comment #15)
> > > I tried to reproduce this issue. In the first attempt OSD didn't go down
> > > after recovery. Ceph health was HEALTH_OK. Waited more than one hour to see
> > > if any OSD is going down. 
> > 
> > Can you please explain what recovery means in the above context?
> I mean the state when the 3 osd are marked 'up' in ceph status after the
> nodes reach Ready state.
> > 
> > > 
> > > In the second attempt on the same provider, 3 OSDs were up for 29 minutes
> > > after all the nodes became Ready. But ceph health did not become HEALTH_OK.
> > > After 29 minutes one OSD went down.
> > 
> > Do you mind sharing where can I find the corresponding logs? The mon logs in
> > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/
> > jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/ocs_must_gather/
> > quay-io-ocs-dev-ocs-must-gather-sha256-
> > cb2456b3eec615d652adc6afb1be14f928f1037725664f039aa6b0d2326ff145/namespaces/
> > openshift-storage/pods/rook-ceph-mon-c-849859f89b-w7fnr/mon/mon/logs/ show
> > that osd.0 was already down when we captured it
> 
> The logs were collected after reproducing the issue. One osd was down at
> that time.
> Time date and time I added in the comment is the date in my system to give
> an idea about how long it take to mark the 1 osd as down. 

We need to capture the daemon logs (with increased debug log levels) from when the OSD gets marked down, and for the period it stays down. I also think we are having a lot of back and forth on this BZ, if it helps I am happy get on a call and explain why these logs are needed, and probably get a better understanding of what the test is trying to do.

> 
> > (2022-02-09T20:38:03.921513472  - 2022-02-09T20:50:37.844620518). 
> >

Comment 27 Jilju Joy 2022-03-25 13:00:29 UTC

Reproduced the issue again after setting the log level as suggested in the comment #26.

$ ceph config get osd debug_osd
20/20
sh-4.4$ ceph config get osd debug_ms 
1/1
sh-4.4$ ceph config get mon debug_mon   
20/20
sh-4.4$ ceph config get mon debug_ms 
1/1


Logs : http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ2050057_repro/

Collected the OSD and MON logs during the testing added added in the directory "individual_osd_and_mon_logs".
Note: The file names which contain "before-node-stop" are collected before stopping the nodes. The file names which contain "after-node-back-ready" are collected after all the nodes became Ready again. So the file containing "after-node-back-ready" in its names have the logs when one of the OSD marked as down.

3 OSDs were UP for 29 minutes after the nodes became Ready.

"ceph status" output just before one OSD marked down:

cluster:
    id:     c116bbb5-b8d1-4fee-a90a-10da42b7ad33
    health: HEALTH_WARN
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            Reduced data availability: 108 pgs inactive, 79 pgs peering
            267 slow ops, oldest one blocked for 1779 sec, daemons [osd.0,osd.1,osd.2] have slow ops.
 
  services:
    mon: 3 daemons, quorum a,b,c (age 29m)
    mgr: a(active, since 29m)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 29m), 3 in (since 4h)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   5 pools, 129 pgs
    objects: 24.51k objects, 95 GiB
    usage:   286 GiB used, 5.7 TiB / 6 TiB avail
    pgs:     83.721% pgs not active
             79 peering
             29 activating
             21 active+clean
 
"ceph osd tree" output just before one OSD marked down:

ID   CLASS  WEIGHT   TYPE NAME                               STATUS  REWEIGHT  PRI-AFF
 -1         6.00000  root default                                                     
 -5         6.00000      region us-east-2                                             
-10         2.00000          zone us-east-2a                                          
 -9         2.00000              host default-1-data-09mht7                           
  1    ssd  2.00000                  osd.1                       up   1.00000  1.00000
 -4         2.00000          zone us-east-2b                                          
 -3         2.00000              host default-2-data-0s6v6m                           
  0    ssd  2.00000                  osd.0                       up   1.00000  1.00000
-14         2.00000          zone us-east-2c                                          
-13         2.00000              host default-0-data-05zjp9                           
  2    ssd  2.00000                  osd.2                       up   1.00000  1.00000



"ceph status" output just after one OSD marked down:

cluster:
    id:     c116bbb5-b8d1-4fee-a90a-10da42b7ad33
    health: HEALTH_WARN
            1 filesystem is degraded
            1 osds down
            1 host (1 osds) down
            1 zone (1 osds) down
            Degraded data redundancy: 24511/73533 objects degraded (33.333%), 84 pgs degraded
 
  services:
    mon: 3 daemons, quorum a,b,c (age 30m)
    mgr: a(active, since 29m)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 2 up (since 7s), 3 in (since 4h)
 
  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   5 pools, 129 pgs
    objects: 24.51k objects, 95 GiB
    usage:   286 GiB used, 5.7 TiB / 6 TiB avail
    pgs:     24511/73533 objects degraded (33.333%)
             84 active+undersized+degraded
             24 active+undersized
             21 active+undersized+wait
 
  io:
    client:   3.7 KiB/s wr, 0 op/s rd, 1 op/s wr
    recovery: 8.0 KiB/s, 1 objects/s


"ceph osd tree" output just after one OSD marked down:

ID   CLASS  WEIGHT   TYPE NAME                               STATUS  REWEIGHT  PRI-AFF
 -1         6.00000  root default                                                     
 -5         6.00000      region us-east-2                                             
-10         2.00000          zone us-east-2a                                          
 -9         2.00000              host default-1-data-09mht7                           
  1    ssd  2.00000                  osd.1                     down   1.00000  1.00000
 -4         2.00000          zone us-east-2b                                          
 -3         2.00000              host default-2-data-0s6v6m                           
  0    ssd  2.00000                  osd.0                       up   1.00000  1.00000
-14         2.00000          zone us-east-2c                                          
-13         2.00000              host default-0-data-05zjp9                           
  2    ssd  2.00000                  osd.2                       up   1.00000  1.00000


$ oc get pods -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE     IP             NODE                                         NOMINATED NODE   READINESS GATES
67356895c063b04b1f32f3e31d4ed40684fffe8513dfc7c344e8bb--1-f24pm   0/1     Completed   0          6h22m   10.131.0.66    ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
alertmanager-managed-ocs-alertmanager-0                           2/2     Running     4          6h21m   10.131.0.14    ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
alertmanager-managed-ocs-alertmanager-1                           2/2     Running     4          6h21m   10.128.2.21    ip-10-0-210-238.us-east-2.compute.internal   <none>           <none>
alertmanager-managed-ocs-alertmanager-2                           2/2     Running     4          6h21m   10.129.2.14    ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
b99fc36d3515cb5393c4870b496e2eeb8e124a63fd5bb8942defcb--1-6qrj5   0/1     Completed   0          6h22m   10.131.0.64    ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
f35b6d7f1bfb86e8f99ace71e6f112001fecf0f91c4897edbe0717--1-2wvzg   0/1     Completed   0          6h22m   10.131.0.63    ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
ocs-metrics-exporter-b55f6f77-tndjf                               1/1     Running     2          6h21m   10.131.0.8     ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
ocs-operator-64b7598bb-r5hml                                      1/1     Running     4          6h21m   10.129.2.3     ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
ocs-osd-controller-manager-69d87b7c96-fd7qh                       3/3     Running     8          6h22m   10.128.2.18    ip-10-0-210-238.us-east-2.compute.internal   <none>           <none>
ocs-provider-server-bd5cd8458-tps2d                               1/1     Running     2          6h21m   10.131.0.23    ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
odf-console-77b6ddffb8-9sxtw                                      1/1     Running     2          6h22m   10.129.2.4     ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
odf-operator-controller-manager-9f8898b5-8w88g                    2/2     Running     4          6h22m   10.131.0.15    ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
prometheus-managed-ocs-prometheus-0                               2/2     Running     5          6h21m   10.128.2.20    ip-10-0-210-238.us-east-2.compute.internal   <none>           <none>
prometheus-operator-5dc6c569-k9t2p                                1/1     Running     2          6h22m   10.129.2.5     ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
redhat-operators-8gjgb                                            1/1     Running     2          6h22m   10.129.2.12    ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-8697c21ca7172dbfcf664edd9ec2a586-q7dhk   1/1     Running     2          6h15m   10.0.210.238   ip-10-0-210-238.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-e5102b497044c9cd5fa09a64c19bddd7-4zvkk   1/1     Running     2          6h15m   10.0.136.166   ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
rook-ceph-crashcollector-fc1cbae477521bd422880cb410cdadb2-f2nxh   1/1     Running     2          6h14m   10.0.191.7     ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-84d84bd4fttct   2/2     Running     4          6h13m   10.0.191.7     ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-57f8d75d4k9p6   2/2     Running     4          6h13m   10.0.210.238   ip-10-0-210-238.us-east-2.compute.internal   <none>           <none>
rook-ceph-mgr-a-59b44cc74-gzzfx                                   2/2     Running     4          6h15m   10.0.136.166   ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-a-7757547d87-qln6k                                  2/2     Running     4          6h20m   10.0.136.166   ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
rook-ceph-mon-b-59f5f9ccc8-pbr7p                                  2/2     Running     4          6h18m   10.0.191.7     ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-mon-c-55cd4fbd5d-nsfgd                                  2/2     Running     4          6h18m   10.0.210.238   ip-10-0-210-238.us-east-2.compute.internal   <none>           <none>
rook-ceph-operator-5cb764b9d9-cmknl                               1/1     Running     2          6h21m   10.129.2.15    ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-0-799f5865d7-5zvdg                                  2/2     Running     4          6h14m   10.0.191.7     ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-osd-1-697f9d84b9-bshn9                                  2/2     Running     4          6h14m   10.0.136.166   ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-2-57f8566b76-csb9n                                  2/2     Running     4          6h14m   10.0.210.238   ip-10-0-210-238.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-0-data-05zjp9--1-pp7rc              0/1     Completed   0          6h14m   10.0.210.238   ip-10-0-210-238.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-1-data-09mht7--1-bq82q              0/1     Completed   0          6h14m   10.0.136.166   ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>
rook-ceph-osd-prepare-default-2-data-0s6v6m--1-5ms4x              0/1     Completed   0          6h14m   10.0.191.7     ip-10-0-191-7.us-east-2.compute.internal     <none>           <none>
rook-ceph-tools-86c9fb5d54-zgrtt                                  1/1     Running     2          6h21m   10.0.136.166   ip-10-0-136-166.us-east-2.compute.internal   <none>           <none>

Comment 37 Mudit Agarwal 2022-04-05 13:45:52 UTC

Still being discussed, not a 4.10 blocker atm

Comment 49 Travis Nielsen 2022-05-16 15:15:41 UTC

Neha This issue seems different from https://bugzilla.redhat.com/show_bug.cgi?id=2072900, the OSD just isn't properly starting after restarting the node. I'm not sure what we can do from Rook for this. The pod has started, but the OSD is not coming online. Are there sufficient logs or do we need increased log levels?

Comment 50 Neha Ojha 2022-05-27 19:01:07 UTC

(In reply to Travis Nielsen from comment #49)
> Neha This issue seems different from
> https://bugzilla.redhat.com/show_bug.cgi?id=2072900, the OSD just isn't
> properly starting after restarting the node. I'm not sure what we can do
> from Rook for this. The pod has started, but the OSD is not coming online.
> Are there sufficient logs or do we need increased log levels?

Travis, QE was able to capture logs with the desired log levels, and I provided my analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2050057#c35. From Ceph's perspective, osd.1 was never restarted.

Comment 51 Mudit Agarwal 2022-06-13 09:35:56 UTC

Neha, is this still a blocker?

Comment 52 Travis Nielsen 2022-06-29 21:20:41 UTC

(In reply to Neha Ojha from comment #50)
> (In reply to Travis Nielsen from comment #49)
> > Neha This issue seems different from
> > https://bugzilla.redhat.com/show_bug.cgi?id=2072900, the OSD just isn't
> > properly starting after restarting the node. I'm not sure what we can do
> > from Rook for this. The pod has started, but the OSD is not coming online.
> > Are there sufficient logs or do we need increased log levels?
> 
> Travis, QE was able to capture logs with the desired log levels, and I
> provided my analysis in
> https://bugzilla.redhat.com/show_bug.cgi?id=2050057#c35. From Ceph's
> perspective, osd.1 was never restarted.

If the logs don't show osd.1 was ever restarted, something isn't adding up. When the OSD pod starts back up after a node outage, the OSD must show logging that shows it was restarted. Could we get another repro that shows the osd.1 pod actually restarted?

Comment 58 Mudit Agarwal 2022-11-03 02:36:58 UTC

Please reopen if this is still an issue and we have enough to debug

Comment 59 Red Hat Bugzilla 2023-12-08 04:27:34 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.