Bug 2050057
| Summary: | ODF4.10 : 1 osd down after stopping all nodes on provider cluster on odf to odf setup | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | suchita <sgatfane> |
| Component: | ceph | Assignee: | Neha Ojha <nojha> |
| ceph sub component: | RADOS | QA Contact: | Elad <ebenahar> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | bniver, ebenahar, jijoy, madam, mhackett, mmuench, muagarwa, nberry, nojha, ocs-bugs, odf-bz-bot, owasserm, pdhiran, rgeorge, sheggodu, tnielsen, vumrao |
| Version: | 4.10 | Flags: | vumrao:
needinfo?
(jijoy) vumrao: needinfo? (jijoy) muagarwa: needinfo? (nberry) tnielsen: needinfo? (sgatfane) sheggodu: needinfo? (nojha) |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-11-03 02:36:58 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The osd.0 log [1] shows that it is running and the OSD pod description [2] shows that the pod is running and passing its liveness probe. I don't see any logging that indicates the issue. Perhaps there was some issue with the aws networking that wasn't restored after the VMs came back online and is preventing the OSD from communicating so the other OSDs are marking it down. Suchita - Does this repro consistently? It really seems like an AWS environment issue. - Was this just a test scenario? Or why were the AWS instances stopped? Neha any other clues to look for the cause for the OSD being marked down? [1] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ac4648adce5e4f901613016460bff96b20f32f8554295d7d1bbed610ed1e1301/namespaces/openshift-storage/pods/rook-ceph-osd-0-784999fdff-4tkwx/osd/osd/logs/current.log [2] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ac4648adce5e4f901613016460bff96b20f32f8554295d7d1bbed610ed1e1301/namespaces/openshift-storage/pods/rook-ceph-osd-0-784999fdff-4tkwx/rook-ceph-osd-0-784999fdff-4tkwx.yaml From the mon logs [0], looks like osd.0 stopped responding to beacons and the mon marked it down. This smells like a network issue. 2022-02-02T13:06:58.569247812Z debug 2022-02-02T13:06:58.568+0000 7f23cb41b700 0 log_channel(cluster) log [INF] : osd.0 marked down after no beacon for 900.407398 seconds 2022-02-02T13:06:58.569275923Z debug 2022-02-02T13:06:58.568+0000 7f23cb41b700 -1 mon.a@0(leader).osd e197 no beacon from osd.0 since 2022-02-02T12:51:58.161800+0000, 900.407398 seconds ago. marking down [0] http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-ac4648adce5e4f901613016460bff96b20f32f8554295d7d1bbed610ed1e1301/namespaces/openshift-storage/pods/rook-ceph-mon-a-7546889649-dkt7h/mon/mon/logs/current.log @tnielsen Travis, this was a test scenario to see how ODF behaves with hostNetworking when one of the nodes get restarted. If this is a network issue, what should we be looking for? Some thoughts: - Does this repro consistently? Or was it a one-time issue? - Does restarting the VM one more time get it working again? - Review the Ceph networking guide and dig into the network with someone who knows AWS networking: https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/ (In reply to Sahina Bose from comment #4) > @tnielsen Travis, this was a test scenario to see how ODF behaves > with hostNetworking when one of the nodes get restarted. If this is a > network issue, what should we be looking for? The test step "Stop all instances of provider cluster from aws (CLI/console) " - is not one of the nodes get restarted. It's a substantially less relevant test, IMHO. (In reply to Travis Nielsen from comment #5) > Some thoughts: > - Does this repro consistently? Or was it a one-time issue? > - Does restarting the VM one more time get it working again? > - Review the Ceph networking guide and dig into the network with someone who > knows AWS networking: > https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/ > - Does this repro consistently? Or was it a one-time issue? ==> yes, we tried 2 times and both time it has been observed. > - Does restarting the VM one more time get it working again? ==> I will try this today and will update the observation here. Neha, For the logs OSD.0 seems to recover and is up and running. Should not the mons have detected it and marked it up again? Are we missing a config setting? > - Does restarting the VM one more time get it working again? (In reply to suchita from comment #7)
> (In reply to Travis Nielsen from comment #5)
> > Some thoughts:
> > - Does this repro consistently? Or was it a one-time issue?
> > - Does restarting the VM one more time get it working again?
> > - Review the Ceph networking guide and dig into the network with someone who
> > knows AWS networking:
> > https://docs.ceph.com/en/latest/rados/configuration/network-config-ref/
>
>
> > - Does this repro consistently? Or was it a one-time issue?
> ==> yes, we tried 2 times and both time it has been observed.
>
> > - Does restarting the VM one more time get it working again?
> ==> I will try this today and will update the observation here.
For some reason in setup, I missed reproducing and checking this. I am doing it today and will try to keep the set up for use
(In reply to Orit Wasserman from comment #8) > Neha, > For the logs OSD.0 seems to recover and is up and running. > Should not the mons have detected it and marked it up again? > Are we missing a config setting? Which specific Config setting do we need? (In reply to suchita from comment #10) > (In reply to Orit Wasserman from comment #8) > > Neha, > > For the logs OSD.0 seems to recover and is up and running. > > Should not the mons have detected it and marked it up again? > > Are we missing a config setting? > > Which specific Config setting do we need? we are still investigating. I will update when we understand what is going on. I tried to reproduce this issue. In the first attempt OSD didn't go down after recovery. Ceph health was HEALTH_OK. Waited more than one hour to see if any OSD is going down.
In the second attempt on the same provider, 3 OSDs were up for 29 minutes after all the nodes became Ready. But ceph health did not become HEALTH_OK. After 29 minutes one OSD went down.
This is the last ceph status when 3 OSD were up (Thu Feb 10 01:36:38 AM IST 2022)
cluster:
id: ad79e075-d011-46a9-992c-a655b1689fed
health: HEALTH_WARN
1 filesystem is degraded
1 MDSs report slow metadata IOs
Reduced data availability: 129 pgs inactive, 90 pgs peering
14 slow ops, oldest one blocked for 1781 sec, daemons [osd.0,osd.1,osd.2] have slow ops.
services:
mon: 3 daemons, quorum c,d,e (age 29m)
mgr: a(active, since 29m)
mds: 1/1 daemons up, 1 standby
osd: 3 osds: 3 up (since 29m), 3 in (since 2d)
data:
volumes: 0/1 healthy, 1 recovering
pools: 5 pools, 129 pgs
objects: 15.06k objects, 55 GiB
usage: 78 GiB used, 2.9 TiB / 3 TiB avail
pgs: 100.000% pgs not active
90 peering
39 activating
After that one osd was marked down (Thu Feb 10 01:36:57 AM IST 2022)
cluster:
id: ad79e075-d011-46a9-992c-a655b1689fed
health: HEALTH_WARN
1 MDSs report slow metadata IOs
1 osds down
1 host (1 osds) down
1 zone (1 osds) down
Reduced data availability: 2 pgs inactive
Degraded data redundancy: 15061/45183 objects degraded (33.333%), 95 pgs degraded
6 slow ops, oldest one blocked for 1796 sec, daemons [osd.1,osd.2] have slow ops.
services:
mon: 3 daemons, quorum c,d,e (age 30m)
mgr: a(active, since 30m)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 2 up (since 9s), 3 in (since 2d)
data:
volumes: 1/1 healthy
pools: 5 pools, 129 pgs
objects: 15.06k objects, 55 GiB
usage: 78 GiB used, 2.9 TiB / 3 TiB avail
pgs: 3.876% pgs not active
15061/45183 objects degraded (33.333%)
95 active+undersized+degraded
29 active+undersized
5 activating+undersized
io:
client: 11 MiB/s rd, 4 op/s rd, 0 op/s wr
OSD pods are running
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
csi-addons-controller-manager-7b65c778df-784xl 2/2 Running 5 4h27m 10.129.2.7 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
csi-cephfsplugin-2dnk8 3/3 Running 6 2d9h 10.0.175.33 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
csi-cephfsplugin-ddpct 3/3 Running 6 2d9h 10.0.132.60 ip-10-0-132-60.us-east-2.compute.internal <none> <none>
csi-cephfsplugin-l6s88 3/3 Running 6 26h 10.0.204.163 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
csi-cephfsplugin-provisioner-6465d4c55-fb8p9 6/6 Running 12 2d9h 10.131.0.26 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
csi-cephfsplugin-provisioner-6465d4c55-t2v66 6/6 Running 16 4h27m 10.129.2.8 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
csi-rbdplugin-7crdf 4/4 Running 8 2d9h 10.0.132.60 ip-10-0-132-60.us-east-2.compute.internal <none> <none>
csi-rbdplugin-9h47g 4/4 Running 8 26h 10.0.204.163 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
csi-rbdplugin-provisioner-5d4f9f74d6-65jp5 7/7 Running 18 4h27m 10.129.2.22 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
csi-rbdplugin-provisioner-5d4f9f74d6-mxh2l 7/7 Running 14 2d9h 10.131.0.22 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
csi-rbdplugin-vsjxx 4/4 Running 8 2d9h 10.0.175.33 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
noobaa-operator-5d8bf7d5d8-97xxf 1/1 Running 2 4h27m 10.129.2.27 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
ocs-metrics-exporter-684b49bfb4-rzss9 1/1 Running 2 4h27m 10.129.2.3 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
ocs-operator-65f46f66f5-t89kd 1/1 Running 4 4h27m 10.129.2.15 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
odf-console-85fdf68fcc-5qbrt 1/1 Running 2 3d3h 10.131.0.4 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
odf-operator-controller-manager-57756ff5d7-mtqvs 2/2 Running 4 26h 10.131.0.12 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
rook-ceph-crashcollector-11663c22d14d9287c7e129ee309d1401-v555r 1/1 Running 1 140m 10.0.132.60 ip-10-0-132-60.us-east-2.compute.internal <none> <none>
rook-ceph-crashcollector-5ae34e94e91c7d5a246468cca7f76ed4-xlq9g 1/1 Running 2 2d9h 10.0.175.33 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
rook-ceph-crashcollector-9c0b4f54f434bc3f9febde270fb1e84c-j7bmr 1/1 Running 1 133m 10.0.204.163 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-765d776b4pptd 2/2 Running 2 133m 10.0.132.60 ip-10-0-132-60.us-east-2.compute.internal <none> <none>
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-58bdfc9djw4kd 2/2 Running 4 2d9h 10.0.175.33 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
rook-ceph-mgr-a-69cdb8fb9f-j88x8 2/2 Running 2 133m 10.0.132.60 ip-10-0-132-60.us-east-2.compute.internal <none> <none>
rook-ceph-mon-c-849859f89b-w7fnr 2/2 Running 4 2d9h 10.0.175.33 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
rook-ceph-mon-d-5545bfb4c9-gxpvt 2/2 Running 2 133m 10.0.204.163 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
rook-ceph-mon-e-5fdc5f5c79-twwtq 2/2 Running 2 140m 10.0.132.60 ip-10-0-132-60.us-east-2.compute.internal <none> <none>
rook-ceph-operator-6db885d965-dd9zg 1/1 Running 2 4h27m 10.129.2.16 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
rook-ceph-osd-0-7cfc878594-zstjh 2/2 Running 2 133m 10.0.204.163 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
rook-ceph-osd-1-5bd9697469-mpl4c 2/2 Running 2 140m 10.0.132.60 ip-10-0-132-60.us-east-2.compute.internal <none> <none>
rook-ceph-osd-2-5669dccc65-c2lz8 2/2 Running 4 2d9h 10.0.175.33 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
rook-ceph-osd-prepare-ocs-deviceset-0-data-0lz4rr--1-58f8p 0/1 Completed 0 2d9h 10.0.175.33 ip-10-0-175-33.us-east-2.compute.internal <none> <none>
rook-ceph-tools-7c78f9db77-cpc8k 1/1 Running 2 4h27m 10.0.204.163 ip-10-0-204-163.us-east-2.compute.internal <none> <none>
All nodes in the provider are Ready
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-132-60.us-east-2.compute.internal Ready worker 3d6h v1.22.3+e790d7f
ip-10-0-152-132.us-east-2.compute.internal Ready master 3d6h v1.22.3+e790d7f
ip-10-0-157-3.us-east-2.compute.internal Ready infra,worker 3d6h v1.22.3+e790d7f
ip-10-0-175-33.us-east-2.compute.internal Ready worker 3d6h v1.22.3+e790d7f
ip-10-0-179-252.us-east-2.compute.internal Ready master 3d6h v1.22.3+e790d7f
ip-10-0-185-188.us-east-2.compute.internal Ready infra,worker 3d6h v1.22.3+e790d7f
ip-10-0-204-163.us-east-2.compute.internal Ready worker 27h v1.22.3+e790d7f
ip-10-0-217-74.us-east-2.compute.internal Ready master 3d6h v1.22.3+e790d7f
ip-10-0-218-132.us-east-2.compute.internal Ready infra,worker 3d6h v1.22.3+e790d7f
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph health detail
HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 zone (1 osds) down; Degraded data redundancy: 15061/45183 objects degraded (33.333%), 100 pgs degraded, 129 pgs undersized
[WRN] OSD_DOWN: 1 osds down
osd.0 (root=default,region=us-east-2,zone=us-east-2c,host=ocs-deviceset-2-data-09zr8m) is down
[WRN] OSD_HOST_DOWN: 1 host (1 osds) down
host ocs-deviceset-2-data-09zr8m (root=default,region=us-east-2,zone=us-east-2c) (1 osds) is down
[WRN] OSD_ZONE_DOWN: 1 zone (1 osds) down
zone us-east-2c (root=default,region=us-east-2) (1 osds) is down
[WRN] PG_DEGRADED: Degraded data redundancy: 15061/45183 objects degraded (33.333%), 100 pgs degraded, 129 pgs undersized
pg 1.13 is active+undersized+degraded, acting [1,2]
pg 1.14 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 1.15 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 1.16 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 1.17 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 1.18 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 1.19 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 1.1a is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 1.1b is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 1.1c is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 1.1d is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 1.1e is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 1.1f is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 3.14 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 3.15 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 3.16 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 3.17 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
pg 3.18 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 3.19 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 3.1a is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 3.1b is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 3.1c is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 3.1d is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 3.1e is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 3.1f is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 4.10 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 4.11 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 4.12 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 4.13 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 4.16 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 4.18 is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 4.19 is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 4.1a is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 4.1b is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 4.1c is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 4.1d is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 4.1e is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 4.1f is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 5.10 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
pg 5.11 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
pg 5.12 is stuck undersized for 74m, current state active+undersized, last acting [2,1]
pg 5.13 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
pg 5.17 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
pg 5.18 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
pg 5.19 is stuck undersized for 74m, current state active+undersized, last acting [1,2]
pg 5.1a is stuck undersized for 74m, current state active+undersized+degraded, last acting [2,1]
pg 5.1b is stuck undersized for 74m, current state active+undersized, last acting [2,1]
pg 5.1c is stuck undersized for 74m, current state active+undersized+degraded, last acting [1,2]
pg 5.1d is stuck undersized for 74m, current state active+undersized, last acting [2,1]
pg 5.1e is stuck undersized for 74m, current state active+undersized, last acting [1,2]
pg 5.1f is stuck undersized for 74m, current state active+undersized, last acting [1,2]
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 3.00000 root default
-5 3.00000 region us-east-2
-10 1.00000 zone us-east-2a
-9 1.00000 host ocs-deviceset-1-data-08mh22
1 ssd 1.00000 osd.1 up 1.00000 1.00000
-14 1.00000 zone us-east-2b
-13 1.00000 host ocs-deviceset-0-data-0lz4rr
2 ssd 1.00000 osd.2 up 1.00000 1.00000
-4 1.00000 zone us-east-2c
-3 1.00000 host ocs-deviceset-2-data-09zr8m
0 ssd 1.00000 osd.0 down 1.00000 1.00000
Provider must gather - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/
Consumer must-gather logs (consumer was not involved in this test) - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/jijoy-feb10_20220209T201934/logs/bug_2050057_repro/consumer/
Set debug level before stopping the Nodes.
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get osd debug_osd
20/20
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get osd debug_ms
1/1
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get mon debug_ms
1/1
$ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get mon debug_mon
20/20
hostNetwork is enabled
$ oc -n openshift-storage get storagecluster ocs-storagecluster -o yaml| grep hostNetwork -B 5
spec:
arbiter: {}
encryption:
kms: {}
externalStorage: {}
hostNetwork: true
Tested in version:
ODF 4.10.0-143
OCP 4.9.17
@nojha Any inputs that you can provide here? thanks! Hi Jiju, (In reply to Jilju Joy from comment #15) > I tried to reproduce this issue. In the first attempt OSD didn't go down > after recovery. Ceph health was HEALTH_OK. Waited more than one hour to see > if any OSD is going down. Can you please explain what recovery means in the above context? > > In the second attempt on the same provider, 3 OSDs were up for 29 minutes > after all the nodes became Ready. But ceph health did not become HEALTH_OK. > After 29 minutes one OSD went down. Do you mind sharing where can I find the corresponding logs? The mon logs in http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/ocs_must_gather/quay-io-ocs-dev-ocs-must-gather-sha256-cb2456b3eec615d652adc6afb1be14f928f1037725664f039aa6b0d2326ff145/namespaces/openshift-storage/pods/rook-ceph-mon-c-849859f89b-w7fnr/mon/mon/logs/ show that osd.0 was already down when we captured it (2022-02-09T20:38:03.921513472 - 2022-02-09T20:50:37.844620518). > > > This is the last ceph status when 3 OSD were up (Thu Feb 10 01:36:38 AM IST > 2022) > > cluster: > id: ad79e075-d011-46a9-992c-a655b1689fed > health: HEALTH_WARN > 1 filesystem is degraded > 1 MDSs report slow metadata IOs > Reduced data availability: 129 pgs inactive, 90 pgs peering > 14 slow ops, oldest one blocked for 1781 sec, daemons > [osd.0,osd.1,osd.2] have slow ops. > > services: > mon: 3 daemons, quorum c,d,e (age 29m) > mgr: a(active, since 29m) > mds: 1/1 daemons up, 1 standby > osd: 3 osds: 3 up (since 29m), 3 in (since 2d) > > data: > volumes: 0/1 healthy, 1 recovering > pools: 5 pools, 129 pgs > objects: 15.06k objects, 55 GiB > usage: 78 GiB used, 2.9 TiB / 3 TiB avail > pgs: 100.000% pgs not active > 90 peering > 39 activating > > > > > After that one osd was marked down (Thu Feb 10 01:36:57 AM IST 2022) > > cluster: > id: ad79e075-d011-46a9-992c-a655b1689fed > health: HEALTH_WARN > 1 MDSs report slow metadata IOs > 1 osds down > 1 host (1 osds) down > 1 zone (1 osds) down > Reduced data availability: 2 pgs inactive > Degraded data redundancy: 15061/45183 objects degraded > (33.333%), 95 pgs degraded > 6 slow ops, oldest one blocked for 1796 sec, daemons > [osd.1,osd.2] have slow ops. > > services: > mon: 3 daemons, quorum c,d,e (age 30m) > mgr: a(active, since 30m) > mds: 1/1 daemons up, 1 hot standby > osd: 3 osds: 2 up (since 9s), 3 in (since 2d) > > data: > volumes: 1/1 healthy > pools: 5 pools, 129 pgs > objects: 15.06k objects, 55 GiB > usage: 78 GiB used, 2.9 TiB / 3 TiB avail > pgs: 3.876% pgs not active > 15061/45183 objects degraded (33.333%) > 95 active+undersized+degraded > 29 active+undersized > 5 activating+undersized > > io: > client: 11 MiB/s rd, 4 op/s rd, 0 op/s wr > > > > OSD pods are running > > > NAME READY > STATUS RESTARTS AGE IP NODE > NOMINATED NODE READINESS GATES > csi-addons-controller-manager-7b65c778df-784xl 2/2 > Running 5 4h27m 10.129.2.7 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > csi-cephfsplugin-2dnk8 3/3 > Running 6 2d9h 10.0.175.33 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > csi-cephfsplugin-ddpct 3/3 > Running 6 2d9h 10.0.132.60 > ip-10-0-132-60.us-east-2.compute.internal <none> <none> > csi-cephfsplugin-l6s88 3/3 > Running 6 26h 10.0.204.163 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > csi-cephfsplugin-provisioner-6465d4c55-fb8p9 6/6 > Running 12 2d9h 10.131.0.26 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > csi-cephfsplugin-provisioner-6465d4c55-t2v66 6/6 > Running 16 4h27m 10.129.2.8 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > csi-rbdplugin-7crdf 4/4 > Running 8 2d9h 10.0.132.60 > ip-10-0-132-60.us-east-2.compute.internal <none> <none> > csi-rbdplugin-9h47g 4/4 > Running 8 26h 10.0.204.163 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > csi-rbdplugin-provisioner-5d4f9f74d6-65jp5 7/7 > Running 18 4h27m 10.129.2.22 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > csi-rbdplugin-provisioner-5d4f9f74d6-mxh2l 7/7 > Running 14 2d9h 10.131.0.22 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > csi-rbdplugin-vsjxx 4/4 > Running 8 2d9h 10.0.175.33 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > noobaa-operator-5d8bf7d5d8-97xxf 1/1 > Running 2 4h27m 10.129.2.27 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > ocs-metrics-exporter-684b49bfb4-rzss9 1/1 > Running 2 4h27m 10.129.2.3 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > ocs-operator-65f46f66f5-t89kd 1/1 > Running 4 4h27m 10.129.2.15 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > odf-console-85fdf68fcc-5qbrt 1/1 > Running 2 3d3h 10.131.0.4 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > odf-operator-controller-manager-57756ff5d7-mtqvs 2/2 > Running 4 26h 10.131.0.12 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > rook-ceph-crashcollector-11663c22d14d9287c7e129ee309d1401-v555r 1/1 > Running 1 140m 10.0.132.60 > ip-10-0-132-60.us-east-2.compute.internal <none> <none> > rook-ceph-crashcollector-5ae34e94e91c7d5a246468cca7f76ed4-xlq9g 1/1 > Running 2 2d9h 10.0.175.33 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > rook-ceph-crashcollector-9c0b4f54f434bc3f9febde270fb1e84c-j7bmr 1/1 > Running 1 133m 10.0.204.163 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-765d776b4pptd 2/2 > Running 2 133m 10.0.132.60 > ip-10-0-132-60.us-east-2.compute.internal <none> <none> > rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-58bdfc9djw4kd 2/2 > Running 4 2d9h 10.0.175.33 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > rook-ceph-mgr-a-69cdb8fb9f-j88x8 2/2 > Running 2 133m 10.0.132.60 > ip-10-0-132-60.us-east-2.compute.internal <none> <none> > rook-ceph-mon-c-849859f89b-w7fnr 2/2 > Running 4 2d9h 10.0.175.33 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > rook-ceph-mon-d-5545bfb4c9-gxpvt 2/2 > Running 2 133m 10.0.204.163 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > rook-ceph-mon-e-5fdc5f5c79-twwtq 2/2 > Running 2 140m 10.0.132.60 > ip-10-0-132-60.us-east-2.compute.internal <none> <none> > rook-ceph-operator-6db885d965-dd9zg 1/1 > Running 2 4h27m 10.129.2.16 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > rook-ceph-osd-0-7cfc878594-zstjh 2/2 > Running 2 133m 10.0.204.163 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > rook-ceph-osd-1-5bd9697469-mpl4c 2/2 > Running 2 140m 10.0.132.60 > ip-10-0-132-60.us-east-2.compute.internal <none> <none> > rook-ceph-osd-2-5669dccc65-c2lz8 2/2 > Running 4 2d9h 10.0.175.33 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > rook-ceph-osd-prepare-ocs-deviceset-0-data-0lz4rr--1-58f8p 0/1 > Completed 0 2d9h 10.0.175.33 > ip-10-0-175-33.us-east-2.compute.internal <none> <none> > rook-ceph-tools-7c78f9db77-cpc8k 1/1 > Running 2 4h27m 10.0.204.163 > ip-10-0-204-163.us-east-2.compute.internal <none> <none> > > > All nodes in the provider are Ready > $ oc get nodes > NAME STATUS ROLES AGE > VERSION > ip-10-0-132-60.us-east-2.compute.internal Ready worker 3d6h > v1.22.3+e790d7f > ip-10-0-152-132.us-east-2.compute.internal Ready master 3d6h > v1.22.3+e790d7f > ip-10-0-157-3.us-east-2.compute.internal Ready infra,worker 3d6h > v1.22.3+e790d7f > ip-10-0-175-33.us-east-2.compute.internal Ready worker 3d6h > v1.22.3+e790d7f > ip-10-0-179-252.us-east-2.compute.internal Ready master 3d6h > v1.22.3+e790d7f > ip-10-0-185-188.us-east-2.compute.internal Ready infra,worker 3d6h > v1.22.3+e790d7f > ip-10-0-204-163.us-east-2.compute.internal Ready worker 27h > v1.22.3+e790d7f > ip-10-0-217-74.us-east-2.compute.internal Ready master 3d6h > v1.22.3+e790d7f > ip-10-0-218-132.us-east-2.compute.internal Ready infra,worker 3d6h > v1.22.3+e790d7f > > > $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph health detail > HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 zone (1 osds) down; > Degraded data redundancy: 15061/45183 objects degraded (33.333%), 100 pgs > degraded, 129 pgs undersized > [WRN] OSD_DOWN: 1 osds down > osd.0 > (root=default,region=us-east-2,zone=us-east-2c,host=ocs-deviceset-2-data- > 09zr8m) is down > [WRN] OSD_HOST_DOWN: 1 host (1 osds) down > host ocs-deviceset-2-data-09zr8m > (root=default,region=us-east-2,zone=us-east-2c) (1 osds) is down > [WRN] OSD_ZONE_DOWN: 1 zone (1 osds) down > zone us-east-2c (root=default,region=us-east-2) (1 osds) is down > [WRN] PG_DEGRADED: Degraded data redundancy: 15061/45183 objects degraded > (33.333%), 100 pgs degraded, 129 pgs undersized > pg 1.13 is active+undersized+degraded, acting [1,2] > pg 1.14 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 1.15 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 1.16 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 1.17 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 1.18 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 1.19 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 1.1a is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 1.1b is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 1.1c is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 1.1d is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 1.1e is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 1.1f is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 3.14 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 3.15 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 3.16 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 3.17 is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > pg 3.18 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 3.19 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 3.1a is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 3.1b is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 3.1c is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 3.1d is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 3.1e is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 3.1f is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 4.10 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 4.11 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 4.12 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 4.13 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 4.16 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 4.18 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 4.19 is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 4.1a is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 4.1b is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 4.1c is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 4.1d is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 4.1e is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 4.1f is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 5.10 is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > pg 5.11 is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > pg 5.12 is stuck undersized for 74m, current state active+undersized, > last acting [2,1] > pg 5.13 is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > pg 5.17 is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > pg 5.18 is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > pg 5.19 is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > pg 5.1a is stuck undersized for 74m, current state > active+undersized+degraded, last acting [2,1] > pg 5.1b is stuck undersized for 74m, current state active+undersized, > last acting [2,1] > pg 5.1c is stuck undersized for 74m, current state > active+undersized+degraded, last acting [1,2] > pg 5.1d is stuck undersized for 74m, current state active+undersized, > last acting [2,1] > pg 5.1e is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > pg 5.1f is stuck undersized for 74m, current state active+undersized, > last acting [1,2] > > > $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph osd tree > ID CLASS WEIGHT TYPE NAME STATUS > REWEIGHT PRI-AFF > -1 3.00000 root default > > -5 3.00000 region us-east-2 > > -10 1.00000 zone us-east-2a > > -9 1.00000 host ocs-deviceset-1-data-08mh22 > > 1 ssd 1.00000 osd.1 up > 1.00000 1.00000 > -14 1.00000 zone us-east-2b > > -13 1.00000 host ocs-deviceset-0-data-0lz4rr > > 2 ssd 1.00000 osd.2 up > 1.00000 1.00000 > -4 1.00000 zone us-east-2c > > -3 1.00000 host ocs-deviceset-2-data-09zr8m > > 0 ssd 1.00000 osd.0 down > 1.00000 1.00000 > > > > Provider must gather - > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/ > jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/ > > Consumer must-gather logs (consumer was not involved in this test) - > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/ > jijoy-feb10_20220209T201934/logs/bug_2050057_repro/consumer/ > > > > Set debug level before stopping the Nodes. > > $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get osd debug_osd > 20/20 > $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get osd debug_ms > 1/1 > $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get mon debug_ms > 1/1 > $ oc rsh rook-ceph-tools-7c78f9db77-cpc8k ceph config get mon debug_mon > 20/20 > > > hostNetwork is enabled > > $ oc -n openshift-storage get storagecluster ocs-storagecluster -o yaml| > grep hostNetwork -B 5 > spec: > arbiter: {} > encryption: > kms: {} > externalStorage: {} > hostNetwork: true > > > Tested in version: > > ODF 4.10.0-143 > OCP 4.9.17 (In reply to Neha Ojha from comment #18) > Hi Jiju, > > (In reply to Jilju Joy from comment #15) > > I tried to reproduce this issue. In the first attempt OSD didn't go down > > after recovery. Ceph health was HEALTH_OK. Waited more than one hour to see > > if any OSD is going down. > > Can you please explain what recovery means in the above context? I mean the state when the 3 osd are marked 'up' in ceph status after the nodes reach Ready state. > > > > > In the second attempt on the same provider, 3 OSDs were up for 29 minutes > > after all the nodes became Ready. But ceph health did not become HEALTH_OK. > > After 29 minutes one OSD went down. > > Do you mind sharing where can I find the corresponding logs? The mon logs in > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/ > jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/ocs_must_gather/ > quay-io-ocs-dev-ocs-must-gather-sha256- > cb2456b3eec615d652adc6afb1be14f928f1037725664f039aa6b0d2326ff145/namespaces/ > openshift-storage/pods/rook-ceph-mon-c-849859f89b-w7fnr/mon/mon/logs/ show > that osd.0 was already down when we captured it The logs were collected after reproducing the issue. One osd was down at that time. Time date and time I added in the comment is the date in my system to give an idea about how long it take to mark the 1 osd as down. > (2022-02-09T20:38:03.921513472 - 2022-02-09T20:50:37.844620518). > (In reply to Jilju Joy from comment #19) > (In reply to Neha Ojha from comment #18) > > Hi Jiju, > > > > (In reply to Jilju Joy from comment #15) > > > I tried to reproduce this issue. In the first attempt OSD didn't go down > > > after recovery. Ceph health was HEALTH_OK. Waited more than one hour to see > > > if any OSD is going down. > > > > Can you please explain what recovery means in the above context? > I mean the state when the 3 osd are marked 'up' in ceph status after the > nodes reach Ready state. > > > > > > > > In the second attempt on the same provider, 3 OSDs were up for 29 minutes > > > after all the nodes became Ready. But ceph health did not become HEALTH_OK. > > > After 29 minutes one OSD went down. > > > > Do you mind sharing where can I find the corresponding logs? The mon logs in > > http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-feb10/ > > jijoy-feb10_20220209T201934/logs/bug_2050057_repro/provider/ocs_must_gather/ > > quay-io-ocs-dev-ocs-must-gather-sha256- > > cb2456b3eec615d652adc6afb1be14f928f1037725664f039aa6b0d2326ff145/namespaces/ > > openshift-storage/pods/rook-ceph-mon-c-849859f89b-w7fnr/mon/mon/logs/ show > > that osd.0 was already down when we captured it > > The logs were collected after reproducing the issue. One osd was down at > that time. > Time date and time I added in the comment is the date in my system to give > an idea about how long it take to mark the 1 osd as down. We need to capture the daemon logs (with increased debug log levels) from when the OSD gets marked down, and for the period it stays down. I also think we are having a lot of back and forth on this BZ, if it helps I am happy get on a call and explain why these logs are needed, and probably get a better understanding of what the test is trying to do. > > > (2022-02-09T20:38:03.921513472 - 2022-02-09T20:50:37.844620518). > > Reproduced the issue again after setting the log level as suggested in the comment #26. $ ceph config get osd debug_osd 20/20 sh-4.4$ ceph config get osd debug_ms 1/1 sh-4.4$ ceph config get mon debug_mon 20/20 sh-4.4$ ceph config get mon debug_ms 1/1 Logs : http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ2050057_repro/ Collected the OSD and MON logs during the testing added added in the directory "individual_osd_and_mon_logs". Note: The file names which contain "before-node-stop" are collected before stopping the nodes. The file names which contain "after-node-back-ready" are collected after all the nodes became Ready again. So the file containing "after-node-back-ready" in its names have the logs when one of the OSD marked as down. 3 OSDs were UP for 29 minutes after the nodes became Ready. "ceph status" output just before one OSD marked down: cluster: id: c116bbb5-b8d1-4fee-a90a-10da42b7ad33 health: HEALTH_WARN 1 filesystem is degraded 1 MDSs report slow metadata IOs Reduced data availability: 108 pgs inactive, 79 pgs peering 267 slow ops, oldest one blocked for 1779 sec, daemons [osd.0,osd.1,osd.2] have slow ops. services: mon: 3 daemons, quorum a,b,c (age 29m) mgr: a(active, since 29m) mds: 1/1 daemons up, 1 standby osd: 3 osds: 3 up (since 29m), 3 in (since 4h) data: volumes: 0/1 healthy, 1 recovering pools: 5 pools, 129 pgs objects: 24.51k objects, 95 GiB usage: 286 GiB used, 5.7 TiB / 6 TiB avail pgs: 83.721% pgs not active 79 peering 29 activating 21 active+clean "ceph osd tree" output just before one OSD marked down: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6.00000 root default -5 6.00000 region us-east-2 -10 2.00000 zone us-east-2a -9 2.00000 host default-1-data-09mht7 1 ssd 2.00000 osd.1 up 1.00000 1.00000 -4 2.00000 zone us-east-2b -3 2.00000 host default-2-data-0s6v6m 0 ssd 2.00000 osd.0 up 1.00000 1.00000 -14 2.00000 zone us-east-2c -13 2.00000 host default-0-data-05zjp9 2 ssd 2.00000 osd.2 up 1.00000 1.00000 "ceph status" output just after one OSD marked down: cluster: id: c116bbb5-b8d1-4fee-a90a-10da42b7ad33 health: HEALTH_WARN 1 filesystem is degraded 1 osds down 1 host (1 osds) down 1 zone (1 osds) down Degraded data redundancy: 24511/73533 objects degraded (33.333%), 84 pgs degraded services: mon: 3 daemons, quorum a,b,c (age 30m) mgr: a(active, since 29m) mds: 1/1 daemons up, 1 standby osd: 3 osds: 2 up (since 7s), 3 in (since 4h) data: volumes: 0/1 healthy, 1 recovering pools: 5 pools, 129 pgs objects: 24.51k objects, 95 GiB usage: 286 GiB used, 5.7 TiB / 6 TiB avail pgs: 24511/73533 objects degraded (33.333%) 84 active+undersized+degraded 24 active+undersized 21 active+undersized+wait io: client: 3.7 KiB/s wr, 0 op/s rd, 1 op/s wr recovery: 8.0 KiB/s, 1 objects/s "ceph osd tree" output just after one OSD marked down: ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 6.00000 root default -5 6.00000 region us-east-2 -10 2.00000 zone us-east-2a -9 2.00000 host default-1-data-09mht7 1 ssd 2.00000 osd.1 down 1.00000 1.00000 -4 2.00000 zone us-east-2b -3 2.00000 host default-2-data-0s6v6m 0 ssd 2.00000 osd.0 up 1.00000 1.00000 -14 2.00000 zone us-east-2c -13 2.00000 host default-0-data-05zjp9 2 ssd 2.00000 osd.2 up 1.00000 1.00000 $ oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 67356895c063b04b1f32f3e31d4ed40684fffe8513dfc7c344e8bb--1-f24pm 0/1 Completed 0 6h22m 10.131.0.66 ip-10-0-191-7.us-east-2.compute.internal <none> <none> alertmanager-managed-ocs-alertmanager-0 2/2 Running 4 6h21m 10.131.0.14 ip-10-0-191-7.us-east-2.compute.internal <none> <none> alertmanager-managed-ocs-alertmanager-1 2/2 Running 4 6h21m 10.128.2.21 ip-10-0-210-238.us-east-2.compute.internal <none> <none> alertmanager-managed-ocs-alertmanager-2 2/2 Running 4 6h21m 10.129.2.14 ip-10-0-136-166.us-east-2.compute.internal <none> <none> b99fc36d3515cb5393c4870b496e2eeb8e124a63fd5bb8942defcb--1-6qrj5 0/1 Completed 0 6h22m 10.131.0.64 ip-10-0-191-7.us-east-2.compute.internal <none> <none> f35b6d7f1bfb86e8f99ace71e6f112001fecf0f91c4897edbe0717--1-2wvzg 0/1 Completed 0 6h22m 10.131.0.63 ip-10-0-191-7.us-east-2.compute.internal <none> <none> ocs-metrics-exporter-b55f6f77-tndjf 1/1 Running 2 6h21m 10.131.0.8 ip-10-0-191-7.us-east-2.compute.internal <none> <none> ocs-operator-64b7598bb-r5hml 1/1 Running 4 6h21m 10.129.2.3 ip-10-0-136-166.us-east-2.compute.internal <none> <none> ocs-osd-controller-manager-69d87b7c96-fd7qh 3/3 Running 8 6h22m 10.128.2.18 ip-10-0-210-238.us-east-2.compute.internal <none> <none> ocs-provider-server-bd5cd8458-tps2d 1/1 Running 2 6h21m 10.131.0.23 ip-10-0-191-7.us-east-2.compute.internal <none> <none> odf-console-77b6ddffb8-9sxtw 1/1 Running 2 6h22m 10.129.2.4 ip-10-0-136-166.us-east-2.compute.internal <none> <none> odf-operator-controller-manager-9f8898b5-8w88g 2/2 Running 4 6h22m 10.131.0.15 ip-10-0-191-7.us-east-2.compute.internal <none> <none> prometheus-managed-ocs-prometheus-0 2/2 Running 5 6h21m 10.128.2.20 ip-10-0-210-238.us-east-2.compute.internal <none> <none> prometheus-operator-5dc6c569-k9t2p 1/1 Running 2 6h22m 10.129.2.5 ip-10-0-136-166.us-east-2.compute.internal <none> <none> redhat-operators-8gjgb 1/1 Running 2 6h22m 10.129.2.12 ip-10-0-136-166.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-8697c21ca7172dbfcf664edd9ec2a586-q7dhk 1/1 Running 2 6h15m 10.0.210.238 ip-10-0-210-238.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-e5102b497044c9cd5fa09a64c19bddd7-4zvkk 1/1 Running 2 6h15m 10.0.136.166 ip-10-0-136-166.us-east-2.compute.internal <none> <none> rook-ceph-crashcollector-fc1cbae477521bd422880cb410cdadb2-f2nxh 1/1 Running 2 6h14m 10.0.191.7 ip-10-0-191-7.us-east-2.compute.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-84d84bd4fttct 2/2 Running 4 6h13m 10.0.191.7 ip-10-0-191-7.us-east-2.compute.internal <none> <none> rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-57f8d75d4k9p6 2/2 Running 4 6h13m 10.0.210.238 ip-10-0-210-238.us-east-2.compute.internal <none> <none> rook-ceph-mgr-a-59b44cc74-gzzfx 2/2 Running 4 6h15m 10.0.136.166 ip-10-0-136-166.us-east-2.compute.internal <none> <none> rook-ceph-mon-a-7757547d87-qln6k 2/2 Running 4 6h20m 10.0.136.166 ip-10-0-136-166.us-east-2.compute.internal <none> <none> rook-ceph-mon-b-59f5f9ccc8-pbr7p 2/2 Running 4 6h18m 10.0.191.7 ip-10-0-191-7.us-east-2.compute.internal <none> <none> rook-ceph-mon-c-55cd4fbd5d-nsfgd 2/2 Running 4 6h18m 10.0.210.238 ip-10-0-210-238.us-east-2.compute.internal <none> <none> rook-ceph-operator-5cb764b9d9-cmknl 1/1 Running 2 6h21m 10.129.2.15 ip-10-0-136-166.us-east-2.compute.internal <none> <none> rook-ceph-osd-0-799f5865d7-5zvdg 2/2 Running 4 6h14m 10.0.191.7 ip-10-0-191-7.us-east-2.compute.internal <none> <none> rook-ceph-osd-1-697f9d84b9-bshn9 2/2 Running 4 6h14m 10.0.136.166 ip-10-0-136-166.us-east-2.compute.internal <none> <none> rook-ceph-osd-2-57f8566b76-csb9n 2/2 Running 4 6h14m 10.0.210.238 ip-10-0-210-238.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-0-data-05zjp9--1-pp7rc 0/1 Completed 0 6h14m 10.0.210.238 ip-10-0-210-238.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-1-data-09mht7--1-bq82q 0/1 Completed 0 6h14m 10.0.136.166 ip-10-0-136-166.us-east-2.compute.internal <none> <none> rook-ceph-osd-prepare-default-2-data-0s6v6m--1-5ms4x 0/1 Completed 0 6h14m 10.0.191.7 ip-10-0-191-7.us-east-2.compute.internal <none> <none> rook-ceph-tools-86c9fb5d54-zgrtt 1/1 Running 2 6h21m 10.0.136.166 ip-10-0-136-166.us-east-2.compute.internal <none> <none> Still being discussed, not a 4.10 blocker atm Neha This issue seems different from https://bugzilla.redhat.com/show_bug.cgi?id=2072900, the OSD just isn't properly starting after restarting the node. I'm not sure what we can do from Rook for this. The pod has started, but the OSD is not coming online. Are there sufficient logs or do we need increased log levels? (In reply to Travis Nielsen from comment #49) > Neha This issue seems different from > https://bugzilla.redhat.com/show_bug.cgi?id=2072900, the OSD just isn't > properly starting after restarting the node. I'm not sure what we can do > from Rook for this. The pod has started, but the OSD is not coming online. > Are there sufficient logs or do we need increased log levels? Travis, QE was able to capture logs with the desired log levels, and I provided my analysis in https://bugzilla.redhat.com/show_bug.cgi?id=2050057#c35. From Ceph's perspective, osd.1 was never restarted. Neha, is this still a blocker? (In reply to Neha Ojha from comment #50) > (In reply to Travis Nielsen from comment #49) > > Neha This issue seems different from > > https://bugzilla.redhat.com/show_bug.cgi?id=2072900, the OSD just isn't > > properly starting after restarting the node. I'm not sure what we can do > > from Rook for this. The pod has started, but the OSD is not coming online. > > Are there sufficient logs or do we need increased log levels? > > Travis, QE was able to capture logs with the desired log levels, and I > provided my analysis in > https://bugzilla.redhat.com/show_bug.cgi?id=2050057#c35. From Ceph's > perspective, osd.1 was never restarted. If the logs don't show osd.1 was ever restarted, something isn't adding up. When the OSD pod starts back up after a node outage, the OSD must show logging that shows it was restarted. Could we get another repro that shows the osd.1 pod actually restarted? Please reopen if this is still an issue and we have enough to debug |
Description of problem (please be detailed as possible and provide a log snippets): Created ODF 4.10 setup on the rosa cluster for the ODF-ODF setup (Provider-consumer). Hence provider ODF-setup is with host networking True. When I stop all the aws instances of the provider cluster, which automatically came in running state within few minutes. Initially I saw 3 osd up and after some time one of the osd went down. Version of all relevant components (if applicable): OpenShift version: 4.9.15 ceph version 16.2.7-35.el8cp (51d904cb9b9eb82f2c11b4cf5252ab3f3ff0d6b4) pacific (stable) OCS - 4.10.0-122 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No, manageable post workaround. Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? 2/2 Can this issue reproduce from the UI? No If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install ROSA cluster. 2. Install ODF operator ODF4.10-122 3. Add below 4 inbound rules in clusters worker security ID from aws conole ->security group-> <worker security ID>-> Edit inbound rules -> Add rules Type port -rage source description custom TCP 6789 10.0.0.16 ceph mon v1 custom TCP 3300 10.0.0.16 ceph mon v2 custom TCP 6800-7300 10.0.0.16 osd custom TCP 9283 10.0.0.16 ceph manager 4.While creating a storage cluster ensure host networking is true in spec "spec: hostNetwork: true" 3. Create cluster - an external rosa consumer cluster 4. Verify all pods and OSD are running and UP. ceph health is okay. 5. Stop all instances of provider cluster from aws (CLI/console) Actual results: 1. all nodes status -Not ready and within few min nodes status-ready. 2. all aws instances running status 3. during recovery osd come up `------------------------ceph -s --------------- cluster: id: 458b0172-00d8-4caa-84a7-5bfd259f18bd health: HEALTH_WARN 1 filesystem is degraded 1 MDSs report slow metadata IOs Reduced data availability: 116 pgs inactive, 90 pgs peering 6 slow ops, oldest one blocked for 781 sec, daemons [osd.0,osd.1,osd.2] have slow ops. services: mon: 3 daemons, quorum a,b,c (age 13m) mgr: a(active, since 13m) mds: 1/1 daemons up, 1 standby osd: 3 osds: 3 up (since 13m), 3 in (since 21h) data: volumes: 0/1 healthy, 1 recovering pools: 5 pools, 129 pgs objects: 656 objects, 2.1 GiB usage: 6.4 GiB used, 3.0 TiB / 3 TiB avail pgs: 89.922% pgs not active 90 peering 26 activating 13 active+cleanwAExpected results: --------------------------------------------------- 4. After some time 1 osd is down permanantly ------------ceph -s--------------------------- cluster: id: 458b0172-00d8-4caa-84a7-5bfd259f18bd health: HEALTH_WARN 1 osds down 1 host (1 osds) down 1 zone (1 osds) down Degraded data redundancy: 656/1968 objects degraded (33.333%), 83 pgs degraded, 129 pgs undersized services: mon: 3 daemons, quorum a,b,c (age 35m) mgr: a(active, since 35m) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 2 up (since 5m), 3 in (since 21h) data: volumes: 1/1 healthy pools: 5 pools, 129 pgs objects: 656 objects, 2.1 GiB usage: 6.4 GiB used, 3.0 TiB / 3 TiB avail pgs: 656/1968 objects degraded (33.333%) 83 active+undersized+degraded 46 active+undersized io: client: 853 B/s rd, 1 op/s rd, 0 op/s ------------------------------------------------------ Additional info: $ ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 3.00000 root default -5 3.00000 region us-east-1 -10 1.00000 zone us-east-1a -9 1.00000 host ocs-deviceset-0-data-0qccqd 1 ssd 1.00000 osd.1 up 1.00000 1.00000 -14 1.00000 zone us-east-1b -13 1.00000 host ocs-deviceset-2-data-0qd5cc 2 ssd 1.00000 osd.2 up 1.00000 1.00000 -4 1.00000 zone us-east-1c -3 1.00000 host ocs-deviceset-1-data-0xxpjp 0 ssd 1.00000 osd.0 down 1.00000 1.00000 Some command o/p here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/provider/ OCP must gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocp_must_gather_p/ OCS Must Gather: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rc130/sgatfane-rc130_20220130T184121/logs/AllNodeProviderOffON/ocs_must_gather/