Description of problem (please be detailed as possible and provide log snippests): one of the OSD is in CrashLoopBackOff state Version of all relevant components (if applicable): openshift installer (4.7.0-0.nightly-2021-02-08-052658) ocs-registry:4.7.0-254.ci Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes Is there any workaround available to the best of your knowledge? NA Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes Can this issue reproduce from the UI? NA If this is a regression, please provide more details to justify this: Yes, previous builds are working fine Steps to Reproduce: 1. install OCS using ocs-ci 2. After 1day, check the ceph health 3. Actual results: one of the osd is in CrashLoopBackOff state Expected results: All osds should be up and ceph health should be OK Additional info: $ oc get pods NAME READY STATUS RESTARTS AGE csi-cephfsplugin-88www 3/3 Running 0 40h csi-cephfsplugin-j2ppc 3/3 Running 0 40h csi-cephfsplugin-nwnt2 3/3 Running 0 40h csi-cephfsplugin-provisioner-fdc478cc-477l8 6/6 Running 3 7h41m csi-cephfsplugin-provisioner-fdc478cc-tt4tw 6/6 Running 36 8h csi-rbdplugin-6tj4s 3/3 Running 0 40h csi-rbdplugin-bgqkl 3/3 Running 0 40h csi-rbdplugin-jlx8q 3/3 Running 0 40h csi-rbdplugin-provisioner-64db99d598-kxpbb 6/6 Running 39 8h csi-rbdplugin-provisioner-64db99d598-nl4zq 6/6 Running 8 7h41m noobaa-core-0 1/1 Running 0 7h29m noobaa-db-pg-0 1/1 Running 0 7h29m noobaa-endpoint-75cb5fd98f-7rcpg 1/1 Running 0 8h noobaa-operator-64d7d695c9-gqnb7 1/1 Running 6 8h ocs-metrics-exporter-747f57d449-2rxc4 1/1 Running 0 8h ocs-operator-6b87d498cb-cdkrs 1/1 Running 8 8h rook-ceph-crashcollector-compute-0-d56d7b498-c54qs 1/1 Running 0 7h41m rook-ceph-crashcollector-compute-1-7bf885565c-gmz6j 1/1 Running 0 8h rook-ceph-crashcollector-compute-2-6d55d685f6-kvbrw 1/1 Running 0 7h48m rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5c76fdcfbx6nb 2/2 Running 0 7h41m rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7ff6b975js6vc 2/2 Running 0 8h rook-ceph-mgr-a-cbf74695b-hxpl2 2/2 Running 0 8h rook-ceph-mon-c-864bfd778-sbfqd 2/2 Running 0 8h rook-ceph-mon-e-6d88fd946b-j58qf 2/2 Running 0 15h rook-ceph-mon-f-7ffdc88f94-skhvb 2/2 Running 0 7h30m rook-ceph-operator-59659cc768-9m78h 1/1 Running 0 8h rook-ceph-osd-0-65fc4c96fb-657v9 2/2 Running 0 8h rook-ceph-osd-1-67759bb447-pknwt 2/2 Running 0 8h rook-ceph-osd-2-76d485f4f8-fczq5 1/2 CrashLoopBackOff 95 8h rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-d6995c9l2zcb 2/2 Running 16 8h rook-ceph-tools-84bc476959-qx8qw 1/1 Running 0 8h ``` > $ oc describe pod rook-ceph-osd-2-76d485f4f8-fczq5 Name: rook-ceph-osd-2-76d485f4f8-fczq5 Namespace: openshift-storage Priority: 0 Node: compute-1/10.1.160.98 Start Time: Wed, 10 Feb 2021 06:42:10 +0530 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8h default-scheduler Successfully assigned openshift-storage/rook-ceph-osd-2-76d485f4f8-fczq5 to compute-1 Warning FailedAttachVolume 8h attachdetach-controller Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7 Warning FailedMount 8h kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[rook-config-override rook-ceph-log rook-ceph-crash run-udev ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb ocs-deviceset-1-data-0ds9gc rook-data]: timed out waiting for the condition Warning FailedMount 8h kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[rook-ceph-crash run-udev ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data rook-config-override ocs-deviceset-1-data-0ds9gc rook-ceph-log]: timed out waiting for the condition Warning FailedMount 8h kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log rook-ceph-crash ocs-deviceset-1-data-0ds9gc run-udev ocs-deviceset-1-data-0ds9gc-bridge]: timed out waiting for the condition Warning FailedMount 8h kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc rook-config-override rook-ceph-log rook-ceph-crash run-udev ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data]: timed out waiting for the condition Warning FailedAttachVolume 8h attachdetach-controller Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7 Warning FailedAttachVolume 8h attachdetach-controller Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7 Warning FailedAttachVolume 8h attachdetach-controller Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7 Warning FailedMount 8h kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc-bridge ocs-deviceset-1-data-0ds9gc rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log rook-ceph-crash run-udev]: error processing PVC openshift-storage/ocs-deviceset-1-data-0ds9gc: failed to fetch PVC from API server: etcdserver: leader changed Warning FailedMount 8h kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log rook-ceph-crash run-udev]: error processing PVC openshift-storage/ocs-deviceset-1-data-0ds9gc: failed to fetch PV pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a from API server: etcdserver: leader changed Warning FailedAttachVolume 8h attachdetach-controller Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7 Warning FailedMount 8h (x2 over 8h) kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data ocs-deviceset-1-data-0ds9gc rook-config-override rook-ceph-log rook-ceph-crash run-udev]: timed out waiting for the condition Warning FailedAttachVolume 8h attachdetach-controller Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7 Warning FailedAttachVolume 8h attachdetach-controller Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7 Warning FailedMount 8h (x2 over 8h) kubelet (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[run-udev ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log ocs-deviceset-1-data-0ds9gc rook-ceph-crash]: timed out waiting for the condition Warning FailedMount 8h (x2 over 8h) kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log ocs-deviceset-1-data-0ds9gc rook-ceph-crash run-udev]: timed out waiting for the condition Normal SuccessfulAttachVolume 8h attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Normal SuccessfulMountVolume 8h kubelet MapVolume.MapPodDevice succeeded for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/volumeDevices/[vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/vavuthu-t254-ql8r9-dynamic-pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a.vmdk" Normal SuccessfulMountVolume 8h kubelet MapVolume.MapPodDevice succeeded for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" volumeMapPath "/var/lib/kubelet/pods/52cfa1dd-f3ac-439a-9dd3-39ba11ae07e2/volumeDevices/kubernetes.io~vsphere-volume" Normal AddedInterface 8h multus Add eth0 [10.128.2.248/23] Normal Pulled 8h kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine Normal Created 8h kubelet Created container blkdevmapper Normal Started 8h kubelet Started container blkdevmapper Normal Pulled 8h kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine Normal Created 8h kubelet Created container activate Normal Started 8h kubelet Started container activate Normal Pulled 8h kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine Normal Created 8h kubelet Created container expand-bluefs Normal Started 8h kubelet Started container expand-bluefs Normal Pulled 8h kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine Normal Created 8h kubelet Created container chown-container-data-dir Normal Started 8h kubelet Started container chown-container-data-dir Normal Created 8h kubelet Created container osd Normal Started 8h kubelet Started container osd Warning Unhealthy 8h (x19 over 8h) kubelet Liveness probe failed: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory Normal Pulled 27m (x99 over 8h) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine Warning BackOff 2m27s (x2344 over 8h) kubelet Back-off restarting failed container > $ oc -n openshift-storage exec rook-ceph-tools-84bc476959-qx8qw -- ceph health HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 5422/16266 objects degraded (33.333%), 185 pgs degraded, 272 pgs undersized; 2 daemons have recently crashed > must gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthu-t254/vavuthu-t254_20210208T152856/logs/deployment_1612798645/ocs_must_gather_new/
The socket liveness probe was not the issue, the OSD had a weird issue authenticating with the cluster, it seemed to be a transient issue. I tried to remove the osd pod and this triggered the restart of the entire deployment. Even patching the deployment would have resulted in the same. Now after this operation the OSD is up and running again, so I cannot debug further. I need to close for now as we cannot make further progress. Feel free to re-open if you get another cluster in the same state, thanks.