1927262 – [vSphere]: osd is in CrashLoopBackOff - Liveness probe failed: admin_socket exception

Bug 1927262 - [vSphere]: osd is in CrashLoopBackOff - Liveness probe failed: admin_socket exception

Summary: [vSphere]: osd is in CrashLoopBackOff - Liveness probe failed: admin_socket e...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Container Storage
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Sébastien Han
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-10 12:14 UTC by Vijay Avuthu
Modified:	2021-02-10 15:16 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-10 15:16:56 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Vijay Avuthu 2021-02-10 12:14:16 UTC

Description of problem (please be detailed as possible and provide log
snippests):

one of the OSD is in CrashLoopBackOff state


Version of all relevant components (if applicable):

openshift installer (4.7.0-0.nightly-2021-02-08-052658)
ocs-registry:4.7.0-254.ci


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes

Is there any workaround available to the best of your knowledge?
NA

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
NA

If this is a regression, please provide more details to justify this:
Yes, previous builds are working fine

Steps to Reproduce:
1. install OCS using ocs-ci
2. After 1day, check the ceph health
3.


Actual results:
one of the osd is in CrashLoopBackOff state


Expected results:
All osds should be up and ceph health should be OK


Additional info:

$ oc get pods
NAME                                                              READY   STATUS             RESTARTS   AGE
csi-cephfsplugin-88www                                            3/3     Running            0          40h
csi-cephfsplugin-j2ppc                                            3/3     Running            0          40h
csi-cephfsplugin-nwnt2                                            3/3     Running            0          40h
csi-cephfsplugin-provisioner-fdc478cc-477l8                       6/6     Running            3          7h41m
csi-cephfsplugin-provisioner-fdc478cc-tt4tw                       6/6     Running            36         8h
csi-rbdplugin-6tj4s                                               3/3     Running            0          40h
csi-rbdplugin-bgqkl                                               3/3     Running            0          40h
csi-rbdplugin-jlx8q                                               3/3     Running            0          40h
csi-rbdplugin-provisioner-64db99d598-kxpbb                        6/6     Running            39         8h
csi-rbdplugin-provisioner-64db99d598-nl4zq                        6/6     Running            8          7h41m
noobaa-core-0                                                     1/1     Running            0          7h29m
noobaa-db-pg-0                                                    1/1     Running            0          7h29m
noobaa-endpoint-75cb5fd98f-7rcpg                                  1/1     Running            0          8h
noobaa-operator-64d7d695c9-gqnb7                                  1/1     Running            6          8h
ocs-metrics-exporter-747f57d449-2rxc4                             1/1     Running            0          8h
ocs-operator-6b87d498cb-cdkrs                                     1/1     Running            8          8h
rook-ceph-crashcollector-compute-0-d56d7b498-c54qs                1/1     Running            0          7h41m
rook-ceph-crashcollector-compute-1-7bf885565c-gmz6j               1/1     Running            0          8h
rook-ceph-crashcollector-compute-2-6d55d685f6-kvbrw               1/1     Running            0          7h48m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-5c76fdcfbx6nb   2/2     Running            0          7h41m
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-7ff6b975js6vc   2/2     Running            0          8h
rook-ceph-mgr-a-cbf74695b-hxpl2                                   2/2     Running            0          8h
rook-ceph-mon-c-864bfd778-sbfqd                                   2/2     Running            0          8h
rook-ceph-mon-e-6d88fd946b-j58qf                                  2/2     Running            0          15h
rook-ceph-mon-f-7ffdc88f94-skhvb                                  2/2     Running            0          7h30m
rook-ceph-operator-59659cc768-9m78h                               1/1     Running            0          8h
rook-ceph-osd-0-65fc4c96fb-657v9                                  2/2     Running            0          8h
rook-ceph-osd-1-67759bb447-pknwt                                  2/2     Running            0          8h
rook-ceph-osd-2-76d485f4f8-fczq5                                  1/2     CrashLoopBackOff   95         8h
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-d6995c9l2zcb   2/2     Running            16         8h
rook-ceph-tools-84bc476959-qx8qw                                  1/1     Running            0          8h
```

> 
$ oc describe pod rook-ceph-osd-2-76d485f4f8-fczq5
Name:         rook-ceph-osd-2-76d485f4f8-fczq5
Namespace:    openshift-storage
Priority:     0
Node:         compute-1/10.1.160.98
Start Time:   Wed, 10 Feb 2021 06:42:10 +0530

Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               8h                     default-scheduler        Successfully assigned openshift-storage/rook-ceph-osd-2-76d485f4f8-fczq5 to compute-1
  Warning  FailedAttachVolume      8h                     attachdetach-controller  Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7
  Warning  FailedMount             8h                     kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[rook-config-override rook-ceph-log rook-ceph-crash run-udev ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb ocs-deviceset-1-data-0ds9gc rook-data]: timed out waiting for the condition
  Warning  FailedMount             8h                     kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[rook-ceph-crash run-udev ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data rook-config-override ocs-deviceset-1-data-0ds9gc rook-ceph-log]: timed out waiting for the condition
  Warning  FailedMount             8h                     kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log rook-ceph-crash ocs-deviceset-1-data-0ds9gc run-udev ocs-deviceset-1-data-0ds9gc-bridge]: timed out waiting for the condition
  Warning  FailedMount             8h                     kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc rook-config-override rook-ceph-log rook-ceph-crash run-udev ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data]: timed out waiting for the condition
  Warning  FailedAttachVolume      8h                     attachdetach-controller  Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7
  Warning  FailedAttachVolume      8h                     attachdetach-controller  Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7
  Warning  FailedAttachVolume      8h                     attachdetach-controller  Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7
  Warning  FailedMount             8h                     kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc-bridge ocs-deviceset-1-data-0ds9gc rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log rook-ceph-crash run-udev]: error processing PVC openshift-storage/ocs-deviceset-1-data-0ds9gc: failed to fetch PVC from API server: etcdserver: leader changed
  Warning  FailedMount             8h                     kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log rook-ceph-crash run-udev]: error processing PVC openshift-storage/ocs-deviceset-1-data-0ds9gc: failed to fetch PV pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a from API server: etcdserver: leader changed
  Warning  FailedAttachVolume      8h                     attachdetach-controller  Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7
  Warning  FailedMount             8h (x2 over 8h)        kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data ocs-deviceset-1-data-0ds9gc rook-config-override rook-ceph-log rook-ceph-crash run-udev]: timed out waiting for the condition
  Warning  FailedAttachVolume      8h                     attachdetach-controller  Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7
  Warning  FailedAttachVolume      8h                     attachdetach-controller  Multi-Attach error for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" Volume is already used by pod(s) rook-ceph-osd-2-76d485f4f8-gxht7
  Warning  FailedMount             8h (x2 over 8h)        kubelet                  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[run-udev ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log ocs-deviceset-1-data-0ds9gc rook-ceph-crash]: timed out waiting for the condition
  Warning  FailedMount             8h (x2 over 8h)        kubelet                  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-1-data-0ds9gc], unattached volumes=[ocs-deviceset-1-data-0ds9gc-bridge rook-ceph-osd-token-7z9lb rook-data rook-config-override rook-ceph-log ocs-deviceset-1-data-0ds9gc rook-ceph-crash run-udev]: timed out waiting for the condition
  Normal   SuccessfulAttachVolume  8h                     attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a"
  Normal   SuccessfulMountVolume   8h                     kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/volumeDevices/[vsanDatastore] 66242d5f-cafa-91c3-8164-e4434bd7df48/vavuthu-t254-ql8r9-dynamic-pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a.vmdk"
  Normal   SuccessfulMountVolume   8h                     kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-98f35cad-9222-4b66-8721-2e7c9b2fdc4a" volumeMapPath "/var/lib/kubelet/pods/52cfa1dd-f3ac-439a-9dd3-39ba11ae07e2/volumeDevices/kubernetes.io~vsphere-volume"
  Normal   AddedInterface          8h                     multus                   Add eth0 [10.128.2.248/23]
  Normal   Pulled                  8h                     kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine
  Normal   Created                 8h                     kubelet                  Created container blkdevmapper
  Normal   Started                 8h                     kubelet                  Started container blkdevmapper
  Normal   Pulled                  8h                     kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine
  Normal   Created                 8h                     kubelet                  Created container activate
  Normal   Started                 8h                     kubelet                  Started container activate
  Normal   Pulled                  8h                     kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine
  Normal   Created                 8h                     kubelet                  Created container expand-bluefs
  Normal   Started                 8h                     kubelet                  Started container expand-bluefs
  Normal   Pulled                  8h                     kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine
  Normal   Created                 8h                     kubelet                  Created container chown-container-data-dir
  Normal   Started                 8h                     kubelet                  Started container chown-container-data-dir
  Normal   Created                 8h                     kubelet                  Created container osd
  Normal   Started                 8h                     kubelet                  Started container osd
  Warning  Unhealthy               8h (x19 over 8h)       kubelet                  Liveness probe failed: admin_socket: exception getting command descriptions: [Errno 2] No such file or directory
  Normal   Pulled                  27m (x99 over 8h)      kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:35e13c86bf5891b6db3386e74fc2be728906173a7aabb5d1aa11452a62d136e9" already present on machine
  Warning  BackOff                 2m27s (x2344 over 8h)  kubelet                  Back-off restarting failed container

> $ oc -n openshift-storage exec rook-ceph-tools-84bc476959-qx8qw -- ceph health
HEALTH_WARN 1 osds down; 1 host (1 osds) down; 1 rack (1 osds) down; Degraded data redundancy: 5422/16266 objects degraded (33.333%), 185 pgs degraded, 272 pgs undersized; 2 daemons have recently crashed

> must gather logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/vavuthu-t254/vavuthu-t254_20210208T152856/logs/deployment_1612798645/ocs_must_gather_new/

Comment 3 Sébastien Han 2021-02-10 15:16:56 UTC

The socket liveness probe was not the issue, the OSD had a weird issue authenticating with the cluster, it seemed to be a transient issue.
I tried to remove the osd pod and this triggered the restart of the entire deployment. Even patching the deployment would have resulted in the same.

Now after this operation the OSD is up and running again, so I cannot debug further.
I need to close for now as we cannot make further progress.

Feel free to re-open if you get another cluster in the same state, thanks.

Note You need to log in before you can comment on or make changes to this bug.