2078364 – OSD are in CLBO when performed test_node_maintenance test

Bug 2078364 - OSD are in CLBO when performed test_node_maintenance test

Summary: OSD are in CLBO when performed test_node_maintenance test

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-25 06:15 UTC by Pratik Surve
Modified:	2023-08-09 17:03 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-25 15:14:26 UTC
Embargoed:

Attachments	(Terms of Use)

Description Pratik Surve 2022-04-25 06:15:46 UTC

Description of problem (please be detailed as possible and provide log
snippets):

OSD is in CLBO when performed test_node_maintenance test

  Warning  Unhealthy               2m18s (x52 over 51m)   kubelet                  Liveness probe failed: no valid command found; 10 closest matches:

Version of all relevant components (if applicable):

OCS operator	4.8.10-2
Ceph Version	14.2.11-208.el8cp (6738ba96f296a41c24357c12e8d594fbde457abc) nautilus (stable)
Cluster Version	4.8.0-0.nightly-2022-04-13-042657

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy 4.8 cluster 
2.run test_node_maintenance[worker]
3. check osd status 


Actual results:
rook-ceph-osd-1-67bd59f5cc-7tw9t                                  1/2     CrashLoopBackOff   17         52m   10.129.4.17    ip-10-0-151-155.us-east-2.compute.internal   <none>           <none>

rook-ceph-osd-4-cff9467db-9prn4                                   1/2     CrashLoopBackOff   17         53m   10.128.6.9     ip-10-0-137-171.us-east-2.compute.internal   <none>           <none>

Events:
  Type     Reason                  Age                    From                     Message
  ----     ------                  ----                   ----                     -------
  Normal   Scheduled               53m                    default-scheduler        Successfully assigned openshift-storage/rook-ceph-osd-4-cff9467db-9prn4 to ip-10-0-137-171.us-east-2.compute.internal
  Warning  FailedAttachVolume      53m                    attachdetach-controller  Multi-Attach error for volume "pvc-ac2f5f1f-f347-4c92-858f-4756de4acaa7" Volume is already used by pod(s) rook-ceph-osd-4-cff9467db-pk4tk
  Normal   SuccessfulAttachVolume  52m                    attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-ac2f5f1f-f347-4c92-858f-4756de4acaa7"
  Normal   SuccessfulMountVolume   52m                    kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-ac2f5f1f-f347-4c92-858f-4756de4acaa7" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/aws-ebs/volumeDevices/aws:/us-east-2a/vol-0f220a5b652da1c04"
  Normal   SuccessfulMountVolume   52m                    kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-ac2f5f1f-f347-4c92-858f-4756de4acaa7" volumeMapPath "/var/lib/kubelet/pods/896d5816-cbef-4e06-96b3-7931d6fbd25d/volumeDevices/kubernetes.io~aws-ebs"
  Normal   AddedInterface          52m                    multus                   Add eth0 [10.128.6.9/23] from openshift-sdn
  Normal   Created                 52m                    kubelet                  Created container blkdevmapper
  Normal   Pulled                  52m                    kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:4b16d6f54a9ae1e43ab0f9b76f1b0860cc4feebfc7ee0e797937fc9445c5bb0a" already present on machine
  Normal   Started                 52m                    kubelet                  Started container blkdevmapper
  Normal   Pulled                  52m                    kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:4b16d6f54a9ae1e43ab0f9b76f1b0860cc4feebfc7ee0e797937fc9445c5bb0a" already present on machine
  Normal   Created                 52m                    kubelet                  Created container activate
  Normal   Started                 52m                    kubelet                  Started container activate
  Normal   Pulled                  52m                    kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:4b16d6f54a9ae1e43ab0f9b76f1b0860cc4feebfc7ee0e797937fc9445c5bb0a" already present on machine
  Normal   Started                 52m                    kubelet                  Started container expand-bluefs
  Normal   Created                 52m                    kubelet                  Created container expand-bluefs
  Normal   Pulled                  52m                    kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:4b16d6f54a9ae1e43ab0f9b76f1b0860cc4feebfc7ee0e797937fc9445c5bb0a" already present on machine
  Normal   Created                 52m                    kubelet                  Created container chown-container-data-dir
  Normal   Started                 52m                    kubelet                  Started container chown-container-data-dir
  Normal   Created                 52m                    kubelet                  Created container osd
  Normal   Started                 52m                    kubelet                  Started container osd
  Normal   Pulled                  52m                    kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:4b16d6f54a9ae1e43ab0f9b76f1b0860cc4feebfc7ee0e797937fc9445c5bb0a" already present on machine
  Normal   Created                 52m                    kubelet                  Created container log-collector
  Normal   Started                 52m                    kubelet                  Started container log-collector
  Normal   Pulled                  51m (x2 over 52m)      kubelet                  Container image "quay.io/rhceph-dev/rhceph@sha256:4b16d6f54a9ae1e43ab0f9b76f1b0860cc4feebfc7ee0e797937fc9445c5bb0a" already present on machine
  Normal   Killing                 51m                    kubelet                  Container osd failed liveness probe, will be restarted
  Warning  BackOff                 7m34s (x131 over 45m)  kubelet                  Back-off restarting failed container
  Warning  Unhealthy               2m18s (x52 over 51m)   kubelet                  Liveness probe failed: no valid command found; 10 closest matches:


Expected results:
OSD should be in running state


Additional info:
must-gather:- http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j-162ai3c33-ua/j-162ai3c33-ua_20220413T083559/logs/failed_testcase_ocs_logs_1649858908/test_node_maintenance%5bworker%5d_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-2cd89ec7cf8a1f770a6f0b1c0ed478d1175e5b59b16cf5e2ab43c7098f5c8ef3/

Comment 7 Travis Nielsen 2022-04-25 15:14:26 UTC

We have added startup probes in a more recent release, which fixes this issue.

Note You need to log in before you can comment on or make changes to this bug.