1973180 – Mounting of one PVC failed when trying to add capacity to the cluster

Bug 1973180 - Mounting of one PVC failed when trying to add capacity to the cluster

Summary: Mounting of one PVC failed when trying to add capacity to the cluster

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ceph
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Scott Ostapovicz
QA Contact:	Raz Tamir
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-17 11:21 UTC by Itzhak
Modified:	2023-08-09 16:37 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-01 07:43:13 UTC
Embargoed:

Attachments	(Terms of Use)

Description Itzhak 2021-06-17 11:21:30 UTC

Description of problem (please be detailed as possible and provide log
snippets): In the add capacity test, when trying to add 3 OSDs, the mounting of PVC failed - which caused one OSD to be down.


Version of all relevant components (if applicable):
Product: vSphere
Cluster Version: 4.6.32
OCS version: v4.6.5
Ceph version: 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)

You can find all the other details here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33s-uma/j006vu1cs33s-uma_20210611T143334/logs/test_report_1623421470.html.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. We can't add capacity properly to the cluster.


Is there any workaround available to the best of your knowledge?
No.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
1

Can this issue reproducible?
Yes. 

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:
Yes. It seems that this problem didn't occur in the OCS 4.5 version.

Steps to Reproduce:
Adding capacity via the UI, or running the "add_capacity" test.


Actual results:
the mounting of PVC failed - which caused one OSD to be down.

Expected results:
All the OSDs should be up and running

Comment 2 Itzhak 2021-06-17 11:35:17 UTC

Additional info:

Link to the Jenkins job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1094/

Here is the pods' output where we can see that the mounting of the PVC failed: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33s-uma/j006vu1cs33s-uma_20210611T143334/logs/failed_testcase_ocs_logs_1623785811/test_add_capacity_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-382e0c32f1b53ed43e2fbcfd0d9b20a0d77166a28f24c51369800c8b7961d6c4/namespaces/openshift-storage/oc_output/pods:

Events:
  Type     Reason                  Age                From                     Message
  ----     ------                  ----               ----                     -------
  Warning  FailedScheduling        68m                default-scheduler        0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.
  Warning  FailedScheduling        68m                default-scheduler        0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims.
  Normal   Scheduled               67m                default-scheduler        Successfully assigned openshift-storage/rook-ceph-osd-prepare-ocs-deviceset-2-data-1fv5rc-knt2f to compute-2
  Normal   SuccessfulAttachVolume  67m                attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-f94e955f-1078-4426-8f51-b092bac794dc"
  Normal   SuccessfulMountVolume   67m                kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-f94e955f-1078-4426-8f51-b092bac794dc" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/volumeDevices/[vsanDatastore] d3718f5f-4047-3574-7837-e4434bd7dee2/j006vu1cs33s-uma-td6x9-dynamic-pvc-f94e955f-1078-4426-8f51-b092bac794dc.vmdk"
  Normal   SuccessfulMountVolume   67m                kubelet                  MapVolume.MapPodDevice succeeded for volume "pvc-f94e955f-1078-4426-8f51-b092bac794dc" volumeMapPath "/var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumeDevices/kubernetes.io~vsphere-volume"
  Warning  FailedMount             66m (x8 over 67m)  kubelet                  MountVolume.SetUp failed for volume "ocs-deviceset-2-data-1fv5rc-bridge" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~empty-dir/ocs-deviceset-2-data-1fv5rc-bridge --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~empty-dir/ocs-deviceset-2-data-1fv5rc-bridge
Output: Failed to start transient scope unit: Argument list too long
  Warning  FailedCreatePodContainer  62m (x25 over 67m)    kubelet  unable to ensure pod container exists: failed to create container for [kubepods besteffort podadfe0df1-944f-4bf8-ba6c-9012ffe6141d] : Argument list too long
  Warning  FailedMount               51m                   kubelet  Unable to attach or mount volumes: unmounted volumes=[rook-ceph-osd-token-t4vf4 ocs-deviceset-2-data-1fv5rc-bridge], unattached volumes=[rook-binaries ceph-conf-emptydir rook-ceph-log rook-data rook-ceph-osd-token-t4vf4 ocs-deviceset-2-data-1fv5rc-bridge rook-ceph-crash devices udev ocs-deviceset-2-data-1fv5rc]: timed out waiting for the condition
  Warning  FailedMount               42m                   kubelet  Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-2-data-1fv5rc-bridge rook-ceph-osd-token-t4vf4], unattached volumes=[ocs-deviceset-2-data-1fv5rc-bridge ceph-conf-emptydir rook-ceph-osd-token-t4vf4 rook-data rook-ceph-crash rook-binaries rook-ceph-log devices udev ocs-deviceset-2-data-1fv5rc]: timed out waiting for the condition
  Warning  FailedMount               6m47s (x36 over 67m)  kubelet  MountVolume.SetUp failed for volume "rook-ceph-osd-token-t4vf4" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~secret/rook-ceph-osd-token-t4vf4 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~secret/rook-ceph-osd-token-t4vf4
Output: Failed to start transient scope unit: Argument list too long
  Warning  FailedMount  2m43s (x25 over 40m)  kubelet  (combined from similar events): MountVolume.SetUp failed for volume "rook-ceph-osd-token-t4vf4" : mount failed: exit status 1
Mounting command: systemd-run
Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~secret/rook-ceph-osd-token-t4vf4 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~secret/rook-ceph-osd-token-t4vf4
Output: Failed to start transient scope unit: Argument list too long


From what I understand from Jilju Joy, this error also happened in the test: tests/manage/pv_services/test_raw_block_pv.py::TestRawBlockPV::test_raw_block_pv[Retain]. 

When Petr retrigged the job, the add_capacity test passed as you can see here:https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1149/testReport/. 

Link to the relevant thread: https://mail.google.com/chat/u/0/#chat/space/AAAAREGEba8/5mVX223n6Vg

Comment 3 Mudit Agarwal 2021-06-24 03:10:07 UTC

Happened once on 4.6 setup, not a 4.8 blocker. Moving out.

Comment 4 Itzhak 2021-06-30 12:42:21 UTC

Can we close this bug?

Note You need to log in before you can comment on or make changes to this bug.