Description of problem (please be detailed as possible and provide log snippets): In the add capacity test, when trying to add 3 OSDs, the mounting of PVC failed - which caused one OSD to be down. Version of all relevant components (if applicable): Product: vSphere Cluster Version: 4.6.32 OCS version: v4.6.5 Ceph version: 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable) You can find all the other details here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33s-uma/j006vu1cs33s-uma_20210611T143334/logs/test_report_1623421470.html. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. We can't add capacity properly to the cluster. Is there any workaround available to the best of your knowledge? No. Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 1 Can this issue reproducible? Yes. Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Yes. It seems that this problem didn't occur in the OCS 4.5 version. Steps to Reproduce: Adding capacity via the UI, or running the "add_capacity" test. Actual results: the mounting of PVC failed - which caused one OSD to be down. Expected results: All the OSDs should be up and running
Additional info: Link to the Jenkins job: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1094/ Here is the pods' output where we can see that the mounting of the PVC failed: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j006vu1cs33s-uma/j006vu1cs33s-uma_20210611T143334/logs/failed_testcase_ocs_logs_1623785811/test_add_capacity_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-382e0c32f1b53ed43e2fbcfd0d9b20a0d77166a28f24c51369800c8b7961d6c4/namespaces/openshift-storage/oc_output/pods: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 68m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. Warning FailedScheduling 68m default-scheduler 0/6 nodes are available: 6 pod has unbound immediate PersistentVolumeClaims. Normal Scheduled 67m default-scheduler Successfully assigned openshift-storage/rook-ceph-osd-prepare-ocs-deviceset-2-data-1fv5rc-knt2f to compute-2 Normal SuccessfulAttachVolume 67m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-f94e955f-1078-4426-8f51-b092bac794dc" Normal SuccessfulMountVolume 67m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-f94e955f-1078-4426-8f51-b092bac794dc" globalMapPath "/var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/volumeDevices/[vsanDatastore] d3718f5f-4047-3574-7837-e4434bd7dee2/j006vu1cs33s-uma-td6x9-dynamic-pvc-f94e955f-1078-4426-8f51-b092bac794dc.vmdk" Normal SuccessfulMountVolume 67m kubelet MapVolume.MapPodDevice succeeded for volume "pvc-f94e955f-1078-4426-8f51-b092bac794dc" volumeMapPath "/var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumeDevices/kubernetes.io~vsphere-volume" Warning FailedMount 66m (x8 over 67m) kubelet MountVolume.SetUp failed for volume "ocs-deviceset-2-data-1fv5rc-bridge" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~empty-dir/ocs-deviceset-2-data-1fv5rc-bridge --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~empty-dir/ocs-deviceset-2-data-1fv5rc-bridge Output: Failed to start transient scope unit: Argument list too long Warning FailedCreatePodContainer 62m (x25 over 67m) kubelet unable to ensure pod container exists: failed to create container for [kubepods besteffort podadfe0df1-944f-4bf8-ba6c-9012ffe6141d] : Argument list too long Warning FailedMount 51m kubelet Unable to attach or mount volumes: unmounted volumes=[rook-ceph-osd-token-t4vf4 ocs-deviceset-2-data-1fv5rc-bridge], unattached volumes=[rook-binaries ceph-conf-emptydir rook-ceph-log rook-data rook-ceph-osd-token-t4vf4 ocs-deviceset-2-data-1fv5rc-bridge rook-ceph-crash devices udev ocs-deviceset-2-data-1fv5rc]: timed out waiting for the condition Warning FailedMount 42m kubelet Unable to attach or mount volumes: unmounted volumes=[ocs-deviceset-2-data-1fv5rc-bridge rook-ceph-osd-token-t4vf4], unattached volumes=[ocs-deviceset-2-data-1fv5rc-bridge ceph-conf-emptydir rook-ceph-osd-token-t4vf4 rook-data rook-ceph-crash rook-binaries rook-ceph-log devices udev ocs-deviceset-2-data-1fv5rc]: timed out waiting for the condition Warning FailedMount 6m47s (x36 over 67m) kubelet MountVolume.SetUp failed for volume "rook-ceph-osd-token-t4vf4" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~secret/rook-ceph-osd-token-t4vf4 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~secret/rook-ceph-osd-token-t4vf4 Output: Failed to start transient scope unit: Argument list too long Warning FailedMount 2m43s (x25 over 40m) kubelet (combined from similar events): MountVolume.SetUp failed for volume "rook-ceph-osd-token-t4vf4" : mount failed: exit status 1 Mounting command: systemd-run Mounting arguments: --description=Kubernetes transient mount for /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~secret/rook-ceph-osd-token-t4vf4 --scope -- mount -t tmpfs tmpfs /var/lib/kubelet/pods/adfe0df1-944f-4bf8-ba6c-9012ffe6141d/volumes/kubernetes.io~secret/rook-ceph-osd-token-t4vf4 Output: Failed to start transient scope unit: Argument list too long From what I understand from Jilju Joy, this error also happened in the test: tests/manage/pv_services/test_raw_block_pv.py::TestRawBlockPV::test_raw_block_pv[Retain]. When Petr retrigged the job, the add_capacity test passed as you can see here:https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/1149/testReport/. Link to the relevant thread: https://mail.google.com/chat/u/0/#chat/space/AAAAREGEba8/5mVX223n6Vg
Happened once on 4.6 setup, not a 4.8 blocker. Moving out.
Can we close this bug?