Description of problem: PVC expansion taking more time than expected. This issue observed in Tier1 test run for RHV Tech qualification. This test verify the PVC expand of already expanded PVC. Issue: PVC expansion from 10Gi to 20Gi took 10 minutes. There are resize failure events before the expansion succeeded. The issue was seen in first attempt of expansion. 18:33:06 - MainThread - ocs_ci.ocs.resources.pvc - INFO - Capacity of PVC pvc-test-07684f4b3a51433bab9005b9cd2c9d7c is not 20Gi as expected, but 10Gi. Retrying. 18:33:07 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.5 will be used. From csi-cephfsplugin-provisioner-5f8b66cc96-zqj9f pod csi-resizer container log, there are resize failure events before the expansion succeeded. PVC pvc-test-07684f4b3a51433bab9005b9cd2c9d7c expansion from 10Gi to 20Gi took 10 minutes which is much more than the expected time. Version-Release number of selected component (if applicable): Ceph Version 14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable) Cluster Version 4.5.9 OCS operator v4.5.0 Test Run Name sgatfane-OCS4-5-Downstream--LSO-MON-HOSTPATH-OSD-HDD-RHV-IPI-1AZ-RHCOS-3M-3W-tier4 ceph rhceph-4-rhel8@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8 cephfsplugin 132d85ceeee09136f1f2048bd5f85cbb53e6b697a9f50557d05bb63959386bcd noobaa-operator mcg-rhel8-operator@sha256:2cc0c9247ea1f1df0d3c930216d8c4880485033330beddb922c45132ff795169 noobaa_cor mcg-core-rhel8@sha256:71757f0e1ac6b4092ff2ceb0e642424e6f99b7824f352d72846ecf88f45a5fe2 noobaa_db mongodb-36-rhel7@sha256:ba74027bb4b244df0b0823ee29aa927d729da33edaa20ebdf51a2430cc6b4e95 rbdplugin 132d85ceeee09136f1f2048bd5f85cbb53e6b697a9f50557d05bb63959386bcd rook-ceph-operator rook-ceph-rhel8-operator@sha256:95b31d3e0bebf4f27b42f25fdde1fd06098ccfc833f20e3b9fe2ca7fb11bc104 rook_ceph rook-ceph-rhel8-operator@sha256:95b31d3e0bebf4f27b42f25fdde1fd06098ccfc833f20e3b9fe2ca7fb11bc104 rook_csi_attacher ose-csi-external-attacher@sha256:db168e03ea39c90691497b4755e231f7ac9bc1a767ed776c02165e2cf509f0b2 rook_csi_ceph cephcsi-rhel8@sha256:132d85ceeee09136f1f2048bd5f85cbb53e6b697a9f50557d05bb63959386bcd rook_csi_provisioner ose-csi-external-provisioner-rhel7@sha256:69798ba68baf4d14a755a5f581e54a4c18ca3022d8a7e1e10f651ba2f8a0a597 rook_csi_registrar ose-csi-driver-registrar@sha256:7a814c6d75c1dd7eb67977cacca379184b54750151dcd9ff698d7852263fbc9f rook_csi_resizer ose-csi-external-resizer-rhel7@sha256:279a12fb2095c7c7f7429135317c53a3f821d6a5a7b89b2f49fc4f84d5cfba42 How reproducible: 3/3 Steps to Reproduce: The test case is automated: tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expansion Actual results: Resize failure before the expansion succeeded Expected results: Expansion should not take this long. Resize attempt should succeed without fail. Additional info: Platform: RHV Storage used: etcd on SSD storage and OS disks are thin provisioned. OSD on HDD storage domain , 500GBi disk on each worker node,disk allocation policy: preallocated. Test execution cosole log: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/13069/console TestLogs: https://drive.google.com/drive/folders/1Ozoky1mX93IwYTfg8pVuEyG_xM8lrHbv?usp=sharing
> Issue: PVC expansion from 10Gi to 20Gi took 10 minutes. There are resize failure events before the expansion succeeded. The issue was seen in first attempt of expansion. > Can you mention the failure error messages noticed in this testing ?
(In reply to Humble Chirammal from comment #2) > > > Issue: PVC expansion from 10Gi to 20Gi took 10 minutes. There are resize > failure events before the expansion succeeded. The issue was seen in first > attempt of expansion. > > > > Can you mention the failure error messages noticed in this testing ? While we continue our analysis,I would like to understand, is there a specific reason to test this on OCS 4.5 cluster? if possible, can we try this on OCS 4.6 setup ?
(In reply to Humble Chirammal from comment #4) > (In reply to Humble Chirammal from comment #2) > > > > > Issue: PVC expansion from 10Gi to 20Gi took 10 minutes. There are resize > > failure events before the expansion succeeded. The issue was seen in first > > attempt of expansion. > > > > > > > Can you mention the failure error messages noticed in this testing ? > > > While we continue our analysis,I would like to understand, is there a > specific reason to test this on OCS 4.5 cluster? if possible, can we try > this on OCS 4.6 setup ? I also have a question on whether this cluster is healthy? can you check the ceph cluster status while you face this issue? The reason why I am asking this is, being the expansion is happening on a CephFS PVC, in the backend it is a very minimal operation of "QUOTA" adjustment which should not cause "ANY" delay or IOW, it should return pretty quickly. From the logs it looks like the ongoing operation on this PVC holding the lock for long time. 2020-10-05T18:34:47.760064589Z I1005 18:34:47.760029 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10097120", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
(In reply to Humble Chirammal from comment #4) > While we continue our analysis,I would like to understand, is there a > specific reason to test this on OCS 4.5 cluster? if possible, can we try > this on OCS 4.6 setup ? We test over OCS 4.5 since RHV is intended to be supported starting 4.5.z
Removing Regression keyword as this is seen with RHV which is still not supported
--------------------------- Ceph Health: cluster: id: 527fc23c-f5ab-43d6-85d9-157474a2453b health: HEALTH_OK services: mon: 3 daemons, quorum a,b,c (age 4h) mgr: a(active, since 9d) mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay osd: 3 osds: 3 up (since 4h), 3 in (since 10d) rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b) task status: scrub status: mds.ocs-storagecluster-cephfilesystem-a: idle mds.ocs-storagecluster-cephfilesystem-b: idle data: pools: 10 pools, 176 pgs objects: 21.55k objects, 75 GiB usage: 196 GiB used, 1.3 TiB / 1.5 TiB avail pgs: 176 active+clean io: client: 853 B/s rd, 21 KiB/s wr, 1 op/s rd, 2 op/s wr -------------- RAW STORAGE: CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 1.5 TiB 1.3 TiB 193 GiB 196 GiB 13.07 TOTAL 1.5 TiB 1.3 TiB 193 GiB 196 GiB 13.07 POOLS: POOL ID STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 59 GiB 19.82k 181 GiB 14.33 360 GiB ocs-storagecluster-cephobjectstore.rgw.control 2 0 B 8 0 B 0 360 GiB ocs-storagecluster-cephfilesystem-metadata 3 117 MiB 350 280 MiB 0.03 360 GiB ocs-storagecluster-cephfilesystem-data0 4 4.0 GiB 1.12k 12 GiB 1.10 360 GiB ocs-storagecluster-cephobjectstore.rgw.meta 5 1.7 KiB 7 1.1 MiB 0 360 GiB ocs-storagecluster-cephobjectstore.rgw.log 6 148 KiB 210 6.5 MiB 0 360 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.index 7 642 B 11 642 B 0 360 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 8 0 B 0 0 B 0 360 GiB .rgw.root 9 4.7 KiB 16 2.8 MiB 0 360 GiB ocs-storagecluster-cephobjectstore.rgw.buckets.data 10 1.0 KiB 2 384 KiB 0 360 GiB ------------------- There are 'crashes' in Ceph: 1 crashes recorded 1 older than 1 days old: 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804 1 older than 3 days old: 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804 1 older than 7 days old: 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804 Also the CSI subvolume group we use for PVCs has been marked for "Deletion" [ { "name": "_deleting" ------> this is marked for "deletion" }, { "name": "csi" } ] ------------------------------- Suchita, Are these logs captured right after or near to issue timestamp? Is this delay of "10mins" for expansion operation noticed just one or couple of instances or is it 100% reproducible in this setup ? Do we have a setup where we could take a look at it ? because one of the possible root cause is Ceph commands are stuck, if we have access to the cluster its something we could check it out . With above questions I am trying to isolate the issue.
--snip-- 2020-10-05T18:31:37.687426159Z E1005 18:31:37.687306 1 controllerserver.go:340] ID: 14288 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:37.687426159Z E1005 18:31:37.687368 1 utils.go:163] ID: 14288 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:39.699191935Z E1005 18:31:39.699104 1 controllerserver.go:340] ID: 14289 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:39.699191935Z E1005 18:31:39.699161 1 utils.go:163] ID: 14289 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:43.710797075Z E1005 18:31:43.710648 1 controllerserver.go:340] ID: 14290 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:43.710797075Z E1005 18:31:43.710690 1 utils.go:163] ID: 14290 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:51.723209745Z E1005 18:31:51.722840 1 controllerserver.go:340] ID: 14291 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:51.723209745Z E1005 18:31:51.722913 1 utils.go:163] ID: 14291 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:07.735194864Z E1005 18:32:07.735128 1 controllerserver.go:340] ID: 14292 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:07.735194864Z E1005 18:32:07.735171 1 utils.go:163] ID: 14292 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:23.013445743Z E1005 18:32:23.013369 1 controllerserver.go:340] ID: 14293 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:23.013477263Z E1005 18:32:23.013433 1 utils.go:163] ID: 14293 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:39.748010186Z E1005 18:32:39.747957 1 controllerserver.go:340] ID: 14295 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:39.748010186Z E1005 18:32:39.747997 1 utils.go:163] ID: 14295 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:34:47.759815737Z E1005 18:34:47.759730 1 controllerserver.go:340] ID: 14298 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:34:47.759815737Z E1005 18:34:47.759772 1 utils.go:163] ID: 14298 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:36:21.739384142Z E1005 18:36:21.739227 1 volume.go:82] ID: 14284 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 failed to get the rootpath for the vol csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7(an error (exit status 22) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]) stdError Error EINVAL: Traceback (most recent call last): 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 158, in get_fs_handle 2020-10-05T18:36:21.739384142Z conn.connect() 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 85, in connect 2020-10-05T18:36:21.739384142Z self.fs.mount(filesystem_name=self.fs_name.encode('utf-8')) 2020-10-05T18:36:21.739384142Z File "cephfs.pyx", line 694, in cephfs.LibCephFS.mount 2020-10-05T18:36:21.739384142Z cephfs.Error: error calling ceph_mount: Connection timed out [Errno 110] 2020-10-05T18:36:21.739384142Z 2020-10-05T18:36:21.739384142Z During handling of the above exception, another exception occurred: 2020-10-05T18:36:21.739384142Z 2020-10-05T18:36:21.739384142Z Traceback (most recent call last): 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/mgr_module.py", line 974, in _handle_command 2020-10-05T18:36:21.739384142Z return self.handle_command(inbuf, cmd) 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/module.py", line 270, in handle_command 2020-10-05T18:36:21.739384142Z return handler(inbuf, cmd) 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/module.py", line 338, in _cmd_fs_subvolume_getpath 2020-10-05T18:36:21.739384142Z group_name=cmd.get('group_name', None)) 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 213, in subvolume_getpath 2020-10-05T18:36:21.739384142Z with open_volume(self, volname) as fs_handle: 2020-10-05T18:36:21.739384142Z File "/lib64/python3.6/contextlib.py", line 81, in __enter__ 2020-10-05T18:36:21.739384142Z return next(self.gen) 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 283, in open_volume 2020-10-05T18:36:21.739384142Z fs_handle = vc.connection_pool.get_fs_handle(volname) 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 164, in get_fs_handle 2020-10-05T18:36:21.739384142Z raise VolumeException(-e.args[0], e.args[1]) 2020-10-05T18:36:21.739384142Z TypeError: bad operand type for unary -: 'str' 2020-10-05T18:36:21.739384142Z 2020-10-05T18:36:21.739384142Z E1005 18:36:21.739310 1 controllerserver.go:354] ID: 14284 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 validation and extraction of volume options failed: an error (exit status 22) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] 2020-10-05T18:36:21.73945746Z E1005 18:36:21.739377 1 utils.go:163] ID: 14284 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = InvalidArgument desc = an error (exit status 22) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***] ---snip-------
(In reply to Humble Chirammal from comment #9) > occurred while running ceph args: [fs subvolume getpath > ocs-storagecluster-cephfilesystem > csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m > 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c > /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner > --keyfile=***stripped***]) stdError Error EINVAL: Traceback (most recent > call last): > 2020-10-05T18:36:21.739384142Z File > "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 158, in > get_fs_handle > 2020-10-05T18:36:21.739384142Z conn.connect() > 2020-10-05T18:36:21.739384142Z File > "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 85, in connect > 2020-10-05T18:36:21.739384142Z > self.fs.mount(filesystem_name=self.fs_name.encode('utf-8')) > 2020-10-05T18:36:21.739384142Z File "cephfs.pyx", line 694, in > cephfs.LibCephFS.mount > 2020-10-05T18:36:21.739384142Z cephfs.Error: error calling ceph_mount: > Connection timed out [Errno 110] > 2020-10-05T18:36:21.739384142Z > 2020-10-05T18:36:21.739384142Z During handling of the above exception, > another exception occurred: > 2020-10-05T18:36:21.739384142Z > 2020-10-05T18:36:21.739384142Z Traceback (most recent call last): > 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/mgr_module.py", > line 974, in _handle_command > 2020-10-05T18:36:21.739384142Z return self.handle_command(inbuf, cmd) > 2020-10-05T18:36:21.739384142Z File > "/usr/share/ceph/mgr/volumes/module.py", line 270, in handle_command > 2020-10-05T18:36:21.739384142Z return handler(inbuf, cmd) > 2020-10-05T18:36:21.739384142Z File > "/usr/share/ceph/mgr/volumes/module.py", line 338, in > _cmd_fs_subvolume_getpath > 2020-10-05T18:36:21.739384142Z group_name=cmd.get('group_name', None)) > 2020-10-05T18:36:21.739384142Z File > "/usr/share/ceph/mgr/volumes/fs/volume.py", line 213, in subvolume_getpath > 2020-10-05T18:36:21.739384142Z with open_volume(self, volname) as > fs_handle: > 2020-10-05T18:36:21.739384142Z File "/lib64/python3.6/contextlib.py", line > 81, in __enter__ > 2020-10-05T18:36:21.739384142Z return next(self.gen) > 2020-10-05T18:36:21.739384142Z File > "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 283, in > open_volume > 2020-10-05T18:36:21.739384142Z fs_handle = > vc.connection_pool.get_fs_handle(volname) > 2020-10-05T18:36:21.739384142Z File > "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 164, in > get_fs_handle > 2020-10-05T18:36:21.739384142Z raise VolumeException(-e.args[0], > e.args[1]) > 2020-10-05T18:36:21.739384142Z TypeError: bad operand type for unary -: 'str' > 2020-10-05T18:36:21.739384142Z > 2020-10-05T18:36:21.739384142Z E1005 18:36:21.739310 1 > ---snip------- The above stack trace closely resembles what was tracked in this upstream Ceph tracker: https://tracker.ceph.com/issues/46607 The fix upstream was provided using the tracker: https://tracker.ceph.com/issues/46464 The fix in the above trackers was to handle EDQUOT error returns, however in this case the errno seems to be ETIMEDOUT (110) which is not handled. This needs a bug against CephFS to handle the same. Further, the error should not have caused the 10 minute delay. CSI should have recovered and tried again. The error itself being a timed out error, looks to be an issue where Ceph health was not ready to serve requests. (In reply to Humble Chirammal from comment #8) > --------------------------- > Also the CSI subvolume group we use for PVCs has been marked for "Deletion" > > [ > { > "name": "_deleting" ------> this is marked for "deletion" > }, > { > "name": "csi" > } > ] The above is listing 2 subvolumegroups, "_deleting" and "csi". This does not reflect the status of the csi subvolumegroup as "deleting". The "_deleting" subvolume group is the location of the async deletion of subvolumes.
(In reply to Shyamsundar from comment #11) > The fix in the above trackers was to handle EDQUOT error returns, however in > this case the errno seems to be ETIMEDOUT (110) which is not handled. This > needs a bug against CephFS to handle the same. Filed an upstream tracker on this: https://tracker.ceph.com/issues/47798 The issue possibly only impacts Nautilus builds upstream.
(In reply to Shyamsundar from comment #11) > (In reply to Humble Chirammal from comment #9) > > occurred while running ceph args: [fs subvolume getpath > > ocs-storagecluster-cephfilesystem .......... > > 2020-10-05T18:36:21.739384142Z File > > "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 164, in > > get_fs_handle > > 2020-10-05T18:36:21.739384142Z raise VolumeException(-e.args[0], > > e.args[1]) > > 2020-10-05T18:36:21.739384142Z TypeError: bad operand type for unary -: 'str' > > 2020-10-05T18:36:21.739384142Z > > 2020-10-05T18:36:21.739384142Z E1005 18:36:21.739310 1 > > ---snip------- > > The above stack trace closely resembles what was tracked in this upstream > Ceph tracker: https://tracker.ceph.com/issues/46607 > > The fix upstream was provided using the tracker: > https://tracker.ceph.com/issues/46464 > > The fix in the above trackers was to handle EDQUOT error returns, however in > this case the errno seems to be ETIMEDOUT (110) which is not handled. This > needs a bug against CephFS to handle the same. True, let me know if I want to open a bug on this. > > Further, the error should not have caused the 10 minute delay. CSI should > have recovered and tried again. The error itself being a timed out error, > looks to be an issue where Ceph health was not ready to serve requests. Yeah. but somehow the must gather output captured or shared here point that,the Ceph health is good. I looked at possible logs from mgr..etc but couldnt figure out any cluster health issues atleast in the logs. However there were many issues in this setup on the same day which I captured from the logs, didnt paste it here in this bugzilla as it was before this incident, here it is: 2020-10-05T06:00:31.109598333Z E1005 06:00:31.109444 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:01:11.86034908Z E1005 06:01:11.860243 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout 2020-10-05T06:01:49.458241922Z E1005 06:01:49.458177 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout 2020-10-05T06:01:56.441435721Z E1005 06:01:56.441330 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:02:05.548077331Z E1005 06:02:05.548019 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:02:15.787643537Z E1005 06:02:15.787565 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: no route to host 2020-10-05T06:02:31.146695203Z E1005 06:02:31.146627 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: no route to host 2020-10-05T06:02:40.931265643Z E1005 06:02:40.931158 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:02:51.19005191Z E1005 06:02:51.189984 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:02:59.66866016Z E1005 06:02:59.668590 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:03:34.724154606Z E1005 06:03:34.724052 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout 2020-10-05T06:03:45.580289453Z E1005 06:03:45.580155 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:03:55.722203879Z E1005 06:03:55.722131 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:04:31.022417917Z E1005 06:04:31.022326 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out Also on 04th Oct: 2020-10-04T12:36:07.940091165Z E1004 12:36:07.939985 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout 2020-10-04T15:41:09.747847649Z E1004 15:41:09.747679 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out 2020-10-04T15:41:30.749830017Z E1004 15:41:30.749728 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out 2020-10-04T15:41:44.733828797Z E1004 15:41:44.733783 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out 2020-10-04T15:41:58.745777091Z E1004 15:41:58.745725 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out 2020-10-04T15:42:19.740076877Z E1004 15:42:19.740025 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out 2020-10-04T15:42:40.743889502Z E1004 15:42:40.743810 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out 2020-10-04T15:42:54.890380587Z E1004 15:42:54.890215 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: http2: server sent GOAWAY and closed the connection; LastStreamID=2787, ErrCode=NO_ERROR, debug="" 2020-10-04T15:43:30.69010146Z E1004 15:43:30.690002 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout > > The above is listing 2 subvolumegroups, "_deleting" and "csi". This does not > reflect the status of the csi subvolumegroup as "deleting". The "_deleting" > subvolume group is the location of the async deletion of subvolumes. True.. later noticed it was the subvolume.
Suchita, Can you revert on questions raised in c#8 ? overall it looks to me that, this cluster is not healthy and leading to the root cause of this issue.
Without luck, spent some time looking at the Ceph Side logs ( mgr,fs dump, distribution status, crashes [1]. progress..etc) to figure out any suspicious state of the cluster. [1] Last crashes are reported at Sep 25th which is old. Suchita, appreciated if you can share more details requested in earlier comments or if we could try to reproduce this again.
(In reply to Humble Chirammal from comment #4) > (In reply to Humble Chirammal from comment #2) > > > > > Issue: PVC expansion from 10Gi to 20Gi took 10 minutes. There are resize > > failure events before the expansion succeeded. The issue was seen in first > > attempt of expansion. > > > > > > > Can you mention the failure error messages noticed in this testing ? > > > While we continue our analysis,I would like to understand, is there a > specific reason to test this on OCS 4.5 cluster? if possible, can we try > this on OCS 4.6 setup ? We are qualifying OCS 4.5 for RHV Tech Preview. So we are testinh it on same
(In reply to Humble Chirammal from comment #8) > --------------------------- > > > Ceph Health: > > cluster: > id: 527fc23c-f5ab-43d6-85d9-157474a2453b > health: HEALTH_OK > > services: > mon: 3 daemons, quorum a,b,c (age 4h) > mgr: a(active, since 9d) > mds: ocs-storagecluster-cephfilesystem:1 > {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay > osd: 3 osds: 3 up (since 4h), 3 in (since 10d) > rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, > ocs.storagecluster.cephobjectstore.b) > > task status: > scrub status: > mds.ocs-storagecluster-cephfilesystem-a: idle > mds.ocs-storagecluster-cephfilesystem-b: idle > > data: > pools: 10 pools, 176 pgs > objects: 21.55k objects, 75 GiB > usage: 196 GiB used, 1.3 TiB / 1.5 TiB avail > pgs: 176 active+clean > > io: > client: 853 B/s rd, 21 KiB/s wr, 1 op/s rd, 2 op/s wr > > -------------- > > RAW STORAGE: > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd 1.5 TiB 1.3 TiB 193 GiB 196 GiB 13.07 > TOTAL 1.5 TiB 1.3 TiB 193 GiB 196 GiB 13.07 > > POOLS: > POOL ID STORED > OBJECTS USED %USED MAX AVAIL > ocs-storagecluster-cephblockpool 1 59 GiB > 19.82k 181 GiB 14.33 360 GiB > ocs-storagecluster-cephobjectstore.rgw.control 2 0 B > 8 0 B 0 360 GiB > ocs-storagecluster-cephfilesystem-metadata 3 117 MiB > 350 280 MiB 0.03 360 GiB > ocs-storagecluster-cephfilesystem-data0 4 4.0 GiB > 1.12k 12 GiB 1.10 360 GiB > ocs-storagecluster-cephobjectstore.rgw.meta 5 1.7 KiB > 7 1.1 MiB 0 360 GiB > ocs-storagecluster-cephobjectstore.rgw.log 6 148 KiB > 210 6.5 MiB 0 360 GiB > ocs-storagecluster-cephobjectstore.rgw.buckets.index 7 642 B > 11 642 B 0 360 GiB > ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec 8 0 B > 0 0 B 0 360 GiB > .rgw.root 9 4.7 KiB > 16 2.8 MiB 0 360 GiB > ocs-storagecluster-cephobjectstore.rgw.buckets.data 10 1.0 KiB > 2 384 KiB 0 360 GiB > > > ------------------- > > There are 'crashes' in Ceph: > > 1 crashes recorded > 1 older than 1 days old: > 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804 > 1 older than 3 days old: > 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804 > 1 older than 7 days old: > 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804 > > > Also the CSI subvolume group we use for PVCs has been marked for "Deletion" > > [ > { > "name": "_deleting" ------> this is marked for "deletion" > }, > { > "name": "csi" > } > ] > > ------------------------------- > > Suchita, > > Are these logs captured right after or near to issue timestamp? -->Logs are collected during ocs-ci test run, when test failed. > Is this delay of "10mins" for expansion operation noticed just one or couple > of instances or is it 100% reproducible in this setup ? -->I execute the test 3 times and observed same failure for 3 times. 1st time while executing tier1 test suite. Then 2 times I explicitly rerun only this test to confirm the failuer. > > Do we have a setup where we could take a look at it ? because one of the > possible root cause is Ceph commands are stuck, if we have access to the > cluster its something we could check it out . --> No We just have a one cluster and that cluster is not available now. > > With above questions I am trying to isolate the issue.
(In reply to Humble Chirammal from comment #2) > > > Issue: PVC expansion from 10Gi to 20Gi took 10 minutes. There are resize > failure events before the expansion succeeded. The issue was seen in first > attempt of expansion. > > > > Can you mention the failure error messages noticed in this testing ? 18:33:06 - MainThread - ocs_ci.ocs.resources.pvc - INFO - Capacity of PVC pvc-test-07684f4b3a51433bab9005b9cd2c9d7c is not 20Gi as expected, but 10Gi. Retrying. 18:33:07 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.5 will be used. From csi-cephfsplugin-provisioner-5f8b66cc96-zqj9f pod csi-resizer container log, there are resize failure events before the expansion succeeded. PVC pvc-test-07684f4b3a51433bab9005b9cd2c9d7c expansion from 10Gi to 20Gi took 10 minutes which is much more than the expected time. ``` 2020-10-05T06:04:40.709925508Z I1005 06:04:40.709515 1 controller.go:190] Starting external resizer openshift-storage.cephfs.csi.ceph.com 2020-10-05T18:29:06.671879079Z I1005 18:29:06.671701 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10094283", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:31:36.675920623Z E1005 18:31:36.675628 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 2020-10-05T18:31:36.675920623Z I1005 18:31:36.675772 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10094283", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded 2020-10-05T18:31:37.683786113Z I1005 18:31:37.683698 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095459", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:31:37.68764307Z E1005 18:31:37.687559 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:37.68766698Z I1005 18:31:37.687633 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095459", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:39.696459133Z I1005 18:31:39.696342 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095472", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:31:39.699381746Z E1005 18:31:39.699297 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:39.699401526Z I1005 18:31:39.699379 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095472", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:43.707458622Z I1005 18:31:43.707351 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095514", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:31:43.710924792Z E1005 18:31:43.710892 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:43.711068549Z I1005 18:31:43.711035 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095514", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:51.719531115Z I1005 18:31:51.719466 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095578", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:31:51.723229466Z E1005 18:31:51.723071 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:31:51.723229466Z I1005 18:31:51.723162 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095578", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:07.732017512Z I1005 18:32:07.731932 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095743", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:32:07.735405106Z E1005 18:32:07.735356 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:07.735479977Z I1005 18:32:07.735449 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095743", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:23.008425723Z I1005 18:32:23.008326 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095885", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:32:23.013612609Z E1005 18:32:23.013545 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:23.013651018Z I1005 18:32:23.013610 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095885", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:39.744212506Z I1005 18:32:39.744101 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10096016", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:32:39.748212052Z E1005 18:32:39.748137 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:32:39.748234731Z I1005 18:32:39.748217 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10096016", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:34:47.756282436Z I1005 18:34:47.756203 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10097120", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:34:47.760025495Z E1005 18:34:47.759963 1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:34:47.760064589Z I1005 18:34:47.760029 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10097120", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists 2020-10-05T18:39:03.768056288Z I1005 18:39:03.767964 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10099185", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe 2020-10-05T18:39:23.091442803Z I1005 18:39:23.091336 1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10099185", FieldPath:""}): type: 'Normal' reason: 'VolumeResizeSuccessful' Resize volume succeeded ```
(In reply to Humble Chirammal from comment #15) > Without luck, spent some time looking at the Ceph Side logs ( mgr,fs dump, > distribution status, crashes [1]. progress..etc) to figure out any > suspicious state of the cluster. > > [1] Last crashes are reported at Sep 25th which is old. --> During fist test tier1 suite run, on 25 th sept the Cluster was crashed however cluster was not crash after this test, so test was run on healthycluster. when I rerun the only this test (2 times retried ), cluster was in healthy state and before and after test. > > Suchita, appreciated if you can share more details requested in earlier > comments or if we could try to reproduce this again.
(In reply to Humble Chirammal from comment #14) > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me > that, this cluster is not healthy and leading to the root cause of this > issue. Before this test cluster Health is OK. From Console Log: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/13069/consoleFull 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.
(In reply to suchita from comment #20) > (In reply to Humble Chirammal from comment #14) > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me > > that, this cluster is not healthy and leading to the root cause of this > > issue. > > Before this test cluster Health is OK. > From Console Log: > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs- > cluster/13069/consoleFull > > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK > > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is > HEALTH_OK. Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the issues in this kube/ocp cluster on API, etcd..etc, reported on the same day. > Do we have a setup where we could take a look at it ? because one of the > possible root cause is Ceph commands are stuck, if we have access to the > cluster its something we could check it out . --> No We just have a one cluster and that cluster is not available now. Unfortunately I don't think we can make out anything else from the logs than which we already analysed and shared in previous comments. We can keep this bugzilla open if we are trying to reproduce this issue again. Please share your thought , accordingly we can proceed.
(In reply to Humble Chirammal from comment #21) > (In reply to suchita from comment #20) > > (In reply to Humble Chirammal from comment #14) > > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me > > > that, this cluster is not healthy and leading to the root cause of this > > > issue. > > > > Before this test cluster Health is OK. > > From Console Log: > > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs- > > cluster/13069/consoleFull > > > > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK > > > > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is > > HEALTH_OK. > > Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the > issues in this kube/ocp > cluster on API, etcd..etc, reported on the same day. Would cluster state affect the PV resize? > > > Do we have a setup where we could take a look at it ? because one of the > > possible root cause is Ceph commands are stuck, if we have access to the > > cluster its something we could check it out . > --> No We just have a one cluster and that cluster is not available now. > > Unfortunately I don't think we can make out anything else from the logs than > which we already analysed > and shared in previous comments. > > > We can keep this bugzilla open if we are trying to reproduce this issue > again. > Please share your thought , accordingly we can proceed.
(In reply to Sahina Bose from comment #22) > (In reply to Humble Chirammal from comment #21) > > (In reply to suchita from comment #20) > > > (In reply to Humble Chirammal from comment #14) > > > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me > > > > that, this cluster is not healthy and leading to the root cause of this > > > > issue. > > > > > > Before this test cluster Health is OK. > > > From Console Log: > > > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs- > > > cluster/13069/consoleFull > > > > > > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK > > > > > > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is > > > HEALTH_OK. > > > > Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the > > issues in this kube/ocp > > cluster on API, etcd..etc, reported on the same day. > > Would cluster state affect the PV resize? > 2020-10-05T06:00:31.109598333Z E1005 06:00:31.109444 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:01:11.86034908Z E1005 06:01:11.860243 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout 2020-10-05T06:01:49.458241922Z E1005 06:01:49.458177 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout 2020-10-05T06:01:56.441435721Z E1005 06:01:56.441330 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused 2020-10-05T06:02:05.548077331Z E1005 06:02:05.548019 1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443 ........ The `external resizer` which is responsible for resizing the PVC failed to get leader or election completed ..etc due to these errors. however, to be clear, these errors were reported on the same day but not exactly at time of issue. The below trace back is the only pointer we have in logs ( which matches with the issue timestamp and the exact instance of the issue) from CSI side where it clearly says there was a failure to get the proper response for the command we executed from CSI to Ceph Cluster. This gives the suspicion that ceph cluster was not serving the request properly or ceph cluster health was not good. --snip-- from c#9 2020-10-05T18:36:21.739384142Z E1005 18:36:21.739227 1 volume.go:82] ID: 14284 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 failed to get the rootpath for the vol csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7(an error (exit status 22) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]) stdError Error EINVAL: Traceback (most recent call last): 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 158, in get_fs_handle 2020-10-05T18:36:21.739384142Z conn.connect() 2020-10-05T18:36:21.739384142Z File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 85, in connect 2020-10-05T18:36:21.739384142Z self.fs.mount(filesystem_name=self.fs_name.encode('utf-8')) 2020-10-05T18:36:21.739384142Z File "cephfs.pyx", line 694, in cephfs.LibCephFS.mount 2020-10-05T18:36:21.739384142Z cephfs.Error: error calling ceph_mount: Connection timed out [Errno 110] 2020-10-05T18:36:21.739384142Z .. --/snip-- --snip-- from c#11 The above stack trace closely resembles what was tracked in this upstream Ceph tracker: https://tracker.ceph.com/issues/46607 The fix upstream was provided using the tracker: https://tracker.ceph.com/issues/46464 The fix in the above trackers was to handle EDQUOT error returns, however in this case the errno seems to be ETIMEDOUT (110) which is not handled. This needs a bug against CephFS to handle the same. Further, the error should not have caused the 10 minute delay. CSI should have recovered and tried again. The error itself being a timed out error, looks to be an issue where Ceph health was not ready to serve requests. --snip--
This fix for https://tracker.ceph.com/issues/46464 also handles missing errnos. With this fix, it no longer crashes. So we don't need a separate fix for ETIMEDOUT in my opinion. Please check https://tracker.ceph.com/issues/47798#note-4
(In reply to Humble Chirammal from comment #21) > (In reply to suchita from comment #20) > > (In reply to Humble Chirammal from comment #14) > > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me > > > that, this cluster is not healthy and leading to the root cause of this > > > issue. > > > > Before this test cluster Health is OK. > > From Console Log: > > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs- > > cluster/13069/consoleFull > > > > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK > > > > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is > > HEALTH_OK. > > Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the > issues in this kube/ocp > cluster on API, etcd..etc, reported on the same day. > > > Do we have a setup where we could take a look at it ? because one of the > > possible root cause is Ceph commands are stuck, if we have access to the > > cluster its something we could check it out . > --> No We just have a one cluster and that cluster is not available now. > > Unfortunately I don't think we can make out anything else from the logs than > which we already analysed > and shared in previous comments. > > > We can keep this bugzilla open if we are trying to reproduce this issue > again. > Please share your thought , accordingly we can proceed. I have created the fresh cluster and this issue is not reproduced.
(In reply to suchita from comment #25) > (In reply to Humble Chirammal from comment #21) > > (In reply to suchita from comment #20) > > > (In reply to Humble Chirammal from comment #14) > > > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me > > > > that, this cluster is not healthy and leading to the root cause of this > > > > issue. > > > > > > Before this test cluster Health is OK. > > > From Console Log: > > > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs- > > > cluster/13069/consoleFull > > > > > > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK > > > > > > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is > > > HEALTH_OK. > > > > Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the > > issues in this kube/ocp > > cluster on API, etcd..etc, reported on the same day. > > > > > Do we have a setup where we could take a look at it ? because one of the > > > possible root cause is Ceph commands are stuck, if we have access to the > > > cluster its something we could check it out . > > --> No We just have a one cluster and that cluster is not available now. > > > > Unfortunately I don't think we can make out anything else from the logs than > > which we already analysed > > and shared in previous comments. > > > > > > We can keep this bugzilla open if we are trying to reproduce this issue > > again. > > Please share your thought , accordingly we can proceed. > I have created the fresh cluster and this issue is not reproduced. I have created a another cluster and the issue is reproduced. Now cluster is available console Logs: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/13549/consoleFull Test execution logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rhv-tp/sgatfane-rhv-tp_20200925T115616/logs/ocs-ci-logs-1602764201/ http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rhv-tp/sgatfane-rhv-tp_20200925T115616/logs/failed_testcase_ocs_logs_1602764201/
on 19 October when I changed the Log level to 5 on the same cluster and execute the test then I realized that all Operators, Provisioner, and Plugin pods were missing from an OCS cluster. Something which is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1884318. I didn't check it before doing this. All pods were running when I commented comment#26 Used oc edit cm/rook-ceph-operator-config -n openshift-storage to update the data section with CSI_LOG_LEVEL: "5" @Madhu Rajanna, Please let me know is this the correct way to change the log-level in the existing cluster. After that, I uninstall and install OCS. Meanwhile, the cluster is down because the storage domain attached to the cluster went to an inactive status. we only have one cluster and able to test it again this test with log level 5, when the cluster issue will resolve.
Do we have new logs now?
(In reply to Yaniv Kaul from comment #30) > Do we have new logs now? Ping? This is in the top 10 bugs, yet had no response for 2 weeks.
suchita, Could you please provide the requested information in comment #30? Thanks
Awaiting data for long, this shouldn't be a blocker.
Closing, please reopen when we can get the information requested.
The Scenario is not reproduced later hence I am not able to provide the required buggy logs. So this is okay to close this bug.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days