Bug 1886365 - [RHV] PVC expansion of already expanded PVC taking more time than expected
Summary: [RHV] PVC expansion of already expanded PVC taking more time than expected
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat Storage
Component: csi-driver
Version: 4.5
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Humble Chirammal
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-10-08 09:45 UTC by suchita
Modified: 2023-09-15 00:49 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-01-06 14:23:43 UTC
Embargoed:


Attachments (Terms of Use)

Description suchita 2020-10-08 09:45:43 UTC
Description of problem:

PVC expansion taking more time than expected.

This issue observed in Tier1 test run for RHV Tech qualification. 
This test verify the PVC expand of already expanded PVC. 

Issue: PVC  expansion from 10Gi to 20Gi took 10 minutes. There are resize failure events before the expansion succeeded. The issue was seen in first attempt of expansion. 


18:33:06 - MainThread - ocs_ci.ocs.resources.pvc - INFO - Capacity of PVC pvc-test-07684f4b3a51433bab9005b9cd2c9d7c is not 20Gi as expected, but 10Gi. Retrying.
18:33:07 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.5 will be used.

From csi-cephfsplugin-provisioner-5f8b66cc96-zqj9f pod csi-resizer container log, there are resize failure events before the expansion succeeded. PVC pvc-test-07684f4b3a51433bab9005b9cd2c9d7c expansion from 10Gi to 20Gi took 10 minutes which is much more than the expected time.


Version-Release number of selected component (if applicable):
Ceph Version 	14.2.8-91.el8cp (75b4845da7d469665bd48d1a49badcc3677bf5cd) nautilus (stable)
Cluster Version 	4.5.9
OCS operator 	v4.5.0
Test Run Name 	sgatfane-OCS4-5-Downstream--LSO-MON-HOSTPATH-OSD-HDD-RHV-IPI-1AZ-RHCOS-3M-3W-tier4
ceph 	rhceph-4-rhel8@sha256:eafd1acb0ada5d7cf93699056118aca19ed7a22e4938411d307ef94048746cc8
cephfsplugin 	132d85ceeee09136f1f2048bd5f85cbb53e6b697a9f50557d05bb63959386bcd
noobaa-operator 	mcg-rhel8-operator@sha256:2cc0c9247ea1f1df0d3c930216d8c4880485033330beddb922c45132ff795169
noobaa_cor 	mcg-core-rhel8@sha256:71757f0e1ac6b4092ff2ceb0e642424e6f99b7824f352d72846ecf88f45a5fe2
noobaa_db 	mongodb-36-rhel7@sha256:ba74027bb4b244df0b0823ee29aa927d729da33edaa20ebdf51a2430cc6b4e95
rbdplugin 	132d85ceeee09136f1f2048bd5f85cbb53e6b697a9f50557d05bb63959386bcd
rook-ceph-operator 	rook-ceph-rhel8-operator@sha256:95b31d3e0bebf4f27b42f25fdde1fd06098ccfc833f20e3b9fe2ca7fb11bc104
rook_ceph 	rook-ceph-rhel8-operator@sha256:95b31d3e0bebf4f27b42f25fdde1fd06098ccfc833f20e3b9fe2ca7fb11bc104
rook_csi_attacher 	ose-csi-external-attacher@sha256:db168e03ea39c90691497b4755e231f7ac9bc1a767ed776c02165e2cf509f0b2
rook_csi_ceph 	cephcsi-rhel8@sha256:132d85ceeee09136f1f2048bd5f85cbb53e6b697a9f50557d05bb63959386bcd
rook_csi_provisioner 	ose-csi-external-provisioner-rhel7@sha256:69798ba68baf4d14a755a5f581e54a4c18ca3022d8a7e1e10f651ba2f8a0a597
rook_csi_registrar 	ose-csi-driver-registrar@sha256:7a814c6d75c1dd7eb67977cacca379184b54750151dcd9ff698d7852263fbc9f
rook_csi_resizer 	ose-csi-external-resizer-rhel7@sha256:279a12fb2095c7c7f7429135317c53a3f821d6a5a7b89b2f49fc4f84d5cfba42


How reproducible:
3/3

Steps to Reproduce:

The test case is automated:
tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expansion


Actual results:
Resize failure before the expansion succeeded

Expected results:
Expansion should not take this long. Resize attempt should succeed without fail.

Additional info:

Platform: RHV
Storage used:
     etcd on SSD storage and OS disks are thin provisioned. 
     OSD on HDD storage domain , 500GBi disk on each worker node,disk  allocation policy: preallocated. 
  
Test execution cosole log: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/13069/console

TestLogs: https://drive.google.com/drive/folders/1Ozoky1mX93IwYTfg8pVuEyG_xM8lrHbv?usp=sharing

Comment 2 Humble Chirammal 2020-10-08 09:52:22 UTC
>
Issue: PVC  expansion from 10Gi to 20Gi took 10 minutes. There are resize failure events before the expansion succeeded. The issue was seen in first attempt of expansion. 
>

Can you mention the failure error messages noticed in this testing ?

Comment 4 Humble Chirammal 2020-10-08 11:38:03 UTC
(In reply to Humble Chirammal from comment #2)
> >
> Issue: PVC  expansion from 10Gi to 20Gi took 10 minutes. There are resize
> failure events before the expansion succeeded. The issue was seen in first
> attempt of expansion. 
> >
> 
> Can you mention the failure error messages noticed in this testing ?


While we continue our analysis,I would like to understand, is there a specific reason to test this on OCS 4.5 cluster? if possible, can we try this on OCS 4.6 setup ?

Comment 5 Humble Chirammal 2020-10-08 11:49:18 UTC
(In reply to Humble Chirammal from comment #4)
> (In reply to Humble Chirammal from comment #2)
> > >
> > Issue: PVC  expansion from 10Gi to 20Gi took 10 minutes. There are resize
> > failure events before the expansion succeeded. The issue was seen in first
> > attempt of expansion. 
> > >
> > 
> > Can you mention the failure error messages noticed in this testing ?
> 
> 
> While we continue our analysis,I would like to understand, is there a
> specific reason to test this on OCS 4.5 cluster? if possible, can we try
> this on OCS 4.6 setup ?

I also have a question on whether this cluster is healthy? can you check the ceph cluster status while you face this issue?
The reason why I am asking this is,  being the expansion is happening 
on a CephFS PVC, in the backend it is a very minimal operation of "QUOTA" adjustment which should not cause "ANY" delay
or IOW, it should return pretty quickly. From the logs it looks like the  ongoing operation on this PVC holding the lock for long time. 

2020-10-05T18:34:47.760064589Z I1005 18:34:47.760029       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10097120", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists

Comment 6 Elad 2020-10-08 12:27:36 UTC
(In reply to Humble Chirammal from comment #4)

> While we continue our analysis,I would like to understand, is there a
> specific reason to test this on OCS 4.5 cluster? if possible, can we try
> this on OCS 4.6 setup ?

We test over OCS 4.5 since RHV is intended to be supported starting 4.5.z

Comment 7 Elad 2020-10-08 12:48:43 UTC
Removing Regression keyword as this is seen with RHV which is still not supported

Comment 8 Humble Chirammal 2020-10-08 13:43:23 UTC
---------------------------


Ceph Health:

  cluster:
    id:     527fc23c-f5ab-43d6-85d9-157474a2453b
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 4h)
    mgr: a(active, since 9d)
    mds: ocs-storagecluster-cephfilesystem:1 {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
    osd: 3 osds: 3 up (since 4h), 3 in (since 10d)
    rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a, ocs.storagecluster.cephobjectstore.b)
 
  task status:
    scrub status:
        mds.ocs-storagecluster-cephfilesystem-a: idle
        mds.ocs-storagecluster-cephfilesystem-b: idle
 
  data:
    pools:   10 pools, 176 pgs
    objects: 21.55k objects, 75 GiB
    usage:   196 GiB used, 1.3 TiB / 1.5 TiB avail
    pgs:     176 active+clean
 
  io:
    client:   853 B/s rd, 21 KiB/s wr, 1 op/s rd, 2 op/s wr
 
--------------

RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED 
    hdd       1.5 TiB     1.3 TiB     193 GiB      196 GiB         13.07 
    TOTAL     1.5 TiB     1.3 TiB     193 GiB      196 GiB         13.07 
 
POOLS:
    POOL                                                      ID     STORED      OBJECTS     USED        %USED     MAX AVAIL 
    ocs-storagecluster-cephblockpool                           1      59 GiB      19.82k     181 GiB     14.33       360 GiB 
    ocs-storagecluster-cephobjectstore.rgw.control             2         0 B           8         0 B         0       360 GiB 
    ocs-storagecluster-cephfilesystem-metadata                 3     117 MiB         350     280 MiB      0.03       360 GiB 
    ocs-storagecluster-cephfilesystem-data0                    4     4.0 GiB       1.12k      12 GiB      1.10       360 GiB 
    ocs-storagecluster-cephobjectstore.rgw.meta                5     1.7 KiB           7     1.1 MiB         0       360 GiB 
    ocs-storagecluster-cephobjectstore.rgw.log                 6     148 KiB         210     6.5 MiB         0       360 GiB 
    ocs-storagecluster-cephobjectstore.rgw.buckets.index       7       642 B          11       642 B         0       360 GiB 
    ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec      8         0 B           0         0 B         0       360 GiB 
    .rgw.root                                                  9     4.7 KiB          16     2.8 MiB         0       360 GiB 
    ocs-storagecluster-cephobjectstore.rgw.buckets.data       10     1.0 KiB           2     384 KiB         0       360 GiB 


-------------------

There are 'crashes' in Ceph:

1 crashes recorded
1 older than 1 days old:
2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804
1 older than 3 days old:
2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804
1 older than 7 days old:
2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804


Also the CSI subvolume group we use for PVCs has been marked for "Deletion" 

[
    {
        "name": "_deleting"            ------> this is marked for "deletion" 
    },
    {
        "name": "csi"
    }
]

-------------------------------

Suchita, 

Are these logs captured right after or near to issue timestamp?

Is this delay of "10mins" for expansion operation noticed just one or couple of instances or is it  100% reproducible in this setup ?

Do we have a setup where we could take a look at it ? because one of the possible root cause is Ceph commands are stuck, if we have access to the cluster its something we could check it out .

With above questions I am trying to isolate the issue.

Comment 9 Humble Chirammal 2020-10-08 14:18:14 UTC
--snip--


2020-10-05T18:31:37.687426159Z E1005 18:31:37.687306       1 controllerserver.go:340] ID: 14288 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:37.687426159Z E1005 18:31:37.687368       1 utils.go:163] ID: 14288 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:39.699191935Z E1005 18:31:39.699104       1 controllerserver.go:340] ID: 14289 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:39.699191935Z E1005 18:31:39.699161       1 utils.go:163] ID: 14289 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:43.710797075Z E1005 18:31:43.710648       1 controllerserver.go:340] ID: 14290 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:43.710797075Z E1005 18:31:43.710690       1 utils.go:163] ID: 14290 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:51.723209745Z E1005 18:31:51.722840       1 controllerserver.go:340] ID: 14291 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:51.723209745Z E1005 18:31:51.722913       1 utils.go:163] ID: 14291 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:07.735194864Z E1005 18:32:07.735128       1 controllerserver.go:340] ID: 14292 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:07.735194864Z E1005 18:32:07.735171       1 utils.go:163] ID: 14292 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:23.013445743Z E1005 18:32:23.013369       1 controllerserver.go:340] ID: 14293 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:23.013477263Z E1005 18:32:23.013433       1 utils.go:163] ID: 14293 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:39.748010186Z E1005 18:32:39.747957       1 controllerserver.go:340] ID: 14295 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:39.748010186Z E1005 18:32:39.747997       1 utils.go:163] ID: 14295 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:34:47.759815737Z E1005 18:34:47.759730       1 controllerserver.go:340] ID: 14298 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:34:47.759815737Z E1005 18:34:47.759772       1 utils.go:163] ID: 14298 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:36:21.739384142Z E1005 18:36:21.739227       1 volume.go:82] ID: 14284 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 failed to get the rootpath for the vol csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7(an error (exit status 22) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]) stdError Error EINVAL: Traceback (most recent call last):
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 158, in get_fs_handle
2020-10-05T18:36:21.739384142Z     conn.connect()
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 85, in connect
2020-10-05T18:36:21.739384142Z     self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
2020-10-05T18:36:21.739384142Z   File "cephfs.pyx", line 694, in cephfs.LibCephFS.mount
2020-10-05T18:36:21.739384142Z cephfs.Error: error calling ceph_mount: Connection timed out [Errno 110]
2020-10-05T18:36:21.739384142Z 
2020-10-05T18:36:21.739384142Z During handling of the above exception, another exception occurred:
2020-10-05T18:36:21.739384142Z 
2020-10-05T18:36:21.739384142Z Traceback (most recent call last):
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/mgr_module.py", line 974, in _handle_command
2020-10-05T18:36:21.739384142Z     return self.handle_command(inbuf, cmd)
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/module.py", line 270, in handle_command
2020-10-05T18:36:21.739384142Z     return handler(inbuf, cmd)
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/module.py", line 338, in _cmd_fs_subvolume_getpath
2020-10-05T18:36:21.739384142Z     group_name=cmd.get('group_name', None))
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 213, in subvolume_getpath
2020-10-05T18:36:21.739384142Z     with open_volume(self, volname) as fs_handle:
2020-10-05T18:36:21.739384142Z   File "/lib64/python3.6/contextlib.py", line 81, in __enter__
2020-10-05T18:36:21.739384142Z     return next(self.gen)
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 283, in open_volume
2020-10-05T18:36:21.739384142Z     fs_handle = vc.connection_pool.get_fs_handle(volname)
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 164, in get_fs_handle
2020-10-05T18:36:21.739384142Z     raise VolumeException(-e.args[0], e.args[1])
2020-10-05T18:36:21.739384142Z TypeError: bad operand type for unary -: 'str'
2020-10-05T18:36:21.739384142Z 
2020-10-05T18:36:21.739384142Z E1005 18:36:21.739310       1 controllerserver.go:354] ID: 14284 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 validation and extraction of volume options failed: an error (exit status 22) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]
2020-10-05T18:36:21.73945746Z E1005 18:36:21.739377       1 utils.go:163] ID: 14284 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 GRPC error: rpc error: code = InvalidArgument desc = an error (exit status 22) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]

---snip-------

Comment 11 Shyamsundar 2020-10-08 15:25:44 UTC
(In reply to Humble Chirammal from comment #9)
> occurred while running ceph args: [fs subvolume getpath
> ocs-storagecluster-cephfilesystem
> csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m
> 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c
> /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner
> --keyfile=***stripped***]) stdError Error EINVAL: Traceback (most recent
> call last):
> 2020-10-05T18:36:21.739384142Z   File
> "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 158, in
> get_fs_handle
> 2020-10-05T18:36:21.739384142Z     conn.connect()
> 2020-10-05T18:36:21.739384142Z   File
> "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 85, in connect
> 2020-10-05T18:36:21.739384142Z    
> self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
> 2020-10-05T18:36:21.739384142Z   File "cephfs.pyx", line 694, in
> cephfs.LibCephFS.mount
> 2020-10-05T18:36:21.739384142Z cephfs.Error: error calling ceph_mount:
> Connection timed out [Errno 110]
> 2020-10-05T18:36:21.739384142Z 
> 2020-10-05T18:36:21.739384142Z During handling of the above exception,
> another exception occurred:
> 2020-10-05T18:36:21.739384142Z 
> 2020-10-05T18:36:21.739384142Z Traceback (most recent call last):
> 2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/mgr_module.py",
> line 974, in _handle_command
> 2020-10-05T18:36:21.739384142Z     return self.handle_command(inbuf, cmd)
> 2020-10-05T18:36:21.739384142Z   File
> "/usr/share/ceph/mgr/volumes/module.py", line 270, in handle_command
> 2020-10-05T18:36:21.739384142Z     return handler(inbuf, cmd)
> 2020-10-05T18:36:21.739384142Z   File
> "/usr/share/ceph/mgr/volumes/module.py", line 338, in
> _cmd_fs_subvolume_getpath
> 2020-10-05T18:36:21.739384142Z     group_name=cmd.get('group_name', None))
> 2020-10-05T18:36:21.739384142Z   File
> "/usr/share/ceph/mgr/volumes/fs/volume.py", line 213, in subvolume_getpath
> 2020-10-05T18:36:21.739384142Z     with open_volume(self, volname) as
> fs_handle:
> 2020-10-05T18:36:21.739384142Z   File "/lib64/python3.6/contextlib.py", line
> 81, in __enter__
> 2020-10-05T18:36:21.739384142Z     return next(self.gen)
> 2020-10-05T18:36:21.739384142Z   File
> "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 283, in
> open_volume
> 2020-10-05T18:36:21.739384142Z     fs_handle =
> vc.connection_pool.get_fs_handle(volname)
> 2020-10-05T18:36:21.739384142Z   File
> "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 164, in
> get_fs_handle
> 2020-10-05T18:36:21.739384142Z     raise VolumeException(-e.args[0],
> e.args[1])
> 2020-10-05T18:36:21.739384142Z TypeError: bad operand type for unary -: 'str'
> 2020-10-05T18:36:21.739384142Z 
> 2020-10-05T18:36:21.739384142Z E1005 18:36:21.739310       1
> ---snip-------

The above stack trace closely resembles what was tracked in this upstream Ceph tracker: https://tracker.ceph.com/issues/46607

The fix upstream was provided using the tracker: https://tracker.ceph.com/issues/46464

The fix in the above trackers was to handle EDQUOT error returns, however in this case the errno seems to be ETIMEDOUT (110) which is not handled. This needs a bug against CephFS to handle the same.

Further, the error should not have caused the 10 minute delay. CSI should have recovered and tried again. The error itself being a timed out error, looks to be an issue where Ceph health was not ready to serve requests.

(In reply to Humble Chirammal from comment #8)
> ---------------------------
> Also the CSI subvolume group we use for PVCs has been marked for "Deletion" 
> 
> [
>     {
>         "name": "_deleting"            ------> this is marked for "deletion" 
>     },
>     {
>         "name": "csi"
>     }
> ]

The above is listing 2 subvolumegroups, "_deleting" and "csi". This does not reflect the status of the csi subvolumegroup as "deleting". The "_deleting" subvolume group is the location of the async deletion of subvolumes.

Comment 12 Shyamsundar 2020-10-08 15:40:09 UTC
(In reply to Shyamsundar from comment #11)
> The fix in the above trackers was to handle EDQUOT error returns, however in
> this case the errno seems to be ETIMEDOUT (110) which is not handled. This
> needs a bug against CephFS to handle the same.

Filed an upstream tracker on this: https://tracker.ceph.com/issues/47798

The issue possibly only impacts Nautilus builds upstream.

Comment 13 Humble Chirammal 2020-10-08 15:41:27 UTC
(In reply to Shyamsundar from comment #11)
> (In reply to Humble Chirammal from comment #9)
> > occurred while running ceph args: [fs subvolume getpath
> > ocs-storagecluster-cephfilesystem
..........
> > 2020-10-05T18:36:21.739384142Z   File
> > "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 164, in
> > get_fs_handle
> > 2020-10-05T18:36:21.739384142Z     raise VolumeException(-e.args[0],
> > e.args[1])
> > 2020-10-05T18:36:21.739384142Z TypeError: bad operand type for unary -: 'str'
> > 2020-10-05T18:36:21.739384142Z 
> > 2020-10-05T18:36:21.739384142Z E1005 18:36:21.739310       1
> > ---snip-------
> 
> The above stack trace closely resembles what was tracked in this upstream
> Ceph tracker: https://tracker.ceph.com/issues/46607
> 
> The fix upstream was provided using the tracker:
> https://tracker.ceph.com/issues/46464
> 
> The fix in the above trackers was to handle EDQUOT error returns, however in
> this case the errno seems to be ETIMEDOUT (110) which is not handled. This
> needs a bug against CephFS to handle the same.

True, let me know if I want to open a bug on this.

> 
> Further, the error should not have caused the 10 minute delay. CSI should
> have recovered and tried again. The error itself being a timed out error,
> looks to be an issue where Ceph health was not ready to serve requests.


Yeah. but somehow the must gather output captured or shared here point that,the Ceph health is good.
I looked at possible logs from mgr..etc but couldnt figure out any cluster health issues atleast in the logs.

However there were many issues in this setup on the same day which I captured from the logs, 
didnt paste it here in this bugzilla as it was before this incident, here it is:



 2020-10-05T06:00:31.109598333Z E1005 06:00:31.109444       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:01:11.86034908Z E1005 06:01:11.860243       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout
2020-10-05T06:01:49.458241922Z E1005 06:01:49.458177       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout
2020-10-05T06:01:56.441435721Z E1005 06:01:56.441330       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:02:05.548077331Z E1005 06:02:05.548019       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:02:15.787643537Z E1005 06:02:15.787565       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: no route to host
2020-10-05T06:02:31.146695203Z E1005 06:02:31.146627       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: no route to host
2020-10-05T06:02:40.931265643Z E1005 06:02:40.931158       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:02:51.19005191Z E1005 06:02:51.189984       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:02:59.66866016Z E1005 06:02:59.668590       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:03:34.724154606Z E1005 06:03:34.724052       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout
2020-10-05T06:03:45.580289453Z E1005 06:03:45.580155       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:03:55.722203879Z E1005 06:03:55.722131       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:04:31.022417917Z E1005 06:04:31.022326       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out


Also on 04th Oct:

2020-10-04T12:36:07.940091165Z E1004 12:36:07.939985       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout
2020-10-04T15:41:09.747847649Z E1004 15:41:09.747679       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out
2020-10-04T15:41:30.749830017Z E1004 15:41:30.749728       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out
2020-10-04T15:41:44.733828797Z E1004 15:41:44.733783       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out
2020-10-04T15:41:58.745777091Z E1004 15:41:58.745725       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out
2020-10-04T15:42:19.740076877Z E1004 15:42:19.740025       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out
2020-10-04T15:42:40.743889502Z E1004 15:42:40.743810       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: etcdserver: request timed out
2020-10-04T15:42:54.890380587Z E1004 15:42:54.890215       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: http2: server sent GOAWAY and closed the connection; LastStreamID=2787, ErrCode=NO_ERROR, debug=""
2020-10-04T15:43:30.69010146Z E1004 15:43:30.690002       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout

> 
> The above is listing 2 subvolumegroups, "_deleting" and "csi". This does not
> reflect the status of the csi subvolumegroup as "deleting". The "_deleting"
> subvolume group is the location of the async deletion of subvolumes.

True.. later noticed it was the subvolume.

Comment 14 Humble Chirammal 2020-10-08 15:46:09 UTC
Suchita, Can you revert on questions raised in c#8 ? overall it looks to me that, this cluster is not healthy and leading to the root cause of this issue.

Comment 15 Humble Chirammal 2020-10-09 12:22:54 UTC
Without luck, spent some time looking at the Ceph Side logs ( mgr,fs dump, distribution status, crashes [1]. progress..etc) to figure out any suspicious state of the cluster. 

[1] Last crashes are reported at Sep 25th which is old.


Suchita, appreciated if you can share more details requested in earlier comments or if we could try to reproduce this again.

Comment 16 suchita 2020-10-09 14:13:22 UTC
(In reply to Humble Chirammal from comment #4)
> (In reply to Humble Chirammal from comment #2)
> > >
> > Issue: PVC  expansion from 10Gi to 20Gi took 10 minutes. There are resize
> > failure events before the expansion succeeded. The issue was seen in first
> > attempt of expansion. 
> > >
> > 
> > Can you mention the failure error messages noticed in this testing ?
> 
> 
> While we continue our analysis,I would like to understand, is there a
> specific reason to test this on OCS 4.5 cluster? if possible, can we try
> this on OCS 4.6 setup ?

We are qualifying OCS 4.5 for RHV Tech Preview. So we are testinh it on same

Comment 17 suchita 2020-10-12 12:53:18 UTC
(In reply to Humble Chirammal from comment #8)
> ---------------------------
> 
> 
> Ceph Health:
> 
>   cluster:
>     id:     527fc23c-f5ab-43d6-85d9-157474a2453b
>     health: HEALTH_OK
>  
>   services:
>     mon: 3 daemons, quorum a,b,c (age 4h)
>     mgr: a(active, since 9d)
>     mds: ocs-storagecluster-cephfilesystem:1
> {0=ocs-storagecluster-cephfilesystem-a=up:active} 1 up:standby-replay
>     osd: 3 osds: 3 up (since 4h), 3 in (since 10d)
>     rgw: 2 daemons active (ocs.storagecluster.cephobjectstore.a,
> ocs.storagecluster.cephobjectstore.b)
>  
>   task status:
>     scrub status:
>         mds.ocs-storagecluster-cephfilesystem-a: idle
>         mds.ocs-storagecluster-cephfilesystem-b: idle
>  
>   data:
>     pools:   10 pools, 176 pgs
>     objects: 21.55k objects, 75 GiB
>     usage:   196 GiB used, 1.3 TiB / 1.5 TiB avail
>     pgs:     176 active+clean
>  
>   io:
>     client:   853 B/s rd, 21 KiB/s wr, 1 op/s rd, 2 op/s wr
>  
> --------------
> 
> RAW STORAGE:
>     CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED 
>     hdd       1.5 TiB     1.3 TiB     193 GiB      196 GiB         13.07 
>     TOTAL     1.5 TiB     1.3 TiB     193 GiB      196 GiB         13.07 
>  
> POOLS:
>     POOL                                                      ID     STORED 
> OBJECTS     USED        %USED     MAX AVAIL 
>     ocs-storagecluster-cephblockpool                           1      59 GiB
> 19.82k     181 GiB     14.33       360 GiB 
>     ocs-storagecluster-cephobjectstore.rgw.control             2         0 B
> 8         0 B         0       360 GiB 
>     ocs-storagecluster-cephfilesystem-metadata                 3     117 MiB
> 350     280 MiB      0.03       360 GiB 
>     ocs-storagecluster-cephfilesystem-data0                    4     4.0 GiB
> 1.12k      12 GiB      1.10       360 GiB 
>     ocs-storagecluster-cephobjectstore.rgw.meta                5     1.7 KiB
> 7     1.1 MiB         0       360 GiB 
>     ocs-storagecluster-cephobjectstore.rgw.log                 6     148 KiB
> 210     6.5 MiB         0       360 GiB 
>     ocs-storagecluster-cephobjectstore.rgw.buckets.index       7       642 B
> 11       642 B         0       360 GiB 
>     ocs-storagecluster-cephobjectstore.rgw.buckets.non-ec      8         0 B
> 0         0 B         0       360 GiB 
>     .rgw.root                                                  9     4.7 KiB
> 16     2.8 MiB         0       360 GiB 
>     ocs-storagecluster-cephobjectstore.rgw.buckets.data       10     1.0 KiB
> 2     384 KiB         0       360 GiB 
> 
> 
> -------------------
> 
> There are 'crashes' in Ceph:
> 
> 1 crashes recorded
> 1 older than 1 days old:
> 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804
> 1 older than 3 days old:
> 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804
> 1 older than 7 days old:
> 2020-09-25_20:47:51.317521Z_fddcda50-89d1-4f97-8485-15c5bbe43804
> 
> 
> Also the CSI subvolume group we use for PVCs has been marked for "Deletion" 
> 
> [
>     {
>         "name": "_deleting"            ------> this is marked for "deletion" 
>     },
>     {
>         "name": "csi"
>     }
> ]
> 
> -------------------------------
> 
> Suchita, 
> 
> Are these logs captured right after or near to issue timestamp?
-->Logs are collected during ocs-ci test run, when test failed. 

> Is this delay of "10mins" for expansion operation noticed just one or couple
> of instances or is it  100% reproducible in this setup ?
-->I execute the test 3 times and observed same failure for 3 times. 
1st time while executing tier1 test suite. Then 2 times I explicitly rerun only this test to confirm the failuer.
 
> 
> Do we have a setup where we could take a look at it ? because one of the
> possible root cause is Ceph commands are stuck, if we have access to the
> cluster its something we could check it out .
--> No We just have a one cluster and that cluster is not available now. 

> 
> With above questions I am trying to isolate the issue.

Comment 18 suchita 2020-10-12 12:56:53 UTC
(In reply to Humble Chirammal from comment #2)
> >
> Issue: PVC  expansion from 10Gi to 20Gi took 10 minutes. There are resize
> failure events before the expansion succeeded. The issue was seen in first
> attempt of expansion. 
> >
> 
> Can you mention the failure error messages noticed in this testing ?


18:33:06 - MainThread - ocs_ci.ocs.resources.pvc - INFO - Capacity of PVC pvc-test-07684f4b3a51433bab9005b9cd2c9d7c is not 20Gi as expected, but 10Gi. Retrying.
18:33:07 - MainThread - ocs_ci.ocs.utils - INFO - Must gather image: quay.io/rhceph-dev/ocs-must-gather:latest-4.5 will be used.

From csi-cephfsplugin-provisioner-5f8b66cc96-zqj9f pod csi-resizer container log, there are resize failure events before the expansion succeeded. PVC pvc-test-07684f4b3a51433bab9005b9cd2c9d7c expansion from 10Gi to 20Gi took 10 minutes which is much more than the expected time.


```
2020-10-05T06:04:40.709925508Z I1005 06:04:40.709515       1 controller.go:190] Starting external resizer openshift-storage.cephfs.csi.ceph.com
2020-10-05T18:29:06.671879079Z I1005 18:29:06.671701       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10094283", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:31:36.675920623Z E1005 18:31:36.675628       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2020-10-05T18:31:36.675920623Z I1005 18:31:36.675772       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10094283", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2020-10-05T18:31:37.683786113Z I1005 18:31:37.683698       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095459", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:31:37.68764307Z E1005 18:31:37.687559       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:37.68766698Z I1005 18:31:37.687633       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095459", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:39.696459133Z I1005 18:31:39.696342       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095472", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:31:39.699381746Z E1005 18:31:39.699297       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:39.699401526Z I1005 18:31:39.699379       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095472", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:43.707458622Z I1005 18:31:43.707351       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095514", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:31:43.710924792Z E1005 18:31:43.710892       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:43.711068549Z I1005 18:31:43.711035       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095514", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:51.719531115Z I1005 18:31:51.719466       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095578", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:31:51.723229466Z E1005 18:31:51.723071       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:31:51.723229466Z I1005 18:31:51.723162       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095578", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:07.732017512Z I1005 18:32:07.731932       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095743", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:32:07.735405106Z E1005 18:32:07.735356       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:07.735479977Z I1005 18:32:07.735449       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095743", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:23.008425723Z I1005 18:32:23.008326       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095885", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:32:23.013612609Z E1005 18:32:23.013545       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:23.013651018Z I1005 18:32:23.013610       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10095885", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:39.744212506Z I1005 18:32:39.744101       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10096016", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:32:39.748212052Z E1005 18:32:39.748137       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:32:39.748234731Z I1005 18:32:39.748217       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10096016", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:34:47.756282436Z I1005 18:34:47.756203       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10097120", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:34:47.760025495Z E1005 18:34:47.759963       1 controller.go:361] Resize volume "pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe" by resizer "openshift-storage.cephfs.csi.ceph.com" failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:34:47.760064589Z I1005 18:34:47.760029       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10097120", FieldPath:""}): type: 'Warning' reason: 'VolumeResizeFailed' resize volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe failed: rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 already exists
2020-10-05T18:39:03.768056288Z I1005 18:39:03.767964       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10099185", FieldPath:""}): type: 'Normal' reason: 'Resizing' External resizer is resizing volume pvc-b3fb1def-1a61-44af-8af3-d8de6e9341fe
2020-10-05T18:39:23.091442803Z I1005 18:39:23.091336       1 event.go:281] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"namespace-test-f000d463c1a545b2ae5d2634bfd8b7a3", Name:"pvc-test-07684f4b3a51433bab9005b9cd2c9d7c", UID:"b3fb1def-1a61-44af-8af3-d8de6e9341fe", APIVersion:"v1", ResourceVersion:"10099185", FieldPath:""}): type: 'Normal' reason: 'VolumeResizeSuccessful' Resize volume succeeded
```

Comment 19 suchita 2020-10-12 13:05:39 UTC
(In reply to Humble Chirammal from comment #15)
> Without luck, spent some time looking at the Ceph Side logs ( mgr,fs dump,
> distribution status, crashes [1]. progress..etc) to figure out any
> suspicious state of the cluster. 
> 
> [1] Last crashes are reported at Sep 25th which is old.
--> During fist test tier1 suite run, on 25 th sept the Cluster was crashed
however cluster was not crash after this test, so test was run on healthycluster. 
when I rerun the only this test (2 times retried ), cluster was in healthy state and before and after test.    

> 
> Suchita, appreciated if you can share more details requested in earlier
> comments or if we could try to reproduce this again.

Comment 20 suchita 2020-10-12 13:16:30 UTC
(In reply to Humble Chirammal from comment #14)
> Suchita, Can you revert on questions raised in c#8 ? overall it looks to me
> that, this cluster is not healthy and leading to the root cause of this
> issue.

Before this test cluster Health is OK. 
From Console Log: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/13069/consoleFull

18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 

19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is HEALTH_OK.

Comment 21 Humble Chirammal 2020-10-12 13:38:24 UTC
(In reply to suchita from comment #20)
> (In reply to Humble Chirammal from comment #14)
> > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me
> > that, this cluster is not healthy and leading to the root cause of this
> > issue.
> 
> Before this test cluster Health is OK. 
> From Console Log:
> https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-
> cluster/13069/consoleFull
> 
> 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 
> 
> 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is
> HEALTH_OK.

Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the issues in this kube/ocp
cluster  on API, etcd..etc, reported on the same day. 

> Do we have a setup where we could take a look at it ? because one of the
> possible root cause is Ceph commands are stuck, if we have access to the
> cluster its something we could check it out .
--> No We just have a one cluster and that cluster is not available now. 

Unfortunately I don't think we can make out anything else from the logs than which we already analysed
and shared in previous comments.


We can keep this bugzilla open if we are trying to reproduce this issue again.
Please share your thought , accordingly we can proceed.

Comment 22 Sahina Bose 2020-10-12 14:21:44 UTC
(In reply to Humble Chirammal from comment #21)
> (In reply to suchita from comment #20)
> > (In reply to Humble Chirammal from comment #14)
> > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me
> > > that, this cluster is not healthy and leading to the root cause of this
> > > issue.
> > 
> > Before this test cluster Health is OK. 
> > From Console Log:
> > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-
> > cluster/13069/consoleFull
> > 
> > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 
> > 
> > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is
> > HEALTH_OK.
> 
> Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the
> issues in this kube/ocp
> cluster  on API, etcd..etc, reported on the same day. 

Would cluster state affect the PV resize?

> 
> > Do we have a setup where we could take a look at it ? because one of the
> > possible root cause is Ceph commands are stuck, if we have access to the
> > cluster its something we could check it out .
> --> No We just have a one cluster and that cluster is not available now. 
> 
> Unfortunately I don't think we can make out anything else from the logs than
> which we already analysed
> and shared in previous comments.
> 
> 
> We can keep this bugzilla open if we are trying to reproduce this issue
> again.
> Please share your thought , accordingly we can proceed.

Comment 23 Humble Chirammal 2020-10-12 14:57:47 UTC
(In reply to Sahina Bose from comment #22)
> (In reply to Humble Chirammal from comment #21)
> > (In reply to suchita from comment #20)
> > > (In reply to Humble Chirammal from comment #14)
> > > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me
> > > > that, this cluster is not healthy and leading to the root cause of this
> > > > issue.
> > > 
> > > Before this test cluster Health is OK. 
> > > From Console Log:
> > > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-
> > > cluster/13069/consoleFull
> > > 
> > > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 
> > > 
> > > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is
> > > HEALTH_OK.
> > 
> > Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the
> > issues in this kube/ocp
> > cluster  on API, etcd..etc, reported on the same day. 
> 
> Would cluster state affect the PV resize?
> 



 2020-10-05T06:00:31.109598333Z E1005 06:00:31.109444       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:01:11.86034908Z E1005 06:01:11.860243       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout
2020-10-05T06:01:49.458241922Z E1005 06:01:49.458177       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: i/o timeout
2020-10-05T06:01:56.441435721Z E1005 06:01:56.441330       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443/apis/coordination.k8s.io/v1/namespaces/openshift-storage/leases/external-resizer-openshift-storage-cephfs-csi-ceph-com: dial tcp 172.30.0.1:443: connect: connection refused
2020-10-05T06:02:05.548077331Z E1005 06:02:05.548019       1 leaderelection.go:331] error retrieving resource lock openshift-storage/external-resizer-openshift-storage-cephfs-csi-ceph-com: Get https://172.30.0.1:443
........

The `external resizer` which is responsible for resizing the PVC failed to get leader or election completed ..etc due to these errors. 
however, to be clear, these errors were reported on the same day but not exactly at time of issue.


The below trace back is the only pointer we have in logs ( which matches with the issue timestamp and the exact instance of the issue) from CSI side where it clearly says there was a failure to get the proper response for the command we executed from CSI to Ceph Cluster. This gives the suspicion that ceph cluster was not serving the request properly or ceph cluster health was not good.


--snip-- from c#9

2020-10-05T18:36:21.739384142Z E1005 18:36:21.739227       1 volume.go:82] ID: 14284 Req-ID: 0001-0011-openshift-storage-0000000000000001-9bb17c80-0737-11eb-8508-0a580a8300a7 failed to get the rootpath for the vol csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7(an error (exit status 22) occurred while running ceph args: [fs subvolume getpath ocs-storagecluster-cephfilesystem csi-vol-9bb17c80-0737-11eb-8508-0a580a8300a7 --group_name csi -m 172.30.227.113:6789,172.30.65.204:6789,172.30.93.109:6789 -c /etc/ceph/ceph.conf -n client.csi-cephfs-provisioner --keyfile=***stripped***]) stdError Error EINVAL: Traceback (most recent call last):
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 158, in get_fs_handle
2020-10-05T18:36:21.739384142Z     conn.connect()
2020-10-05T18:36:21.739384142Z   File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 85, in connect
2020-10-05T18:36:21.739384142Z     self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
2020-10-05T18:36:21.739384142Z   File "cephfs.pyx", line 694, in cephfs.LibCephFS.mount
2020-10-05T18:36:21.739384142Z cephfs.Error: error calling ceph_mount: Connection timed out [Errno 110]
2020-10-05T18:36:21.739384142Z 

..

--/snip--



--snip-- from c#11

The above stack trace closely resembles what was tracked in this upstream Ceph tracker: https://tracker.ceph.com/issues/46607

The fix upstream was provided using the tracker: https://tracker.ceph.com/issues/46464

The fix in the above trackers was to handle EDQUOT error returns, however in this case the errno seems to be ETIMEDOUT (110) which is not handled. This needs a bug against CephFS to handle the same.

Further, the error should not have caused the 10 minute delay. CSI should have recovered and tried again. The error itself being a timed out error, looks to be an issue where Ceph health was not ready to serve requests.

--snip--

Comment 24 Kotresh HR 2020-10-13 10:33:51 UTC
This fix for https://tracker.ceph.com/issues/46464 also handles missing errnos. With this fix, it no longer crashes.
So we don't need a separate fix for ETIMEDOUT in my opinion. Please check https://tracker.ceph.com/issues/47798#note-4

Comment 25 suchita 2020-10-15 12:28:44 UTC
(In reply to Humble Chirammal from comment #21)
> (In reply to suchita from comment #20)
> > (In reply to Humble Chirammal from comment #14)
> > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me
> > > that, this cluster is not healthy and leading to the root cause of this
> > > issue.
> > 
> > Before this test cluster Health is OK. 
> > From Console Log:
> > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-
> > cluster/13069/consoleFull
> > 
> > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 
> > 
> > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is
> > HEALTH_OK.
> 
> Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the
> issues in this kube/ocp
> cluster  on API, etcd..etc, reported on the same day. 
> 
> > Do we have a setup where we could take a look at it ? because one of the
> > possible root cause is Ceph commands are stuck, if we have access to the
> > cluster its something we could check it out .
> --> No We just have a one cluster and that cluster is not available now. 
> 
> Unfortunately I don't think we can make out anything else from the logs than
> which we already analysed
> and shared in previous comments.
> 
> 
> We can keep this bugzilla open if we are trying to reproduce this issue
> again.
> Please share your thought , accordingly we can proceed.
I have created the fresh cluster and this issue is not reproduced.

Comment 26 suchita 2020-10-15 13:43:51 UTC
(In reply to suchita from comment #25)
> (In reply to Humble Chirammal from comment #21)
> > (In reply to suchita from comment #20)
> > > (In reply to Humble Chirammal from comment #14)
> > > > Suchita, Can you revert on questions raised in c#8 ? overall it looks to me
> > > > that, this cluster is not healthy and leading to the root cause of this
> > > > issue.
> > > 
> > > Before this test cluster Health is OK. 
> > > From Console Log:
> > > https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-
> > > cluster/13069/consoleFull
> > > 
> > > 18:21:27 - MainThread - tests.conftest - INFO - Checking for Ceph Health OK 
> > > 
> > > 19:04:39 - MainThread - ocs_ci.utility.utils - INFO - Ceph cluster health is
> > > HEALTH_OK.
> > 
> > Sure, thats just the Ceph Cluster state. The logs in c#13 refer to the
> > issues in this kube/ocp
> > cluster  on API, etcd..etc, reported on the same day. 
> > 
> > > Do we have a setup where we could take a look at it ? because one of the
> > > possible root cause is Ceph commands are stuck, if we have access to the
> > > cluster its something we could check it out .
> > --> No We just have a one cluster and that cluster is not available now. 
> > 
> > Unfortunately I don't think we can make out anything else from the logs than
> > which we already analysed
> > and shared in previous comments.
> > 
> > 
> > We can keep this bugzilla open if we are trying to reproduce this issue
> > again.
> > Please share your thought , accordingly we can proceed.
> I have created the fresh cluster and this issue is not reproduced.

I have created a another cluster and the issue is reproduced.
Now cluster is available
console Logs: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/13549/consoleFull

Test execution logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rhv-tp/sgatfane-rhv-tp_20200925T115616/logs/ocs-ci-logs-1602764201/
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/sgatfane-rhv-tp/sgatfane-rhv-tp_20200925T115616/logs/failed_testcase_ocs_logs_1602764201/

Comment 28 suchita 2020-10-23 07:26:52 UTC
on 19 October when I changed the Log level to 5 on the same cluster and execute the test then I realized that all Operators, Provisioner, and Plugin pods were missing from an OCS cluster. Something which is similar to https://bugzilla.redhat.com/show_bug.cgi?id=1884318. 
I didn't check it before doing this.  All pods were running when I commented comment#26

Used oc edit cm/rook-ceph-operator-config -n openshift-storage to update the data section with  CSI_LOG_LEVEL: "5"

@Madhu Rajanna, Please let me know is this the correct way to change the log-level in the existing cluster.  

After that, I uninstall and install OCS. Meanwhile, the cluster is down because the storage domain attached to the cluster went to an inactive status. we only have one cluster and able to test it again this test with log level 5, when the cluster issue will resolve.

Comment 30 Yaniv Kaul 2020-11-11 11:50:45 UTC
Do we have new logs now?

Comment 31 Yaniv Kaul 2020-11-27 15:29:37 UTC
(In reply to Yaniv Kaul from comment #30)
> Do we have new logs now?

Ping? This is in the top 10 bugs, yet had no response for 2 weeks.

Comment 32 Raz Tamir 2020-11-29 08:00:54 UTC
suchita,

Could you please provide the requested information in comment #30?

Thanks

Comment 33 Mudit Agarwal 2021-01-06 13:46:27 UTC
Awaiting data for long, this shouldn't be a blocker.

Comment 34 Yaniv Kaul 2021-01-06 14:23:43 UTC
Closing, please reopen when we can get the information requested.

Comment 35 suchita 2021-01-06 15:05:38 UTC
The Scenario is not reproduced later hence I am not able to provide the required buggy logs. 
So this is okay to close this bug.

Comment 36 Red Hat Bugzilla 2023-09-15 00:49:22 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days


Note You need to log in before you can comment on or make changes to this bug.