Description of problem (please be detailed as possible and provide log snippests): Upon completion of volume expansion and file system resize, the new size is not reflected on pod as expected. This issue is observed in rbd based volume. This issue is observed in acceptance run of recent 3 builds. Test case failed : tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expansion The last check to verify the new size on pod is done 240 seconds after expansion of the volume: 23:37:27 - MainThread - tests.manage.pv_services.pvc_resize.test_pvc_expansion - INFO - Expanded size of PVC pvc-test-89b1a2d6db734c778958874d09fbdf41 is not reflected on pod pod-test-rbd-b0f30f76c71f41e8b6423862f24a307d. New size on mount is not 25G as expected, but 9.8G. Checking again. must-gather : https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/17/artifact/logs/failed_testcase_ocs_logs_1597286339/test_pvc_expansion_ocs_logs/ Test case log: https://storage-jenkins-csb-ceph.cloud.paas.psi.redhat.com/job/ocs-ci/17/artifact/logs/ocs-ci-logs-1597286339/by_outcome/failed/tests/manage/pv_services/pvc_resize/test_pvc_expansion.py/TestPvcExpand/test_pvc_expansion/ Version of all relevant components (if applicable): --------------------------------------------------- OCS operator v4.6.0-53.ci Cluster Version 4.6.0-0.nightly-2020-08-12-155346 Ceph Version 14.2.8-91.el8cp rook_csi_ceph cephcsi@sha256:9a78a22c3c375b61e5112468d1160afe5793e87dd1788b576ab9f05892d83e5e rook_ceph rook-ceph@sha256:89a6a7c0fe5f45a07db7cb1e2e41a5177c61813dd30b2faa2aa428b236e85d90 Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes, the expanded size is not available for use Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: The test case is passing in OCS4.5 builds Steps to Reproduce: 1. Create a rbd based PVC and attach to a pod. 2. Expand the PVC. 3. Wait for the new size to reflected on pod. The test case is automated: tests/manage/pv_services/pvc_resize/test_pvc_expansion.py::TestPvcExpand::test_pvc_expansion Actual results: Volume expansion succeeded, but the size is not reflected on the pod. Expected results: The new size should be reflected on pod and available for use. Additional info:
This is working fine in 4.5, we have seen this issue only in the recent 4.6 builds.
I just ran tests/manage/pv_services/pvc_resize/test_pvc_expansion.py on v4.6.0-530.ci and it passed without any issues. The only difference with the setup on which it failed for QE and my setup is the OCP version, I am running OCP 4.5 while QE setup had 4.6
(In reply to Mudit Agarwal from comment #4) > I just ran tests/manage/pv_services/pvc_resize/test_pvc_expansion.py on > v4.6.0-530.ci and it passed without any issues. > > The only difference with the setup on which it failed for QE and my setup is > the OCP version, I am running OCP 4.5 while QE setup had 4.6 Thanks Mudhit. In short: OCS 4.5 + OCP 4.5 -> it works OCS 4.6 + OCP 4.5 -> it works OCS 4.6 + OCP 4.6 -> We see a failure in CI and waiting for manually reproducing it .
Jilju++ was able to reproduce this issue in his setup today ( OCP 4.6 + OCS 4.6) and provided the system access: Here is my initial analysis on this issue: Steps followed by QE: --------------- Expansion from 10Gi to 20Gi of PVC 'pvc-rbd-test-expand ' performed while pod was consuming the volume, however the new size was not reflected inside the POD ie 20Gi rather it stayed on 10Gi. Problematic PVC and POD: ------------------------ [hchiramm@localhost ceph-csi]$ oc get pvc oc get pods NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE pvc-rbd-test-expand Bound pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1 20Gi RWO ocs-storagecluster-ceph-rbd 97m [hchiramm@localhost ceph-csi]$ oc get pods NAME READY STATUS RESTARTS AGE pod-pvc-rbd-test-expand 1/1 Running 0 92m PVC size: [1] and [2] matches spec: accessModes: - ReadWriteOnce resources: requests: storage: 20Gi -----------------------------------> [1] storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Filesystem volumeName: pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1 status: accessModes: - ReadWriteOnce capacity: storage: 20Gi --------------------------------------> [2] phase: Bound However inside the POD: Filesystem Size Used Avail Use% Mounted on overlay 120G 11G 109G 9% / tmpfs 64M 0 64M 0% /dev tmpfs 32G 0 32G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 32G 52M 32G 1% /etc/hostname /dev/mapper/coreos-luks-root-nocrypt 120G 11G 109G 9% /etc/hosts /dev/rbd3 9.8G 37M 9.8G 1% /var/lib/www/html -----------------------------------> [NOT EXPECTED SIZE] tmpfs 32G 28K 32G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 32G 0 32G 0% /proc/acpi tmpfs 32G 0 32G 0% /proc/scsi tmpfs 32G 0 32G 0% /sys/firmware NOTE: PVC capacity list "expected size ( 20Gi), but POD mount **NOT**. CSI driver requests and responses: ------------------------------------- We received the controllerExpand and nodeExpand request as expected in this case and the responses from the RBD plugin for these responses were **successful***: I0818 07:46:45.017588 1 connection.go:182] GRPC call: /csi.v1.Controller/ControllerExpandVolume I0818 07:46:45.017601 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":21474836480},"secrets":"***stripped***","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b"} I0818 07:46:45.126362 1 connection.go:185] GRPC response: {"capacity_bytes":21474836480,"node_expansion_required":true} --------------> [It was successful and marked it for NODE EXPANSION] I0818 07:46:45.127355 1 connection.go:186] GRPC error: <nil> I0818 07:47:38.200963 52325 utils.go:165] ID: 119 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}}]} I0818 07:47:38.201627 52325 utils.go:159] ID: 120 Req-ID: 0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b GRPC call: /csi.v1.Node/NodeExpandVolume I0818 07:47:38.202855 52325 utils.go:160] ID: 120 Req-ID: 0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b GRPC request: {"capacity_range":{"required_bytes":21474836480},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b","volume_path":"/var/lib/kubelet/pods/75bb2400-b2d8-41f7-9c06-6a48c2a9cdf2/volumes/kubernetes.io~csi/pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1/mount"} I0818 07:47:38.203842 52325 utils.go:165] ID: 120 Req-ID: 0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b GRPC response: {} ^^^^^^^^^^^^^^^ [Even NODE expansion response was successful] So we have successful events in API server: ----------------------------- Normal Resizing 1h external-resizer openshift-storage.rbd.csi.ceph.com External resizer is resizing volume pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1 Normal FileSystemResizeRequired 1h external-resizer openshift-storage.rbd.csi.ceph.com Require file system resize of volume on node Normal FileSystemResizeSuccessful 1h kubelet, ip-10-0-176-197.us-east-2.compute.internal MountVolume.NodeExpandVolume succeeded for volume "pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1" Now to the interesting part : ------------------------------------ The RBD image is resized to 20Gi in the OSD: rbd image 'csi-vol-fca27f94-e125-11ea-bc9f-0a580a81020b': size 20 GiB in 5120 objects -----------------------> 20Gi as expected order 22 (4 MiB objects) snapshot_count: 0 id: 5e40c41054cc block_name_prefix: rbd_data.5e40c41054cc format: 2 features: layering op_features: flags: create_timestamp: Tue Aug 18 07:39:45 2020 access_timestamp: Tue Aug 18 07:39:45 2020 modify_timestamp: Tue Aug 18 07:39:45 2020 The subjected RBD device was mapped to the kubelet/node with new size : 20 Gi rbd3 252:48 0 20G 0 disk /var/lib/kubelet/pods/75bb2400-b2d8-41f7-9c06-6a48c2a9cdf2/volumes/kubernetes.io~csi/pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1/mount -------------------- > However the Filesystem or mount of this RBD device ***havent*** resized, it still on 10Gi: /dev/rbd3 9.8G 37M 9.8G 1% /var/lib/kubelet/pods/75bb2400-b2d8-41f7-9c06-6a48c2a9cdf2/volumes/kubernetes.io~csi/pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1/mount The kernel ******detected****** the size change from 10Gi to 20Gi : [iic, this is when the mapping happens and not when FS resize happens , but good to confirm too] ------------------------------------------------------ [ 2064.163337] EXT4-fs (rbd3): mounted filesystem with ordered data mode. Opts: (null) [ 2064.278161] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 2064.291371] IPv6: ADDRCONF(NETDEV_UP): veth7459a0ea: link is not ready [ 2064.294355] IPv6: ADDRCONF(NETDEV_CHANGE): veth7459a0ea: link becomes ready [ 2064.296993] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 2064.311961] device veth7459a0ea entered promiscuous mode [ 2066.418585] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue) [ 2184.207518] rbd3: detected capacity change from 10737418240 to 21474836480 ------------------------------ [size change] [ 2393.290915] device veth3fe217ac left promiscuous mode [ 2393.367160] device veth15e4af39 left promiscuous mode [ 2393.443839] device veth2e7d6841 left promiscuous mode [ 2393.475271] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 2393.503318] IPv6: ADDRCONF(NETDEV_UP): veth710f23a5: link is not ready [ 2393.505853] IPv6: ADDRCONF(NETDEV_CHANGE): veth710f23a5: link becomes ready [ 2393.508591] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 2393.511412] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready [ 2393.528066] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready [ 2393.534415] device veth561cfc77 left promiscuous mode Possibilities: -------------------------- 1) The interesting thing here is, the CSI Driver didnt capture any errors [1] OR encountered any error [2] when tried resize: If we have enabled extra LOGGING ( -v=5 instead of -v=0 )in this setup, it would have helped to capture couple of other INFO messages in below code which could reconfirm which stages were passed in this process: and its critical to understand whats going on here. But one thing we are sure that, there was no error from these functions. ----snip------------ func (resizefs *ResizeFs) Resize(devicePath string, deviceMountPath string) (bool, error) { format, err := resizefs.mounter.GetDiskFormat(devicePath) if err != nil { formatErr := fmt.Errorf("ResizeFS.Resize - error checking format for device %s: %v", devicePath, err) return false, formatErr } // If disk has no format, there is no need to resize the disk because mkfs.* // by default will use whole disk anyways. if format == "" { return false, nil } klog.V(3).Infof("ResizeFS.Resize - Expanding mounted volume %s", devicePath) switch format { case "ext3", "ext4": return resizefs.extResize(devicePath) case "xfs": return resizefs.xfsResize(deviceMountPath) } return false, fmt.Errorf("ResizeFS.Resize - resize of format %s is not supported for device %s mounted at %s", format, devicePath, deviceMountPath) } func (resizefs *ResizeFs) extResize(devicePath string) (bool, error) { output, err := resizefs.mounter.Exec.Command("resize2fs", devicePath).CombinedOutput() if err == nil { klog.V(2).Infof("Device %s resized successfully", devicePath) return true, nil } resizeError := fmt.Errorf("resize of device %s failed: %v. resize2fs output: %s", devicePath, err, string(output)) return false, resizeError } --------snip------------ 2) Otherwise we would have gone wrong here in this logic of ******finding out whether its a BLOCK mount or FS mount and returned success*****: ( which is getting removed here https://github.com/ceph/ceph-csi/pull/1386) // volumePath is targetPath for block PVC and stagingPath for filesystem. // check the path is mountpoint or not, if it is // mountpoint treat this as block PVC or else it is filesystem PVC // TODO remove this once ceph-csi supports CSI v1.2.0 spec notMnt, err := mount.IsNotMountPoint(ns.mounter, volumePath) if err != nil { if os.IsNotExist(err) { return nil, status.Error(codes.NotFound, err.Error()) } return nil, status.Error(codes.Internal, err.Error()) } if !notMnt { return &csi.NodeExpandVolumeResponse{}, nil } 3) It could be a bug in latest kernel or ext4 or crio or selinux deny this resize operation: I dont see any suspecious process in this node in my initial check atleast. Noticed some AVC denials but not sure whether it belong to this operation, it has to be clarified as well. mkfs.ext4 version mke2fs 1.45.4 (23-Sep-2019) redhat release: Red Hat Enterprise Linux release 8.2 (Ootpa) Subject package: e2fsprogs-1.45.4-3.el8.x86_64 Selinux: SELinux status: enabled SELinuxfs mount: /sys/fs/selinux SELinux root directory: /etc/selinux Loaded policy name: targeted Current mode: enforcing Mode from config file: enforcing Policy MLS status: enabled Policy deny_unknown status: allowed Memory protection checking: actual (secure) Max kernel policy version: 32 Denials: type=AVC msg=audit(1597734655.534:53): avc: granted { setsecparam } for pid=3388 comm="tuned" scontext=system_u:system_r:spc_t:s0 tcontext=system_u:object_r:security_t:s0 tclass=security type=AVC msg=audit(1597735654.149:83): avc: denied { write } for pid=53413 comm="cephcsi" name="pids.max" dev="cgroup" ino=5861 scontext=system_u:system_r:container_t:s0:c97,c508 tcontext=system_u:object_r:cgroup_t:s0 tclass=file permissive=0 type=AVC msg=audit(1597735965.441:99): avc: denied { associate } for pid=1628 comm="crio" name="/" dev="ceph" ino=1099511627779 scontext=system_u:object_r:container_file_t:s0:c2,c23 tcontext=system_u:object_r:cephfs_t:s0 tclass=filesystem permissive=0 type=AVC msg=audit(1597736218.771:111): avc: denied { associate } for pid=1628 comm="crio" name="/" dev="ceph" ino=1099511627782 scontext=system_u:object_r:container_file_t:s0:c0,c25 tcontext=system_u:object_r:cephfs_t:s0 tclass=filesystem permissive=0 Processes: root 74431 0.0 0.0 0 0 ? I< 07:31 0:00 [rbd] root 74442 0.0 0.0 0 0 ? I< 07:31 0:00 [rbd0-tasks] root 74484 0.0 0.0 0 0 ? S 07:31 0:00 [jbd2/rbd0-8] root 77841 0.0 0.0 0 0 ? I< 07:31 0:00 [rbd1-tasks] root 77850 0.0 0.0 0 0 ? S 07:31 0:00 [jbd2/rbd1-8] root 79051 0.0 0.0 0 0 ? I< 07:32 0:00 [rbd2-tasks] root 79061 0.0 0.0 0 0 ? S 07:32 0:00 [jbd2/rbd2-8] root 141588 0.0 0.0 0 0 ? I< 07:44 0:00 [rbd3-tasks] root 141598 0.0 0.0 0 0 ? S 07:44 0:00 [jbd2/rbd3-8] Next Step: I would like to reproduce this issue once again with extra debugging enabled.
Latest change in this code path from cephcsi side was this: https://github.com/ceph/ceph-csi/pull/1297 but this should not cause what we are seeing here. Just mentioning it for records.
While looking at the plugin logs I failed to notice any markers of resize attempt in this cluster. Also it looks like the cluster is decommissioned. Based on the analysis so far , here is the action plan what I am coming up with: We will validate or check the size of RBD device mapped mount before we attempt resize: We have volumepath and new size in the request which we could make use of: So before getting into below, above mentioned validation could happen: ... resizer := resizefs.NewResizeFs(diskMounter) ok, err := resizer.Resize(devicePath, volumePath) if !ok { return nil, fmt.Errorf("rbd: resize failed on path %s, error: %v", req.GetVolumePath(), err) } It will help isolating the issue if the FS resize was a NO OP or attempted before the new size detected by the kernel. It should either solve the issue or it should isolate the issue to a low level component atleast.
[Side node] One more thing we could do here is, expanding a PVC which is not attached to a POD and then attach to a POD while we have an event FS resize pending. This should also help in isolating this issue to an extent between the FS resize operation and size reflection delay in the path.
I followed the Humble's suggestion from https://bugzilla.redhat.com/show_bug.cgi?id=1868703#c16 1. Create a PVC - size 1Gi (Do not attach to the pod) 2. Expand to a new size - 5Gi (Won't be reflected because there is no pod in picture) 3. Attach to a pod - New size is reflected on both PVC and Pod. >>> Expand but don't attach a pod [muagarwa@dhcp53-217 ocp]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mudit-claim Bound pvc-fc98a069-7c71-4446-a3d1-ea74fedb0944 1Gi RWO ocs-storagecluster-ceph-rbd 113s [muagarwa@dhcp53-217 ocp]$ oc get pvc -o yaml apiVersion: v1 items: - apiVersion: v1 kind: PersistentVolumeClaim spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi volumeMode: Filesystem volumeName: pvc-fc98a069-7c71-4446-a3d1-ea74fedb0944 status: accessModes: - ReadWriteOnce capacity: storage: 1Gi ------------------------------------------------------------------> Size not reflected because of FSsize conditions: - lastProbeTime: null lastTransitionTime: "2020-08-23T08:10:34Z" message: Waiting for user to (re-)start a pod to finish file system resize of volume on node. status: "True" type: FileSystemResizePending phase: Bound kind: List metadata: resourceVersion: "" selfLink: "" >> After attaching a Pod [muagarwa@dhcp53-217 ocp]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mudit-claim Bound pvc-fc98a069-7c71-4446-a3d1-ea74fedb0944 5Gi RWO ocs-storagecluster-ceph-rbd 9m11s [muagarwa@dhcp53-217 ocp]$ [muagarwa@dhcp53-217 ocp]$ oc exec -it mudit-app sh kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead. / # df -h /usr/share/busybox Filesystem Size Used Available Use% Mounted on /dev/rbd1 4.9G 20.0M 4.8G 0% /usr/share/busybox / # So this experiment kind of validates the theory in https://bugzilla.redhat.com/show_bug.cgi?id=1868703#c10 We still don't know what changed in OCP 4.6 which is causing this.
Thanks Mudit. The complete proof of things would happen once we have a build per c#14, and hopefully that should help to resolve the issue too!
https://github.com/kubernetes/kubernetes/commit/69613da0ae9743397ba7b2b3daebdc18d68e795e
There is a complete change in the way NodeExpandVolume request is being called in Upstream and Downstream: Upstream: I0824 07:29:10.172338 22867 utils.go:159] ID: 15 Req-ID: 0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013 GRPC call: /c si.v1.Node/NodeExpandVolume I0824 07:29:10.173992 22867 utils.go:160] ID: 15 Req-ID: 0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013 GRPC request: {"capacity_range":{"required_bytes":2147483648},"volume_id":"0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013","volume_p ath":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-53226d06-3de6-4998-bb10-ba6e6ea479f5/globalmount"} >> [1] volume_path:= "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-53226d06-3de6-4998-bb10-ba6e6ea479f5/globalmount" Downstream: I0824 09:24:45.735515 507082 utils.go:159] ID: 66 Req-ID: 0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920-0a580a810229 GRPC call: /csi.v1.Node/NodeExpandVolume I0824 09:24:45.737321 507082 utils.go:160] ID: 66 Req-ID: 0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920-0a580a810229 GRPC request: {"capacity_range":{"required_bytes":2147483648},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920-0a580a810229","volume_path":"/var/lib/kubelet/pods/43de2997-5f4e-4fb3-a0e4-7e0d303c6c37/volumes/kubernetes.io~csi/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/mount"} >> [2] volume_path = "/var/lib/kubelet/pods/43de2997-5f4e-4fb3-a0e4-7e0d303c6c37/volumes/kubernetes.io~csi/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/mount" Now, the getDevicePath() function always expects [1] but somehow that is passing with [2] in DS. I tried replacing it with my Upstream build and getDevicePath() started failing. Digging more.
(In reply to Mudit Agarwal from comment #20) > There is a complete change in the way NodeExpandVolume request is being > called in Upstream and Downstream: > > Upstream: > > I0824 07:29:10.172338 22867 utils.go:159] ID: 15 Req-ID: > 0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013 > GRPC call: /c > si.v1.Node/NodeExpandVolume > I0824 07:29:10.173992 22867 utils.go:160] ID: 15 Req-ID: > 0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013 > GRPC request: > {"capacity_range":{"required_bytes":2147483648},"volume_id":"0001-0009-rook- > ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013","volume_p > ath":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-53226d06-3de6-4998- > bb10-ba6e6ea479f5/globalmount"} > > >> [1] volume_path:= "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-53226d06-3de6-4998-bb10-ba6e6ea479f5/globalmount" > > Downstream: > > I0824 09:24:45.735515 507082 utils.go:159] ID: 66 Req-ID: > 0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920- > 0a580a810229 GRPC call: /csi.v1.Node/NodeExpandVolume > I0824 09:24:45.737321 507082 utils.go:160] ID: 66 Req-ID: > 0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920- > 0a580a810229 GRPC request: > {"capacity_range":{"required_bytes":2147483648},"staging_target_path":"/var/ > lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3c6426fa-7476-478f-b685- > f8b7831ce70b/globalmount","volume_capability":{"AccessType":{"Mount": > {"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"0001-0011- > openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920-0a580a810229", > "volume_path":"/var/lib/kubelet/pods/43de2997-5f4e-4fb3-a0e4-7e0d303c6c37/ > volumes/kubernetes.io~csi/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/mount"} > > >> [2] volume_path = "/var/lib/kubelet/pods/43de2997-5f4e-4fb3-a0e4-7e0d303c6c37/volumes/kubernetes.io~csi/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/mount" > Just to make it clear, there is a change in the NODE expansion request and also see the commit in c#19. So the volume path was only carrying just one ( unlike the logs in c#6 from downstream setup) previously and iic we were/are operating on it in "Ceph CSI upstream" and "OCS 4.5" build atm. But due to this change ( 1.18+), in OCS 4.6, we get both ( staging_target_path and volume_path) in the request. iic, our logic of deriving the device path and resizing becomes NO OP due to this change. the resize is not attempted or resize becomes NO OP. With that theory in place, I assume the upstream test should also fail if we get latest KUBE in the CI and run our E2E.
RCA: 1. As Humble mentioned because of https://github.com/kubernetes/kubernetes/commit/69613da0ae9743397ba7b2b3daebdc18d68e795e Kubernetes is sending volume_path and staging_target_path both 2. Because of that OCP4.6 onwards volume_path doesn't consist of /globalmount and thus resize became a no-op because we are not able to get the device path infor using the new volume_path Now, there were two questions: 1. Why it is passing in Upstream: >> Upstream is still using a Kube version which doesn't have above PR 2. Why getDevicePath() is not giving any error? >> we never called getDevicePath() because of this code block (which is removed by https://github.com/ceph/ceph-csi/pull/1386) ================== notMnt, err := mount.IsNotMountPoint(ns.mounter, volumePath) if err != nil { if os.IsNotExist(err) { return nil, status.Error(codes.NotFound, err.Error()) } return nil, status.Error(codes.Internal, err.Error()) } if !notMnt { return &csi.NodeExpandVolumeResponse{}, nil -----------------------------> We were always returning from here because of wrong volumePath } ================== Now, I am raising a PR for this but it has to wait till upstream goes to Kube v18+ Thanks Humble for all the help and lots of offline discussions during the day.
(In reply to Mudit Agarwal from comment #22) > RCA: > > 1. As Humble mentioned because of > https://github.com/kubernetes/kubernetes/commit/ > 69613da0ae9743397ba7b2b3daebdc18d68e795e Kubernetes is sending volume_path > and staging_target_path both > > 2. Because of that OCP4.6 onwards volume_path doesn't consist of > /globalmount and thus resize became a no-op because we are not able to get > the device path infor using the new volume_path > > Now, there were two questions: > > 1. Why it is passing in Upstream: > > >> Upstream is still using a Kube version which doesn't have above PR > > 2. Why getDevicePath() is not giving any error? > > >> we never called getDevicePath() because of this code block (which is removed by https://github.com/ceph/ceph-csi/pull/1386) > > ================== > notMnt, err := mount.IsNotMountPoint(ns.mounter, volumePath) > if err != nil { > if os.IsNotExist(err) { > return nil, status.Error(codes.NotFound, err.Error()) > } > return nil, status.Error(codes.Internal, err.Error()) > } > if !notMnt { > return &csi.NodeExpandVolumeResponse{}, nil > -----------------------------> We were always returning from here because of > wrong volumePath > } > ================== Yeah, possibilities (2) in c#6 :) > > Now, I am raising a PR for this but it has to wait till upstream goes to > Kube v18+ Thats great. We will be able to reconfirm this once we upgraded Ceph CSI to 1.19 > > Thanks Humble for all the help and lots of offline discussions during the > day. Thanks to you Mudit!! for sitting on top of this and doing the experiments ! Its not easy :)..
The release branch 4.6 has the fix now, moving to MODIFIED.
Considering we have the CI passes now on OCS 4.6 builds , I am moving this to ON_QA Mail in rhocs-eng with below subject: ocs-ci results for OCS4-6-Downstream-OCP4-6-AWS-IPI-1AZ-RHCOS-3M-3W-acceptance (BUILD ID: v4.6.0-75.ci RUN ID: 1599531194)
I re-tested with ocs-operator.v4.6.0-90.ci, tests/manage/pv_services/pvc_resize/test_pvc_expansion.py passed without any issues.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days