Bug 1868703
| Summary: | [rbd] After volume expansion, the new size is not reflected on the pod | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Container Storage | Reporter: | Jilju Joy <jijoy> | |
| Component: | csi-driver | Assignee: | Mudit Agarwal <muagarwa> | |
| Status: | CLOSED ERRATA | QA Contact: | Elena Bondarenko <ebondare> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.6 | CC: | assingh, hchiramm, hekumar, jdillama, jsafrane, madam, mrajanna, muagarwa, nberry, ocs-bugs | |
| Target Milestone: | --- | Keywords: | Automation, Regression | |
| Target Release: | OCS 4.6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1873387 (view as bug list) | Environment: | ||
| Last Closed: | 2020-12-17 06:23:13 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1873387 | |||
|
Description
Jilju Joy
2020-08-13 14:24:11 UTC
This is working fine in 4.5, we have seen this issue only in the recent 4.6 builds. I just ran tests/manage/pv_services/pvc_resize/test_pvc_expansion.py on v4.6.0-530.ci and it passed without any issues. The only difference with the setup on which it failed for QE and my setup is the OCP version, I am running OCP 4.5 while QE setup had 4.6 (In reply to Mudit Agarwal from comment #4) > I just ran tests/manage/pv_services/pvc_resize/test_pvc_expansion.py on > v4.6.0-530.ci and it passed without any issues. > > The only difference with the setup on which it failed for QE and my setup is > the OCP version, I am running OCP 4.5 while QE setup had 4.6 Thanks Mudhit. In short: OCS 4.5 + OCP 4.5 -> it works OCS 4.6 + OCP 4.5 -> it works OCS 4.6 + OCP 4.6 -> We see a failure in CI and waiting for manually reproducing it . Jilju++ was able to reproduce this issue in his setup today ( OCP 4.6 + OCS 4.6) and provided the system access:
Here is my initial analysis on this issue:
Steps followed by QE:
---------------
Expansion from 10Gi to 20Gi of PVC 'pvc-rbd-test-expand ' performed while pod was consuming the volume, however the new size was not reflected inside the POD ie 20Gi rather it stayed on 10Gi.
Problematic PVC and POD:
------------------------
[hchiramm@localhost ceph-csi]$ oc get pvc
oc get pods
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pvc-rbd-test-expand Bound pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1 20Gi RWO ocs-storagecluster-ceph-rbd 97m
[hchiramm@localhost ceph-csi]$ oc get pods
NAME READY STATUS RESTARTS AGE
pod-pvc-rbd-test-expand 1/1 Running 0 92m
PVC size: [1] and [2] matches
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi -----------------------------------> [1]
storageClassName: ocs-storagecluster-ceph-rbd
volumeMode: Filesystem
volumeName: pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 20Gi --------------------------------------> [2]
phase: Bound
However inside the POD:
Filesystem Size Used Avail Use% Mounted on
overlay 120G 11G 109G 9% /
tmpfs 64M 0 64M 0% /dev
tmpfs 32G 0 32G 0% /sys/fs/cgroup
shm 64M 0 64M 0% /dev/shm
tmpfs 32G 52M 32G 1% /etc/hostname
/dev/mapper/coreos-luks-root-nocrypt 120G 11G 109G 9% /etc/hosts
/dev/rbd3 9.8G 37M 9.8G 1% /var/lib/www/html -----------------------------------> [NOT EXPECTED SIZE]
tmpfs 32G 28K 32G 1% /run/secrets/kubernetes.io/serviceaccount
tmpfs 32G 0 32G 0% /proc/acpi
tmpfs 32G 0 32G 0% /proc/scsi
tmpfs 32G 0 32G 0% /sys/firmware
NOTE: PVC capacity list "expected size ( 20Gi), but POD mount **NOT**.
CSI driver requests and responses:
-------------------------------------
We received the controllerExpand and nodeExpand request as expected in this case and the responses from the RBD plugin for these responses were **successful***:
I0818 07:46:45.017588 1 connection.go:182] GRPC call: /csi.v1.Controller/ControllerExpandVolume
I0818 07:46:45.017601 1 connection.go:183] GRPC request: {"capacity_range":{"required_bytes":21474836480},"secrets":"***stripped***","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b"}
I0818 07:46:45.126362 1 connection.go:185] GRPC response: {"capacity_bytes":21474836480,"node_expansion_required":true} --------------> [It was successful and marked it for NODE EXPANSION]
I0818 07:46:45.127355 1 connection.go:186] GRPC error: <nil>
I0818 07:47:38.200963 52325 utils.go:165] ID: 119 GRPC response: {"capabilities":[{"Type":{"Rpc":{"type":1}}},{"Type":{"Rpc":{"type":2}}},{"Type":{"Rpc":{"type":3}}}]}
I0818 07:47:38.201627 52325 utils.go:159] ID: 120 Req-ID: 0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b GRPC call: /csi.v1.Node/NodeExpandVolume
I0818 07:47:38.202855 52325 utils.go:160] ID: 120 Req-ID: 0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b GRPC request: {"capacity_range":{"required_bytes":21474836480},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b","volume_path":"/var/lib/kubelet/pods/75bb2400-b2d8-41f7-9c06-6a48c2a9cdf2/volumes/kubernetes.io~csi/pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1/mount"}
I0818 07:47:38.203842 52325 utils.go:165] ID: 120 Req-ID: 0001-0011-openshift-storage-0000000000000001-fca27f94-e125-11ea-bc9f-0a580a81020b GRPC response: {}
^^^^^^^^^^^^^^^ [Even NODE expansion response was successful]
So we have successful events in API server:
-----------------------------
Normal Resizing 1h external-resizer openshift-storage.rbd.csi.ceph.com External resizer is resizing volume pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1
Normal FileSystemResizeRequired 1h external-resizer openshift-storage.rbd.csi.ceph.com Require file system resize of volume on node
Normal FileSystemResizeSuccessful 1h kubelet, ip-10-0-176-197.us-east-2.compute.internal MountVolume.NodeExpandVolume succeeded for volume "pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1"
Now to the interesting part :
------------------------------------
The RBD image is resized to 20Gi in the OSD:
rbd image 'csi-vol-fca27f94-e125-11ea-bc9f-0a580a81020b':
size 20 GiB in 5120 objects -----------------------> 20Gi as expected
order 22 (4 MiB objects)
snapshot_count: 0
id: 5e40c41054cc
block_name_prefix: rbd_data.5e40c41054cc
format: 2
features: layering
op_features:
flags:
create_timestamp: Tue Aug 18 07:39:45 2020
access_timestamp: Tue Aug 18 07:39:45 2020
modify_timestamp: Tue Aug 18 07:39:45 2020
The subjected RBD device was mapped to the kubelet/node with new size : 20 Gi
rbd3 252:48 0 20G 0 disk /var/lib/kubelet/pods/75bb2400-b2d8-41f7-9c06-6a48c2a9cdf2/volumes/kubernetes.io~csi/pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1/mount
-------------------- > However the Filesystem or mount of this RBD device ***havent*** resized, it still on 10Gi:
/dev/rbd3 9.8G 37M 9.8G 1% /var/lib/kubelet/pods/75bb2400-b2d8-41f7-9c06-6a48c2a9cdf2/volumes/kubernetes.io~csi/pvc-8a9b20a7-7b60-4005-9d20-b2b58ecc77c1/mount
The kernel ******detected****** the size change from 10Gi to 20Gi : [iic, this is when the mapping happens and not when FS resize happens , but good to confirm too]
------------------------------------------------------
[ 2064.163337] EXT4-fs (rbd3): mounted filesystem with ordered data mode. Opts: (null)
[ 2064.278161] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 2064.291371] IPv6: ADDRCONF(NETDEV_UP): veth7459a0ea: link is not ready
[ 2064.294355] IPv6: ADDRCONF(NETDEV_CHANGE): veth7459a0ea: link becomes ready
[ 2064.296993] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 2064.311961] device veth7459a0ea entered promiscuous mode
[ 2066.418585] SELinux: mount invalid. Same superblock, different security settings for (dev mqueue, type mqueue)
[ 2184.207518] rbd3: detected capacity change from 10737418240 to 21474836480 ------------------------------ [size change]
[ 2393.290915] device veth3fe217ac left promiscuous mode
[ 2393.367160] device veth15e4af39 left promiscuous mode
[ 2393.443839] device veth2e7d6841 left promiscuous mode
[ 2393.475271] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 2393.503318] IPv6: ADDRCONF(NETDEV_UP): veth710f23a5: link is not ready
[ 2393.505853] IPv6: ADDRCONF(NETDEV_CHANGE): veth710f23a5: link becomes ready
[ 2393.508591] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 2393.511412] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[ 2393.528066] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 2393.534415] device veth561cfc77 left promiscuous mode
Possibilities:
--------------------------
1) The interesting thing here is, the CSI Driver didnt capture any errors [1] OR encountered any error [2] when tried resize: If we have enabled extra LOGGING ( -v=5 instead of -v=0 )in this setup, it would have helped to capture couple of other INFO messages in below code which could reconfirm which stages were passed in this process: and its critical to understand whats going on here. But one thing we are sure that, there was no error from these functions.
----snip------------
func (resizefs *ResizeFs) Resize(devicePath string, deviceMountPath string) (bool, error) {
format, err := resizefs.mounter.GetDiskFormat(devicePath)
if err != nil {
formatErr := fmt.Errorf("ResizeFS.Resize - error checking format for device %s: %v", devicePath, err)
return false, formatErr
}
// If disk has no format, there is no need to resize the disk because mkfs.*
// by default will use whole disk anyways.
if format == "" {
return false, nil
}
klog.V(3).Infof("ResizeFS.Resize - Expanding mounted volume %s", devicePath)
switch format {
case "ext3", "ext4":
return resizefs.extResize(devicePath)
case "xfs":
return resizefs.xfsResize(deviceMountPath)
}
return false, fmt.Errorf("ResizeFS.Resize - resize of format %s is not supported for device %s mounted at %s", format, devicePath, deviceMountPath)
}
func (resizefs *ResizeFs) extResize(devicePath string) (bool, error) {
output, err := resizefs.mounter.Exec.Command("resize2fs", devicePath).CombinedOutput()
if err == nil {
klog.V(2).Infof("Device %s resized successfully", devicePath)
return true, nil
}
resizeError := fmt.Errorf("resize of device %s failed: %v. resize2fs output: %s", devicePath, err, string(output))
return false, resizeError
}
--------snip------------
2) Otherwise we would have gone wrong here in this logic of ******finding out whether its a BLOCK mount or FS mount and returned success*****: ( which is getting removed here https://github.com/ceph/ceph-csi/pull/1386)
// volumePath is targetPath for block PVC and stagingPath for filesystem.
// check the path is mountpoint or not, if it is
// mountpoint treat this as block PVC or else it is filesystem PVC
// TODO remove this once ceph-csi supports CSI v1.2.0 spec
notMnt, err := mount.IsNotMountPoint(ns.mounter, volumePath)
if err != nil {
if os.IsNotExist(err) {
return nil, status.Error(codes.NotFound, err.Error())
}
return nil, status.Error(codes.Internal, err.Error())
}
if !notMnt {
return &csi.NodeExpandVolumeResponse{}, nil
}
3) It could be a bug in latest kernel or ext4 or crio or selinux deny this resize operation:
I dont see any suspecious process in this node in my initial check atleast. Noticed some AVC denials but not sure whether it belong to this operation, it has to be clarified as well.
mkfs.ext4 version
mke2fs 1.45.4 (23-Sep-2019)
redhat release:
Red Hat Enterprise Linux release 8.2 (Ootpa)
Subject package: e2fsprogs-1.45.4-3.el8.x86_64
Selinux:
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: targeted
Current mode: enforcing
Mode from config file: enforcing
Policy MLS status: enabled
Policy deny_unknown status: allowed
Memory protection checking: actual (secure)
Max kernel policy version: 32
Denials:
type=AVC msg=audit(1597734655.534:53): avc: granted { setsecparam } for pid=3388 comm="tuned" scontext=system_u:system_r:spc_t:s0 tcontext=system_u:object_r:security_t:s0 tclass=security
type=AVC msg=audit(1597735654.149:83): avc: denied { write } for pid=53413 comm="cephcsi" name="pids.max" dev="cgroup" ino=5861 scontext=system_u:system_r:container_t:s0:c97,c508 tcontext=system_u:object_r:cgroup_t:s0 tclass=file permissive=0
type=AVC msg=audit(1597735965.441:99): avc: denied { associate } for pid=1628 comm="crio" name="/" dev="ceph" ino=1099511627779 scontext=system_u:object_r:container_file_t:s0:c2,c23 tcontext=system_u:object_r:cephfs_t:s0 tclass=filesystem permissive=0
type=AVC msg=audit(1597736218.771:111): avc: denied { associate } for pid=1628 comm="crio" name="/" dev="ceph" ino=1099511627782 scontext=system_u:object_r:container_file_t:s0:c0,c25 tcontext=system_u:object_r:cephfs_t:s0 tclass=filesystem permissive=0
Processes:
root 74431 0.0 0.0 0 0 ? I< 07:31 0:00 [rbd]
root 74442 0.0 0.0 0 0 ? I< 07:31 0:00 [rbd0-tasks]
root 74484 0.0 0.0 0 0 ? S 07:31 0:00 [jbd2/rbd0-8]
root 77841 0.0 0.0 0 0 ? I< 07:31 0:00 [rbd1-tasks]
root 77850 0.0 0.0 0 0 ? S 07:31 0:00 [jbd2/rbd1-8]
root 79051 0.0 0.0 0 0 ? I< 07:32 0:00 [rbd2-tasks]
root 79061 0.0 0.0 0 0 ? S 07:32 0:00 [jbd2/rbd2-8]
root 141588 0.0 0.0 0 0 ? I< 07:44 0:00 [rbd3-tasks]
root 141598 0.0 0.0 0 0 ? S 07:44 0:00 [jbd2/rbd3-8]
Next Step:
I would like to reproduce this issue once again with extra debugging enabled.
Latest change in this code path from cephcsi side was this: https://github.com/ceph/ceph-csi/pull/1297 but this should not cause what we are seeing here. Just mentioning it for records. While looking at the plugin logs I failed to notice any markers of resize attempt in this cluster. Also it looks like the cluster is decommissioned. Based on the analysis so far , here is the action plan what I am coming up with:
We will validate or check the size of RBD device mapped mount before we attempt resize: We have volumepath and new size in the request which we could make use of:
So before getting into below, above mentioned validation could happen:
...
resizer := resizefs.NewResizeFs(diskMounter)
ok, err := resizer.Resize(devicePath, volumePath)
if !ok {
return nil, fmt.Errorf("rbd: resize failed on path %s, error: %v", req.GetVolumePath(), err)
}
It will help isolating the issue if the FS resize was a NO OP or attempted before the new size detected by the kernel. It should either solve the issue or it should isolate the issue to a low level component atleast.
[Side node] One more thing we could do here is, expanding a PVC which is not attached to a POD and then attach to a POD while we have an event FS resize pending. This should also help in isolating this issue to an extent between the FS resize operation and size reflection delay in the path. I followed the Humble's suggestion from https://bugzilla.redhat.com/show_bug.cgi?id=1868703#c16 1. Create a PVC - size 1Gi (Do not attach to the pod) 2. Expand to a new size - 5Gi (Won't be reflected because there is no pod in picture) 3. Attach to a pod - New size is reflected on both PVC and Pod. >>> Expand but don't attach a pod [muagarwa@dhcp53-217 ocp]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mudit-claim Bound pvc-fc98a069-7c71-4446-a3d1-ea74fedb0944 1Gi RWO ocs-storagecluster-ceph-rbd 113s [muagarwa@dhcp53-217 ocp]$ oc get pvc -o yaml apiVersion: v1 items: - apiVersion: v1 kind: PersistentVolumeClaim spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi volumeMode: Filesystem volumeName: pvc-fc98a069-7c71-4446-a3d1-ea74fedb0944 status: accessModes: - ReadWriteOnce capacity: storage: 1Gi ------------------------------------------------------------------> Size not reflected because of FSsize conditions: - lastProbeTime: null lastTransitionTime: "2020-08-23T08:10:34Z" message: Waiting for user to (re-)start a pod to finish file system resize of volume on node. status: "True" type: FileSystemResizePending phase: Bound kind: List metadata: resourceVersion: "" selfLink: "" >> After attaching a Pod [muagarwa@dhcp53-217 ocp]$ oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE mudit-claim Bound pvc-fc98a069-7c71-4446-a3d1-ea74fedb0944 5Gi RWO ocs-storagecluster-ceph-rbd 9m11s [muagarwa@dhcp53-217 ocp]$ [muagarwa@dhcp53-217 ocp]$ oc exec -it mudit-app sh kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead. / # df -h /usr/share/busybox Filesystem Size Used Available Use% Mounted on /dev/rbd1 4.9G 20.0M 4.8G 0% /usr/share/busybox / # So this experiment kind of validates the theory in https://bugzilla.redhat.com/show_bug.cgi?id=1868703#c10 We still don't know what changed in OCP 4.6 which is causing this. Thanks Mudit. The complete proof of things would happen once we have a build per c#14, and hopefully that should help to resolve the issue too! There is a complete change in the way NodeExpandVolume request is being called in Upstream and Downstream:
Upstream:
I0824 07:29:10.172338 22867 utils.go:159] ID: 15 Req-ID: 0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013 GRPC call: /c
si.v1.Node/NodeExpandVolume
I0824 07:29:10.173992 22867 utils.go:160] ID: 15 Req-ID: 0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013 GRPC request:
{"capacity_range":{"required_bytes":2147483648},"volume_id":"0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013","volume_p
ath":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-53226d06-3de6-4998-bb10-ba6e6ea479f5/globalmount"}
>> [1] volume_path:= "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-53226d06-3de6-4998-bb10-ba6e6ea479f5/globalmount"
Downstream:
I0824 09:24:45.735515 507082 utils.go:159] ID: 66 Req-ID: 0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920-0a580a810229 GRPC call: /csi.v1.Node/NodeExpandVolume
I0824 09:24:45.737321 507082 utils.go:160] ID: 66 Req-ID: 0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920-0a580a810229 GRPC request: {"capacity_range":{"required_bytes":2147483648},"staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920-0a580a810229","volume_path":"/var/lib/kubelet/pods/43de2997-5f4e-4fb3-a0e4-7e0d303c6c37/volumes/kubernetes.io~csi/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/mount"}
>> [2] volume_path = "/var/lib/kubelet/pods/43de2997-5f4e-4fb3-a0e4-7e0d303c6c37/volumes/kubernetes.io~csi/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/mount"
Now, the getDevicePath() function always expects [1] but somehow that is passing with [2] in DS.
I tried replacing it with my Upstream build and getDevicePath() started failing. Digging more.
(In reply to Mudit Agarwal from comment #20) > There is a complete change in the way NodeExpandVolume request is being > called in Upstream and Downstream: > > Upstream: > > I0824 07:29:10.172338 22867 utils.go:159] ID: 15 Req-ID: > 0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013 > GRPC call: /c > si.v1.Node/NodeExpandVolume > I0824 07:29:10.173992 22867 utils.go:160] ID: 15 Req-ID: > 0001-0009-rook-ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013 > GRPC request: > {"capacity_range":{"required_bytes":2147483648},"volume_id":"0001-0009-rook- > ceph-0000000000000001-168e6382-e5db-11ea-9c7e-0242ac110013","volume_p > ath":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-53226d06-3de6-4998- > bb10-ba6e6ea479f5/globalmount"} > > >> [1] volume_path:= "/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-53226d06-3de6-4998-bb10-ba6e6ea479f5/globalmount" > > Downstream: > > I0824 09:24:45.735515 507082 utils.go:159] ID: 66 Req-ID: > 0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920- > 0a580a810229 GRPC call: /csi.v1.Node/NodeExpandVolume > I0824 09:24:45.737321 507082 utils.go:160] ID: 66 Req-ID: > 0001-0011-openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920- > 0a580a810229 GRPC request: > {"capacity_range":{"required_bytes":2147483648},"staging_target_path":"/var/ > lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-3c6426fa-7476-478f-b685- > f8b7831ce70b/globalmount","volume_capability":{"AccessType":{"Mount": > {"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_id":"0001-0011- > openshift-storage-0000000000000001-cce6ebe3-e5ea-11ea-a920-0a580a810229", > "volume_path":"/var/lib/kubelet/pods/43de2997-5f4e-4fb3-a0e4-7e0d303c6c37/ > volumes/kubernetes.io~csi/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/mount"} > > >> [2] volume_path = "/var/lib/kubelet/pods/43de2997-5f4e-4fb3-a0e4-7e0d303c6c37/volumes/kubernetes.io~csi/pvc-3c6426fa-7476-478f-b685-f8b7831ce70b/mount" > Just to make it clear, there is a change in the NODE expansion request and also see the commit in c#19. So the volume path was only carrying just one ( unlike the logs in c#6 from downstream setup) previously and iic we were/are operating on it in "Ceph CSI upstream" and "OCS 4.5" build atm. But due to this change ( 1.18+), in OCS 4.6, we get both ( staging_target_path and volume_path) in the request. iic, our logic of deriving the device path and resizing becomes NO OP due to this change. the resize is not attempted or resize becomes NO OP. With that theory in place, I assume the upstream test should also fail if we get latest KUBE in the CI and run our E2E. RCA: 1. As Humble mentioned because of https://github.com/kubernetes/kubernetes/commit/69613da0ae9743397ba7b2b3daebdc18d68e795e Kubernetes is sending volume_path and staging_target_path both 2. Because of that OCP4.6 onwards volume_path doesn't consist of /globalmount and thus resize became a no-op because we are not able to get the device path infor using the new volume_path Now, there were two questions: 1. Why it is passing in Upstream: >> Upstream is still using a Kube version which doesn't have above PR 2. Why getDevicePath() is not giving any error? >> we never called getDevicePath() because of this code block (which is removed by https://github.com/ceph/ceph-csi/pull/1386) ================== notMnt, err := mount.IsNotMountPoint(ns.mounter, volumePath) if err != nil { if os.IsNotExist(err) { return nil, status.Error(codes.NotFound, err.Error()) } return nil, status.Error(codes.Internal, err.Error()) } if !notMnt { return &csi.NodeExpandVolumeResponse{}, nil -----------------------------> We were always returning from here because of wrong volumePath } ================== Now, I am raising a PR for this but it has to wait till upstream goes to Kube v18+ Thanks Humble for all the help and lots of offline discussions during the day. (In reply to Mudit Agarwal from comment #22) > RCA: > > 1. As Humble mentioned because of > https://github.com/kubernetes/kubernetes/commit/ > 69613da0ae9743397ba7b2b3daebdc18d68e795e Kubernetes is sending volume_path > and staging_target_path both > > 2. Because of that OCP4.6 onwards volume_path doesn't consist of > /globalmount and thus resize became a no-op because we are not able to get > the device path infor using the new volume_path > > Now, there were two questions: > > 1. Why it is passing in Upstream: > > >> Upstream is still using a Kube version which doesn't have above PR > > 2. Why getDevicePath() is not giving any error? > > >> we never called getDevicePath() because of this code block (which is removed by https://github.com/ceph/ceph-csi/pull/1386) > > ================== > notMnt, err := mount.IsNotMountPoint(ns.mounter, volumePath) > if err != nil { > if os.IsNotExist(err) { > return nil, status.Error(codes.NotFound, err.Error()) > } > return nil, status.Error(codes.Internal, err.Error()) > } > if !notMnt { > return &csi.NodeExpandVolumeResponse{}, nil > -----------------------------> We were always returning from here because of > wrong volumePath > } > ================== Yeah, possibilities (2) in c#6 :) > > Now, I am raising a PR for this but it has to wait till upstream goes to > Kube v18+ Thats great. We will be able to reconfirm this once we upgraded Ceph CSI to 1.19 > > Thanks Humble for all the help and lots of offline discussions during the > day. Thanks to you Mudit!! for sitting on top of this and doing the experiments ! Its not easy :).. The release branch 4.6 has the fix now, moving to MODIFIED. Considering we have the CI passes now on OCS 4.6 builds , I am moving this to ON_QA Mail in rhocs-eng with below subject: ocs-ci results for OCS4-6-Downstream-OCP4-6-AWS-IPI-1AZ-RHCOS-3M-3W-acceptance (BUILD ID: v4.6.0-75.ci RUN ID: 1599531194) I re-tested with ocs-operator.v4.6.0-90.ci, tests/manage/pv_services/pvc_resize/test_pvc_expansion.py passed without any issues. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat OpenShift Container Storage 4.6.0 security, bug fix, enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5605 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |