Bug 1956232 - [RHEL7][RBD] FailedMount error when using restored PVC on app pod
Summary: [RHEL7][RBD] FailedMount error when using restored PVC on app pod
Keywords:
Status: POST
Alias: None
Product: Red Hat OpenShift Container Storage
Classification: Red Hat
Component: csi-driver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Rakshith
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-03 08:57 UTC by Jilju Joy
Modified: 2021-05-10 11:24 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Known Issue
Doc Text:
.Newly restored PVC can not be mounted Newly restored PVC can not be mounted, if some of the OCP nodes are running on a Red Hat Enterprise Linux version of less than 8.2 and the snapshot from which it was restored is deleted. To avoid this issue, do not delete the snapshot from which the PVC was restored until the restored PVC is deleted.
Clone Of:
Environment:
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github ceph ceph-csi pull 2045 0 None open Rbd: Fix image flattening when snapshot and parent PVC is deleted 2021-05-07 05:48:01 UTC

Description Jilju Joy 2021-05-03 08:57:56 UTC
Description of problem (please be detailed as possible and provide log
snippests):
While creating an app pod using a PVC which is restored from a volume snapshot, the pod is remaining in Pending state due to the below error:

Warning  FailedMount             4s (x8 over 68s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-215ee066-b305-4c33-af22-2f8b393d6e7a" : rpc error: code = Internal desc = image not found: RBD image not found

The snapshot was restored after deleting the parent PVC. Another test case which is restoring snapshot without deleting the parent PVC succeeded in consuming the restored PVC.

oc describe output of pod:

Name:         pod-test-rbd-3d6799693e2d4ab2acc098e0844
Namespace:    namespace-test-051dbfaf22db424b8a7df29f8
Priority:     0
Node:         ip-10-0-89-171.us-east-2.compute.internal/10.0.89.171
Start Time:   Fri, 30 Apr 2021 23:21:44 +0000
Labels:       <none>
Annotations:  openshift.io/scc: anyuid
Status:       Pending
IP:           
IPs:          <none>
Containers:
  web-server:
    Container ID:   
    Image:          quay.io/ocsci/nginx:latest
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /var/lib/www/html from mypvc (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-lmwh5 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  mypvc:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  restore-pvc-test-11d2071e9e6440-702d0295
    ReadOnly:   false
  default-token-lmwh5:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-lmwh5
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason                  Age               From                     Message
  ----     ------                  ----              ----                     -------
  Normal   Scheduled               75s               default-scheduler        Successfully assigned namespace-test-051dbfaf22db424b8a7df29f8/pod-test-rbd-3d6799693e2d4ab2acc098e0844 to ip-10-0-89-171.us-east-2.compute.internal
  Normal   SuccessfulAttachVolume  76s               attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-215ee066-b305-4c33-af22-2f8b393d6e7a"
  Warning  FailedMount             4s (x8 over 68s)  kubelet                  MountVolume.MountDevice failed for volume "pvc-215ee066-b305-4c33-af22-2f8b393d6e7a" : rpc error: code = Internal desc = image not found: RBD image not found


oc describe output of PV:

Name:            pvc-215ee066-b305-4c33-af22-2f8b393d6e7a
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    ocs-storagecluster-ceph-rbd
Status:          Bound
Claim:           namespace-test-051dbfaf22db424b8a7df29f8/restore-pvc-test-11d2071e9e6440-702d0295
Reclaim Policy:  Delete
Access Modes:    RWO
VolumeMode:      Filesystem
Capacity:        10Gi
Node Affinity:   <none>
Message:         
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            openshift-storage.rbd.csi.ceph.com
    FSType:            ext4
    VolumeHandle:      0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411
    ReadOnly:          false
    VolumeAttributes:      clusterID=openshift-storage
                           csi.storage.k8s.io/pv/name=pvc-215ee066-b305-4c33-af22-2f8b393d6e7a
                           csi.storage.k8s.io/pvc/name=restore-pvc-test-11d2071e9e6440-702d0295
                           csi.storage.k8s.io/pvc/namespace=namespace-test-051dbfaf22db424b8a7df29f8
                           imageFeatures=layering
                           imageFormat=2
                           imageName=csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411
                           journalPool=ocs-storagecluster-cephblockpool
                           pool=ocs-storagecluster-cephblockpool
                           radosNamespace=
                           storage.kubernetes.io/csiProvisionerIdentity=1619799365086-8081-openshift-storage.rbd.csi.ceph.com
Events:                <none>


Image details from must-gather:

rbd image 'csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411':
	size 10 GiB in 2560 objects
	order 22 (4 MiB objects)
	snapshot_count: 0
	id: 5e6410ea0866
	block_name_prefix: rbd_data.5e6410ea0866
	format: 2
	features: layering, operations
	op_features: clone-child
	flags: 
	create_timestamp: Fri Apr 30 23:18:56 2021
	access_timestamp: Fri Apr 30 23:18:56 2021
	modify_timestamp: Fri Apr 30 23:18:56 2021
	parent: ocs-storagecluster-cephblockpool/csi-snap-bec4c10d-aa09-11eb-b36c-0a580a800411@1bb7d6e3-7f82-4932-be53-540c7d963c78 (trash 5e64bc87d124)
	overlap: 10 GiB


Error logs from csi-rbdplugin (http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j011au3r33-t1n/j011au3r33-t1n_20210430T150442/logs/failed_testcase_ocs_logs_1619801927/test_snapshot_at_different_usage_level_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-0e929cb3857e60e2f154be3ce2f4a2aa2924b2e660e5cf96b9a5f64897a0d072/namespaces/openshift-storage/pods/csi-rbdplugin-5mm2x/csi-rbdplugin/csi-rbdplugin/logs/current.log): 

2021-04-30T23:21:52.534033525Z I0430 23:21:52.534001   24684 utils.go:132] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 GRPC call: /csi.v1.Node/NodeStageVolume
2021-04-30T23:21:52.534139593Z I0430 23:21:52.534114   24684 utils.go:133] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-215ee066-b305-4c33-af22-2f8b393d6e7a/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-215ee066-b305-4c33-af22-2f8b393d6e7a","csi.storage.k8s.io/pvc/name":"restore-pvc-test-11d2071e9e6440-702d0295","csi.storage.k8s.io/pvc/namespace":"namespace-test-051dbfaf22db424b8a7df29f8","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411","journalPool":"ocs-storagecluster-cephblockpool","pool":"ocs-storagecluster-cephblockpool","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1619799365086-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411"}
2021-04-30T23:21:52.534270944Z I0430 23:21:52.534257   24684 rbd_util.go:814] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 setting disableInUseChecks on rbd volume to: false
2021-04-30T23:21:52.535308640Z I0430 23:21:52.535278   24684 omap.go:84] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.6fc62615-aa0a-11eb-b36c-0a580a800411"): map[csi.imageid:5e6410ea0866 csi.imagename:csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411 csi.volname:pvc-215ee066-b305-4c33-af22-2f8b393d6e7a csi.volume.owner:namespace-test-051dbfaf22db424b8a7df29f8]
2021-04-30T23:21:52.535437394Z E0430 23:21:52.535423   24684 util.go:232] kernel 3.10.0-1160.25.1.el7.x86_64 does not support required features
2021-04-30T23:21:52.581603843Z E0430 23:21:52.581535   24684 rbd_util.go:589] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 failed to get image info for {%!s(*[]util.TopologyConstrainedPool=<nil>) <nil> map[] csi-snap-bec4c10d-aa09-11eb-b36c-0a580a800411   172.30.41.228:6789,172.30.166.153:6789,172.30.55.186:6789  ocs-storagecluster-cephblockpool    csi-snap-bec4c10d-aa09-11eb-b36c-0a580a800411 %!s(rbd.FeatureSet=257)           %!s(int64=10737418240) %!s(bool=false) %!s(bool=false) %!s(bool=false) %!s(bool=false) <nil>  seconds:1619824736 nanos:697795370 %!s(*util.ClusterConnection=&{0xc0001a71e0 0xc000111620}) %!s(*rados.IOContext=&{0x7f0490078b20})}: image not found: RBD image not found
2021-04-30T23:21:52.581713311Z E0430 23:21:52.581689   24684 utils.go:136] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 GRPC error: rpc error: code = Internal desc = image not found: RBD image not found


Test case: tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py::TestSnapshotAtDifferentPvcUsageLevel::test_snapshot_at_different_usage_level


must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j011au3r33-t1n/j011au3r33-t1n_20210430T150442/logs/failed_testcase_ocs_logs_1619801927/test_snapshot_at_different_usage_level_ocs_logs/

ocs-ci test case logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j011au3r33-t1n/j011au3r33-t1n_20210430T150442/logs/ocs-ci-logs-1619801927/tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py/TestSnapshotAtDifferentPvcUsageLevel/test_snapshot_at_different_usage_level/logs


Test run: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/665/


Version of all relevant components (if applicable):
OCP 4.7.0-0.nightly-2021-04-30-110828
OCS 4.7.0-364.ci
Ceph Version	14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)
rook_csi_ceph	cephcsi@sha256:bbfe631e0e6d0e5ca98423eac3c16abaac1c781ea1139741ffa374a638d44e9c


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
Yes. The PVC which is restored from a volume snapshot after deleting the parent PVC is not usable.

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?
2

Can this issue reproducible?
Yes


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create an RBD PVC named "pvc-rbd" and attach it to an app pod.
2. Run I/O
3. Create snapshot from pvc-rbd. Ensure snapshot is Ready.
4. Delete the PVC pvc-rbd
5. Restore snapshot and verify the restored PVC is Bound.
6. Attach the restored PVC to an app pod.

OR run ocs-ci test case :
tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py::TestSnapshotAtDifferentPvcUsageLevel::test_snapshot_at_different_usage_level

Actual results:
App pod did not reach Running state due to FailedMount error.

Expected results:
Mount should succeed and app pod should be Running. 

Additional info:

Comment 4 Mudit Agarwal 2021-05-03 10:21:07 UTC
As mentioned by Madhu, this will fail because of no kernel support.

Questions for QE:

1. This should exist in 4.6 too, have we tested?
2. Do we want to document it, do we know if customers have this kernel version?

Comment 5 Humble Chirammal 2021-05-03 10:27:09 UTC
>2021-04-30T23:21:52.534270944Z I0430 23:21:52.534257   24684 rbd_util.go:814] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 setting disableInUseChecks on rbd volume to: false
2021-04-30T23:21:52.535308640Z I0430 23:21:52.535278   24684 omap.go:84] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.6fc62615-aa0a-11eb-b36c-0a580a800411"): map[csi.imageid:5e6410ea0866 csi.imagename:csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411 csi.volname:pvc-215ee066-b305-4c33-af22-2f8b393d6e7a csi.volume.owner:namespace-test-051dbfaf22db424b8a7df29f8]
2021-04-30T23:21:52.535437394Z E0430 23:21:52.535423   24684 util.go:232] kernel 3.10.0-1160.25.1.el7.x86_64 does not support required features
>

Just to mention, missing flatten supported as mentioned above in kernel ' kernel 3.10.0-1160.25.1.el7.x86_64'  in this setup causes it.

Comment 7 Elad 2021-05-03 10:35:19 UTC
(In reply to Mudit Agarwal from comment #4)
> As mentioned by Madhu, this will fail because of no kernel support.
> 
> Questions for QE:
> 
> 1. This should exist in 4.6 too, have we tested?
I found a single run done as part of OCS4.6 testing over RHEL7 nodes - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j007icm1r3-t1/j007icm1r3-t1_20210218T132305/logs/
Jilju - I see the same test case failed there too, is it from the same reason?

Comment 8 Jilju Joy 2021-05-03 10:38:54 UTC
(In reply to Mudit Agarwal from comment #4)
> As mentioned by Madhu, this will fail because of no kernel support.
> 
> Questions for QE:
> 
> 1. This should exist in 4.6 too, have we tested?
> 2. Do we want to document it, do we know if customers have this kernel
> version?

There is a failure instance of this test case in OCS 4.6 on VSPHERE-UPI/RHEL configuration.
Do to some other issues, must-gather logs was not collected after the failure.

Run: ocs-ci results for OCS4-6-Downstream-OCP4-6-VSPHERE-UPI-ENCRYPTION-1AZ-RHEL-VSAN-3M-3W-tier1 (BUILD ID: v4.6.0-149.ci RUN ID: 1604372378)

Comment 10 Mudit Agarwal 2021-05-03 10:53:05 UTC
Elad,

>> Do we want to document it, do we know if customers have this kernel version?

Emphasizing because if we have customers then this becomes a serious issue and I want it to be backported to 4.6.z/4.7.z

Comment 13 Mudit Agarwal 2021-05-04 06:49:17 UTC
Madhu, can we add doc text for this? We might want to tell them that they should not delete the parent pvc (in case the policy is not 'retain')

Comment 18 Jilju Joy 2021-05-05 12:12:12 UTC
Updating the 
Steps to Reproduce:
1. Create an RBD PVC named "pvc-rbd" and attach it to an app pod.
2. Run I/O
3. Create snapshot from pvc-rbd. Ensure snapshot is Ready.
4. Delete the PVC pvc-rbd
5. Restore snapshot and verify the restored PVC is Bound.
6. Delete the volume snapshot.
7. Attach the restored PVC to an app pod.
The test case deletes snapshot before attaching the restored PVC to app pod.

Comment 23 Rakshith 2021-05-05 13:23:41 UTC
(In reply to Madhu Rajanna from comment #20)
> steps we need to verify
> 
> Create PVC
> Create snapshot
> Create Restore PVC from snapshot
> Delete Snapshot (don't Delete parent PVC)
> Mount restore PVC to app pod (mounting will fail)

Verified the above with live cluster and csi driver image with the upstream patch( https://github.com/ceph/ceph-csi/pull/2045 ).
The patch works. It will result rpc error: code = Internal desc = flatten in progress: flatten is in progress but will eventually succeed. 

> 
> Create PVC
> Create snapshot
> Delete parent PVC
> Create Restore PVC from snapshot
> Mount restore PVC to app pod (Mounting should be  successful)

This works as expected too.

> Create PVC
> Create PVC-PVC clone
> Delete parent PVC
> Mount PVC to app pod (Mounting should be  successful)  --> This is already
> verified at #17

Comment 24 Jilju Joy 2021-05-05 13:32:37 UTC
Madhu,

In the doc, is it possible to mention a specific RHEL release or kernel version below which this error can occur ?
From the comment #9 , this is applicable for RHEL8 also.

Comment 26 Mudit Agarwal 2021-05-05 14:18:47 UTC
Looks good to me.

Thanks Jilju/Madhu/Rakshith


Note You need to log in before you can comment on or make changes to this bug.