Description of problem (please be detailed as possible and provide log snippests): While creating an app pod using a PVC which is restored from a volume snapshot, the pod is remaining in Pending state due to the below error: Warning FailedMount 4s (x8 over 68s) kubelet MountVolume.MountDevice failed for volume "pvc-215ee066-b305-4c33-af22-2f8b393d6e7a" : rpc error: code = Internal desc = image not found: RBD image not found The snapshot was restored after deleting the parent PVC. Another test case which is restoring snapshot without deleting the parent PVC succeeded in consuming the restored PVC. oc describe output of pod: Name: pod-test-rbd-3d6799693e2d4ab2acc098e0844 Namespace: namespace-test-051dbfaf22db424b8a7df29f8 Priority: 0 Node: ip-10-0-89-171.us-east-2.compute.internal/10.0.89.171 Start Time: Fri, 30 Apr 2021 23:21:44 +0000 Labels: <none> Annotations: openshift.io/scc: anyuid Status: Pending IP: IPs: <none> Containers: web-server: Container ID: Image: quay.io/ocsci/nginx:latest Image ID: Port: <none> Host Port: <none> State: Waiting Reason: ContainerCreating Ready: False Restart Count: 0 Environment: <none> Mounts: /var/lib/www/html from mypvc (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-lmwh5 (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: mypvc: Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace) ClaimName: restore-pvc-test-11d2071e9e6440-702d0295 ReadOnly: false default-token-lmwh5: Type: Secret (a volume populated by a Secret) SecretName: default-token-lmwh5 Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 75s default-scheduler Successfully assigned namespace-test-051dbfaf22db424b8a7df29f8/pod-test-rbd-3d6799693e2d4ab2acc098e0844 to ip-10-0-89-171.us-east-2.compute.internal Normal SuccessfulAttachVolume 76s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-215ee066-b305-4c33-af22-2f8b393d6e7a" Warning FailedMount 4s (x8 over 68s) kubelet MountVolume.MountDevice failed for volume "pvc-215ee066-b305-4c33-af22-2f8b393d6e7a" : rpc error: code = Internal desc = image not found: RBD image not found oc describe output of PV: Name: pvc-215ee066-b305-4c33-af22-2f8b393d6e7a Labels: <none> Annotations: pv.kubernetes.io/provisioned-by: openshift-storage.rbd.csi.ceph.com Finalizers: [kubernetes.io/pv-protection] StorageClass: ocs-storagecluster-ceph-rbd Status: Bound Claim: namespace-test-051dbfaf22db424b8a7df29f8/restore-pvc-test-11d2071e9e6440-702d0295 Reclaim Policy: Delete Access Modes: RWO VolumeMode: Filesystem Capacity: 10Gi Node Affinity: <none> Message: Source: Type: CSI (a Container Storage Interface (CSI) volume source) Driver: openshift-storage.rbd.csi.ceph.com FSType: ext4 VolumeHandle: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 ReadOnly: false VolumeAttributes: clusterID=openshift-storage csi.storage.k8s.io/pv/name=pvc-215ee066-b305-4c33-af22-2f8b393d6e7a csi.storage.k8s.io/pvc/name=restore-pvc-test-11d2071e9e6440-702d0295 csi.storage.k8s.io/pvc/namespace=namespace-test-051dbfaf22db424b8a7df29f8 imageFeatures=layering imageFormat=2 imageName=csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411 journalPool=ocs-storagecluster-cephblockpool pool=ocs-storagecluster-cephblockpool radosNamespace= storage.kubernetes.io/csiProvisionerIdentity=1619799365086-8081-openshift-storage.rbd.csi.ceph.com Events: <none> Image details from must-gather: rbd image 'csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411': size 10 GiB in 2560 objects order 22 (4 MiB objects) snapshot_count: 0 id: 5e6410ea0866 block_name_prefix: rbd_data.5e6410ea0866 format: 2 features: layering, operations op_features: clone-child flags: create_timestamp: Fri Apr 30 23:18:56 2021 access_timestamp: Fri Apr 30 23:18:56 2021 modify_timestamp: Fri Apr 30 23:18:56 2021 parent: ocs-storagecluster-cephblockpool/csi-snap-bec4c10d-aa09-11eb-b36c-0a580a800411@1bb7d6e3-7f82-4932-be53-540c7d963c78 (trash 5e64bc87d124) overlap: 10 GiB Error logs from csi-rbdplugin (http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j011au3r33-t1n/j011au3r33-t1n_20210430T150442/logs/failed_testcase_ocs_logs_1619801927/test_snapshot_at_different_usage_level_ocs_logs/ocs_must_gather/quay-io-rhceph-dev-ocs-must-gather-sha256-0e929cb3857e60e2f154be3ce2f4a2aa2924b2e660e5cf96b9a5f64897a0d072/namespaces/openshift-storage/pods/csi-rbdplugin-5mm2x/csi-rbdplugin/csi-rbdplugin/logs/current.log): 2021-04-30T23:21:52.534033525Z I0430 23:21:52.534001 24684 utils.go:132] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 GRPC call: /csi.v1.Node/NodeStageVolume 2021-04-30T23:21:52.534139593Z I0430 23:21:52.534114 24684 utils.go:133] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 GRPC request: {"secrets":"***stripped***","staging_target_path":"/var/lib/kubelet/plugins/kubernetes.io/csi/pv/pvc-215ee066-b305-4c33-af22-2f8b393d6e7a/globalmount","volume_capability":{"AccessType":{"Mount":{"fs_type":"ext4"}},"access_mode":{"mode":1}},"volume_context":{"clusterID":"openshift-storage","csi.storage.k8s.io/pv/name":"pvc-215ee066-b305-4c33-af22-2f8b393d6e7a","csi.storage.k8s.io/pvc/name":"restore-pvc-test-11d2071e9e6440-702d0295","csi.storage.k8s.io/pvc/namespace":"namespace-test-051dbfaf22db424b8a7df29f8","imageFeatures":"layering","imageFormat":"2","imageName":"csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411","journalPool":"ocs-storagecluster-cephblockpool","pool":"ocs-storagecluster-cephblockpool","radosNamespace":"","storage.kubernetes.io/csiProvisionerIdentity":"1619799365086-8081-openshift-storage.rbd.csi.ceph.com"},"volume_id":"0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411"} 2021-04-30T23:21:52.534270944Z I0430 23:21:52.534257 24684 rbd_util.go:814] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 setting disableInUseChecks on rbd volume to: false 2021-04-30T23:21:52.535308640Z I0430 23:21:52.535278 24684 omap.go:84] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.6fc62615-aa0a-11eb-b36c-0a580a800411"): map[csi.imageid:5e6410ea0866 csi.imagename:csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411 csi.volname:pvc-215ee066-b305-4c33-af22-2f8b393d6e7a csi.volume.owner:namespace-test-051dbfaf22db424b8a7df29f8] 2021-04-30T23:21:52.535437394Z E0430 23:21:52.535423 24684 util.go:232] kernel 3.10.0-1160.25.1.el7.x86_64 does not support required features 2021-04-30T23:21:52.581603843Z E0430 23:21:52.581535 24684 rbd_util.go:589] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 failed to get image info for {%!s(*[]util.TopologyConstrainedPool=<nil>) <nil> map[] csi-snap-bec4c10d-aa09-11eb-b36c-0a580a800411 172.30.41.228:6789,172.30.166.153:6789,172.30.55.186:6789 ocs-storagecluster-cephblockpool csi-snap-bec4c10d-aa09-11eb-b36c-0a580a800411 %!s(rbd.FeatureSet=257) %!s(int64=10737418240) %!s(bool=false) %!s(bool=false) %!s(bool=false) %!s(bool=false) <nil> seconds:1619824736 nanos:697795370 %!s(*util.ClusterConnection=&{0xc0001a71e0 0xc000111620}) %!s(*rados.IOContext=&{0x7f0490078b20})}: image not found: RBD image not found 2021-04-30T23:21:52.581713311Z E0430 23:21:52.581689 24684 utils.go:136] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 GRPC error: rpc error: code = Internal desc = image not found: RBD image not found Test case: tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py::TestSnapshotAtDifferentPvcUsageLevel::test_snapshot_at_different_usage_level must-gather logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j011au3r33-t1n/j011au3r33-t1n_20210430T150442/logs/failed_testcase_ocs_logs_1619801927/test_snapshot_at_different_usage_level_ocs_logs/ ocs-ci test case logs: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j011au3r33-t1n/j011au3r33-t1n_20210430T150442/logs/ocs-ci-logs-1619801927/tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py/TestSnapshotAtDifferentPvcUsageLevel/test_snapshot_at_different_usage_level/logs Test run: https://ocs4-jenkins-csb-ocsqe.apps.ocp4.prod.psi.redhat.com/job/qe-deploy-ocs-cluster-prod/665/ Version of all relevant components (if applicable): OCP 4.7.0-0.nightly-2021-04-30-110828 OCS 4.7.0-364.ci Ceph Version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable) rook_csi_ceph cephcsi@sha256:bbfe631e0e6d0e5ca98423eac3c16abaac1c781ea1139741ffa374a638d44e9c Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Yes. The PVC which is restored from a volume snapshot after deleting the parent PVC is not usable. Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? 2 Can this issue reproducible? Yes Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create an RBD PVC named "pvc-rbd" and attach it to an app pod. 2. Run I/O 3. Create snapshot from pvc-rbd. Ensure snapshot is Ready. 4. Delete the PVC pvc-rbd 5. Restore snapshot and verify the restored PVC is Bound. 6. Attach the restored PVC to an app pod. OR run ocs-ci test case : tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py::TestSnapshotAtDifferentPvcUsageLevel::test_snapshot_at_different_usage_level Actual results: App pod did not reach Running state due to FailedMount error. Expected results: Mount should succeed and app pod should be Running. Additional info:
As mentioned by Madhu, this will fail because of no kernel support. Questions for QE: 1. This should exist in 4.6 too, have we tested? 2. Do we want to document it, do we know if customers have this kernel version?
>2021-04-30T23:21:52.534270944Z I0430 23:21:52.534257 24684 rbd_util.go:814] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 setting disableInUseChecks on rbd volume to: false 2021-04-30T23:21:52.535308640Z I0430 23:21:52.535278 24684 omap.go:84] ID: 2896 Req-ID: 0001-0011-openshift-storage-0000000000000001-6fc62615-aa0a-11eb-b36c-0a580a800411 got omap values: (pool="ocs-storagecluster-cephblockpool", namespace="", name="csi.volume.6fc62615-aa0a-11eb-b36c-0a580a800411"): map[csi.imageid:5e6410ea0866 csi.imagename:csi-vol-6fc62615-aa0a-11eb-b36c-0a580a800411 csi.volname:pvc-215ee066-b305-4c33-af22-2f8b393d6e7a csi.volume.owner:namespace-test-051dbfaf22db424b8a7df29f8] 2021-04-30T23:21:52.535437394Z E0430 23:21:52.535423 24684 util.go:232] kernel 3.10.0-1160.25.1.el7.x86_64 does not support required features > Just to mention, missing flatten supported as mentioned above in kernel ' kernel 3.10.0-1160.25.1.el7.x86_64' in this setup causes it.
(In reply to Mudit Agarwal from comment #4) > As mentioned by Madhu, this will fail because of no kernel support. > > Questions for QE: > > 1. This should exist in 4.6 too, have we tested? I found a single run done as part of OCS4.6 testing over RHEL7 nodes - http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/j007icm1r3-t1/j007icm1r3-t1_20210218T132305/logs/ Jilju - I see the same test case failed there too, is it from the same reason?
(In reply to Mudit Agarwal from comment #4) > As mentioned by Madhu, this will fail because of no kernel support. > > Questions for QE: > > 1. This should exist in 4.6 too, have we tested? > 2. Do we want to document it, do we know if customers have this kernel > version? There is a failure instance of this test case in OCS 4.6 on VSPHERE-UPI/RHEL configuration. Do to some other issues, must-gather logs was not collected after the failure. Run: ocs-ci results for OCS4-6-Downstream-OCP4-6-VSPHERE-UPI-ENCRYPTION-1AZ-RHEL-VSAN-3M-3W-tier1 (BUILD ID: v4.6.0-149.ci RUN ID: 1604372378)
Elad, >> Do we want to document it, do we know if customers have this kernel version? Emphasizing because if we have customers then this becomes a serious issue and I want it to be backported to 4.6.z/4.7.z
Madhu, can we add doc text for this? We might want to tell them that they should not delete the parent pvc (in case the policy is not 'retain')
Updating the Steps to Reproduce: 1. Create an RBD PVC named "pvc-rbd" and attach it to an app pod. 2. Run I/O 3. Create snapshot from pvc-rbd. Ensure snapshot is Ready. 4. Delete the PVC pvc-rbd 5. Restore snapshot and verify the restored PVC is Bound. 6. Delete the volume snapshot. 7. Attach the restored PVC to an app pod. The test case deletes snapshot before attaching the restored PVC to app pod.
(In reply to Madhu Rajanna from comment #20) > steps we need to verify > > Create PVC > Create snapshot > Create Restore PVC from snapshot > Delete Snapshot (don't Delete parent PVC) > Mount restore PVC to app pod (mounting will fail) Verified the above with live cluster and csi driver image with the upstream patch( https://github.com/ceph/ceph-csi/pull/2045 ). The patch works. It will result rpc error: code = Internal desc = flatten in progress: flatten is in progress but will eventually succeed. > > Create PVC > Create snapshot > Delete parent PVC > Create Restore PVC from snapshot > Mount restore PVC to app pod (Mounting should be successful) This works as expected too. > Create PVC > Create PVC-PVC clone > Delete parent PVC > Mount PVC to app pod (Mounting should be successful) --> This is already > verified at #17
Madhu, In the doc, is it possible to mention a specific RHEL release or kernel version below which this error can occur ? From the comment #9 , this is applicable for RHEL8 also.
Looks good to me. Thanks Jilju/Madhu/Rakshith
Doc text needs to be changed as we have fixed this issue now.
Verified using the test case tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py::TestSnapshotAtDifferentPvcUsageLevel::test_snapshot_at_different_usage_level. The test case deletes the parent PVC and snapshot before attaching the restored RBD PVC to the pod. Test case logs : http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/jijoy-jun3/jijoy-jun3_20210603T181951/logs/ocs-ci-logs-1622752627/tests/manage/pv_services/pvc_snapshot/test_snapshot_at_different_pvc_utlilization_level.py/TestSnapshotAtDifferentPvcUsageLevel/test_snapshot_at_different_usage_level/logs Worker nodes kernel version is 3.10.0-1160.25.1.el7.x86_64 Verified in version: OCS operator v4.8.0-407.ci Cluster Version 4.8.0-0.nightly-2021-06-03-101158 Ceph Version 14.2.11-147.el8cp (1f54d52f20d93c1b91f1ec6af4c67a4b81402800) nautilus (stable)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenShift Container Storage 4.8.0 container images bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3003