Bug 2227066 - Recreation of the boot source images as cached snapshots may have issues
Summary: Recreation of the boot source images as cached snapshots may have issues
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.14.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.14.0
Assignee: Alex Kalenyuk
QA Contact: Harel Meir
URL:
Whiteboard:
: 2228606 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-27 16:47 UTC by Jenia Peimer
Modified: 2023-11-08 14:06 UTC (History)
5 users (show)

Fixed In Version: CNV v4.14.0.rhel9-1854
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-11-08 14:06:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github kubevirt containerized-data-importer pull 2837 0 None Merged Avoid creating snapshot of old storage class DataImportCron PVCs 2023-09-04 13:41:15 UTC
Red Hat Issue Tracker CNV-31467 0 None None None 2023-07-27 16:48:10 UTC
Red Hat Product Errata RHSA-2023:6817 0 None None None 2023-11-08 14:06:32 UTC

Description Jenia Peimer 2023-07-27 16:47:46 UTC
Description of problem:
If your default storage class was not supporting snapshots,
boot source images, created by the DataImportCron in openshift-virtualization-os-images namespace, will be imported as the DVs/PVCs.

When you switch the default storage class to OCS, you can re-import the images by deleting the old DVs. The DV/PVC will be re-imported, VolumeSnapshot object will be created, and DV/PVC will be removed automatically.

Alex akalenyu looked at it, and sees 2 issues:

Issue 1: Snapshots are being made out of the previous storage class (when changing SC from HPP->OCS)

Issue 2: When deleting the old storage class DVs, there may be a race where the snapshot got created, but the DV didn't recreate 


Version-Release number of selected component (if applicable):
4.14

How reproducible:
Always


Steps to Reproduce:

1. Have a non-snapshotable default storage class (HPP)

2. See that DVs/PVCs were imported

   $ oc get dv -A
NAMESPACE                            NAME                          PHASE       PROGRESS   RESTARTS   AGE
openshift-virtualization-os-images   centos-stream8-b9b768dcd73b   Succeeded   100.0%                18h
openshift-virtualization-os-images   centos-stream9-362e1f1d9f11   Succeeded   100.0%                18h
openshift-virtualization-os-images   centos7-680e9b4e0fba          Succeeded   100.0%                18h
openshift-virtualization-os-images   fedora-f7cc15256f08           Succeeded   100.0%                18h
openshift-virtualization-os-images   rhel8-0da894200daa            Succeeded   100.0%                18h
openshift-virtualization-os-images   rhel9-b006ef7856b6            Succeeded   100.0%                18h


3. Make HPP non-default, make OCS default

   oc patch storageclass ocs-storagecluster-ceph-rbd -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'


4. Delete one DV 

   $ oc delete dv -n openshift-virtualization-os-images rhel9-b006ef7856b6
datavolume.cdi.kubevirt.io "rhel9-b006ef7856b6" deleted


5. DV didn't get recreated (but should have been), VolumeSnapshot was created, but it's not Ready

   $ oc get VolumeSnapshot -A
NAMESPACE                            NAME                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT   CREATIONTIME   AGE
openshift-virtualization-os-images   rhel9-b006ef7856b6   false        rhel9-b006ef7856b6                                         ocs-storagecluster-rbdplugin-snapclass                                    13s


[cloud-user@ocp-psi-executor ~]$ oc get VolumeSnapshot -n openshift-virtualization-os-images rhel9-b006ef7856b6 -oyaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  annotations:
    cdi.kubevirt.io/storage.import.lastUseTime: "2023-07-27T14:31:32.631870881Z"
  creationTimestamp: "2023-07-27T14:31:32Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  generation: 1
  labels:
    app: containerized-data-importer
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: cdi-controller
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.14.0
    cdi.kubevirt.io: ""
    cdi.kubevirt.io/dataImportCron: rhel9-image-cron
  name: rhel9-b006ef7856b6
  namespace: openshift-virtualization-os-images
  resourceVersion: "1182048"
  uid: d69181d0-4195-4b3f-91b4-ba3631f05249
spec:
  source:
    persistentVolumeClaimName: rhel9-b006ef7856b6
  volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
status:
  error:
    message: 'Failed to create snapshot content with error snapshot controller failed
      to update rhel9-b006ef7856b6 on API server: cannot get claim from snapshot'


6. See that 2 minutes later, other VolumeSnapshots are created while old DVs were not yet deleted

   $ oc get VolumeSnapshot -A
NAMESPACE                            NAME                          READYTOUSE   SOURCEPVC                     SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT                                    CREATIONTIME   AGE
openshift-virtualization-os-images   centos-stream8-b9b768dcd73b   false        centos-stream8-b9b768dcd73b                                         ocs-storagecluster-rbdplugin-snapclass   snapcontent-8455f2ea-0d70-4998-9fa5-bbc42133b1f5                  23s
openshift-virtualization-os-images   centos-stream9-362e1f1d9f11   false        centos-stream9-362e1f1d9f11                                         ocs-storagecluster-rbdplugin-snapclass   snapcontent-3eec6ff1-f73f-493f-b61b-58abfeec5b65                  23s
openshift-virtualization-os-images   centos7-680e9b4e0fba          false        centos7-680e9b4e0fba                                                ocs-storagecluster-rbdplugin-snapclass   snapcontent-76229453-37ff-40f6-8ce0-94e15a5b912c                  23s
openshift-virtualization-os-images   fedora-f7cc15256f08           false        fedora-f7cc15256f08                                                 ocs-storagecluster-rbdplugin-snapclass   snapcontent-94d05d80-20f5-4861-a7af-344f19842a61                  23s
openshift-virtualization-os-images   rhel8-0da894200daa            false        rhel8-0da894200daa                                                  ocs-storagecluster-rbdplugin-snapclass   snapcontent-df7f9a06-4a2e-41b1-8f04-a16758daf4e8                  23s
openshift-virtualization-os-images   rhel9-b006ef7856b6            false        rhel9-b006ef7856b6                                                  ocs-storagecluster-rbdplugin-snapclass                                                                     2m47s

7. See the yaml of another VolumeSnapshot, whose DV/PVC wasn't deleted and still using non-snapshotable HPP:

spec:
  source:
    persistentVolumeClaimName: centos-stream8-b9b768dcd73b
  volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
status:
  boundVolumeSnapshotContentName: snapcontent-8455f2ea-0d70-4998-9fa5-bbc42133b1f5
  error:
    message: 'Failed to check and update snapshot content: failed to take snapshot
      of the volume pvc-e59ee8cd-57d0-4ecf-906f-0ab7a1f8ba72: "rpc error: code = Internal
      desc = panic runtime error: invalid memory address or nil pointer dereference"'
    time: "2023-07-27T14:33:56Z"
  readyToUse: false


8. To fix the broken VolumeSnapshot of the first deleted DV: delete that VolumeSnapshot

   $ oc delete VolumeSnapshot -n openshift-virtualization-os-images rhel9-b006ef7856b6
volumesnapshot.snapshot.storage.k8s.io "rhel9-b006ef7856b6" deleted

9. This will trigger the DV/PVC to re-import on OCS, create a VolumeSnapshot that will be ReadyToUse, and DV/PVC will be deleted automatically. 


Actual results:
Re-importing requires more steps.

Expected results:
Re-importing should happen once we switch the storage class and delete the old DVs.

Comment 1 Jenia Peimer 2023-08-09 13:58:54 UTC
We also should encounter this situation:
OCS was the default, DataImportCron images were imported and stayed as VolumeSnapshots
But then we changed the default storage class to HPP - new DVs/PVCs are not created unless we delete the VolumeSnapshot
And there are reconcile errors in the log

Comment 2 Alex Kalenyuk 2023-09-13 12:55:17 UTC
*** Bug 2228606 has been marked as a duplicate of this bug. ***

Comment 3 Harel Meir 2023-09-18 11:00:44 UTC
Verified on 4.14.0:

Steps:

1. Made ocs default storage class:
[cloud-user@ocp-psi-executor-xl ~]$ oc patch storageclass hostpath-csi-basic -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "false"}}}'
storageclass.storage.k8s.io/hostpath-csi-basic patched
[cloud-user@ocp-psi-executor-xl ~]$ oc patch storageclass ocs-storagecluster-ceph-rbd -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'
storageclass.storage.k8s.io/ocs-storagecluster-ceph-rbd patched

2. delete rhel9 DV


[cloud-user@ocp-psi-executor-xl ~]$ oc delete dv $NS rhel9-a1947a1edca5
datavolume.cdi.kubevirt.io "rhel9-a1947a1edca5" deleted

An import started, and the DV gets recreated:
[cloud-user@ocp-psi-executor-xl ~]$ oc get dv $NS rhel9-a1947a1edca5
NAME                 PHASE              PROGRESS   RESTARTS   AGE
rhel9-a1947a1edca5   ImportInProgress   82.94%                50s

3. After the import is finished - a volumesnapshot is created out of ocs storageclass

[cloud-user@ocp-psi-executor-xl ~]$ oc get volumesnapshot $NS
NAME                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT                                    CREATIONTIME   AGE
rhel9-a1947a1edca5   true         rhel9-a1947a1edca5                           30Gi          ocs-storagecluster-rbdplugin-snapclass   snapcontent-e4a9b852-0b1c-4e47-b143-060428944515   51s            53s


4. The DV and the PVC are deleted:
[cloud-user@ocp-psi-executor-xl ~]$ oc get dv $NS
NAME                          PHASE       PROGRESS   RESTARTS   AGE
centos-stream8-894237fb27f8   Succeeded   100.0%                3d18h
centos-stream9-a37c5c3cb1d0   Succeeded   100.0%                3d18h
centos7-680e9b4e0fba          Succeeded   100.0%                3d18h
fedora-f7cc15256f08           Succeeded   100.0%                3d18h
rhel8-b8545b0b6174            Succeeded   100.0%                3d18h
[cloud-user@ocp-psi-executor-xl ~]$ oc get pvc $NS
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         AGE
centos-stream8-894237fb27f8   Bound    pvc-fb9907c2-3bf6-442c-834c-2e83e4916e7e   149Gi      RWO            hostpath-csi-basic   3d18h
centos-stream9-a37c5c3cb1d0   Bound    pvc-22920d27-fe5f-4f27-8532-a42cb3f523c7   149Gi      RWO            hostpath-csi-basic   3d18h
centos7-680e9b4e0fba          Bound    pvc-66459f77-eeff-49e6-9717-8f3a47d0a681   149Gi      RWO            hostpath-csi-basic   3d18h
fedora-f7cc15256f08           Bound    pvc-847e8a14-06c0-4c23-8e7a-f21919d2bdc0   149Gi      RWO            hostpath-csi-basic   3d18h
rhel8-b8545b0b6174            Bound    pvc-848690c2-73a6-434b-9735-9c8c25ce06c6   149Gi      RWO            hostpath-csi-basic   18m


5. set HPP as default SC again.

The result is the snapshot is still availble and the DV wasnt imported:
[cloud-user@ocp-psi-executor-xl ~]$ oc get dv $NS
NAME                          PHASE       PROGRESS   RESTARTS   AGE
centos-stream8-894237fb27f8   Succeeded   100.0%                3d18h
centos-stream9-a37c5c3cb1d0   Succeeded   100.0%                3d18h
centos7-680e9b4e0fba          Succeeded   100.0%                3d18h
fedora-f7cc15256f08           Succeeded   100.0%                3d18h
rhel8-b8545b0b6174            Succeeded   100.0%                3d18h
[cloud-user@ocp-psi-executor-xl ~]$ oc get volumesnapshot $NS
NAME                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT                                    CREATIONTIME   AGE
rhel9-a1947a1edca5   true         rhel9-a1947a1edca5                           30Gi          ocs-storagecluster-rbdplugin-snapclass   snapcontent-e4a9b852-0b1c-4e47-b143-060428944515   4m54s          4m56s


6. Deleted the rhel9 volumesnapshot

Result: The dv get imported, and after completed a pvc and dv are bound:

[cloud-user@ocp-psi-executor-xl ~]$ oc delete volumesnapshot $NS rhel9-a1947a1edca5
volumesnapshot.snapshot.storage.k8s.io "rhel9-a1947a1edca5" deleted


NAME                          PHASE       PROGRESS   RESTARTS   AGE
rhel9-a1947a1edca5            Succeeded   100.0%                4m25s

[cloud-user@ocp-psi-executor-xl ~]$ ^dv^pvc
NAME                          STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS         AGE
rhel9-a1947a1edca5            Bound    pvc-885fd0d5-de6e-4bd0-836a-36e4df1df647   149Gi      RWO            hostpath-csi-basic   4m33s

Comment 5 errata-xmlrpc 2023-11-08 14:06:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Virtualization 4.14.0 Images security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:6817


Note You need to log in before you can comment on or make changes to this bug.