Bug 2227066

Summary: Recreation of the boot source images as cached snapshots may have issues
Product: Container Native Virtualization (CNV) Reporter: Jenia Peimer <jpeimer>
Component: StorageAssignee: Alex Kalenyuk <akalenyu>
Status: ASSIGNED --- QA Contact: Natalie Gavrielov <ngavrilo>
Severity: high Docs Contact:
Priority: high    
Version: 4.14.0CC: akalenyu, alitke
Target Milestone: ---   
Target Release: 4.14.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jenia Peimer 2023-07-27 16:47:46 UTC
Description of problem:
If your default storage class was not supporting snapshots,
boot source images, created by the DataImportCron in openshift-virtualization-os-images namespace, will be imported as the DVs/PVCs.

When you switch the default storage class to OCS, you can re-import the images by deleting the old DVs. The DV/PVC will be re-imported, VolumeSnapshot object will be created, and DV/PVC will be removed automatically.

Alex akalenyu looked at it, and sees 2 issues:

Issue 1: Snapshots are being made out of the previous storage class (when changing SC from HPP->OCS)

Issue 2: When deleting the old storage class DVs, there may be a race where the snapshot got created, but the DV didn't recreate 


Version-Release number of selected component (if applicable):
4.14

How reproducible:
Always


Steps to Reproduce:

1. Have a non-snapshotable default storage class (HPP)

2. See that DVs/PVCs were imported

   $ oc get dv -A
NAMESPACE                            NAME                          PHASE       PROGRESS   RESTARTS   AGE
openshift-virtualization-os-images   centos-stream8-b9b768dcd73b   Succeeded   100.0%                18h
openshift-virtualization-os-images   centos-stream9-362e1f1d9f11   Succeeded   100.0%                18h
openshift-virtualization-os-images   centos7-680e9b4e0fba          Succeeded   100.0%                18h
openshift-virtualization-os-images   fedora-f7cc15256f08           Succeeded   100.0%                18h
openshift-virtualization-os-images   rhel8-0da894200daa            Succeeded   100.0%                18h
openshift-virtualization-os-images   rhel9-b006ef7856b6            Succeeded   100.0%                18h


3. Make HPP non-default, make OCS default

   oc patch storageclass ocs-storagecluster-ceph-rbd -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'


4. Delete one DV 

   $ oc delete dv -n openshift-virtualization-os-images rhel9-b006ef7856b6
datavolume.cdi.kubevirt.io "rhel9-b006ef7856b6" deleted


5. DV didn't get recreated (but should have been), VolumeSnapshot was created, but it's not Ready

   $ oc get VolumeSnapshot -A
NAMESPACE                            NAME                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT   CREATIONTIME   AGE
openshift-virtualization-os-images   rhel9-b006ef7856b6   false        rhel9-b006ef7856b6                                         ocs-storagecluster-rbdplugin-snapclass                                    13s


[cloud-user@ocp-psi-executor ~]$ oc get VolumeSnapshot -n openshift-virtualization-os-images rhel9-b006ef7856b6 -oyaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  annotations:
    cdi.kubevirt.io/storage.import.lastUseTime: "2023-07-27T14:31:32.631870881Z"
  creationTimestamp: "2023-07-27T14:31:32Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  generation: 1
  labels:
    app: containerized-data-importer
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: cdi-controller
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.14.0
    cdi.kubevirt.io: ""
    cdi.kubevirt.io/dataImportCron: rhel9-image-cron
  name: rhel9-b006ef7856b6
  namespace: openshift-virtualization-os-images
  resourceVersion: "1182048"
  uid: d69181d0-4195-4b3f-91b4-ba3631f05249
spec:
  source:
    persistentVolumeClaimName: rhel9-b006ef7856b6
  volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
status:
  error:
    message: 'Failed to create snapshot content with error snapshot controller failed
      to update rhel9-b006ef7856b6 on API server: cannot get claim from snapshot'


6. See that 2 minutes later, other VolumeSnapshots are created while old DVs were not yet deleted

   $ oc get VolumeSnapshot -A
NAMESPACE                            NAME                          READYTOUSE   SOURCEPVC                     SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT                                    CREATIONTIME   AGE
openshift-virtualization-os-images   centos-stream8-b9b768dcd73b   false        centos-stream8-b9b768dcd73b                                         ocs-storagecluster-rbdplugin-snapclass   snapcontent-8455f2ea-0d70-4998-9fa5-bbc42133b1f5                  23s
openshift-virtualization-os-images   centos-stream9-362e1f1d9f11   false        centos-stream9-362e1f1d9f11                                         ocs-storagecluster-rbdplugin-snapclass   snapcontent-3eec6ff1-f73f-493f-b61b-58abfeec5b65                  23s
openshift-virtualization-os-images   centos7-680e9b4e0fba          false        centos7-680e9b4e0fba                                                ocs-storagecluster-rbdplugin-snapclass   snapcontent-76229453-37ff-40f6-8ce0-94e15a5b912c                  23s
openshift-virtualization-os-images   fedora-f7cc15256f08           false        fedora-f7cc15256f08                                                 ocs-storagecluster-rbdplugin-snapclass   snapcontent-94d05d80-20f5-4861-a7af-344f19842a61                  23s
openshift-virtualization-os-images   rhel8-0da894200daa            false        rhel8-0da894200daa                                                  ocs-storagecluster-rbdplugin-snapclass   snapcontent-df7f9a06-4a2e-41b1-8f04-a16758daf4e8                  23s
openshift-virtualization-os-images   rhel9-b006ef7856b6            false        rhel9-b006ef7856b6                                                  ocs-storagecluster-rbdplugin-snapclass                                                                     2m47s

7. See the yaml of another VolumeSnapshot, whose DV/PVC wasn't deleted and still using non-snapshotable HPP:

spec:
  source:
    persistentVolumeClaimName: centos-stream8-b9b768dcd73b
  volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
status:
  boundVolumeSnapshotContentName: snapcontent-8455f2ea-0d70-4998-9fa5-bbc42133b1f5
  error:
    message: 'Failed to check and update snapshot content: failed to take snapshot
      of the volume pvc-e59ee8cd-57d0-4ecf-906f-0ab7a1f8ba72: "rpc error: code = Internal
      desc = panic runtime error: invalid memory address or nil pointer dereference"'
    time: "2023-07-27T14:33:56Z"
  readyToUse: false


8. To fix the broken VolumeSnapshot of the first deleted DV: delete that VolumeSnapshot

   $ oc delete VolumeSnapshot -n openshift-virtualization-os-images rhel9-b006ef7856b6
volumesnapshot.snapshot.storage.k8s.io "rhel9-b006ef7856b6" deleted

9. This will trigger the DV/PVC to re-import on OCS, create a VolumeSnapshot that will be ReadyToUse, and DV/PVC will be deleted automatically. 


Actual results:
Re-importing requires more steps.

Expected results:
Re-importing should happen once we switch the storage class and delete the old DVs.

Comment 1 Jenia Peimer 2023-08-09 13:58:54 UTC
We also should encounter this situation:
OCS was the default, DataImportCron images were imported and stayed as VolumeSnapshots
But then we changed the default storage class to HPP - new DVs/PVCs are not created unless we delete the VolumeSnapshot
And there are reconcile errors in the log