Bug 2227066 - Recreation of the boot source images as cached snapshots may have issues
Summary: Recreation of the boot source images as cached snapshots may have issues
Keywords:
Status: ASSIGNED
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Storage
Version: 4.14.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.14.0
Assignee: Alex Kalenyuk
QA Contact: Natalie Gavrielov
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-07-27 16:47 UTC by Jenia Peimer
Modified: 2023-08-09 18:04 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CNV-31467 0 None None None 2023-07-27 16:48:10 UTC

Description Jenia Peimer 2023-07-27 16:47:46 UTC
Description of problem:
If your default storage class was not supporting snapshots,
boot source images, created by the DataImportCron in openshift-virtualization-os-images namespace, will be imported as the DVs/PVCs.

When you switch the default storage class to OCS, you can re-import the images by deleting the old DVs. The DV/PVC will be re-imported, VolumeSnapshot object will be created, and DV/PVC will be removed automatically.

Alex akalenyu looked at it, and sees 2 issues:

Issue 1: Snapshots are being made out of the previous storage class (when changing SC from HPP->OCS)

Issue 2: When deleting the old storage class DVs, there may be a race where the snapshot got created, but the DV didn't recreate 


Version-Release number of selected component (if applicable):
4.14

How reproducible:
Always


Steps to Reproduce:

1. Have a non-snapshotable default storage class (HPP)

2. See that DVs/PVCs were imported

   $ oc get dv -A
NAMESPACE                            NAME                          PHASE       PROGRESS   RESTARTS   AGE
openshift-virtualization-os-images   centos-stream8-b9b768dcd73b   Succeeded   100.0%                18h
openshift-virtualization-os-images   centos-stream9-362e1f1d9f11   Succeeded   100.0%                18h
openshift-virtualization-os-images   centos7-680e9b4e0fba          Succeeded   100.0%                18h
openshift-virtualization-os-images   fedora-f7cc15256f08           Succeeded   100.0%                18h
openshift-virtualization-os-images   rhel8-0da894200daa            Succeeded   100.0%                18h
openshift-virtualization-os-images   rhel9-b006ef7856b6            Succeeded   100.0%                18h


3. Make HPP non-default, make OCS default

   oc patch storageclass ocs-storagecluster-ceph-rbd -p '{"metadata": {"annotations": {"storageclass.kubernetes.io/is-default-class": "true"}}}'


4. Delete one DV 

   $ oc delete dv -n openshift-virtualization-os-images rhel9-b006ef7856b6
datavolume.cdi.kubevirt.io "rhel9-b006ef7856b6" deleted


5. DV didn't get recreated (but should have been), VolumeSnapshot was created, but it's not Ready

   $ oc get VolumeSnapshot -A
NAMESPACE                            NAME                 READYTOUSE   SOURCEPVC            SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT   CREATIONTIME   AGE
openshift-virtualization-os-images   rhel9-b006ef7856b6   false        rhel9-b006ef7856b6                                         ocs-storagecluster-rbdplugin-snapclass                                    13s


[cloud-user@ocp-psi-executor ~]$ oc get VolumeSnapshot -n openshift-virtualization-os-images rhel9-b006ef7856b6 -oyaml
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  annotations:
    cdi.kubevirt.io/storage.import.lastUseTime: "2023-07-27T14:31:32.631870881Z"
  creationTimestamp: "2023-07-27T14:31:32Z"
  finalizers:
  - snapshot.storage.kubernetes.io/volumesnapshot-as-source-protection
  generation: 1
  labels:
    app: containerized-data-importer
    app.kubernetes.io/component: storage
    app.kubernetes.io/managed-by: cdi-controller
    app.kubernetes.io/part-of: hyperconverged-cluster
    app.kubernetes.io/version: 4.14.0
    cdi.kubevirt.io: ""
    cdi.kubevirt.io/dataImportCron: rhel9-image-cron
  name: rhel9-b006ef7856b6
  namespace: openshift-virtualization-os-images
  resourceVersion: "1182048"
  uid: d69181d0-4195-4b3f-91b4-ba3631f05249
spec:
  source:
    persistentVolumeClaimName: rhel9-b006ef7856b6
  volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
status:
  error:
    message: 'Failed to create snapshot content with error snapshot controller failed
      to update rhel9-b006ef7856b6 on API server: cannot get claim from snapshot'


6. See that 2 minutes later, other VolumeSnapshots are created while old DVs were not yet deleted

   $ oc get VolumeSnapshot -A
NAMESPACE                            NAME                          READYTOUSE   SOURCEPVC                     SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                            SNAPSHOTCONTENT                                    CREATIONTIME   AGE
openshift-virtualization-os-images   centos-stream8-b9b768dcd73b   false        centos-stream8-b9b768dcd73b                                         ocs-storagecluster-rbdplugin-snapclass   snapcontent-8455f2ea-0d70-4998-9fa5-bbc42133b1f5                  23s
openshift-virtualization-os-images   centos-stream9-362e1f1d9f11   false        centos-stream9-362e1f1d9f11                                         ocs-storagecluster-rbdplugin-snapclass   snapcontent-3eec6ff1-f73f-493f-b61b-58abfeec5b65                  23s
openshift-virtualization-os-images   centos7-680e9b4e0fba          false        centos7-680e9b4e0fba                                                ocs-storagecluster-rbdplugin-snapclass   snapcontent-76229453-37ff-40f6-8ce0-94e15a5b912c                  23s
openshift-virtualization-os-images   fedora-f7cc15256f08           false        fedora-f7cc15256f08                                                 ocs-storagecluster-rbdplugin-snapclass   snapcontent-94d05d80-20f5-4861-a7af-344f19842a61                  23s
openshift-virtualization-os-images   rhel8-0da894200daa            false        rhel8-0da894200daa                                                  ocs-storagecluster-rbdplugin-snapclass   snapcontent-df7f9a06-4a2e-41b1-8f04-a16758daf4e8                  23s
openshift-virtualization-os-images   rhel9-b006ef7856b6            false        rhel9-b006ef7856b6                                                  ocs-storagecluster-rbdplugin-snapclass                                                                     2m47s

7. See the yaml of another VolumeSnapshot, whose DV/PVC wasn't deleted and still using non-snapshotable HPP:

spec:
  source:
    persistentVolumeClaimName: centos-stream8-b9b768dcd73b
  volumeSnapshotClassName: ocs-storagecluster-rbdplugin-snapclass
status:
  boundVolumeSnapshotContentName: snapcontent-8455f2ea-0d70-4998-9fa5-bbc42133b1f5
  error:
    message: 'Failed to check and update snapshot content: failed to take snapshot
      of the volume pvc-e59ee8cd-57d0-4ecf-906f-0ab7a1f8ba72: "rpc error: code = Internal
      desc = panic runtime error: invalid memory address or nil pointer dereference"'
    time: "2023-07-27T14:33:56Z"
  readyToUse: false


8. To fix the broken VolumeSnapshot of the first deleted DV: delete that VolumeSnapshot

   $ oc delete VolumeSnapshot -n openshift-virtualization-os-images rhel9-b006ef7856b6
volumesnapshot.snapshot.storage.k8s.io "rhel9-b006ef7856b6" deleted

9. This will trigger the DV/PVC to re-import on OCS, create a VolumeSnapshot that will be ReadyToUse, and DV/PVC will be deleted automatically. 


Actual results:
Re-importing requires more steps.

Expected results:
Re-importing should happen once we switch the storage class and delete the old DVs.

Comment 1 Jenia Peimer 2023-08-09 13:58:54 UTC
We also should encounter this situation:
OCS was the default, DataImportCron images were imported and stayed as VolumeSnapshots
But then we changed the default storage class to HPP - new DVs/PVCs are not created unless we delete the VolumeSnapshot
And there are reconcile errors in the log


Note You need to log in before you can comment on or make changes to this bug.