Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1949018

Summary: [azure disk csi driver] volumesnapshot instance is stuck in NotReady to use status and no events show
Product: OpenShift Container Platform Reporter: Qin Ping <piqin>
Component: StorageAssignee: Christian Huffman <chuffman>
Storage sub component: Operators QA Contact: Wei Duan <wduan>
Status: CLOSED UPSTREAM Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, chuffman
Version: 4.8   
Target Milestone: ---   
Target Release: 4.9.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-21 14:47:43 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qin Ping 2021-04-13 09:13:13 UTC
Description of problem:
When volumesnapshot instance is stuck in NotReady to use status, there are no events show to know what happened.

Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-04-09-222447

How reproducible:
Always

Steps to Reproduce:
1. Create a PVC(test-pvc-5) with azure disk csi driver provisioner and a pod to use it.
2. Create a volumesnapshotclass with the following yaml file:
$ cat snapshotclass.yaml 
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshotClass
metadata:
  name: csi-snapshotclass
driver: disk.csi.azure.com
deletionPolicy: Delete
parameters:
  skuname: StandardSSD_LRS
3. Create a volumesnapshot instance with the following yaml file:
$ cat snapshot.yaml 
apiVersion: snapshot.storage.k8s.io/v1beta1
kind: VolumeSnapshot
metadata:
  name: mysnapshot
spec:
  volumeSnapshotClassName: csi-snapshotclass
  source:
    persistentVolumeClaimName: test-pvc-5


Actual results:
volumesnapshot/mysnapshot is stuck in NotReady to use status, and no events are created.
Checked the log got:
I0413 08:43:55.764348       1 snapshot_controller.go:301] createSnapshotWrapper: CreateSnapshot for content snapcontent-e71e5dab-f337-4072-bb57-51dd0a5bf4b2 returned error: rpc error: code = Unknown desc = AzureDisk - invalid option skuname in VolumeSnapshotClass
E0413 08:43:55.770853       1 snapshot_controller.go:105] createSnapshot for content [snapcontent-e71e5dab-f337-4072-bb57-51dd0a5bf4b2]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume, /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/piqin-0413-sc8mb-rg/providers/Microsoft.Compute/disks/pvc-dce6d9c9-0990-496b-8e1c-f0c46014fad2: %!q(<nil>)
E0413 08:43:55.770923       1 snapshot_controller_base.go:264] could not sync content "snapcontent-e71e5dab-f337-4072-bb57-51dd0a5bf4b2": failed to take snapshot of the volume, /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/piqin-0413-sc8mb-rg/providers/Microsoft.Compute/disks/pvc-dce6d9c9-0990-496b-8e1c-f0c46014fad2: %!q(<nil>)
I0413 08:43:55.770967       1 snapshot_controller.go:267] createSnapshotWrapper: Creating snapshot for content snapcontent-e71e5dab-f337-4072-bb57-51dd0a5bf4b2 through the plugin ...
I0413 08:43:55.771019       1 event.go:282] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"", UID:"", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume, /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/piqin-0413-sc8mb-rg/providers/Microsoft.Compute/disks/pvc-dce6d9c9-0990-496b-8e1c-f0c46014fad2: %!q(<nil>)


Expected results:
Create some useful event

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 5 Christian Huffman 2021-04-21 17:40:51 UTC
We have two separate issues here, that I'm going to describe below.

1. If we encounter a fatal error during snapshot content creation, this error is overwritten when we attempt to remove the volume annotation on the content. I've submitted [1] to fix this upstream. The fix is fairly straightforward, simply use a different variable to track errors obtained during removal of the annotation.
2. Once this is fixed, we still have an issue with potential stale data causing the updates of the status to fail. Upstream is attempting to move to patch instead of update at [2], but this is going slowly. 

This issue won't be fully fixed into both [1] and [2] are resolved, and even then this error will only be propagated to the VolumeSnapshotContent (not the snapshot itself). As of right now this is what we see:

  $ oc describe volumesnapshotcontent $NAME
  [...]
  Events:
  Type     Reason                  Age                     From                                Message
  ----     ------                  ----                    ----                                -------
  Warning  SnapshotCreationFailed  4m32s (x2003 over 24m)  csi-snapshotter disk.csi.azure.com  Failed to create snapshot: failed to take snapshot of the volume /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-ln-wtx4i32-002ac-wsm2z-rg/providers/Microsoft.Compute/disks/pvc-a9d498e0-60d4-47a4-9c00-060a23fbc5b1: rpc error: code = Unknown desc = AzureDisk - invalid option hello in VolumeSnapshotClass

  $ oc describe volumesnapshot $NAME
  Normal  CreatingSnapshot  25m   snapshot-controller  Waiting for a snapshot default/snapshot to be created by the CSI driver.

Personally I think this is reasonable. The error is now consistently in the logs and on the content.

[1] https://github.com/kubernetes-csi/external-snapshotter/pull/502
[2] https://github.com/kubernetes-csi/external-snapshotter/pull/480

Comment 14 Christian Huffman 2021-06-21 14:47:43 UTC
This is addressed upstream in [1], and I'm going to close the bug as UPSTREAM. This is an improvement in error displaying, but we can still see the error recorded in the logs. Once we do a rebase on upstream we'll get the fix, and the error will be correctly propagated.

[1] https://github.com/kubernetes-csi/external-snapshotter/pull/527