Bug 1949018
| Summary: | [azure disk csi driver] volumesnapshot instance is stuck in NotReady to use status and no events show | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Qin Ping <piqin> |
| Component: | Storage | Assignee: | Christian Huffman <chuffman> |
| Storage sub component: | Operators | QA Contact: | Wei Duan <wduan> |
| Status: | CLOSED UPSTREAM | Docs Contact: | |
| Severity: | medium | ||
| Priority: | unspecified | CC: | aos-bugs, chuffman |
| Version: | 4.8 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.9.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-06-21 14:47:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
We have two separate issues here, that I'm going to describe below. 1. If we encounter a fatal error during snapshot content creation, this error is overwritten when we attempt to remove the volume annotation on the content. I've submitted [1] to fix this upstream. The fix is fairly straightforward, simply use a different variable to track errors obtained during removal of the annotation. 2. Once this is fixed, we still have an issue with potential stale data causing the updates of the status to fail. Upstream is attempting to move to patch instead of update at [2], but this is going slowly. This issue won't be fully fixed into both [1] and [2] are resolved, and even then this error will only be propagated to the VolumeSnapshotContent (not the snapshot itself). As of right now this is what we see: $ oc describe volumesnapshotcontent $NAME [...] Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning SnapshotCreationFailed 4m32s (x2003 over 24m) csi-snapshotter disk.csi.azure.com Failed to create snapshot: failed to take snapshot of the volume /subscriptions/d38f1e38-4bed-438e-b227-833f997adf6a/resourceGroups/ci-ln-wtx4i32-002ac-wsm2z-rg/providers/Microsoft.Compute/disks/pvc-a9d498e0-60d4-47a4-9c00-060a23fbc5b1: rpc error: code = Unknown desc = AzureDisk - invalid option hello in VolumeSnapshotClass $ oc describe volumesnapshot $NAME Normal CreatingSnapshot 25m snapshot-controller Waiting for a snapshot default/snapshot to be created by the CSI driver. Personally I think this is reasonable. The error is now consistently in the logs and on the content. [1] https://github.com/kubernetes-csi/external-snapshotter/pull/502 [2] https://github.com/kubernetes-csi/external-snapshotter/pull/480 This is addressed upstream in [1], and I'm going to close the bug as UPSTREAM. This is an improvement in error displaying, but we can still see the error recorded in the logs. Once we do a rebase on upstream we'll get the fix, and the error will be correctly propagated. [1] https://github.com/kubernetes-csi/external-snapshotter/pull/527 |
Description of problem: When volumesnapshot instance is stuck in NotReady to use status, there are no events show to know what happened. Version-Release number of selected component (if applicable): 4.8.0-0.nightly-2021-04-09-222447 How reproducible: Always Steps to Reproduce: 1. Create a PVC(test-pvc-5) with azure disk csi driver provisioner and a pod to use it. 2. Create a volumesnapshotclass with the following yaml file: $ cat snapshotclass.yaml apiVersion: snapshot.storage.k8s.io/v1beta1 kind: VolumeSnapshotClass metadata: name: csi-snapshotclass driver: disk.csi.azure.com deletionPolicy: Delete parameters: skuname: StandardSSD_LRS 3. Create a volumesnapshot instance with the following yaml file: $ cat snapshot.yaml apiVersion: snapshot.storage.k8s.io/v1beta1 kind: VolumeSnapshot metadata: name: mysnapshot spec: volumeSnapshotClassName: csi-snapshotclass source: persistentVolumeClaimName: test-pvc-5 Actual results: volumesnapshot/mysnapshot is stuck in NotReady to use status, and no events are created. Checked the log got: I0413 08:43:55.764348 1 snapshot_controller.go:301] createSnapshotWrapper: CreateSnapshot for content snapcontent-e71e5dab-f337-4072-bb57-51dd0a5bf4b2 returned error: rpc error: code = Unknown desc = AzureDisk - invalid option skuname in VolumeSnapshotClass E0413 08:43:55.770853 1 snapshot_controller.go:105] createSnapshot for content [snapcontent-e71e5dab-f337-4072-bb57-51dd0a5bf4b2]: error occurred in createSnapshotWrapper: failed to take snapshot of the volume, /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/piqin-0413-sc8mb-rg/providers/Microsoft.Compute/disks/pvc-dce6d9c9-0990-496b-8e1c-f0c46014fad2: %!q(<nil>) E0413 08:43:55.770923 1 snapshot_controller_base.go:264] could not sync content "snapcontent-e71e5dab-f337-4072-bb57-51dd0a5bf4b2": failed to take snapshot of the volume, /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/piqin-0413-sc8mb-rg/providers/Microsoft.Compute/disks/pvc-dce6d9c9-0990-496b-8e1c-f0c46014fad2: %!q(<nil>) I0413 08:43:55.770967 1 snapshot_controller.go:267] createSnapshotWrapper: Creating snapshot for content snapcontent-e71e5dab-f337-4072-bb57-51dd0a5bf4b2 through the plugin ... I0413 08:43:55.771019 1 event.go:282] Event(v1.ObjectReference{Kind:"VolumeSnapshotContent", Namespace:"", Name:"", UID:"", APIVersion:"snapshot.storage.k8s.io/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SnapshotCreationFailed' Failed to create snapshot: failed to take snapshot of the volume, /subscriptions/53b8f551-f0fc-4bea-8cba-6d1fefd54c8a/resourceGroups/piqin-0413-sc8mb-rg/providers/Microsoft.Compute/disks/pvc-dce6d9c9-0990-496b-8e1c-f0c46014fad2: %!q(<nil>) Expected results: Create some useful event Master Log: Node Log (of failed PODs): PV Dump: PVC Dump: StorageClass Dump (if StorageClass used by PV/PVC): Additional info: