Bug 1814280

Summary: CSI Snapshot Controller panics in checkandRemoveSnapshotFinalizersAndCheckandDeleteContent
Product: OpenShift Container Platform Reporter: Christian Huffman <chuffman>
Component: StorageAssignee: Christian Huffman <chuffman>
Status: CLOSED ERRATA QA Contact: Qin Ping <piqin>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, bbennett, piqin, wking
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: VolumeSnapshotContents were assumed to be created, resulting in a panic if the content was nil. Consequence: This could cause the CSI Snapshot Controller to panic and crash. Fix: Included logic so that the we check to see if the VolumeSnapshotContent is nil before using it. Result: The CSI Snapshot Controller no longer panics due to a nil VolumeSnapshotContent.
Story Points: ---
Clone Of:
: 1815563 (view as bug list) Environment:
Last Closed: 2020-08-04 18:05:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1815563    

Description Christian Huffman 2020-03-17 14:41:43 UTC
The CSI Snapshot Controller can enter into a crashloop in checkandRemoveSnapshotFinalizersAndCheckandDeleteContent. This was seen in the following test:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_kube-state-metrics/27/pull-ci-openshift-kube-state-metrics-master-e2e-aws/41

This seems to occur because we don't examine if the snapshotContent is nil before using it in certain cases. The stack trace is below:

I0316 19:01:12.229634       1 snapshot_controller.go:832] checkandRemovePVCFinalizer[snapshot-rt6ft]: Remove Finalizer for PVC pvc-f42nr as it is not used by snapshots in creation
...
E0316 19:01:14.124545       1 snapshot_controller_base.go:371] could not sync volume "e2e-snapshotting-5615/snapshot-rt6ft": failed to delete VolumeSnapshotContent snapcontent-2d11399d-a8c6-430a-b3da-747c853c1b55 from API server: "volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-2d11399d-a8c6-430a-b3da-747c853c1b55\" not found"
E0316 19:01:14.126116       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 136 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic(0x12df160, 0x1fd2770)
	/go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:74 +0xa3
k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:48 +0x82
panic(0x12df160, 0x1fd2770)
	/usr/local/go/src/runtime/panic.go:679 +0x1b2
github.com/kubernetes-csi/external-snapshotter/pkg/common-controller.(*csiSnapshotCommonController).checkandRemoveSnapshotFinalizersAndCheckandDeleteContent(0xc00003c400, 0xc000392780, 0x0, 0x0, 0x2, 0xc0001dc780)
	/go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller.go:265 +0x39f
github.com/kubernetes-csi/external-snapshotter/pkg/common-controller.(*csiSnapshotCommonController).processSnapshotWithDeletionTimestamp(0xc00003c400, 0xc000392780, 0x0, 0x0)
	/go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller.go:229 +0x35b
github.com/kubernetes-csi/external-snapshotter/pkg/common-controller.(*csiSnapshotCommonController).syncSnapshot(0xc00003c400, 0xc000392780, 0x1438b00, 0xc000392780)
	/go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller.go:170 +0x361
github.com/kubernetes-csi/external-snapshotter/pkg/common-controller.(*csiSnapshotCommonController).updateSnapshot(0xc00003c400, 0xc000392780)
	/go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller_base.go:364 +0x250
github.com/kubernetes-csi/external-snapshotter/pkg/common-controller.(*csiSnapshotCommonController).snapshotWorker.func1(0x6738796f484a6300)
	/go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller_base.go:220 +0x944
github.com/kubernetes-csi/external-snapshotter/pkg/common-controller.(*csiSnapshotCommonController).snapshotWorker(0xc00003c400)
	/go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller_base.go:253 +0x4b
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc0002f4180)
	/go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152 +0x5e
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc0002f4180, 0x0, 0x0, 0x1, 0xc0003d43c0)
	/go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153 +0xf8
k8s.io/apimachinery/pkg/util/wait.Until(0xc0002f4180, 0x0, 0xc0003d43c0)
	/go/src/github.com/kubernetes-csi/external-snapshotter/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubernetes-csi/external-snapshotter/pkg/common-controller.(*csiSnapshotCommonController).Run
	/go/src/github.com/kubernetes-csi/external-snapshotter/pkg/common-controller/snapshot_controller_base.go:154 +0x2d9
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x1137c7f]

Comment 1 Jan Safranek 2020-03-18 12:25:44 UTC
*** Bug 1814458 has been marked as a duplicate of this bug. ***

Comment 2 Christian Huffman 2020-03-18 15:21:39 UTC
I submitted https://github.com/kubernetes-csi/external-snapshotter/pull/278 to include this fix upstream. I haven't been able to reproduce the issue once this commit is applied.

Comment 4 Christian Huffman 2020-03-18 18:31:30 UTC
Cherrypick PR to OpenShift - https://github.com/openshift/csi-external-snapshotter/pull/16

Comment 7 Qin Ping 2020-03-25 08:15:50 UTC
Checked the upstream ci jobs last 4 days, about 100 jobs, did not find this error.

Comment 9 errata-xmlrpc 2020-08-04 18:05:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409