Bug 1894645 - Cinder volume provisioning crashes on nil cloud provider
Summary: Cinder volume provisioning crashes on nil cloud provider
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6.z
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: 4.7.0
Assignee: Mike Fedosin
QA Contact: weiwei jiang
URL:
Whiteboard:
: 1923064 (view as bug list)
Depends On: 1894561 1895022
Blocks: 1924785
TreeView+ depends on / blocked
 
Reported: 2020-11-04 17:06 UTC by Jan Safranek
Modified: 2021-05-24 20:44 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1894561
: 1928978 (view as bug list)
Environment:
Last Closed: 2021-02-24 15:30:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift kubernetes pull 510 0 None closed Bug 1894645: UPSTREAM: 89885: SQUASH: Fix cinder crash 2021-02-21 08:25:29 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:30:33 UTC

Internal Links: 1923064

Description Jan Safranek 2020-11-04 17:06:26 UTC
+++ This bug was initially created as a clone of Bug #1894561 +++

E1104 13:47:34.766497       1 goroutinemap.go:150] Operation for "provision-openshift-image-registry/image-registry-storage[4405d840-b69e-486e-8136-dccb8ec2879e]" failed. No retries permitted until 2020-11-04 13:48:06.766363989 +0000 UTC m=+165.870074032 (durationBeforeRetry 32s). Error: "recovered from panic \"runtime error: invalid memory address or nil pointer dereference\". (err=<nil>) Call stack:
goroutine 7289 [running]:
k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime.RecoverFromPanic(0xc008003fd0)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:158 +0xba
panic(0x3eb7be0, 0x6f2d2d0)
/usr/lib/golang/src/runtime/panic.go:969 +0x175
k8s.io/kubernetes/vendor/github.com/gophercloud/gophercloud/openstack.initClientOpts(0x0, 0x473cb62, 0x8, 0x0, 0x0, 0xc000d10dc8, 0x5, 0x47389dc, 0x6, 0x473cb62, ...)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/gophercloud/gophercloud/openstack/client.go:297 +0x96
k8s.io/kubernetes/vendor/github.com/gophercloud/gophercloud/openstack.NewBlockStorageV3(...)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/github.com/gophercloud/gophercloud/openstack/client.go:353
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/openstack.(*OpenStack).NewBlockStorageV3(0xc001196780, 0x457be80, 0x0, 0xed7321efe)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/openstack/openstack_client.go:74 +0xc7
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/openstack.(*OpenStack).volumeService(0xc001196780, 0x0, 0x0, 0xc008003308, 0x8521f9, 0xc00cfa0c60, 0xc008003298)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/openstack/openstack.go:913 +0x6cf
k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/openstack.(*OpenStack).CreateVolume(0xc001196780, 0xc007999c80, 0x3e, 0x64, 0x0, 0x0, 0xc003f3cb15, 0x5, 0xc00bca4ee8, 0xc003e56380, ...)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/vendor/k8s.io/legacy-cloud-providers/openstack/openstack_volumes.go:465 +0x4b
k8s.io/kubernetes/pkg/volume/cinder.(*DiskUtil).CreateVolume(0x6f9fcb8, 0xc00cde5f80, 0xc000a1e000, 0x0, 0x0, 0x0, 0x4e4ad01, 0x4350940, 0xc0080036c0, 0x40e878, ...)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/volume/cinder/cinder_util.go:221 +0x647
k8s.io/kubernetes/pkg/volume/cinder.(*cinderVolumeProvisioner).Provision(0xc00cde5f80, 0xc000a1e000, 0x0, 0x0, 0x0, 0x0, 0xc00c8f4390, 0x28)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/volume/cinder/cinder.go:568 +0x113
k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).provisionClaimOperation(0xc00521e000, 0xc00b522700, 0x7fb69f6e61e8, 0xc004d4f4e0, 0xc003c60600, 0x473e6ac, 0x9, 0x0, 0x0)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go:1555 +0xfe2
k8s.io/kubernetes/pkg/controller/volume/persistentvolume.(*PersistentVolumeController).provisionClaim.func1(0x0, 0x0)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/controller/volume/persistentvolume/pv_controller.go:1447 +0x1fd
k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run.func1(0xc004af3a10, 0xc008aa8540, 0x5f, 0xc00cf182a0, 0x0, 0x0)
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/goroutinemap/goroutinemap.go:115 +0xee
created by k8s.io/kubernetes/pkg/util/goroutinemap.(*goRoutineMap).Run
/go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/pkg/util/goroutinemap/goroutinemap.go:108 +0x2e7
"

--- Additional comment from Maciej Szulik on 2020-11-04 14:27:10 UTC ---

From checking the code it looks like the nil pointer is coming from not properly initialized ProviderClient for openstack.
I'm moving this to storage team for further investigation, since this is at the connection of openstack and storage I'll 
let them have saying on it.

--- Additional comment from Jan Safranek on 2020-11-04 15:18:34 UTC ---

There are several interesting things:

1. OpenStack cloud provider delays its initialization until it can read its secret. I.e. it gets Secrets lister in SetInformers() call:
https://github.com/openshift/origin/blob/d9ca44ba958c2d7f647b791e28eeef694fe7406b/vendor/k8s.io/legacy-cloud-providers/openstack/openstack.go#L272

And uses the informer to initialize itself in ensureCloudProviderWasInitialized() calls.

2. The log says:
E1104 13:51:19.035242       1 openstack.go:284] Cannot get secret openstack-credentials in namespace kube-system. error: "secret \"openstack-credentials\" not found"

The secret exists:

$ oc get secrets -n kube-system openstack-credentials
NAME                    TYPE     DATA   AGE
openstack-credentials   Opaque   2      47h

This may imply the Secret lister was not synced yet at 13:51:19.035242

3. We can then see this:

E1104 13:51:30.162048       1 openstack_instances.go:55] cannot initialize cloud provider, only limited functionality is available : cloud provider is not initialized

This implies:
a. The secret is in informer cache at 13:51:30.162048 (otherwise we would see "Cannot get secret openstack-credentials..." again)

b. ensureCloudProviderWasInitialized/setConfigFromSecret did find the secret, but failed on something else, most probably the provider initialization itself. Unfortunately, the err is not reported to the caller!
https://github.com/openshift/origin/blob/d9ca44ba958c2d7f647b791e28eeef694fe7406b/vendor/k8s.io/legacy-cloud-providers/openstack/openstack.go#L320


4. Most OpenStack cloud provider calls check that the cloud provider was fully initialized (calling ensureCloudProviderWasInitialized), but OpenStack storage parts do not (volumeService()) and it crashes.


To sum it up:
* volumeService() should call ensureCloudProviderWasInitialized(), this will fix the panic. Still, kube-controller-manager won't be very usable, it will only report nicer error.
* ensureCloudProviderWasInitialized/setConfigFromSecret must report the error up the stack! We don't see why the cloud provider failed to initialize!

-----

This clone is only about not-crashing kube-controller-manager and logging a sane error instead.

Comment 1 Jan Safranek 2020-11-13 12:12:06 UTC
Waiting for review

Comment 2 Jan Safranek 2020-11-20 13:08:09 UTC
PR https://github.com/openshift/kubernetes/pull/438 won't get likely merged into OCP, it must be fixed upstream first. This implies the config map parsing we carry in OCP must be merged upstream too. This is out of scope of storage team, assigning to cloud provider.

Comment 3 Martin André 2020-12-17 10:17:06 UTC
Mike pushed Jan's changes to https://github.com/kubernetes/kubernetes/pull/96750. These now need to find a way to merge downstream as they'll apparently not merge upstream.

Comment 6 hgomes 2021-02-02 15:00:15 UTC
@m.andre

Thank you for confirming. Just a quick correction, my customer is at 4.5.27 is it in the roadmap the backport to 4.5?

Comment 9 weiwei jiang 2021-02-04 07:06:27 UTC
Verified with Kubernetes Version: v1.20.0+cac2421

$ oc version
Client Version: 4.7.0-202102032256.p0-c66c03f
Server Version: 4.7.0-0.nightly-2021-02-04-012305
Kubernetes Version: v1.20.0+cac2421

$ oc -n openshift-kube-controller-manager logs kube-controller-manager-wj47ios204ay-jsscr-master-2 |grep -i openstack.go 
E0204 06:13:18.749187       1 openstack.go:284] cannot get secret openstack-credentials in namespace kube-system. error: "secret \"openstack-credentials\" not found"
I0204 06:13:19.015824       1 openstack.go:301] OpenStack cloud provider was initialized using data from the secret.

$ oc get co  storage
NAME      VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage   4.7.0-0.nightly-2021-02-04-012305   True        False         False      40m

$ oc get sc 
NAME                 PROVISIONER                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
csi-manila-ceph      manila.csi.openstack.org   Delete          Immediate              false                  63m
standard (default)   kubernetes.io/cinder       Delete          WaitForFirstConsumer   true                   64m
standard-csi         cinder.csi.openstack.org   Delete          WaitForFirstConsumer   true                   64m

Comment 12 Mike Fedosin 2021-02-17 11:13:03 UTC
*** Bug 1923064 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2021-02-24 15:30:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633


Note You need to log in before you can comment on or make changes to this bug.