Bug 1869516 - cluster-storage-operator pod is panic when upgrading OCP from 4.1.0-0.nightly-2020-07-29-210856 to 4.6.0-0.nightly-2020-08-04-193041
Summary: cluster-storage-operator pod is panic when upgrading OCP from 4.1.0-0.nightly...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Storage
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.6.0
Assignee: Christian Huffman
QA Contact: Qin Ping
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-08-18 07:56 UTC by Qin Ping
Modified: 2020-10-27 16:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:28:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-storage-operator pull 77 0 None closed Bug 1869516: Prevent crash if PlatformStatus is nil 2021-01-14 08:54:06 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:28:53 UTC

Description Qin Ping 2020-08-18 07:56:01 UTC
Description of problem:
cluster-storage-operator pod is panic when upgrading OCP from  4.1.0-0.nightly-2020-07-29-210856 to 4.6.0-0.nightly-2020-08-04-193041

Version-Release number of selected component (if applicable):
upgrade from original build: 4.1.0-0.nightly-2020-07-29-210856 to target_build: 4.2.0-0.nightly-2020-08-04-161322,4.3.0-0.nightly-2020-08-04-163159,4.4.0-0.nightly-2020-08-03-123644,4.5.0-0.nightly-2020-08-03-123303,4.6.0-0.nightly-2020-08-04-193041

How reproducible:
Hit once in QE upgrade ci

Steps to Reproduce:
1.profile: 05_UPI on Baremetal with RHCOS (FIPS off)
2.
3.

Actual results:
2020-08-05T02:49:32.316531208Z E0805 02:49:32.316485       1 runtime.go:78] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
2020-08-05T02:49:32.316531208Z goroutine 608 [running]:
2020-08-05T02:49:32.316531208Z k8s.io/apimachinery/pkg/util/runtime.logPanic(0x1e69120, 0x36e6df0)
2020-08-05T02:49:32.316531208Z         k8s.io/apimachinery.0-rc.2/pkg/util/runtime/runtime.go:74 +0xa3
2020-08-05T02:49:32.316531208Z k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
2020-08-05T02:49:32.316531208Z         k8s.io/apimachinery.0-rc.2/pkg/util/runtime/runtime.go:48 +0x82
2020-08-05T02:49:32.316531208Z panic(0x1e69120, 0x36e6df0)
2020-08-05T02:49:32.316531208Z         runtime/panic.go:969 +0x166
2020-08-05T02:49:32.316531208Z github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass.newStorageClassForCluster(0xc0012bc000, 0x215e6a3, 0x7, 0xc0012bc000)
2020-08-05T02:49:32.316531208Z         github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass/controller.go:163 +0x2d
2020-08-05T02:49:32.316531208Z github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass.(*Controller).syncStorageClass(0xc000619310, 0x2183a12, 0x1d)
2020-08-05T02:49:32.316531208Z         github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass/controller.go:138 +0x7f
2020-08-05T02:49:32.316531208Z github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass.(*Controller).sync(0xc000619310, 0x2581320, 0xc000972b80, 0x257b2a0, 0xc00050aab0, 0x0, 0x0)
2020-08-05T02:49:32.316531208Z         github.com/openshift/cluster-storage-operator/pkg/operator/defaultstorageclass/controller.go:87 +0x33d
2020-08-05T02:49:32.316531208Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).reconcile(0xc0001747e0, 0x2581320, 0xc000972b80, 0x257b2a0, 0xc00050aab0, 0x427745, 0xc001c52660)
2020-08-05T02:49:32.316531208Z         github.com/openshift/library-go.0-20200724235449-b4f9ae5f0c51/pkg/controller/factory/base_controller.go:175 +0x76
2020-08-05T02:49:32.316531208Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).processNextWorkItem(0xc0001747e0, 0x2581320, 0xc000972b80)
2020-08-05T02:49:32.316531208Z         github.com/openshift/library-go.0-20200724235449-b4f9ae5f0c51/pkg/controller/factory/base_controller.go:215 +0x230
2020-08-05T02:49:32.316531208Z github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker.func1(0xc001c76710, 0x2581320, 0xc000972b80, 0xc0001747e0)
2020-08-05T02:49:32.316531208Z         github.com/openshift/library-go.0-20200724235449-b4f9ae5f0c51/pkg/controller/factory/base_controller.go:166 +0x99
2020-08-05T02:49:32.316531208Z created by github.com/openshift/library-go/pkg/controller/factory.(*baseController).runWorker
2020-08-05T02:49:32.316531208Z         github.com/openshift/library-go.0-20200724235449-b4f9ae5f0c51/pkg/controller/factory/base_controller.go:158 +0x8d


Expected results:

Master Log:

Node Log (of failed PODs):

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:

Comment 2 Christian Huffman 2020-08-18 20:26:43 UTC
It looks like the issue here is because the Infrastructure.Status.PlatformStatus is nil. From the must-gather, here is the Infrastructure resource:

apiVersion: config.openshift.io/v1
items:
- apiVersion: config.openshift.io/v1
  kind: Infrastructure
  metadata:
    creationTimestamp: "2020-08-04T22:09:18Z"
    generation: 1
    name: cluster
    resourceVersion: "402"
    selfLink: /apis/config.openshift.io/v1/infrastructures/cluster
    uid: 24430978-d69f-11ea-b1fc-fa163e8ef816
  spec:
    cloudConfig:
      name: ""
  status:
    apiServerInternalURI: https://api-int.ugdci05054712.qe.devcluster.openshift.com:6443
    apiServerURL: https://api.ugdci05054712.qe.devcluster.openshift.com:6443
    etcdDiscoveryDomain: ugdci05054712.qe.devcluster.openshift.com
    infrastructureName: ugdci05054712-zxmkz
    platform: None

We throw an unsupportedPlatformError on default, but don't check for a `nil` value here. I've submitted [1] which should address this issue.

[1] https://github.com/openshift/cluster-storage-operator/pull/77

Comment 5 Qin Ping 2020-08-27 01:10:09 UTC
Verified with: 4.1.41->4.2.36->4.3.33->4.4.18->4.5.7->4.6.0-0.nightly-2020-08-26-032807

Comment 7 errata-xmlrpc 2020-10-27 16:28:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.