1798049 – CVO got panic when downgrading to 4.2.10

Bug 1798049 - CVO got panic when downgrading to 4.2.10

Summary: CVO got panic when downgrading to 4.2.10

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.3.z
Assignee:	W. Trevor King
QA Contact:	Gaoyun Pei
Docs Contact:
URL:
Whiteboard:
Depends On:	1783221
Blocks:	1800346
TreeView+	depends on / blocked

Reported:	2020-02-04 13:29 UTC by Scott Dodson
Modified:	2021-10-12 16:07 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1783221
Clones:	1800346 (view as bug list)
Environment:
Last Closed:	2020-02-25 06:18:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 313	0	None	closed	Bug 1798049: lib/resourcemerge/core: Fix panic on container/port removal	2020-12-22 03:21:27 UTC
Red Hat Product Errata	RHBA-2020:0528	0	None	None	None	2020-02-25 06:18:18 UTC

Comment 3 Gaoyun Pei 2020-02-15 06:29:55 UTC

Verify this bug using payload 4.3.0-0.nightly-2020-02-14-234906, downgrade to 4.2.19 still failed.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-02-14-234906   True        False         9m11s   Cluster version is 4.3.0-0.nightly-2020-02-14-234906

# oc adm upgrade --to-image='quay.io/openshift-release-dev/ocp-release@sha256:b51a0c316bb0c11686e6b038ec7c9f7ff96763f47a53c3443ac82e8c054bc035' --allow-explicit-upgrade
Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:b51a0c316bb0c11686e6b038ec7c9f7ff96763f47a53c3443ac82e8c054bc035


# oc get pod -n openshift-cluster-version
NAME                                        READY   STATUS      RESTARTS   AGE
cluster-version-operator-6d78ff4f8f-ng6fl   0/1     Error       4          15m
version--qgqbz-29vmc                        0/1     Completed   0          15m

# oc logs cluster-version-operator-6d78ff4f8f-ng6fl -n openshift-cluster-version
...
...
I0215 06:08:49.510678       1 request.go:530] Throttling request took 793.363675ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-cluster-machine-approver/deployments/machine-approver
E0215 06:08:49.515290       1 runtime.go:69] Observed a panic: "index out of range" (runtime error: index out of range)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:44
/go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:69
/go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:28
/go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:23
/go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/apps.go:27
/go/src/github.com/openshift/cluster-version-operator/lib/resourceapply/apps.go:29
/go/src/github.com/openshift/cluster-version-operator/lib/resourcebuilder/apps.go:70
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:593
/go/src/github.com/openshift/cluster-version-operator/pkg/payload/task.go:71
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:588
/go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:591
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/asm_amd64.s:1337
panic: runtime error: index out of range [recovered]
        panic: runtime error: index out of range

goroutine 196 [running]:
github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x13e1c20, 0x2540480)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/cluster-version-operator/lib/resourcemerge.ensureContainers(0xc001b7f65f, 0xc0019e38b0, 0xc00095fe00, 0x1, 0x1)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:69 +0x799
github.com/openshift/cluster-version-operator/lib/resourcemerge.ensurePodSpec(0xc001b7f65f, 0xc0019e3880, 0xc001223c20, 0x1, 0x1, 0x0, 0x0, 0x0, 0xc00095fe00, 0x1, ...)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:28 +0xc6
github.com/openshift/cluster-version-operator/lib/resourcemerge.ensurePodTemplateSpec(0xc001b7f65f, 0xc0019e3798, 0xc000d75560, 0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:23 +0xd0
github.com/openshift/cluster-version-operator/lib/resourcemerge.EnsureDeployment(0xc001b7f65f, 0xc0019e3680, 0x12b0c24, 0xa, 0xc000d75d30, 0x7, 0xc000d75180, 0x10, 0x0, 0x0, ...)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/apps.go:27 +0x172
github.com/openshift/cluster-version-operator/lib/resourceapply.ApplyDeployment(0x17a4d60, 0xc0002c9ff0, 0xc0019e3200, 0x20, 0x2573d20, 0xa, 0xc00069d698)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourceapply/apps.go:29 +0x1b0
github.com/openshift/cluster-version-operator/lib/resourcebuilder.(*deploymentBuilder).Do(0xc00147b480, 0x17de660, 0xc0012f52c0, 0xc00147b480, 0xc00147b480)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcebuilder/apps.go:70 +0xeb
github.com/openshift/cluster-version-operator/pkg/cvo.(*resourceBuilder).Apply(0xc001027f80, 0x17de660, 0xc0012f52c0, 0xc00093a6e0, 0x0, 0x30, 0x200)
        /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:593 +0xb6
github.com/openshift/cluster-version-operator/pkg/payload.(*Task).Run(0xc000ad5b80, 0x17de660, 0xc0012f52c0, 0xc000ceee00, 0x6, 0x17a3840, 0xc001027f80, 0x0, 0x0, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task.go:71 +0xb0
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).apply.func1(0x17de660, 0xc0012f52c0, 0xc0012305b8, 0x7, 0x149, 0x2, 0x2)
        /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:588 +0x37d
github.com/openshift/cluster-version-operator/pkg/payload.RunGraph.func2(0xc000dad790, 0x17de660, 0xc0012f52c0, 0xc0011e8d20, 0xc001026b10, 0xc0010489a0, 0xc0011e8d80, 0xa)
        /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:591 +0x289
created by github.com/openshift/cluster-version-operator/pkg/payload.RunGraph
        /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:577 +0x23b

Comment 4 W. Trevor King 2020-02-16 06:06:12 UTC

> Verify this bug using payload 4.3.0-0.nightly-2020-02-14-234906, downgrade to 4.2.19 still failed.

On upgrades and downgrades, the important CVO is that for the target release.  So for 4.3.0-0.nightly-2020-02-14-234906 -> 4.2.19, you will still hit this failure mode because 4.2.19 does not contain the patch.  If you want to independently verify this fix for this 4.3 Bugzilla, you will need to use a whatever -> 4.3-nightly upgrade in which there is a manifest change that removes either a container or a service port that was not the final entry in its list (e.g. see the unit test removing the test-A container [1]).  I'm not sure an appropriate source release image exists off the shelf.  You could create one by adding additional ports to a service in your target 4.3 nightly.  Or you could just verify this bug by saying "we don't see any regressions" and then test the 4.3->4.2 downgrade as part of verifying the 4.2.z bug 1800346.

[1]: https://github.com/openshift/cluster-version-operator/pull/282/files#diff-415c13f11ffc32696c5d69b900b3fe58R251-R268

Comment 5 W. Trevor King 2020-02-16 06:30:26 UTC

Digging into the manifest change that triggered the initial issue.  We don't have 4.3.0-0.nightly-2019-12-12-155629 around anymore, but we do have the temporally close 4.3.0-0.nightly-2019-12-13-072740.  Comparing between the 4.3 nightly and 4.2.10:

$ oc adm release extract --to 4.2.10 quay.io/openshift-release-dev/ocp-release:4.2.10
$ oc adm release extract --to 4.3.0-0.nightly-2019-12-13-072740 quay.io/openshift-release-dev/ocp-release-nightly:4.3.0-0.nightly-2019-12-13-072740
$ diff -U3 4.2.10/0000_50_cluster-machine-approver_02-deployment.yaml 4.3.0-0.nightly-2019-12-13-072740/0000_50_cluster-machine-approver_04-deployment.yaml
--- 4.2.10/0000_50_cluster-machine-approver_02-deployment.yaml	2019-12-02 22:52:11.000000000 -0800
+++ 4.3.0-0.nightly-2019-12-13-072740/0000_50_cluster-machine-approver_04-deployment.yaml	2019-12-06 16:35:48.000000000 -0800
@@ -21,8 +23,31 @@
       hostNetwork: true
       serviceAccountName: machine-approver-sa
       containers:
+      - args:
+        ...
+        name: kube-rbac-proxy
+        ...
+          name: machine-approver-tls
       - name: machine-approver-controller
...

so the issue is that the kube-rbac-proxy container spec (the first entry in that array) is being removed, and subsequent iteration into the machine-approver-controller container spec hits the panic.  Unless 4.4 -> 4.3 downgrades were hitting a similar panic already, you'd need to synthesize another change like this (or by adding a Service port) in order to verify this 4.3.z bug in a whatever -> 4.3-nightly upgrade/downgrade.

Comment 6 Gaoyun Pei 2020-02-18 08:14:58 UTC

Thanks very much for the detailed explanation, I must be confused the CVO version when running downgrade. Actually I should already made the 4.4 -> 4.3 downgrade test in https://bugzilla.redhat.com/show_bug.cgi?id=1783221#c7. Since there's no issue of 4.3 CVO, I'll prefer to move this bug to verified and will test the initial problem in BZ#1800346. Thanks.

Comment 8 errata-xmlrpc 2020-02-25 06:18:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0528

Note You need to log in before you can comment on or make changes to this bug.