Bug 1798049

Summary: CVO got panic when downgrading to 4.2.10
Product: OpenShift Container Platform Reporter: Scott Dodson <sdodson>
Component: Cluster Version OperatorAssignee: W. Trevor King <wking>
Status: CLOSED ERRATA QA Contact: Gaoyun Pei <gpei>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.0CC: aos-bugs, ccoleman, gpei, jokerman, padillon, wking
Target Milestone: ---   
Target Release: 4.3.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1783221
: 1800346 (view as bug list) Environment:
Last Closed: 2020-02-25 06:18:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1783221    
Bug Blocks: 1800346    

Comment 3 Gaoyun Pei 2020-02-15 06:29:55 UTC
Verify this bug using payload 4.3.0-0.nightly-2020-02-14-234906, downgrade to 4.2.19 still failed.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2020-02-14-234906   True        False         9m11s   Cluster version is 4.3.0-0.nightly-2020-02-14-234906

# oc adm upgrade --to-image='quay.io/openshift-release-dev/ocp-release@sha256:b51a0c316bb0c11686e6b038ec7c9f7ff96763f47a53c3443ac82e8c054bc035' --allow-explicit-upgrade
Updating to release image quay.io/openshift-release-dev/ocp-release@sha256:b51a0c316bb0c11686e6b038ec7c9f7ff96763f47a53c3443ac82e8c054bc035


# oc get pod -n openshift-cluster-version
NAME                                        READY   STATUS      RESTARTS   AGE
cluster-version-operator-6d78ff4f8f-ng6fl   0/1     Error       4          15m
version--qgqbz-29vmc                        0/1     Completed   0          15m

# oc logs cluster-version-operator-6d78ff4f8f-ng6fl -n openshift-cluster-version
...
...
I0215 06:08:49.510678       1 request.go:530] Throttling request took 793.363675ms, request: GET:https://127.0.0.1:6443/apis/apps/v1/namespaces/openshift-cluster-machine-approver/deployments/machine-approver
E0215 06:08:49.515290       1 runtime.go:69] Observed a panic: "index out of range" (runtime error: index out of range)
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:76
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:44
/go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:69
/go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:28
/go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:23
/go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/apps.go:27
/go/src/github.com/openshift/cluster-version-operator/lib/resourceapply/apps.go:29
/go/src/github.com/openshift/cluster-version-operator/lib/resourcebuilder/apps.go:70
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:593
/go/src/github.com/openshift/cluster-version-operator/pkg/payload/task.go:71
/go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:588
/go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:591
/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/asm_amd64.s:1337
panic: runtime error: index out of range [recovered]
        panic: runtime error: index out of range

goroutine 196 [running]:
github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x105
panic(0x13e1c20, 0x2540480)
        /opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/panic.go:522 +0x1b5
github.com/openshift/cluster-version-operator/lib/resourcemerge.ensureContainers(0xc001b7f65f, 0xc0019e38b0, 0xc00095fe00, 0x1, 0x1)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:69 +0x799
github.com/openshift/cluster-version-operator/lib/resourcemerge.ensurePodSpec(0xc001b7f65f, 0xc0019e3880, 0xc001223c20, 0x1, 0x1, 0x0, 0x0, 0x0, 0xc00095fe00, 0x1, ...)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:28 +0xc6
github.com/openshift/cluster-version-operator/lib/resourcemerge.ensurePodTemplateSpec(0xc001b7f65f, 0xc0019e3798, 0xc000d75560, 0x10, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/core.go:23 +0xd0
github.com/openshift/cluster-version-operator/lib/resourcemerge.EnsureDeployment(0xc001b7f65f, 0xc0019e3680, 0x12b0c24, 0xa, 0xc000d75d30, 0x7, 0xc000d75180, 0x10, 0x0, 0x0, ...)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcemerge/apps.go:27 +0x172
github.com/openshift/cluster-version-operator/lib/resourceapply.ApplyDeployment(0x17a4d60, 0xc0002c9ff0, 0xc0019e3200, 0x20, 0x2573d20, 0xa, 0xc00069d698)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourceapply/apps.go:29 +0x1b0
github.com/openshift/cluster-version-operator/lib/resourcebuilder.(*deploymentBuilder).Do(0xc00147b480, 0x17de660, 0xc0012f52c0, 0xc00147b480, 0xc00147b480)
        /go/src/github.com/openshift/cluster-version-operator/lib/resourcebuilder/apps.go:70 +0xeb
github.com/openshift/cluster-version-operator/pkg/cvo.(*resourceBuilder).Apply(0xc001027f80, 0x17de660, 0xc0012f52c0, 0xc00093a6e0, 0x0, 0x30, 0x200)
        /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/cvo.go:593 +0xb6
github.com/openshift/cluster-version-operator/pkg/payload.(*Task).Run(0xc000ad5b80, 0x17de660, 0xc0012f52c0, 0xc000ceee00, 0x6, 0x17a3840, 0xc001027f80, 0x0, 0x0, 0x0)
        /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task.go:71 +0xb0
github.com/openshift/cluster-version-operator/pkg/cvo.(*SyncWorker).apply.func1(0x17de660, 0xc0012f52c0, 0xc0012305b8, 0x7, 0x149, 0x2, 0x2)
        /go/src/github.com/openshift/cluster-version-operator/pkg/cvo/sync_worker.go:588 +0x37d
github.com/openshift/cluster-version-operator/pkg/payload.RunGraph.func2(0xc000dad790, 0x17de660, 0xc0012f52c0, 0xc0011e8d20, 0xc001026b10, 0xc0010489a0, 0xc0011e8d80, 0xa)
        /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:591 +0x289
created by github.com/openshift/cluster-version-operator/pkg/payload.RunGraph
        /go/src/github.com/openshift/cluster-version-operator/pkg/payload/task_graph.go:577 +0x23b

Comment 4 W. Trevor King 2020-02-16 06:06:12 UTC
> Verify this bug using payload 4.3.0-0.nightly-2020-02-14-234906, downgrade to 4.2.19 still failed.

On upgrades and downgrades, the important CVO is that for the target release.  So for 4.3.0-0.nightly-2020-02-14-234906 -> 4.2.19, you will still hit this failure mode because 4.2.19 does not contain the patch.  If you want to independently verify this fix for this 4.3 Bugzilla, you will need to use a whatever -> 4.3-nightly upgrade in which there is a manifest change that removes either a container or a service port that was not the final entry in its list (e.g. see the unit test removing the test-A container [1]).  I'm not sure an appropriate source release image exists off the shelf.  You could create one by adding additional ports to a service in your target 4.3 nightly.  Or you could just verify this bug by saying "we don't see any regressions" and then test the 4.3->4.2 downgrade as part of verifying the 4.2.z bug 1800346.

[1]: https://github.com/openshift/cluster-version-operator/pull/282/files#diff-415c13f11ffc32696c5d69b900b3fe58R251-R268

Comment 5 W. Trevor King 2020-02-16 06:30:26 UTC
Digging into the manifest change that triggered the initial issue.  We don't have 4.3.0-0.nightly-2019-12-12-155629 around anymore, but we do have the temporally close 4.3.0-0.nightly-2019-12-13-072740.  Comparing between the 4.3 nightly and 4.2.10:

$ oc adm release extract --to 4.2.10 quay.io/openshift-release-dev/ocp-release:4.2.10
$ oc adm release extract --to 4.3.0-0.nightly-2019-12-13-072740 quay.io/openshift-release-dev/ocp-release-nightly:4.3.0-0.nightly-2019-12-13-072740
$ diff -U3 4.2.10/0000_50_cluster-machine-approver_02-deployment.yaml 4.3.0-0.nightly-2019-12-13-072740/0000_50_cluster-machine-approver_04-deployment.yaml
--- 4.2.10/0000_50_cluster-machine-approver_02-deployment.yaml	2019-12-02 22:52:11.000000000 -0800
+++ 4.3.0-0.nightly-2019-12-13-072740/0000_50_cluster-machine-approver_04-deployment.yaml	2019-12-06 16:35:48.000000000 -0800
@@ -21,8 +23,31 @@
       hostNetwork: true
       serviceAccountName: machine-approver-sa
       containers:
+      - args:
+        ...
+        name: kube-rbac-proxy
+        ...
+          name: machine-approver-tls
       - name: machine-approver-controller
...

so the issue is that the kube-rbac-proxy container spec (the first entry in that array) is being removed, and subsequent iteration into the machine-approver-controller container spec hits the panic.  Unless 4.4 -> 4.3 downgrades were hitting a similar panic already, you'd need to synthesize another change like this (or by adding a Service port) in order to verify this 4.3.z bug in a whatever -> 4.3-nightly upgrade/downgrade.

Comment 6 Gaoyun Pei 2020-02-18 08:14:58 UTC
Thanks very much for the detailed explanation, I must be confused the CVO version when running downgrade. Actually I should already made the 4.4 -> 4.3 downgrade test in https://bugzilla.redhat.com/show_bug.cgi?id=1783221#c7. Since there's no issue of 4.3 CVO, I'll prefer to move this bug to verified and will test the initial problem in BZ#1800346. Thanks.

Comment 8 errata-xmlrpc 2020-02-25 06:18:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0528