+++ This bug was initially created as a clone of Bug #2005581 +++ Description of problem: When upgrade from 4.8.12 to 4.9.0-0.nightly-2021-09-17-210126, the upgrade hung at Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete), check the cluster-version-operator pod is "CrashLoopBackOff" with error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable. 09-18 15:50:07.266 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS 09-18 15:50:07.266 version 4.8.12 True True 63s Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete) ...... ...... 09-18 17:35:04.378 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS 09-18 17:35:04.378 version 4.8.12 True True 106m Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete) Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version 4.8.12 to 4.9.0-0.nightly-2021-09-17-210126 How reproducible: 3 Steps to Reproduce: 1. Upgrade vSphere cluster from 4.8.12 to 4.9.0-0.nightly-2021-09-17-210126 2. The cluster-version-operator pod $ oc -n openshift-cluster-version logs cluster-version-operator-588cf597dd-vw4wk I0918 09:22:13.359349 1 start.go:21] ClusterVersionOperator 4.9.0-202109161743.p0.git.43d63b8.assembly.stream-43d63b8 F0918 09:22:13.359611 1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable goroutine 1 [running]: k8s.io/klog/v2.stacks(0xc000012001, 0xc000468000, 0xb8, 0xd0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9 k8s.io/klog/v2.(*loggingT).output(0x2ad7120, 0xc000000003, 0x0, 0x0, 0xc0001d0380, 0x22e7998, 0x8, 0x18, 0x0) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:970 +0x191 k8s.io/klog/v2.(*loggingT).printf(0x2ad7120, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1c6ecc9, 0x9, 0xc000606490, 0x1, ...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:751 +0x191 k8s.io/klog/v2.Fatalf(...) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1509 main.init.3.func1(0xc00045e000, 0xc0001d02a0, 0x0, 0x7) /go/src/github.com/openshift/cluster-version-operator/cmd/start.go:24 +0x1ed github.com/spf13/cobra.(*Command).execute(0xc00045e000, 0xc0001d0230, 0x7, 0x7, 0xc00045e000, 0xc0001d0230) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2 github.com/spf13/cobra.(*Command).ExecuteC(0x2ac3380, 0xc000000180, 0xc000066740, 0x46ef85) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375 github.com/spf13/cobra.(*Command).Execute(...) /go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:895 main.main() /go/src/github.com/openshift/cluster-version-operator/cmd/main.go:26 +0x53 goroutine 6 [chan receive]: k8s.io/klog/v2.(*loggingT).flushDaemon(0x2ad7120) /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b created by k8s.io/klog/v2.init.0 /go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:418 +0xdf Actual results: Upgrade hung Expected results: Upgrade should be successful Additional info: Must-gather in another cluster have same failure: http://10.73.131.57:9000/openshift-must-gather/2021-09-18-04-56-27/must-gather.local.8528734579890313358.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=openshift%2F20210918%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210918T045641Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=f8156800c652db3924b4c45ff7f48f3ee98f9a8e7d1d0e8703735ff8dfaf7b10 --- Additional comment from W. Trevor King on 2021-09-20 22:34:41 UTC --- Same thing going on in CI, e.g. [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator-")).status.containerStatuses[] | .state.waiting.reason + " " + (.restartCount | tostri ng) + "\n\n" + .lastState.terminated.message' CrashLoopBackOff 34 4.9.0-202109161743.p0.git.43d63b8.assembly.stream-43d63b8 F0920 13:23:23.565439 1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable goroutine 1 [running]: ... [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344 --- Additional comment from W. Trevor King on 2021-09-20 22:42:39 UTC --- Comparing with a healthy 4.8.11 -> 4.9.0-rc.1 job [1], the issue is the recent volume change from bug 2002834 (backported to 4.9 as bug 2004568): $ diff -u \ > <(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1437413099530358784/artifacts/launch/deployments.json | jq '.items[] | select(.metadata.name == "cluster-version-operator").spec.template.spec.containers[].volumeMounts[]') \ > <(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344/artifacts/e2e-aws-upgrade/deployments.json | jq '.items[] | select(.metadata.name == "cluster-version-operator").spec.template.spec.containers[].volumeMounts[]') --- /dev/fd/63 2021-09-20 15:39:31.090945777 -0700 +++ /dev/fd/62 2021-09-20 15:39:31.092945777 -0700 @@ -13,8 +13,3 @@ "name": "serving-cert", "readOnly": true } -{ - "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount", - "name": "kube-api-access", - "readOnly": true -} [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1437413099530358784
Verifying with 4.9.0-0.nightly-2021-09-21-215600. The CVO pod is rolled out to 4.9 successfully. There are 2 issues in this upgrade testing but they seems not relevant to this bug. 1. 2 nodes go to Notready 2. The old CVO pod is in terminating status but doesn't get removed # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.12 True True 3h3m Working towards 4.9.0-0.nightly-2021-09-21-215600: 71 of 734 done (9% complete) # oc get node NAME STATUS ROLES AGE VERSION compute-0 NotReady,SchedulingDisabled worker 4h13m v1.21.1+d8043e1 compute-1 Ready worker 4h13m v1.21.1+d8043e1 control-plane-0 Ready master 4h23m v1.21.1+d8043e1 control-plane-1 NotReady,SchedulingDisabled master 4h22m v1.21.1+d8043e1 control-plane-2 Ready master 4h23m v1.21.1+d8043e1 # oc get po -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-df4858cf7-sm996 1/1 Running 0 122m cluster-version-operator-df4858cf7-whbpx 1/1 Terminating 0 126m version--rzg6k-k7x2v 0/1 Completed 0 3h4m # oc get pod/cluster-version-operator-df4858cf7-sm996 -ojson | jq -r .spec.volumes[] { "hostPath": { "path": "/etc/ssl/certs", "type": "" }, "name": "etc-ssl-certs" } { "hostPath": { "path": "/etc/cvo/updatepayloads", "type": "" }, "name": "etc-cvo-updatepayloads" } { "name": "serving-cert", "secret": { "defaultMode": 420, "secretName": "cluster-version-operator-serving-cert" } } { "name": "kube-api-access", "projected": { "defaultMode": 420, "sources": [ { "serviceAccountToken": { "expirationSeconds": 3600, "path": "token" } }, { "configMap": { "items": [ { "key": "ca.crt", "path": "ca.crt" } ], "name": "kube-root-ca.crt" } }, { "downwardAPI": { "items": [ { "fieldRef": { "apiVersion": "v1", "fieldPath": "metadata.namespace" }, "path": "namespace" } ] } } ] } } # oc get pod/cluster-version-operator-df4858cf7-qfg9n -ojson | jq -r .spec.containers[].volumeMounts [ { "mountPath": "/etc/ssl/certs", "name": "etc-ssl-certs", "readOnly": true }, { "mountPath": "/etc/cvo/updatepayloads", "name": "etc-cvo-updatepayloads", "readOnly": true }, { "mountPath": "/etc/tls/serving-cert", "name": "serving-cert", "readOnly": true }, { "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount", "name": "kube-api-access", "readOnly": true } ]
Moving it to verified state because the upgrade is not stuck on the CVO pod creation any more.
*** Bug 2007230 has been marked as a duplicate of this bug. ***
*** Bug 2007229 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759