Bug 2005581 - 4.8.12 to 4.9 upgrade hung due to cluster-version-operator pod CrashLoopBackOff: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
Summary: 4.8.12 to 4.9 upgrade hung due to cluster-version-operator pod CrashLoopBackO...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.9
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.10.0
Assignee: W. Trevor King
QA Contact: Yang Yang
URL:
Whiteboard:
: 2007228 (view as bug list)
Depends On:
Blocks: 2006145
TreeView+ depends on / blocked
 
Reported: 2021-09-18 10:26 UTC by Wei Duan
Modified: 2022-03-10 16:12 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2006145 (view as bug list)
Environment:
Last Closed: 2022-03-10 16:11:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-version-operator pull 660 0 None open Bug 2005581: install/0000_00_cluster-version-operator_03_deployment: Explicit kube-api-access 2021-09-20 23:05:24 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:12:12 UTC

Description Wei Duan 2021-09-18 10:26:13 UTC
Description of problem:
When upgrade from 4.8.12 to 4.9.0-0.nightly-2021-09-17-210126, the upgrade hung at Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete), check the cluster-version-operator pod is "CrashLoopBackOff" with error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable.
09-18 15:50:07.266  NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
09-18 15:50:07.266  version   4.8.12    True        True          63s     Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete)
......
......
09-18 17:35:04.378  NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
09-18 17:35:04.378  version   4.8.12    True        True          106m    Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete)


Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version
4.8.12 to 4.9.0-0.nightly-2021-09-17-210126

How reproducible:
3

Steps to Reproduce:
1. Upgrade vSphere cluster from 4.8.12 to 4.9.0-0.nightly-2021-09-17-210126
2. The cluster-version-operator pod
$ oc -n openshift-cluster-version logs cluster-version-operator-588cf597dd-vw4wk
I0918 09:22:13.359349       1 start.go:21] ClusterVersionOperator 4.9.0-202109161743.p0.git.43d63b8.assembly.stream-43d63b8
F0918 09:22:13.359611       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc000468000, 0xb8, 0xd0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x2ad7120, 0xc000000003, 0x0, 0x0, 0xc0001d0380, 0x22e7998, 0x8, 0x18, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:970 +0x191
k8s.io/klog/v2.(*loggingT).printf(0x2ad7120, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1c6ecc9, 0x9, 0xc000606490, 0x1, ...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:751 +0x191
k8s.io/klog/v2.Fatalf(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1509
main.init.3.func1(0xc00045e000, 0xc0001d02a0, 0x0, 0x7)
	/go/src/github.com/openshift/cluster-version-operator/cmd/start.go:24 +0x1ed
github.com/spf13/cobra.(*Command).execute(0xc00045e000, 0xc0001d0230, 0x7, 0x7, 0xc00045e000, 0xc0001d0230)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x2ac3380, 0xc000000180, 0xc000066740, 0x46ef85)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:895
main.main()
	/go/src/github.com/openshift/cluster-version-operator/cmd/main.go:26 +0x53

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x2ad7120)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
created by k8s.io/klog/v2.init.0
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:418 +0xdf

Actual results:
Upgrade hung

Expected results:
Upgrade should be successful

Additional info:
Must-gather in another cluster have same failure: 
http://10.73.131.57:9000/openshift-must-gather/2021-09-18-04-56-27/must-gather.local.8528734579890313358.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=openshift%2F20210918%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210918T045641Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=f8156800c652db3924b4c45ff7f48f3ee98f9a8e7d1d0e8703735ff8dfaf7b10

Comment 1 W. Trevor King 2021-09-20 22:34:41 UTC
Same thing going on in CI, e.g. [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator-")).status.containerStatuses[] | .state.waiting.reason + " " + (.restartCount | tostri
ng) + "\n\n" + .lastState.terminated.message'
CrashLoopBackOff 34

4.9.0-202109161743.p0.git.43d63b8.assembly.stream-43d63b8
F0920 13:23:23.565439       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
goroutine 1 [running]:
...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344

Comment 2 W. Trevor King 2021-09-20 22:42:39 UTC
Comparing with a healthy 4.8.11 -> 4.9.0-rc.1 job [1], the issue is the recent volume change from bug 2002834 (backported to 4.9 as bug 2004568):

$ diff -u \
>   <(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1437413099530358784/artifacts/launch/deployments.json | jq '.items[] | select(.metadata.name == "cluster-version-operator").spec.template.spec.containers[].volumeMounts[]') \
>   <(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344/artifacts/e2e-aws-upgrade/deployments.json | jq '.items[] | select(.metadata.name == "cluster-version-operator").spec.template.spec.containers[].volumeMounts[]')
--- /dev/fd/63  2021-09-20 15:39:31.090945777 -0700
+++ /dev/fd/62  2021-09-20 15:39:31.092945777 -0700
@@ -13,8 +13,3 @@
   "name": "serving-cert",
   "readOnly": true
 }
-{
-  "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
-  "name": "kube-api-access",
-  "readOnly": true
-}

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1437413099530358784

Comment 4 Yang Yang 2021-09-22 07:01:57 UTC
It's not reproduced on an upgrade from 4.9.0-rc.1 to 4.10.0-0.nightly-2021-09-21-181111

Comment 5 Yang Yang 2021-09-22 09:04:10 UTC
Reproduced on an upgrade from 4.9.0-0.nightly-2021-09-21-215600 to 4.10.0-0.nightly-2021-09-21-102830 because 4.9.0-0.nightly-2021-09-21-215600 contains the fix in which CVO explicitly sets the kube-api-access, while 4.10.0-0.nightly-2021-09-21-102830 does not.

Comment 6 Yang Yang 2021-09-22 09:05:03 UTC
Procedure to reproduce it:

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-21-215600   True        True          5m32s   Working towards 4.10.0-0.nightly-2021-09-21-102830: 18 of 739 done (2% complete)

# oc get po -n openshift-cluster-version
NAME                                        READY   STATUS             RESTARTS      AGE
cluster-version-operator-55cfc7966d-vgzt4   0/1     CrashLoopBackOff   4 (43s ago)   2m11s
version--gvs6t--1-hxl8w                     0/1     Completed          0             5m42s

# oc logs pod/cluster-version-operator-55cfc7966d-vgzt4 -n openshift-cluster-version
I0922 08:59:29.116670       1 start.go:21] ClusterVersionOperator 4.10.0-202109162043.p0.git.e4eefca.assembly.stream-e4eefca
F0922 08:59:29.116990       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc0001ba1c0, 0xb8, 0xd1)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x2ad7120, 0xc000000003, 0x0, 0x0, 0xc00013d9d0, 0x22e7978, 0x8, 0x18, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:970 +0x191
k8s.io/klog/v2.(*loggingT).printf(0x2ad7120, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1c6ecc9, 0x9, 0xc000128fe0, 0x1, ...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:751 +0x191
k8s.io/klog/v2.Fatalf(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1509
main.init.3.func1(0xc0004d6000, 0xc00013d960, 0x0, 0x7)
	/go/src/github.com/openshift/cluster-version-operator/cmd/start.go:24 +0x1ed
github.com/spf13/cobra.(*Command).execute(0xc0004d6000, 0xc00013d8f0, 0x7, 0x7, 0xc0004d6000, 0xc00013d8f0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x2ac3380, 0xc000000180, 0xc00005c740, 0x46ef85)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:895
main.main()
	/go/src/github.com/openshift/cluster-version-operator/cmd/main.go:26 +0x53

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x2ad7120)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
created by k8s.io/klog/v2.init.0
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:418 +0xdf

Comment 10 Yang Yang 2021-09-22 12:03:44 UTC
Verified with 4.10.0-0.nightly-2021-09-22-061245 and passed

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.0-0.nightly-2021-09-21-215600   True        True          11m     Working towards 4.10.0-0.nightly-2021-09-22-061245: 95 of 739 done (12% complete)

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-09-22-061245   True        False         20m     Cluster version is 4.10.0-0.nightly-2021-09-22-061245

# oc get pod/cluster-version-operator-c966c8955-nlc6v -ojson -n openshift-cluster-version | jq -r .spec.volumes[]
{
  "hostPath": {
    "path": "/etc/ssl/certs",
    "type": ""
  },
  "name": "etc-ssl-certs"
}
{
  "hostPath": {
    "path": "/etc/cvo/updatepayloads",
    "type": ""
  },
  "name": "etc-cvo-updatepayloads"
}
{
  "name": "serving-cert",
  "secret": {
    "defaultMode": 420,
    "secretName": "cluster-version-operator-serving-cert"
  }
}
{
  "name": "kube-api-access",
  "projected": {
    "defaultMode": 420,
    "sources": [
      {
        "serviceAccountToken": {
          "expirationSeconds": 3600,
          "path": "token"
        }
      },
      {
        "configMap": {
          "items": [
            {
              "key": "ca.crt",
              "path": "ca.crt"
            }
          ],
          "name": "kube-root-ca.crt"
        }
      },
      {
        "downwardAPI": {
          "items": [
            {
              "fieldRef": {
                "apiVersion": "v1",
                "fieldPath": "metadata.namespace"
              },
              "path": "namespace"
            }
          ]
        }
      }
    ]
  }
}

# oc get pod/cluster-version-operator-c966c8955-nlc6v -ojson -n openshift-cluster-version | jq -r .spec.containers[].volumeMounts
[
  {
    "mountPath": "/etc/ssl/certs",
    "name": "etc-ssl-certs",
    "readOnly": true
  },
  {
    "mountPath": "/etc/cvo/updatepayloads",
    "name": "etc-cvo-updatepayloads",
    "readOnly": true
  },
  {
    "mountPath": "/etc/tls/serving-cert",
    "name": "serving-cert",
    "readOnly": true
  },
  {
    "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
    "name": "kube-api-access",
    "readOnly": true
  }
]

Moving it to verified state.

Comment 11 W. Trevor King 2021-09-22 19:44:02 UTC
Bug 2006145 fixed this in 4.9 before GA, and it didn't apply to 4.8, so we won't need to block any edges in graph-data on this bug, and I'm dropping UpgradeBlocker.

Comment 12 Lalatendu Mohanty 2021-09-23 12:43:49 UTC
*** Bug 2007228 has been marked as a duplicate of this bug. ***

Comment 16 errata-xmlrpc 2022-03-10 16:11:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.