2006145 – 4.8.12 to 4.9 upgrade hung due to cluster-version-operator pod CrashLoopBackOff: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

Bug 2006145 - 4.8.12 to 4.9 upgrade hung due to cluster-version-operator pod CrashLoopBackOff: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable

Summary: 4.8.12 to 4.9 upgrade hung due to cluster-version-operator pod CrashLoopBackO...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.9.0
Assignee:	W. Trevor King
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	2007229 2007230 (view as bug list)
Depends On:	2005581
Blocks:
TreeView+	depends on / blocked

Reported:	2021-09-20 23:41 UTC by W. Trevor King
Modified:	2021-10-18 17:52 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	2005581
Environment:
Last Closed:	2021-10-18 17:51:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 661	0	None	open	Bug 2006145: install/0000_00_cluster-version-operator_03_deployment: Explicit kube-api-access	2021-09-20 23:50:09 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:52:07 UTC

Description W. Trevor King 2021-09-20 23:41:16 UTC

+++ This bug was initially created as a clone of Bug #2005581 +++

Description of problem:
When upgrade from 4.8.12 to 4.9.0-0.nightly-2021-09-17-210126, the upgrade hung at Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete), check the cluster-version-operator pod is "CrashLoopBackOff" with error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable.
09-18 15:50:07.266  NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
09-18 15:50:07.266  version   4.8.12    True        True          63s     Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete)
......
......
09-18 17:35:04.378  NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
09-18 17:35:04.378  version   4.8.12    True        True          106m    Working towards 4.9.0-0.nightly-2021-09-17-210126: 9 of 734 done (1% complete)


Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version
4.8.12 to 4.9.0-0.nightly-2021-09-17-210126

How reproducible:
3

Steps to Reproduce:
1. Upgrade vSphere cluster from 4.8.12 to 4.9.0-0.nightly-2021-09-17-210126
2. The cluster-version-operator pod
$ oc -n openshift-cluster-version logs cluster-version-operator-588cf597dd-vw4wk
I0918 09:22:13.359349       1 start.go:21] ClusterVersionOperator 4.9.0-202109161743.p0.git.43d63b8.assembly.stream-43d63b8
F0918 09:22:13.359611       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
goroutine 1 [running]:
k8s.io/klog/v2.stacks(0xc000012001, 0xc000468000, 0xb8, 0xd0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1021 +0xb9
k8s.io/klog/v2.(*loggingT).output(0x2ad7120, 0xc000000003, 0x0, 0x0, 0xc0001d0380, 0x22e7998, 0x8, 0x18, 0x0)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:970 +0x191
k8s.io/klog/v2.(*loggingT).printf(0x2ad7120, 0xc000000003, 0x0, 0x0, 0x0, 0x0, 0x1c6ecc9, 0x9, 0xc000606490, 0x1, ...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:751 +0x191
k8s.io/klog/v2.Fatalf(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1509
main.init.3.func1(0xc00045e000, 0xc0001d02a0, 0x0, 0x7)
	/go/src/github.com/openshift/cluster-version-operator/cmd/start.go:24 +0x1ed
github.com/spf13/cobra.(*Command).execute(0xc00045e000, 0xc0001d0230, 0x7, 0x7, 0xc00045e000, 0xc0001d0230)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:854 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x2ac3380, 0xc000000180, 0xc000066740, 0x46ef85)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:958 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/cluster-version-operator/vendor/github.com/spf13/cobra/command.go:895
main.main()
	/go/src/github.com/openshift/cluster-version-operator/cmd/main.go:26 +0x53

goroutine 6 [chan receive]:
k8s.io/klog/v2.(*loggingT).flushDaemon(0x2ad7120)
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:1164 +0x8b
created by k8s.io/klog/v2.init.0
	/go/src/github.com/openshift/cluster-version-operator/vendor/k8s.io/klog/v2/klog.go:418 +0xdf

Actual results:
Upgrade hung

Expected results:
Upgrade should be successful

Additional info:
Must-gather in another cluster have same failure: 
http://10.73.131.57:9000/openshift-must-gather/2021-09-18-04-56-27/must-gather.local.8528734579890313358.tar.gz?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=openshift%2F20210918%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210918T045641Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=f8156800c652db3924b4c45ff7f48f3ee98f9a8e7d1d0e8703735ff8dfaf7b10

--- Additional comment from W. Trevor King on 2021-09-20 22:34:41 UTC ---

Same thing going on in CI, e.g. [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items[] | select(.metadata.name | startswith("cluster-version-operator-")).status.containerStatuses[] | .state.waiting.reason + " " + (.restartCount | tostri
ng) + "\n\n" + .lastState.terminated.message'
CrashLoopBackOff 34

4.9.0-202109161743.p0.git.43d63b8.assembly.stream-43d63b8
F0920 13:23:23.565439       1 start.go:24] error: error creating clients: invalid configuration: no configuration has been provided, try setting KUBERNETES_MASTER environment variable
goroutine 1 [running]:
...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344

--- Additional comment from W. Trevor King on 2021-09-20 22:42:39 UTC ---

Comparing with a healthy 4.8.11 -> 4.9.0-rc.1 job [1], the issue is the recent volume change from bug 2002834 (backported to 4.9 as bug 2004568):

$ diff -u \
>   <(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1437413099530358784/artifacts/launch/deployments.json | jq '.items[] | select(.metadata.name == "cluster-version-operator").spec.template.spec.containers[].volumeMounts[]') \
>   <(curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1439895817182777344/artifacts/e2e-aws-upgrade/deployments.json | jq '.items[] | select(.metadata.name == "cluster-version-operator").spec.template.spec.containers[].volumeMounts[]')
--- /dev/fd/63  2021-09-20 15:39:31.090945777 -0700
+++ /dev/fd/62  2021-09-20 15:39:31.092945777 -0700
@@ -13,8 +13,3 @@
   "name": "serving-cert",
   "readOnly": true
 }
-{
-  "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
-  "name": "kube-api-access",
-  "readOnly": true
-}

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1437413099530358784

Comment 3 Yang Yang 2021-09-22 07:24:50 UTC

Verifying with 4.9.0-0.nightly-2021-09-21-215600. The CVO pod is rolled out to 4.9 successfully. There are 2 issues in this upgrade testing but they seems not relevant to this bug.

1. 2 nodes go to Notready
2. The old CVO pod is in terminating status but doesn't get removed

# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.12    True        True          3h3m    Working towards 4.9.0-0.nightly-2021-09-21-215600: 71 of 734 done (9% complete)

# oc get node
NAME              STATUS                        ROLES    AGE     VERSION
compute-0         NotReady,SchedulingDisabled   worker   4h13m   v1.21.1+d8043e1
compute-1         Ready                         worker   4h13m   v1.21.1+d8043e1
control-plane-0   Ready                         master   4h23m   v1.21.1+d8043e1
control-plane-1   NotReady,SchedulingDisabled   master   4h22m   v1.21.1+d8043e1
control-plane-2   Ready                         master   4h23m   v1.21.1+d8043e1


# oc get po -n openshift-cluster-version
NAME                                       READY   STATUS        RESTARTS   AGE
cluster-version-operator-df4858cf7-sm996   1/1     Running       0          122m
cluster-version-operator-df4858cf7-whbpx   1/1     Terminating   0          126m
version--rzg6k-k7x2v                       0/1     Completed     0          3h4m

# oc get pod/cluster-version-operator-df4858cf7-sm996 -ojson | jq -r .spec.volumes[]
{
  "hostPath": {
    "path": "/etc/ssl/certs",
    "type": ""
  },
  "name": "etc-ssl-certs"
}
{
  "hostPath": {
    "path": "/etc/cvo/updatepayloads",
    "type": ""
  },
  "name": "etc-cvo-updatepayloads"
}
{
  "name": "serving-cert",
  "secret": {
    "defaultMode": 420,
    "secretName": "cluster-version-operator-serving-cert"
  }
}
{
  "name": "kube-api-access",
  "projected": {
    "defaultMode": 420,
    "sources": [
      {
        "serviceAccountToken": {
          "expirationSeconds": 3600,
          "path": "token"
        }
      },
      {
        "configMap": {
          "items": [
            {
              "key": "ca.crt",
              "path": "ca.crt"
            }
          ],
          "name": "kube-root-ca.crt"
        }
      },
      {
        "downwardAPI": {
          "items": [
            {
              "fieldRef": {
                "apiVersion": "v1",
                "fieldPath": "metadata.namespace"
              },
              "path": "namespace"
            }
          ]
        }
      }
    ]
  }
}

# oc get pod/cluster-version-operator-df4858cf7-qfg9n -ojson | jq -r .spec.containers[].volumeMounts
[
  {
    "mountPath": "/etc/ssl/certs",
    "name": "etc-ssl-certs",
    "readOnly": true
  },
  {
    "mountPath": "/etc/cvo/updatepayloads",
    "name": "etc-cvo-updatepayloads",
    "readOnly": true
  },
  {
    "mountPath": "/etc/tls/serving-cert",
    "name": "serving-cert",
    "readOnly": true
  },
  {
    "mountPath": "/var/run/secrets/kubernetes.io/serviceaccount",
    "name": "kube-api-access",
    "readOnly": true
  }
]

Comment 4 Yang Yang 2021-09-22 08:34:13 UTC

Moving it to verified state because the upgrade is not stuck on the CVO pod creation any more.

Comment 5 Lalatendu Mohanty 2021-09-29 16:53:43 UTC

*** Bug 2007230 has been marked as a duplicate of this bug. ***

Comment 6 Lalatendu Mohanty 2021-09-29 16:54:45 UTC

*** Bug 2007229 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2021-10-18 17:51:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.