Bug 1912820

Summary:	openshift-apiserver Available is False with 3 pods not ready for a while during upgrade
Product:	OpenShift Container Platform	Reporter:	Xingxing Xia <xxia>
Component:	openshift-apiserver	Assignee:	Luis Sanchez <sanchezl>
Status:	CLOSED ERRATA	QA Contact:	Xingxing Xia <xxia>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.7	CC:	akashem, aos-bugs, dgautam, fabian, kewang, mf.flip, mfojtik, rgangwar, sanchezl, sttts, wking
Target Milestone:	---	Keywords:	Upgrades
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1926867 (view as bug list)		Environment:
Last Closed:	2021-07-27 22:35:38 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1926867, 1946856

Description Xingxing Xia 2021-01-05 11:30:06 UTC

Description of problem:
openshift-apiserver Available is False with 3 pods not ready for a while during upgrade

Version-Release number of selected component (if applicable):
Upgrade from 4.6.0-0.nightly-2021-01-03-162024 to 4.7.0-0.nightly-2021-01-04-215816 (matrix upi-on-baremetal/versioned-installer-openstack-all_in_one)

How reproducible:
Not sure

Steps to Reproduce:
1. Launch a 4.6.0-0.nightly-2021-01-03-162024 env
2. Upgrade it to 4.7.0-0.nightly-2021-01-04-215816, during which a while-loop script watch-apiserver-in-upgrade.sh is run to watch `oc get project.project` command: ./watch-apiserver-in-upgrade.sh | tee watch-apiserver-in-upgrade.log

Actual results:
2. Upgrade succeeded. But the watch script has failure:
$ watch-apiserver-in-upgrade.sh | tee watch-apiserver-in-upgrade.log
Checking if upgrade started ...................................
2021-01-05T14:08:40+08:00, upgrade started
...
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io)
2021-01-05T14:11:02+08:00 oc get project.project failed
...
2021-01-05T14:17:57+08:00 oc get project.project failed
...
apiserver-76b747b597-7x2qd   1/2   NotReady   1     3h21m   10.129.0.6    weinliu46upgrade-t4k5p-control-plane-1   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
apiserver-76b747b597-qlfps   1/2   NotReady   1     3h19m   10.130.0.7    weinliu46upgrade-t4k5p-control-plane-2   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
apiserver-76b747b597-rmmg6   1/2   NotReady   1     3h16m   10.128.0.11   weinliu46upgrade-t4k5p-control-plane-0   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
openshift-apiserver   4.6.0-0.nightly-2021-01-03-162024   False   False   False   2s
...
weinliu46upgrade-t4k5p-control-plane-0   Ready   master,worker   3h54m   v1.19.0+9c69bdc
weinliu46upgrade-t4k5p-control-plane-1   Ready   master,worker   3h55m   v1.19.0+9c69bdc
weinliu46upgrade-t4k5p-control-plane-2   Ready   master,worker   3h55m   v1.19.0+9c69bdc
2021-01-05T14:17:58+08:00 oc get cm succeeded
version   4.6.0-0.nightly-2021-01-03-162024   True   True   9m24s   Working towards 4.7.0-0.nightly-2021-01-04-215816: 14% complete
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io)
2021-01-05T14:18:00+08:00 oc get project.project failed
...
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io)
2021-01-05T14:24:26+08:00 oc get project.project failed
apiserver-76b747b597-7x2qd   1/2   CrashLoopBackOff   2     3h28m   10.129.0.6    weinliu46upgrade-t4k5p-control-plane-1   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
apiserver-76b747b597-qlfps   1/2   CrashLoopBackOff   2     3h25m   10.130.0.7    weinliu46upgrade-t4k5p-control-plane-2   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
apiserver-76b747b597-rmmg6   1/2   CrashLoopBackOff   2     3h23m   10.128.0.11   weinliu46upgrade-t4k5p-control-plane-0   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
openshift-apiserver   4.6.0-0.nightly-2021-01-03-162024   False   False   False   24s
weinliu46upgrade-t4k5p-control-plane-0   Ready   master,worker   4h1m   v1.19.0+9c69bdc
weinliu46upgrade-t4k5p-control-plane-1   Ready   master,worker   4h1m   v1.19.0+9c69bdc
weinliu46upgrade-t4k5p-control-plane-2   Ready   master,worker   4h1m   v1.19.0+9c69bdc
version   4.6.0-0.nightly-2021-01-03-162024   True   True   15m   Working towards 4.7.0-0.nightly-2021-01-04-215816: 17% complete
...

$ grep "failed" watch-apiserver-in-upgrade.log # totally 13 count
2021-01-05T14:11:02+08:00 oc get project.project failed
...
2021-01-05T14:24:26+08:00 oc get project.project failed

Expected results:
Should have Zero-disruption upgrade for OAS.

Additional info:
You can check the while-loop script watch-apiserver-in-upgrade.sh in test case OCP-34223 "Check Zero-disruption upgrade for OAS and KAS". This bug is found during this case testing.

Comment 2 Stefan Schimanski 2021-01-05 11:50:15 UTC

Please provide pod manifests (including status) and logs of failing pods (in the moment they happen, must-gather does not help long time after).

Comment 3 Xingxing Xia 2021-01-05 12:32:37 UTC

Hmm, sorry didn't use eyes to do the watch, only used the while-loop script, therefore didn't get the moment's logs. Will manually collect them if hitting it next time.
Though, checked the must-gather events.yaml, found some apiserver-76b747b597-* events during above 2021-01-05T14:11:02+08:00 ~ 2021-01-05T14:24:26+08:00:
$ ~/auto/yaml2json.rb namespaces/openshift-apiserver/core/events.yaml > ~/my/events-upgrade-oas.json

$ jq -r '.items[] | select(.involvedObject.name | test("76b747b597")) | select(.type != "Normal") | "\(.firstTimestamp) \(.count) \(.type) \(.reason) \(.involvedObject.name) \(.message)"' ~/my/events-upgrade-oas.json | grep 2021-01-05T06
2021-01-05T06:17:55Z 3 Warning BackOff apiserver-76b747b597-7x2qd Back-off restarting failed container
2021-01-05T06:25:14Z 6 Warning Unhealthy apiserver-76b747b597-7x2qd Readiness probe failed: Get "https://10.129.0.6:8443/healthz": dial tcp 10.129.0.6:8443: connect: connection refused
2021-01-05T06:25:20Z 3 Warning Unhealthy apiserver-76b747b597-7x2qd Liveness probe failed: Get "https://10.129.0.6:8443/healthz": dial tcp 10.129.0.6:8443: connect: connection refused
2021-01-05T06:17:56Z 3 Warning BackOff apiserver-76b747b597-qlfps Back-off restarting failed container
2021-01-05T06:28:01Z 3 Warning Unhealthy apiserver-76b747b597-qlfps Liveness probe failed: Get "https://10.130.0.7:8443/healthz": dial tcp 10.130.0.7:8443: connect: connection refused
2021-01-05T06:28:03Z 6 Warning Unhealthy apiserver-76b747b597-qlfps Readiness probe failed: Get "https://10.130.0.7:8443/healthz": dial tcp 10.130.0.7:8443: connect: connection refused
2021-01-05T06:17:55Z 3 Warning BackOff apiserver-76b747b597-rmmg6 Back-off restarting failed container
2021-01-05T06:26:40Z 6 Warning Unhealthy apiserver-76b747b597-rmmg6 Readiness probe failed: Get "https://10.128.0.11:8443/healthz": dial tcp 10.128.0.11:8443: connect: connection refused
2021-01-05T06:26:42Z 3 Warning Unhealthy apiserver-76b747b597-rmmg6 Liveness probe failed: Get "https://10.128.0.11:8443/healthz": dial tcp 10.128.0.11:8443: connect: connection refused

Comment 4 Stefan Schimanski 2021-01-05 12:48:49 UTC

Yes, saw those too. But these are just the kubelet messages about the probes. The apiserver does not start up. Maybe etcd cannot be reached, or the port is occupied, or there is a panic. Interesting I also don't see a termination message in the status. That suggests that the process is not started at all.

Comment 6 Stefan Schimanski 2021-01-05 16:32:19 UTC

What is surprising is that all 3 instances are crash-looping. This suggests that some dependency used by all of them (service network? etcd?) is down. So logs would tell.

Comment 7 Xingxing Xia 2021-01-07 11:18:20 UTC

Today observed one upgrade (matrix: upi-on-gcp/versioned-installer-ovn fips on, upgrades from 4.6.0-0.nightly-2021-01-05-203053 to 4.7.0-0.nightly-2021-01-07-034013)
It also reproduced (therefore changing Severity to Medium):
I collected the moment's logs when observing it:
oc get po -n openshift-apiserver -o yaml > openshift-apiserver-pods.yaml # found the restart container is openshift-apiserver-check-endpoints instead of openshift-apiserver
oc describe po apiserver-6464ff6474-mrvpp -n openshift-apiserver > apiserver-6464ff6474-mrvpp.oc-describe.txt
oc logs --previous ... -c openshift-apiserver-check-endpoints > openshift-apiserver-check-endpoints.log # it ended with below
...
E0107 07:00:01.449494       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)
I0107 07:00:02.840026       1 base_controller.go:72] Caches are synced for check-endpoints
I0107 07:00:02.840163       1 base_controller.go:109] Starting #1 worker of check-endpoints controller ...
I0107 07:02:22.825961       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

These collections are uploaded to http://file.rdu.redhat.com/~xxia/1912820/2020-0107/ , you can open:
openshift-apiserver-pods.yaml
apiserver-6464ff6474-mrvpp.oc-describe.txt 
openshift-apiserver-check-endpoints.log
watch-apiserver-in-upgrade.log

Comment 8 Luis Sanchez 2021-01-13 15:45:59 UTC

In 4.6, the PodNetworkConnectivityCheck CRD is installed/uninstalled on demand. In 4.7 the CRD is installed permanently. During a 4.6 to 4.7 upgrade there will be some contention between the 4.6 controllers that uninstall the CRD, and the 4.7 controllers that re-create the CRD for just this scenario, but eventually it all settles down after the 4.6 pods (kube-apiserver-operator and openshift-apiserver-operator) have been upgraded.

Comment 9 Luis Sanchez 2021-01-27 17:54:18 UTC

*** Bug 1889900 has been marked as a duplicate of this bug. ***

Comment 12 Stefan Schimanski 2021-05-20 15:45:32 UTC

All PRs merged.

Comment 14 Xingxing Xia 2021-05-25 09:18:50 UTC

Tested upgrade from 4.7.12 to 4.8.0-fc.5 (4.8.0-0.nightly-2021-05-21-101954), my reported issue is not seen again.

Comment 17 errata-xmlrpc 2021-07-27 22:35:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438