Bug 1912820 - openshift-apiserver Available is False with 3 pods not ready for a while during upgrade
Summary: openshift-apiserver Available is False with 3 pods not ready for a while duri...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.8.0
Assignee: Luis Sanchez
QA Contact: Xingxing Xia
URL:
Whiteboard:
: 1889900 (view as bug list)
Depends On:
Blocks: 1926867 1946856
TreeView+ depends on / blocked
 
Reported: 2021-01-05 11:30 UTC by Xingxing Xia
Modified: 2022-08-22 03:08 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1926867 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:35:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1036 0 None closed Bug 1926867: competing connectivitycheckcontrollers cause pod restarts during upgrades 2021-05-20 15:44:12 UTC
Github openshift cluster-network-operator pull 970 0 None closed Bug 1912820: competing connectivitycheckcontrollers cause pod restarts during upgrades 2021-05-20 15:44:09 UTC
Github openshift cluster-openshift-apiserver-operator pull 430 0 None closed Bug 1912820: competing connectivitycheckcontrollers cause pod restarts during upgrades 2021-05-20 15:44:10 UTC
Github openshift library-go pull 987 0 None closed Bug 1912820: competing connectivitycheckcontrollers cause pod restarts during upgrades 2021-05-20 15:44:14 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:36:04 UTC

Description Xingxing Xia 2021-01-05 11:30:06 UTC
Description of problem:
openshift-apiserver Available is False with 3 pods not ready for a while during upgrade

Version-Release number of selected component (if applicable):
Upgrade from 4.6.0-0.nightly-2021-01-03-162024 to 4.7.0-0.nightly-2021-01-04-215816 (matrix upi-on-baremetal/versioned-installer-openstack-all_in_one)

How reproducible:
Not sure

Steps to Reproduce:
1. Launch a 4.6.0-0.nightly-2021-01-03-162024 env
2. Upgrade it to 4.7.0-0.nightly-2021-01-04-215816, during which a while-loop script watch-apiserver-in-upgrade.sh is run to watch `oc get project.project` command: ./watch-apiserver-in-upgrade.sh | tee watch-apiserver-in-upgrade.log

Actual results:
2. Upgrade succeeded. But the watch script has failure:
$ watch-apiserver-in-upgrade.sh | tee watch-apiserver-in-upgrade.log
Checking if upgrade started ...................................
2021-01-05T14:08:40+08:00, upgrade started
...
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io)
2021-01-05T14:11:02+08:00 oc get project.project failed
...
2021-01-05T14:17:57+08:00 oc get project.project failed
...
apiserver-76b747b597-7x2qd   1/2   NotReady   1     3h21m   10.129.0.6    weinliu46upgrade-t4k5p-control-plane-1   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
apiserver-76b747b597-qlfps   1/2   NotReady   1     3h19m   10.130.0.7    weinliu46upgrade-t4k5p-control-plane-2   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
apiserver-76b747b597-rmmg6   1/2   NotReady   1     3h16m   10.128.0.11   weinliu46upgrade-t4k5p-control-plane-0   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
openshift-apiserver   4.6.0-0.nightly-2021-01-03-162024   False   False   False   2s
...
weinliu46upgrade-t4k5p-control-plane-0   Ready   master,worker   3h54m   v1.19.0+9c69bdc
weinliu46upgrade-t4k5p-control-plane-1   Ready   master,worker   3h55m   v1.19.0+9c69bdc
weinliu46upgrade-t4k5p-control-plane-2   Ready   master,worker   3h55m   v1.19.0+9c69bdc
2021-01-05T14:17:58+08:00 oc get cm succeeded
version   4.6.0-0.nightly-2021-01-03-162024   True   True   9m24s   Working towards 4.7.0-0.nightly-2021-01-04-215816: 14% complete
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io)
2021-01-05T14:18:00+08:00 oc get project.project failed
...
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io)
2021-01-05T14:24:26+08:00 oc get project.project failed
apiserver-76b747b597-7x2qd   1/2   CrashLoopBackOff   2     3h28m   10.129.0.6    weinliu46upgrade-t4k5p-control-plane-1   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
apiserver-76b747b597-qlfps   1/2   CrashLoopBackOff   2     3h25m   10.130.0.7    weinliu46upgrade-t4k5p-control-plane-2   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
apiserver-76b747b597-rmmg6   1/2   CrashLoopBackOff   2     3h23m   10.128.0.11   weinliu46upgrade-t4k5p-control-plane-0   <none>   <none>   apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2
openshift-apiserver   4.6.0-0.nightly-2021-01-03-162024   False   False   False   24s
weinliu46upgrade-t4k5p-control-plane-0   Ready   master,worker   4h1m   v1.19.0+9c69bdc
weinliu46upgrade-t4k5p-control-plane-1   Ready   master,worker   4h1m   v1.19.0+9c69bdc
weinliu46upgrade-t4k5p-control-plane-2   Ready   master,worker   4h1m   v1.19.0+9c69bdc
version   4.6.0-0.nightly-2021-01-03-162024   True   True   15m   Working towards 4.7.0-0.nightly-2021-01-04-215816: 17% complete
...

$ grep "failed" watch-apiserver-in-upgrade.log # totally 13 count
2021-01-05T14:11:02+08:00 oc get project.project failed
...
2021-01-05T14:24:26+08:00 oc get project.project failed

Expected results:
Should have Zero-disruption upgrade for OAS.

Additional info:
You can check the while-loop script watch-apiserver-in-upgrade.sh in test case OCP-34223 "Check Zero-disruption upgrade for OAS and KAS". This bug is found during this case testing.

Comment 2 Stefan Schimanski 2021-01-05 11:50:15 UTC
Please provide pod manifests (including status) and logs of failing pods (in the moment they happen, must-gather does not help long time after).

Comment 3 Xingxing Xia 2021-01-05 12:32:37 UTC
Hmm, sorry didn't use eyes to do the watch, only used the while-loop script, therefore didn't get the moment's logs. Will manually collect them if hitting it next time.
Though, checked the must-gather events.yaml, found some apiserver-76b747b597-* events during above 2021-01-05T14:11:02+08:00 ~ 2021-01-05T14:24:26+08:00:
$ ~/auto/yaml2json.rb namespaces/openshift-apiserver/core/events.yaml > ~/my/events-upgrade-oas.json

$ jq -r '.items[] | select(.involvedObject.name | test("76b747b597")) | select(.type != "Normal") | "\(.firstTimestamp) \(.count) \(.type) \(.reason) \(.involvedObject.name) \(.message)"' ~/my/events-upgrade-oas.json | grep 2021-01-05T06
2021-01-05T06:17:55Z 3 Warning BackOff apiserver-76b747b597-7x2qd Back-off restarting failed container
2021-01-05T06:25:14Z 6 Warning Unhealthy apiserver-76b747b597-7x2qd Readiness probe failed: Get "https://10.129.0.6:8443/healthz": dial tcp 10.129.0.6:8443: connect: connection refused
2021-01-05T06:25:20Z 3 Warning Unhealthy apiserver-76b747b597-7x2qd Liveness probe failed: Get "https://10.129.0.6:8443/healthz": dial tcp 10.129.0.6:8443: connect: connection refused
2021-01-05T06:17:56Z 3 Warning BackOff apiserver-76b747b597-qlfps Back-off restarting failed container
2021-01-05T06:28:01Z 3 Warning Unhealthy apiserver-76b747b597-qlfps Liveness probe failed: Get "https://10.130.0.7:8443/healthz": dial tcp 10.130.0.7:8443: connect: connection refused
2021-01-05T06:28:03Z 6 Warning Unhealthy apiserver-76b747b597-qlfps Readiness probe failed: Get "https://10.130.0.7:8443/healthz": dial tcp 10.130.0.7:8443: connect: connection refused
2021-01-05T06:17:55Z 3 Warning BackOff apiserver-76b747b597-rmmg6 Back-off restarting failed container
2021-01-05T06:26:40Z 6 Warning Unhealthy apiserver-76b747b597-rmmg6 Readiness probe failed: Get "https://10.128.0.11:8443/healthz": dial tcp 10.128.0.11:8443: connect: connection refused
2021-01-05T06:26:42Z 3 Warning Unhealthy apiserver-76b747b597-rmmg6 Liveness probe failed: Get "https://10.128.0.11:8443/healthz": dial tcp 10.128.0.11:8443: connect: connection refused

Comment 4 Stefan Schimanski 2021-01-05 12:48:49 UTC
Yes, saw those too. But these are just the kubelet messages about the probes. The apiserver does not start up. Maybe etcd cannot be reached, or the port is occupied, or there is a panic. Interesting I also don't see a termination message in the status. That suggests that the process is not started at all.

Comment 6 Stefan Schimanski 2021-01-05 16:32:19 UTC
What is surprising is that all 3 instances are crash-looping. This suggests that some dependency used by all of them (service network? etcd?) is down. So logs would tell.

Comment 7 Xingxing Xia 2021-01-07 11:18:20 UTC
Today observed one upgrade (matrix: upi-on-gcp/versioned-installer-ovn fips on, upgrades from 4.6.0-0.nightly-2021-01-05-203053 to 4.7.0-0.nightly-2021-01-07-034013)
It also reproduced (therefore changing Severity to Medium):
I collected the moment's logs when observing it:
oc get po -n openshift-apiserver -o yaml > openshift-apiserver-pods.yaml # found the restart container is openshift-apiserver-check-endpoints instead of openshift-apiserver
oc describe po apiserver-6464ff6474-mrvpp -n openshift-apiserver > apiserver-6464ff6474-mrvpp.oc-describe.txt
oc logs --previous ... -c openshift-apiserver-check-endpoints > openshift-apiserver-check-endpoints.log # it ended with below
...
E0107 07:00:01.449494       1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io)
I0107 07:00:02.840026       1 base_controller.go:72] Caches are synced for check-endpoints
I0107 07:00:02.840163       1 base_controller.go:109] Starting #1 worker of check-endpoints controller ...
I0107 07:02:22.825961       1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io".

These collections are uploaded to http://file.rdu.redhat.com/~xxia/1912820/2020-0107/ , you can open:
openshift-apiserver-pods.yaml
apiserver-6464ff6474-mrvpp.oc-describe.txt 
openshift-apiserver-check-endpoints.log
watch-apiserver-in-upgrade.log

Comment 8 Luis Sanchez 2021-01-13 15:45:59 UTC
In 4.6, the PodNetworkConnectivityCheck CRD is installed/uninstalled on demand. In 4.7 the CRD is installed permanently. During a 4.6 to 4.7 upgrade there will be some contention between the 4.6 controllers that uninstall the CRD, and the 4.7 controllers that re-create the CRD for just this scenario, but eventually it all settles down after the 4.6 pods (kube-apiserver-operator and openshift-apiserver-operator) have been upgraded.

Comment 9 Luis Sanchez 2021-01-27 17:54:18 UTC
*** Bug 1889900 has been marked as a duplicate of this bug. ***

Comment 12 Stefan Schimanski 2021-05-20 15:45:32 UTC
All PRs merged.

Comment 14 Xingxing Xia 2021-05-25 09:18:50 UTC
Tested upgrade from 4.7.12 to 4.8.0-fc.5 (4.8.0-0.nightly-2021-05-21-101954), my reported issue is not seen again.

Comment 17 errata-xmlrpc 2021-07-27 22:35:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.