Description of problem: openshift-apiserver Available is False with 3 pods not ready for a while during upgrade Version-Release number of selected component (if applicable): Upgrade from 4.6.0-0.nightly-2021-01-03-162024 to 4.7.0-0.nightly-2021-01-04-215816 (matrix upi-on-baremetal/versioned-installer-openstack-all_in_one) How reproducible: Not sure Steps to Reproduce: 1. Launch a 4.6.0-0.nightly-2021-01-03-162024 env 2. Upgrade it to 4.7.0-0.nightly-2021-01-04-215816, during which a while-loop script watch-apiserver-in-upgrade.sh is run to watch `oc get project.project` command: ./watch-apiserver-in-upgrade.sh | tee watch-apiserver-in-upgrade.log Actual results: 2. Upgrade succeeded. But the watch script has failure: $ watch-apiserver-in-upgrade.sh | tee watch-apiserver-in-upgrade.log Checking if upgrade started ................................... 2021-01-05T14:08:40+08:00, upgrade started ... Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io) 2021-01-05T14:11:02+08:00 oc get project.project failed ... 2021-01-05T14:17:57+08:00 oc get project.project failed ... apiserver-76b747b597-7x2qd 1/2 NotReady 1 3h21m 10.129.0.6 weinliu46upgrade-t4k5p-control-plane-1 <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2 apiserver-76b747b597-qlfps 1/2 NotReady 1 3h19m 10.130.0.7 weinliu46upgrade-t4k5p-control-plane-2 <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2 apiserver-76b747b597-rmmg6 1/2 NotReady 1 3h16m 10.128.0.11 weinliu46upgrade-t4k5p-control-plane-0 <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2 openshift-apiserver 4.6.0-0.nightly-2021-01-03-162024 False False False 2s ... weinliu46upgrade-t4k5p-control-plane-0 Ready master,worker 3h54m v1.19.0+9c69bdc weinliu46upgrade-t4k5p-control-plane-1 Ready master,worker 3h55m v1.19.0+9c69bdc weinliu46upgrade-t4k5p-control-plane-2 Ready master,worker 3h55m v1.19.0+9c69bdc 2021-01-05T14:17:58+08:00 oc get cm succeeded version 4.6.0-0.nightly-2021-01-03-162024 True True 9m24s Working towards 4.7.0-0.nightly-2021-01-04-215816: 14% complete Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io) 2021-01-05T14:18:00+08:00 oc get project.project failed ... Error from server (ServiceUnavailable): the server is currently unable to handle the request (get projects.project.openshift.io) 2021-01-05T14:24:26+08:00 oc get project.project failed apiserver-76b747b597-7x2qd 1/2 CrashLoopBackOff 2 3h28m 10.129.0.6 weinliu46upgrade-t4k5p-control-plane-1 <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2 apiserver-76b747b597-qlfps 1/2 CrashLoopBackOff 2 3h25m 10.130.0.7 weinliu46upgrade-t4k5p-control-plane-2 <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2 apiserver-76b747b597-rmmg6 1/2 CrashLoopBackOff 2 3h23m 10.128.0.11 weinliu46upgrade-t4k5p-control-plane-0 <none> <none> apiserver=true,app=openshift-apiserver-a,openshift-apiserver-anti-affinity=true,pod-template-hash=76b747b597,revision=2 openshift-apiserver 4.6.0-0.nightly-2021-01-03-162024 False False False 24s weinliu46upgrade-t4k5p-control-plane-0 Ready master,worker 4h1m v1.19.0+9c69bdc weinliu46upgrade-t4k5p-control-plane-1 Ready master,worker 4h1m v1.19.0+9c69bdc weinliu46upgrade-t4k5p-control-plane-2 Ready master,worker 4h1m v1.19.0+9c69bdc version 4.6.0-0.nightly-2021-01-03-162024 True True 15m Working towards 4.7.0-0.nightly-2021-01-04-215816: 17% complete ... $ grep "failed" watch-apiserver-in-upgrade.log # totally 13 count 2021-01-05T14:11:02+08:00 oc get project.project failed ... 2021-01-05T14:24:26+08:00 oc get project.project failed Expected results: Should have Zero-disruption upgrade for OAS. Additional info: You can check the while-loop script watch-apiserver-in-upgrade.sh in test case OCP-34223 "Check Zero-disruption upgrade for OAS and KAS". This bug is found during this case testing.
Please provide pod manifests (including status) and logs of failing pods (in the moment they happen, must-gather does not help long time after).
Hmm, sorry didn't use eyes to do the watch, only used the while-loop script, therefore didn't get the moment's logs. Will manually collect them if hitting it next time. Though, checked the must-gather events.yaml, found some apiserver-76b747b597-* events during above 2021-01-05T14:11:02+08:00 ~ 2021-01-05T14:24:26+08:00: $ ~/auto/yaml2json.rb namespaces/openshift-apiserver/core/events.yaml > ~/my/events-upgrade-oas.json $ jq -r '.items[] | select(.involvedObject.name | test("76b747b597")) | select(.type != "Normal") | "\(.firstTimestamp) \(.count) \(.type) \(.reason) \(.involvedObject.name) \(.message)"' ~/my/events-upgrade-oas.json | grep 2021-01-05T06 2021-01-05T06:17:55Z 3 Warning BackOff apiserver-76b747b597-7x2qd Back-off restarting failed container 2021-01-05T06:25:14Z 6 Warning Unhealthy apiserver-76b747b597-7x2qd Readiness probe failed: Get "https://10.129.0.6:8443/healthz": dial tcp 10.129.0.6:8443: connect: connection refused 2021-01-05T06:25:20Z 3 Warning Unhealthy apiserver-76b747b597-7x2qd Liveness probe failed: Get "https://10.129.0.6:8443/healthz": dial tcp 10.129.0.6:8443: connect: connection refused 2021-01-05T06:17:56Z 3 Warning BackOff apiserver-76b747b597-qlfps Back-off restarting failed container 2021-01-05T06:28:01Z 3 Warning Unhealthy apiserver-76b747b597-qlfps Liveness probe failed: Get "https://10.130.0.7:8443/healthz": dial tcp 10.130.0.7:8443: connect: connection refused 2021-01-05T06:28:03Z 6 Warning Unhealthy apiserver-76b747b597-qlfps Readiness probe failed: Get "https://10.130.0.7:8443/healthz": dial tcp 10.130.0.7:8443: connect: connection refused 2021-01-05T06:17:55Z 3 Warning BackOff apiserver-76b747b597-rmmg6 Back-off restarting failed container 2021-01-05T06:26:40Z 6 Warning Unhealthy apiserver-76b747b597-rmmg6 Readiness probe failed: Get "https://10.128.0.11:8443/healthz": dial tcp 10.128.0.11:8443: connect: connection refused 2021-01-05T06:26:42Z 3 Warning Unhealthy apiserver-76b747b597-rmmg6 Liveness probe failed: Get "https://10.128.0.11:8443/healthz": dial tcp 10.128.0.11:8443: connect: connection refused
Yes, saw those too. But these are just the kubelet messages about the probes. The apiserver does not start up. Maybe etcd cannot be reached, or the port is occupied, or there is a panic. Interesting I also don't see a termination message in the status. That suggests that the process is not started at all.
What is surprising is that all 3 instances are crash-looping. This suggests that some dependency used by all of them (service network? etcd?) is down. So logs would tell.
Today observed one upgrade (matrix: upi-on-gcp/versioned-installer-ovn fips on, upgrades from 4.6.0-0.nightly-2021-01-05-203053 to 4.7.0-0.nightly-2021-01-07-034013) It also reproduced (therefore changing Severity to Medium): I collected the moment's logs when observing it: oc get po -n openshift-apiserver -o yaml > openshift-apiserver-pods.yaml # found the restart container is openshift-apiserver-check-endpoints instead of openshift-apiserver oc describe po apiserver-6464ff6474-mrvpp -n openshift-apiserver > apiserver-6464ff6474-mrvpp.oc-describe.txt oc logs --previous ... -c openshift-apiserver-check-endpoints > openshift-apiserver-check-endpoints.log # it ended with below ... E0107 07:00:01.449494 1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io) I0107 07:00:02.840026 1 base_controller.go:72] Caches are synced for check-endpoints I0107 07:00:02.840163 1 base_controller.go:109] Starting #1 worker of check-endpoints controller ... I0107 07:02:22.825961 1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io". These collections are uploaded to http://file.rdu.redhat.com/~xxia/1912820/2020-0107/ , you can open: openshift-apiserver-pods.yaml apiserver-6464ff6474-mrvpp.oc-describe.txt openshift-apiserver-check-endpoints.log watch-apiserver-in-upgrade.log
In 4.6, the PodNetworkConnectivityCheck CRD is installed/uninstalled on demand. In 4.7 the CRD is installed permanently. During a 4.6 to 4.7 upgrade there will be some contention between the 4.6 controllers that uninstall the CRD, and the 4.7 controllers that re-create the CRD for just this scenario, but eventually it all settles down after the 4.6 pods (kube-apiserver-operator and openshift-apiserver-operator) have been upgraded.
*** Bug 1889900 has been marked as a duplicate of this bug. ***
All PRs merged.
Tested upgrade from 4.7.12 to 4.8.0-fc.5 (4.8.0-0.nightly-2021-05-21-101954), my reported issue is not seen again.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438