Bug 1912820
Summary: | openshift-apiserver Available is False with 3 pods not ready for a while during upgrade | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xingxing Xia <xxia> | |
Component: | openshift-apiserver | Assignee: | Luis Sanchez <sanchezl> | |
Status: | CLOSED ERRATA | QA Contact: | Xingxing Xia <xxia> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.7 | CC: | akashem, aos-bugs, dgautam, fabian, kewang, mf.flip, mfojtik, rgangwar, sanchezl, sttts, wking | |
Target Milestone: | --- | Keywords: | Upgrades | |
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1926867 (view as bug list) | Environment: | ||
Last Closed: | 2021-07-27 22:35:38 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1926867, 1946856 |
Description
Xingxing Xia
2021-01-05 11:30:06 UTC
Please provide pod manifests (including status) and logs of failing pods (in the moment they happen, must-gather does not help long time after). Hmm, sorry didn't use eyes to do the watch, only used the while-loop script, therefore didn't get the moment's logs. Will manually collect them if hitting it next time. Though, checked the must-gather events.yaml, found some apiserver-76b747b597-* events during above 2021-01-05T14:11:02+08:00 ~ 2021-01-05T14:24:26+08:00: $ ~/auto/yaml2json.rb namespaces/openshift-apiserver/core/events.yaml > ~/my/events-upgrade-oas.json $ jq -r '.items[] | select(.involvedObject.name | test("76b747b597")) | select(.type != "Normal") | "\(.firstTimestamp) \(.count) \(.type) \(.reason) \(.involvedObject.name) \(.message)"' ~/my/events-upgrade-oas.json | grep 2021-01-05T06 2021-01-05T06:17:55Z 3 Warning BackOff apiserver-76b747b597-7x2qd Back-off restarting failed container 2021-01-05T06:25:14Z 6 Warning Unhealthy apiserver-76b747b597-7x2qd Readiness probe failed: Get "https://10.129.0.6:8443/healthz": dial tcp 10.129.0.6:8443: connect: connection refused 2021-01-05T06:25:20Z 3 Warning Unhealthy apiserver-76b747b597-7x2qd Liveness probe failed: Get "https://10.129.0.6:8443/healthz": dial tcp 10.129.0.6:8443: connect: connection refused 2021-01-05T06:17:56Z 3 Warning BackOff apiserver-76b747b597-qlfps Back-off restarting failed container 2021-01-05T06:28:01Z 3 Warning Unhealthy apiserver-76b747b597-qlfps Liveness probe failed: Get "https://10.130.0.7:8443/healthz": dial tcp 10.130.0.7:8443: connect: connection refused 2021-01-05T06:28:03Z 6 Warning Unhealthy apiserver-76b747b597-qlfps Readiness probe failed: Get "https://10.130.0.7:8443/healthz": dial tcp 10.130.0.7:8443: connect: connection refused 2021-01-05T06:17:55Z 3 Warning BackOff apiserver-76b747b597-rmmg6 Back-off restarting failed container 2021-01-05T06:26:40Z 6 Warning Unhealthy apiserver-76b747b597-rmmg6 Readiness probe failed: Get "https://10.128.0.11:8443/healthz": dial tcp 10.128.0.11:8443: connect: connection refused 2021-01-05T06:26:42Z 3 Warning Unhealthy apiserver-76b747b597-rmmg6 Liveness probe failed: Get "https://10.128.0.11:8443/healthz": dial tcp 10.128.0.11:8443: connect: connection refused Yes, saw those too. But these are just the kubelet messages about the probes. The apiserver does not start up. Maybe etcd cannot be reached, or the port is occupied, or there is a panic. Interesting I also don't see a termination message in the status. That suggests that the process is not started at all. What is surprising is that all 3 instances are crash-looping. This suggests that some dependency used by all of them (service network? etcd?) is down. So logs would tell. Today observed one upgrade (matrix: upi-on-gcp/versioned-installer-ovn fips on, upgrades from 4.6.0-0.nightly-2021-01-05-203053 to 4.7.0-0.nightly-2021-01-07-034013) It also reproduced (therefore changing Severity to Medium): I collected the moment's logs when observing it: oc get po -n openshift-apiserver -o yaml > openshift-apiserver-pods.yaml # found the restart container is openshift-apiserver-check-endpoints instead of openshift-apiserver oc describe po apiserver-6464ff6474-mrvpp -n openshift-apiserver > apiserver-6464ff6474-mrvpp.oc-describe.txt oc logs --previous ... -c openshift-apiserver-check-endpoints > openshift-apiserver-check-endpoints.log # it ended with below ... E0107 07:00:01.449494 1 reflector.go:127] k8s.io/client-go.0/tools/cache/reflector.go:156: Failed to watch *v1alpha1.PodNetworkConnectivityCheck: failed to list *v1alpha1.PodNetworkConnectivityCheck: the server could not find the requested resource (get podnetworkconnectivitychecks.controlplane.operator.openshift.io) I0107 07:00:02.840026 1 base_controller.go:72] Caches are synced for check-endpoints I0107 07:00:02.840163 1 base_controller.go:109] Starting #1 worker of check-endpoints controller ... I0107 07:02:22.825961 1 start_stop_controllers.go:70] The server doesn't have a resource type "podnetworkconnectivitychecks.controlplane.operator.openshift.io". These collections are uploaded to http://file.rdu.redhat.com/~xxia/1912820/2020-0107/ , you can open: openshift-apiserver-pods.yaml apiserver-6464ff6474-mrvpp.oc-describe.txt openshift-apiserver-check-endpoints.log watch-apiserver-in-upgrade.log In 4.6, the PodNetworkConnectivityCheck CRD is installed/uninstalled on demand. In 4.7 the CRD is installed permanently. During a 4.6 to 4.7 upgrade there will be some contention between the 4.6 controllers that uninstall the CRD, and the 4.7 controllers that re-create the CRD for just this scenario, but eventually it all settles down after the 4.6 pods (kube-apiserver-operator and openshift-apiserver-operator) have been upgraded. *** Bug 1889900 has been marked as a duplicate of this bug. *** All PRs merged. Tested upgrade from 4.7.12 to 4.8.0-fc.5 (4.8.0-0.nightly-2021-05-21-101954), my reported issue is not seen again. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |