Seen in a 4.7.14 -> 4.8.0-0.ci-2021-06-07-230231 OVN update [1]: INFO[2021-06-08T03:58:34Z] Jun 08 03:33:48.000 W ns/openshift-apiserver-operator deployment/openshift-apiserver-operator reason/OpenShiftAPICheckFailed "apps.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request) ... INFO[2021-06-08T03:58:35Z] Jun 08 03:40:18.354 E ns/openshift-kube-controller-manager pod/kube-controller-manager-ip-10-0-250-206.us-west-1.compute.internal node/ip-10-0-250-206.us-west-1.compute.internal container/cluster-policy-controller reason/ContainerExit code/1 cause/Error https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/route.openshift.io/v1?timeout=32s\nI0608 03:39:49.441062 1 request.go:645] Throttling request took 3.117986892s, request: GET:https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/batch/v1?timeout=32s\nE0608 03:39:54.508929 1 reconciliation_controller.go:166] unable to retrieve the complete list of server APIs: apps.openshift.io/v1: the server is currently unable to handle the request, authorization.openshift.io/v1: the server is currently unable to handle the request, image.openshift.io/v1: the server is currently unable to handle the request, project.openshift.io/v1: the server is currently unable to handle the request, route.openshift.io/v1: the server is currently unable to handle the request, template.openshift.io/v1: the server is currently unable to handle the request\nE0608 03:40:08.610617 1 leaderelection.go:321] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get "https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller": read tcp 10.0.250.206:52822->10.0.216.48:6443: read: connection timed out\nW0608 03:40:08.649983 1 warnings.go:67] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress\nI0608 03:40:17.861508 1 leaderelection.go:278] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition\nI0608 03:40:17.861521 1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-250-206 stopped leading\nE0608 03:40:17.861618 1 leaderelection.go:297] Failed to release lock: resource name may not be empty\nF0608 03:40:17.861657 1 policy_controller.go:107] leaderelection lost\n Those probably get rolled into: disruption_tests: [sig-api-machinery] OpenShift APIs remain available for new connections 1h33m31s Jun 8 03:57:18.034: API "openshift-api-available-new-connections" was unreachable during disruption (AWS has a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1943804) for at least 0s of 1h33m30s (0%): Jun 08 03:33:55.147 E openshift-apiserver-new-connection openshift-apiserver-new-connection started failing: Get "https://api.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/default/imagestreams": dial tcp 54.151.58.47:6443: i/o timeout Jun 08 03:33:55.147 - 203ms E openshift-apiserver-new-connection openshift-apiserver-new-connection is not responding to GET requests Jun 08 03:33:55.350 I openshift-apiserver-new-connection openshift-apiserver-new-connection started responding to GET requests Jun 08 03:39:53.509 E openshift-apiserver-new-connection openshift-apiserver-new-connection started failing: Get "https://api.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/default/imagestreams": net/http: TLS handshake timeout Jun 08 03:39:53.509 - 198ms E openshift-apiserver-new-connection openshift-apiserver-new-connection is not responding to GET requests Jun 08 03:39:53.708 I openshift-apiserver-new-connection openshift-apiserver-new-connection started responding to GET requests Jun 08 03:40:23.445 E openshift-apiserver-new-connection openshift-apiserver-new-connection started failing: Get "https://api.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/default/imagestreams": dial tcp 54.151.58.47:6443: i/o timeout Jun 08 03:40:23.445 - 206ms E openshift-apiserver-new-connection openshift-apiserver-new-connection is not responding to GET requests Jun 08 03:40:23.651 I openshift-apiserver-new-connection openshift-apiserver-new-connection started responding to GET requests As described in bug 1940894, they also make must-gather mad. Seems somewhat popular over the past 24h, although the only affected release informer is the OVN minor updates: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=build-log&search=apps.openshift.io/v1:+the+server+is+currently+unable+to+handle+the+ request' | grep 'failures match' | sort periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 6% of failures match = 6% impact pull-ci-code-ready-snc-master-e2e-snc (all) - 2 runs, 50% failed, 100% of failures match = 50% impact pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws (all) - 16 runs, 19% failed, 33% of failures match = 6% impact pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-gcp-operator (all) - 16 runs, 25% failed, 25% of failures match = 6% impact pull-ci-openshift-origin-master-e2e-aws-disruptive (all) - 24 runs, 96% failed, 9% of failures match = 8% impact pull-ci-openshift-origin-master-e2e-gcp-disruptive (all) - 23 runs, 96% failed, 9% of failures match = 9% impact pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn (all) - 5 runs, 40% failed, 50% of failures match = 20% impact I'm not sure if all of those involve control-plane node reboots or not; there may be other triggers for the OpenShift API-server being unreachable mixed in as well. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560
Get "https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller": read tcp 10.0.250.206:52822->10.0.216.48:6443: read: connection timed out is a strong hint that the draining and the shutdown of the node was not graceful. Otherwise, we cannot see a TCP connection timeout, but it would be on a higher level in the network stack.
Taking a look at the MCO namespaced object, nothing seems to jump out. At the time of must-gather for the above CI run, the MCO has fully settled, and notably all of the MCD containers were restarted at some point, so we no longer have any logs from the older MCDs. The oldest log now is from 03:36 which seems to be after the api issues. It's also possible that the api issues were happening as the first master was rebooting due to its update. I'm not sure how to dig this further. Maybe we would need to look at journal logs? Not sure what the root cause of the instability was.
> ...we no longer have any logs from the older MCDs... We should have those in Loki, but I dunno if we retain them for three weeks. From [1] -> custom-link-tools -> Loki -> adjust time range to cover 2021-06-08 00:00Z to 2021-06-08 06:00Z, and using: {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560"} | unpack | namespace="openshift-machine-config-operator" Seems to give me some MCD logs. Was there a particular MCD pod we wanted to dig into? An instance query for: count by (pod_name) (count_over_time(({invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560"} | unpack | namespace="openshift-machine-config-operator")[24h])) gives a list of options. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560
machine-config-daemon-llgpj is probably the most likely one to have any information, if any exists. It does seem to exist in the query you gave, but I don't really know how to find the logs from the dashboard
The newer manifestations of this should be https://bugzilla.redhat.com/show_bug.cgi?id=2019215. If you see this still @Trevor in newer runs, please feel free to open a new bug or attach to that one. Closing this one since it's a bit stale at this point. Thanks!