Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1969598

Summary:	During machine-config pool updates: apps.openshift.io.v1: the server is currently unable to handle the request
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Machine Config Operator	Assignee:	Yu Qi Zhang <jerzhang>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	aos-bugs, jerzhang, mfojtik, mkrejci, rioliu, skumari
Version:	4.8
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	tag-ci
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-11-08 17:02:05 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2021-06-08 18:17:07 UTC

Seen in a 4.7.14 -> 4.8.0-0.ci-2021-06-07-230231 OVN update [1]:

  INFO[2021-06-08T03:58:34Z] Jun 08 03:33:48.000 W ns/openshift-apiserver-operator deployment/openshift-apiserver-operator reason/OpenShiftAPICheckFailed "apps.openshift.io.v1" failed with HTTP status code 503 (the server is currently unable to handle the request)
  ...
  INFO[2021-06-08T03:58:35Z] Jun 08 03:40:18.354 E ns/openshift-kube-controller-manager pod/kube-controller-manager-ip-10-0-250-206.us-west-1.compute.internal node/ip-10-0-250-206.us-west-1.compute.internal container/cluster-policy-controller reason/ContainerExit code/1 cause/Error https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/route.openshift.io/v1?timeout=32s\nI0608 03:39:49.441062       1 request.go:645] Throttling request took 3.117986892s, request: GET:https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/batch/v1?timeout=32s\nE0608 03:39:54.508929       1 reconciliation_controller.go:166] unable to retrieve the complete list of server APIs: apps.openshift.io/v1: the server is currently unable to handle the request, authorization.openshift.io/v1: the server is currently unable to handle the request, image.openshift.io/v1: the server is currently unable to handle the request, project.openshift.io/v1: the server is currently unable to handle the request, route.openshift.io/v1: the server is currently unable to handle the request, template.openshift.io/v1: the server is currently unable to handle the request\nE0608 03:40:08.610617       1 leaderelection.go:321] error retrieving resource lock openshift-kube-controller-manager/cluster-policy-controller: Get "https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller": read tcp 10.0.250.206:52822->10.0.216.48:6443: read: connection timed out\nW0608 03:40:08.649983       1 warnings.go:67] extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress\nI0608 03:40:17.861508       1 leaderelection.go:278] failed to renew lease openshift-kube-controller-manager/cluster-policy-controller: timed out waiting for the condition\nI0608 03:40:17.861521       1 event.go:282] Event(v1.ObjectReference{Kind:"ConfigMap", Namespace:"", Name:"", UID:"", APIVersion:"v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' ip-10-0-250-206 stopped leading\nE0608 03:40:17.861618       1 leaderelection.go:297] Failed to release lock: resource name may not be empty\nF0608 03:40:17.861657       1 policy_controller.go:107] leaderelection lost\n

Those probably get rolled into:

  disruption_tests: [sig-api-machinery] OpenShift APIs remain available for new connections	1h33m31s
  Jun  8 03:57:18.034: API "openshift-api-available-new-connections" was unreachable during disruption (AWS has a known issue: https://bugzilla.redhat.com/show_bug.cgi?id=1943804) for at least 0s of 1h33m30s (0%):

  Jun 08 03:33:55.147 E openshift-apiserver-new-connection openshift-apiserver-new-connection started failing: Get "https://api.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/default/imagestreams": dial tcp 54.151.58.47:6443: i/o timeout
  Jun 08 03:33:55.147 - 203ms E openshift-apiserver-new-connection openshift-apiserver-new-connection is not responding to GET requests
  Jun 08 03:33:55.350 I openshift-apiserver-new-connection openshift-apiserver-new-connection started responding to GET requests
  Jun 08 03:39:53.509 E openshift-apiserver-new-connection openshift-apiserver-new-connection started failing: Get "https://api.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/default/imagestreams": net/http: TLS handshake timeout
  Jun 08 03:39:53.509 - 198ms E openshift-apiserver-new-connection openshift-apiserver-new-connection is not responding to GET requests
  Jun 08 03:39:53.708 I openshift-apiserver-new-connection openshift-apiserver-new-connection started responding to GET requests
  Jun 08 03:40:23.445 E openshift-apiserver-new-connection openshift-apiserver-new-connection started failing: Get "https://api.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/apis/image.openshift.io/v1/namespaces/default/imagestreams": dial tcp 54.151.58.47:6443: i/o timeout
  Jun 08 03:40:23.445 - 206ms E openshift-apiserver-new-connection openshift-apiserver-new-connection is not responding to GET requests
  Jun 08 03:40:23.651 I openshift-apiserver-new-connection openshift-apiserver-new-connection started responding to GET requests

As described in bug 1940894, they also make must-gather mad.  Seems somewhat popular over the past 24h, although the only affected release informer is the OVN minor updates:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=24h&type=build-log&search=apps.openshift.io/v1:+the+server+is+currently+unable+to+handle+the+
request' | grep 'failures match' | sort
periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade (all) - 17 runs, 100% failed, 6% of failures match = 6% impact
pull-ci-code-ready-snc-master-e2e-snc (all) - 2 runs, 50% failed, 100% of failures match = 50% impact
pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws (all) - 16 runs, 19% failed, 33% of failures match = 6% impact
pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-gcp-operator (all) - 16 runs, 25% failed, 25% of failures match = 6% impact
pull-ci-openshift-origin-master-e2e-aws-disruptive (all) - 24 runs, 96% failed, 9% of failures match = 8% impact
pull-ci-openshift-origin-master-e2e-gcp-disruptive (all) - 23 runs, 96% failed, 9% of failures match = 9% impact
pull-ci-openshift-ovn-kubernetes-master-e2e-openstack-ovn (all) - 5 runs, 40% failed, 50% of failures match = 20% impact

I'm not sure if all of those involve control-plane node reboots or not; there may be other triggers for the OpenShift API-server being unreachable mixed in as well.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560

Comment 1 Stefan Schimanski 2021-06-09 07:36:54 UTC

Get "https://api-int.ci-op-krp5m4vh-8d118.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/openshift-kube-controller-manager/configmaps/cluster-policy-controller": read tcp 10.0.250.206:52822->10.0.216.48:6443: read: connection timed out

is a strong hint that the draining and the shutdown of the node was not graceful. Otherwise, we cannot see a TCP connection timeout, but it would be on a higher level in the network stack.

Comment 2 Yu Qi Zhang 2021-06-29 00:17:49 UTC

Taking a look at the MCO namespaced object, nothing seems to jump out. At the time of must-gather for the above CI run, the MCO has fully settled, and notably all of the MCD containers were restarted at some point, so we no longer have any logs from the older MCDs. The oldest log now is from 03:36 which seems to be after the api issues. It's also possible that the api issues were happening as the first master was rebooting due to its update.

I'm not sure how to dig this further. Maybe we would need to look at journal logs? Not sure what the root cause of the instability was.

Comment 3 W. Trevor King 2021-06-29 05:05:04 UTC

> ...we no longer have any logs from the older MCDs...

We should have those in Loki, but I dunno if we retain them for three weeks.  From [1] -> custom-link-tools -> Loki -> adjust time range to cover 2021-06-08 00:00Z to 2021-06-08 06:00Z, and using:

  {invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560"} | unpack | namespace="openshift-machine-config-operator"

Seems to give me some MCD logs.  Was there a particular MCD pod we wanted to dig into?  An instance query for:

  count by (pod_name) (count_over_time(({invoker="openshift-internal-ci/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560"} | unpack | namespace="openshift-machine-config-operator")[24h]))

gives a list of options.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-ovn-upgrade/1402078479163330560

Comment 4 Yu Qi Zhang 2021-07-03 00:43:17 UTC

machine-config-daemon-llgpj is probably the most likely one to have any information, if any exists. It does seem to exist in the query you gave, but I don't really know how to find the logs from the dashboard

Comment 5 Yu Qi Zhang 2021-11-08 17:02:05 UTC

The newer manifestations of this should be https://bugzilla.redhat.com/show_bug.cgi?id=2019215. If you see this still @Trevor in newer runs, please feel free to open a new bug or attach to that one. Closing this one since it's a bit stale at this point. Thanks!