Bug 1701291
Summary: | upgrade: openshift-apiserver OpenShift API is not responding to GET requests | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Steve Milner <smilner> | ||||
Component: | Master | Assignee: | Michal Fojtik <mfojtik> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | Xingxing Xia <xxia> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.1.0 | CC: | aos-bugs, bleanhar, jokerman, mmccomas, wking, yinzhou | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.1.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2019-04-29 08:24:07 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Steve Milner
2019-04-18 14:56:56 UTC
"Cluster did not acknowledge request to upgrade in a reasonable time" shows up in a number of logs as well: 313, 326, 331, 352, 353, 368, 379 From a recent run [1]: Apr 18 13:50:11.873 E openshift-apiserver OpenShift API is not responding to GET requests despite that being well after the successful install: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/391/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Install complete' time="2019-04-18T13:42:54Z" level=info msg="Install complete!" So moving to the master team to triage why they aren't rolling their update out gracefully. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/391 Possibly mitigated by https://github.com/openshift/installer/pull/1629 Created attachment 1557513 [details] Occurences of this error in CI from 2019-04-22T13:58 through 2019-04-23T08:49Z These are still showing up in just about every failed upgrade job (which is most upgrade jobs). For a recent example, see [1], where it's being promoted into the main summary page via occurrences during failing tests: $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581 | grep 'openshift-apiserver OpenShift API is not responding to GET requests' Apr 23 07:39:40.299 - 75s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:53:40.299 - 44s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:54:55.299 E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:57:25.299 - 44s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:39:40.299 - 75s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:53:40.299 - 44s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:54:55.299 E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:57:25.299 - 44s E openshift-apiserver OpenShift API is not responding to GET requests Generating code seems to be [2]. [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581 [2]: https://github.com/openshift/origin/blob/9c28792c0aece4f9b7926746949111faf61a3be7/pkg/monitor/api.go#L55-L120 Expanding to include the Kubernetes monitors: $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581 | grep 'API is not responding to GET requests' | sort Apr 23 07:39:40.299 - 75s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:39:40.299 - 75s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:51:55.299 - 75s E kube-apiserver Kube API is not responding to GET requests Apr 23 07:51:55.299 - 75s E kube-apiserver Kube API is not responding to GET requests Apr 23 07:53:40.299 - 29s E kube-apiserver Kube API is not responding to GET requests Apr 23 07:53:40.299 - 29s E kube-apiserver Kube API is not responding to GET requests Apr 23 07:53:40.299 - 44s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:53:40.299 - 44s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:54:55.299 E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:54:55.299 E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:57:25.299 - 44s E openshift-apiserver OpenShift API is not responding to GET requests Apr 23 07:57:25.299 - 44s E openshift-apiserver OpenShift API is not responding to GET requests So this is also an issue for kube-apiserver. But if it were just a load-balancer issue, I'd guess that we'd have more overlap between the Kubernetes and OpenShift errors above. Probably worth checking events.json to get a more complete picture. Trying to get a handle on API-server roll-outs: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items | sort_by(.status.startTime)[] | select(.metadata.namespace | contains("-apiserver")) | .status.startTime + " " + .metadata.namespace + "/" + .metadata.name'2019-04-23T06:47:40Z openshift-apiserver-operator/openshift-apiserver-operator-79785cbd7c-spwwb 2019-04-23T06:49:38Z openshift-kube-apiserver/kube-apiserver-ip-10-0-137-157.ec2.internal 2019-04-23T06:51:05Z openshift-kube-apiserver/kube-apiserver-ip-10-0-147-83.ec2.internal 2019-04-23T06:52:58Z openshift-apiserver/apiserver-glrt9 2019-04-23T06:55:05Z openshift-apiserver/apiserver-lpjgt 2019-04-23T06:56:31Z openshift-kube-apiserver/installer-6-ip-10-0-165-250.ec2.internal 2019-04-23T06:56:36Z openshift-apiserver/apiserver-dh2w2 2019-04-23T06:56:46Z openshift-kube-apiserver/kube-apiserver-ip-10-0-165-250.ec2.internal 2019-04-23T06:57:11Z openshift-kube-apiserver/revision-pruner-6-ip-10-0-165-250.ec2.internal 2019-04-23T07:54:51Z openshift-kube-apiserver/revision-pruner-7-ip-10-0-137-157.ec2.internal 2019-04-23T07:54:52Z openshift-kube-apiserver/installer-7-ip-10-0-165-250.ec2.internal 2019-04-23T07:57:10Z openshift-kube-apiserver-operator/kube-apiserver-operator-846ccd49c8-89tfh 2019-04-23T07:57:10Z openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator-7fc474c755-lthrm 2019-04-23T07:58:34Z openshift-kube-apiserver/revision-pruner-7-ip-10-0-147-83.ec2.internal 2019-04-23T07:58:38Z openshift-kube-apiserver/revision-pruner-7-ip-10-0-165-250.ec2.internal So the 7:39 error from comment 7 seems to be off in space. The 7:51 through 7:57 errors are close to some 7:5* pod starts, but none of those pods look like the API servers themselves. So I'm currently leaning to towards this not being an issue with API-server roll-out, although it could be an issue with an auth roll-out or some such. Dropping the namespace filters: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items | sort_by(.status.startTime)[] | .status.startTime + " " + .metadata.namespace + "/" + .metadata.name' ... 2019-04-23T07:28:54Z openshift-multus/multus-vm4dq 2019-04-23T07:28:54Z openshift-sdn/ovs-zl4sk 2019-04-23T07:38:54Z openshift-multus/multus-lc9nw 2019-04-23T07:38:54Z openshift-sdn/ovs-tt7vh 2019-04-23T07:39:32Z openshift-multus/multus-wbh7f 2019-04-23T07:39:36Z openshift-sdn/ovs-f2lfx 2019-04-23T07:40:14Z openshift-marketplace/certified-operators-f97547c89-mds6z ... 2019-04-23T07:48:05Z openshift-kube-controller-manager-operator/kube-controller-manager-operator-c9974dc6f-9gkwf 2019-04-23T07:49:29Z openshift-cloud-credential-operator/cloud-credential-operator-6676469cc6-q5vdc 2019-04-23T07:50:26Z openshift-kube-controller-manager/installer-3-ip-10-0-165-250.ec2.internal 2019-04-23T07:51:38Z openshift-machine-api/machine-api-controllers-7f77999d99-9msr2 2019-04-23T07:54:14Z openshift-kube-controller-manager/kube-controller-manager-ip-10-0-137-157.ec2.internal ... So at least the 7:39 error seems like it was due to the network roll-out. I still don't see anything that looks suspicious for the 7:51:55 error. PR: https://github.com/openshift/origin/pull/22651 The change will make sure that on master node reboot the apiserver exit with 0 and there is no chance of the container to be restarted. https://github.com/openshift/origin/pull/22651#event-2297673514 <- merged Although this doesn't address graceful shutdowns for static pods. Is that part of this issue too? Graceful term is bug 1702414. We still could see the error from the latest ci upgrade : [root@dhcp-140-138 ~]# curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/901 | grep 'API is not responding to GET requests' Apr 29 06:28:34.475 E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:39:04.475 - 15s E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:39:49.475 E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:41:19.475 - 90s E kube-apiserver Kube API is not responding to GET requests Apr 29 06:43:34.475 - 45s E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:43:34.475 - 105s E kube-apiserver Kube API is not responding to GET requests Apr 29 06:45:49.477 E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:49:19.475 E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:28:34.475 E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:39:04.475 - 15s E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:39:49.475 E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:41:19.475 - 90s E kube-apiserver Kube API is not responding to GET requests Apr 29 06:43:34.475 - 45s E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:43:34.475 - 105s E kube-apiserver Kube API is not responding to GET requests Apr 29 06:45:49.477 E openshift-apiserver OpenShift API is not responding to GET requests Apr 29 06:49:19.475 E openshift-apiserver OpenShift API is not responding to GET requests The payload: 4.1.0-0.ci-2019-04-28-235503 -> 4.1.0-0.ci-2019-04-29-031658 contain the PR: https://github.com/openshift/origin/pull/22651 |