Bug 1701291

Summary: upgrade: openshift-apiserver OpenShift API is not responding to GET requests
Product: OpenShift Container Platform Reporter: Steve Milner <smilner>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED DUPLICATE QA Contact: Xingxing Xia <xxia>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.1.0CC: aos-bugs, bleanhar, jokerman, mmccomas, wking, yinzhou
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-29 08:24:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Occurences of this error in CI from 2019-04-22T13:58 through 2019-04-23T08:49Z none

Description Steve Milner 2019-04-18 14:56:56 UTC
Looks to have occurred around 70 times over the last 48 hours. 

failed: (7m19s) 2019-04-18T06:45:12 "[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]"

How reproducible:

Happens most of the time, though a few fail for infra flakes

Notes:
"Got error testing for reachability" is a common line that pops up

Comment 2 Steve Milner 2019-04-18 15:03:13 UTC
"Cluster did not acknowledge request to upgrade in a reasonable time" shows up in a number of logs as well: 313, 326, 331, 352, 353, 368, 379

Comment 3 W. Trevor King 2019-04-22 17:46:53 UTC
From a recent run [1]:

  Apr 18 13:50:11.873 E openshift-apiserver OpenShift API is not responding to GET requests

despite that being well after the successful install:

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/391/artifacts/e2e-aws-upgrade/installer/.openshift_install.log | grep 'Install complete'
  time="2019-04-18T13:42:54Z" level=info msg="Install complete!"

So moving to the master team to triage why they aren't rolling their update out gracefully.

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/391

Comment 4 W. Trevor King 2019-04-22 17:53:51 UTC
Possibly mitigated by https://github.com/openshift/installer/pull/1629

Comment 6 W. Trevor King 2019-04-23 09:20:27 UTC
Created attachment 1557513 [details]
Occurences of this error in CI from 2019-04-22T13:58 through 2019-04-23T08:49Z

These are still showing up in just about every failed upgrade job (which is most upgrade jobs).  For a recent example, see [1], where it's being promoted into the main summary page via occurrences during failing tests:

$ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581 | grep 'openshift-apiserver OpenShift API is not responding to GET requests'
Apr 23 07:39:40.299 - 75s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:53:40.299 - 44s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:54:55.299 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:57:25.299 - 44s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:39:40.299 - 75s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:53:40.299 - 44s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:54:55.299 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:57:25.299 - 44s   E openshift-apiserver OpenShift API is not responding to GET requests

Generating code seems to be [2].

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581
[2]: https://github.com/openshift/origin/blob/9c28792c0aece4f9b7926746949111faf61a3be7/pkg/monitor/api.go#L55-L120

Comment 7 W. Trevor King 2019-04-23 09:23:59 UTC
Expanding to include the Kubernetes monitors:

$ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581 | grep 'API is not responding to GET requests' | sort
Apr 23 07:39:40.299 - 75s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:39:40.299 - 75s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:51:55.299 - 75s   E kube-apiserver Kube API is not responding to GET requests
Apr 23 07:51:55.299 - 75s   E kube-apiserver Kube API is not responding to GET requests
Apr 23 07:53:40.299 - 29s   E kube-apiserver Kube API is not responding to GET requests
Apr 23 07:53:40.299 - 29s   E kube-apiserver Kube API is not responding to GET requests
Apr 23 07:53:40.299 - 44s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:53:40.299 - 44s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:54:55.299 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:54:55.299 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:57:25.299 - 44s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 23 07:57:25.299 - 44s   E openshift-apiserver OpenShift API is not responding to GET requests

So this is also an issue for kube-apiserver.  But if it were just a load-balancer issue, I'd guess that we'd have more overlap between the Kubernetes and OpenShift errors above.  Probably worth checking events.json to get a more complete picture.

Comment 8 W. Trevor King 2019-04-23 11:03:09 UTC
Trying to get a handle on API-server roll-outs:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items | sort_by(.status.startTime)[] | select(.metadata.namespace | contains("-apiserver")) | .status.startTime + " " + .metadata.namespace + "/" + .metadata.name'2019-04-23T06:47:40Z openshift-apiserver-operator/openshift-apiserver-operator-79785cbd7c-spwwb
2019-04-23T06:49:38Z openshift-kube-apiserver/kube-apiserver-ip-10-0-137-157.ec2.internal
2019-04-23T06:51:05Z openshift-kube-apiserver/kube-apiserver-ip-10-0-147-83.ec2.internal
2019-04-23T06:52:58Z openshift-apiserver/apiserver-glrt9
2019-04-23T06:55:05Z openshift-apiserver/apiserver-lpjgt
2019-04-23T06:56:31Z openshift-kube-apiserver/installer-6-ip-10-0-165-250.ec2.internal
2019-04-23T06:56:36Z openshift-apiserver/apiserver-dh2w2
2019-04-23T06:56:46Z openshift-kube-apiserver/kube-apiserver-ip-10-0-165-250.ec2.internal
2019-04-23T06:57:11Z openshift-kube-apiserver/revision-pruner-6-ip-10-0-165-250.ec2.internal
2019-04-23T07:54:51Z openshift-kube-apiserver/revision-pruner-7-ip-10-0-137-157.ec2.internal
2019-04-23T07:54:52Z openshift-kube-apiserver/installer-7-ip-10-0-165-250.ec2.internal
2019-04-23T07:57:10Z openshift-kube-apiserver-operator/kube-apiserver-operator-846ccd49c8-89tfh
2019-04-23T07:57:10Z openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator-7fc474c755-lthrm
2019-04-23T07:58:34Z openshift-kube-apiserver/revision-pruner-7-ip-10-0-147-83.ec2.internal
2019-04-23T07:58:38Z openshift-kube-apiserver/revision-pruner-7-ip-10-0-165-250.ec2.internal

So the 7:39 error from comment 7 seems to be off in space.  The 7:51 through 7:57 errors are close to some 7:5* pod starts, but none of those pods look like the API servers themselves.  So I'm currently leaning to towards this not being an issue with API-server roll-out, although it could be an issue with an auth roll-out or some such.

Comment 9 W. Trevor King 2019-04-23 11:07:28 UTC
Dropping the namespace filters:

$ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/581/artifacts/e2e-aws-upgrade/pods.json | jq -r '.items | sort_by(.status.startTime)[] | .status.startTime + " " + .metadata.namespace + "/" + .metadata.name'
...
2019-04-23T07:28:54Z openshift-multus/multus-vm4dq
2019-04-23T07:28:54Z openshift-sdn/ovs-zl4sk
2019-04-23T07:38:54Z openshift-multus/multus-lc9nw
2019-04-23T07:38:54Z openshift-sdn/ovs-tt7vh
2019-04-23T07:39:32Z openshift-multus/multus-wbh7f
2019-04-23T07:39:36Z openshift-sdn/ovs-f2lfx
2019-04-23T07:40:14Z openshift-marketplace/certified-operators-f97547c89-mds6z
...
2019-04-23T07:48:05Z openshift-kube-controller-manager-operator/kube-controller-manager-operator-c9974dc6f-9gkwf
2019-04-23T07:49:29Z openshift-cloud-credential-operator/cloud-credential-operator-6676469cc6-q5vdc
2019-04-23T07:50:26Z openshift-kube-controller-manager/installer-3-ip-10-0-165-250.ec2.internal
2019-04-23T07:51:38Z openshift-machine-api/machine-api-controllers-7f77999d99-9msr2
2019-04-23T07:54:14Z openshift-kube-controller-manager/kube-controller-manager-ip-10-0-137-157.ec2.internal
...

So at least the 7:39 error seems like it was due to the network roll-out.  I still don't see anything that looks suspicious for the 7:51:55 error.

Comment 10 Michal Fojtik 2019-04-24 13:18:04 UTC
PR: https://github.com/openshift/origin/pull/22651

The change will make sure that on master node reboot the apiserver exit with 0 and there is no chance of the container
to be restarted.

Comment 11 W. Trevor King 2019-04-25 19:08:19 UTC
https://github.com/openshift/origin/pull/22651#event-2297673514 <- merged

Although this doesn't address graceful shutdowns for static pods.  Is that part of this issue too?

Comment 12 W. Trevor King 2019-04-25 19:24:09 UTC
Graceful term is bug 1702414.

Comment 14 zhou ying 2019-04-29 07:54:27 UTC
We still  could see the error from the latest ci upgrade :
[root@dhcp-140-138 ~]# curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/901 | grep 'API is not responding to GET requests' 
Apr 29 06:28:34.475 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:39:04.475 - 15s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:39:49.475 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:41:19.475 - 90s   E kube-apiserver Kube API is not responding to GET requests
Apr 29 06:43:34.475 - 45s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:43:34.475 - 105s  E kube-apiserver Kube API is not responding to GET requests
Apr 29 06:45:49.477 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:49:19.475 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:28:34.475 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:39:04.475 - 15s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:39:49.475 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:41:19.475 - 90s   E kube-apiserver Kube API is not responding to GET requests
Apr 29 06:43:34.475 - 45s   E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:43:34.475 - 105s  E kube-apiserver Kube API is not responding to GET requests
Apr 29 06:45:49.477 E openshift-apiserver OpenShift API is not responding to GET requests
Apr 29 06:49:19.475 E openshift-apiserver OpenShift API is not responding to GET requests

The payload: 4.1.0-0.ci-2019-04-28-235503 -> 4.1.0-0.ci-2019-04-29-031658 contain the PR:  https://github.com/openshift/origin/pull/22651