Bug 1694186
Summary: | kube-apiserver started a redeploy during e2e run (should not do so), and rollout is not graceful (causing e2e failures) | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> |
Component: | Master | Assignee: | Stefan Schimanski <sttts> |
Status: | CLOSED ERRATA | QA Contact: | zhou ying <yinzhou> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 4.1.0 | CC: | aos-bugs, jokerman, mfojtik, mmccomas, rvokal, sttts, wsun |
Target Milestone: | --- | Keywords: | BetaBlocker, Reopened |
Target Release: | 4.1.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-06-04 10:46:40 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Clayton Coleman
2019-03-29 17:46:25 UTC
The test failed due to fail [k8s.io/kubernetes/test/e2e/framework/pods.go:77]: Error creating Pod Expected error: <*url.Error | 0xc422d8e9f0>: { Op: "Post", URL: "https://api.ci-op-41c5v8wf-5a633.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-test-image-info-rhr5p/pods", Err: { Op: "read", Net: "tcp", Source: { IP: [172, 16, 16, 141], Port: 44432, Zone: "", }, Addr: { IP: [18, 210, 77, 74], Port: 6443, Zone: "", }, Err: {Syscall: "read", Err: 0x68}, }, } Post https://api.ci-op-41c5v8wf-5a633.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-test-image-info-rhr5p/pods: read tcp 172.16.16.141:44432->18.210.77.74:6443: read: connection reset by peer not to have occurred Which I would have expected graceful shutdown to not give me. It would be awesome to see correlations with SDN giving up on iptables updates for 30 sec due to hitting the iptables lock. Maybe that's the cause, maybe not. Moving to Pod team to investigate the Scheduler backoff. I think there is a separate BZ to investigate kube apiserver rolling new version out in middle/at the end of an e2e test (likely caused by controller manager rotating kubelet certs faster which was fixed last week). scheduler is not starting with same error as in 1691085 *** This bug has been marked as a duplicate of bug 1691085 *** I am reopening because the issue that was opened here is about kube-apiserver doing a rollout during an e2e run that is not fully graceful. That's an urgent issue because it causes a large percentage of the e2e flakes and failures. Expectation: all graceful rollouts except for watch disconnects are *completely* transparent to the user if done correctly. If we see disconnect errors, it's a bug. ---- It looks like the symptoms are different / worse post rebase - I have seen several e2e runs fail harder than they did prerebaes, possible due to different graceful deletion behavior, a slow restart due to openapi, or other possible causes. https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22439/pull-ci-openshift-origin-master-e2e-aws/6175?log#log Failed due to an apiserver rolling out and connectivity drops - the rollout is right at the beginning of the e2e run. This looks interesting but might just be unrelated Apr 02 19:15:20.556 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator (combined from similar events): Status for operator kube-apiserver changed: Failing message changed from "StaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-5\" is not ready\nStaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-cert-syncer-5\" is not ready" to "StaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-5\" is not ready" (51 times) A separate set of failures happens at the end of the run that is probably unrelated to the rollout. The ELB was taking up to 40 sec to take an endpoint out of the loop, while the kube-apiserver only guearanteed to stay up for 35 sec, giving some chance that the ELB routes traffic to it after it has terminated. Fixed in https://github.com/openshift/installer/pull/1537. https://github.com/openshift/installer/pull/1537 was wrong. The (current) AWS theory says that a target is taken out of the loop after at most 30 seconds of /readyz failing. Meanwhile we merged a lot of stabilization fixes and we fixed graceful termination timeouts, so this should happen less often. There are two issues still: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6733#openshift-tests-sig-api-machinery-customresourcepublishopenapi-featurecustomresourcepublishopenapi-works-for-crd-with-validation-schema-suiteopenshiftconformanceparallel-suitek8s looks similar to above. Need triage there The bulk of the failures look new: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6733#openshift-tests-featuredeploymentconfig-deploymentconfigs-with-multiple-image-change-triggers-conformance-should-run-a-successful-deployment-with-a-trigger-used-by-different-containers-suiteopenshiftconformanceparallelminimal and will spawn a separate bug for the "Unauthorized": https://bugzilla.redhat.com/show_bug.cgi?id=1700046 Per comment 10, moving back to MODIFIED to revert the immediate robot comment 11. Can't reproduce the issue with payload: 4.1.0-0.nightly-2019-05-04-210601, will verify the issue: [root@dhcp-140-138 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-04-210601 True False 3h39m Cluster version is 4.1.0-0.nightly-2019-05-04-210601 [root@dhcp-140-138 ~]# oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"e8d1fd69b", GitTreeState:"clean", BuildDate:"2019-04-29T06:42:38Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d4a26b6", GitCommit:"d4a26b6", GitTreeState:"clean", BuildDate:"2019-05-04T18:12:07Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758 |