Hide Forgot
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6264/artifacts/e2e-aws/e2e.log Near the end of the e2e run ~20m in apiserver is restarted which causes a test to exit out (which I will track as a separate bug whether the test should be resilient to that). I had expected no rollouts, and looks like kube-scheduler is crash looping on a master? If we are rolling out that would cause other flakes. Mar 28 16:34:33.813 W clusteroperator/kube-scheduler changed Failing to False Mar 28 16:34:57.311 - 69s I test="[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]" running Mar 28 16:34:57.312 - 36s I test="[Feature:ImageInfo] Image info should display information about images [Suite:openshift/conformance/parallel]" running Mar 28 16:34:58.395 W ns/openshift-kube-scheduler pod/openshift-kube-scheduler-ip-10-0-164-176.ec2.internal Back-off restarting failed container (6 times) Mar 28 16:34:58.395 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator (combined from similar events): Created Pod/revision-pruner-6-ip-10-0-164-176.ec2.internal -n openshift-kube-apiserver because it was missing Mar 28 16:34:58.395 I ns/openshift-kube-apiserver pod/revision-pruner-6-ip-10-0-164-176.ec2.internal Container image "registry.svc.ci.openshift.org/ocp/4.0-2019-03-28-160339@sha256:7201ae707e5345237aee6dfbdbde5f8cc6de4350bcee5b07e4d5d2996e566c6b" already present on machine Mar 28 16:34:58.395 I ns/openshift-kube-apiserver pod/revision-pruner-6-ip-10-0-164-176.ec2.internal Created container pruner Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/revision-pruner-6-ip-10-0-164-176.ec2.internal Started container pruner Mar 28 16:34:58.409 W ns/openshift-kube-scheduler pod/openshift-kube-scheduler-ip-10-0-164-176.ec2.internal Back-off restarting failed container (7 times) Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/installer-6-ip-10-0-137-120.ec2.internal Container image "registry.svc.ci.openshift.org/ocp/4.0-2019-03-28-160339@sha256:7201ae707e5345237aee6dfbdbde5f8cc6de4350bcee5b07e4d5d2996e566c6b" already present on machine Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/installer-6-ip-10-0-137-120.ec2.internal Created container installer Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/installer-6-ip-10-0-137-120.ec2.internal Started container installer Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-137-120.ec2.internal Stopping container kube-apiserver-5 Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-137-120.ec2.internal Stopping container kube-apiserver-cert-syncer-5
The test failed due to fail [k8s.io/kubernetes/test/e2e/framework/pods.go:77]: Error creating Pod Expected error: <*url.Error | 0xc422d8e9f0>: { Op: "Post", URL: "https://api.ci-op-41c5v8wf-5a633.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-test-image-info-rhr5p/pods", Err: { Op: "read", Net: "tcp", Source: { IP: [172, 16, 16, 141], Port: 44432, Zone: "", }, Addr: { IP: [18, 210, 77, 74], Port: 6443, Zone: "", }, Err: {Syscall: "read", Err: 0x68}, }, } Post https://api.ci-op-41c5v8wf-5a633.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-test-image-info-rhr5p/pods: read tcp 172.16.16.141:44432->18.210.77.74:6443: read: connection reset by peer not to have occurred Which I would have expected graceful shutdown to not give me.
It would be awesome to see correlations with SDN giving up on iptables updates for 30 sec due to hitting the iptables lock. Maybe that's the cause, maybe not.
Moving to Pod team to investigate the Scheduler backoff. I think there is a separate BZ to investigate kube apiserver rolling new version out in middle/at the end of an e2e test (likely caused by controller manager rotating kubelet certs faster which was fixed last week).
scheduler is not starting with same error as in 1691085 *** This bug has been marked as a duplicate of bug 1691085 ***
I am reopening because the issue that was opened here is about kube-apiserver doing a rollout during an e2e run that is not fully graceful. That's an urgent issue because it causes a large percentage of the e2e flakes and failures. Expectation: all graceful rollouts except for watch disconnects are *completely* transparent to the user if done correctly. If we see disconnect errors, it's a bug. ---- It looks like the symptoms are different / worse post rebase - I have seen several e2e runs fail harder than they did prerebaes, possible due to different graceful deletion behavior, a slow restart due to openapi, or other possible causes. https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22439/pull-ci-openshift-origin-master-e2e-aws/6175?log#log Failed due to an apiserver rolling out and connectivity drops - the rollout is right at the beginning of the e2e run. This looks interesting but might just be unrelated Apr 02 19:15:20.556 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator (combined from similar events): Status for operator kube-apiserver changed: Failing message changed from "StaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-5\" is not ready\nStaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-cert-syncer-5\" is not ready" to "StaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-5\" is not ready" (51 times) A separate set of failures happens at the end of the run that is probably unrelated to the rollout.
The ELB was taking up to 40 sec to take an endpoint out of the loop, while the kube-apiserver only guearanteed to stay up for 35 sec, giving some chance that the ELB routes traffic to it after it has terminated. Fixed in https://github.com/openshift/installer/pull/1537.
https://github.com/openshift/installer/pull/1537 was wrong. The (current) AWS theory says that a target is taken out of the loop after at most 30 seconds of /readyz failing.
Meanwhile we merged a lot of stabilization fixes and we fixed graceful termination timeouts, so this should happen less often.
There are two issues still: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6733#openshift-tests-sig-api-machinery-customresourcepublishopenapi-featurecustomresourcepublishopenapi-works-for-crd-with-validation-schema-suiteopenshiftconformanceparallel-suitek8s looks similar to above. Need triage there The bulk of the failures look new: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6733#openshift-tests-featuredeploymentconfig-deploymentconfigs-with-multiple-image-change-triggers-conformance-should-run-a-successful-deployment-with-a-trigger-used-by-different-containers-suiteopenshiftconformanceparallelminimal and will spawn a separate bug for the "Unauthorized": https://bugzilla.redhat.com/show_bug.cgi?id=1700046
Per comment 10, moving back to MODIFIED to revert the immediate robot comment 11.
Can't reproduce the issue with payload: 4.1.0-0.nightly-2019-05-04-210601, will verify the issue: [root@dhcp-140-138 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-04-210601 True False 3h39m Cluster version is 4.1.0-0.nightly-2019-05-04-210601 [root@dhcp-140-138 ~]# oc version Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"e8d1fd69b", GitTreeState:"clean", BuildDate:"2019-04-29T06:42:38Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d4a26b6", GitCommit:"d4a26b6", GitTreeState:"clean", BuildDate:"2019-05-04T18:12:07Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758