Bug 1694186

Summary: kube-apiserver started a redeploy during e2e run (should not do so), and rollout is not graceful (causing e2e failures)
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: MasterAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: zhou ying <yinzhou>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: aos-bugs, jokerman, mfojtik, mmccomas, rvokal, sttts, wsun
Target Milestone: ---Keywords: BetaBlocker, Reopened
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:46:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-03-29 17:46:25 UTC
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/6264/artifacts/e2e-aws/e2e.log

Near the end of the e2e run ~20m in apiserver is restarted which causes a test to exit out (which I will track as a separate bug whether the test should be resilient to that).  I had expected no rollouts, and looks like kube-scheduler is crash looping on a master?  If we are rolling out that would cause other flakes.


Mar 28 16:34:33.813 W clusteroperator/kube-scheduler changed Failing to False
Mar 28 16:34:57.311 - 69s   I test="[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]" running
Mar 28 16:34:57.312 - 36s   I test="[Feature:ImageInfo] Image info should display information about images [Suite:openshift/conformance/parallel]" running
Mar 28 16:34:58.395 W ns/openshift-kube-scheduler pod/openshift-kube-scheduler-ip-10-0-164-176.ec2.internal Back-off restarting failed container (6 times)
Mar 28 16:34:58.395 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator (combined from similar events): Created Pod/revision-pruner-6-ip-10-0-164-176.ec2.internal -n openshift-kube-apiserver because it was missing
Mar 28 16:34:58.395 I ns/openshift-kube-apiserver pod/revision-pruner-6-ip-10-0-164-176.ec2.internal Container image "registry.svc.ci.openshift.org/ocp/4.0-2019-03-28-160339@sha256:7201ae707e5345237aee6dfbdbde5f8cc6de4350bcee5b07e4d5d2996e566c6b" already present on machine
Mar 28 16:34:58.395 I ns/openshift-kube-apiserver pod/revision-pruner-6-ip-10-0-164-176.ec2.internal Created container pruner
Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/revision-pruner-6-ip-10-0-164-176.ec2.internal Started container pruner
Mar 28 16:34:58.409 W ns/openshift-kube-scheduler pod/openshift-kube-scheduler-ip-10-0-164-176.ec2.internal Back-off restarting failed container (7 times)
Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/installer-6-ip-10-0-137-120.ec2.internal Container image "registry.svc.ci.openshift.org/ocp/4.0-2019-03-28-160339@sha256:7201ae707e5345237aee6dfbdbde5f8cc6de4350bcee5b07e4d5d2996e566c6b" already present on machine
Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/installer-6-ip-10-0-137-120.ec2.internal Created container installer
Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/installer-6-ip-10-0-137-120.ec2.internal Started container installer
Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-137-120.ec2.internal Stopping container kube-apiserver-5
Mar 28 16:34:58.409 I ns/openshift-kube-apiserver pod/kube-apiserver-ip-10-0-137-120.ec2.internal Stopping container kube-apiserver-cert-syncer-5

Comment 1 Clayton Coleman 2019-03-29 17:51:04 UTC
The test failed due to 

fail [k8s.io/kubernetes/test/e2e/framework/pods.go:77]: Error creating Pod
Expected error:
    <*url.Error | 0xc422d8e9f0>: {
        Op: "Post",
        URL: "https://api.ci-op-41c5v8wf-5a633.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-test-image-info-rhr5p/pods",
        Err: {
            Op: "read",
            Net: "tcp",
            Source: {
                IP: [172, 16, 16, 141],
                Port: 44432,
                Zone: "",
            },
            Addr: {
                IP: [18, 210, 77, 74],
                Port: 6443,
                Zone: "",
            },
            Err: {Syscall: "read", Err: 0x68},
        },
    }
    Post https://api.ci-op-41c5v8wf-5a633.origin-ci-int-aws.dev.rhcloud.com:6443/api/v1/namespaces/e2e-test-image-info-rhr5p/pods: read tcp 172.16.16.141:44432->18.210.77.74:6443: read: connection reset by peer
not to have occurred

Which I would have expected graceful shutdown to not give me.

Comment 3 Stefan Schimanski 2019-04-01 09:44:16 UTC
It would be awesome to see correlations with SDN giving up on iptables updates for 30 sec due to hitting the iptables lock. Maybe that's the cause, maybe not.

Comment 4 Michal Fojtik 2019-04-01 10:14:41 UTC
Moving to Pod team to investigate the Scheduler backoff.

I think there is a separate BZ to investigate kube apiserver rolling new version out in middle/at the end of an e2e test (likely caused by controller manager rotating kubelet certs faster which was fixed last week).

Comment 5 Seth Jennings 2019-04-01 16:57:12 UTC
scheduler is not starting with same error as in 1691085

*** This bug has been marked as a duplicate of bug 1691085 ***

Comment 6 Clayton Coleman 2019-04-02 20:28:59 UTC
I am reopening because the issue that was opened here is about kube-apiserver doing a rollout during an e2e run that is not fully graceful.  That's an urgent issue because it causes a large percentage of the e2e flakes and failures.

Expectation: all graceful rollouts except for watch disconnects are *completely* transparent to the user if done correctly.  If we see disconnect errors, it's a bug.

----

It looks like the symptoms are different / worse post rebase - I have seen several e2e runs fail harder than they did prerebaes, possible due to different graceful deletion behavior, a slow restart due to openapi, or other possible causes.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22439/pull-ci-openshift-origin-master-e2e-aws/6175?log#log

Failed due to an apiserver rolling out and connectivity drops - the rollout is right at the beginning of the e2e run.

This looks interesting but might just be unrelated

Apr 02 19:15:20.556 I ns/openshift-kube-apiserver-operator deployment/kube-apiserver-operator (combined from similar events): Status for operator kube-apiserver changed: Failing message changed from "StaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-5\" is not ready\nStaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-cert-syncer-5\" is not ready" to "StaticPodsFailing: nodes/ip-10-0-158-12.ec2.internal pods/kube-apiserver-ip-10-0-158-12.ec2.internal container=\"kube-apiserver-5\" is not ready" (51 times)

A separate set of failures happens at the end of the run that is probably unrelated to the rollout.

Comment 7 Stefan Schimanski 2019-04-05 10:12:53 UTC
The ELB was taking up to 40 sec to take an endpoint out of the loop, while the kube-apiserver only guearanteed to stay up for 35 sec, giving some chance that the ELB routes traffic to it after it has terminated.

Fixed in https://github.com/openshift/installer/pull/1537.

Comment 8 Stefan Schimanski 2019-04-12 12:42:17 UTC
https://github.com/openshift/installer/pull/1537 was wrong. The (current) AWS theory says that a target is taken out of the loop after at most 30 seconds of /readyz failing.

Comment 9 Michal Fojtik 2019-04-12 12:44:09 UTC
Meanwhile we merged a lot of stabilization fixes and we fixed graceful termination timeouts, so this should happen less often.

Comment 12 Xingxing Xia 2019-04-16 06:43:28 UTC
Per comment 10, moving back to MODIFIED to revert the immediate robot comment 11.

Comment 14 zhou ying 2019-05-05 08:34:55 UTC
Can't reproduce the issue with payload: 4.1.0-0.nightly-2019-05-04-210601, will verify the issue:
[root@dhcp-140-138 ~]# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-04-210601   True        False         3h39m   Cluster version is 4.1.0-0.nightly-2019-05-04-210601
[root@dhcp-140-138 ~]# oc version
Client Version: version.Info{Major:"4", Minor:"1+", GitVersion:"v4.1.0", GitCommit:"e8d1fd69b", GitTreeState:"clean", BuildDate:"2019-04-29T06:42:38Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.4+d4a26b6", GitCommit:"d4a26b6", GitTreeState:"clean", BuildDate:"2019-05-04T18:12:07Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Comment 16 errata-xmlrpc 2019-06-04 10:46:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758