test: Application behind service load balancer with PDB is not disrupted is failing frequently in CI, see search results: https://search.svc.ci.openshift.org/?maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted
Current this is OVS 3 second disruption due to kernel rejecting new flows. 4.5 will move OVS to the node, and michael cambria is working on it. When that lands this should become zero, or be reopened and we find a new cause.
Assigning this to the 4.5 release to get worked on. We likely will not be able to backport the fix, but we can work out if there is anything else we can do, or at least we can change the test to make it more accurately reflect what we tolerate.
Setting to urgent, we expect full availability during upgrades in 4.5.
Why was this moved to 4.6? This impacts application workloads and we were supposed to have fixed it.
This represents a fundamental product flaw (taking workload outages during an upgrade). It cannot be deferred w/o agreement from at least the group lead and preferably discussion with the architecture team. Setting back to 4.5.
(In reply to Clayton Coleman from comment #1) > Current this is OVS 3 second disruption due to kernel rejecting new flows. > 4.5 will move OVS to the node, and michael cambria is working on it. When > that lands this should become zero, or be reopened and we find a new cause. OVS is running on host and OVNKubernetes has been changed to make use of this. Can you retest?
https://search.ci.openshift.org/?search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted&maxAge=48h&context=1&type=junit&name=.*4.6.*&maxMatches=5&maxBytes=20971520&groupBy=job shows it is still happening quite frequently.
@Tim You got assigned to this during the bug-scrum. Could you please have a look at this and try to figure out why we are experiencing networking downtime after system OVS? Thanks! Alex
Looking at the logs we have 2 pods backing the test service. The test constantly does HTTP GET /echo?msg=Hello on the service. I can see both pods stop receiving these at the same time (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1285191438505611264/artifacts/e2e-aws-upgrade/): #### service-test-sls6t ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw 2020/07/20 06:21:39 GET /echo?msg=Hello 2020/07/20 06:21:39 GET /hostName 2020/07/20 06:27:00 GET /hostName 2020/07/20 06:27:03 GET /hostName #### service-test-xfrrq 2020/07/20 06:21:39 GET /echo?msg=Hello 2020/07/20 06:21:40 GET /hostName 2020/07/20 06:27:01 GET /hostName 2020/07/20 06:27:04 GET /hostName So we can see the requests stop coming at 6:21:39 and looking at worker 9hwqw: Jul 20 06:15:45.350738 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw systemd[1]: Shutting down. Jul 20 06:18:35.463612 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:18:35.463138 1539 config.go:412] Receiving a new pod "service-test-sls6t_e2e-k8s-service-lb-available-9610(2788de88-c1db-46dd-8f25-73aa3d464cd1)" Jul 20 06:21:00.900639 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.900557 1539 prober.go:181] HTTP-Probe Host: http://10.128.2.13, Port: 80, Path: /hostName Jul 20 06:21:00.900639 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.900619 1539 prober.go:184] HTTP-Probe Headers: map[] Jul 20 06:21:00.901739 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.901662 1539 http.go:128] Probe succeeded for http://10.128.2.13:80/hostName, Response: {200 OK 200 HTTP/1.1 1 1 map[Content-Length:[18] Content-Type:[text/plain; charset=utf-8] Date:[Mon, 20 Jul 2020 06:21:00 GMT]] 0xc000992f00 18 [] true false map[] 0xc0000bd500 <nil>} The worker has been upgraded and been up for over 5 minutes. The pod exists and keeps responding to health checks, but the service suddenly stops working. It looks to me like this doesn't have anything to do with a temporary network outage, and looks more like something went wrong with openshift-sdn. This bug should probably be re-assigned to someone who works on openshift-sdn.
Possibly has gotten worse recently? $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted&maxAge=12h&name=upgrade' | grep 'failures match' | sort pull-ci-openshift-cluster-authentication-operator-master-e2e-aws-upgrade - 9 runs, 11% failed, 400% of failures match pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade - 5 runs, 100% failed, 40% of failures match ... pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 20 runs, 85% failed, 94% of failures match pull-ci-openshift-origin-master-e2e-gcp-upgrade - 15 runs, 80% failed, 75% of failures match pull-ci-openshift-router-master-e2e-upgrade - 13 runs, 92% failed, 100% of failures match rehearse-10491-pull-ci-openshift-cluster-samples-operator-master-okd-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match release-openshift-okd-installer-e2e-aws-upgrade - 7 runs, 14% failed, 600% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-aws-upgrade - 68 runs, 47% failed, 88% of failures match release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 3 runs, 100% failed, 33% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 2 runs, 100% failed, 50% of failures match release-openshift-origin-installer-e2e-gcp-upgrade - 14 runs, 21% failed, 167% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.5 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci - 2 runs, 50% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 6 runs, 50% failed, 133% of failures match That^ will include jobs with both flaky and failing test-cases, which may overestimate the impact on throughput. We can also search on the fatal error message to exclude currently-acceptable flakes like: Service was unreachable during disruption for at least 23s of 30m10s (1%), this is currently sufficient to pass the test/job but not considered completely correct: with: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Service+was+unreachable+during+disruption+for+at+least+.*+of.*%5C%29%3A&maxAge=12h&name=upgrade' | grep 'failures match' | sort pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade - 5 runs, 100% failed, 40% of failures match pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade - 5 runs, 80% failed, 100% of failures match ... pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 20 runs, 85% failed, 94% of failures match pull-ci-openshift-origin-master-e2e-gcp-upgrade - 15 runs, 80% failed, 75% of failures match pull-ci-openshift-router-master-e2e-upgrade - 12 runs, 92% failed, 100% of failures match rehearse-10491-pull-ci-openshift-cluster-samples-operator-master-okd-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match release-openshift-okd-installer-e2e-aws-upgrade - 6 runs, 17% failed, 500% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-aws-upgrade - 68 runs, 47% failed, 75% of failures match release-openshift-origin-installer-e2e-gcp-upgrade - 14 runs, 21% failed, 67% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.5 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 6 runs, 50% failed, 67% of failures match Although it's possible that that^ is overestimating by looping in some other test-cases.
GCP also seems overly represented in comment 13. Maybe we recently broke something there? Although gcp-upgrade-4.5 also shows up, so that suggests it's either a long-running issue or something we broke recently and backported at least to 4.5. On GCP. I dunno. Would be great if we got it fixed everywhere :)
Ooh, Kirsten has discovered [1]. I don't understand what that does, but in [2], David is linking [3] where "Application behind service load balancer with PDB is not disrupted" is the only failing test-case. So maybe that will fix this bug :) [1]: https://github.com/openshift/kubernetes/pull/300 [2]: https://github.com/openshift/kubernetes/pull/300#issue-457945177 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25314/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1287993294235635712
*** Bug 1861944 has been marked as a duplicate of this bug. ***
No failure on the master-e2e-gcp-upgrade jobs since July 31 https://search.ci.openshift.org/?search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted&maxAge=336h&context=1&type=bug%2Bjunit&name=master-e2e-gcp-upgrade&maxMatches=5&maxBytes=20971520&groupBy=job it seems comment #15 is correct and https://github.com/openshift/kubernetes/pull/300 fixed the problem :-)
(In reply to Antonio Ojea from comment #17) > No failure on the master-e2e-gcp-upgrade jobs since July 31 > > https://search.ci.openshift.org/ > ?search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted&ma > xAge=336h&context=1&type=bug%2Bjunit&name=master-e2e-gcp- > upgrade&maxMatches=5&maxBytes=20971520&groupBy=job > > it seems comment #15 is correct and > https://github.com/openshift/kubernetes/pull/300 fixed the problem :-) You'll need to unskip the test to confirm whether the PR you linked to solves the problem. The test was disabled as part of the 1.19 bump in the absence of David's fix: https://github.com/openshift/origin/blob/master/test/e2e/upgrade/upgrade.go#L42 Unskipping the test is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1861944 and is proposed in https://github.com/openshift/origin/pull/25354 Note that the issue fixed by 300 was introduced in 1.19, and this bz was filed well before 1.19 landed.
> is it possible that this is a GCP thing? First try verify this hypothesis https://github.com/openshift/origin/pull/25354#issuecomment-671758837 AWS Run #0: Failed expand_less 34m14s Service was unreachable during disruption for at least 1s of 29m45s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Aug 11 07:39:29.916 E ns/e2e-k8s-service-lb-available-1640 svc/service-test Service stopped responding to GET requests over new connections Aug 11 07:39:29.932 I ns/e2e-k8s-service-lb-available-1640 svc/service-test Service started responding to GET requests over new connections GCP Aug 11 08:00:19.100: Service was unreachable during disruption for at least 49s of 32m30s (3%): will run again for double checking https://github.com/openshift/origin/pull/25354#issuecomment-671804696
* AWS https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25354/pull-ci-openshift-origin-master-e2e-aws-upgrade/1293100591748222976 Service was unreachable during disruption for at least 8s of 30m50s (0%), this is currently sufficient to pass the test/job but not considered completely correct: Aug 11 09:32:51.033 - 3s E ns/e2e-k8s-service-lb-available-4472 svc/service-test Service is not responding to GET requests over new connections * GCP https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25354/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1293100591773388800 Aug 11 09:43:11.214: Service was unreachable during disruption for at least 1m8s of 31m10s (4%): (snipped) Aug 11 09:34:40.029 - 18s E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections (snipped) Aug 11 09:37:42.029 - 11s E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests on reused connections (snipped) Aug 11 09:38:01.029 - 5s E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections (snipped) Aug 11 09:40:40.029 - 22s E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections Second round shows the same outcome, AWS pass and GCP fails, how do we want to proceed?
Interesting, setting the service ExternalTrafficPolicy to Local pass the tests in both cloud providers I filed a PR and you can see the results there https://github.com/openshift/origin/pull/25406 I think that for cloud providers it makes more sense to use Local to avoid the possible double ops LB-> Node -> Pod Is there any special reasons for not using it?
I don't have any special knowledge of the test or the component, i think that change is something that would need to be discussed w/ the networking-edge team(or possibly Clayton Coleman as the test author) to ensure we're not losing coverage of an intended scenario.
cherry pick to 4.6 branch https://github.com/openshift/origin/pull/25482
I fear this is still happening: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5/1306525777306587136
(In reply to Devan Goodwin from comment #32) > I fear this is still happening: > https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift- > origin-installer-e2e-azure-upgrade-4.5/1306525777306587136 the patch is not in that branch, right? https://github.com/openshift/origin/pull/25482
fixing infrastructure test link. That test is related to CI not starting an installer correctly.
Still seeing this in 4.6: https://search.ci.openshift.org/?search=%5C%5Bsig-network-edge%5C%5D+Application+behind+service+load+balancer+with+PDB+is+not+disrupted&maxAge=168h&context=1&type=bug%2Bjunit&name=4.6&maxMatches=5&maxBytes=20971520&groupBy=job For example, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-upgrade-4.6/1314214514333323264
(In reply to Benjamin Gilbert from comment #36) > Still seeing this in 4.6: > https://search.ci.openshift.org/?search=%5C%5Bsig-network- > edge%5C%5D+Application+behind+service+load+balancer+with+PDB+is+not+disrupted > &maxAge=168h&context=1&type=bug%2Bjunit&name=4. > 6&maxMatches=5&maxBytes=20971520&groupBy=job > > For example, > https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift- > origin-installer-e2e-gcp-upgrade-4.6/1314214514333323264 The particular example that you link shows that it failed the first time and passed the second time. Most of the results in the search query that you linked shows > this is currently sufficient to pass the test/job but not considered completely correct This test has to have a margin of error because there are multiple factors influencing the result, and not all of them are under our control: - external factors are the cloud provider load balancer, path from the client to the load balancer, status of the VMs, ... - internal factors that can influence are, per example, the client that is polling the API, is the client running alone, with more services (there is a known issue with one particular test cluster having io timeouts accessing APIs), has enough resources the VM where the client runs, is there congestion in that particular network? Don't get me wrong, the test has to pass, but for that we have know the percentage of failures, I mean, without flakes, the times that the test failed the 2 times in the same job, and caused the job to fail. This test simulates a High Available application and for that we need to define an SLA and the number of 9s we want to support. If we want to make this perfect we need to change more things, not only our code, also how the test runs, so there is a trade off here, and we should weight that and define what is "well enough" and what is the priority.
Setting the target to 4.7 to investigate further. We will consider backports of any fixes we identify. Since the move to system ovs we should be able to tolerate the sdn restart. We need to identify where the outage comes from... if it is due to endpoint changes while the sdn is restarted we may not be able to do anything about that (and we should make it so the test detects that case). But if there is any other cause, we need to identify it and work out how to fix it.
> Most of the results in the search query that you linked shows >> this is currently sufficient to pass the test/job but not considered completely correct I know. The build watcher rules require an open bug for every frequently failing test. Feel free to set a low priority.
Analyzing this failure https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5/1306525777306587136 and correlating some logs with the failure reason: Sep 17 10:57:00.502 E ns/e2e-k8s-service-lb-available-3308 svc/service-test Service stopped responding to GET requests over new connections Sep 17 10:57:00.502: INFO: Service service-test is unreachable on new connections: Get http://40.89.252.10:80/echo?msg=Hello&timeout=10s: dial tcp 40.89.252.10:80: connect: no route to host Sep 17 10:57:00.594 I ns/e2e-k8s-service-lb-available-3308 svc/service-test Service started responding to GET requests over new connections Sep 17 10:57:04.982 E ns/e2e-k8s-service-lb-available-3308 svc/service-test Service stopped responding to GET requests over new connections > Sep 17 10:57:04.982: INFO: Service service-test is unreachable on new connections: Gethttp://40.89.252.10:80/echo?msg=Hello&timeout=10s: dial tcp 40.89.252.10:80: connect: no route to host some of the errors are caused by > connect: no route to host AFAIK it can mean that there is an error returned by the kernel in the client, or that an icmp-host-unreachable was received maybe we are overloading the client?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.5.21 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5194