Bug 1828858 - Application behind service load balancer with PDB is not disrupted
Summary: Application behind service load balancer with PDB is not disrupted
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.5
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.5.z
Assignee: Antonio Ojea
QA Contact: huirwang
URL:
Whiteboard: SDN-CI-IMPACT
: 1861944 (view as bug list)
Depends On: 1891711
Blocks: 1868490
TreeView+ depends on / blocked
 
Reported: 2020-04-28 13:38 UTC by Ben Parees
Modified: 2020-12-01 10:49 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1868489 1886620 (view as bug list)
Environment:
Application behind service load balancer with PDB is not disrupted
Last Closed: 2020-12-01 10:48:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25406 0 None closed Bug 1828858: use LoadBalancers with ExternalTrafficPolicy set to Local 2021-01-14 20:04:26 UTC
Github openshift origin pull 25635 0 None closed [release-4.5] Bug 1828858: use net/http instead of client-go for e2e PDB test 2021-01-14 20:04:25 UTC
Red Hat Product Errata RHSA-2020:5194 0 None None None 2020-12-01 10:49:34 UTC

Description Ben Parees 2020-04-28 13:38:18 UTC
test:
Application behind service load balancer with PDB is not disrupted 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted

Comment 1 Clayton Coleman 2020-04-28 13:49:59 UTC
Current this is OVS 3 second disruption due to kernel rejecting new flows.  4.5 will move OVS to the node, and michael cambria is working on it.  When that lands this should become zero, or be reopened and we find a new cause.

Comment 2 Ben Bennett 2020-04-28 15:21:24 UTC
Assigning this to the 4.5 release to get worked on.  We likely will not be able to backport the fix, but we can work out if there is anything else we can do, or at least we can change the test to make it more accurately reflect what we tolerate.

Comment 4 Ben Parees 2020-04-28 18:58:03 UTC
Setting to urgent, we expect full availability during upgrades in 4.5.

Comment 5 Clayton Coleman 2020-05-18 14:43:53 UTC
Why was this moved to 4.6?  This impacts application workloads and we were supposed to have fixed it.

Comment 6 Ben Parees 2020-05-18 15:56:35 UTC
This represents a fundamental product flaw (taking workload outages during an upgrade).  It cannot be deferred w/o agreement from at least the group lead and preferably discussion with the architecture team.

Setting back to 4.5.

Comment 9 mcambria@redhat.com 2020-07-13 13:00:48 UTC
(In reply to Clayton Coleman from comment #1)
> Current this is OVS 3 second disruption due to kernel rejecting new flows. 
> 4.5 will move OVS to the node, and michael cambria is working on it.  When
> that lands this should become zero, or be reopened and we find a new cause.

OVS is running on host and OVNKubernetes has been changed to make use of this.  Can you retest?

Comment 11 Alexander Constantinescu 2020-07-20 15:28:03 UTC
@Tim

You got assigned to this during the bug-scrum.

Could you please have a look at this and try to figure out why we are experiencing networking downtime after system OVS? 

Thanks!
Alex

Comment 12 Tim Rozet 2020-07-20 21:56:42 UTC
Looking at the logs we have 2 pods backing the test service. The test constantly does HTTP GET /echo?msg=Hello on the service. I can see both pods stop receiving these at the same time (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1285191438505611264/artifacts/e2e-aws-upgrade/):

####
service-test-sls6t
ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw

2020/07/20 06:21:39 GET /echo?msg=Hello
2020/07/20 06:21:39 GET /hostName

2020/07/20 06:27:00 GET /hostName
2020/07/20 06:27:03 GET /hostName


####
service-test-xfrrq
2020/07/20 06:21:39 GET /echo?msg=Hello
2020/07/20 06:21:40 GET /hostName

2020/07/20 06:27:01 GET /hostName
2020/07/20 06:27:04 GET /hostName

So we can see the requests stop coming at 6:21:39 and looking at worker 9hwqw:
Jul 20 06:15:45.350738 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw systemd[1]: Shutting down.
Jul 20 06:18:35.463612 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:18:35.463138    1539 config.go:412] Receiving a new pod "service-test-sls6t_e2e-k8s-service-lb-available-9610(2788de88-c1db-46dd-8f25-73aa3d464cd1)"
Jul 20 06:21:00.900639 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.900557    1539 prober.go:181] HTTP-Probe Host: http://10.128.2.13, Port: 80, Path: /hostName
Jul 20 06:21:00.900639 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.900619    1539 prober.go:184] HTTP-Probe Headers: map[]
Jul 20 06:21:00.901739 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.901662    1539 http.go:128] Probe succeeded for http://10.128.2.13:80/hostName, Response: {200 OK 200 HTTP/1.1 1 1 map[Content-Length:[18] Content-Type:[text/plain; charset=utf-8] Date:[Mon, 20 Jul 2020 06:21:00 GMT]] 0xc000992f00 18 [] true false map[] 0xc0000bd500 <nil>}

The worker has been upgraded and been up for over 5 minutes. The pod exists and keeps responding to health checks, but the service suddenly stops working. It looks to me like this doesn't have anything to do with a temporary network outage, and looks more like something went wrong with openshift-sdn. This bug should probably be re-assigned to someone who works on openshift-sdn.

Comment 13 W. Trevor King 2020-07-29 20:49:42 UTC
Possibly has gotten worse recently?

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted&maxAge=12h&name=upgrade' | grep 'failures match' | sort
pull-ci-openshift-cluster-authentication-operator-master-e2e-aws-upgrade - 9 runs, 11% failed, 400% of failures match
pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade - 5 runs, 100% failed, 40% of failures match
...
pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 20 runs, 85% failed, 94% of failures match
pull-ci-openshift-origin-master-e2e-gcp-upgrade - 15 runs, 80% failed, 75% of failures match
pull-ci-openshift-router-master-e2e-upgrade - 13 runs, 92% failed, 100% of failures match
rehearse-10491-pull-ci-openshift-cluster-samples-operator-master-okd-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 7 runs, 14% failed, 600% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade - 68 runs, 47% failed, 88% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 3 runs, 100% failed, 33% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 2 runs, 100% failed, 50% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 14 runs, 21% failed, 167% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci - 2 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 6 runs, 50% failed, 133% of failures match

That^ will include jobs with both flaky and failing test-cases, which may overestimate the impact on throughput.  We can also search on the fatal error message to exclude currently-acceptable flakes like:

  Service was unreachable during disruption for at least 23s of 30m10s (1%), this is currently sufficient to pass the test/job but not considered completely correct:

with:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Service+was+unreachable+during+disruption+for+at+least+.*+of.*%5C%29%3A&maxAge=12h&name=upgrade' | grep 'failures match' | sort
pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade - 5 runs, 100% failed, 40% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade - 5 runs, 80% failed, 100% of failures match
...
pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 20 runs, 85% failed, 94% of failures match
pull-ci-openshift-origin-master-e2e-gcp-upgrade - 15 runs, 80% failed, 75% of failures match
pull-ci-openshift-router-master-e2e-upgrade - 12 runs, 92% failed, 100% of failures match
rehearse-10491-pull-ci-openshift-cluster-samples-operator-master-okd-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 6 runs, 17% failed, 500% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade - 68 runs, 47% failed, 75% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 14 runs, 21% failed, 67% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 6 runs, 50% failed, 67% of failures match

Although it's possible that that^ is overestimating by looping in some other test-cases.

Comment 14 W. Trevor King 2020-07-29 20:51:51 UTC
GCP also seems overly represented in comment 13.  Maybe we recently broke something there?  Although gcp-upgrade-4.5 also shows up, so that suggests it's either a long-running issue or something we broke recently and backported at least to 4.5.  On GCP.  I dunno.  Would be great if we got it fixed everywhere :)

Comment 15 W. Trevor King 2020-07-29 21:15:40 UTC
Ooh, Kirsten has discovered [1].  I don't understand what that does, but in [2], David is linking [3] where "Application behind service load balancer with PDB is not disrupted" is the only failing test-case.  So maybe that will fix this bug :)

[1]: https://github.com/openshift/kubernetes/pull/300
[2]: https://github.com/openshift/kubernetes/pull/300#issue-457945177
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25314/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1287993294235635712

Comment 16 Andrew McDermott 2020-07-30 08:33:46 UTC
*** Bug 1861944 has been marked as a duplicate of this bug. ***

Comment 18 Maru Newby 2020-08-04 15:05:05 UTC
(In reply to Antonio Ojea from comment #17)
> No failure on the master-e2e-gcp-upgrade jobs since July 31
> 
> https://search.ci.openshift.org/
> ?search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted&ma
> xAge=336h&context=1&type=bug%2Bjunit&name=master-e2e-gcp-
> upgrade&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> it seems comment #15 is correct and
> https://github.com/openshift/kubernetes/pull/300 fixed the problem :-)

You'll need to unskip the test to confirm whether the PR you linked to solves the problem. The test was disabled as part of the 1.19 bump in the absence of David's fix:

https://github.com/openshift/origin/blob/master/test/e2e/upgrade/upgrade.go#L42

Unskipping the test is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1861944 and is proposed in https://github.com/openshift/origin/pull/25354

Note that the issue fixed by 300 was introduced in 1.19, and this bz was filed well before 1.19 landed.

Comment 22 Antonio Ojea 2020-08-11 08:24:02 UTC
> is it possible that this is a GCP thing?

First try verify this hypothesis

https://github.com/openshift/origin/pull/25354#issuecomment-671758837

AWS

Run #0: Failed expand_less	34m14s
Service was unreachable during disruption for at least 1s of 29m45s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

Aug 11 07:39:29.916 E ns/e2e-k8s-service-lb-available-1640 svc/service-test Service stopped responding to GET requests over new connections
Aug 11 07:39:29.932 I ns/e2e-k8s-service-lb-available-1640 svc/service-test Service started responding to GET requests over new connections


GCP

Aug 11 08:00:19.100: Service was unreachable during disruption for at least 49s of 32m30s (3%):

will run again for double checking
https://github.com/openshift/origin/pull/25354#issuecomment-671804696

Comment 23 Antonio Ojea 2020-08-11 10:20:10 UTC
* AWS https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25354/pull-ci-openshift-origin-master-e2e-aws-upgrade/1293100591748222976

Service was unreachable during disruption for at least 8s of 30m50s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

Aug 11 09:32:51.033 - 3s    E ns/e2e-k8s-service-lb-available-4472 svc/service-test Service is not responding to GET requests over new connections


* GCP https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25354/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1293100591773388800

Aug 11 09:43:11.214: Service was unreachable during disruption for at least 1m8s of 31m10s (4%):

(snipped)
Aug 11 09:34:40.029 - 18s   E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections
(snipped)
Aug 11 09:37:42.029 - 11s   E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests on reused connections
(snipped)
Aug 11 09:38:01.029 - 5s    E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections
(snipped)
Aug 11 09:40:40.029 - 22s   E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections


Second round shows the same outcome, AWS pass and GCP fails, how do we want to proceed?

Comment 24 Antonio Ojea 2020-08-11 18:01:09 UTC
Interesting, setting the service ExternalTrafficPolicy to Local pass the tests in both cloud providers

I filed a PR and you can see the results there https://github.com/openshift/origin/pull/25406

I think that for cloud providers it makes more sense to use Local to avoid the possible double ops LB-> Node -> Pod

Is there any special reasons for not using it?

Comment 25 Ben Parees 2020-08-11 18:17:56 UTC
I don't have any special knowledge of the test or the component, i think that change is something that would need to be discussed w/ the networking-edge team(or possibly Clayton Coleman as the test author) to ensure we're not losing coverage of an intended scenario.

Comment 29 Antonio Ojea 2020-09-07 08:56:05 UTC
cherry pick to 4.6 branch https://github.com/openshift/origin/pull/25482

Comment 31 Stephen Greene 2020-09-14 21:15:09 UTC
*** Bug 1861944 has been marked as a duplicate of this bug. ***

Comment 33 Antonio Ojea 2020-09-17 13:24:03 UTC
(In reply to Devan Goodwin from comment #32)
> I fear this is still happening:
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-azure-upgrade-4.5/1306525777306587136

the patch is not in that branch, right?
https://github.com/openshift/origin/pull/25482

Comment 34 David Eads 2020-09-17 17:41:02 UTC
fixing infrastructure test link.  That test is related to CI not starting an installer correctly.

Comment 37 Antonio Ojea 2020-10-09 07:29:30 UTC
(In reply to Benjamin Gilbert from comment #36)
> Still seeing this in 4.6:
> https://search.ci.openshift.org/?search=%5C%5Bsig-network-
> edge%5C%5D+Application+behind+service+load+balancer+with+PDB+is+not+disrupted
> &maxAge=168h&context=1&type=bug%2Bjunit&name=4.
> 6&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> For example,
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-gcp-upgrade-4.6/1314214514333323264


The particular example that you link shows that it failed the first time and passed the second time.

Most of the results in the search query that you linked shows

>  this is currently sufficient to pass the test/job but not considered completely correct

This test has to have a margin of error because there are multiple factors influencing the result, and not all of them are under our control:
- external factors are the cloud provider load balancer, path from the client to the load balancer, status of the VMs, ...
- internal factors that can influence are, per example, the client that is polling the API, is the client running alone, with more services (there is a known issue with one particular test cluster having io timeouts accessing APIs), has enough resources the VM where the client runs, is there congestion in that particular network?

Don't get me wrong, the test has to pass, but for that we have know the percentage of failures, I mean, without flakes, the times that the test failed the 2 times in the same job, and caused the job to fail.

This test simulates a High Available application and for that we need to define an SLA and the number of 9s we want to support. If we want to make this perfect we need to change more things, not only our code, also how the test runs, so there is a trade off here, and we should weight that and define what is "well enough" and what is the priority.

Comment 38 Ben Bennett 2020-10-09 13:07:00 UTC
Setting the target to 4.7 to investigate further.  We will consider backports of any fixes we identify.

Since the move to system ovs we should be able to tolerate the sdn restart.  We need to identify where the outage comes from... if it is due to endpoint changes while the sdn is restarted we may not be able to do anything about that (and we should make it so the test detects that case).  But if there is any other cause, we need to identify it and work out how to fix it.

Comment 39 Benjamin Gilbert 2020-10-09 15:47:43 UTC
> Most of the results in the search query that you linked shows
>> this is currently sufficient to pass the test/job but not considered completely correct

I know.  The build watcher rules require an open bug for every frequently failing test.  Feel free to set a low priority.

Comment 41 Antonio Ojea 2020-10-13 14:01:37 UTC
Analyzing this failure 

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-azure-upgrade-4.5/1306525777306587136

and correlating some logs with the failure reason:

Sep 17 10:57:00.502 E ns/e2e-k8s-service-lb-available-3308 svc/service-test Service stopped responding to GET requests over new connections

Sep 17 10:57:00.502: INFO: Service service-test is unreachable on new connections: Get http://40.89.252.10:80/echo?msg=Hello&timeout=10s: dial tcp 40.89.252.10:80: connect: no route to host


Sep 17 10:57:00.594 I ns/e2e-k8s-service-lb-available-3308 svc/service-test Service started responding to GET requests over new connections


Sep 17 10:57:04.982 E ns/e2e-k8s-service-lb-available-3308 svc/service-test Service stopped responding to GET requests over new connections

> Sep 17 10:57:04.982: INFO: Service service-test is unreachable on new connections: Gethttp://40.89.252.10:80/echo?msg=Hello&timeout=10s: dial tcp 40.89.252.10:80: connect: no route to host


some of the errors are caused by

> connect: no route to host

AFAIK it can mean that there is an error returned by the kernel in the client, or that an icmp-host-unreachable was received

maybe we are overloading the client?

Comment 49 errata-xmlrpc 2020-12-01 10:48:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.5.21 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5194


Note You need to log in before you can comment on or make changes to this bug.