Bug 1828858 - Application behind service load balancer with PDB is not disrupted
Summary: Application behind service load balancer with PDB is not disrupted
Keywords:
Status: VERIFIED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.2.z
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.6.0
Assignee: Antonio Ojea
QA Contact: huirwang
URL:
Whiteboard: SDN-CI-IMPACT
: 1861944 (view as bug list)
Depends On:
Blocks: 1868490
TreeView+ depends on / blocked
 
Reported: 2020-04-28 13:38 UTC by Ben Parees
Modified: 2020-09-17 17:41 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1868489 (view as bug list)
Environment:
Application behind service load balancer with PDB is not disrupted
Last Closed:
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Github openshift origin pull 25406 None closed Bug 1828858: use LoadBalancers with ExternalTrafficPolicy set to Local 2020-09-21 14:38:03 UTC

Description Ben Parees 2020-04-28 13:38:18 UTC
test:
Application behind service load balancer with PDB is not disrupted 

is failing frequently in CI, see search results:
https://search.svc.ci.openshift.org/?maxAge=48h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted

Comment 1 Clayton Coleman 2020-04-28 13:49:59 UTC
Current this is OVS 3 second disruption due to kernel rejecting new flows.  4.5 will move OVS to the node, and michael cambria is working on it.  When that lands this should become zero, or be reopened and we find a new cause.

Comment 2 Ben Bennett 2020-04-28 15:21:24 UTC
Assigning this to the 4.5 release to get worked on.  We likely will not be able to backport the fix, but we can work out if there is anything else we can do, or at least we can change the test to make it more accurately reflect what we tolerate.

Comment 4 Ben Parees 2020-04-28 18:58:03 UTC
Setting to urgent, we expect full availability during upgrades in 4.5.

Comment 5 Clayton Coleman 2020-05-18 14:43:53 UTC
Why was this moved to 4.6?  This impacts application workloads and we were supposed to have fixed it.

Comment 6 Ben Parees 2020-05-18 15:56:35 UTC
This represents a fundamental product flaw (taking workload outages during an upgrade).  It cannot be deferred w/o agreement from at least the group lead and preferably discussion with the architecture team.

Setting back to 4.5.

Comment 9 mcambria@redhat.com 2020-07-13 13:00:48 UTC
(In reply to Clayton Coleman from comment #1)
> Current this is OVS 3 second disruption due to kernel rejecting new flows. 
> 4.5 will move OVS to the node, and michael cambria is working on it.  When
> that lands this should become zero, or be reopened and we find a new cause.

OVS is running on host and OVNKubernetes has been changed to make use of this.  Can you retest?

Comment 11 Alexander Constantinescu 2020-07-20 15:28:03 UTC
@Tim

You got assigned to this during the bug-scrum.

Could you please have a look at this and try to figure out why we are experiencing networking downtime after system OVS? 

Thanks!
Alex

Comment 12 Tim Rozet 2020-07-20 21:56:42 UTC
Looking at the logs we have 2 pods backing the test service. The test constantly does HTTP GET /echo?msg=Hello on the service. I can see both pods stop receiving these at the same time (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci/1285191438505611264/artifacts/e2e-aws-upgrade/):

####
service-test-sls6t
ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw

2020/07/20 06:21:39 GET /echo?msg=Hello
2020/07/20 06:21:39 GET /hostName

2020/07/20 06:27:00 GET /hostName
2020/07/20 06:27:03 GET /hostName


####
service-test-xfrrq
2020/07/20 06:21:39 GET /echo?msg=Hello
2020/07/20 06:21:40 GET /hostName

2020/07/20 06:27:01 GET /hostName
2020/07/20 06:27:04 GET /hostName

So we can see the requests stop coming at 6:21:39 and looking at worker 9hwqw:
Jul 20 06:15:45.350738 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw systemd[1]: Shutting down.
Jul 20 06:18:35.463612 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:18:35.463138    1539 config.go:412] Receiving a new pod "service-test-sls6t_e2e-k8s-service-lb-available-9610(2788de88-c1db-46dd-8f25-73aa3d464cd1)"
Jul 20 06:21:00.900639 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.900557    1539 prober.go:181] HTTP-Probe Host: http://10.128.2.13, Port: 80, Path: /hostName
Jul 20 06:21:00.900639 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.900619    1539 prober.go:184] HTTP-Probe Headers: map[]
Jul 20 06:21:00.901739 ci-op-81hrit9t-28de9-wkxgs-worker-c-9hwqw hyperkube[1539]: I0720 06:21:00.901662    1539 http.go:128] Probe succeeded for http://10.128.2.13:80/hostName, Response: {200 OK 200 HTTP/1.1 1 1 map[Content-Length:[18] Content-Type:[text/plain; charset=utf-8] Date:[Mon, 20 Jul 2020 06:21:00 GMT]] 0xc000992f00 18 [] true false map[] 0xc0000bd500 <nil>}

The worker has been upgraded and been up for over 5 minutes. The pod exists and keeps responding to health checks, but the service suddenly stops working. It looks to me like this doesn't have anything to do with a temporary network outage, and looks more like something went wrong with openshift-sdn. This bug should probably be re-assigned to someone who works on openshift-sdn.

Comment 13 W. Trevor King 2020-07-29 20:49:42 UTC
Possibly has gotten worse recently?

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted&maxAge=12h&name=upgrade' | grep 'failures match' | sort
pull-ci-openshift-cluster-authentication-operator-master-e2e-aws-upgrade - 9 runs, 11% failed, 400% of failures match
pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade - 5 runs, 100% failed, 40% of failures match
...
pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 20 runs, 85% failed, 94% of failures match
pull-ci-openshift-origin-master-e2e-gcp-upgrade - 15 runs, 80% failed, 75% of failures match
pull-ci-openshift-router-master-e2e-upgrade - 13 runs, 92% failed, 100% of failures match
rehearse-10491-pull-ci-openshift-cluster-samples-operator-master-okd-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 7 runs, 14% failed, 600% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade - 68 runs, 47% failed, 88% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-rollback-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 3 runs, 100% failed, 33% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 2 runs, 100% failed, 50% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 14 runs, 21% failed, 167% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.5-stable-to-4.6-ci - 2 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 6 runs, 50% failed, 133% of failures match

That^ will include jobs with both flaky and failing test-cases, which may overestimate the impact on throughput.  We can also search on the fatal error message to exclude currently-acceptable flakes like:

  Service was unreachable during disruption for at least 23s of 30m10s (1%), this is currently sufficient to pass the test/job but not considered completely correct:

with:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=Service+was+unreachable+during+disruption+for+at+least+.*+of.*%5C%29%3A&maxAge=12h&name=upgrade' | grep 'failures match' | sort
pull-ci-openshift-cluster-etcd-operator-master-e2e-gcp-upgrade - 5 runs, 100% failed, 40% of failures match
pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn-upgrade - 5 runs, 80% failed, 100% of failures match
...
pull-ci-openshift-machine-config-operator-master-e2e-gcp-upgrade - 20 runs, 85% failed, 94% of failures match
pull-ci-openshift-origin-master-e2e-gcp-upgrade - 15 runs, 80% failed, 75% of failures match
pull-ci-openshift-router-master-e2e-upgrade - 12 runs, 92% failed, 100% of failures match
rehearse-10491-pull-ci-openshift-cluster-samples-operator-master-okd-e2e-aws-upgrade - 3 runs, 67% failed, 50% of failures match
release-openshift-okd-installer-e2e-aws-upgrade - 6 runs, 17% failed, 500% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-nightly-to-4.3-nightly - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.2-to-4.3 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade - 68 runs, 47% failed, 75% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 14 runs, 21% failed, 67% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.5 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade-4.6 - 6 runs, 50% failed, 67% of failures match

Although it's possible that that^ is overestimating by looping in some other test-cases.

Comment 14 W. Trevor King 2020-07-29 20:51:51 UTC
GCP also seems overly represented in comment 13.  Maybe we recently broke something there?  Although gcp-upgrade-4.5 also shows up, so that suggests it's either a long-running issue or something we broke recently and backported at least to 4.5.  On GCP.  I dunno.  Would be great if we got it fixed everywhere :)

Comment 15 W. Trevor King 2020-07-29 21:15:40 UTC
Ooh, Kirsten has discovered [1].  I don't understand what that does, but in [2], David is linking [3] where "Application behind service load balancer with PDB is not disrupted" is the only failing test-case.  So maybe that will fix this bug :)

[1]: https://github.com/openshift/kubernetes/pull/300
[2]: https://github.com/openshift/kubernetes/pull/300#issue-457945177
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25314/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1287993294235635712

Comment 16 Andrew McDermott 2020-07-30 08:33:46 UTC
*** Bug 1861944 has been marked as a duplicate of this bug. ***

Comment 18 Maru Newby 2020-08-04 15:05:05 UTC
(In reply to Antonio Ojea from comment #17)
> No failure on the master-e2e-gcp-upgrade jobs since July 31
> 
> https://search.ci.openshift.org/
> ?search=Application+behind+service+load+balancer+with+PDB+is+not+disrupted&ma
> xAge=336h&context=1&type=bug%2Bjunit&name=master-e2e-gcp-
> upgrade&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> it seems comment #15 is correct and
> https://github.com/openshift/kubernetes/pull/300 fixed the problem :-)

You'll need to unskip the test to confirm whether the PR you linked to solves the problem. The test was disabled as part of the 1.19 bump in the absence of David's fix:

https://github.com/openshift/origin/blob/master/test/e2e/upgrade/upgrade.go#L42

Unskipping the test is tracked by https://bugzilla.redhat.com/show_bug.cgi?id=1861944 and is proposed in https://github.com/openshift/origin/pull/25354

Note that the issue fixed by 300 was introduced in 1.19, and this bz was filed well before 1.19 landed.

Comment 22 Antonio Ojea 2020-08-11 08:24:02 UTC
> is it possible that this is a GCP thing?

First try verify this hypothesis

https://github.com/openshift/origin/pull/25354#issuecomment-671758837

AWS

Run #0: Failed expand_less	34m14s
Service was unreachable during disruption for at least 1s of 29m45s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

Aug 11 07:39:29.916 E ns/e2e-k8s-service-lb-available-1640 svc/service-test Service stopped responding to GET requests over new connections
Aug 11 07:39:29.932 I ns/e2e-k8s-service-lb-available-1640 svc/service-test Service started responding to GET requests over new connections


GCP

Aug 11 08:00:19.100: Service was unreachable during disruption for at least 49s of 32m30s (3%):

will run again for double checking
https://github.com/openshift/origin/pull/25354#issuecomment-671804696

Comment 23 Antonio Ojea 2020-08-11 10:20:10 UTC
* AWS https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25354/pull-ci-openshift-origin-master-e2e-aws-upgrade/1293100591748222976

Service was unreachable during disruption for at least 8s of 30m50s (0%), this is currently sufficient to pass the test/job but not considered completely correct:

Aug 11 09:32:51.033 - 3s    E ns/e2e-k8s-service-lb-available-4472 svc/service-test Service is not responding to GET requests over new connections


* GCP https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25354/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1293100591773388800

Aug 11 09:43:11.214: Service was unreachable during disruption for at least 1m8s of 31m10s (4%):

(snipped)
Aug 11 09:34:40.029 - 18s   E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections
(snipped)
Aug 11 09:37:42.029 - 11s   E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests on reused connections
(snipped)
Aug 11 09:38:01.029 - 5s    E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections
(snipped)
Aug 11 09:40:40.029 - 22s   E ns/e2e-k8s-service-lb-available-1592 svc/service-test Service is not responding to GET requests over new connections


Second round shows the same outcome, AWS pass and GCP fails, how do we want to proceed?

Comment 24 Antonio Ojea 2020-08-11 18:01:09 UTC
Interesting, setting the service ExternalTrafficPolicy to Local pass the tests in both cloud providers

I filed a PR and you can see the results there https://github.com/openshift/origin/pull/25406

I think that for cloud providers it makes more sense to use Local to avoid the possible double ops LB-> Node -> Pod

Is there any special reasons for not using it?

Comment 25 Ben Parees 2020-08-11 18:17:56 UTC
I don't have any special knowledge of the test or the component, i think that change is something that would need to be discussed w/ the networking-edge team(or possibly Clayton Coleman as the test author) to ensure we're not losing coverage of an intended scenario.

Comment 29 Antonio Ojea 2020-09-07 08:56:05 UTC
cherry pick to 4.6 branch https://github.com/openshift/origin/pull/25482

Comment 31 Stephen Greene 2020-09-14 21:15:09 UTC
*** Bug 1861944 has been marked as a duplicate of this bug. ***

Comment 33 Antonio Ojea 2020-09-17 13:24:03 UTC
(In reply to Devan Goodwin from comment #32)
> I fear this is still happening:
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> origin-installer-e2e-azure-upgrade-4.5/1306525777306587136

the patch is not in that branch, right?
https://github.com/openshift/origin/pull/25482

Comment 34 David Eads 2020-09-17 17:41:02 UTC
fixing infrastructure test link.  That test is related to CI not starting an installer correctly.


Note You need to log in before you can comment on or make changes to this bug.