Bug 1989169

Summary: unidling tests are flaky under ovn-kubernetes
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: jamo luhrsen <jluhrsen>
Networking sub component: ovn-kubernetes QA Contact: Anurag saxena <anusaxen>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: medium    
Priority: medium CC: astoycos, cholman, mbooth, sippy, trozet, wking
Version: 4.9   
Target Milestone: ---   
Target Release: 4.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2003228 (view as bug list) Environment:
job=periodic-ci-openshift-release-master-ci-4.9-e2e-openstack-ovn=all
Last Closed: 2022-11-17 22:36:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2003228    
Bug Blocks:    

Description Dan Winship 2021-08-02 15:13:31 UTC
It turns out that even though there were a bunch of unidling tests in origin/test/extended/idling/, none of them ever got run in CI, and so ovn-kubernetes had never been tested against any of them.

There is a single unidling test outside of that directory, "The HAProxy router should be able to connect to a service that is idled because a GET on the route will unidle it", and ovn-kubernetes does reliably pass that one.

Also, the UDP unidling test passes; it's only the TCP ones that fail.

It's _possible_ that the problem is a badly-written test, but it works fine under openshift-sdn...

Newly-added and failing on ovn-kubernetes:

  - [sig-network-edge][Feature:Idling] Unidling should work with TCP (when fully idled)
  - [sig-network-edge][Feature:Idling] Unidling should work with TCP (while idling)

Newly-added and passing on ovn-kubernetes:

  - [sig-network-edge][Feature:Idling] Idling with a single service and ReplicationController should idle the service and ReplicationController properly
  - [sig-network-edge][Feature:Idling] Unidling should work with UDP

Newly-added but [Serial] so it doesn't run in e2e-aws-ovn / e2e-gcp-ovn and OMG do we still have no e2e-*-ovn-serial job anywhere?

  - [sig-network-edge][Feature:Idling] Unidling should handle many TCP connections by possibly dropping those over a certain bound [Serial]
  - [sig-network-edge][Feature:Idling] Unidling should handle many UDP senders (by continuing to drop all packets on the floor) [Serial]

Comment 1 Dan Winship 2021-08-02 15:16:56 UTC
oh, these tests are currently still disabled everywhere, but will be re-enabled by https://github.com/openshift/origin/pull/26155

Comment 2 jamo luhrsen 2021-08-25 20:50:54 UTC
@danw, wanted to know if you had a specific plan for this bz? It's assigned to me,
but not sure if I'm the one to un-flake these tests (assuming it's ovn-ish things that cause
the flakes)?

I see your PR to re-enable the tests is still in progress, but some tests did fail in the
most recent job:
  https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26155/pull-ci-openshift-origin-master-e2e-gcp/1430255548804108288

Can you let me know how I can help here?

Comment 3 Dan Winship 2021-08-25 21:17:36 UTC
yeah, I was assuming that PR was going to merge sooner...

the tests failed in the latest e2e-gcp run because openshift/kubernetes#899 hasn't merged, so "oc idle" doesn't work.

so given that, there's no easy way to test it under ovn-kube for now...

Comment 4 jamo luhrsen 2021-08-25 21:24:37 UTC
ok, so plan is to wait for openshift/kubernetes#899, then /retest origin#26155 and see where
we are?

Comment 5 Dan Winship 2021-08-26 13:15:49 UTC
yes

Comment 6 Dan Winship 2021-09-07 15:59:50 UTC
ok, idling is fixed, the tests are merged (in 4.9), and they're skipped on ovn-kube

Comment 7 jamo luhrsen 2021-09-08 17:29:31 UTC
you can see the tests re-introduced after https://github.com/openshift/origin/pull/26155 was merged and
they are passing. here is a testgrid link to show it. marking this as verified:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-ci-4.9-e2e-aws-ovn&show-stale-tests=

Comment 8 Dan Winship 2021-09-08 19:11:25 UTC
sorry, I should have been clearer. 6 new tests were added. 2 are enabled everywhere, 2 are [Serial] and so don't get run on any of our current ovn-kube jobs, and 2 run under openshift-sdn but are disabled on ovn-kubernetes.

You can confirm that neither of these show up in the test grid:

  - [sig-network-edge][Feature:Idling] Unidling should work with TCP (when fully idled)
  - [sig-network-edge][Feature:Idling] Unidling should work with TCP (while idling)

because 26155 marks them as skipped on ovn-kube, because they tend to not work there.

Comment 9 jamo luhrsen 2021-09-09 18:00:14 UTC
(In reply to Dan Winship from comment #8)

> You can confirm that neither of these show up in the test grid:
> 
>   - [sig-network-edge][Feature:Idling] Unidling should work with TCP (when
> fully idled)
>   - [sig-network-edge][Feature:Idling] Unidling should work with TCP (while
> idling)

correct, these two tests are not being run at this point.

change back to VERIFIED?

Comment 10 Dan Winship 2021-09-09 18:15:54 UTC
No, *that's the bug*. This bz is tracking the fact that we had to disable two tests on ovn-kubernetes because there is a bug in ovn-kubernetes that causes the tests to fail. It can be closed after the tests are re-enabled (which can only happen after the ovn-kubernetes bug is fixed).

Comment 11 jamo luhrsen 2021-09-09 18:30:07 UTC
ok. clearly I was clueless to what's going on. Is there a different bug to track the fix that we need
in ovn-kubernetes? Or is it this one? If this bz is just to re-enable the tests once we get the fix,
I will keep it. but if this bz is to track the actual fix then I think we need to re-assign to
someone else.

Comment 12 Dan Winship 2021-09-09 19:24:21 UTC
There's currently only the one bug for both parts. It might make sense to clone it to have a second bug for fixing ovn-kube.

Comment 13 jamo luhrsen 2021-09-10 17:15:22 UTC
ok, here is https://bugzilla.redhat.com/show_bug.cgi?id=2003228 to track the dev work to get the tests
passing. There is a PR https://github.com/openshift/origin/pull/26460 to re-enable the tests once we have a fix.

Comment 14 jamo luhrsen 2021-09-22 02:02:09 UTC
I don't think these two "Unidling should work with TCP" tests are flaky anymore.
At least they are not failing in my PR that adds them back:
  https://github.com/openshift/origin/pull/26460

If that's correct, we can close this bz I guess.

Comment 15 jamo luhrsen 2021-09-23 18:00:01 UTC
(In reply to jamo luhrsen from comment #14)
> I don't think these two "Unidling should work with TCP" tests are flaky
> anymore.
> At least they are not failing in my PR that adds them back:
>   https://github.com/openshift/origin/pull/26460
> 
> If that's correct, we can close this bz I guess.

Incorrect. I was only running openshift-sdn jobs where those tests do pass. the ovn-k8s version of the job
does flake. Need https://bugzilla.redhat.com/show_bug.cgi?id=2003228 resolved before this can be.

Comment 16 Matthew Booth 2021-10-25 14:01:24 UTC
*** Bug 2017036 has been marked as a duplicate of this bug. ***

Comment 17 jamo luhrsen 2022-02-03 22:05:29 UTC
still no progress on https://bugzilla.redhat.com/show_bug.cgi?id=2003228 so moving this target release to 4.11

Comment 19 jamo luhrsen 2022-11-02 17:13:57 UTC
@mmahmoud, I was pinged about the status of this bug. It's blocked on https://bugzilla.redhat.com/show_bug.cgi?id=2003228 which is
assigned to you. They wanted a fresh comment here in this bz to say that, since this came up as a stale bug. just fyi.

Comment 20 Tim Rozet 2022-11-17 22:36:02 UTC
The unidling behavior has been fixed in ovnk via the retry mechanisms introduced in 4.11. The tests were re-enabled in https://bugzilla.redhat.com/show_bug.cgi?id=2003228, https://github.com/openshift/origin/pull/27538 ....Closing this bug.