Bug 2004076

Summary: fix flaky test "Unidling should work with TCP (while idling)" on openshift-sdn
Product: OpenShift Container Platform Reporter: Dan Winship <danw>
Component: NetworkingAssignee: Mohamed Mahmoud <mmahmoud>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: medium CC: bbennett, rravaiol, trozet, zzhao
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 2004074 Environment:
Last Closed: 2022-11-17 22:40:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2004074    

Description Dan Winship 2021-09-14 13:17:40 UTC
+++ This bug was initially created as a clone of Bug #2004074 +++

the test

"[sig-network-edge][Feature:Idling] Unidling should work with TCP (while idling) [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/parallel]"

is currently very flaky. This is a test that was only recently un-disabled after having been disabled for all of 4.x, so this does not indicate a recent regression.

(Note that this test is also very flaky under ovn-kubernetes, however in that case the "(when fully idled)" version of the test is also flaky, whereas with openshift-sdn only the "(while idling)" version flakes.)

Comment 1 Dan Winship 2021-09-14 13:18:21 UTC
(This bug tracks fixing the underlying idling issue in openshift-sdn. Bug 2004074 tracks un-skipping the test once this bug is fixed.)

Comment 3 Scott Dodson 2022-05-17 15:58:12 UTC
Adjusting this to have Version 4.9 as this test flakes or fails at a high rate there as well. I will pursue moving that to be a broken test.

Comment 4 Dan Winship 2022-05-17 16:46:45 UTC
See https://bugzilla.redhat.com/show_bug.cgi?id=2085327#c3 although that comment was written before I realized we'd already disabled this test in 4.10+.

The problem seems to be that idling the service takes longer than expected, so the "(when fully idled)" test works (because it waits for the service to idle before trying to unidle it) but the "(while idling)" test fails sometimes because it expects the service to have been successfully unidled before it actually gets idled in the first place. Right now it does:

  - idle the service
  - try to connect to the service every half a second for 10 seconds
  - fail if any of the connection attempts fail or the service still has the idle annotations

Instead it needs to do:

  - idle the service
  - try to connect to the service every half a second until N seconds after the idle annotation is removed from the service, up to a maximum of M seconds
  - fail if any of the connections attempts fail or the service was still not idled after M seconds

for some values of N and M, perhaps 5 and 60.