Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2085327

Summary:	[sig-network-edge][Feature:Idling] Unidling should work with TCP (while idling) [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/parallel]
Product:	OpenShift Container Platform	Reporter:	Scott Dodson <sdodson>
Component:	Networking	Assignee:	obraunsh
Networking sub component:	ovn-kubernetes	QA Contact:	Anurag saxena <anusaxen>
Status:	CLOSED DUPLICATE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	cholman, danw, ffernand, jitsingh, mmasters, sippy, surya, tjungblu, wking
Version:	4.9	Flags:	jluhrsen: needinfo-
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-11-10 12:25:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Scott Dodson 2022-05-13 02:04:56 UTC

[sig-network-edge][Feature:Idling] Unidling should work with TCP (while idling) [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/parallel]

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=%5Bsig-network-edge%5D%5BFeature%3AIdling%5D%20Unidling%20should%20work%20with%20TCP%20(while%20idling)%20%5BSkipped%3ANetwork%2FOVNKubernetes%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D

This seems to have begun failing commonly in 4.9 CI and Nightly aws-upgrade jobs and is the common cause for recent CI and Nightly acceptance failures.

Examples:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524813239098019840

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524605621343621120

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524454745392025600

When reviewing when the failures started occuring I don't see any code changes in OVN or Router, but I do observe that CRI-O was bumped at the time the failures began.

Comment 1 Scott Dodson 2022-05-13 12:47:17 UTC

Setting severity urgent given that this is blocking payload acceptance.

Comment 3 Dan Winship 2022-05-15 19:44:55 UTC

> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524813239098019840

test logs from the first run show:

  May 12 20:30:38.801: INFO: Running 'oc --namespace=e2e-test-cli-idling-58bcz --kubeconfig=/tmp/configfile710036419 create -f /tmp/fixture-testdata-dir827686644/test/extended/testdata/idling-echo-server.yaml -o name'

  ...

  STEP: Idling the service
  May 12 20:30:56.781: INFO: Running 'oc --namespace=e2e-test-cli-idling-58bcz --kubeconfig=/tmp/configfile710036419 idle --resource-names-file /tmp/idling-services-837630022'
  STEP: Connecting to the service IP and repeatedly connecting, making sure we seamlessly idle and come back up
  May 12 20:30:57.319: INFO: Creating new exec pod
  May 12 20:31:03.462: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-lhpkj72z-24e09.aws-2.ci.openshift.org:6443 --kubeconfig=/tmp/kubeconfig-1865973082 --namespace=e2e-test-cli-idling-58bcz exec execpodlx45w -- /bin/sh -x -c curl --max-time 120 -s -g http://172.30.119.216:8675/echo?msg=It+is+time+to+TCP.'
  May 12 20:31:04.059: INFO: stderr: "+ curl --max-time 120 -s -g 'http://172.30.119.216:8675/echo?msg=It+is+time+to+TCP.'\n"
  May 12 20:31:04.059: INFO: stdout: "It is time to TCP."

  ...

  May 12 20:31:14.564: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~LrnO7xN0uO2TVIWRmJUdQcULhnMpG_2wqVYPj4znAwk}, err: <nil>
  [AfterEach] [sig-network-edge][Feature:Idling]
    github.com/openshift/origin/test/extended/util/client.go:141
  STEP: Destroying namespace "e2e-test-cli-idling-58bcz" for this suite.

  ...

 fail [github.com/openshift/origin/test/extended/idling/idling.go:298]: Expected
 ...
 not to have key
     <string>: idling.alpha.openshift.io/idled-at

ie, it created the service at 20:30:38, idled it at 20:30:56, tried to unidle it at 20:30:03 and successfully connected on the first try. Shortly before 20:31:14 it noticed that the idled-at annotation was still there, and declare the test a failure, and shortly after that it started cleaning things up, including in particular the namespace.

The e2e test runs `oc idle` synchronously and fails if `oc` returns an error, so we know the service/endpoints were successfully annotated by 20:30:57.



The openshift-controller-manager logs show:

  I0512 20:30:39.027049       1 event.go:282] Event(v1.ObjectReference{Kind:"DeploymentConfig", Namespace:"e2e-test-cli-idling-58bcz", Name:"idling-echo", UID:"eaf735a8-f1ef-4168-b8c3-30ce3536bf90", APIVersion:"apps.openshift.io/v1", ResourceVersion:"233818", FieldPath:""}): type: 'Normal' reason: 'DeploymentCreated' Created new replication controller "idling-echo-1" for version 1

  ...

  I0512 20:31:21.026393       1 event.go:282] Event(v1.ObjectReference{Kind:"DeploymentConfig", Namespace:"e2e-test-cli-idling-58bcz", Name:"idling-echo", UID:"eaf735a8-f1ef-4168-b8c3-30ce3536bf90", APIVersion:"apps.openshift.io/v1", ResourceVersion:"237592", FieldPath:""}): type: 'Normal' reason: 'ReplicationControllerScaled' Scaled replication controller "idling-echo-1" from 2 to 0
  E0512 20:31:21.028046       1 event.go:264] Server rejected event ...: 'events "idling-echo.16ee75a8c5da8318" is forbidden: unable to create new content in namespace e2e-test-cli-idling-58bcz because it is being terminated' (will not retry!)

Meaning, it observed the service creation when it happened, but the next time it tried to do anything with the service was 25 seconds after the service got marked idle (and 6 seconds after the test case gave up and started deleting things).


The sdn logs show:

  I0512 20:30:39.055739    1715 service.go:301] Service e2e-test-cli-idling-58bcz/idling-echo updated: 2 ports

  ...

  I0512 20:31:23.814243    1715 service.go:301] Service e2e-test-cli-idling-58bcz/idling-echo updated: 0 ports

So, it doesn't see the endpoints being deleted until the e2e test cleans up the test namespace, which matches what we saw in the openshift-controller-manager logs: OCM never idled the service.

---

So it appears that the problem is that openshift-controller-manager is never actually idling the service. There is another test that waits for the service to become fully-idled before it starts trying to connect ("Unidling should work with TCP (when fully idled)"). The test that is failing here ("Unidling should work with TCP (while idling)") is testing that if connections come up during the process of idling the service, that it still comes back up cleanly. Given that this test is failing and the other one is not, I guess the problem is that something has made openshift-controller-manager slower to respond, and so now the amount of time it takes it to respond to an idling request is longer than 15 seconds, and this test doesn't deal with that possibility. (Whereas the other test calls exutil.WaitForNoPodsRunning() which will wait up to 3 minutes.)

We can make the test run longer, which will probably fix this. But this raises the question of why openshift-controller-manager has become slow and if this is hinting at some other bug there? 25 seconds to respond to an annotation change seems excessive in an e2e-sized cluster.

Comment 4 jamo luhrsen 2022-05-18 15:17:16 UTC

This test was skipped in 4.10 and higher, as it was flaking and failing and no fix was
being worked on, afaik. We missed skipping it in 4.9, so that was just done with this
PR [0] from @sdodson

I'm not the right person to actually fix the bug that causes this to flake or
fail though.

[0] https://github.com/openshift/origin/pull/27150

Comment 5 ffernand 2022-05-23 13:38:01 UTC

Moving this bz to ocp controller manager for follow up.
Flaky test is now skipped in 4.9, so maybe this is no longer an urgent issue.