Bug 2085327 - [sig-network-edge][Feature:Idling] Unidling should work with TCP (while idling) [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/parallel]
Summary: [sig-network-edge][Feature:Idling] Unidling should work with TCP (while idlin...
Keywords:
Status: CLOSED DUPLICATE of bug 2003228
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.9
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: obraunsh
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-13 02:04 UTC by Scott Dodson
Modified: 2022-11-10 12:25 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-11-10 12:25:01 UTC
Target Upstream Version:
Embargoed:
jluhrsen: needinfo-


Attachments (Terms of Use)

Description Scott Dodson 2022-05-13 02:04:56 UTC
[sig-network-edge][Feature:Idling] Unidling should work with TCP (while idling) [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/parallel]

is failing frequently in CI, see:
https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=%5Bsig-network-edge%5D%5BFeature%3AIdling%5D%20Unidling%20should%20work%20with%20TCP%20(while%20idling)%20%5BSkipped%3ANetwork%2FOVNKubernetes%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D

This seems to have begun failing commonly in 4.9 CI and Nightly aws-upgrade jobs and is the common cause for recent CI and Nightly acceptance failures.

Examples:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524813239098019840

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524605621343621120

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524454745392025600

When reviewing when the failures started occuring I don't see any code changes in OVN or Router, but I do observe that CRI-O was bumped at the time the failures began.

Comment 1 Scott Dodson 2022-05-13 12:47:17 UTC
Setting severity urgent given that this is blocking payload acceptance.

Comment 3 Dan Winship 2022-05-15 19:44:55 UTC
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-
> openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524813239098019840

test logs from the first run show:

  May 12 20:30:38.801: INFO: Running 'oc --namespace=e2e-test-cli-idling-58bcz --kubeconfig=/tmp/configfile710036419 create -f /tmp/fixture-testdata-dir827686644/test/extended/testdata/idling-echo-server.yaml -o name'

  ...

  STEP: Idling the service
  May 12 20:30:56.781: INFO: Running 'oc --namespace=e2e-test-cli-idling-58bcz --kubeconfig=/tmp/configfile710036419 idle --resource-names-file /tmp/idling-services-837630022'
  STEP: Connecting to the service IP and repeatedly connecting, making sure we seamlessly idle and come back up
  May 12 20:30:57.319: INFO: Creating new exec pod
  May 12 20:31:03.462: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-lhpkj72z-24e09.aws-2.ci.openshift.org:6443 --kubeconfig=/tmp/kubeconfig-1865973082 --namespace=e2e-test-cli-idling-58bcz exec execpodlx45w -- /bin/sh -x -c curl --max-time 120 -s -g http://172.30.119.216:8675/echo?msg=It+is+time+to+TCP.'
  May 12 20:31:04.059: INFO: stderr: "+ curl --max-time 120 -s -g 'http://172.30.119.216:8675/echo?msg=It+is+time+to+TCP.'\n"
  May 12 20:31:04.059: INFO: stdout: "It is time to TCP."

  ...

  May 12 20:31:14.564: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens  sha256~LrnO7xN0uO2TVIWRmJUdQcULhnMpG_2wqVYPj4znAwk}, err: <nil>
  [AfterEach] [sig-network-edge][Feature:Idling]
    github.com/openshift/origin/test/extended/util/client.go:141
  STEP: Destroying namespace "e2e-test-cli-idling-58bcz" for this suite.

  ...

 fail [github.com/openshift/origin/test/extended/idling/idling.go:298]: Expected
 ...
 not to have key
     <string>: idling.alpha.openshift.io/idled-at

ie, it created the service at 20:30:38, idled it at 20:30:56, tried to unidle it at 20:30:03 and successfully connected on the first try. Shortly before 20:31:14 it noticed that the idled-at annotation was still there, and declare the test a failure, and shortly after that it started cleaning things up, including in particular the namespace.

The e2e test runs `oc idle` synchronously and fails if `oc` returns an error, so we know the service/endpoints were successfully annotated by 20:30:57.



The openshift-controller-manager logs show:

  I0512 20:30:39.027049       1 event.go:282] Event(v1.ObjectReference{Kind:"DeploymentConfig", Namespace:"e2e-test-cli-idling-58bcz", Name:"idling-echo", UID:"eaf735a8-f1ef-4168-b8c3-30ce3536bf90", APIVersion:"apps.openshift.io/v1", ResourceVersion:"233818", FieldPath:""}): type: 'Normal' reason: 'DeploymentCreated' Created new replication controller "idling-echo-1" for version 1

  ...

  I0512 20:31:21.026393       1 event.go:282] Event(v1.ObjectReference{Kind:"DeploymentConfig", Namespace:"e2e-test-cli-idling-58bcz", Name:"idling-echo", UID:"eaf735a8-f1ef-4168-b8c3-30ce3536bf90", APIVersion:"apps.openshift.io/v1", ResourceVersion:"237592", FieldPath:""}): type: 'Normal' reason: 'ReplicationControllerScaled' Scaled replication controller "idling-echo-1" from 2 to 0
  E0512 20:31:21.028046       1 event.go:264] Server rejected event ...: 'events "idling-echo.16ee75a8c5da8318" is forbidden: unable to create new content in namespace e2e-test-cli-idling-58bcz because it is being terminated' (will not retry!)

Meaning, it observed the service creation when it happened, but the next time it tried to do anything with the service was 25 seconds after the service got marked idle (and 6 seconds after the test case gave up and started deleting things).


The sdn logs show:

  I0512 20:30:39.055739    1715 service.go:301] Service e2e-test-cli-idling-58bcz/idling-echo updated: 2 ports

  ...

  I0512 20:31:23.814243    1715 service.go:301] Service e2e-test-cli-idling-58bcz/idling-echo updated: 0 ports

So, it doesn't see the endpoints being deleted until the e2e test cleans up the test namespace, which matches what we saw in the openshift-controller-manager logs: OCM never idled the service.

---

So it appears that the problem is that openshift-controller-manager is never actually idling the service. There is another test that waits for the service to become fully-idled before it starts trying to connect ("Unidling should work with TCP (when fully idled)"). The test that is failing here ("Unidling should work with TCP (while idling)") is testing that if connections come up during the process of idling the service, that it still comes back up cleanly. Given that this test is failing and the other one is not, I guess the problem is that something has made openshift-controller-manager slower to respond, and so now the amount of time it takes it to respond to an idling request is longer than 15 seconds, and this test doesn't deal with that possibility. (Whereas the other test calls exutil.WaitForNoPodsRunning() which will wait up to 3 minutes.)

We can make the test run longer, which will probably fix this. But this raises the question of why openshift-controller-manager has become slow and if this is hinting at some other bug there? 25 seconds to respond to an annotation change seems excessive in an e2e-sized cluster.

Comment 4 jamo luhrsen 2022-05-18 15:17:16 UTC
This test was skipped in 4.10 and higher, as it was flaking and failing and no fix was
being worked on, afaik. We missed skipping it in 4.9, so that was just done with this
PR [0] from @sdodson

I'm not the right person to actually fix the bug that causes this to flake or
fail though.

[0] https://github.com/openshift/origin/pull/27150

Comment 5 ffernand 2022-05-23 13:38:01 UTC
Moving this bz to ocp controller manager for follow up.
Flaky test is now skipped in 4.9, so maybe this is no longer an urgent issue.


Note You need to log in before you can comment on or make changes to this bug.