[sig-network-edge][Feature:Idling] Unidling should work with TCP (while idling) [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/parallel] is failing frequently in CI, see: https://sippy.ci.openshift.org/sippy-ng/tests/4.9/analysis?test=%5Bsig-network-edge%5D%5BFeature%3AIdling%5D%20Unidling%20should%20work%20with%20TCP%20(while%20idling)%20%5BSkipped%3ANetwork%2FOVNKubernetes%5D%20%5BSuite%3Aopenshift%2Fconformance%2Fparallel%5D This seems to have begun failing commonly in 4.9 CI and Nightly aws-upgrade jobs and is the common cause for recent CI and Nightly acceptance failures. Examples: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524813239098019840 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524605621343621120 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524454745392025600 When reviewing when the failures started occuring I don't see any code changes in OVN or Router, but I do observe that CRI-O was bumped at the time the failures began.
Setting severity urgent given that this is blocking payload acceptance.
> https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci- > openshift-release-master-nightly-4.9-e2e-aws-upgrade/1524813239098019840 test logs from the first run show: May 12 20:30:38.801: INFO: Running 'oc --namespace=e2e-test-cli-idling-58bcz --kubeconfig=/tmp/configfile710036419 create -f /tmp/fixture-testdata-dir827686644/test/extended/testdata/idling-echo-server.yaml -o name' ... STEP: Idling the service May 12 20:30:56.781: INFO: Running 'oc --namespace=e2e-test-cli-idling-58bcz --kubeconfig=/tmp/configfile710036419 idle --resource-names-file /tmp/idling-services-837630022' STEP: Connecting to the service IP and repeatedly connecting, making sure we seamlessly idle and come back up May 12 20:30:57.319: INFO: Creating new exec pod May 12 20:31:03.462: INFO: Running '/usr/bin/kubectl --server=https://api.ci-op-lhpkj72z-24e09.aws-2.ci.openshift.org:6443 --kubeconfig=/tmp/kubeconfig-1865973082 --namespace=e2e-test-cli-idling-58bcz exec execpodlx45w -- /bin/sh -x -c curl --max-time 120 -s -g http://172.30.119.216:8675/echo?msg=It+is+time+to+TCP.' May 12 20:31:04.059: INFO: stderr: "+ curl --max-time 120 -s -g 'http://172.30.119.216:8675/echo?msg=It+is+time+to+TCP.'\n" May 12 20:31:04.059: INFO: stdout: "It is time to TCP." ... May 12 20:31:14.564: INFO: Deleted {oauth.openshift.io/v1, Resource=oauthaccesstokens sha256~LrnO7xN0uO2TVIWRmJUdQcULhnMpG_2wqVYPj4znAwk}, err: <nil> [AfterEach] [sig-network-edge][Feature:Idling] github.com/openshift/origin/test/extended/util/client.go:141 STEP: Destroying namespace "e2e-test-cli-idling-58bcz" for this suite. ... fail [github.com/openshift/origin/test/extended/idling/idling.go:298]: Expected ... not to have key <string>: idling.alpha.openshift.io/idled-at ie, it created the service at 20:30:38, idled it at 20:30:56, tried to unidle it at 20:30:03 and successfully connected on the first try. Shortly before 20:31:14 it noticed that the idled-at annotation was still there, and declare the test a failure, and shortly after that it started cleaning things up, including in particular the namespace. The e2e test runs `oc idle` synchronously and fails if `oc` returns an error, so we know the service/endpoints were successfully annotated by 20:30:57. The openshift-controller-manager logs show: I0512 20:30:39.027049 1 event.go:282] Event(v1.ObjectReference{Kind:"DeploymentConfig", Namespace:"e2e-test-cli-idling-58bcz", Name:"idling-echo", UID:"eaf735a8-f1ef-4168-b8c3-30ce3536bf90", APIVersion:"apps.openshift.io/v1", ResourceVersion:"233818", FieldPath:""}): type: 'Normal' reason: 'DeploymentCreated' Created new replication controller "idling-echo-1" for version 1 ... I0512 20:31:21.026393 1 event.go:282] Event(v1.ObjectReference{Kind:"DeploymentConfig", Namespace:"e2e-test-cli-idling-58bcz", Name:"idling-echo", UID:"eaf735a8-f1ef-4168-b8c3-30ce3536bf90", APIVersion:"apps.openshift.io/v1", ResourceVersion:"237592", FieldPath:""}): type: 'Normal' reason: 'ReplicationControllerScaled' Scaled replication controller "idling-echo-1" from 2 to 0 E0512 20:31:21.028046 1 event.go:264] Server rejected event ...: 'events "idling-echo.16ee75a8c5da8318" is forbidden: unable to create new content in namespace e2e-test-cli-idling-58bcz because it is being terminated' (will not retry!) Meaning, it observed the service creation when it happened, but the next time it tried to do anything with the service was 25 seconds after the service got marked idle (and 6 seconds after the test case gave up and started deleting things). The sdn logs show: I0512 20:30:39.055739 1715 service.go:301] Service e2e-test-cli-idling-58bcz/idling-echo updated: 2 ports ... I0512 20:31:23.814243 1715 service.go:301] Service e2e-test-cli-idling-58bcz/idling-echo updated: 0 ports So, it doesn't see the endpoints being deleted until the e2e test cleans up the test namespace, which matches what we saw in the openshift-controller-manager logs: OCM never idled the service. --- So it appears that the problem is that openshift-controller-manager is never actually idling the service. There is another test that waits for the service to become fully-idled before it starts trying to connect ("Unidling should work with TCP (when fully idled)"). The test that is failing here ("Unidling should work with TCP (while idling)") is testing that if connections come up during the process of idling the service, that it still comes back up cleanly. Given that this test is failing and the other one is not, I guess the problem is that something has made openshift-controller-manager slower to respond, and so now the amount of time it takes it to respond to an idling request is longer than 15 seconds, and this test doesn't deal with that possibility. (Whereas the other test calls exutil.WaitForNoPodsRunning() which will wait up to 3 minutes.) We can make the test run longer, which will probably fix this. But this raises the question of why openshift-controller-manager has become slow and if this is hinting at some other bug there? 25 seconds to respond to an annotation change seems excessive in an e2e-sized cluster.
This test was skipped in 4.10 and higher, as it was flaking and failing and no fix was being worked on, afaik. We missed skipping it in 4.9, so that was just done with this PR [0] from @sdodson I'm not the right person to actually fix the bug that causes this to flake or fail though. [0] https://github.com/openshift/origin/pull/27150
Moving this bz to ocp controller manager for follow up. Flaky test is now skipped in 4.9, so maybe this is no longer an urgent issue.