test: [sig-devex][Feature:Templates] templateinstance cross-namespace test should create and delete objects across namespaces [Suite:openshift/conformance/parallel] is failing frequently in CI. These are the occurrences from last couple of days: - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2452 - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2407 - https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2372
So certainly a fair number of failures, but it is intermittent. With search CI, perhaps see 170 failure instances over the last 2 weeks in the main line e2e-gcp. The test is validating the k8s foreground garbage collection function wrt templates and the objects they create. This space was last visited a month ago, when we re-enabled the templates tests. At that time, with the help of David Eads, a upstream regression was identified which lead to consistent failure in all the templete deletion tests, including this one. We tracked it via https://bugzilla.redhat.com/show_bug.cgi?id=1731222 David had me copy the upstream fix in line into the openshift API server. But again, that was a consistent failure, so easy to reproduce locally and diagnose. So this intermittent failure would appear to be something else. In triaging the must gather pod logs and events logs for one of the failures, https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2452, I'm not seeing any interesting in relation to the 2 projects involved, e2e-test-templates-x82cq and e2e-test-templates2-2ljkz. Nor am I seeing any related to the specific secrets, "secret1" or "secret2". I suspect some new debug will need to be added to the e2e's to dump information not currently dumped with the e2e, so we can get that data when the next failure occurs, but I don't know what that would be. Sending over to master team for guidance on e2e debug to add / additional triage of the data that exists.
@Adam yeah "WHAT" debug to add is the problem? Per my #Comment 2 .... what is needed for the fact the k8s foreground garbage collection is the unknown to me. Since my ask here in bugzilla failed I'll try in slack when I circle back to this.
No luck reproducing this AM adding retry in the test in case this is a timing issue (i.e. the foreground gc took a bit longer) finally occurred to me will have a PR up shortly and then we can cherrypick as far back as needed
ignore #comment 5 ... forgot what foreground met for a moment I'm simply adding debug around UIDs, foreground finalizers, deletion timestamps, owner ref block flags. will then wait for https://search.apps.build01.ci.devcluster.openshift.com/?search=templateinstance+cross-namespace+test+should+create+and+delete&maxAge=24h&context=2&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job to turn up a hit in the master branch
no, don't ignore #comment 5 :-) .... templates has a finalizer to delete dependent objects and it doesn't depend on ownerref deletions per se .... so there could be a delay still adding debug
So far https://search.apps.build01.ci.devcluster.openshift.com/?search=templateinstance+cross-namespace+test+should+create+and+delete&maxAge=24h&context=2&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job shows no failures in the master branch since the PR merged. Of course still see instances in the older branches. Good sign, but let's get a longer sample size. On whether the test case change is in effect, the line numbers in the error message are of relevance. Prior to the fix, the ginkgo errors would cite line 136 or 139. Any errors now would report line 163 or 181. If all seems well after a sufficient time, we'll mark this verified and I'll start the backport process. Based on the occurrence in master branch jobs over the last 14 days, I'd say if we go to the end of July 3 with no instances in master branch jobs, let's mark as verified.
Clarification, no new instances between July 1 and July 3, where taking in those line numbers is the litmus test on whether a failure occurred with the new tests.
e2e test only change ... doc not needed
OK search CI provided https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25069/pull-ci-openshift-origin-master-e2e-gcp/1278299478939406336 In that run, the template instance and the 2 secrets it created sat for 30 seconds (Jul 1 13:16:06.271 to 13:15:36:147) with their deletion timestamps set and the foregroundDeletion finalizer set. All the yaml dumps are in the associated logs for the [sig-devex][Feature:Templates] templateinstance cross-namespace test should create and delete objects across namespaces [Suite:openshift/conformance/parallel] failure in that run. For convenience, the last of the yaml dumps is at https://gist.github.com/gabemontero/e079bfa1e113adf24f5cca272a0379e9 It would seem to Adam and I that the k8s foreground deletion should have occurred within 30 seconds...sending over to apiserver for triage on the k8s foreground deletion piece. This definitely is an intermittent issue though. I only see 2 instances with the fail [github.com/openshift/origin/test/extended/templates/templateinstance_cross_namespace.go:163]: Unexpected error over the last 14 days using search CI.
> Assigning to David as he was looking onto this or similar template issues already. I don't recognize what this is. But last update was a month ago. Reopen if it reocurrs.
Confirmed with search.ci and any failure in the aforementioned templateinstance e2e over the last 14 days in the master branch was *NOT* k8s foreground deletion based, but some unrelated issue usually of the cannot even communicate, etc etc. So +1 on closing/worksforme ... also fwiw at this point given the intermittent nature of this even in e2e's in the past, I'm also good with not worrying about 4.5.z unless a customer case comes in.