Bug 1844483 - [sig-devex][Feature:Templates] templateinstance cross-namespace test should create and delete objects across namespaces [Suite:openshift/conformance/parallel]
Summary: [sig-devex][Feature:Templates] templateinstance cross-namespace test should c...
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.6.0
Assignee: David Eads
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-05 13:59 UTC by Petr Horáček
Modified: 2020-08-05 14:08 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should support a 'default-deny' policy [Feature:NetworkPolicy]
Last Closed: 2020-08-05 13:52:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 25191 0 None closed Bug 1844483: poll on deleted obj check; give controller finalizer deletes time to complete 2020-08-03 09:53:49 UTC

Description Petr Horáček 2020-06-05 13:59:58 UTC
test:
[sig-devex][Feature:Templates] templateinstance cross-namespace test should create and delete objects across namespaces [Suite:openshift/conformance/parallel]

is failing frequently in CI.

These are the occurrences from last couple of days:
- https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2452
- https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2407
- https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2372

Comment 2 Gabe Montero 2020-06-05 18:43:53 UTC
So certainly a fair number of failures, but it is intermittent.  With search CI, perhaps see 170 failure instances over the last 2 weeks in the main line e2e-gcp.

The test is validating the k8s foreground garbage collection function wrt templates and the objects they create.

This space was last visited a month ago, when we re-enabled the templates tests.

At that time, with the help of David Eads, a upstream regression was identified which lead to consistent failure in all the templete deletion tests, including this one.

We tracked it via https://bugzilla.redhat.com/show_bug.cgi?id=1731222
David had me copy the upstream fix in line into the openshift API server.

But again, that was a consistent failure, so easy to reproduce locally and diagnose.

So this intermittent failure would appear to be something else.

In triaging the must gather pod logs and events logs for one of the failures, https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.5/2452, I'm not seeing any interesting in relation to the 2 projects involved, 
e2e-test-templates-x82cq and e2e-test-templates2-2ljkz.

Nor am I seeing any related to the specific secrets, "secret1" or "secret2".

I suspect some new debug will need to be added to the e2e's to dump information not currently dumped with the e2e, so we can get that data when the next failure occurs, but I don't know what that would be.

Sending over to master team for guidance on e2e debug to add / additional triage of the data that exists.

Comment 4 Gabe Montero 2020-06-18 20:58:25 UTC
@Adam yeah "WHAT" debug to add is the problem?

Per my #Comment 2 .... what is needed for the fact the k8s foreground garbage collection is the unknown to me.

Since my ask here in bugzilla failed I'll try in slack when I circle back to this.

Comment 5 Gabe Montero 2020-06-23 14:39:32 UTC
No luck reproducing this AM

adding retry in the test in case this is a timing issue (i.e. the foreground gc took a bit longer) finally occurred to me

will have a PR up shortly and then we can cherrypick as far back as needed

Comment 6 Gabe Montero 2020-06-23 17:39:39 UTC
ignore #comment 5 ... forgot what foreground met for a moment 

I'm simply adding debug around UIDs, foreground finalizers, deletion timestamps, owner ref block flags.

will then wait for 

https://search.apps.build01.ci.devcluster.openshift.com/?search=templateinstance+cross-namespace+test+should+create+and+delete&maxAge=24h&context=2&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

to turn up a hit in the master branch

Comment 7 Gabe Montero 2020-06-23 19:05:02 UTC
no, don't ignore #comment 5 :-) .... templates has a finalizer to delete dependent objects and it doesn't depend on ownerref deletions per se .... so there could be a delay

still adding debug

Comment 10 Gabe Montero 2020-07-01 12:30:28 UTC
So far https://search.apps.build01.ci.devcluster.openshift.com/?search=templateinstance+cross-namespace+test+should+create+and+delete&maxAge=24h&context=2&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job
shows no failures in the master branch since the PR merged.  

Of course still see instances in the older branches.  

Good sign, but let's get a longer sample size.

On whether the test case change is in effect, the line numbers in the error message are of relevance.

Prior to the fix, the ginkgo errors would cite line 136 or 139.

Any errors now would report line 163 or 181.

If all seems well after a sufficient time, we'll mark this verified and I'll start the backport process.

Based on the occurrence in master branch jobs over the last 14 days, I'd say if we go to the end of July 3
with no instances in master branch jobs, let's mark as verified.

Comment 11 Gabe Montero 2020-07-01 12:31:45 UTC
Clarification, no new instances between July 1 and July 3, where taking in those line numbers is the litmus test on whether a failure occurred with the new tests.

Comment 12 Gabe Montero 2020-07-01 13:10:10 UTC
e2e test only change ... doc not needed

Comment 13 Gabe Montero 2020-07-06 17:41:09 UTC
OK search CI provided https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/25069/pull-ci-openshift-origin-master-e2e-gcp/1278299478939406336

In that run, the template instance and the 2 secrets it created sat for 30 seconds (Jul 1 13:16:06.271 to 13:15:36:147) with their deletion timestamps set and the foregroundDeletion finalizer set.

All the yaml dumps are in the associated logs for the [sig-devex][Feature:Templates] templateinstance cross-namespace test should create and delete objects across namespaces [Suite:openshift/conformance/parallel] 
failure in that run.

For convenience, the last of the yaml dumps is at https://gist.github.com/gabemontero/e079bfa1e113adf24f5cca272a0379e9

It would seem to Adam and I that the k8s foreground deletion should have occurred within 30 seconds...sending over to apiserver 
for triage on the k8s foreground deletion piece.

This definitely is an intermittent issue though.  I only see 2 instances with the fail [github.com/openshift/origin/test/extended/templates/templateinstance_cross_namespace.go:163]: Unexpected error
over the last 14 days using search CI.

Comment 16 David Eads 2020-08-05 13:52:22 UTC
> Assigning to David as he was looking onto this or similar template issues already.

I don't recognize what this is.  But last update was a month ago.  Reopen if it reocurrs.

Comment 17 Gabe Montero 2020-08-05 14:08:26 UTC
Confirmed with search.ci and any failure in the aforementioned templateinstance e2e over the last 14 days in the master branch was *NOT* 
k8s foreground deletion based, but some unrelated issue usually of the cannot even communicate, etc etc.

So +1 on closing/worksforme ... also fwiw at this point given the intermittent nature of this even in e2e's in the past, I'm also good with not worrying about 4.5.z
unless a customer case comes in.


Note You need to log in before you can comment on or make changes to this bug.