Bug 1695807

Summary: Unit test flake post rebase: k8s.io/kubernetes/pkg/controller/podautoscaler TestEventNotCreated 3m0s
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NodeAssignee: Joel Smith <joelsmith>
Status: CLOSED ERRATA QA Contact: Sunil Choudhary <schoudha>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: anpicker, aos-bugs, erooth, fbranczy, gblomqui, jokerman, mloibl, mmccomas, pkrupa, sjenning, surbania
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2019-04-03 18:43:21 UTC
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/podautoscaler TestEventNotCreated 3m0s

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22447/pull-ci-openshift-origin-master-unit/4610#githubcomopenshiftoriginvendork8siokubernetespkgcontrollerpodautoscaler-testeventnotcreated

goroutine 1200 [chan receive, 2 minutes]:
github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/podautoscaler.(*HorizontalController).Run(0xc420138f00, 0xc420048d20)
	/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal.go:166 +0x29f
created by github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/podautoscaler.(*testCase).runTestWithController
	/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/controller/podautoscaler/horizontal_test.go:701 +0xbf

Set to high because it is flaking not infrequently.

Comment 2 Frederic Branczyk 2019-04-08 14:21:24 UTC
Joel from the Pod team is taking care of pod autoscaler things. Re-assigning.

Comment 3 Seth Jennings 2019-04-10 14:47:36 UTC
Removing as blocker but still need to stay on this.

15 occurrences in the last 48h.

https://search.svc.ci.openshift.org/?search=TestEventNotCreated&maxAge=48h&context=2&type=all

Comment 4 Clayton Coleman 2019-04-10 19:07:30 UTC
This is absolutely still a blocker.  50% of origin merges / PR jobs were failing on this.

I am disabling the test here: https://github.com/openshift/origin/pull/22527

Please ensure you follow up and fix.

Comment 6 Joel Smith 2019-04-10 22:42:42 UTC
I think I have figured out the cause of the deadlock in the test. I've got a WIP PR that should address the deadlock and fix another potential flake. I'm going to test it over and over to see if I can get it to hit the flake again with my change. If it looks good, I'll work on getting it merged upstream and in Origin.

https://github.com/openshift/origin/pull/22531

Comment 7 Joel Smith 2019-04-11 14:22:16 UTC
No flakes on my tests that ran overnight. Hopefully that's a good sign.

Comment 9 Joel Smith 2019-04-17 14:46:05 UTC
Hi Sunil, 
Here's a reproducer for the flake. From the origin source directory run this:

GOMAXPROCS=1 go test ./vendor/k8s.io/kubernetes/pkg/controller/podautoscaler/... -timeout 60s -count 1

Before the fix, it should timeout after 60 seconds with a deadlock backtrace.

After the fix, it should give an "ok" result.

Unfortunately, there is another flake that we're fixing in https://github.com/openshift/origin/pull/22591 that needs to merge too, or the reproducer will fail on that one too. Once both fixes are in, then you should get an "ok" result.

Comment 12 errata-xmlrpc 2019-06-04 10:47:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758