Bug 1838730

Summary:	[azure-disk] azure e2e fail with failed scheduling errors
Product:	OpenShift Container Platform	Reporter:	Hemant Kumar <hekumar>
Component:	Storage	Assignee:	Christian Huffman <chuffman>
Storage sub component:	Kubernetes	QA Contact:	Wei Duan <wduan>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	medium
Priority:	unspecified	CC:	aos-bugs, ffranz, jsafrane, wduan
Version:	4.5	Flags:	wduan: needinfo-
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	1861382 (view as bug list)		Environment:
Last Closed:	2020-10-27 16:00:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1861382

Description Hemant Kumar 2020-05-21 16:43:38 UTC

I see some tests in test run https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1242 failing with error:

May 21 12:18:33.682: INFO: Warning: Making PVC: VolumeMode specified as invalid empty string, treating as nil
May 21 12:18:33.738: INFO: Waiting up to 5m0s for PersistentVolumeClaims [azure-disklgqmm] to have phase Bound
May 21 12:18:33.789: INFO: PersistentVolumeClaim azure-disklgqmm found but phase is Pending instead of Bound.
May 21 12:18:35.830: INFO: PersistentVolumeClaim azure-disklgqmm found but phase is Pending instead of Bound.
May 21 12:18:37.872: INFO: PersistentVolumeClaim azure-disklgqmm found but phase is Pending instead of Bound.
May 21 12:18:39.916: INFO: PersistentVolumeClaim azure-disklgqmm found and phase=Bound (6.177903713s)
STEP: starting azure-injector
STEP: Deleting pod azure-injector in namespace e2e-volume-4484
May 21 12:23:40.240: INFO: Waiting for pod azure-injector to disappear
May 21 12:23:40.279: INFO: Pod azure-injector no longer exists
May 21 12:23:40.279: FAIL: Failed to create injector pod: timed out waiting for the condition

Full Stack Trace
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/volume.InjectContent(0xc00145fb80, 0xc00166e080, 0xf, 0x59d0dd4, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/volume/fixtures.go:518 +0x973
github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/storage/testsuites.(*volumesTestSuite).DefineTests.func3()
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/storage/testsuites/volumes.go:183 +0x405
github.com/openshift/origin/pkg/test/ginkgo.(*TestOptions).Run(0xc001b66570, 0xc00171a710, 0x1, 0x1, 0x0, 0x23d3400)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/pkg/test/ginkgo/cmd_runtest.go:59 +0x41f
main.newRunTestCommand.func1.1()
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:239 +0x4e
github.com/openshift/origin/test/extended/util.WithCleanup(0xc001df1bd8)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/test/extended/util/test.go:166 +0x58
main.newRunTestCommand.func1(0xc001600a00, 0xc00171a710, 0x1, 0x1, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:239 +0x1be
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).execute(0xc001600a00, 0xc00171a6d0, 0x1, 0x1, 0xc001600a00, 0xc00171a6d0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:826 +0x460
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc001600280, 0x0, 0x6687640, 0xa1180f8)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:914 +0x2fb
github.com/openshift/origin/vendor/github.com/spf13/cobra.(*Command).Execute(...)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/github.com/spf13/cobra/command.go:864
main.main.func1(0xc001600280, 0x0, 0x0)
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:61 +0x9c
main.main()
	/go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/cmd/openshift-tests/openshift-tests.go:62 +0x36e
STEP: cleaning the environment after azure
STEP: Deleting pvc
May 21 12:23:40.280: INFO: Deleting PersistentVolumeClaim "azure-disklgqmm"
May 21 12:23:40.334: INFO: Waiting up to 5m0s for PersistentVolume pvc-c90588b4-4d55-4089-acf8-2767446e236e to get deleted
May 21 12:23:40.374: INFO: PersistentVolume pvc-c90588b4-4d55-4089-acf8-2767446e236e found and phase=Released (39.773857ms)
May 21 12:23:45.416: INFO: PersistentVolume pvc-c90588b4-4d55-4089-acf8-2767446e236e found and phase=Released (5.081553213s)
May 21 12:23:50.464: INFO: PersistentVolume pvc-c90588b4-4d55-4089-acf8-2767446e236e was removed
STEP: Deleting sc
May 21 12:23:50.528: INFO: In-tree plugin kubernetes.io/azure-disk is not migrated, not validating any metrics
[AfterEach] [Testpattern: Dynamic PV (ext3)] volumes
  /go/src/github.com/openshift/origin/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:179
STEP: Collecting events from namespace "e2e-volume-4484".
STEP: Found 3 events.
May 21 12:23:50.580: INFO: At 2020-05-21 12:18:39 +0000 UTC - event for azure-disklgqmm: {persistentvolume-controller } ProvisioningSucceeded: Successfully provisioned volume pvc-c90588b4-4d55-4089-acf8-2767446e236e using kubernetes.io/azure-disk
May 21 12:23:50.580: INFO: At 2020-05-21 12:18:40 +0000 UTC - event for azure-injector: {default-scheduler } FailedScheduling: 0/5 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
May 21 12:23:50.580: INFO: At 2020-05-21 12:23:40 +0000 UTC - event for azure-injector: {default-scheduler } FailedScheduling: skip schedule deleting pod: e2e-volume-4484/azure-injector
May 21 12:23:50.622: INFO: POD  NODE  PHASE  GRACE  CONDITIONS
May 21 12:23:50.622: INFO: 
May 21 12:23:50.742: INFO: skipping dumping cluster info - cluster too large
May 21 12:23:50.742: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-volume-4484" for this suite.
May 21 12:23:50.869: INFO: Running AfterSuite actions on all nodes
May 21 12:23:50.869: INFO: Running AfterSuite actions on node 1
fail [k8s.io/kubernetes/test/e2e/framework/volume/fixtures.go:518]: May 21 12:23:40.279: Failed to create injector pod: timed out waiting for the condition

Comment 1 Hemant Kumar 2020-05-21 16:46:04 UTC

These tests are I think failing because volume is being provisioned in a zone where there is no worker node. This is because cluster has 3 master nodes and 2 worker nodes. I think we had similar problem in AWS for abit and it caused flakes.

Comment 3 Christian Huffman 2020-06-01 18:59:48 UTC

The internal tests Azure tests have passed with this change. I've submitted an upstream PR [1] to include this in k8s.

[1] https://github.com/kubernetes/kubernetes/pull/91642

Comment 6 Wei Duan 2020-07-28 00:46:35 UTC

Still find some cases failed with volume node affinity conflict, need to check if there is other issue.

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1284319209928527872
STEP: Found 3 events.
Jul 18 03:54:08.144: INFO: At 2020-07-18 03:48:55 +0000 UTC - event for azure-diskph482: {persistentvolume-controller } ProvisioningSucceeded: Successfully provisioned volume pvc-9521d714-f364-48a1-baed-95a40db5b33f using kubernetes.io/azure-disk
Jul 18 03:54:08.144: INFO: At 2020-07-18 03:48:57 +0000 UTC - event for exec-volume-test-dynamicpv-xwb5: {default-scheduler } FailedScheduling: 0/6 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.


https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.5/1285226062380273664
STEP: Found 4 events.
Jul 20 16:10:38.369: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for exec-volume-test-dynamicpv-7fcc: {default-scheduler } FailedScheduling: 0/6 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Jul 20 16:10:38.369: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for exec-volume-test-dynamicpv-7fcc: {default-scheduler } FailedScheduling: 0/6 nodes are available: 2 node(s) had volume node affinity conflict, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Jul 20 16:10:38.369: INFO: At 0001-01-01 00:00:00 +0000 UTC - event for exec-volume-test-dynamicpv-7fcc: {default-scheduler } FailedScheduling: skip schedule deleting pod: e2e-volume-9629/exec-volume-test-dynamicpv-7fcc
Jul 20 16:10:38.369: INFO: At 2020-07-20 16:05:24 +0000 UTC - event for azure-diskdm6s9: {persistentvolume-controller } ProvisioningSucceeded: Successfully provisioned volume pvc-21d39c8e-0eeb-4251-9789-dfe97a0ce40e using kubernetes.io/azure-disk

Comment 7 Christian Huffman 2020-07-28 03:38:47 UTC

Both of the links provided are for 4.5; however, this change has not been backported to 4.5 yet - it only exists in 4.6 at this time. If we don't see any failures in 4.6, then we can proceed to backport it.

Is it possible to chec kand see if these failures exist in 4.6, which contains the change?

Comment 8 Wei Duan 2020-07-28 06:33:51 UTC

Hi Huffman, it's my fault. I did not see failures exist in 4.6. 
Status changed.

Comment 10 errata-xmlrpc 2020-10-27 16:00:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196