1732901 – FailedScheduling after few deployments on azure

Bug 1732901 - FailedScheduling after few deployments on azure

Summary: FailedScheduling after few deployments on azure

Keywords:
Status:	CLOSED DUPLICATE of bug 1731059
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	ravig
QA Contact:	ge liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-24 16:32 UTC by Simon
Modified:	2019-07-26 10:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-26 10:30:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Simon 2019-07-24 16:32:05 UTC

Description of problem:
FailedScheduling after few deployments on azure.

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-07-22-160516

How reproducible:
100%

Steps to Reproduce:
1. Using openshift-install script create cluster on azure with parameters:

apiVersion: v1
baseDomain: qe.azure.devcluster.openshift.com
compute:
- hyperthreading: Enabled
  name: worker
  platform:
    azure:
      type: Standard_DS2_v2
  replicas: 3
controlPlane:
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      type: Standard_DS3_v2
  replicas: 3


2. Get template:

wget https://raw.githubusercontent.com/skordas/svt/starge_git_test_update/storage/git/files/oc/template_git.yaml

3. Deploy:
oc new-project test-1
oc process -f template_git.yaml -p PVC_SIZE=1Gi -p STORAGE_CLASS_NAME=managed-premium | oc create --namespace test-1 -f -

4. Repeat point 3 for projects test-2 ... test-9

Actual results:
In my case after 8 deployments rest are unsuccessful

oc get pods -n test-9
NAME           READY   STATUS   RESTARTS   AGE
git-1-deploy   0/1     Error    0          20h

oc get events
LAST SEEN   TYPE      REASON                        OBJECT                        MESSAGE
115m        Warning   FailedScheduling              pod/git-2-deploy              Binding rejected: Operation cannot be fulfilled on pods/binding "git-2-deploy": pod git-2-deploy is already assigned to node "skordas0723-5lbrd-worker-centralus3-s4qn2"
115m        Warning   FailedScheduling              pod/git-2-deploy              Binding rejected: Operation cannot be fulfilled on pods/binding "git-2-deploy": pod git-2-deploy is already assigned to node "skordas0723-5lbrd-worker-centralus3-s4qn2"
115m        Normal    Scheduled                     pod/git-2-deploy              Successfully assigned test-9/git-2-deploy to skordas0723-5lbrd-worker-centralus3-s4qn2
115m        Normal    Pulled                        pod/git-2-deploy              Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:fcb3aa914fad34d05b60a957ff2a99c9d10b7c441916d0fe36d41b0f756440a3" already present on machine
115m        Normal    Created                       pod/git-2-deploy              Created container deployment
115m        Normal    Started                       pod/git-2-deploy              Started container deployment
105m        Warning   FailedScheduling              pod/git-2-g8cg7               0/6 nodes are available: 1 node(s) exceed max volume count, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate.
105m        Warning   FailedScheduling              pod/git-2-g8cg7               0/6 nodes are available: 1 node(s) exceed max volume count, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate.
105m        Warning   FailedScheduling              pod/git-2-g8cg7               0/6 nodes are available: 1 node(s) exceed max volume count, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate.
105m        Warning   FailedScheduling              pod/git-2-g8cg7               skip schedule deleting pod: test-9/git-2-g8cg7
105m        Warning   FailedScheduling              pod/git-2-g8cg7               skip schedule deleting pod: test-9/git-2-g8cg7
105m        Warning   FailedScheduling              pod/git-2-g8cg7               skip schedule deleting pod: test-9/git-2-g8cg7
115m        Normal    SuccessfulCreate              replicationcontroller/git-2   Created pod: git-2-g8cg7
105m        Normal    SuccessfulDelete              replicationcontroller/git-2   Deleted pod: git-2-g8cg7
115m        Normal    DeploymentCreated             deploymentconfig/git          Created new replication controller "git-2" for version 2
105m        Normal    ReplicationControllerScaled   deploymentconfig/git          Scaled replication controller "git-2" from 1 to 0


Expected results:
All deployments should be successful.

Comment 2 Simon 2019-07-25 14:25:16 UTC

It's looks like is storage related:

when there are multiple availability zones PV/PVC should be in the same zone.

oc get storageclass managed-premium -o yaml | grep volumeBindingMode

Actual:
volumeBindingMode: Immediate

Expected:
volumeBindingMode: WaitForFirstConsumer


With volumeBindingMode: Immediate PVC can created in different zone than the pod. WaitForFirstConsumer will assure PV, PVC pod will be in the same zone

Comment 3 Jan Safranek 2019-07-26 10:30:49 UTC

> 105m        Warning   FailedScheduling              pod/git-2-g8cg7               0/6 nodes are available: 1 node(s) exceed max volume count, 2 node(s) had volume node affinity conflict, 3 node(s) had taints that the pod didn't tolerate.

It seems that one node is at its limit of attached Azure volumes.

> volumeBindingMode: WaitForFirstConsumer

This is already covered in bug #1731059. It should help in your case. But please check nr. of pods on the other nodes - are they reaching the volume attachment limit too? Kubernetes should distribute volumes across zones roughly evenly. Mabe it's a time to scale the cluster up.

*** This bug has been marked as a duplicate of bug 1731059 ***

Note You need to log in before you can comment on or make changes to this bug.