Bug 1493432

Summary:	Pod scheduled failed when it uses a local storage
Product:	OpenShift Container Platform	Reporter:	Qin Ping <piqin>
Component:	Installer	Assignee:	Jan Safranek <jsafrane>
Status:	CLOSED ERRATA	QA Contact:	Qin Ping <piqin>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.7.0	CC:	aos-bugs, aos-storage-staff, jokerman, jsafrane, mmccomas
Target Milestone:	---
Target Release:	3.7.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-11-28 22:12:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qin Ping 2017-09-20 07:51:34 UTC

Description of problem:
Pod using local storage did not schedule to the node where the local storage exists.

Version-Release number of selected component (if applicable):
openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd

How reproducible:
Always

Steps to Reproduce:
1. Create a OCP cluster with 2 nodes.
2. On each onde, create directories(/mnt/disks/vol1)
3. Create PVCs, make sure using PVs from different nodes
4. Create Pods, using PVCs created step forward.
5. All Pods should be scheduled to correct node and running.


Actual results:
One Pod's status is "ContainerCreating", and reports: FailedMount Storage node affinity check failed for volume "local-pv-dc7ed566" : NodeSelectorTerm [{Key:kubernetes.io/hostname Operator:In Values:[host-8-241-73.host.centralci.eng.rdu2.redhat.com]}] does not match node labels


Expected results:
Pod's status is "Running"


Master Log:

Node Log (of failed PODs):
Sep 20 03:50:18 host-8-241-40 atomic-openshift-node: E0920 03:50:18.903901   10056 reconciler.go:253] operationExecutor.MountVolume failed (controllerAttachDetachEnabled true) for volume "local-pv-ae231eff" (UniqueName: "kubernetes.io/local-volume/88c2a7b0-9dd5-11e7-b4ca-fa163e501ea4-local-pv-ae231eff") pod "local-volume-pod-2" (UID: "88c2a7b0-9dd5-11e7-b4ca-fa163e501ea4") : Storage node affinity check failed for volume "local-pv-ae231eff" (UniqueName: "kubernetes.io/local-volume/88c2a7b0-9dd5-11e7-b4ca-fa163e501ea4-local-pv-ae231eff") pod "local-volume-pod-2" (UID: "88c2a7b0-9dd5-11e7-b4ca-fa163e501ea4") : NodeSelectorTerm [{Key:kubernetes.io/hostname Operator:In Values:[host-8-241-73.host.centralci.eng.rdu2.redhat.com]}] does not match node labels

PV Dump:

PVC Dump:

StorageClass Dump (if StorageClass used by PV/PVC):

Additional info:
$ oc describe pod local-volume-pod-2
Name:			local-volume-pod-2
Namespace:		mytest
Security Policy:	restricted
Node:			host-8-241-40.host.centralci.eng.rdu2.redhat.com/172.16.120.57
Start Time:		Wed, 20 Sep 2017 15:30:10 +0800
Labels:			<none>
Status:			Pending
IP:			
Controllers:		<none>
Containers:
  myfront:
    Container ID:	
    Image:		aosqe/hello-openshift
    Image ID:		
    Port:		80/TCP
    State:		Waiting
      Reason:		ContainerCreating
    Ready:		False
    Restart Count:	0
    Volume Mounts:
      /mnt/local from pvol (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-9sr9w (ro)
    Environment Variables:	<none>
Conditions:
  Type		Status
  Initialized 	True 
  Ready 	False 
  PodScheduled 	True 
Volumes:
  pvol:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	localstorageclaim-2
    ReadOnly:	false
  default-token-9sr9w:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	default-token-9sr9w
QoS Class:	BestEffort
Tolerations:	<none>
Events:
  FirstSeen	LastSeen	Count	From								SubObjectPath	Type		Reason			Message
  ---------	--------	-----	----								-------------	--------	------			-------
  15m		15m		1	{default-scheduler }								Normal		Scheduled		Successfully assigned local-volume-pod-2 to host-8-241-40.host.centralci.eng.rdu2.redhat.com
  15m		15m		1	{kubelet host-8-241-40.host.centralci.eng.rdu2.redhat.com}			Normal		SuccessfulMountVolume	MountVolume.SetUp succeeded for volume "default-token-9sr9w" 
  15m		57s		8970	{kubelet host-8-241-40.host.centralci.eng.rdu2.redhat.com}			Warning		FailedMount		Storage node affinity check failed for volume "local-pv-ae231eff" : NodeSelectorTerm [{Key:kubernetes.io/hostname Operator:In Values:[host-8-241-73.host.centralci.eng.rdu2.redhat.com]}] does not match node labels

Comment 1 Jan Safranek 2017-09-20 08:13:30 UTC

In controller-manager logs I can see that scheduler is started with these predicates and priorities:

zář 19 22:29:11 host-8-241-62.host.centralci.eng.rdu2.redhat.com atomic-openshift-master-controllers[18383]: I0919 22:29:11.935972   18383 factory.go:351] Creating scheduler from configuration: {{ }[
{NoVolumeZoneConflict <nil>}
{MaxEBSVolumeCount <nil>}
{MaxGCEPDVolumeCount <nil>}
{MatchInterPodAffinity <nil>}
{NoDiskConflict <nil>}
{GeneralPredicates <nil>}
{PodToleratesNodeTaints <nil>}
{CheckNodeMemoryPressure <nil>}
{CheckNodeDiskPressure <nil>}
{Region 0xc4210392d0}]
[{SelectorSpreadPriority 1 <nil>}
{InterPodAffinityPriority 1 <nil>}
{LeastRequestedPriority 1 <nil>}
{BalancedResourceAllocation 1 <nil>}
{NodePreferAvoidPodsPriority 10000 <nil>}
{NodeAffinityPriority 1 <nil>}
{TaintTolerationPriority 1 <nil>}
{Zone 2 0xc421039990}] [] 0}


(edited for readability)

"NoVolumeNodeConflict" is missing there from some reason.

Comment 2 Jan Safranek 2017-09-20 11:27:21 UTC

NoVolumeNodeConflict predicate is not configured in /etc/origin/master/scheduler.json. It must be added there by installer.

As a workaround, you can manually add it there and restart atomic-openshift-master-controllers:

...
    "predicates": [
        {
            "name": "NoVolumeNodeConflict"
        },
...


Note that the predicate is enabled when running local cluster by simple "openshift start", i.e. unconfigured OpenShift works out of the box.

Comment 3 Jan Safranek 2017-09-20 13:41:20 UTC

I'll try to fix it on my own, I need to get familiar with openshift-ansible anyway and this looks like trivial template modification.

Comment 5 Jan Safranek 2017-09-25 12:09:27 UTC

Ansible PR: https://github.com/openshift/openshift-ansible/pull/5492

Comment 6 Qin Ping 2017-09-27 08:01:34 UTC

Verified the workaround in the following version:
openshift v3.7.0-0.126.4
kubernetes v1.7.0+80709908fd

Pod can be scheduled to the correct node.

Comment 7 Qin Ping 2017-09-27 08:02:47 UTC

Verified "NoVolumeNodeConflict" and "MaxAzureDiskVolumeCount" predicate were installed in the following version:
oc v3.7.0-0.131.0
kubernetes v1.7.0+80709908fd

Comment 11 errata-xmlrpc 2017-11-28 22:12:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188