Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1756467

Summary:	Stage pod scheduling fails in clusters where pod affinity isn't working
Product:	OpenShift Container Platform	Reporter:	Scott Seago <sseago>
Component:	Migration Tooling	Assignee:	Scott Seago <sseago>
Status:	CLOSED ERRATA	QA Contact:	Xin jiang <xjiang>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.2.0	CC:	dymurray, jmatthew, xjiang
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-02-06 20:20:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Scott Seago 2019-09-27 17:17:59 UTC

Description of problem:
Stage pods created during stage migration currently use podAffinity to ensure that they are placed on the same node as the related application pods. In a cluster in which pod affinity is being ignored for some reason, these pods won't be scheduled properly, resulting in stage pods failing to start for pods which include volumes which use plugins that are dependent on the scheduling (such as gp2/ebs and glusterblock).

How reproducible:
Difficult to reproduce unless you have access to a cluster in which affinity is being ignored.

Steps to Reproduce:
1. Find a cluster in which affinity is being ignored
2. Use a namespace with running stateful pods with gp2 or glusterblock PVs
3. Migrate this namespace to a new cluster with PVs configured for filesystem copy.
4. If multiple nodes are available, eventually a case will be hit where the stage pod is scheduled on the wrong node and it will fail to start

It should probably be sufficient to verify that any fix for the above doesn't introduce any regressions with normal clusters -- in other words, following the steps above after 1, the below Expected results occur.

Actual results:
When the stage pod is scheduled on a different node from the application pod, the pod fails to start.

Expected results:
The stage pod is always scheduled on the same node as the application pod, and the migration succeeds.

Comment 1 Scott Seago 2019-09-30 20:32:06 UTC

The fix is to replace the podAffinity scheduling with explictly setting NodeName:

Fixes are merged to master (and to the stable branch:
https://github.com/fusor/mig-controller/pull/327
https://github.com/fusor/mig-controller/pull/332

I expect that these will be pulled over to the release-1.0 branch tomorrow.

Comment 5 Xin jiang 2020-01-23 15:13:26 UTC

Verified with below steps.

1. Prepare 2 clusters, one is ocp3.11, one is ocp4.3
2. Remove below section from /etc/origin/master/scheduler.json inside ocp3.11 cluster, the restart controller with command "master-restart controllers"

{
"name": "InterPodAffinityPriority",
"weight": 1
},
3. create a new project named "test-affinity"
 #oc new-project test-affinity
4. Deploy Statefulset application 
# oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/statefulset/stable-storage.yaml

# oc get pod
NAME                  READY     STATUS    RESTARTS   AGE
hello-statefulset-0   1/1       Running   0          32m
hello-statefulset-1   1/1       Running   0          31m

# oc get pvc
NAME                      STATUS    VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
www-hello-statefulset-0   Bound     pvc-52d77dca-3da8-11ea-affc-0e876d566add   1Gi        RWO            gp2            32m
www-hello-statefulset-1   Bound     pvc-6acf33a6-3da8-11ea-affc-0e876d566add   1Gi        RWO            gp2            31m

5. Create migplan and execute migplan
6. the application is migrated to ocp4.3 and pods are running well

Comment 7 errata-xmlrpc 2020-02-06 20:20:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0440