Description of problem: Stage pods created during stage migration currently use podAffinity to ensure that they are placed on the same node as the related application pods. In a cluster in which pod affinity is being ignored for some reason, these pods won't be scheduled properly, resulting in stage pods failing to start for pods which include volumes which use plugins that are dependent on the scheduling (such as gp2/ebs and glusterblock). How reproducible: Difficult to reproduce unless you have access to a cluster in which affinity is being ignored. Steps to Reproduce: 1. Find a cluster in which affinity is being ignored 2. Use a namespace with running stateful pods with gp2 or glusterblock PVs 3. Migrate this namespace to a new cluster with PVs configured for filesystem copy. 4. If multiple nodes are available, eventually a case will be hit where the stage pod is scheduled on the wrong node and it will fail to start It should probably be sufficient to verify that any fix for the above doesn't introduce any regressions with normal clusters -- in other words, following the steps above after 1, the below Expected results occur. Actual results: When the stage pod is scheduled on a different node from the application pod, the pod fails to start. Expected results: The stage pod is always scheduled on the same node as the application pod, and the migration succeeds.
The fix is to replace the podAffinity scheduling with explictly setting NodeName: Fixes are merged to master (and to the stable branch: https://github.com/fusor/mig-controller/pull/327 https://github.com/fusor/mig-controller/pull/332 I expect that these will be pulled over to the release-1.0 branch tomorrow.
Verified with below steps. 1. Prepare 2 clusters, one is ocp3.11, one is ocp4.3 2. Remove below section from /etc/origin/master/scheduler.json inside ocp3.11 cluster, the restart controller with command "master-restart controllers" { "name": "InterPodAffinityPriority", "weight": 1 }, 3. create a new project named "test-affinity" #oc new-project test-affinity 4. Deploy Statefulset application # oc create -f https://raw.githubusercontent.com/openshift-qe/v3-testfiles/master/statefulset/stable-storage.yaml # oc get pod NAME READY STATUS RESTARTS AGE hello-statefulset-0 1/1 Running 0 32m hello-statefulset-1 1/1 Running 0 31m # oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE www-hello-statefulset-0 Bound pvc-52d77dca-3da8-11ea-affc-0e876d566add 1Gi RWO gp2 32m www-hello-statefulset-1 Bound pvc-6acf33a6-3da8-11ea-affc-0e876d566add 1Gi RWO gp2 31m 5. Create migplan and execute migplan 6. the application is migrated to ocp4.3 and pods are running well
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2020:0440