Created attachment 1642081 [details] script to run with clusterloader under https://github.com/openshift/svt Description of problem: Pods fail due delay in nfs pvc binding after migration Version-Release number of selected component (if applicable): # oc describe pod/migration-operator-56cbd64c4b-bwvfh -n openshift-migration | grep Image: containerImage: Image: image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-rhel7-operator@sha256:1a410c94063401e4a4d16e73a8c1efb1f9255c81e922e17ba6ae88860169bf60 Image: image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-rhel7-operator@sha256:1a410c94063401e4a4d16e73a8c1efb1f9255c81e922e17ba6ae88860169bf60 # oc describe pod/migration-controller-77459d677d-jnxhr -n openshift-migration | grep Image: Image: image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-controller-rhel8@sha256:e2c3cbb61157605d8246496f77c76b9b2950eb951bd0a63d4f8e3ae6f1884c2c How reproducible: always Steps to Reproduce: 1. Create 10 namespaces with more than 10 pods each that use nfs pvc's. Attaching the scripts and templates used. 2. Each pvc is filled up with data upto maximum capacity. Attaching the script used. 3. Create the test plan and start migration. 4. Migration is successful Actual results: A few pods under the first 5-6 namespaces fail. # oc get pods -n migtest-pv-4 NAME READY STATUS RESTARTS AGE django-psql-example0-1-build 0/1 Completed 0 17h django-psql-example0-2-deploy 0/1 Error 0 17h django-psql-example1-1-build 0/1 Completed 0 17h django-psql-example1-2-deploy 0/1 Error 0 17h django-psql-example2-1-build 0/1 Completed 0 17h django-psql-example2-2-deploy 0/1 Error 0 17h django-psql-example3-1-build 0/1 Completed 0 17h django-psql-example3-2-deploy 0/1 Error 0 17h django-psql-example4-1-build 0/1 Completed 0 17h django-psql-example4-2-deploy 0/1 Error 0 17h django-psql-example5-1-build 0/1 Completed 0 17h django-psql-example5-2-deploy 0/1 Error 0 17h django-psql-example6-1-build 0/1 Completed 0 17h django-psql-example6-2-deploy 0/1 Error 0 17h django-psql-example7-1-build 0/1 Completed 0 17h django-psql-example7-2-deploy 0/1 Error 0 17h django-psql-example8-1-build 0/1 Completed 0 17h django-psql-example8-2-deploy 0/1 Error 0 17h django-psql-example9-1-build 0/1 Completed 0 17h django-psql-example9-2-deploy 0/1 Error 0 17h postgresql0-1-deploy 0/1 Completed 0 17h postgresql0-1-zs52x 1/1 Running 0 17h postgresql1-1-deploy 0/1 Completed 0 17h postgresql1-1-r7fsz 1/1 Running 0 17h postgresql2-1-deploy 0/1 Completed 0 17h postgresql2-1-tgkqt 1/1 Running 0 17h postgresql3-1-5j7n8 1/1 Running 0 17h postgresql3-1-deploy 0/1 Completed 0 17h postgresql4-1-deploy 0/1 Completed 0 17h postgresql4-1-qmqbf 1/1 Running 0 17h postgresql5-1-55vzm 1/1 Running 0 17h postgresql5-1-deploy 0/1 Completed 0 17h postgresql6-1-deploy 0/1 Completed 0 17h postgresql6-1-n7szt 1/1 Running 0 17h postgresql7-1-deploy 0/1 Completed 0 17h postgresql7-1-qt65w 1/1 Running 0 17h postgresql8-1-deploy 0/1 Completed 0 17h postgresql8-1-tz69t 1/1 Running 0 17h postgresql9-1-deploy 0/1 Completed 0 17h postgresql9-1-jhwqd 1/1 Running 0 17h [root@rpattath mig-operator]# oc get pods -n migtest-pv-5 NAME READY STATUS RESTARTS AGE django-psql-example0-1-build 0/1 Completed 0 17h django-psql-example0-2-deploy 0/1 Error 0 17h django-psql-example1-1-build 0/1 Completed 0 17h django-psql-example1-2-deploy 0/1 Error 0 17h django-psql-example2-1-build 0/1 Completed 0 17h django-psql-example2-2-deploy 0/1 Error 0 17h django-psql-example3-1-build 0/1 Completed 0 17h django-psql-example3-2-deploy 0/1 Error 0 17h django-psql-example4-1-build 0/1 Completed 0 17h django-psql-example4-2-deploy 0/1 Error 0 17h django-psql-example5-1-build 0/1 Completed 0 17h django-psql-example5-2-deploy 0/1 Error 0 17h django-psql-example6-1-build 0/1 Completed 0 17h django-psql-example6-2-deploy 0/1 Error 0 17h django-psql-example7-1-build 0/1 Completed 0 17h django-psql-example7-2-deploy 0/1 Error 0 17h django-psql-example8-1-build 0/1 Completed 0 17h django-psql-example8-2-deploy 0/1 Error 0 17h django-psql-example9-1-build 0/1 Completed 0 17h django-psql-example9-2-deploy 0/1 Error 0 17h postgresql0-1-deploy 0/1 Completed 0 17h postgresql0-1-rtldf 1/1 Running 0 17h postgresql1-1-deploy 0/1 Completed 0 17h postgresql1-1-m97gr 1/1 Running 0 17h postgresql2-1-deploy 0/1 Completed 0 17h postgresql2-1-ztr47 1/1 Running 0 17h postgresql3-1-4b8hm 1/1 Running 0 17h postgresql3-1-deploy 0/1 Completed 0 17h postgresql4-1-deploy 0/1 Completed 0 17h postgresql4-1-s2l74 1/1 Running 0 17h postgresql5-1-2tzxl 1/1 Running 0 17h postgresql5-1-deploy 0/1 Completed 0 17h postgresql6-1-57q2z 1/1 Running 0 17h postgresql6-1-deploy 0/1 Completed 0 17h postgresql7-1-deploy 0/1 Completed 0 17h postgresql7-1-lvpgh 1/1 Running 0 17h postgresql8-1-deploy 0/1 Completed 0 17h postgresql8-1-qktpw 1/1 Running 0 17h postgresql9-1-deploy 0/1 Completed 0 17h postgresql9-1-g4z5l 1/1 Running 0 17h [root@rpattath mig-operator]# oc get pods -n migtest-pv-6 NAME READY STATUS RESTARTS AGE django-psql-example0-1-build 0/1 Completed 0 17h django-psql-example0-2-7w2gp 1/1 Running 6 17h django-psql-example0-2-deploy 0/1 Completed 0 17h django-psql-example1-1-build 0/1 Completed 0 17h django-psql-example1-2-deploy 0/1 Error 0 17h django-psql-example2-1-build 0/1 Completed 0 17h django-psql-example2-2-4vm6n 1/1 Running 6 17h django-psql-example2-2-deploy 0/1 Completed 0 17h django-psql-example3-1-build 0/1 Completed 0 17h django-psql-example3-2-deploy 0/1 Completed 0 17h django-psql-example3-2-rvgpf 1/1 Running 6 17h django-psql-example4-1-build 0/1 Completed 0 17h django-psql-example4-2-deploy 0/1 Completed 0 17h django-psql-example4-2-t7h9j 1/1 Running 6 17h django-psql-example5-1-build 0/1 Completed 0 17h django-psql-example5-2-deploy 0/1 Completed 0 17h django-psql-example5-2-q28jj 1/1 Running 6 17h django-psql-example6-1-build 0/1 Completed 0 17h django-psql-example6-2-deploy 0/1 Completed 0 17h django-psql-example6-2-v2wh2 1/1 Running 6 17h django-psql-example7-1-build 0/1 Completed 0 17h django-psql-example7-2-deploy 0/1 Completed 0 17h django-psql-example7-2-l7prk 1/1 Running 6 17h django-psql-example8-1-build 0/1 Completed 0 17h django-psql-example8-2-deploy 0/1 Completed 0 17h django-psql-example8-2-lfqnp 1/1 Running 6 17h django-psql-example9-1-build 0/1 Completed 0 17h django-psql-example9-2-6hrlg 1/1 Running 6 17h django-psql-example9-2-deploy 0/1 Completed 0 17h postgresql0-1-deploy 0/1 Completed 0 17h postgresql0-1-nsfrs 1/1 Running 0 17h postgresql1-1-deploy 0/1 Completed 0 17h postgresql1-1-nkm4p 1/1 Running 0 17h postgresql2-1-4g6cr 1/1 Running 0 17h postgresql2-1-deploy 0/1 Completed 0 17h postgresql3-1-deploy 0/1 Completed 0 17h postgresql3-1-ws6g8 1/1 Running 0 17h postgresql4-1-d78k4 1/1 Running 0 17h postgresql4-1-deploy 0/1 Completed 0 17h postgresql5-1-6sgcv 1/1 Running 0 17h postgresql5-1-deploy 0/1 Completed 0 17h postgresql6-1-deploy 0/1 Completed 0 17h postgresql6-1-dllbs 1/1 Running 0 17h postgresql7-1-deploy 0/1 Completed 0 17h postgresql7-1-z4qsh 1/1 Running 0 17h postgresql8-1-deploy 0/1 Completed 0 17h postgresql8-1-jhqvh 1/1 Running 0 17h postgresql9-1-deploy 0/1 Completed 0 17h postgresql9-1-h8vpx 1/1 Running 0 17h I was able to get these pods up and running after trying # oc project migtest-pv-4 Now using project "migtest-pv-4" on server "https://api.rpattath-4-nfs-migration.perf-testing.devcluster.openshift.com:6443". [root@rpattath mig-operator]# oc get dc NAME REVISION DESIRED CURRENT TRIGGERED BY django-psql-example0 2 1 0 config,image(django-psql-example0:latest) django-psql-example1 2 1 0 config,image(django-psql-example1:latest) django-psql-example2 2 1 0 config,image(django-psql-example2:latest) django-psql-example3 2 1 0 config,image(django-psql-example3:latest) django-psql-example4 2 1 0 config,image(django-psql-example4:latest) django-psql-example5 2 1 0 config,image(django-psql-example5:latest) django-psql-example6 2 1 0 config,image(django-psql-example6:latest) django-psql-example7 2 1 0 config,image(django-psql-example7:latest) django-psql-example8 2 1 0 config,image(django-psql-example8:latest) django-psql-example9 2 1 0 config,image(django-psql-example9:latest) postgresql0 1 1 1 config,image(postgresql:latest) postgresql1 1 1 1 config,image(postgresql:latest) postgresql2 1 1 1 config,image(postgresql:latest) postgresql3 1 1 1 config,image(postgresql:latest) postgresql4 1 1 1 config,image(postgresql:latest) postgresql5 1 1 1 config,image(postgresql:latest) postgresql6 1 1 1 config,image(postgresql:latest) postgresql7 1 1 1 config,image(postgresql:latest) postgresql8 1 1 1 config,image(postgresql:latest) postgresql9 1 1 1 config,image(postgresql:latest) [root@rpattath mig-operator]# oc rollout dc dc/django-psql-example0 error: unknown command "dc dc/django-psql-example0" See 'oc rollout -h' for help and examples [root@rpattath mig-operator]# oc rollout dc/django-psql-example0 error: unknown command "dc/django-psql-example0" See 'oc rollout -h' for help and examples [root@rpattath mig-operator]# oc rollout latestdc/django-psql-example0 error: unknown command "latestdc/django-psql-example0" See 'oc rollout -h' for help and examples [root@rpattath mig-operator]# oc rollout latest dc/django-psql-example0 deploymentconfig.apps.openshift.io/django-psql-example0 rolled out [root@rpattath mig-operator]# oc get pods | grep django-psql-example0 django-psql-example0-1-build 0/1 Completed 0 17h django-psql-example0-2-deploy 0/1 Error 0 17h django-psql-example0-3-2b847 1/1 Running 0 38s django-psql-example0-3-deploy 0/1 Completed 0 46s Expected results: All pods should be up and running and pvc bound. Additional info: Attaching events under one of the namespaces
Created attachment 1642082 [details] app template
Created attachment 1642083 [details] Fill the pvc's
Created attachment 1642084 [details] events under one of the projects that had failed pods
Created attachment 1642085 [details] migplan and migmigration
Now that I've reproduced it, here's what I'm seeing looking at one of the many failed deploymentconfigs. Looking at logs for `django-psql-example0-2-deploy`, the message displayed is: `error: update acceptor rejected django-psql-example0-2: pods for rc 'migtest-pv-3/django-psql-example0-2' took longer than 600 seconds to become available` If I try to redeploy explicitly, nothing seems to happen. Looking at the deployment state: $ oc rollout status deploymentconfig.apps.openshift.io/django-psql-example0 error: replication controller "django-psql-example0-2" has failed progressing I deleted the replicationcontrollers for this pod and it deployed the application successfully. Then I deleted replicationcontrollers for the other 9 failing DCs and only one of them came up properly. Repeating the delete after this second failure seems to do the trick. It looks like we may be dealing with some sort of scaling/timeout problem with openshift itself rather than the migration functionality since I can force the failing applications into a working state by deleting the replicationcontrollers for the failing deploymentconfigs, although sometimes it takes more than one try to do so. Simply attempting to retry the rollout isn't sufficient.
Fundamentally, the issue here seems to be a cluster scalability issue. Creating hundreds of DeploymentConfigs all at once causes a large temporary increase in cluster load, resulting in some pods to take longer to become available than dependent pods may be willing to wait. A short-term workaround is probably to migrate fewer namespaces at a time if each namespace has a large number of pods. Longer-term, we may want to see if we can enhance the velero restore processing to slow down a bit with large restores. If the number of deployments/deploymentconfigs/etc. is larger than some amount, we may want to introduce some deliberate pauses in the restores to give the cluster time to keep up. We'd have to be careful not to pause at the wrong time, though, as this could cause additional timeouts. On the other hand, it may be that the real fix is not to mess with velero restores but that the DeploymentConfig itself has the bug. Shouldn't the DC keep checking for the required pod until it's found rather than just giving up after 10 minutes?
Aligning to next release to consider.
Closing as stale, please re-open if the issue persists.