Bug 1779692 - Pods fail due delay in nfs pvc binding after migration
Summary: Pods fail due delay in nfs pvc binding after migration
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Migration Toolkit for Containers
Classification: Red Hat
Component: General
Version: 1.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 1.5.0
Assignee: Scott Seago
QA Contact: Xin jiang
Avital Pinnick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-04 14:16 UTC by Roshni
Modified: 2021-04-07 20:58 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-04-07 20:58:16 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
script to run with clusterloader under https://github.com/openshift/svt (294 bytes, text/plain)
2019-12-04 14:16 UTC, Roshni
no flags Details
app template (16.62 KB, text/plain)
2019-12-04 14:16 UTC, Roshni
no flags Details
Fill the pvc's (572 bytes, application/x-shellscript)
2019-12-04 14:17 UTC, Roshni
no flags Details
events under one of the projects that had failed pods (121.68 KB, text/plain)
2019-12-04 14:19 UTC, Roshni
no flags Details
migplan and migmigration (48.22 KB, text/plain)
2019-12-04 14:21 UTC, Roshni
no flags Details

Description Roshni 2019-12-04 14:16:09 UTC
Created attachment 1642081 [details]
script to run with clusterloader under https://github.com/openshift/svt

Description of problem:
Pods fail due delay in nfs pvc binding after migration

Version-Release number of selected component (if applicable):
# oc describe pod/migration-operator-56cbd64c4b-bwvfh -n openshift-migration | grep Image:
                    containerImage:
    Image:         image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-rhel7-operator@sha256:1a410c94063401e4a4d16e73a8c1efb1f9255c81e922e17ba6ae88860169bf60
    Image:          image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-rhel7-operator@sha256:1a410c94063401e4a4d16e73a8c1efb1f9255c81e922e17ba6ae88860169bf60

# oc describe pod/migration-controller-77459d677d-jnxhr -n openshift-migration | grep Image:
    Image:         image-registry.openshift-image-registry.svc:5000/rhcam-1-0/openshift-migration-controller-rhel8@sha256:e2c3cbb61157605d8246496f77c76b9b2950eb951bd0a63d4f8e3ae6f1884c2c

How reproducible:
always

Steps to Reproduce:
1. Create 10 namespaces with more than 10 pods each that use nfs pvc's. Attaching the scripts and templates used.
2. Each pvc is filled up with data upto maximum capacity. Attaching the script used.
3. Create the test plan and start migration.
4. Migration is successful

Actual results:
A few pods under the first 5-6 namespaces fail.

# oc get pods -n migtest-pv-4 
NAME                            READY   STATUS      RESTARTS   AGE
django-psql-example0-1-build    0/1     Completed   0          17h
django-psql-example0-2-deploy   0/1     Error       0          17h
django-psql-example1-1-build    0/1     Completed   0          17h
django-psql-example1-2-deploy   0/1     Error       0          17h
django-psql-example2-1-build    0/1     Completed   0          17h
django-psql-example2-2-deploy   0/1     Error       0          17h
django-psql-example3-1-build    0/1     Completed   0          17h
django-psql-example3-2-deploy   0/1     Error       0          17h
django-psql-example4-1-build    0/1     Completed   0          17h
django-psql-example4-2-deploy   0/1     Error       0          17h
django-psql-example5-1-build    0/1     Completed   0          17h
django-psql-example5-2-deploy   0/1     Error       0          17h
django-psql-example6-1-build    0/1     Completed   0          17h
django-psql-example6-2-deploy   0/1     Error       0          17h
django-psql-example7-1-build    0/1     Completed   0          17h
django-psql-example7-2-deploy   0/1     Error       0          17h
django-psql-example8-1-build    0/1     Completed   0          17h
django-psql-example8-2-deploy   0/1     Error       0          17h
django-psql-example9-1-build    0/1     Completed   0          17h
django-psql-example9-2-deploy   0/1     Error       0          17h
postgresql0-1-deploy            0/1     Completed   0          17h
postgresql0-1-zs52x             1/1     Running     0          17h
postgresql1-1-deploy            0/1     Completed   0          17h
postgresql1-1-r7fsz             1/1     Running     0          17h
postgresql2-1-deploy            0/1     Completed   0          17h
postgresql2-1-tgkqt             1/1     Running     0          17h
postgresql3-1-5j7n8             1/1     Running     0          17h
postgresql3-1-deploy            0/1     Completed   0          17h
postgresql4-1-deploy            0/1     Completed   0          17h
postgresql4-1-qmqbf             1/1     Running     0          17h
postgresql5-1-55vzm             1/1     Running     0          17h
postgresql5-1-deploy            0/1     Completed   0          17h
postgresql6-1-deploy            0/1     Completed   0          17h
postgresql6-1-n7szt             1/1     Running     0          17h
postgresql7-1-deploy            0/1     Completed   0          17h
postgresql7-1-qt65w             1/1     Running     0          17h
postgresql8-1-deploy            0/1     Completed   0          17h
postgresql8-1-tz69t             1/1     Running     0          17h
postgresql9-1-deploy            0/1     Completed   0          17h
postgresql9-1-jhwqd             1/1     Running     0          17h
[root@rpattath mig-operator]# oc get pods -n migtest-pv-5
NAME                            READY   STATUS      RESTARTS   AGE
django-psql-example0-1-build    0/1     Completed   0          17h
django-psql-example0-2-deploy   0/1     Error       0          17h
django-psql-example1-1-build    0/1     Completed   0          17h
django-psql-example1-2-deploy   0/1     Error       0          17h
django-psql-example2-1-build    0/1     Completed   0          17h
django-psql-example2-2-deploy   0/1     Error       0          17h
django-psql-example3-1-build    0/1     Completed   0          17h
django-psql-example3-2-deploy   0/1     Error       0          17h
django-psql-example4-1-build    0/1     Completed   0          17h
django-psql-example4-2-deploy   0/1     Error       0          17h
django-psql-example5-1-build    0/1     Completed   0          17h
django-psql-example5-2-deploy   0/1     Error       0          17h
django-psql-example6-1-build    0/1     Completed   0          17h
django-psql-example6-2-deploy   0/1     Error       0          17h
django-psql-example7-1-build    0/1     Completed   0          17h
django-psql-example7-2-deploy   0/1     Error       0          17h
django-psql-example8-1-build    0/1     Completed   0          17h
django-psql-example8-2-deploy   0/1     Error       0          17h
django-psql-example9-1-build    0/1     Completed   0          17h
django-psql-example9-2-deploy   0/1     Error       0          17h
postgresql0-1-deploy            0/1     Completed   0          17h
postgresql0-1-rtldf             1/1     Running     0          17h
postgresql1-1-deploy            0/1     Completed   0          17h
postgresql1-1-m97gr             1/1     Running     0          17h
postgresql2-1-deploy            0/1     Completed   0          17h
postgresql2-1-ztr47             1/1     Running     0          17h
postgresql3-1-4b8hm             1/1     Running     0          17h
postgresql3-1-deploy            0/1     Completed   0          17h
postgresql4-1-deploy            0/1     Completed   0          17h
postgresql4-1-s2l74             1/1     Running     0          17h
postgresql5-1-2tzxl             1/1     Running     0          17h
postgresql5-1-deploy            0/1     Completed   0          17h
postgresql6-1-57q2z             1/1     Running     0          17h
postgresql6-1-deploy            0/1     Completed   0          17h
postgresql7-1-deploy            0/1     Completed   0          17h
postgresql7-1-lvpgh             1/1     Running     0          17h
postgresql8-1-deploy            0/1     Completed   0          17h
postgresql8-1-qktpw             1/1     Running     0          17h
postgresql9-1-deploy            0/1     Completed   0          17h
postgresql9-1-g4z5l             1/1     Running     0          17h
[root@rpattath mig-operator]# oc get pods -n migtest-pv-6
NAME                            READY   STATUS      RESTARTS   AGE
django-psql-example0-1-build    0/1     Completed   0          17h
django-psql-example0-2-7w2gp    1/1     Running     6          17h
django-psql-example0-2-deploy   0/1     Completed   0          17h
django-psql-example1-1-build    0/1     Completed   0          17h
django-psql-example1-2-deploy   0/1     Error       0          17h
django-psql-example2-1-build    0/1     Completed   0          17h
django-psql-example2-2-4vm6n    1/1     Running     6          17h
django-psql-example2-2-deploy   0/1     Completed   0          17h
django-psql-example3-1-build    0/1     Completed   0          17h
django-psql-example3-2-deploy   0/1     Completed   0          17h
django-psql-example3-2-rvgpf    1/1     Running     6          17h
django-psql-example4-1-build    0/1     Completed   0          17h
django-psql-example4-2-deploy   0/1     Completed   0          17h
django-psql-example4-2-t7h9j    1/1     Running     6          17h
django-psql-example5-1-build    0/1     Completed   0          17h
django-psql-example5-2-deploy   0/1     Completed   0          17h
django-psql-example5-2-q28jj    1/1     Running     6          17h
django-psql-example6-1-build    0/1     Completed   0          17h
django-psql-example6-2-deploy   0/1     Completed   0          17h
django-psql-example6-2-v2wh2    1/1     Running     6          17h
django-psql-example7-1-build    0/1     Completed   0          17h
django-psql-example7-2-deploy   0/1     Completed   0          17h
django-psql-example7-2-l7prk    1/1     Running     6          17h
django-psql-example8-1-build    0/1     Completed   0          17h
django-psql-example8-2-deploy   0/1     Completed   0          17h
django-psql-example8-2-lfqnp    1/1     Running     6          17h
django-psql-example9-1-build    0/1     Completed   0          17h
django-psql-example9-2-6hrlg    1/1     Running     6          17h
django-psql-example9-2-deploy   0/1     Completed   0          17h
postgresql0-1-deploy            0/1     Completed   0          17h
postgresql0-1-nsfrs             1/1     Running     0          17h
postgresql1-1-deploy            0/1     Completed   0          17h
postgresql1-1-nkm4p             1/1     Running     0          17h
postgresql2-1-4g6cr             1/1     Running     0          17h
postgresql2-1-deploy            0/1     Completed   0          17h
postgresql3-1-deploy            0/1     Completed   0          17h
postgresql3-1-ws6g8             1/1     Running     0          17h
postgresql4-1-d78k4             1/1     Running     0          17h
postgresql4-1-deploy            0/1     Completed   0          17h
postgresql5-1-6sgcv             1/1     Running     0          17h
postgresql5-1-deploy            0/1     Completed   0          17h
postgresql6-1-deploy            0/1     Completed   0          17h
postgresql6-1-dllbs             1/1     Running     0          17h
postgresql7-1-deploy            0/1     Completed   0          17h
postgresql7-1-z4qsh             1/1     Running     0          17h
postgresql8-1-deploy            0/1     Completed   0          17h
postgresql8-1-jhqvh             1/1     Running     0          17h
postgresql9-1-deploy            0/1     Completed   0          17h
postgresql9-1-h8vpx             1/1     Running     0          17h

I was able to get these pods up and running after trying 

# oc project migtest-pv-4
Now using project "migtest-pv-4" on server "https://api.rpattath-4-nfs-migration.perf-testing.devcluster.openshift.com:6443".
[root@rpattath mig-operator]# oc get dc
NAME                   REVISION   DESIRED   CURRENT   TRIGGERED BY
django-psql-example0   2          1         0         config,image(django-psql-example0:latest)
django-psql-example1   2          1         0         config,image(django-psql-example1:latest)
django-psql-example2   2          1         0         config,image(django-psql-example2:latest)
django-psql-example3   2          1         0         config,image(django-psql-example3:latest)
django-psql-example4   2          1         0         config,image(django-psql-example4:latest)
django-psql-example5   2          1         0         config,image(django-psql-example5:latest)
django-psql-example6   2          1         0         config,image(django-psql-example6:latest)
django-psql-example7   2          1         0         config,image(django-psql-example7:latest)
django-psql-example8   2          1         0         config,image(django-psql-example8:latest)
django-psql-example9   2          1         0         config,image(django-psql-example9:latest)
postgresql0            1          1         1         config,image(postgresql:latest)
postgresql1            1          1         1         config,image(postgresql:latest)
postgresql2            1          1         1         config,image(postgresql:latest)
postgresql3            1          1         1         config,image(postgresql:latest)
postgresql4            1          1         1         config,image(postgresql:latest)
postgresql5            1          1         1         config,image(postgresql:latest)
postgresql6            1          1         1         config,image(postgresql:latest)
postgresql7            1          1         1         config,image(postgresql:latest)
postgresql8            1          1         1         config,image(postgresql:latest)
postgresql9            1          1         1         config,image(postgresql:latest)
[root@rpattath mig-operator]# oc rollout dc dc/django-psql-example0
error: unknown command "dc dc/django-psql-example0"
See 'oc rollout -h' for help and examples
[root@rpattath mig-operator]# oc rollout dc/django-psql-example0
error: unknown command "dc/django-psql-example0"
See 'oc rollout -h' for help and examples
[root@rpattath mig-operator]# oc rollout latestdc/django-psql-example0
error: unknown command "latestdc/django-psql-example0"
See 'oc rollout -h' for help and examples
[root@rpattath mig-operator]# oc rollout latest dc/django-psql-example0
deploymentconfig.apps.openshift.io/django-psql-example0 rolled out
[root@rpattath mig-operator]# oc get pods | grep django-psql-example0
django-psql-example0-1-build    0/1     Completed   0          17h
django-psql-example0-2-deploy   0/1     Error       0          17h
django-psql-example0-3-2b847    1/1     Running     0          38s
django-psql-example0-3-deploy   0/1     Completed   0          46s


Expected results:
All pods should be up and running and pvc bound.

Additional info:
Attaching events under one of the namespaces

Comment 1 Roshni 2019-12-04 14:16:48 UTC
Created attachment 1642082 [details]
app template

Comment 2 Roshni 2019-12-04 14:17:13 UTC
Created attachment 1642083 [details]
Fill the pvc's

Comment 3 Roshni 2019-12-04 14:19:04 UTC
Created attachment 1642084 [details]
events under one of the projects that had failed pods

Comment 4 Roshni 2019-12-04 14:21:57 UTC
Created attachment 1642085 [details]
migplan and migmigration

Comment 5 Scott Seago 2020-01-17 23:24:50 UTC
Now that I've reproduced it, here's what I'm seeing looking at one of the many failed deploymentconfigs. Looking at logs for `django-psql-example0-2-deploy`, the message displayed is:
`error: update acceptor rejected django-psql-example0-2: pods for rc 'migtest-pv-3/django-psql-example0-2' took longer than 600 seconds to become available`

If I try to redeploy explicitly, nothing seems to happen. Looking at the deployment state:

$ oc rollout status deploymentconfig.apps.openshift.io/django-psql-example0
error: replication controller "django-psql-example0-2" has failed progressing

I deleted the replicationcontrollers for this pod and it deployed the application successfully.

Then I deleted replicationcontrollers for the other 9 failing DCs and only one of them came up properly.

Repeating the delete after this second failure seems to do the trick.

It looks like we may be dealing with some sort of scaling/timeout problem with openshift itself rather than the migration functionality since I can force the failing applications into a working state by deleting the replicationcontrollers for the failing deploymentconfigs, although sometimes it takes more than one try to do so. Simply attempting to retry the rollout isn't sufficient.

Comment 6 Scott Seago 2020-04-01 16:50:55 UTC
Fundamentally, the issue here seems to be a cluster scalability issue. Creating hundreds of DeploymentConfigs all at once causes a large temporary increase in cluster load, resulting in some pods to take longer to become available than dependent pods may be willing to wait. A short-term workaround is probably to migrate fewer namespaces at a time if each namespace has a large number of pods. 

Longer-term, we may want to see if we can enhance the velero restore processing to slow down a bit with large restores. If the number of deployments/deploymentconfigs/etc. is larger than some amount, we may want to introduce some deliberate pauses in the restores to give the cluster time to keep up. We'd have to be careful not to pause at the wrong time, though, as this could cause additional timeouts.

On the other hand, it may be that the real fix is not to mess with velero restores but that the DeploymentConfig itself has the bug. Shouldn't the DC keep checking for the required pod until it's found rather than just giving up after 10 minutes?

Comment 7 John Matthews 2020-04-14 19:34:31 UTC
Aligning to next release to consider.

Comment 8 Erik Nelson 2021-04-07 20:58:16 UTC
Closing as stale, please re-open if the issue persists.


Note You need to log in before you can comment on or make changes to this bug.