Bug 1997127
| Summary: | Direct volume migration "retry" feature does not work correctly after a network failure | ||
|---|---|---|---|
| Product: | Migration Toolkit for Containers | Reporter: | Sergio <sregidor> |
| Component: | General | Assignee: | Pranav Gaikwad <pgaikwad> |
| Status: | CLOSED ERRATA | QA Contact: | Xin jiang <xjiang> |
| Severity: | medium | Docs Contact: | Avital Pinnick <apinnick> |
| Priority: | medium | ||
| Version: | 1.6.0 | CC: | ernelson, prajoshi, rjohnson, ssingla, whu, xjiang |
| Target Milestone: | --- | ||
| Target Release: | 1.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-09-29 14:35:20 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@sregidor I am unable to reproduce this issue on the current master branch of the controller:
```json
[
{
"currentAttempt": 20,
"failed": true,
"pvcReference": {
"name": "catalogue-data-volume-claim",
"namespace": "sock-shop"
}
},
{
"currentAttempt": 20,
"failed": true,
"pvcReference": {
"name": "user-data-volume-claim",
"namespace": "sock-shop"
}
},
{
"currentAttempt": 20,
"failed": true,
"pvcReference": {
"name": "orders-data-volume-claim",
"namespace": "sock-shop"
}
},
{
"currentAttempt": 20,
"failed": true,
"pvcReference": {
"name": "carts-data-volume-claim",
"namespace": "sock-shop"
}
}
]
```
I can see that the DVM is attempting 20 retries as expected and the MigMigration has the right error condition post failure. Is it possible that there was perhaps another underlying issue that may have caused this behavior in your environment?
I don't know what's the exact thing triggering the issue, but in my 3.11 -> 4.9 cluster it happens consistently. I have just double checked in my new cluster. I can provide you an environment where the issue happens consistently. Verfied using:
SOURCE CLUSTER: AWS OCP 3.11 (MTC 1.5.1) NFS
TARGET CLUSTER: AWS OCP 4.9 (MTC 1.6.0) OCS4
openshift-migration-rhel8-operator@sha256:ef00e934ed578a4acb429f8710284d10acf2cf98f38a2b2268bbea8b5fd7139c
- name: MIG_CONTROLLER_REPO
value: openshift-migration-controller-rhel8@sha256
- name: MIG_CONTROLLER_TAG
value: 27f465b2cd38cee37af5c3d0fd745676086fe0391e3c459d4df18dd3a12e7051
- name: MIG_UI_REPO
value: openshift-migration-ui-rhel8@sha256
- name: MIG_UI_TAG
The migration tried 20 times to run the rsync pod and then failed. As expected.
Moved to VERIFIED status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.6.0 security & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3694 |
Description of problem: When there is a network problem and MTC is executing a DVM, it should retry 20 times by default and fail if it didn't finally succeed. The actual behavior is that after less than 20 retries the migration is stuck forever. Version-Release number of selected component (if applicable): SOURCE CLUSTER: AWS 3.11 MTC 1.5.1 TARGET CLUSTER: AWS 4.9 MTC 1.6.0 REPLICATION REPOSITORY: AWS S3 How reproducible: Always Steps to Reproduce: 1. In source cluster, deploy an application with a PVC oc new-project test-django oc new-app django-psql-persistent 2. In this namespace create a network policy blocking all traffic apiVersion: network.openshift.io/v1 kind: EgressNetworkPolicy metadata: name: denyall-test spec: egress: - to: cidrSelector: 0.0.0.0/0 type: Deny 3. Migrate this namespace using DVM Actual results: The migration has network problems and is stuck forever after less than 20 retries (randomly, always a different number of retries, always less than 20) The DVM resource reports 20 retries and a SourceToDestinationNetworkError error. But there are not actually 20 retries. The number of rsync pods created is less than 20, and the number of retries reported on the UI is less than 20. The migration is stuck forever. Expected results: The migration should retry 20 times, and after 20 retries it should report a SourceToDestinationNetworkError Critical condition in the DVM resource. And the migration should finish. Additional info: If we try to inspect the DVM resource on the UI debug screen. The UI becomes blank showing this error in the browser’s console: app.bundle.js:2 TypeError: Cannot read property 'length' of undefined at p (app.bundle.js:2) at app.bundle.js:2 at t.default (app.bundle.js:2) at ci (app.bundle.js:2) at jr (app.bundle.js:2) at bs (app.bundle.js:2) at Ms (app.bundle.js:2) at Os (app.bundle.js:2) at hs (app.bundle.js:2) at app.bundle.js:2