Description of problem: When there is a network problem and MTC is executing a DVM, it should retry 20 times by default and fail if it didn't finally succeed. The actual behavior is that after less than 20 retries the migration is stuck forever. Version-Release number of selected component (if applicable): SOURCE CLUSTER: AWS 3.11 MTC 1.5.1 TARGET CLUSTER: AWS 4.9 MTC 1.6.0 REPLICATION REPOSITORY: AWS S3 How reproducible: Always Steps to Reproduce: 1. In source cluster, deploy an application with a PVC oc new-project test-django oc new-app django-psql-persistent 2. In this namespace create a network policy blocking all traffic apiVersion: network.openshift.io/v1 kind: EgressNetworkPolicy metadata: name: denyall-test spec: egress: - to: cidrSelector: 0.0.0.0/0 type: Deny 3. Migrate this namespace using DVM Actual results: The migration has network problems and is stuck forever after less than 20 retries (randomly, always a different number of retries, always less than 20) The DVM resource reports 20 retries and a SourceToDestinationNetworkError error. But there are not actually 20 retries. The number of rsync pods created is less than 20, and the number of retries reported on the UI is less than 20. The migration is stuck forever. Expected results: The migration should retry 20 times, and after 20 retries it should report a SourceToDestinationNetworkError Critical condition in the DVM resource. And the migration should finish. Additional info: If we try to inspect the DVM resource on the UI debug screen. The UI becomes blank showing this error in the browser’s console: app.bundle.js:2 TypeError: Cannot read property 'length' of undefined at p (app.bundle.js:2) at app.bundle.js:2 at t.default (app.bundle.js:2) at ci (app.bundle.js:2) at jr (app.bundle.js:2) at bs (app.bundle.js:2) at Ms (app.bundle.js:2) at Os (app.bundle.js:2) at hs (app.bundle.js:2) at app.bundle.js:2
@sregidor I am unable to reproduce this issue on the current master branch of the controller: ```json [ { "currentAttempt": 20, "failed": true, "pvcReference": { "name": "catalogue-data-volume-claim", "namespace": "sock-shop" } }, { "currentAttempt": 20, "failed": true, "pvcReference": { "name": "user-data-volume-claim", "namespace": "sock-shop" } }, { "currentAttempt": 20, "failed": true, "pvcReference": { "name": "orders-data-volume-claim", "namespace": "sock-shop" } }, { "currentAttempt": 20, "failed": true, "pvcReference": { "name": "carts-data-volume-claim", "namespace": "sock-shop" } } ] ``` I can see that the DVM is attempting 20 retries as expected and the MigMigration has the right error condition post failure. Is it possible that there was perhaps another underlying issue that may have caused this behavior in your environment?
I don't know what's the exact thing triggering the issue, but in my 3.11 -> 4.9 cluster it happens consistently. I have just double checked in my new cluster. I can provide you an environment where the issue happens consistently.
Verfied using: SOURCE CLUSTER: AWS OCP 3.11 (MTC 1.5.1) NFS TARGET CLUSTER: AWS OCP 4.9 (MTC 1.6.0) OCS4 openshift-migration-rhel8-operator@sha256:ef00e934ed578a4acb429f8710284d10acf2cf98f38a2b2268bbea8b5fd7139c - name: MIG_CONTROLLER_REPO value: openshift-migration-controller-rhel8@sha256 - name: MIG_CONTROLLER_TAG value: 27f465b2cd38cee37af5c3d0fd745676086fe0391e3c459d4df18dd3a12e7051 - name: MIG_UI_REPO value: openshift-migration-ui-rhel8@sha256 - name: MIG_UI_TAG The migration tried 20 times to run the rsync pod and then failed. As expected. Moved to VERIFIED status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.6.0 security & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3694