Bug 1997127

Summary:	Direct volume migration "retry" feature does not work correctly after a network failure
Product:	Migration Toolkit for Containers	Reporter:	Sergio <sregidor>
Component:	General	Assignee:	Pranav Gaikwad <pgaikwad>
Status:	CLOSED ERRATA	QA Contact:	Xin jiang <xjiang>
Severity:	medium	Docs Contact:	Avital Pinnick <apinnick>
Priority:	medium
Version:	1.6.0	CC:	ernelson, prajoshi, rjohnson, ssingla, whu, xjiang
Target Milestone:	---
Target Release:	1.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-09-29 14:35:20 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sergio 2021-08-24 13:17:45 UTC

Description of problem:
When there is a network problem and MTC is executing a DVM, it should retry 20 times by default and fail if it didn't finally succeed. The actual behavior is that after less than 20 retries the migration is stuck forever. 


Version-Release number of selected component (if applicable):
SOURCE CLUSTER: AWS 3.11 MTC 1.5.1
TARGET CLUSTER: AWS 4.9 MTC 1.6.0
REPLICATION REPOSITORY: AWS S3

How reproducible:
Always

Steps to Reproduce:
1. In source cluster, deploy an application with a PVC

oc new-project test-django
oc new-app django-psql-persistent

2. In this namespace create a network policy blocking all traffic

apiVersion: network.openshift.io/v1
kind: EgressNetworkPolicy
metadata:
  name: denyall-test
spec:
  egress:
  - to:
      cidrSelector: 0.0.0.0/0
    type: Deny

3. Migrate this namespace using DVM

Actual results:
The migration has network problems and is stuck forever after less than 20 retries (randomly, always a different number of retries, always less than 20)

The DVM resource reports 20 retries and a SourceToDestinationNetworkError error. But there are not actually 20 retries. The number of rsync pods created is less than 20, and the number of retries reported on the UI is less than 20. The migration is stuck forever.


Expected results:
The migration should retry 20 times, and after 20 retries it should report a SourceToDestinationNetworkError Critical condition in the DVM resource. And the migration should finish.

Additional info:
If we try to inspect the DVM resource on the UI debug screen. The UI becomes blank showing this error in the browser’s console:

app.bundle.js:2 TypeError: Cannot read property 'length' of undefined
    at p (app.bundle.js:2)
    at app.bundle.js:2
    at t.default (app.bundle.js:2)
    at ci (app.bundle.js:2)
    at jr (app.bundle.js:2)
    at bs (app.bundle.js:2)
    at Ms (app.bundle.js:2)
    at Os (app.bundle.js:2)
    at hs (app.bundle.js:2)
    at app.bundle.js:2

Comment 1 Pranav Gaikwad 2021-08-26 17:40:43 UTC

@sregidor I am unable to reproduce this issue on the current master branch of the controller: 

```json
[
  {
    "currentAttempt": 20,
    "failed": true,
    "pvcReference": {
      "name": "catalogue-data-volume-claim",
      "namespace": "sock-shop"
    }
  },
  {
    "currentAttempt": 20,
    "failed": true,
    "pvcReference": {
      "name": "user-data-volume-claim",
      "namespace": "sock-shop"
    }
  },
  {
    "currentAttempt": 20,
    "failed": true,
    "pvcReference": {
      "name": "orders-data-volume-claim",
      "namespace": "sock-shop"
    }
  },
  {
    "currentAttempt": 20,
    "failed": true,
    "pvcReference": {
      "name": "carts-data-volume-claim",
      "namespace": "sock-shop"
    }
  }
]
``` 

I can see that the DVM is attempting 20 retries as expected and the MigMigration has the right error condition post failure. Is it possible that there was perhaps another underlying issue that may have caused this behavior in your environment?

Comment 2 Sergio 2021-08-30 11:47:05 UTC

I don't know what's the exact thing triggering the issue, but in my 3.11 -> 4.9 cluster it happens consistently. I have just double checked in my new cluster.

I can provide you an environment where the issue happens consistently.

Comment 6 Sergio 2021-09-08 11:39:48 UTC

Verfied using:
SOURCE CLUSTER: AWS OCP 3.11 (MTC 1.5.1) NFS
TARGET CLUSTER: AWS OCP 4.9 (MTC 1.6.0) OCS4

openshift-migration-rhel8-operator@sha256:ef00e934ed578a4acb429f8710284d10acf2cf98f38a2b2268bbea8b5fd7139c
    - name: MIG_CONTROLLER_REPO
      value: openshift-migration-controller-rhel8@sha256
    - name: MIG_CONTROLLER_TAG
      value: 27f465b2cd38cee37af5c3d0fd745676086fe0391e3c459d4df18dd3a12e7051
    - name: MIG_UI_REPO
      value: openshift-migration-ui-rhel8@sha256
    - name: MIG_UI_TAG


The migration tried 20 times to run the rsync pod and then failed. As expected.

Moved to VERIFIED status.

Comment 8 errata-xmlrpc 2021-09-29 14:35:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.6.0 security & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3694