1997127 – Direct volume migration "retry" feature does not work correctly after a network failure

Bug 1997127 - Direct volume migration "retry" feature does not work correctly after a network failure

Summary: Direct volume migration "retry" feature does not work correctly after a netwo...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Migration Toolkit for Containers
Classification:	Red Hat
Component:	General
Sub Component:
Version:	1.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	1.6.0
Assignee:	Pranav Gaikwad
QA Contact:	Xin jiang
Docs Contact:	Avital Pinnick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-08-24 13:17 UTC by Sergio
Modified:	2021-09-29 14:35 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-09-29 14:35:20 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	konveyor mig-controller pull 1182	0	None	None	None	2021-08-30 18:50:56 UTC
Red Hat Product Errata	RHSA-2021:3694	0	None	None	None	2021-09-29 14:35:31 UTC

Description Sergio 2021-08-24 13:17:45 UTC

Description of problem:
When there is a network problem and MTC is executing a DVM, it should retry 20 times by default and fail if it didn't finally succeed. The actual behavior is that after less than 20 retries the migration is stuck forever. 


Version-Release number of selected component (if applicable):
SOURCE CLUSTER: AWS 3.11 MTC 1.5.1
TARGET CLUSTER: AWS 4.9 MTC 1.6.0
REPLICATION REPOSITORY: AWS S3

How reproducible:
Always

Steps to Reproduce:
1. In source cluster, deploy an application with a PVC

oc new-project test-django
oc new-app django-psql-persistent

2. In this namespace create a network policy blocking all traffic

apiVersion: network.openshift.io/v1
kind: EgressNetworkPolicy
metadata:
  name: denyall-test
spec:
  egress:
  - to:
      cidrSelector: 0.0.0.0/0
    type: Deny

3. Migrate this namespace using DVM

Actual results:
The migration has network problems and is stuck forever after less than 20 retries (randomly, always a different number of retries, always less than 20)

The DVM resource reports 20 retries and a SourceToDestinationNetworkError error. But there are not actually 20 retries. The number of rsync pods created is less than 20, and the number of retries reported on the UI is less than 20. The migration is stuck forever.


Expected results:
The migration should retry 20 times, and after 20 retries it should report a SourceToDestinationNetworkError Critical condition in the DVM resource. And the migration should finish.

Additional info:
If we try to inspect the DVM resource on the UI debug screen. The UI becomes blank showing this error in the browser’s console:

app.bundle.js:2 TypeError: Cannot read property 'length' of undefined
    at p (app.bundle.js:2)
    at app.bundle.js:2
    at t.default (app.bundle.js:2)
    at ci (app.bundle.js:2)
    at jr (app.bundle.js:2)
    at bs (app.bundle.js:2)
    at Ms (app.bundle.js:2)
    at Os (app.bundle.js:2)
    at hs (app.bundle.js:2)
    at app.bundle.js:2

Comment 1 Pranav Gaikwad 2021-08-26 17:40:43 UTC

@sregidor I am unable to reproduce this issue on the current master branch of the controller: 

```json
[
  {
    "currentAttempt": 20,
    "failed": true,
    "pvcReference": {
      "name": "catalogue-data-volume-claim",
      "namespace": "sock-shop"
    }
  },
  {
    "currentAttempt": 20,
    "failed": true,
    "pvcReference": {
      "name": "user-data-volume-claim",
      "namespace": "sock-shop"
    }
  },
  {
    "currentAttempt": 20,
    "failed": true,
    "pvcReference": {
      "name": "orders-data-volume-claim",
      "namespace": "sock-shop"
    }
  },
  {
    "currentAttempt": 20,
    "failed": true,
    "pvcReference": {
      "name": "carts-data-volume-claim",
      "namespace": "sock-shop"
    }
  }
]
``` 

I can see that the DVM is attempting 20 retries as expected and the MigMigration has the right error condition post failure. Is it possible that there was perhaps another underlying issue that may have caused this behavior in your environment?

Comment 2 Sergio 2021-08-30 11:47:05 UTC

I don't know what's the exact thing triggering the issue, but in my 3.11 -> 4.9 cluster it happens consistently. I have just double checked in my new cluster.

I can provide you an environment where the issue happens consistently.

Comment 6 Sergio 2021-09-08 11:39:48 UTC

Verfied using:
SOURCE CLUSTER: AWS OCP 3.11 (MTC 1.5.1) NFS
TARGET CLUSTER: AWS OCP 4.9 (MTC 1.6.0) OCS4

openshift-migration-rhel8-operator@sha256:ef00e934ed578a4acb429f8710284d10acf2cf98f38a2b2268bbea8b5fd7139c
    - name: MIG_CONTROLLER_REPO
      value: openshift-migration-controller-rhel8@sha256
    - name: MIG_CONTROLLER_TAG
      value: 27f465b2cd38cee37af5c3d0fd745676086fe0391e3c459d4df18dd3a12e7051
    - name: MIG_UI_REPO
      value: openshift-migration-ui-rhel8@sha256
    - name: MIG_UI_TAG


The migration tried 20 times to run the rsync pod and then failed. As expected.

Moved to VERIFIED status.

Comment 8 errata-xmlrpc 2021-09-29 14:35:20 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.6.0 security & bugfix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3694

Note You need to log in before you can comment on or make changes to this bug.