Bug 1925065 - Direct Volume Migration stuck when Rsync pods fail because of OutOfCpu error
Summary: Direct Volume Migration stuck when Rsync pods fail because of OutOfCpu error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Migration Toolkit for Containers
Classification: Red Hat
Component: General
Version: 1.4.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 1.4.2
Assignee: Alay Patel
QA Contact: Xin jiang
Avital Pinnick
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-02-04 10:58 UTC by Sergio
Modified: 2021-03-15 08:15 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-03-15 08:15:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github konveyor mig-controller pull 982 0 None open Bug 1925065: raise a warning when rsync pod fails and fail dvmp 2021-03-04 16:39:28 UTC
Github konveyor mig-controller pull 986 0 None open Bug 1925065: raise a warning when rsync pod fails and fail dvmp (#982) 2021-03-04 22:42:03 UTC
Red Hat Product Errata RHBA-2021:0814 0 None None None 2021-03-15 08:15:54 UTC

Description Sergio 2021-02-04 10:58:24 UTC
Description of problem:
When the cluster has not enough cpu resources to run the rsync pods, the migration is stuck and the DVMP reports an error saying that the pod cannot be found. But actually the pod exists and fails because of OutOfCpu error.

Version-Release number of selected component (if applicable):
MTC 1.4.0

How reproducible:
Always

Steps to Reproduce:
1. Configure limits/requests for the rsync pod in the source cluster high enough to trigger an OutOfcpu error.

$ oc edit migrationcontroller -n openshift-migration
.....
  spec:
.....
    source_rsync_pod_cpu_limits: 10
    source_rsync_pod_cpu_requests: 9


2. Migrate any application using a PVC. Use Direct Volume Migration

Actual results:
The migration will be stuck, and waiting forever for the rsync client pods. The pods will exist with a failed status OutOfcpu in the source cluster's migrated namespace

$ oc get pods
NAME                                              READY     STATUS     RESTARTS   AGE
directvolumemigration-rsync-transfer-nginx-html   0/1       OutOfcpu   0          3s
directvolumemigration-rsync-transfer-nginx-logs   0/1       OutOfcpu   0          3s
directvolumemigration-stunnel-transfer            1/1       Running    0          6s
nginx-deployment-578fd5c94c-9269c                 1/1       Running    0          2m2s

The DVM will report this:

spec:
  createDestinationNamespaces: true
  destMigClusterRef:
    name: source-cluster
    namespace: openshift-migration
  persistentVolumeClaims:
  - name: nginx-html
    namespace: ocp-24706-basicvolmig
    targetAccessModes:
    - ReadWriteOnce
    targetStorageClass: gp2
    verify: false
  - name: nginx-logs
    namespace: ocp-24706-basicvolmig
    targetAccessModes:
    - ReadWriteOnce
    targetStorageClass: gp2
    verify: false
  srcMigClusterRef:
    name: host
    namespace: openshift-migration
status:
  conditions:
  - category: Advisory
    lastTransitionTime: "2021-02-03T17:06:38Z"
    message: 'Step: 20/23'
    reason: WaitForRsyncClientPodsCompleted
    status: "True"
    type: Running
  - category: Required
    lastTransitionTime: "2021-02-03T17:05:59Z"
    message: Direct migration is ready
    status: "True"
    type: Ready


And the DVMP will complain about not being able to find the pods:

spec:
  clusterRef:
    name: host
    namespace: openshift-migration
  podRef:
    name: directvolumemigration-rsync-transfer-nginx-html
    namespace: ocp-24706-basicvolmig
status:
  conditions:
  - category: Critical
    lastTransitionTime: "2021-02-03T17:06:38Z"
    message: The spec.podRef ocp-24706-basicvolmig/directvolumemigration-rsync-transfer-nginx-html
      must reference a `Pod` with container name rsync-client
    reason: NotFound
    status: "True"
    type: InvalidPod
  observedDigest: ac76a6598caf9c61b1c9675c4dcc9265358895343cda1681b0860fc10932f508



Expected results:
The rsync pods actually exist, and their phase is "Failed".

status:
  message: 'Pod Node didn''t have enough resource: cpu, requested: 9000, used: 1255,
    capacity: 3500'
  phase: Failed
  reason: OutOfcpu

So the migration should be failed or should report a warning.

Additional info:

Comment 5 Sergio 2021-03-08 11:38:38 UTC
Verified using MTC 1.4.2

openshift-migration-rhel7-operator@sha256:20cb66a4cc32bd51c111afda54461af5c2834fe0fccc93baed6134e8cbafc480
    - name: MIG_CONTROLLER_REPO
      value: openshift-migration-controller-rhel8@sha256
    - name: MIG_CONTROLLER_TAG
      value: 751f34a7dac7c9121792590b1661087970d71dd659c129b622626582031a61b2


The migration now reports a warning.

The DVM resource reports this error
"type":string"Failed"
"status":string"True"
"reason":string"WaitForRsyncClientPodsCompleted"
"category":string"Advisory"
"message":string"The migration has failed. See: Errors."


"errors":[1 item
0:string"One or more pods are in error state"
]
"failedPods":[1 item
0:{2 items
"namespace":string"ocp-django2"
"name":string"directvolumemigration-rsync-transfer-postgresql"
}
]



Moved to VERIFIED status.

Comment 9 errata-xmlrpc 2021-03-15 08:15:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) image release advisory 1.4.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0814


Note You need to log in before you can comment on or make changes to this bug.