Bug 1925065

Summary: Direct Volume Migration stuck when Rsync pods fail because of OutOfCpu error
Product: Migration Toolkit for Containers Reporter: Sergio <sregidor>
Component: GeneralAssignee: Alay Patel <alpatel>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: medium Docs Contact: Avital Pinnick <apinnick>
Priority: unspecified    
Version: 1.4.0CC: alpatel, chezhang, ernelson, rjohnson, whu, xjiang
Target Milestone: ---   
Target Release: 1.4.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-03-15 08:15:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Sergio 2021-02-04 10:58:24 UTC
Description of problem:
When the cluster has not enough cpu resources to run the rsync pods, the migration is stuck and the DVMP reports an error saying that the pod cannot be found. But actually the pod exists and fails because of OutOfCpu error.

Version-Release number of selected component (if applicable):
MTC 1.4.0

How reproducible:
Always

Steps to Reproduce:
1. Configure limits/requests for the rsync pod in the source cluster high enough to trigger an OutOfcpu error.

$ oc edit migrationcontroller -n openshift-migration
.....
  spec:
.....
    source_rsync_pod_cpu_limits: 10
    source_rsync_pod_cpu_requests: 9


2. Migrate any application using a PVC. Use Direct Volume Migration

Actual results:
The migration will be stuck, and waiting forever for the rsync client pods. The pods will exist with a failed status OutOfcpu in the source cluster's migrated namespace

$ oc get pods
NAME                                              READY     STATUS     RESTARTS   AGE
directvolumemigration-rsync-transfer-nginx-html   0/1       OutOfcpu   0          3s
directvolumemigration-rsync-transfer-nginx-logs   0/1       OutOfcpu   0          3s
directvolumemigration-stunnel-transfer            1/1       Running    0          6s
nginx-deployment-578fd5c94c-9269c                 1/1       Running    0          2m2s

The DVM will report this:

spec:
  createDestinationNamespaces: true
  destMigClusterRef:
    name: source-cluster
    namespace: openshift-migration
  persistentVolumeClaims:
  - name: nginx-html
    namespace: ocp-24706-basicvolmig
    targetAccessModes:
    - ReadWriteOnce
    targetStorageClass: gp2
    verify: false
  - name: nginx-logs
    namespace: ocp-24706-basicvolmig
    targetAccessModes:
    - ReadWriteOnce
    targetStorageClass: gp2
    verify: false
  srcMigClusterRef:
    name: host
    namespace: openshift-migration
status:
  conditions:
  - category: Advisory
    lastTransitionTime: "2021-02-03T17:06:38Z"
    message: 'Step: 20/23'
    reason: WaitForRsyncClientPodsCompleted
    status: "True"
    type: Running
  - category: Required
    lastTransitionTime: "2021-02-03T17:05:59Z"
    message: Direct migration is ready
    status: "True"
    type: Ready


And the DVMP will complain about not being able to find the pods:

spec:
  clusterRef:
    name: host
    namespace: openshift-migration
  podRef:
    name: directvolumemigration-rsync-transfer-nginx-html
    namespace: ocp-24706-basicvolmig
status:
  conditions:
  - category: Critical
    lastTransitionTime: "2021-02-03T17:06:38Z"
    message: The spec.podRef ocp-24706-basicvolmig/directvolumemigration-rsync-transfer-nginx-html
      must reference a `Pod` with container name rsync-client
    reason: NotFound
    status: "True"
    type: InvalidPod
  observedDigest: ac76a6598caf9c61b1c9675c4dcc9265358895343cda1681b0860fc10932f508



Expected results:
The rsync pods actually exist, and their phase is "Failed".

status:
  message: 'Pod Node didn''t have enough resource: cpu, requested: 9000, used: 1255,
    capacity: 3500'
  phase: Failed
  reason: OutOfcpu

So the migration should be failed or should report a warning.

Additional info:

Comment 5 Sergio 2021-03-08 11:38:38 UTC
Verified using MTC 1.4.2

openshift-migration-rhel7-operator@sha256:20cb66a4cc32bd51c111afda54461af5c2834fe0fccc93baed6134e8cbafc480
    - name: MIG_CONTROLLER_REPO
      value: openshift-migration-controller-rhel8@sha256
    - name: MIG_CONTROLLER_TAG
      value: 751f34a7dac7c9121792590b1661087970d71dd659c129b622626582031a61b2


The migration now reports a warning.

The DVM resource reports this error
"type":string"Failed"
"status":string"True"
"reason":string"WaitForRsyncClientPodsCompleted"
"category":string"Advisory"
"message":string"The migration has failed. See: Errors."


"errors":[1 item
0:string"One or more pods are in error state"
]
"failedPods":[1 item
0:{2 items
"namespace":string"ocp-django2"
"name":string"directvolumemigration-rsync-transfer-postgresql"
}
]



Moved to VERIFIED status.

Comment 9 errata-xmlrpc 2021-03-15 08:15:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Migration Toolkit for Containers (MTC) image release advisory 1.4.2), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0814