Description of problem: When the cluster has not enough cpu resources to run the rsync pods, the migration is stuck and the DVMP reports an error saying that the pod cannot be found. But actually the pod exists and fails because of OutOfCpu error. Version-Release number of selected component (if applicable): MTC 1.4.0 How reproducible: Always Steps to Reproduce: 1. Configure limits/requests for the rsync pod in the source cluster high enough to trigger an OutOfcpu error. $ oc edit migrationcontroller -n openshift-migration ..... spec: ..... source_rsync_pod_cpu_limits: 10 source_rsync_pod_cpu_requests: 9 2. Migrate any application using a PVC. Use Direct Volume Migration Actual results: The migration will be stuck, and waiting forever for the rsync client pods. The pods will exist with a failed status OutOfcpu in the source cluster's migrated namespace $ oc get pods NAME READY STATUS RESTARTS AGE directvolumemigration-rsync-transfer-nginx-html 0/1 OutOfcpu 0 3s directvolumemigration-rsync-transfer-nginx-logs 0/1 OutOfcpu 0 3s directvolumemigration-stunnel-transfer 1/1 Running 0 6s nginx-deployment-578fd5c94c-9269c 1/1 Running 0 2m2s The DVM will report this: spec: createDestinationNamespaces: true destMigClusterRef: name: source-cluster namespace: openshift-migration persistentVolumeClaims: - name: nginx-html namespace: ocp-24706-basicvolmig targetAccessModes: - ReadWriteOnce targetStorageClass: gp2 verify: false - name: nginx-logs namespace: ocp-24706-basicvolmig targetAccessModes: - ReadWriteOnce targetStorageClass: gp2 verify: false srcMigClusterRef: name: host namespace: openshift-migration status: conditions: - category: Advisory lastTransitionTime: "2021-02-03T17:06:38Z" message: 'Step: 20/23' reason: WaitForRsyncClientPodsCompleted status: "True" type: Running - category: Required lastTransitionTime: "2021-02-03T17:05:59Z" message: Direct migration is ready status: "True" type: Ready And the DVMP will complain about not being able to find the pods: spec: clusterRef: name: host namespace: openshift-migration podRef: name: directvolumemigration-rsync-transfer-nginx-html namespace: ocp-24706-basicvolmig status: conditions: - category: Critical lastTransitionTime: "2021-02-03T17:06:38Z" message: The spec.podRef ocp-24706-basicvolmig/directvolumemigration-rsync-transfer-nginx-html must reference a `Pod` with container name rsync-client reason: NotFound status: "True" type: InvalidPod observedDigest: ac76a6598caf9c61b1c9675c4dcc9265358895343cda1681b0860fc10932f508 Expected results: The rsync pods actually exist, and their phase is "Failed". status: message: 'Pod Node didn''t have enough resource: cpu, requested: 9000, used: 1255, capacity: 3500' phase: Failed reason: OutOfcpu So the migration should be failed or should report a warning. Additional info:
Verified using MTC 1.4.2 openshift-migration-rhel7-operator@sha256:20cb66a4cc32bd51c111afda54461af5c2834fe0fccc93baed6134e8cbafc480 - name: MIG_CONTROLLER_REPO value: openshift-migration-controller-rhel8@sha256 - name: MIG_CONTROLLER_TAG value: 751f34a7dac7c9121792590b1661087970d71dd659c129b622626582031a61b2 The migration now reports a warning. The DVM resource reports this error "type":string"Failed" "status":string"True" "reason":string"WaitForRsyncClientPodsCompleted" "category":string"Advisory" "message":string"The migration has failed. See: Errors." "errors":[1 item 0:string"One or more pods are in error state" ] "failedPods":[1 item 0:{2 items "namespace":string"ocp-django2" "name":string"directvolumemigration-rsync-transfer-postgresql" } ] Moved to VERIFIED status.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Migration Toolkit for Containers (MTC) image release advisory 1.4.2), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0814