Description of the problem: When MCG Nooba is used as the Replication Repository, the registry pod gets restarted several times going through the CrashLoopBackOff state before finally coming to a stable Running state. Severity: Medium Version-Release number of selected component (if applicable): MTC 1.7.0 + MCG Nooba as Replication Repository Source cluster OCP version: 3.11 Target cluster OCP version: 4.7 (Controller) - MCG installed Steps to reproduce: 1. Deploy any application in the source cluster 2. Login to the MTC UI and create a migplan 3. Execute Full/Cutover migration Actual Results: During the StageBackup step, registry pods crashes several times which is also visible on the Migrations page. Expected Results: The registry pod should not crash. Additional Info: Even after increasing the liveness/readiness timeout of the registry pod, it keeps crashing. Logs time="2022-03-03T15:54:40.011958156Z" level=debug msg="authorizing request" go.version=go1.16.12 http.request.host="10.129.2.89:5000" http.request.id=2b2a85f8-f3bb-4e57-8a6f-82fc8d00295f http.request.method=GET http.request.remoteaddr="10.129.2.1:54922" http.request.uri="/v2/_catalog?n=5" http.request.useragent="kube-probe/1.20" time="2022-03-03T15:54:40.160697615Z" level=debug msg="s3aws.ListObjectsV2Pages(automatic-registry-b9b2251f-bf91-4214-87e5-1ab5e5f27a9d/docker/registry/v2/repositories/django/django-psql-persistent/)" go.version=go1.16.12 http.request.host="10.129.2.89:5000" http.request.id=2b2a85f8-f3bb-4e57-8a6f-82fc8d00295f http.request.method=GET http.request.remoteaddr="10.129.2.1:54922" http.request.uri="/v2/_catalog?n=5" http.request.useragent="kube-probe/1.20" trace.duration=100.414272ms
I have seen this happen as well with Noobaa in the last week or so. Did you increase the Readiness and Liveness timeouts in the MigrationController CR on both clusters? The values are set per cluster, so doing so on the controller node only is not sufficient. If you did increase it on both, what did you increase it to? Can you try a large value like 300 on both clusters if you tried something smaller and see if it resolves the issue?
To remedy this for the typical case we're increasing the default liveness and readiness probe timeouts from 3 to 300 seconds. https://github.com/konveyor/mig-controller/pull/1269 / https://github.com/konveyor/mig-controller/pull/1270 https://github.com/konveyor/mig-controller/commit/0fda45f8771ed3ee4b9bd9a89ce49f50e2ee106f
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.7.1 security and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:1734