Bug 2060717

Summary: [MTC] Registry pod goes in CrashLoopBackOff several times when MCG Nooba is used as the Replication Repository
Product: Migration Toolkit for Containers Reporter: ssingla
Component: GeneralAssignee: Jason Montleon <jmontleo>
Status: CLOSED ERRATA QA Contact: mohamed <midays>
Severity: medium Docs Contact: Richard Hoch <rhoch>
Priority: medium    
Version: 1.7.0CC: ernelson, midays, rjohnson
Target Milestone: ---   
Target Release: 1.7.1   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-05 13:50:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description ssingla 2022-03-04 04:45:16 UTC
Description of the problem: When MCG Nooba is used as the Replication Repository, the registry pod gets restarted several times going through the CrashLoopBackOff state before finally coming to a stable Running state.

Severity: Medium

Version-Release number of selected component (if applicable):
MTC 1.7.0 + MCG Nooba as Replication Repository
Source cluster OCP version: 3.11
Target cluster OCP version: 4.7 (Controller) - MCG installed 

Steps to reproduce:

1. Deploy any application in the source cluster 
2. Login to the MTC UI and create a migplan
3. Execute Full/Cutover migration



Actual Results: During the StageBackup step, registry pods crashes several times which is also visible on the Migrations page.

Expected Results:  The registry pod should not crash.

Additional Info: Even after increasing the liveness/readiness timeout of the registry pod, it keeps crashing.

Logs
time="2022-03-03T15:54:40.011958156Z" level=debug msg="authorizing request" go.version=go1.16.12 http.request.host="10.129.2.89:5000" http.request.id=2b2a85f8-f3bb-4e57-8a6f-82fc8d00295f http.request.method=GET http.request.remoteaddr="10.129.2.1:54922" http.request.uri="/v2/_catalog?n=5" http.request.useragent="kube-probe/1.20" 
time="2022-03-03T15:54:40.160697615Z" level=debug msg="s3aws.ListObjectsV2Pages(automatic-registry-b9b2251f-bf91-4214-87e5-1ab5e5f27a9d/docker/registry/v2/repositories/django/django-psql-persistent/)" go.version=go1.16.12 http.request.host="10.129.2.89:5000" http.request.id=2b2a85f8-f3bb-4e57-8a6f-82fc8d00295f http.request.method=GET http.request.remoteaddr="10.129.2.1:54922" http.request.uri="/v2/_catalog?n=5" http.request.useragent="kube-probe/1.20" trace.duration=100.414272ms

Comment 1 Jason Montleon 2022-03-08 16:09:49 UTC
I have seen this happen as well with Noobaa in the last week or so. Did you increase the Readiness and Liveness timeouts in the MigrationController CR on both clusters? The values are set per cluster, so doing so on the controller node only is not sufficient.

If you did increase it on both, what did you increase it to? Can you try a large value like 300 on both clusters if you tried something smaller and see if it resolves the issue?

Comment 3 Jason Montleon 2022-04-06 02:15:14 UTC
To remedy this for the typical case we're increasing the default liveness and readiness probe timeouts from 3 to 300 seconds.

https://github.com/konveyor/mig-controller/pull/1269 / https://github.com/konveyor/mig-controller/pull/1270

https://github.com/konveyor/mig-controller/commit/0fda45f8771ed3ee4b9bd9a89ce49f50e2ee106f

Comment 9 errata-xmlrpc 2022-05-05 13:50:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.7.1 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:1734