Bug 1986796

Summary: Controller pod takes too much time to restart in 3.10
Product: Migration Toolkit for Containers Reporter: Sergio <sregidor>
Component: DocumentationAssignee: Avital Pinnick <apinnick>
Status: CLOSED CURRENTRELEASE QA Contact: Xin jiang <xjiang>
Severity: low Docs Contact: Avital Pinnick <apinnick>
Priority: high    
Version: 1.4.6CC: ernelson, jmontleo
Target Milestone: ---   
Target Release: 1.6.0   
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-01 14:16:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Sergio 2021-07-28 11:07:03 UTC
Description of problem:
Deploy controller pod on ocp 3.10 cluster, change some items in the MigrationController, the controller pod will take too much time to restart. Like 20 minutes aprox.

Version-Release number of selected component (if applicable):
MTC 1.4.6
source cluster: AWS OCP 3.10 (controller)
target cluster: AWS OCP 4.8  

How reproducible:

Steps to Reproduce:
1. deploy migration_controller on ocp 3.10 cluster. 
2. change one item in MigrationController, such like "mig_namespace_limit": "5",
# oc edit  MigrationController migration-controller -n openshift-migration

Actual results:
The controller pod will take too much time to restart.

Expected results:
The controller pod should be restarted in 2 minutes 

Additional info:
The main problem with this issue is that it has a big impact on the automated tests, since they cannot wait 20 minutes for the pod to be restarted.

Comment 1 Erik Nelson 2021-07-29 14:59:41 UTC
Hi Sergio, we need some more information to understand if this is expected and what is actually taking so long.

What fields are you adding/updating on the MigrationController?

Are you seeing those changes get applied to a ConfigMap, for example, and the restart of the controller pod afterwards is taking too long?

Comment 2 Sergio 2021-07-29 17:17:00 UTC

It does not matter the field. It happens with "mig_namespace_limit" for example, but with many other fields too. I didn't check if the changes were in the configmap, I'm sorry, I only realized that migration-controller pod was taking 20 minutes to restart (aprox).

Comment 3 Jason Montleon 2021-08-12 14:08:01 UTC
Does this happen on 3.10 and lower or just 3.10?

This may be due to limitations regarding status, especially prior to 3.11. We actually had to change manageStatus off in the operator because the operator wouldn't install on these releases otherwise.

Comment 4 Jason Montleon 2021-08-23 17:03:32 UTC
Has anyone from QE had a chance to confirm whether this is happening only on 3.10 or any release less than or equal to 3.10?

Comment 5 Jason Montleon 2021-08-24 13:04:17 UTC
If this turns out to be an absence of status issue on <= 3.10 we could turn down the reconcilePeriod in the legacy operator. Right now we have it set to 30m which seems perfectly reasonable on 4.x. We could make it 5 or 10 minutes and it would force periodic reruns whether required or not. The downside is higher CPU usage for our operator.

Comment 6 Xin jiang 2021-08-25 08:20:48 UTC
I double checked on 3.7 and 3.9 with the steps commented in bug, they don't have this problem. controller pod  was restarted in around 2, 3 mins

Comment 7 Xin jiang 2021-08-25 08:23:09 UTC
$ oc get migrationcontroller migration-controller  -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigrationController
  clusterName: ""
  creationTimestamp: 2021-08-25T07:42:47Z
  generation: 0
  name: migration-controller
  namespace: openshift-migration
  resourceVersion: "12718"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migrationcontrollers/migration-controller
  uid: 0ad21909-0578-11ec-b73d-0e9273977b11
  azure_resource_group: ""
  cluster_name: host
  mig_namespace_limit: "5"
  mig_ui_cluster_api_endpoint: https://ec2-174-129-53-97.compute-1.amazonaws.com:8443
  migration_controller: true
  migration_log_reader: true
  migration_ui: true
  migration_velero: true
  restic_timeout: 1h

Comment 8 Jason Montleon 2021-08-30 14:28:39 UTC
I can reproduce this. I don't have an explanation for it yet. I do notice some oddities when editing resources below 3.11.

The generation is always 0 on 3.9. On 3.10 it's always 1. On 3.11 when you edit the resource the generation increments as you'd expect.

On 3.9 editing the CR does trigger a run, as on 3.11. On 3.10 it does not.

The operator does run on the interval we set regardless of version: https://github.com/konveyor/mig-operator/blob/master/watches.yaml#L6

I am not sure we can fix this, but I am still looking.

Comment 9 Jason Montleon 2021-08-30 16:28:50 UTC
This can be made to work, but it requires reconfiguring the cluster BEFORE installing Crane / MTC.

If Crane or MTC is installed, then before doing this, remove all migrationcontroller CRs and the migrationcontrollers.migration.openshift.io CRD, then proceed with the steps below. When done delete the migration-operator pod and let it redeploy. These steps can be followed before or after following the steps below. What is important is that for the steps below to have an effect the CRD must be created after they've been followed.

1. Edit /etc/origin/master/master-config.yaml on all masters, enable the feature-gate under apiServerArguments and controllerArguments

    - CustomResourceSubresources=true
    - CustomResourceSubresources=true

2. master-restart api && master-restart controllers

3. Once this has been done on all masters install Crane or MTC normally and updates to the migrationcontroller CR should trigger an update.

The ramification of not doing this is you'll either have to wait for the 30 minute interval for the operator to update resources, else you'll need to do something like delete the migration-operator pod in order to cause it to restart it if you want faster results.

Comment 10 Jason Montleon 2021-08-30 20:00:47 UTC
The above isn't very clear, I'll try again:

The migrationcontrollers.migration.openshift.io CRD needs to be (re)created after the feature-gate is enabled with the steps in the previous comment. In order to remove you'll need to remove all migrationcontroller CRs, and once the CRD is recreated you'll need to restart the operator pod.

It's easiest way to do it is update the masters first, the install MTC normally, but it is possible to recover from the situation if MTC was installed first.

Some additional references:
https://github.com/kubernetes/kubernetes/issues/58778 (discussing problems)
https://github.com/kubernetes/kubernetes/pull/55168 (adds the feature gate)
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ (documentation for enabling and disabling features via feature gates)

Comment 13 Xin jiang 2021-09-01 13:02:03 UTC
The PR looks good to me

Comment 14 Avital Pinnick 2021-09-01 14:16:54 UTC
Changes merged