Description of problem:
Deploy controller pod on ocp 3.10 cluster, change some items in the MigrationController, the controller pod will take too much time to restart. Like 20 minutes aprox.
Version-Release number of selected component (if applicable):
source cluster: AWS OCP 3.10 (controller)
target cluster: AWS OCP 4.8
Steps to Reproduce:
1. deploy migration_controller on ocp 3.10 cluster.
2. change one item in MigrationController, such like "mig_namespace_limit": "5",
# oc edit MigrationController migration-controller -n openshift-migration
The controller pod will take too much time to restart.
The controller pod should be restarted in 2 minutes
The main problem with this issue is that it has a big impact on the automated tests, since they cannot wait 20 minutes for the pod to be restarted.
Hi Sergio, we need some more information to understand if this is expected and what is actually taking so long.
What fields are you adding/updating on the MigrationController?
Are you seeing those changes get applied to a ConfigMap, for example, and the restart of the controller pod afterwards is taking too long?
It does not matter the field. It happens with "mig_namespace_limit" for example, but with many other fields too. I didn't check if the changes were in the configmap, I'm sorry, I only realized that migration-controller pod was taking 20 minutes to restart (aprox).
Does this happen on 3.10 and lower or just 3.10?
This may be due to limitations regarding status, especially prior to 3.11. We actually had to change manageStatus off in the operator because the operator wouldn't install on these releases otherwise.
Has anyone from QE had a chance to confirm whether this is happening only on 3.10 or any release less than or equal to 3.10?
If this turns out to be an absence of status issue on <= 3.10 we could turn down the reconcilePeriod in the legacy operator. Right now we have it set to 30m which seems perfectly reasonable on 4.x. We could make it 5 or 10 minutes and it would force periodic reruns whether required or not. The downside is higher CPU usage for our operator.
I double checked on 3.7 and 3.9 with the steps commented in bug, they don't have this problem. controller pod was restarted in around 2, 3 mins
$ oc get migrationcontroller migration-controller -o yaml
I can reproduce this. I don't have an explanation for it yet. I do notice some oddities when editing resources below 3.11.
The generation is always 0 on 3.9. On 3.10 it's always 1. On 3.11 when you edit the resource the generation increments as you'd expect.
On 3.9 editing the CR does trigger a run, as on 3.11. On 3.10 it does not.
The operator does run on the interval we set regardless of version: https://github.com/konveyor/mig-operator/blob/master/watches.yaml#L6
I am not sure we can fix this, but I am still looking.
This can be made to work, but it requires reconfiguring the cluster BEFORE installing Crane / MTC.
If Crane or MTC is installed, then before doing this, remove all migrationcontroller CRs and the migrationcontrollers.migration.openshift.io CRD, then proceed with the steps below. When done delete the migration-operator pod and let it redeploy. These steps can be followed before or after following the steps below. What is important is that for the steps below to have an effect the CRD must be created after they've been followed.
1. Edit /etc/origin/master/master-config.yaml on all masters, enable the feature-gate under apiServerArguments and controllerArguments
2. master-restart api && master-restart controllers
3. Once this has been done on all masters install Crane or MTC normally and updates to the migrationcontroller CR should trigger an update.
The ramification of not doing this is you'll either have to wait for the 30 minute interval for the operator to update resources, else you'll need to do something like delete the migration-operator pod in order to cause it to restart it if you want faster results.
The above isn't very clear, I'll try again:
The migrationcontrollers.migration.openshift.io CRD needs to be (re)created after the feature-gate is enabled with the steps in the previous comment. In order to remove you'll need to remove all migrationcontroller CRs, and once the CRD is recreated you'll need to restart the operator pod.
It's easiest way to do it is update the masters first, the install MTC normally, but it is possible to recover from the situation if MTC was installed first.
Some additional references:
https://github.com/kubernetes/kubernetes/issues/58778 (discussing problems)
https://github.com/kubernetes/kubernetes/pull/55168 (adds the feature gate)
https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ (documentation for enabling and disabling features via feature gates)
The PR looks good to me