Bug 1986796
Summary: | Controller pod takes too much time to restart in 3.10 | ||
---|---|---|---|
Product: | Migration Toolkit for Containers | Reporter: | Sergio <sregidor> |
Component: | Documentation | Assignee: | Avital Pinnick <apinnick> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Xin jiang <xjiang> |
Severity: | low | Docs Contact: | Avital Pinnick <apinnick> |
Priority: | high | ||
Version: | 1.4.6 | CC: | ernelson, jmontleo |
Target Milestone: | --- | ||
Target Release: | 1.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-09-01 14:16:54 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Sergio
2021-07-28 11:07:03 UTC
Hi Sergio, we need some more information to understand if this is expected and what is actually taking so long. What fields are you adding/updating on the MigrationController? Are you seeing those changes get applied to a ConfigMap, for example, and the restart of the controller pod afterwards is taking too long? Hello, It does not matter the field. It happens with "mig_namespace_limit" for example, but with many other fields too. I didn't check if the changes were in the configmap, I'm sorry, I only realized that migration-controller pod was taking 20 minutes to restart (aprox). Does this happen on 3.10 and lower or just 3.10? This may be due to limitations regarding status, especially prior to 3.11. We actually had to change manageStatus off in the operator because the operator wouldn't install on these releases otherwise. Has anyone from QE had a chance to confirm whether this is happening only on 3.10 or any release less than or equal to 3.10? If this turns out to be an absence of status issue on <= 3.10 we could turn down the reconcilePeriod in the legacy operator. Right now we have it set to 30m which seems perfectly reasonable on 4.x. We could make it 5 or 10 minutes and it would force periodic reruns whether required or not. The downside is higher CPU usage for our operator. I double checked on 3.7 and 3.9 with the steps commented in bug, they don't have this problem. controller pod was restarted in around 2, 3 mins $ oc get migrationcontroller migration-controller -o yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigrationController metadata: clusterName: "" creationTimestamp: 2021-08-25T07:42:47Z generation: 0 name: migration-controller namespace: openshift-migration resourceVersion: "12718" selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migrationcontrollers/migration-controller uid: 0ad21909-0578-11ec-b73d-0e9273977b11 spec: azure_resource_group: "" cluster_name: host mig_namespace_limit: "5" mig_ui_cluster_api_endpoint: https://ec2-174-129-53-97.compute-1.amazonaws.com:8443 migration_controller: true migration_log_reader: true migration_ui: true migration_velero: true restic_timeout: 1h I can reproduce this. I don't have an explanation for it yet. I do notice some oddities when editing resources below 3.11. The generation is always 0 on 3.9. On 3.10 it's always 1. On 3.11 when you edit the resource the generation increments as you'd expect. On 3.9 editing the CR does trigger a run, as on 3.11. On 3.10 it does not. The operator does run on the interval we set regardless of version: https://github.com/konveyor/mig-operator/blob/master/watches.yaml#L6 I am not sure we can fix this, but I am still looking. This can be made to work, but it requires reconfiguring the cluster BEFORE installing Crane / MTC. If Crane or MTC is installed, then before doing this, remove all migrationcontroller CRs and the migrationcontrollers.migration.openshift.io CRD, then proceed with the steps below. When done delete the migration-operator pod and let it redeploy. These steps can be followed before or after following the steps below. What is important is that for the steps below to have an effect the CRD must be created after they've been followed. 1. Edit /etc/origin/master/master-config.yaml on all masters, enable the feature-gate under apiServerArguments and controllerArguments kubernetesMasterConfig: apiServerArguments: feature-gates: - CustomResourceSubresources=true ... controllerArguments: feature-gates: - CustomResourceSubresources=true ... 2. master-restart api && master-restart controllers 3. Once this has been done on all masters install Crane or MTC normally and updates to the migrationcontroller CR should trigger an update. The ramification of not doing this is you'll either have to wait for the 30 minute interval for the operator to update resources, else you'll need to do something like delete the migration-operator pod in order to cause it to restart it if you want faster results. The above isn't very clear, I'll try again: The migrationcontrollers.migration.openshift.io CRD needs to be (re)created after the feature-gate is enabled with the steps in the previous comment. In order to remove you'll need to remove all migrationcontroller CRs, and once the CRD is recreated you'll need to restart the operator pod. It's easiest way to do it is update the masters first, the install MTC normally, but it is possible to recover from the situation if MTC was installed first. Some additional references: https://github.com/kubernetes/kubernetes/issues/58778 (discussing problems) https://github.com/kubernetes/kubernetes/pull/55168 (adds the feature gate) https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ (documentation for enabling and disabling features via feature gates) The PR looks good to me Changes merged |