Created attachment 1697818 [details] all logs Description of problem: Sometimes the migration fails while EnsureLabelsDeleted phase because of a problem updating the replicationcontroller resource information. If the migration is run again, the migration will end successfully. Version-Release number of selected component (if applicable): CAM 1.2.2 SOURCE CLUSTER: OCP 3.11 AWS TARGET CLUSTER: OCP 4.4 AWS NOOBAA BUCKET How reproducible: Intermittent Steps to Reproduce: 1. The problem happened while migrating a mysql DeploymentConfig. I attach the jinja2 template that we use to deploy this application, but it should happen with any DC when a race condition happens. Actual results: The migration fails in EnsureLabelsDeleted stage, and the MigMigration resource shows this failure status: conditions: - category: Advisory durable: true lastTransitionTime: "2020-06-17T12:54:49Z" message: 'The migration has failed. See: Errors.' reason: EnsureLabelsDeleted status: "True" type: Failed errors: - 'Operation cannot be fulfilled on replicationcontrollers "mysql-1": the object has been modified; please apply your changes to the latest version and try again' itenerary: Failed If the migration is run again, the migration will end successfully. Expected results: The migration should end without problems. Additional info: Full MigMigration resource: apiVersion: migration.openshift.io/v1alpha1 kind: MigMigration metadata: annotations: openshift.io/touch: baced36f-b099-11ea-8cbd-0a580a820268 creationTimestamp: "2020-06-17T12:50:26Z" generation: 30 labels: controller-tools.k8s.io: "1.0" name: ocp-28967-migplan-naming-mig-1592398188 namespace: openshift-migration ownerReferences: - apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan name: ocp-28967-migplan-naming.migplan.1592398188 uid: 2aeb54f6-8389-4fd3-9fa8-468135f57f5b resourceVersion: "117389" selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/ocp-28967-migplan-naming-mig-1592398188 uid: 1204a4ca-e660-4f5d-90ff-95418a197aea spec: migPlanRef: name: ocp-28967-migplan-naming.migplan.1592398188 namespace: openshift-migration stage: false status: conditions: - category: Advisory durable: true lastTransitionTime: "2020-06-17T12:54:49Z" message: 'The migration has failed. See: Errors.' reason: EnsureLabelsDeleted status: "True" type: Failed errors: - 'Operation cannot be fulfilled on replicationcontrollers "mysql-1": the object has been modified; please apply your changes to the latest version and try again' itenerary: Failed observedDigest: d67b950bb9516f04ae2a6bdf60b0c8e70aea4ef3e52bf63f7278cd1826fa075a phase: Completed startTimestamp: "2020-06-17T12:50:26Z"
Created attachment 1697819 [details] deployment template
Created attachment 1697820 [details] Migmigration resource
Created attachment 1697821 [details] migplan resource
Created attachment 1697822 [details] controller logs
Fix is here: https://github.com/konveyor/mig-controller/pull/571 When there's a resource conflict, it means the resource in openshift/kubernetes has been modified between the time our controller grabbed it and the time we attempted to update it. In this case, when we're running migration tasks, when the error is a conflict error like this, we should requeue the reconcile and try again rather than to fail the migration.
Verified using CAM 1.2.3 stage, 3.11 -> 4.5 AWS with AWS Bucket. The error is still happening. openshift-migration-rhel7-operator@sha256:58b41647b27dfc1791bacedd998b02725bce6ffe5ae9577c7b33a0cc9e33a408 - name: MIG_CONTROLLER_REPO value: openshift-migration-controller-rhel8@sha256 - name: MIG_CONTROLLER_TAG value: b7eadaaae8f2173328aa4782795e0911ac1e546d7d3dd72d4eb36e855fd4c6bf - name: MIG_UI_REPO value: openshift-migration-ui-rhel8@sha256 - name: MIG_UI_TAG value: 6abfaea8ac04e3b5bbf9648a3479b420b4baec35201033471020c9cae1fe1e11 - name: MIGRATION_REGISTRY_REPO value: openshift-migration-registry-rhel8@sha256 - name: MIGRATION_REGISTRY_TAG value: ea6301a15277d448c8756881c7e2e712893ca8041c913476640f52da9e76cad9 - name: VELERO_REPO value: openshift-migration-velero-rhel8@sha256 - name: VELERO_TAG value: 1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281 - name: VELERO_PLUGIN_REPO value: openshift-migration-plugin-rhel8@sha256 - name: VELERO_PLUGIN_TAG value: 8dbf92e2f0de49049cb376e6941ab49846ed122b6b9328881fe490fb0905fa38 The problem is still there and was reproduced twice out of 16 executions. This is the MigMigration resource showing the failure $ oc get migmigration ocp-25212-initcont-mig-1592578520 -o yaml apiVersion: migration.openshift.io/v1alpha1 kind: MigMigration metadata: annotations: openshift.io/touch: 3a461363-b23d-11ea-a9bf-0a580a810211 creationTimestamp: "2020-06-19T14:55:47Z" generation: 15 labels: controller-tools.k8s.io: "1.0" managedFields: - apiVersion: migration.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:controller-tools.k8s.io: {} f:spec: .: {} f:migPlanRef: .: {} f:name: {} f:namespace: {} f:stage: {} manager: Swagger-Codegen operation: Update time: "2020-06-19T14:55:47Z" - apiVersion: migration.openshift.io/v1alpha1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:annotations: .: {} f:openshift.io/touch: {} f:ownerReferences: .: {} k:{"uid":"ecc5e993-954d-45cf-9051-5090b08b2b7e"}: .: {} f:apiVersion: {} f:kind: {} f:name: {} f:uid: {} f:status: .: {} f:conditions: {} f:errors: {} f:itenerary: {} f:observedDigest: {} f:phase: {} f:startTimestamp: {} manager: manager operation: Update time: "2020-06-19T14:57:42Z" name: ocp-25212-initcont-mig-1592578520 namespace: openshift-migration ownerReferences: - apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan name: ocp-25212-initcont-migplan-1592578520 uid: ecc5e993-954d-45cf-9051-5090b08b2b7e resourceVersion: "65772" selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/ocp-25212-initcont-mig-1592578520 uid: cfcbe616-189e-496c-8736-d9dfbfefbb91 spec: migPlanRef: name: ocp-25212-initcont-migplan-1592578520 namespace: openshift-migration stage: false status: conditions: - category: Advisory durable: true lastTransitionTime: "2020-06-19T14:57:42Z" message: 'The migration has failed. See: Errors.' reason: EnsureLabelsDeleted status: "True" type: Failed errors: - 'Operation cannot be fulfilled on deployments.apps "external-nginx-deployment": the object has been modified; please apply your changes to the latest version and try again' itenerary: Failed observedDigest: 7795ffbe75c4f149fc91d50bff959462462633922bcef51ef6900efce1678766 phase: Completed startTimestamp: "2020-06-19T14:55:47Z" The Problem happened migrating 2 deployments with init containers $ oc get all NAME READY STATUS RESTARTS AGE pod/external-nginx-deployment-66896f6fc6-lmdsw 1/1 Running 0 1h pod/internal-nginx-deployment-7dc786fddb-cnwnb 1/1 Running 0 1h NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/external-nginx-service ClusterIP 172.30.73.27 <none> 8081/TCP 1h service/internal-nginx-service ClusterIP 172.30.173.199 <none> 8081/TCP 1h NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deployment.apps/external-nginx-deployment 1 1 1 1 1h deployment.apps/internal-nginx-deployment 1 1 1 1 1h NAME DESIRED CURRENT READY AGE replicaset.apps/external-nginx-deployment-66896f6fc6 1 1 1 1h replicaset.apps/internal-nginx-deployment-7dc786fddb 1 1 1 1h NAME DOCKER REPO TAGS UPDATED imagestream.image.openshift.io/internal-alpine docker-registry.default.svc:5000/ocp-25212-initcont/internal-alpine int 2 hours ago
It looks like this fix is on the master branch but is not on the release-1.2.3 branch. I wouldn't expect a 1.2.3 build to have this fix yet. Moving it back to post. I guess tomorrow we'll need to cherry-pick it to the release branch and get a new build done.
The fix is now cherry-picked to the release-1.2.3 branch.
Verified using CAM 1.2.3 stage After running a reasonable number of migrations (~60) the issue was not reproduced. So I think that we can safely assume that it can be verified as fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2764