1847993 – Migration sometimes fails in EnsureLabelsDeleted phase

Bug 1847993 - Migration sometimes fails in EnsureLabelsDeleted phase

Summary: Migration sometimes fails in EnsureLabelsDeleted phase

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Migration Tooling
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.4.z
Assignee:	Scott Seago
QA Contact:	Xin jiang
Docs Contact:
URL:
Whiteboard:
Depends On:	1848041
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-17 14:04 UTC by Sergio
Modified:	2023-10-06 20:40 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1848041 (view as bug list)
Environment:
Last Closed:	2020-06-30 06:54:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
all logs (6.13 MB, application/zip) 2020-06-17 14:04 UTC, Sergio	no flags	Details
deployment template (2.61 KB, text/plain) 2020-06-17 14:06 UTC, Sergio	no flags	Details
Migmigration resource (1.39 KB, text/plain) 2020-06-17 14:06 UTC, Sergio	no flags	Details
migplan resource (2.21 KB, text/plain) 2020-06-17 14:06 UTC, Sergio	no flags	Details
controller logs (25.29 KB, text/plain) 2020-06-17 14:07 UTC, Sergio	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:2764	0	None	None	None	2020-06-30 06:54:44 UTC

Description Sergio 2020-06-17 14:04:51 UTC

Created attachment 1697818 [details]
all logs

Description of problem:
Sometimes the migration fails while EnsureLabelsDeleted phase because of a problem updating the replicationcontroller resource information.

If the migration is run again, the migration will end successfully.


Version-Release number of selected component (if applicable):
CAM 1.2.2 
SOURCE CLUSTER: OCP 3.11 AWS
TARGET CLUSTER: OCP 4.4 AWS
NOOBAA BUCKET

How reproducible:
Intermittent

Steps to Reproduce:
1. The problem happened while migrating a mysql DeploymentConfig. I attach the jinja2 template that we use to deploy this application, but it should happen with any DC when a race condition happens.

Actual results:
The migration fails in EnsureLabelsDeleted stage, and the MigMigration resource shows this failure

status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-06-17T12:54:49Z"
    message: 'The migration has failed.  See: Errors.'
    reason: EnsureLabelsDeleted
    status: "True"
    type: Failed
  errors:
  - 'Operation cannot be fulfilled on replicationcontrollers "mysql-1": the object
    has been modified; please apply your changes to the latest version and try again'
  itenerary: Failed


If the migration is run again, the migration will end successfully.


Expected results:
The migration should end without problems.

Additional info:

Full MigMigration resource:

apiVersion: migration.openshift.io/v1alpha1
kind: MigMigration
metadata:
  annotations:
    openshift.io/touch: baced36f-b099-11ea-8cbd-0a580a820268
  creationTimestamp: "2020-06-17T12:50:26Z"
  generation: 30
  labels:
    controller-tools.k8s.io: "1.0"
  name: ocp-28967-migplan-naming-mig-1592398188
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: migration.openshift.io/v1alpha1
    kind: MigPlan
    name: ocp-28967-migplan-naming.migplan.1592398188
    uid: 2aeb54f6-8389-4fd3-9fa8-468135f57f5b
  resourceVersion: "117389"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/ocp-28967-migplan-naming-mig-1592398188
  uid: 1204a4ca-e660-4f5d-90ff-95418a197aea
spec:
  migPlanRef:
    name: ocp-28967-migplan-naming.migplan.1592398188
    namespace: openshift-migration
  stage: false
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-06-17T12:54:49Z"
    message: 'The migration has failed.  See: Errors.'
    reason: EnsureLabelsDeleted
    status: "True"
    type: Failed
  errors:
  - 'Operation cannot be fulfilled on replicationcontrollers "mysql-1": the object
    has been modified; please apply your changes to the latest version and try again'
  itenerary: Failed
  observedDigest: d67b950bb9516f04ae2a6bdf60b0c8e70aea4ef3e52bf63f7278cd1826fa075a
  phase: Completed
  startTimestamp: "2020-06-17T12:50:26Z"

Comment 1 Sergio 2020-06-17 14:06:02 UTC

Created attachment 1697819 [details]
deployment template

Comment 2 Sergio 2020-06-17 14:06:33 UTC

Created attachment 1697820 [details]
Migmigration resource

Comment 3 Sergio 2020-06-17 14:06:58 UTC

Created attachment 1697821 [details]
migplan resource

Comment 4 Sergio 2020-06-17 14:07:19 UTC

Created attachment 1697822 [details]
controller logs

Comment 5 Scott Seago 2020-06-17 20:32:46 UTC

Fix is here: https://github.com/konveyor/mig-controller/pull/571

When there's a resource conflict, it means the resource in openshift/kubernetes has been modified between the time our controller grabbed it and the time we attempted to update it. In this case, when we're running migration tasks, when the error is a conflict error like this, we should requeue the reconcile and try again rather than to fail the migration.

Comment 9 Sergio 2020-06-19 16:39:27 UTC

Verified using CAM 1.2.3 stage, 3.11 -> 4.5 AWS with AWS Bucket. The error is still happening.

openshift-migration-rhel7-operator@sha256:58b41647b27dfc1791bacedd998b02725bce6ffe5ae9577c7b33a0cc9e33a408

    - name: MIG_CONTROLLER_REPO
      value: openshift-migration-controller-rhel8@sha256
    - name: MIG_CONTROLLER_TAG
      value: b7eadaaae8f2173328aa4782795e0911ac1e546d7d3dd72d4eb36e855fd4c6bf
    - name: MIG_UI_REPO
      value: openshift-migration-ui-rhel8@sha256
    - name: MIG_UI_TAG
      value: 6abfaea8ac04e3b5bbf9648a3479b420b4baec35201033471020c9cae1fe1e11
    - name: MIGRATION_REGISTRY_REPO
      value: openshift-migration-registry-rhel8@sha256
    - name: MIGRATION_REGISTRY_TAG
      value: ea6301a15277d448c8756881c7e2e712893ca8041c913476640f52da9e76cad9
    - name: VELERO_REPO
      value: openshift-migration-velero-rhel8@sha256
    - name: VELERO_TAG
      value: 1a33e327dd610f0eebaaeae5b3c9b4170ab5db572b01a170be35b9ce946c0281
    - name: VELERO_PLUGIN_REPO
      value: openshift-migration-plugin-rhel8@sha256
    - name: VELERO_PLUGIN_TAG
      value: 8dbf92e2f0de49049cb376e6941ab49846ed122b6b9328881fe490fb0905fa38



The problem is still there and was reproduced twice out of 16 executions. This is the MigMigration resource showing the failure


$ oc get migmigration ocp-25212-initcont-mig-1592578520 -o yaml
apiVersion: migration.openshift.io/v1alpha1
kind: MigMigration
metadata:
  annotations:
    openshift.io/touch: 3a461363-b23d-11ea-a9bf-0a580a810211
  creationTimestamp: "2020-06-19T14:55:47Z"
  generation: 15
  labels:
    controller-tools.k8s.io: "1.0"
  managedFields:
  - apiVersion: migration.openshift.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:controller-tools.k8s.io: {}
      f:spec:
        .: {}
        f:migPlanRef:
          .: {}
          f:name: {}
          f:namespace: {}
        f:stage: {}
    manager: Swagger-Codegen
    operation: Update
    time: "2020-06-19T14:55:47Z"
  - apiVersion: migration.openshift.io/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:openshift.io/touch: {}
        f:ownerReferences:
          .: {}
          k:{"uid":"ecc5e993-954d-45cf-9051-5090b08b2b7e"}:
            .: {}
            f:apiVersion: {}
            f:kind: {}
            f:name: {}
            f:uid: {}
      f:status:
        .: {}
        f:conditions: {}
        f:errors: {}
        f:itenerary: {}
        f:observedDigest: {}
        f:phase: {}
        f:startTimestamp: {}
    manager: manager
    operation: Update
    time: "2020-06-19T14:57:42Z"
  name: ocp-25212-initcont-mig-1592578520
  namespace: openshift-migration
  ownerReferences:
  - apiVersion: migration.openshift.io/v1alpha1
    kind: MigPlan
    name: ocp-25212-initcont-migplan-1592578520
    uid: ecc5e993-954d-45cf-9051-5090b08b2b7e
  resourceVersion: "65772"
  selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/ocp-25212-initcont-mig-1592578520
  uid: cfcbe616-189e-496c-8736-d9dfbfefbb91
spec:
  migPlanRef:
    name: ocp-25212-initcont-migplan-1592578520
    namespace: openshift-migration
  stage: false
status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-06-19T14:57:42Z"
    message: 'The migration has failed.  See: Errors.'
    reason: EnsureLabelsDeleted
    status: "True"
    type: Failed
  errors:
  - 'Operation cannot be fulfilled on deployments.apps "external-nginx-deployment":
    the object has been modified; please apply your changes to the latest version
    and try again'
  itenerary: Failed
  observedDigest: 7795ffbe75c4f149fc91d50bff959462462633922bcef51ef6900efce1678766
  phase: Completed
  startTimestamp: "2020-06-19T14:55:47Z"


The Problem happened migrating 2 deployments with init containers

$ oc get all
NAME                                             READY     STATUS    RESTARTS   AGE
pod/external-nginx-deployment-66896f6fc6-lmdsw   1/1       Running   0          1h
pod/internal-nginx-deployment-7dc786fddb-cnwnb   1/1       Running   0          1h

NAME                             TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)    AGE
service/external-nginx-service   ClusterIP   172.30.73.27     <none>        8081/TCP   1h
service/internal-nginx-service   ClusterIP   172.30.173.199   <none>        8081/TCP   1h

NAME                                        DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/external-nginx-deployment   1         1         1            1           1h
deployment.apps/internal-nginx-deployment   1         1         1            1           1h

NAME                                                   DESIRED   CURRENT   READY     AGE
replicaset.apps/external-nginx-deployment-66896f6fc6   1         1         1         1h
replicaset.apps/internal-nginx-deployment-7dc786fddb   1         1         1         1h

NAME                                             DOCKER REPO                                                           TAGS      UPDATED
imagestream.image.openshift.io/internal-alpine   docker-registry.default.svc:5000/ocp-25212-initcont/internal-alpine   int       2 hours ago

Comment 10 Scott Seago 2020-06-22 00:16:24 UTC

It looks like this fix is on the master branch but is not on the release-1.2.3 branch. I wouldn't expect a 1.2.3 build to have this fix yet. Moving it back to post. I guess tomorrow we'll need to cherry-pick it to the release branch and get a new build done.

Comment 11 Scott Seago 2020-06-22 12:08:58 UTC

The fix is now cherry-picked to the release-1.2.3 branch.

Comment 14 Sergio 2020-06-24 12:35:39 UTC

Verified using CAM 1.2.3 stage

After running a reasonable number of migrations (~60) the issue was not reproduced. So I think that we can safely assume that it can be verified as fixed.

Comment 16 errata-xmlrpc 2020-06-30 06:54:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2764

Note You need to log in before you can comment on or make changes to this bug.