Bug 1845157

Summary: Quiesced migrations fail when they have been previously staged
Product: OpenShift Container Platform Reporter: Sergio <sregidor>
Component: Migration ToolingAssignee: Scott Seago <sseago>
Status: CLOSED ERRATA QA Contact: Xin jiang <xjiang>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.5CC: ernelson, jmatthew, jmontleo, mberube, rjohnson, whu
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1845997 1845998 (view as bug list) Environment:
Last Closed: 2020-06-17 00:04:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1845997    
Bug Blocks:    

Description Sergio 2020-06-08 14:44:43 UTC
Description of problem:
When we execute a migration that has been previously run as "stage", the migration fails.

Version-Release number of selected component (if applicable):
CAM 1.2.2 stage
SOURCE: OCP 3.11 AWS
TARGET: OCP 4.4 AWS
AWS S3 BUCKET


How reproducible:
Always

Steps to Reproduce:
1. Create a namespace with a nginx application

oc process -p NAMESPACE=bztest -f  https://gitlab.cee.redhat.com/app-mig/cam-helper/raw/master/ocp-26160/nginx_with_pv_defaultsc_template.yml | oc create -f -

2. Feed the data

oc -n bztest rsh $(oc get pods -n bztest -o jsonpath='{.items[0].metadata.name}') sh -c 'echo "<h1>HELLO WORLD</h1>" > /usr/share/nginx/html/index.html'

3. Execute a "stage" migration with this namespace

4. Execute a migration with this namespace


Actual results:
The stage migration will run OK, but the actual migration will fail, and this error will be displayed in the MigMigration resource

status:
  conditions:
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-06-08T12:59:10Z"
    message: '[1] Stage pods created.'
    status: "True"
    type: StagePodsCreated
  - category: Advisory
    durable: true
    lastTransitionTime: "2020-06-08T13:00:12Z"
    message: 'The migration has failed.  See: Errors.'
    reason: QuiesceApplications
    status: "True"
    type: Failed
  errors:
  - 'Operation cannot be fulfilled on replicasets.extensions "nginx-deployment-557dd97bf8":
    the object has been modified; please apply your changes to the latest version
    and try again'
  itenerary: Failed


Expected results:
The migration should be executed without problems.

Additional info:

Comment 1 Sergio 2020-06-08 16:28:15 UTC
It doesn't seem to necessary to previously "stage" the migration. I could reproduce it consistently by staging migrations, but all of a sudden it stopped happening. Now it eventually happens in quiesced migrations, and I cannot consistently reproduce it.

Comment 3 Scott Seago 2020-06-10 18:22:56 UTC
PR is here: https://github.com/konveyor/mig-controller/pull/565

Comment 4 Scott Seago 2020-06-10 18:23:31 UTC
ReplicaSets can exist standalone or as part of a Deployment.
Standalone ReplicaSets should be quiesced, but Deployment
ReplicaSets should not -- the Deployment quiesce will handle it
and without skipping it, we run into a race condition with
two things modifying the ReplicaSet at once.

Comment 8 whu 2020-06-15 11:41:26 UTC
Have been verified in new CAM 1.2.2 stage image. The fix has worked.

CAM 1.2.2 information :
openshift-migration-rhel7-operator@sha256:ab124c3917a2ea22e03618f287c629e727bbcdf7ec76db5e7d0f8654064b7a52
openshift-migration-controller-rhel8@sha256:ca9ab7ecf0d939afa1aae2540bb3daf5d7ce651ad58b94c6987484d12af1d211
openshift-migration-ui-rhel8@sha256:6abfaea8ac04e3b5bbf9648a3479b420b4baec35201033471020c9cae1fe1e11

Comment 10 errata-xmlrpc 2020-06-17 00:04:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:2571