Created attachment 1730956 [details] must-gather-file Description of problem: When we execute a migration, the stage pod creation can fail because it is trying to mount a volume and it fails (FailedMount). When it happens, the migration succeeds but the volume data is not migrated. Version-Release number of selected component (if applicable): CAM 1.2.3 SOURCE CLUSTER: 3.11 AWS TARGET CLUSTER: 4.2 AWS REPLICATION REPOSITORY: S3 How reproducible: Inttermitent Steps to Reproduce: We have not been able to reproduce it Actual results: When the stage pod is created in the target cluster, if there are problems mounting the volume the volume data will not be migrated. The migration status will be successful, though. Expected results: The migration must fail (or at least must have a warning). Additional info: Attached the must-gather file. The migration with this problem is: ocp-24659-mysql-migplan-1605706788 Those are the events in the target cluster's migrated namespace: $ oc get events LAST SEEN TYPE REASON OBJECT MESSAGE 115m Normal Scheduled pod/mysql-1-deploy Successfully assigned ocp-24659-mysql/mysql-1-deploy to ip-10-0-155-228.us-east-2.compute.internal 115m Normal Pulled pod/mysql-1-deploy Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:dc4f29574e61490a3136c99f872027e2ed4a3502cb4aefab6353cccc499b9b7d" already present on machine 115m Normal Created pod/mysql-1-deploy Created container deployment 115m Normal Started pod/mysql-1-deploy Started container deployment 115m Normal Scheduled pod/mysql-1-pjlj5 Successfully assigned ocp-24659-mysql/mysql-1-pjlj5 to ip-10-0-155-228.us-east-2.compute.internal 115m Normal SuccessfulAttachVolume pod/mysql-1-pjlj5 AttachVolume.Attach succeeded for volume "pvc-ddd792c8-29a3-11eb-934a-02177daf6b80" 115m Normal Pulling pod/mysql-1-pjlj5 Pulling image "quay.io/openshifttest/mysql:5.7" 114m Normal Pulled pod/mysql-1-pjlj5 Successfully pulled image "quay.io/openshifttest/mysql:5.7" 114m Normal Created pod/mysql-1-pjlj5 Created container mysql 114m Normal Started pod/mysql-1-pjlj5 Started container mysql 115m Normal SuccessfulCreate replicationcontroller/mysql-1 Created pod: mysql-1-pjlj5 116m Normal WaitForFirstConsumer persistentvolumeclaim/mysql waiting for first consumer to be created before binding 116m Normal ProvisioningSucceeded persistentvolumeclaim/mysql Successfully provisioned volume pvc-ddd792c8-29a3-11eb-934a-02177daf6b80 using kubernetes.io/aws-ebs 115m Normal DeploymentCreated deploymentconfig/mysql Created new replication controller "mysql-1" for version 1 116m Normal Scheduled pod/stage-mysql-1-9bqt7-jfrw4 Successfully assigned ocp-24659-mysql/stage-mysql-1-9bqt7-jfrw4 to ip-10-0-155-228.us-east-2.compute.internal 116m Warning FailedAttachVolume pod/stage-mysql-1-9bqt7-jfrw4 AttachVolume.Attach failed for volume "pvc-ddd792c8-29a3-11eb-934a-02177daf6b80" : "Error attaching EBS volume \"vol-0304082dd6c6a6561\"" to instance "i-0c12bf5d5091d3855" since volume is in "creating" state 116m Normal SuccessfulAttachVolume pod/stage-mysql-1-9bqt7-jfrw4 AttachVolume.Attach succeeded for volume "pvc-ddd792c8-29a3-11eb-934a-02177daf6b80" 114m Warning FailedMount pod/stage-mysql-1-9bqt7-jfrw4 Unable to mount volumes for pod "stage-mysql-1-9bqt7-jfrw4_ocp-24659-mysql(de0b1637-29a3-11eb-934a-02177daf6b80)": timeout expired waiting for volumes to attach or mount for pod "ocp-24659-mysql"/"stage-mysql-1-9bqt7-jfrw4". list of unmounted volumes=[mysql-data default-token-tbmbv]. list of unattached volumes=[mysql-data default-token-tbmbv] We can see that the migration execution does not report a failure here: apiVersion: migration.openshift.io/v1alpha1 kind: MigMigration metadata: annotations: openshift.io/touch: 09007540-29a4-11eb-bc67-0a580a81020a creationTimestamp: "2020-11-18T13:40:15Z" generation: 30 labels: controller-tools.k8s.io: "1.0" migration.openshift.io/migplan-name: ocp-24659-mysql-migplan-1605706788 name: ocp-24659-mysql-mig-1605706788 namespace: openshift-migration ownerReferences: - apiVersion: migration.openshift.io/v1alpha1 kind: MigPlan name: ocp-24659-mysql-migplan-1605706788 uid: 87d78219-29a3-11eb-934a-02177daf6b80 resourceVersion: "95120" selfLink: /apis/migration.openshift.io/v1alpha1/namespaces/openshift-migration/migmigrations/ocp-24659-mysql-mig-1605706788 uid: 9726dd4c-29a3-11eb-934a-02177daf6b80 spec: migPlanRef: name: ocp-24659-mysql-migplan-1605706788 namespace: openshift-migration stage: false status: conditions: - category: Advisory durable: true lastTransitionTime: "2020-11-18T13:43:26Z" message: The migration has completed successfully. reason: Completed status: "True" type: Succeeded itinerary: Final observedDigest: b5aa4b8db19eb80e26b8af29c725d849332f5ff9c561c2db78c3ad60df28a89f phase: Completed startTimestamp: "2020-11-18T13:40:15Z"
As of MTC 1.4.z+, we're expecting a status condition to be raised that warns in the event that pods are hung. There are also no stage pods in DVM, and we're expecting to deprecate indirect as a legacy state transfer method. Let's verify this as fixed for 1.6.0
indirect migrations will not be deprecated for 1.6.0, let’s confirm this status condition on the eng side and hand to QE for verification.
Alay, I do not see any PR attach to this BZ.Can you share the PR which implements this fix? Thanks, Aziza
Hello, We are having problems reproducing this issue. We are not able to force a failure in the stage pod so that it cannot mount the volume. We observed this behavior intermittently in MTC 1.2.3, but we can't see this error mounting the volume now in our executions using MTC 1.6.0.
Verified using MTC 1.6.0 SOURCE CLUSTER: AWS 4.6 PROXY TARGET CLUSTER: AWS 4.9 PROXY (CONTROLLER + UI) openshift-migration-rhel8-operator@sha256:7963e612abfe195c9d7781b45324c3af2d3b0fdca6900bbc9603a643b9b 66cac - name: MIG_CONTROLLER_REPO value: openshift-migration-controller-rhel8@sha256 - name: MIG_CONTROLLER_TAG value: 2cfae5a025cad6e0ec421958ff9bdff1bceb6bec132d1992ea2a9e342be1c04f The stage pod, once it cannot mount the volume, remains in the namespace in ContainerCreating status. Moved to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Migration Toolkit for Containers (MTC) 1.6.0 security & bugfix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3694
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days