Description of problem: In direct volume migration mode and checked "Verify Copy" option, migrate an application which volume data was corrupted during migration process, migration did not throw out `ResticVerifyErrors`. Version-Release number of selected component (if applicable): MTC 1.4.0 source cluster: ocp 4.4 aws target cluster: ocp 4.7 aws How reproducible: Steps to Reproduce: 1.In source cluster deploy nginx application using PVC, # oc process -f https://gitlab.cee.redhat.com/app-mig/cam-helper/raw/master/ocp-30240/nginx_with_pv_defaultsc_template.yml -p NAMESPACE=ocp-30240-datavalidation | oc create -f - 2. Create a Migration Plan with default values. And check "Verify Copy" in the "Copy options" screen of the Persistent Volumes, and check the "Use direct PV migration for filesystem copies" in Migraton Options. 3.In target cluster, create a pod which try to corrupt the volume data cat <<EOF | oc create -f - apiVersion: v1 kind: Pod metadata: name: pod-test namespace: ocp-30240-datavalidation spec: containers: - name: podtest image: alpine command: [ "/bin/sh", "-c", "--" ] args: [ "while true; do >/data/vol/index.html; done;" ] volumeMounts: - name: testvolume mountPath: /data/vol volumes: - name: testvolume persistentVolumeClaim: claimName: nginx-html EOF 4 Then trigger the migration, to capture `ResticVerifyErrors` in MigMigration resource. Actual results: There was no `ResticVerifyErrors` in MigMigration resource. Expected results: There should be a `ResticVerifyErrors` in MigMigration resource
Is this test one that correctly corrupted the data and threw the ResticVerifyError in the past? Curious if that test is actually sufficient or not. Scott, any thoughts on that?
Yeah, I'm not sure what's going on here. Is the implication that restic is logging errors and we're not catching it? If so, then this is a bug against that. If the implication is that there are errors that neither restic nor MTC are catching, then this is unrelated to that and a completely new request. If we're being asked to look for and report on errors that restic is not already reporting, then this feels like something that's certainly out of scope for 1.4.0. Basically, if this is not verified as a regression, we should probably push it post-1.4.0.
the issue is that migration doesn’t check the checksum for source files when select "Verify copy" to verify data migrated with Filesystem copy
for indirect migration, it does work. But for direct migration, it seems you missed to check the checksum for each of source files.
Ahh, yes. So the issue is that this is a feature that we have in indirect migration but has not (yet) been implemented in direct migration. Got it.
For direct migration with rsync, I have a short-term solution that *should* catch this test scenario as described in the bug but we will want to include a future enhancement down the road. Rsync performs transfer-level checksum verification out of the box, meaning that for every data transfer it runs checksum verification to ensure the transfer wasn't corrupted. Rsync also exposes a `--checksum` option which provides additional checksum comparisons to check if checksum of files on the source differ from checksums of files on the destination, and tries to copy the data to make them match. This is different from a high-level "post-transfer" verification where we actually run a checksum of the PV directory itself and compare the checksum of the source and the destination. To do this, we would need to enhance the DVM transfer workflow to compare checksums of the two PVCs from source and destination after rsync has completed. This latter approach will require significant changes so I am proposing for 1.4.0 the "verify" flag will add the `--checksum` flag to rsync to add some additional checksum comparison over the default transfer-level checksums and we can revisit this in a future release to add additional checksum comparisons.
Tracking the long term changes here as a 1.4.z candidate: https://issues.redhat.com/browse/MIG-504
https://github.com/konveyor/mig-controller/pull/890
https://github.com/konveyor/mig-operator/pull/553
Verified using MTC 1.4.0. AWS 3.11 -> AWS 4.5 (AWS S3) openshift-migration-rhel7-operator@sha256:79f524931e7188bfbfddf1e3d23f491b627d691ef7849a42432c7aec2d5f8a54 - name: MIG_CONTROLLER_REPO value: openshift-migration-controller-rhel8@sha256 - name: MIG_CONTROLLER_TAG value: cdf1bd56e353f076693cb7373c0a876be8984593d664ee0d7e1aeae7a3c54c1f When we check "Validate data" in the migration, the rsync command is executed with the --checksum flag. We can see that in the migration-controller pod's logs. For instance, this is a command run with validate data and a limited rate {"level":"info","ts":1611587858.6720047,"logger":"direct|tqqph","msg":"Using Rsync command [rsync --bwlimit=2000 --archive --delete --recursive --hard-links --partial --info=COPY2,DEL2,REMOVE2,SKIP2,FLIST2,PROGRESS2,STATS2 --human-readable --port 2222 --log-file /dev/stdout --checksum /mnt/ocp-30240-datavalidation/nginx-html/ rsync://root.78.48/nginx-html]","direct":"openshift-migration/34387700-5f20-11eb-b0ca-a524f44d2dff-b5r2p"} Moved to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Migration Toolkit for Containers (MTC) tool image release advisory 1.4.0), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:5329
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days