Created attachment 1696477 [details] Copy of the test run artifacts, for when the Prow URL expires In the following CI run: https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_containerized-data-importer/1226/pull-containerized-data-importer-e2e-k8s-1.17-ceph/1270060648532807680 We can see (in the artifacts) that the test failed due to filesystem corruption. MountVolume.MountDevice failed for volume "pvc-40f14e2f-ef22-4fcd-a0a4-b8730789e89a" : rpc error: code = Internal desc = 'xfs_repair' found errors on device /dev/rbd0 but could not correct them: Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. (Thanks to Niels de Vos) This happens because a snapshot was taken before the filesystem was unmounted. Source volume: - pv: pvc-4389fd29-37ed-43e6-877a-ec114a93857f - volume: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8 Snapshot: - snapshot-8e123d7a-3a51-4d95-86f5-8f11e0e976d1 - id: 0001-0009-rook-ceph-0000000000000001-c50abf73-a9b8-11ea-b649-fe137ae731b8 - source: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8 Problem volume: - source: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8 - volume: 0001-0009-rook-ceph-0000000000000001-c675249e-a9b8-11ea-b649-fe137ae731b8 18:49:13.476232 successful mount of source volume 18:49:25.452584 snapshot of source volume created 18:49:27.700401 attempted snapshot delete (failed, not critical) 18:49:44.437652 corruption detected on problem volume 18:49:50.315836 unmounting source volume I believe this issue is specific to smart cloning (or maybe read-write-many). If the cloning is done using a pod, kubernetes will not allow a second mount of the filesystem before the first one is done.
After some discussions it seems that we will need to update the controller logic for both smart cloning and host-assisted cloning to check that the source PVC is not still mounted. There are four combinations to consider: - RWX/host-assisted clone - Bug can happen any time the source pvc is in use - RWX/Smart clone - Bug can happen any time the source pvc is in use - RWO/host-assisted clone - Can happen if clone operation is scheduled on the same node as a previous user - RWO/Smart clone - Can happen any time the source PVC is in use Even if the controller checks for a user of the source pvc prior to cloning, there is still a race and we cannot do anything reasonable to stop someone else from using the source pvc during clone.
According to this bug also affects host-assisted clone, so we consider it is the issue we need to fix in 2.4
There is no reasonable way to eliminate the possibility of data corruption when cloning as kubernetes gives no guarantees that a PVC cannot be mounted/modified while the clone operation is in progress. But to address this issue, CDI will make sure that there are no pods accessing the affected PVCs in read/write mode. Mounting a PVC read only should be okay.
Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV? I would expect any cloning that is susceptible to concurrent writing to first obtain such a lock. Is this something we can fix for 2.4, or does it require a much bigger upstream work?
> Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV? Nope. The linked PR does about all we can. I don't think there will ever be any such locking mechanism.
> Nope. > The linked PR does about all we can. I don't think there will ever be any such locking mechanism. Can you discuss this in upstream k8s or point to a place where it was already discussed? I would like to express in a PVC that I expect to be the only writer of the PV.
> Can you discuss this in upstream k8s or point to a place where it was already discussed? I would like to express in a PVC that I expect to be the only writer of the PV. If the goal is to have a consistent filesystem snapshot without errors, there are two options: 1. Make sure that no pods have the PVC mounted for the entirety of the snapshot operation. 2. Manually quiesce the app in the pod(s) that have the PVC mounted and freeze the filesystem. The community is focused on the second as PVCs that you care about snapshotting are probably in use. The current initiative is called Container Notifier: https://docs.google.com/document/d/1Q4UMrx9r58LUNOHdn_4nDhqHN21iOOYQeViMKSPa_Ks/edit#heading=h.xh2wqx7h9cqm Sounds like you are advocating for the first case, which I think has limited utility, but could maybe be implemented with a "Lock" CRD and a validating webhook on Pods. But there may still be race conditions.
I'm not sure how option 2 can work without higher level orchestration forbidding new mounts file snapshot takes place. I see this as a big challenge; I think it should be tracked elsewhere, but maybe I miss something in the big picture.
(In reply to Dan Kenigsberg from comment #6) > Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV? > I would expect any cloning that is susceptible to concurrent writing to > first obtain such a lock. No. The platform does not offer any guarantees for this. You are expected to know what runs in your namespace and ensure the workload behaves (ie. you can provide your own synchronization mechanism if required). > > Is this something we can fix for 2.4, or does it require a much bigger > upstream work? Adding additional locking mechanisms would require a large upstream effort and I don't really see it gaining much traction. --- The way I see it, causes of this type of corruption can fall into two categories: 1) platform-caused, and 2) application caused. #1 covers Kubernetes and OpenShift components including KubeVirt. If these components cause corruption given normal allowed user behavior this is a bug. #2 includes scenarios where a user causes corruption by starting additional Pods that write to PVCs when they should not. We cannot reasonably prevent users from shooting themselves in the foot. Namespaces and related RBAC can help to ensure that only the workloads a user wants to run can actually access PVCs. In summary, this bug identifies a situation where CDI steps on its own toes and causes corruption. We can and should correct this specific scenario.
Test RWO/host-assisted clone on OCP4.5 CNV2.4 virt-cdi-operator-container-v2.4.0-26 Verify steps: 1. Create a source dv --- apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: dv-source spec: source: http: url: http://<URL>/cirros-images/cirros-0.4.0-x86_64-disk.qcow2 pvc: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: hostpath-provisioner volumeMode: Filesystem contentType: kubevirt 2. Create vm with the source pvc --- apiVersion: kubevirt.io/v1alpha3 kind: VirtualMachine metadata: name: cirros-vmi-src spec: template: spec: domain: resources: requests: memory: 1024M devices: rng: {} disks: - disk: bus: virtio name: dv-disk volumes: - name: dv-disk dataVolume: name: dv-source metadata: labels: kubevirt.io/vm: cirros-vm-src kubevirt.io/domain: cirros-vm-src running: true 3. Clone the source pvc --- apiVersion: cdi.kubevirt.io/v1alpha1 kind: DataVolume metadata: name: dv-target2 spec: source: pvc: name: dv-source namespace: default pvc: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: hostpath-provisioner volumeMode: Filesystem contentType: kubevirt Clone didn't proceed when source PVC is in used with a VM. After deleting the VM, then the clone could be continued. @Michael @Adam Is it the expected behavior? And a small question: where could I get the related log/warning about the clone won't be proceed due to the PVC is in use. I think without the log, user may get a little confused, and I tried the pod log and describe the dv/pod, seems I didn't get any related log: [cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc get dv NAME PHASE PROGRESS RESTARTS AGE dv-source Succeeded 100.0% 0 12m dv-target2 CloneInProgress N/A 0 8m28s [cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc get pod NAME READY STATUS RESTARTS AGE cdi-upload-dv-target2 1/1 Running 0 6m49s virt-launcher-cirros-vmi-src-69tm7 1/1 Running 0 10m [cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc logs cdi-upload-dv-target2 I0708 06:55:56.887881 1 uploadserver.go:63] Upload destination: /data/disk.img I0708 06:55:56.888073 1 uploadserver.go:65] Running server on 0.0.0.0:8443
>After deleting the VM, then the clone could be continued. >@Michael @Adam Is it the expected behavior? Yes, that is expected behavior. >And a small question: where could I get the related log/warning about the clone won't be proceed due to the PVC is in use. I think without the log, user may get a little confused, and I tried the pod log and describe the dv/pod, seems I didn't get any related log: There should be events generated. For smart clone: https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/controller/datavolume-controller.go#L296-L297 For network clone: https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/controller/clone-controller.go#L227-L228
Thanks Michael Got the pvc in use info in events: 6s Normal CloneInProgress datavolume/dv-target2 Cloning from default/dv-source into test/dv-target2 in progress 0s Warning CloneSourceInUse persistentvolumeclaim/dv-target2 pod default/virt-launcher-cirros-vmi-src-xfnmw using PersistentVolumeClaim dv-source
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3194