Bug 1845901

Summary: Filesystem corruption related to smart clone
Product: Container Native Virtualization (CNV) Reporter: Maya Rashish <mrashish>
Component: StorageAssignee: Michael Henriksen <mhenriks>
Status: CLOSED ERRATA QA Contact: Yan Du <yadu>
Severity: high Docs Contact:
Priority: high    
Version: 2.4.0CC: alitke, cnv-qe-bugs, danken, mhenriks, ncredi, ngavrilo
Target Milestone: ---   
Target Release: 2.4.0   
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: virt-cdi-operator-container-v2.4.0-26 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-28 19:10:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Copy of the test run artifacts, for when the Prow URL expires none

Description Maya Rashish 2020-06-10 11:48:47 UTC
Created attachment 1696477 [details]
Copy of the test run artifacts, for when the Prow URL expires

In the following CI run:
https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_containerized-data-importer/1226/pull-containerized-data-importer-e2e-k8s-1.17-ceph/1270060648532807680

We can see (in the artifacts) that the test failed due to filesystem corruption.

MountVolume.MountDevice failed for volume "pvc-40f14e2f-ef22-4fcd-a0a4-b8730789e89a" : rpc error: code = Internal desc = 'xfs_repair' found errors on device /dev/rbd0 but could not correct them: Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

(Thanks to Niels de Vos)

This happens because a snapshot was taken before the filesystem was unmounted.

Source volume:
 - pv: pvc-4389fd29-37ed-43e6-877a-ec114a93857f
 - volume: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8

Snapshot:
 - snapshot-8e123d7a-3a51-4d95-86f5-8f11e0e976d1
 - id: 0001-0009-rook-ceph-0000000000000001-c50abf73-a9b8-11ea-b649-fe137ae731b8
 - source: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8

Problem volume:
 - source: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8
 - volume: 0001-0009-rook-ceph-0000000000000001-c675249e-a9b8-11ea-b649-fe137ae731b8

18:49:13.476232 successful mount of source volume
18:49:25.452584 snapshot of source volume created
18:49:27.700401 attempted snapshot delete (failed, not critical)
18:49:44.437652 corruption detected on problem volume
18:49:50.315836 unmounting source volume


I believe this issue is specific to smart cloning (or maybe read-write-many).
If the cloning is done using a pod, kubernetes will not allow a second mount of the filesystem before the first one is done.

Comment 2 Adam Litke 2020-06-10 13:19:34 UTC
After some discussions it seems that we will need to update the controller logic for both smart cloning and host-assisted cloning to check that the source PVC is not still mounted.

There are four combinations to consider:
- RWX/host-assisted clone
  - Bug can happen any time the source pvc is in use
- RWX/Smart clone
  - Bug can happen any time the source pvc is in use
- RWO/host-assisted clone
  - Can happen if clone operation is scheduled on the same node as a previous user
- RWO/Smart clone
  - Can happen any time the source PVC is in use

Even if the controller checks for a user of the source pvc prior to cloning, there is still a race and we cannot do anything reasonable to stop someone else from using the source pvc during clone.

Comment 4 Ying Cui 2020-06-17 12:19:57 UTC
According to this bug also affects host-assisted clone, so we consider it is the issue we need to fix in 2.4

Comment 5 Michael Henriksen 2020-06-23 16:57:55 UTC
There is no reasonable way to eliminate the possibility of data corruption when cloning as kubernetes gives no guarantees that a PVC cannot be mounted/modified while the clone operation is in progress.  But to address this issue, CDI will make sure that there are no pods accessing the affected PVCs in read/write mode.  Mounting a PVC read only should be okay.

Comment 6 Dan Kenigsberg 2020-06-29 08:54:12 UTC
Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV? I would expect any cloning that is susceptible to concurrent writing to first obtain such a lock.

Is this something we can fix for 2.4, or does it require a much bigger upstream work?

Comment 7 Michael Henriksen 2020-06-29 11:48:22 UTC
> Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV?

Nope.

The linked PR does about all we can.  I don't think there will ever be any such locking mechanism.

Comment 8 Dan Kenigsberg 2020-06-29 12:00:27 UTC
> Nope.

> The linked PR does about all we can.  I don't think there will ever be any such locking mechanism.

Can you discuss this in upstream k8s or point to a place where it was already discussed? I would like to express in a PVC that I expect to be the only writer of the PV.

Comment 9 Michael Henriksen 2020-06-29 14:05:06 UTC
> Can you discuss this in upstream k8s or point to a place where it was already discussed? I would like to express in a PVC that I expect to be the only writer of the PV.

If the goal is to have a consistent filesystem snapshot without errors, there are two options:

1.  Make sure that no pods have the PVC mounted for the entirety of the snapshot operation.

2.  Manually quiesce the app in the pod(s) that have the PVC mounted and freeze the filesystem.

The community is focused on the second as PVCs that you care about snapshotting are probably in use.  The current initiative is called Container Notifier:  https://docs.google.com/document/d/1Q4UMrx9r58LUNOHdn_4nDhqHN21iOOYQeViMKSPa_Ks/edit#heading=h.xh2wqx7h9cqm

Sounds like you are advocating for the first case, which I think has limited utility, but could maybe be implemented with a "Lock" CRD and a validating webhook on Pods.  But there may still be race conditions.

Comment 10 Dan Kenigsberg 2020-07-06 08:32:10 UTC
I'm not sure how option 2 can work without higher level orchestration forbidding new mounts file snapshot takes place. I see this as a big challenge; I think it should be tracked elsewhere, but maybe I miss something in the big picture.

Comment 11 Adam Litke 2020-07-07 12:19:26 UTC
(In reply to Dan Kenigsberg from comment #6)
> Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV?
> I would expect any cloning that is susceptible to concurrent writing to
> first obtain such a lock.

No.  The platform does not offer any guarantees for this.  You are expected to know what runs in your namespace and ensure the workload behaves (ie. you can provide your own synchronization mechanism if required).

> 
> Is this something we can fix for 2.4, or does it require a much bigger
> upstream work?

Adding additional locking mechanisms would require a large upstream effort and I don't really see it gaining much traction.

---

The way I see it, causes of this type of corruption can fall into two categories: 1) platform-caused, and 2) application caused.  #1 covers Kubernetes and OpenShift components including KubeVirt.  If these components cause corruption given normal allowed user behavior this is a bug.  #2 includes scenarios where a user causes corruption by starting additional Pods that write to PVCs when they should not.  We cannot reasonably prevent users from shooting themselves in the foot.  Namespaces and related RBAC can help to ensure that only the workloads a user wants to run can actually access PVCs.

In summary, this bug identifies a situation where CDI steps on its own toes and causes corruption.  We can and should correct this specific scenario.

Comment 12 Yan Du 2020-07-08 08:30:22 UTC
Test RWO/host-assisted clone on OCP4.5 CNV2.4 
virt-cdi-operator-container-v2.4.0-26

Verify steps:

1. Create a source dv
---
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: dv-source
spec:
  source:
    http:
      url: http://<URL>/cirros-images/cirros-0.4.0-x86_64-disk.qcow2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
  contentType: kubevirt

2. Create vm with the source pvc
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  name: cirros-vmi-src
spec:
  template:
    spec:
      domain:
        resources:
          requests:
            memory: 1024M
        devices:
          rng: {}
          disks:
          - disk:
              bus: virtio
            name: dv-disk
      volumes:
      - name: dv-disk
        dataVolume:
          name: dv-source
    metadata:
      labels:
        kubevirt.io/vm: cirros-vm-src
        kubevirt.io/domain: cirros-vm-src
  running: true

3. Clone the source pvc
---
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: dv-target2
spec:
  source:
    pvc:
      name: dv-source
      namespace: default
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
  contentType: kubevirt


Clone didn't proceed when source PVC is in used with a VM. 

After deleting the VM, then the clone could be continued.

@Michael @Adam Is it the expected behavior?



And a small question: where could I get the related log/warning about the clone won't be proceed due to the PVC is in use. I think without the log, user may get a little confused, and I tried the pod log and describe the dv/pod, seems I didn't get any related log:

[cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc get dv
NAME         PHASE             PROGRESS   RESTARTS   AGE
dv-source    Succeeded         100.0%     0          12m
dv-target2   CloneInProgress   N/A        0          8m28s
[cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc get pod
NAME                                 READY     STATUS    RESTARTS   AGE
cdi-upload-dv-target2                1/1       Running   0          6m49s
virt-launcher-cirros-vmi-src-69tm7   1/1       Running   0          10m
[cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc logs cdi-upload-dv-target2 
I0708 06:55:56.887881       1 uploadserver.go:63] Upload destination: /data/disk.img
I0708 06:55:56.888073       1 uploadserver.go:65] Running server on 0.0.0.0:8443

Comment 13 Michael Henriksen 2020-07-08 12:02:11 UTC
>After deleting the VM, then the clone could be continued.

>@Michael @Adam Is it the expected behavior?

Yes, that is expected behavior.

>And a small question: where could I get the related log/warning about the clone won't be proceed due to the PVC is in use. I think without the log, user may get a little confused, and I tried the pod log and describe the dv/pod, seems I didn't get any related log:

There should be events generated.

For smart clone:  https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/controller/datavolume-controller.go#L296-L297

For network clone:  https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/controller/clone-controller.go#L227-L228

Comment 14 Yan Du 2020-07-08 14:29:15 UTC
Thanks Michael

Got the pvc in use info in events:

6s          Normal    CloneInProgress         datavolume/dv-target2                    Cloning from default/dv-source into test/dv-target2 in progress
0s          Warning   CloneSourceInUse        persistentvolumeclaim/dv-target2         pod default/virt-launcher-cirros-vmi-src-xfnmw using PersistentVolumeClaim dv-source

Comment 17 errata-xmlrpc 2020-07-28 19:10:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3194