1845901 – Filesystem corruption related to smart clone

Bug 1845901 - Filesystem corruption related to smart clone

Summary: Filesystem corruption related to smart clone

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Storage
Sub Component:
Version:	2.4.0
Hardware:	All
OS:	All
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	2.4.0
Assignee:	Michael Henriksen
QA Contact:	Yan Du
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-06-10 11:48 UTC by Maya Rashish
Modified:	2020-07-28 19:10 UTC (History)
CC List:	6 users (show)
Fixed In Version:	virt-cdi-operator-container-v2.4.0-26
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-07-28 19:10:33 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Copy of the test run artifacts, for when the Prow URL expires (122.88 KB, application/gzip) 2020-06-10 11:48 UTC, Maya Rashish	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	kubevirt containerized-data-importer pull 1263	0	None	closed	Wait for PVCs to be unused before initiating clone operations	2021-02-02 13:37:21 UTC
Red Hat Product Errata	RHSA-2020:3194	0	None	None	None	2020-07-28 19:10:48 UTC

Internal Links: 1849561

Description Maya Rashish 2020-06-10 11:48:47 UTC

Created attachment 1696477 [details]
Copy of the test run artifacts, for when the Prow URL expires

In the following CI run:
https://prow.apps.ovirt.org/view/gcs/kubevirt-prow/pr-logs/pull/kubevirt_containerized-data-importer/1226/pull-containerized-data-importer-e2e-k8s-1.17-ceph/1270060648532807680

We can see (in the artifacts) that the test failed due to filesystem corruption.

MountVolume.MountDevice failed for volume "pvc-40f14e2f-ef22-4fcd-a0a4-b8730789e89a" : rpc error: code = Internal desc = 'xfs_repair' found errors on device /dev/rbd0 but could not correct them: Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

(Thanks to Niels de Vos)

This happens because a snapshot was taken before the filesystem was unmounted.

Source volume:
 - pv: pvc-4389fd29-37ed-43e6-877a-ec114a93857f
 - volume: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8

Snapshot:
 - snapshot-8e123d7a-3a51-4d95-86f5-8f11e0e976d1
 - id: 0001-0009-rook-ceph-0000000000000001-c50abf73-a9b8-11ea-b649-fe137ae731b8
 - source: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8

Problem volume:
 - source: 0001-0009-rook-ceph-0000000000000001-b218a536-a9b8-11ea-b649-fe137ae731b8
 - volume: 0001-0009-rook-ceph-0000000000000001-c675249e-a9b8-11ea-b649-fe137ae731b8

18:49:13.476232 successful mount of source volume
18:49:25.452584 snapshot of source volume created
18:49:27.700401 attempted snapshot delete (failed, not critical)
18:49:44.437652 corruption detected on problem volume
18:49:50.315836 unmounting source volume


I believe this issue is specific to smart cloning (or maybe read-write-many).
If the cloning is done using a pod, kubernetes will not allow a second mount of the filesystem before the first one is done.

Comment 2 Adam Litke 2020-06-10 13:19:34 UTC

After some discussions it seems that we will need to update the controller logic for both smart cloning and host-assisted cloning to check that the source PVC is not still mounted.

There are four combinations to consider:
- RWX/host-assisted clone
  - Bug can happen any time the source pvc is in use
- RWX/Smart clone
  - Bug can happen any time the source pvc is in use
- RWO/host-assisted clone
  - Can happen if clone operation is scheduled on the same node as a previous user
- RWO/Smart clone
  - Can happen any time the source PVC is in use

Even if the controller checks for a user of the source pvc prior to cloning, there is still a race and we cannot do anything reasonable to stop someone else from using the source pvc during clone.

Comment 4 Ying Cui 2020-06-17 12:19:57 UTC

According to this bug also affects host-assisted clone, so we consider it is the issue we need to fix in 2.4

Comment 5 Michael Henriksen 2020-06-23 16:57:55 UTC

There is no reasonable way to eliminate the possibility of data corruption when cloning as kubernetes gives no guarantees that a PVC cannot be mounted/modified while the clone operation is in progress.  But to address this issue, CDI will make sure that there are no pods accessing the affected PVCs in read/write mode.  Mounting a PVC read only should be okay.

Comment 6 Dan Kenigsberg 2020-06-29 08:54:12 UTC

Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV? I would expect any cloning that is susceptible to concurrent writing to first obtain such a lock.

Is this something we can fix for 2.4, or does it require a much bigger upstream work?

Comment 7 Michael Henriksen 2020-06-29 11:48:22 UTC

> Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV?

Nope.

The linked PR does about all we can.  I don't think there will ever be any such locking mechanism.

Comment 8 Dan Kenigsberg 2020-06-29 12:00:27 UTC

> Nope.

> The linked PR does about all we can.  I don't think there will ever be any such locking mechanism.

Can you discuss this in upstream k8s or point to a place where it was already discussed? I would like to express in a PVC that I expect to be the only writer of the PV.

Comment 9 Michael Henriksen 2020-06-29 14:05:06 UTC

> Can you discuss this in upstream k8s or point to a place where it was already discussed? I would like to express in a PVC that I expect to be the only writer of the PV.

If the goal is to have a consistent filesystem snapshot without errors, there are two options:

1.  Make sure that no pods have the PVC mounted for the entirety of the snapshot operation.

2.  Manually quiesce the app in the pod(s) that have the PVC mounted and freeze the filesystem.

The community is focused on the second as PVCs that you care about snapshotting are probably in use.  The current initiative is called Container Notifier:  https://docs.google.com/document/d/1Q4UMrx9r58LUNOHdn_4nDhqHN21iOOYQeViMKSPa_Ks/edit#heading=h.xh2wqx7h9cqm

Sounds like you are advocating for the first case, which I think has limited utility, but could maybe be implemented with a "Lock" CRD and a validating webhook on Pods.  But there may still be race conditions.

Comment 10 Dan Kenigsberg 2020-07-06 08:32:10 UTC

I'm not sure how option 2 can work without higher level orchestration forbidding new mounts file snapshot takes place. I see this as a big challenge; I think it should be tracked elsewhere, but maybe I miss something in the big picture.

Comment 11 Adam Litke 2020-07-07 12:19:26 UTC

(In reply to Dan Kenigsberg from comment #6)
> Michael, doesn't k8s have any mechanism to obtain a uniq write lock on a PV?
> I would expect any cloning that is susceptible to concurrent writing to
> first obtain such a lock.

No.  The platform does not offer any guarantees for this.  You are expected to know what runs in your namespace and ensure the workload behaves (ie. you can provide your own synchronization mechanism if required).

> 
> Is this something we can fix for 2.4, or does it require a much bigger
> upstream work?

Adding additional locking mechanisms would require a large upstream effort and I don't really see it gaining much traction.

---

The way I see it, causes of this type of corruption can fall into two categories: 1) platform-caused, and 2) application caused.  #1 covers Kubernetes and OpenShift components including KubeVirt.  If these components cause corruption given normal allowed user behavior this is a bug.  #2 includes scenarios where a user causes corruption by starting additional Pods that write to PVCs when they should not.  We cannot reasonably prevent users from shooting themselves in the foot.  Namespaces and related RBAC can help to ensure that only the workloads a user wants to run can actually access PVCs.

In summary, this bug identifies a situation where CDI steps on its own toes and causes corruption.  We can and should correct this specific scenario.

Comment 12 Yan Du 2020-07-08 08:30:22 UTC

Test RWO/host-assisted clone on OCP4.5 CNV2.4 
virt-cdi-operator-container-v2.4.0-26

Verify steps:

1. Create a source dv
---
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: dv-source
spec:
  source:
    http:
      url: http://<URL>/cirros-images/cirros-0.4.0-x86_64-disk.qcow2
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
  contentType: kubevirt

2. Create vm with the source pvc
---
apiVersion: kubevirt.io/v1alpha3
kind: VirtualMachine
metadata:
  name: cirros-vmi-src
spec:
  template:
    spec:
      domain:
        resources:
          requests:
            memory: 1024M
        devices:
          rng: {}
          disks:
          - disk:
              bus: virtio
            name: dv-disk
      volumes:
      - name: dv-disk
        dataVolume:
          name: dv-source
    metadata:
      labels:
        kubevirt.io/vm: cirros-vm-src
        kubevirt.io/domain: cirros-vm-src
  running: true

3. Clone the source pvc
---
apiVersion: cdi.kubevirt.io/v1alpha1
kind: DataVolume
metadata:
  name: dv-target2
spec:
  source:
    pvc:
      name: dv-source
      namespace: default
  pvc:
    accessModes:
    - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
    storageClassName: hostpath-provisioner
    volumeMode: Filesystem
  contentType: kubevirt


Clone didn't proceed when source PVC is in used with a VM. 

After deleting the VM, then the clone could be continued.

@Michael @Adam Is it the expected behavior?



And a small question: where could I get the related log/warning about the clone won't be proceed due to the PVC is in use. I think without the log, user may get a little confused, and I tried the pod log and describe the dv/pod, seems I didn't get any related log:

[cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc get dv
NAME         PHASE             PROGRESS   RESTARTS   AGE
dv-source    Succeeded         100.0%     0          12m
dv-target2   CloneInProgress   N/A        0          8m28s
[cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc get pod
NAME                                 READY     STATUS    RESTARTS   AGE
cdi-upload-dv-target2                1/1       Running   0          6m49s
virt-launcher-cirros-vmi-src-69tm7   1/1       Running   0          10m
[cnv-qe-jenkins@yadu24-4gw9j-executor ~]$ oc logs cdi-upload-dv-target2 
I0708 06:55:56.887881       1 uploadserver.go:63] Upload destination: /data/disk.img
I0708 06:55:56.888073       1 uploadserver.go:65] Running server on 0.0.0.0:8443

Comment 13 Michael Henriksen 2020-07-08 12:02:11 UTC

>After deleting the VM, then the clone could be continued.

>@Michael @Adam Is it the expected behavior?

Yes, that is expected behavior.

>And a small question: where could I get the related log/warning about the clone won't be proceed due to the PVC is in use. I think without the log, user may get a little confused, and I tried the pod log and describe the dv/pod, seems I didn't get any related log:

There should be events generated.

For smart clone:  https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/controller/datavolume-controller.go#L296-L297

For network clone:  https://github.com/kubevirt/containerized-data-importer/blob/master/pkg/controller/clone-controller.go#L227-L228

Comment 14 Yan Du 2020-07-08 14:29:15 UTC

Thanks Michael

Got the pvc in use info in events:

6s          Normal    CloneInProgress         datavolume/dv-target2                    Cloning from default/dv-source into test/dv-target2 in progress
0s          Warning   CloneSourceInUse        persistentvolumeclaim/dv-target2         pod default/virt-launcher-cirros-vmi-src-xfnmw using PersistentVolumeClaim dv-source

Comment 17 errata-xmlrpc 2020-07-28 19:10:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3194

Note You need to log in before you can comment on or make changes to this bug.