Bug 2098657

Summary:	VM workload - PVC Filesystem write performance is 95% lower compared to Block.
Product:	Container Native Virtualization (CNV)	Reporter:	Boaz <bbenshab>
Component:	Storage	Assignee:	Alex Kalenyuk <akalenyu>
Status:	CLOSED WORKSFORME	QA Contact:	Kevin Alon Goldblatt <kgoldbla>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.10.2	CC:	akamra, alitke, awels, ekuric, fdeutsch, jhopper, mimehta, mrashish, ndevos, yadu
Target Milestone:	---
Target Release:	4.14.0
Hardware:	Unspecified
OS:	All
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2023-08-09 18:03:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Boaz 2022-06-20 09:21:42 UTC

Description of problem:
I recently encountered a performance issue on our CI which is mind-boggling 
when comparing VM IO workload writing to a PVC - when the PVC was configured as a filesystem vs Block like so:


--- 
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  annotations:
    cdi.kubevirt.io/storage.preallocation: "true"
  name: vdbench-pvc-claim
  namespace: benchmark-runner
spec:
  storageClassName: ocs-storagecluster-ceph-rbd
  accessModes: [ "ReadWriteOnce" ]
  volumeMode: Filesystem # or Set to BLOCK 
  resources:
    requests:
      storage: 64Gi
---


after some investigation, we found that this is happening because on block devices we automatically set
io=native, while on filesystem we do not, now if we use a filesystem within a DataVolume like so:

---
  dataVolumeTemplates:
  - apiVersion: cdi.kubevirt.io/v1
    kind: DataVolume
    metadata:
      annotations:
        kubevirt.io/provisionOnNode: worker-0
      name: workload-disk
    spec:
      pvc:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 65Gi
        storageClassName: ocs-storagecluster-ceph-rbd
        volumeMode: Filesystem
      source:
        blank: {}
---
 
we will still experience 95% degradation compared to block, but if we add "preallocation: true" like so:

---
  dataVolumeTemplates:
  - apiVersion: cdi.kubevirt.io/v1
    kind: DataVolume
    metadata:
      annotations:
        kubevirt.io/provisionOnNode: worker-0
      name: workload-disk
    spec:
      preallocation: true
      pvc:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 65Gi
        storageClassName: ocs-storagecluster-ceph-rbd
        volumeMode: Filesystem
      source:
        blank: {}
---


then turns out that using "preallocation" (which was created as a tool to improve performance on thin devices),
that magically causes QEMU to set io=native to the filesystem (https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-launcher/virtwrap/converter/converter.go#L480)
that is a workaround that is only applicable to data volumes.

as for PVC the scenario, that's a little more complicated the workaround for that issue will be 
to manually create a fully preallocated disk.img in the root directory of the PVC, CNV correctly detects that it was preallocated, and attached it to the VM with io=native


however, both the above workarounds are far from being user-friendly, and there are only a few people that actually know that using the filesystem will cause such severe performance issues, and even fewer know how to address it, which is why I suggest the following:

1. for Datavolume - preallocation should be set to true by default.
2. for PVC - we should implement a way to set io=native

Comment 1 Fabian Deutsch 2022-06-29 14:12:53 UTC

FWIW the different modes can be specified in the API http://kubevirt.io/api-reference/main/definitions.html#_v1_disk

The reasoning was as follows
1. Preallocated and block (which is assumed to be preallocated) work better with native
2. Sparse files were assumed to work better with threaded io mode

Thus if no io mode was set, then the above reasoning was used to make a smart decision.

If we now say that threaded for spase files is poor performance, then we can check if native - in general - behaves better also for sparse file backed disks.

Can the perf team provide this evaluation?

Comment 2 Adam Litke 2022-06-29 19:27:57 UTC

To add to what Fabian says...

I think we should conduct a study on io=threads vs io=native using various combinations of storage and volumeModes.  For example, the following combinations would be interesting:
- ceph rbd block: io=native vs io=threads
- ceph rbd filesystem: io=native vs io=threads
- hpp (filesystem): io=native vs io=threads
- NFS (filesystem): io=native vs io=threads
- Trident ISCSI block: io=native vs io=threads
- Trident ISCSI filesystem: io=native vs io=threads

This could help to determine if io=native is generally better or if certain storage backends perform better with one vs the other.

Comment 4 Yaniv Kaul 2022-07-17 10:02:03 UTC

While it's hard to block it on the CLI, I'm sure we've already removed it from the UI. If we haven't, this is what this BZ should be focused on.

Comment 5 Yaniv Kaul 2022-07-17 10:03:19 UTC

(side note: Ceph doesn't need preallocation, so I would advise not setting it for rbd)

Comment 6 Michey Mehta 2022-07-20 04:53:57 UTC

This issue can be reproduced without the use of CNV. We saw this issue in CNV under the following conditions:
- The underlying block storage for the DataVolume is rbd, and the volumeMode is Filesystem
- CNV creates an ext4 file system on the rbd device, and a raw disk.img file is created on the ext4 filesystem to use as the backing storage for the VM
- CNV automatically sets the VM threading model to "threads" instead of "native" since the disk.img file is thin provisioned

To recreate this outside of CNV:
- Create a 10GB PVC from the ocs-storagecluster-ceph-rbd storage class, and use volumeMode Block
- Create a pod which uses the PVC
- Within the pod create an ext4 filesystem on the block device, mount the file system, and create a raw disk.img file on the ext4 filesystem (use dd to create a fully allocated disk.img file)
- run fio with the sync ioengine against this disk.img file, use multiple jobs, the throughput will be low (~500 IOPS), equivalent to what a single job gives you
- Example fio invocation: fio --name=random-write --filename=/disk/disk.img --offset_increment=1g --direct=1 --ioengine=sync   --rw=randwrite --bs=64k --numjobs=9 --size=1g --iodepth=1 --runtime=600 --time_based --end_fsync=1 --group_reporting

Note that we use ioengine=sync and multiple fio jobs in the above example, because that's equivalent of what happens when a VM is using fio in the guest and the VM is using io threads instead of io native. 

This issue of slow performance only happens when using ext4 as the filesystem, it does not happen when using xfs as the filesystem.

It appears that ext4 is serializing the writes to the disk.img file, there is a semaphore which is serializing the traffic.

All but one of the fio processes is blocked on the ext4_file_write_iter wait channel:

[core@worker-0 ~]$ sudo ps -eo pid,ppid,user,stat,pcpu,comm,wchan:32 | grep fio
1088821 1031232 root     Sl+   4.0 fio  hrtimer_nanosleep
1088932 1088821 root     Ds    0.6 fio  ext4_file_write_iter
1088933 1088821 root     Ds    0.5 fio  ext4_file_write_iter
1088934 1088821 root     Ds    0.5 fio  ext4_file_write_iter
1088935 1088821 root     Ds    0.5 fio  -
1088936 1088821 root     Ds    0.6 fio  ext4_file_write_iter
1088937 1088821 root     Ds    0.5 fio  ext4_file_write_iter
1088938 1088821 root     Ds    0.5 fio  ext4_file_write_iter
1088939 1088821 root     Ds    0.5 fio  ext4_file_write_iter
1088940 1088821 root     Ds    0.5 fio  ext4_file_write_iter

The kernel stack tracebacks for the blocked fio processes look like this, they are waiting on a semaphore:

[<0>] rwsem_down_write_slowpath+0x32a/0x610
[<0>] ext4_file_write_iter+0x3cb/0x3e0 [ext4]
[<0>] new_sync_write+0x112/0x160
[<0>] vfs_write+0xa5/0x1a0
[<0>] ksys_write+0x4f/0xb0
[<0>] do_syscall_64+0x5b/0x1a0
[<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca

We need to establish whether this serialized performance with ext4 on rbd also happens with NVMe storage, and if it does not happen with NVMe, then we need to figure out if this is an issue in ext4 or the kernel rbd driver.

Comment 7 Michey Mehta 2022-07-20 12:51:59 UTC

The serialized write performance of ext4 also happens on NVMe storage, most threads are waiting on a semaphore (the raw performance is higher than the 500 IOPS we see on rbd, but that's just because the underlying storage is faster). So the question that comes out of all of this is whether it is possible to use xfs instead of ext4 when creating the underlying disk.img backing file. I'll defer to a file system expert to weigh in here, it seems to me that xfs outperforms ext4 for this particular use case (parallel IO against a large file).

Comment 8 Ashish Kamra 2022-07-20 13:09:19 UTC

More context on xfs vs ext4 performance largely validating your findings - https://access.redhat.com/articles/3129891

Comment 9 Adam Litke 2022-11-23 18:21:46 UTC

So it would seem to me that switching to XFS as the default filesystem for the ceph-rbd provisioner would solve this problem.  If so, this should be moved to the ODF component.

Comment 10 Adam Litke 2022-11-23 18:22:35 UTC

Niels, what do you think?

Comment 13 Niels de Vos 2023-01-31 13:26:53 UTC

TLDR: There is a risk when deployments use RBD volumes on the same nodes as where OSDs are running. The risk is lower with ext4 than with the more memory hungry xfs. Hence the default to ext4, as problems are not seen regularly.

xfs requires lots of memory when running certain workloads. OSDs receive I/O from the RBD volume, and in order to process the network packets that contain the I/O from the RBD volume and write to the actual disks, the OSD needs to allocate memory. In case the node is under memory pressure, the dirty pages will get flushed to free up more memory (that the OSD requested). Local filesystems will get the request to flush I/O, and that includes xfs. In case the xfs filesystem is on an RBD device, the I/O is sent over the network to the OSD, which in turn requires (again) more memory to process the write requests. This ends up in an endless loop, a hung_task might be reported in the kernel logs.

https://rook.io/docs/rook/v1.10/Troubleshooting/ceph-common-issues/#a-worker-node-using-rbd-devices-hangs-up

Because of this, it is not trivial to use xfs as the default filesystem on top of RBD devices.

Comment 23 Adam Litke 2023-08-09 18:03:33 UTC

Yes, closing as suggested.  Using the storage API is the recommended approach and yields the correct configuration.