Description of problem: I recently encountered a performance issue on our CI which is mind-boggling when comparing VM IO workload writing to a PVC - when the PVC was configured as a filesystem vs Block like so: --- kind: PersistentVolumeClaim apiVersion: v1 metadata: annotations: cdi.kubevirt.io/storage.preallocation: "true" name: vdbench-pvc-claim namespace: benchmark-runner spec: storageClassName: ocs-storagecluster-ceph-rbd accessModes: [ "ReadWriteOnce" ] volumeMode: Filesystem # or Set to BLOCK resources: requests: storage: 64Gi --- after some investigation, we found that this is happening because on block devices we automatically set io=native, while on filesystem we do not, now if we use a filesystem within a DataVolume like so: --- dataVolumeTemplates: - apiVersion: cdi.kubevirt.io/v1 kind: DataVolume metadata: annotations: kubevirt.io/provisionOnNode: worker-0 name: workload-disk spec: pvc: accessModes: - ReadWriteOnce resources: requests: storage: 65Gi storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Filesystem source: blank: {} --- we will still experience 95% degradation compared to block, but if we add "preallocation: true" like so: --- dataVolumeTemplates: - apiVersion: cdi.kubevirt.io/v1 kind: DataVolume metadata: annotations: kubevirt.io/provisionOnNode: worker-0 name: workload-disk spec: preallocation: true pvc: accessModes: - ReadWriteOnce resources: requests: storage: 65Gi storageClassName: ocs-storagecluster-ceph-rbd volumeMode: Filesystem source: blank: {} --- then turns out that using "preallocation" (which was created as a tool to improve performance on thin devices), that magically causes QEMU to set io=native to the filesystem (https://github.com/kubevirt/kubevirt/blob/main/pkg/virt-launcher/virtwrap/converter/converter.go#L480) that is a workaround that is only applicable to data volumes. as for PVC the scenario, that's a little more complicated the workaround for that issue will be to manually create a fully preallocated disk.img in the root directory of the PVC, CNV correctly detects that it was preallocated, and attached it to the VM with io=native however, both the above workarounds are far from being user-friendly, and there are only a few people that actually know that using the filesystem will cause such severe performance issues, and even fewer know how to address it, which is why I suggest the following: 1. for Datavolume - preallocation should be set to true by default. 2. for PVC - we should implement a way to set io=native
FWIW the different modes can be specified in the API http://kubevirt.io/api-reference/main/definitions.html#_v1_disk The reasoning was as follows 1. Preallocated and block (which is assumed to be preallocated) work better with native 2. Sparse files were assumed to work better with threaded io mode Thus if no io mode was set, then the above reasoning was used to make a smart decision. If we now say that threaded for spase files is poor performance, then we can check if native - in general - behaves better also for sparse file backed disks. Can the perf team provide this evaluation?
To add to what Fabian says... I think we should conduct a study on io=threads vs io=native using various combinations of storage and volumeModes. For example, the following combinations would be interesting: - ceph rbd block: io=native vs io=threads - ceph rbd filesystem: io=native vs io=threads - hpp (filesystem): io=native vs io=threads - NFS (filesystem): io=native vs io=threads - Trident ISCSI block: io=native vs io=threads - Trident ISCSI filesystem: io=native vs io=threads This could help to determine if io=native is generally better or if certain storage backends perform better with one vs the other.
While it's hard to block it on the CLI, I'm sure we've already removed it from the UI. If we haven't, this is what this BZ should be focused on.
(side note: Ceph doesn't need preallocation, so I would advise not setting it for rbd)
This issue can be reproduced without the use of CNV. We saw this issue in CNV under the following conditions: - The underlying block storage for the DataVolume is rbd, and the volumeMode is Filesystem - CNV creates an ext4 file system on the rbd device, and a raw disk.img file is created on the ext4 filesystem to use as the backing storage for the VM - CNV automatically sets the VM threading model to "threads" instead of "native" since the disk.img file is thin provisioned To recreate this outside of CNV: - Create a 10GB PVC from the ocs-storagecluster-ceph-rbd storage class, and use volumeMode Block - Create a pod which uses the PVC - Within the pod create an ext4 filesystem on the block device, mount the file system, and create a raw disk.img file on the ext4 filesystem (use dd to create a fully allocated disk.img file) - run fio with the sync ioengine against this disk.img file, use multiple jobs, the throughput will be low (~500 IOPS), equivalent to what a single job gives you - Example fio invocation: fio --name=random-write --filename=/disk/disk.img --offset_increment=1g --direct=1 --ioengine=sync --rw=randwrite --bs=64k --numjobs=9 --size=1g --iodepth=1 --runtime=600 --time_based --end_fsync=1 --group_reporting Note that we use ioengine=sync and multiple fio jobs in the above example, because that's equivalent of what happens when a VM is using fio in the guest and the VM is using io threads instead of io native. This issue of slow performance only happens when using ext4 as the filesystem, it does not happen when using xfs as the filesystem. It appears that ext4 is serializing the writes to the disk.img file, there is a semaphore which is serializing the traffic. All but one of the fio processes is blocked on the ext4_file_write_iter wait channel: [core@worker-0 ~]$ sudo ps -eo pid,ppid,user,stat,pcpu,comm,wchan:32 | grep fio 1088821 1031232 root Sl+ 4.0 fio hrtimer_nanosleep 1088932 1088821 root Ds 0.6 fio ext4_file_write_iter 1088933 1088821 root Ds 0.5 fio ext4_file_write_iter 1088934 1088821 root Ds 0.5 fio ext4_file_write_iter 1088935 1088821 root Ds 0.5 fio - 1088936 1088821 root Ds 0.6 fio ext4_file_write_iter 1088937 1088821 root Ds 0.5 fio ext4_file_write_iter 1088938 1088821 root Ds 0.5 fio ext4_file_write_iter 1088939 1088821 root Ds 0.5 fio ext4_file_write_iter 1088940 1088821 root Ds 0.5 fio ext4_file_write_iter The kernel stack tracebacks for the blocked fio processes look like this, they are waiting on a semaphore: [<0>] rwsem_down_write_slowpath+0x32a/0x610 [<0>] ext4_file_write_iter+0x3cb/0x3e0 [ext4] [<0>] new_sync_write+0x112/0x160 [<0>] vfs_write+0xa5/0x1a0 [<0>] ksys_write+0x4f/0xb0 [<0>] do_syscall_64+0x5b/0x1a0 [<0>] entry_SYSCALL_64_after_hwframe+0x65/0xca We need to establish whether this serialized performance with ext4 on rbd also happens with NVMe storage, and if it does not happen with NVMe, then we need to figure out if this is an issue in ext4 or the kernel rbd driver.
The serialized write performance of ext4 also happens on NVMe storage, most threads are waiting on a semaphore (the raw performance is higher than the 500 IOPS we see on rbd, but that's just because the underlying storage is faster). So the question that comes out of all of this is whether it is possible to use xfs instead of ext4 when creating the underlying disk.img backing file. I'll defer to a file system expert to weigh in here, it seems to me that xfs outperforms ext4 for this particular use case (parallel IO against a large file).
More context on xfs vs ext4 performance largely validating your findings - https://access.redhat.com/articles/3129891
So it would seem to me that switching to XFS as the default filesystem for the ceph-rbd provisioner would solve this problem. If so, this should be moved to the ODF component.
Niels, what do you think?
TLDR: There is a risk when deployments use RBD volumes on the same nodes as where OSDs are running. The risk is lower with ext4 than with the more memory hungry xfs. Hence the default to ext4, as problems are not seen regularly. xfs requires lots of memory when running certain workloads. OSDs receive I/O from the RBD volume, and in order to process the network packets that contain the I/O from the RBD volume and write to the actual disks, the OSD needs to allocate memory. In case the node is under memory pressure, the dirty pages will get flushed to free up more memory (that the OSD requested). Local filesystems will get the request to flush I/O, and that includes xfs. In case the xfs filesystem is on an RBD device, the I/O is sent over the network to the OSD, which in turn requires (again) more memory to process the write requests. This ends up in an endless loop, a hung_task might be reported in the kernel logs. https://rook.io/docs/rook/v1.10/Troubleshooting/ceph-common-issues/#a-worker-node-using-rbd-devices-hangs-up Because of this, it is not trivial to use xfs as the default filesystem on top of RBD devices.
Yes, closing as suggested. Using the storage API is the recommended approach and yields the correct configuration.