Bug 2175242 - [RFE] I/O timeout needed in krbd
Summary: [RFE] I/O timeout needed in krbd
Keywords:
Status: NEW
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RBD
Version: 6.0
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: 8.0
Assignee: Ilya Dryomov
QA Contact: Preethi
URL:
Whiteboard:
Depends On:
Blocks: 2158591
TreeView+ depends on / blocked
 
Reported: 2023-03-03 16:28 UTC by Adam Litke
Modified: 2023-08-04 20:36 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2158591
Environment:
Last Closed:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHCEPH-6226 0 None None None 2023-03-03 16:31:39 UTC

Description Adam Litke 2023-03-03 16:28:56 UTC
+++ This bug was initially created as a clone of Bug #2158591 +++

Description of problem:

The VMs created using OpenShift virtualization are by default configured to pause if the storage returns an i/o error. This is defined by error_policy='stop' in libvirtd xml as below and this propagates to qemu as "werror=stop,rerror=stop":

~~~
    <disk type='block' device='disk' model='virtio-non-transitional'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='native'/>    <<<
      <source dev='/dev/rootdisk' index='2'/>
      <backingStore/>
      <target dev='vda' bus='virtio'/>
      <alias name='ua-rootdisk'/>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </disk>
~~~

While using Ceph backend, the device is mapped using the Kernel rbd module and the timeout is defined in osd_request_timeout and is by default 0 [1]. This means the i/o will wait forever or will never timeout. So the qemu will never get an EIO for the pending IOs and will not move the VM into paused. 

Also, we cannot power down the VM in this state. The virt-launcher pod will end up in "Terminating" status and qemu-kvm in D state when we power down the VM.

~~~
[root@worker-0 ~]# ps aux|grep qemu-kvm
107      1112413 12.7  0.0      0     0 ?        D    18:12   0:47 [qemu-kvm]
~~~

Force deleting the virt-launcher pod will remove the pod but the qemu-kvm process will be still there in the OCP node and the only way to get rid of it is to reboot the node. 

Also, it is technically possible to modify the osd_request_timeout while mapping the device using -o osd_request_timeout=custom-timeout. And also, the ceph-csi does have the feature to pass this via mapOptions. However, [2] and [3] don't recommend setting osd_request_timeout. However, we do have IO timeout in other block storage like FC or iSCSI.

[1] https://github.com/torvalds/linux/blob/85c7000fda0029ec16569b1eec8fd3a8d026be73/include/linux/ceph/libceph.h#L78
[2] https://patchwork.kernel.org/project/ceph-devel/patch/1527132420-10740-1-git-send-email-dongsheng.yang@easystack.cn/
[3] https://github.com/ceph/ceph/pull/20792#pullrequestreview-102251868


Version-Release number of selected component (if applicable):

OpenShift Virtualization   4.11.1

How reproducible:

100%

Steps to Reproduce:

1. Block the communication between the Ceph storage and the worker node where the VM is running. 
2. The VM will get hung with the network layer still responding to the ping requests. Check the virsh output and it will be still running.  
3. Try shutting down the VM. The virt-launcher will get stuck in terminating.

Actual results:

Virtual Machines are not moving into "Paused" status when Ceph storage backend is unavailable

Expected results:

The customer who reported the issue expects the VM to go down or paused when the storage goes down. Since the VM is still pingable, it prevents users from building HA applications that use an election mechanism which depends on health reports inferred by network connectivity, even if they spread their workload in multiple AZs.

Additional info:

--- Additional comment from Stefan Hajnoczi on 2023-02-13 12:43:56 UTC ---

I think Ceph needs timeouts so userspace processes like QEMU don't get stuck in uninterruptible sleep forever. QEMU userspace cannot work around that state, so a host kernel solution is needed.

I checked the Linux 6.2.0-rc5 rbd driver code and it doesn't implement the blk_mq_ops->timeout() mechanism for I/O timeouts. The osd_request_timeout parameter mentioned in the bug report might be a solution for the time being with blk_mq_ops->timeout() as a long-term solution.

Ilya: What would you recommend for rbd users that need I/O timeouts?

--- Additional comment from Ilya Dryomov on 2023-02-13 13:22:28 UTC ---

(In reply to Stefan Hajnoczi from comment #1)
> I checked the Linux 6.2.0-rc5 rbd driver code and it doesn't implement the
> blk_mq_ops->timeout() mechanism for I/O timeouts. The osd_request_timeout
> parameter mentioned in the bug report might be a solution for the time being
> with blk_mq_ops->timeout() as a long-term solution.
> 
> Ilya: What would you recommend for rbd users that need I/O timeouts?

Hi Stefan,

Unfortunately, nothing changed in this area meaning that both

https://patchwork.kernel.org/project/ceph-devel/patch/1527132420-10740-1-git-send-email-dongsheng.yang@easystack.cn/
https://github.com/ceph/ceph/pull/20792#pullrequestreview-102251868

are still valid.  If a timeout is absolutely needed, the undocumented osd_request_timeout mapping option is the only avenue for krbd.

librbd doesn't support timeouts either but it's a lesser issue there because librbd can be fronted with something that does.  The NBD driver in the kernel is one such thing so another possible solution is to switch to rbd-nbd ("sudo rbd device map -t nbd").

--- Additional comment from Fabian Deutsch on 2023-02-22 12:11:35 UTC ---

Ilya, thanks for your input, regardless of the solution - where do you suggest should an RFE be filed for the described failure scenario?

--- Additional comment from Ilya Dryomov on 2023-02-22 12:55:51 UTC ---

Hi Fabian,

Technically, if this got implemented upstream, a RHEL kernel BZ would be needed for the backport but we don't have a krbd sub-component there.  I would suggest filing it under Red Hat Ceph Storage product / RBD component (even though Version, Target Release, etc fields won't make sense).

Note though that this would require some pretty heavy lifting in krbd and so it's not a short-term deliverable by any means, even if prioritized.

--- Additional comment from Fabian Deutsch on 2023-02-22 13:47:09 UTC ---

Thanks, Ilya.

@alitke @pelauter it seems like we need this to get qemu unfrozen in case of IO issues with krbd. MoD is running into this. Can you take it from here?

Comment 1 Adam Litke 2023-03-03 16:32:36 UTC
Please see the comment history from Ilya and Stefan.  We are using the krbd driver in OCP and our Pod cannot be cleaned up if the workload loses access to the storage.

Comment 3 Adam Litke 2023-04-18 12:15:42 UTC
Ilya, can we get this bug targeted?


Note You need to log in before you can comment on or make changes to this bug.