Description of problem: Customer is seeing timeout errors whilst accessing an iSCSI lun using iSER over an Infiniband connection. Errors such as this are logged: sd 4:0:0:1: timing out command, waited 30s The iSCSI lun is presented by one RHEL5.4 host for a similar host to access. Both hosts have a Mellanox ConnectX Infiniband card and are linked together. The iSCSI target host is using 0.0-6.20091205snap.el5_4.1 to export the lun. The timeouts occur more often if the iSCSI lun is mirrored, via MD, on the client host to a local lun. They had observed that during the problem the dirty buffers on the iSCSI host increase up to the maximum, e.g. 40% of 16GB, whilst the md mirroring is running. After dropping the dirty_ratio to 1 on the iSCSI server, the problem was no longer reproducible. >>>>>> from Mike Christie: I was looking at the logs some more and the command we are timing out on is the sync cache one. When you do not mess with dirty ratio, then we have lots of data to write/sync up, and so that command takes a long time. We actually only get 3 chances at completing the command and the command gets 30 secs each attempt (I was mistaken before because I thought it was coming from userspace but it is actually coming from the kernel using the passthrough interface). So when you changed the dirty settings, we write out data sooner and there is not the case like before where we have lots and lots of data to write out and sync up and we cannot do all the writes within the 30 secs. So we can 1. have the user adjust the vm settings like you are doing. I think having lots of write outs may affect performance. If it does than maybe we can do #2. 2. Add something simple like a modparm that lets you set the timeout for the sync cache command. I did this in the attached patch. It was made over the current rhel5 kernel. When you load sd_mod just pass in the new mod param sd_sync_cache_tmo modprobe sd_mod sd_sync_cache_tmo=N where N is some value in seconds (I would try maybe a couple minutes). (note if they are using scsi for root then you have to pass the mod param on the command line). And to check that it is getting set you can do cat /sys/module/sd_mod/parameters/sd_sync_cache_tmo <<<<<<<<<<<<<<<<< Steps to Reproduce: -Create iSCSI connection via iSER -Mirror iSCSI device with local device via mdadm -Begin mirror sync -Wait for timeouts
Patch was sent upstream: http://www.spinics.net/lists/linux-scsi/msg45017.html Adding devel ack for 5.6 and taking bz.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
The patch that went upstream increased the hard coded sync cache timeout from 30 to 60 secs: http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=e3b3e6246726cd05950677ed843010b8e8c5884c If the sync cache is taking longer than we need to adjust the dirty ratio settings on the target, or do not use write back caching on the target, or modify the target so that it is smarter about writing data out and handling sync caches.
Created attachment 447599 [details] increase sync cache timeout from 30 to 60 This is a port of the patch that went upstream. It increases the sync cache timeout from 30 secs to 60 secs.
in kernel-2.6.18-223.el5 You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5 Detailed testing feedback is always welcomed.
When creating iSCSI over iSER, got kernel panic with kernel-2.6.18-229.el5 and iscsi-initiator-utils-6.2.0.872-4.el5. Please check Bug #664651 for the detail. Add Bug #664651 as dependence of this bug.
(In reply to comment #9) > When creating iSCSI over iSER, got kernel panic with kernel-2.6.18-229.el5 and > iscsi-initiator-utils-6.2.0.872-4.el5. > > Please check Bug #664651 for the detail. > > Add Bug #664651 as dependence of this bug. Bug #664651 is for Qlogic cards whereas this is opened for a setup with Mellanox cards. It might not be related to each other at all.
Tested this bug both on RHEL5.6 and RHEL5.5. [root@ib-test1 ~]# iscsiadm -m session iser: [1] 192.0.0.1:3260,1 iqn.2010-10.com.example:storage-1000 As I only got 1 server for testing, I setup iscsi target and iscsi initiator on the same machine. Cannot simulate a 30s timeout in iSER. I have try dd urandom to /dev/md0 during mirror sync. This patch change the SCSI timeout value from 30s to 60s. linux-2.6-scsi-increase-sync-cache-timeout.patch was applied to kernel-2.6.18-236.el5 Set bug as Sanity Only. Gurhan, Can you setup the IPoIB for the server I am using. I also need another server with IB connection to this one. If that possible for you to change the IB switch port speed? Thank you.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html