Description of problem:
Customer is seeing timeout errors whilst accessing an iSCSI lun using iSER
over an Infiniband connection. Errors such as this are logged:
sd 4:0:0:1: timing out command, waited 30s
The iSCSI lun is presented by one RHEL5.4 host for a similar host to access.
Both hosts have a Mellanox ConnectX Infiniband card and are linked together.
The iSCSI target host is using 0.0-6.20091205snap.el5_4.1 to export the lun.
The timeouts occur more often if the iSCSI lun is mirrored, via MD, on the
client host to a local lun.
They had observed that during the problem the dirty buffers on the iSCSI host
increase up to the maximum, e.g. 40% of 16GB, whilst the md mirroring is
running. After dropping the dirty_ratio to 1 on the iSCSI server, the
problem was no longer reproducible.
>>>>>> from Mike Christie:
I was looking at the logs some more and the command we are timing out on is
the sync cache one. When you do not mess with dirty ratio, then we have lots
of data to write/sync up, and so that command takes a long time. We actually
only get 3 chances at completing the command and the command gets 30 secs
each attempt (I was mistaken before because I thought it was coming from
userspace but it is actually coming from the kernel using the passthrough
So when you changed the dirty settings, we write out data sooner and there
is not the case like before where we have lots and lots of data to write out
and sync up and we cannot do all the writes within the 30 secs.
So we can
1. have the user adjust the vm settings like you are doing.
I think having lots of write outs may affect performance. If it does than
maybe we can do #2.
2. Add something simple like a modparm that lets you set the timeout for
the sync cache command. I did this in the attached patch. It was made over
the current rhel5 kernel.
When you load sd_mod just pass in the new mod param sd_sync_cache_tmo
modprobe sd_mod sd_sync_cache_tmo=N
where N is some value in seconds (I would try maybe a couple minutes).
(note if they are using scsi for root then you have to pass the mod param
on the command line).
And to check that it is getting set you can do
Steps to Reproduce:
-Create iSCSI connection via iSER
-Mirror iSCSI device with local device via mdadm
-Begin mirror sync
-Wait for timeouts
Patch was sent upstream:
Adding devel ack for 5.6 and taking bz.
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release. Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products. This request is not yet committed for inclusion in an Update
The patch that went upstream increased the hard coded sync cache timeout from 30 to 60 secs:
If the sync cache is taking longer than we need to adjust the dirty ratio settings on the target, or do not use write back caching on the target, or modify the target so that it is smarter about writing data out and handling sync caches.
Created attachment 447599 [details]
increase sync cache timeout from 30 to 60
This is a port of the patch that went upstream. It increases the sync cache timeout from 30 secs to 60 secs.
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5
Detailed testing feedback is always welcomed.
When creating iSCSI over iSER, got kernel panic with kernel-2.6.18-229.el5 and iscsi-initiator-utils-188.8.131.522-4.el5.
Please check Bug #664651 for the detail.
Add Bug #664651 as dependence of this bug.
(In reply to comment #9)
> When creating iSCSI over iSER, got kernel panic with kernel-2.6.18-229.el5 and
> Please check Bug #664651 for the detail.
> Add Bug #664651 as dependence of this bug.
Bug #664651 is for Qlogic cards whereas this is opened for a setup with Mellanox cards. It might not be related to each other at all.
Tested this bug both on RHEL5.6 and RHEL5.5.
[root@ib-test1 ~]# iscsiadm -m session
iser:  192.0.0.1:3260,1 iqn.2010-10.com.example:storage-1000
As I only got 1 server for testing, I setup iscsi target and iscsi initiator on the same machine.
Cannot simulate a 30s timeout in iSER.
I have try dd urandom to /dev/md0 during mirror sync.
This patch change the SCSI timeout value from 30s to 60s.
linux-2.6-scsi-increase-sync-cache-timeout.patch was applied to kernel-2.6.18-236.el5
Set bug as Sanity Only.
Can you setup the IPoIB for the server I am using.
I also need another server with IB connection to this one.
If that possible for you to change the IB switch port speed?
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.