Bug 592322 - [RHEL 5] Errors when Accessing iSCSI luns via iSER - timing out command
Summary: [RHEL 5] Errors when Accessing iSCSI luns via iSER - timing out command
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Mike Christie
QA Contact: Gris Ge
URL:
Whiteboard:
Depends On: 664651
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-05-14 14:58 UTC by Flavio Leitner
Modified: 2018-11-14 20:10 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 21:31:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
increase sync cache timeout from 30 to 60 (903 bytes, application/octet-stream)
2010-09-16 00:50 UTC, Mike Christie
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Flavio Leitner 2010-05-14 14:58:20 UTC
Description of problem:

Customer is seeing timeout errors whilst accessing an iSCSI lun using iSER
over an Infiniband connection. Errors such as this are logged:

   sd 4:0:0:1: timing out command, waited 30s

The iSCSI lun is presented by one RHEL5.4 host for a similar host to access.
Both hosts have a Mellanox ConnectX Infiniband card and are linked together.
The iSCSI target host is using 0.0-6.20091205snap.el5_4.1 to export the lun.
The timeouts occur more often if the iSCSI lun is mirrored, via MD, on the
client host to a local lun.

They had observed that during the problem the dirty buffers on the iSCSI host
increase up to the maximum, e.g. 40% of 16GB, whilst the md mirroring is
running.  After dropping the dirty_ratio to 1 on the iSCSI server, the
problem was no longer reproducible.


>>>>>> from Mike Christie:
I was looking at the logs some more and the command we are timing out on is 
the sync cache one. When you do not mess with dirty ratio, then we have lots 
of data to write/sync up, and so that command takes a long time. We actually
only get 3 chances at completing the command and the command gets 30 secs 
each attempt (I was mistaken before because I thought it was coming from
userspace but it is actually coming from the kernel using the passthrough
interface).

So when you changed the dirty settings, we write out data sooner and there
is not the case like before where we have lots and lots of data to write out
and sync up and we cannot do all the writes within the  30 secs.

So we can

1. have the user adjust the vm settings like you are doing.

I think having lots of write outs may affect performance. If it does than
maybe we can do #2.

2. Add something simple like a modparm that lets you set the timeout for
the sync cache command. I did this in the attached patch. It was made over
the current rhel5 kernel.

When you load sd_mod just pass in the new mod param sd_sync_cache_tmo

modprobe sd_mod sd_sync_cache_tmo=N

where N is some value in seconds (I would try maybe a couple minutes).
(note if they are using scsi for root then you have to pass the mod param
on the command line).

And to check that it is getting set you can do

cat /sys/module/sd_mod/parameters/sd_sync_cache_tmo 
<<<<<<<<<<<<<<<<<

Steps to Reproduce:
-Create iSCSI connection via iSER
-Mirror iSCSI device with local device via mdadm
-Begin mirror sync
-Wait for timeouts

Comment 2 Mike Christie 2010-07-20 15:41:52 UTC
Patch was sent upstream:
http://www.spinics.net/lists/linux-scsi/msg45017.html

Adding devel ack for 5.6 and taking bz.

Comment 3 RHEL Program Management 2010-07-20 15:59:32 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 4 Mike Christie 2010-09-16 00:40:46 UTC
The patch that went upstream increased the hard coded sync cache timeout from 30 to 60 secs:
http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commitdiff;h=e3b3e6246726cd05950677ed843010b8e8c5884c

If the sync cache is taking longer than we need to adjust the dirty ratio settings on the target, or do not use write back caching on the target, or modify the target so that it is smarter about writing data out and handling sync caches.

Comment 5 Mike Christie 2010-09-16 00:50:08 UTC
Created attachment 447599 [details]
increase sync cache timeout from 30 to 60

This is a port of the patch that went upstream. It increases the sync cache timeout from 30 secs to 60 secs.

Comment 7 Jarod Wilson 2010-09-21 21:00:19 UTC
in kernel-2.6.18-223.el5
You can download this test kernel (or newer) from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 9 Gris Ge 2010-12-21 06:09:46 UTC
When creating iSCSI over iSER, got kernel panic with kernel-2.6.18-229.el5 and iscsi-initiator-utils-6.2.0.872-4.el5.

Please check Bug #664651 for the detail.

Add Bug #664651 as dependence of this bug.

Comment 10 Gurhan Ozen 2011-01-04 02:23:09 UTC
(In reply to comment #9)
> When creating iSCSI over iSER, got kernel panic with kernel-2.6.18-229.el5 and
> iscsi-initiator-utils-6.2.0.872-4.el5.
> 
> Please check Bug #664651 for the detail.
> 
> Add Bug #664651 as dependence of this bug.

Bug #664651 is for Qlogic cards whereas this is opened for a setup with Mellanox cards. It might not be related to each other at all.

Comment 14 Gris Ge 2011-01-05 07:38:50 UTC
Tested this bug both on RHEL5.6 and RHEL5.5.

[root@ib-test1 ~]# iscsiadm -m session
iser: [1] 192.0.0.1:3260,1 iqn.2010-10.com.example:storage-1000

As I only got 1 server for testing, I setup iscsi target and iscsi initiator on the same machine.

Cannot simulate a 30s timeout in iSER.
I have try dd urandom to /dev/md0 during mirror sync.

This patch change the SCSI timeout value from 30s to 60s.

linux-2.6-scsi-increase-sync-cache-timeout.patch was applied to kernel-2.6.18-236.el5

Set bug as Sanity Only.

Gurhan,
Can you setup the IPoIB for the server I am using.
I also need another server with IB connection to this one.

If that possible for you to change the IB switch port speed?

Thank you.

Comment 18 errata-xmlrpc 2011-01-13 21:31:58 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.