570681 – REGRESSION: Fix iscsi failover time

Bug 570681 - REGRESSION: Fix iscsi failover time

Summary: REGRESSION: Fix iscsi failover time

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	5.6
Assignee:	Mike Christie
QA Contact:	Storage QE
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	606801 (view as bug list)
Depends On:
Blocks:	557597 570682 580840 583892 583893 583898 583899
TreeView+	depends on / blocked

Reported:	2010-03-05 01:47 UTC by Mike Christie
Modified:	2018-10-27 13:55 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	570682 (view as bug list)
Environment:
Last Closed:	2011-01-13 20:38:09 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Both patches in this BZ combined. (1.68 KB, application/octet-stream) 2010-04-09 13:48 UTC, Ben Turner	no flags	Details
Spec file, I just renamed my patch to test patch. (815.27 KB, application/octet-stream) 2010-04-09 14:07 UTC, Ben Turner	no flags	Details
updated fialover fix (4.46 KB, patch) 2010-04-09 18:35 UTC, Mike Christie	no flags	Details \| Diff
Show Obsolete (1) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0017	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update	2011-01-13 10:37:42 UTC

Description Mike Christie 2010-03-05 01:47:52 UTC

Description of problem:

The time it takes to detect a problem and get IO failed upwards should take nop timout + nop interval + replacement_timeout seconds. It is taking a lot longer because when the problem occurs the xmit thread could be asleep in sendpage's sk_stream_wait_memory or it could have partially sent data then it could swing around and end up falling into sk_stream_wait_memory.

This is fixed in the iscsi layer with these two patches:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=b64e77f70b8c11766e967e3485331a9e6ef01390


http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=32382492eb18e8e20be382a1743d0c08469d1e84


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2010-03-05 01:48:43 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Mike Christie 2010-03-05 01:51:50 UTC

Oh yeah, this is a regression added in 5.3. In the older kernels we would fail IO right away and that allowed us to bypass the possible getting stuck in sk_stream_wait_memory. Now, we fail IO after the code chunk handling the xmit thread and sk_stream_wait_memory.

Comment 8 Zhang Kexin 2010-04-08 11:11:32 UTC

(In reply to comment #0)
> Description of problem:
> 
> The time it takes to detect a problem and get IO failed upwards should take nop
> timout + nop interval + replacement_timeout seconds. It is taking a lot longer
> because when the problem occurs the xmit thread could be asleep in sendpage's
> sk_stream_wait_memory or it could have partially sent data then it could swing
> around and end up falling into sk_stream_wait_memory.
> Steps to Reproduce:
> 1.
> 2.
> 3.
Hi Mike,

Could you please give the steps to reproduce the bug? thanks a lot.

Comment 9 Mike Christie 2010-04-09 03:28:42 UTC

How to replicate the bug:

1. Run a IO test that produces lots of writes. I use disktest with lots of large writes and lots of threads, or you can run lots of dds, or you can probably run a FS test.

2. While the IO is running, pull a cable or shutdown the target or kill the switch.

3. You should see the noop/ping timeout error within the timeouts set in the iscsi db (run iscsiadm -m node -T your_target to see them), but then instead of seeing the replacement/recovery timeout message X seconds after that. It will take several minutes to finally see that error and the IO failed upwards.

Comment 10 Mike Christie 2010-04-09 03:29:35 UTC

Oh yeah, you would see the error messages I mention in /var/log/messages.

And for #3 you can see the replacement_recovery timeout being used in the same "iscsiadm -m node -T target" command.

Comment 13 Ben Turner 2010-04-09 13:48:30 UTC

Created attachment 405551 [details]
Both patches in this BZ combined.

Comment 14 Ben Turner 2010-04-09 14:07:05 UTC

Created attachment 405556 [details]
Spec file, I just renamed my patch to test patch.

Comment 15 Mike Christie 2010-04-09 18:35:19 UTC

Created attachment 405617 [details]
updated fialover fix

Patch sent to rh-kernel. It has one more fix.

Comment 21 Jiri Pirko 2010-04-14 10:15:23 UTC

Since the regression appeared in 5.3, shouldn't this fix go into 5.3.z and 5.4.z too?

Comment 22 Mike Christie 2010-04-14 17:47:57 UTC

(In reply to comment #21)
> Since the regression appeared in 5.3, shouldn't this fix go into 5.3.z and
> 5.4.z too?    

Yes, I think it would be useful.

Comment 23 Jiri Pirko 2010-04-15 08:39:39 UTC

per comment #22, proposing this for 5.3.z and 5.4.z

Comment 27 Jarod Wilson 2010-04-21 19:41:54 UTC

in kernel-2.6.18-197.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 29 Ben Marzinski 2010-08-02 14:45:23 UTC

*** Bug 606801 has been marked as a duplicate of this bug. ***

Comment 32 Barry Donahue 2010-12-01 19:56:45 UTC

Verified via test case in comment #9 on 5.6.Server-20101124.1.

Comment 34 errata-xmlrpc 2011-01-13 20:38:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.