Bug 570681 - REGRESSION: Fix iscsi failover time
Summary: REGRESSION: Fix iscsi failover time
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: All
OS: Linux
urgent
urgent
Target Milestone: rc
: 5.6
Assignee: Mike Christie
QA Contact: Storage QE
URL:
Whiteboard:
: 606801 (view as bug list)
Depends On:
Blocks: 557597 570682 580840 583892 583893 583898 583899
TreeView+ depends on / blocked
 
Reported: 2010-03-05 01:47 UTC by Mike Christie
Modified: 2018-10-27 13:55 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 570682 (view as bug list)
Environment:
Last Closed: 2011-01-13 20:38:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Both patches in this BZ combined. (1.68 KB, application/octet-stream)
2010-04-09 13:48 UTC, Ben Turner
no flags Details
Spec file, I just renamed my patch to test patch. (815.27 KB, application/octet-stream)
2010-04-09 14:07 UTC, Ben Turner
no flags Details
updated fialover fix (4.46 KB, patch)
2010-04-09 18:35 UTC, Mike Christie
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0017 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update 2011-01-13 10:37:42 UTC

Description Mike Christie 2010-03-05 01:47:52 UTC
Description of problem:

The time it takes to detect a problem and get IO failed upwards should take nop timout + nop interval + replacement_timeout seconds. It is taking a lot longer because when the problem occurs the xmit thread could be asleep in sendpage's sk_stream_wait_memory or it could have partially sent data then it could swing around and end up falling into sk_stream_wait_memory.

This is fixed in the iscsi layer with these two patches:

http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=b64e77f70b8c11766e967e3485331a9e6ef01390


http://git.kernel.org/?p=linux/kernel/git/jejb/scsi-misc-2.6.git;a=commit;h=32382492eb18e8e20be382a1743d0c08469d1e84


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 RHEL Program Management 2010-03-05 01:48:43 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 2 Mike Christie 2010-03-05 01:51:50 UTC
Oh yeah, this is a regression added in 5.3. In the older kernels we would fail IO right away and that allowed us to bypass the possible getting stuck in sk_stream_wait_memory. Now, we fail IO after the code chunk handling the xmit thread and sk_stream_wait_memory.

Comment 8 Zhang Kexin 2010-04-08 11:11:32 UTC
(In reply to comment #0)
> Description of problem:
> 
> The time it takes to detect a problem and get IO failed upwards should take nop
> timout + nop interval + replacement_timeout seconds. It is taking a lot longer
> because when the problem occurs the xmit thread could be asleep in sendpage's
> sk_stream_wait_memory or it could have partially sent data then it could swing
> around and end up falling into sk_stream_wait_memory.
> Steps to Reproduce:
> 1.
> 2.
> 3.
Hi Mike,

Could you please give the steps to reproduce the bug? thanks a lot.

Comment 9 Mike Christie 2010-04-09 03:28:42 UTC
How to replicate the bug:

1. Run a IO test that produces lots of writes. I use disktest with lots of large writes and lots of threads, or you can run lots of dds, or you can probably run a FS test.

2. While the IO is running, pull a cable or shutdown the target or kill the switch.

3. You should see the noop/ping timeout error within the timeouts set in the iscsi db (run iscsiadm -m node -T your_target to see them), but then instead of seeing the replacement/recovery timeout message X seconds after that. It will take several minutes to finally see that error and the IO failed upwards.

Comment 10 Mike Christie 2010-04-09 03:29:35 UTC
Oh yeah, you would see the error messages I mention in /var/log/messages.

And for #3 you can see the replacement_recovery timeout being used in the same "iscsiadm -m node -T target" command.

Comment 13 Ben Turner 2010-04-09 13:48:30 UTC
Created attachment 405551 [details]
Both patches in this BZ combined.

Comment 14 Ben Turner 2010-04-09 14:07:05 UTC
Created attachment 405556 [details]
Spec file, I just renamed my patch to test patch.

Comment 15 Mike Christie 2010-04-09 18:35:19 UTC
Created attachment 405617 [details]
updated fialover fix

Patch sent to rh-kernel. It has one more fix.

Comment 21 Jiri Pirko 2010-04-14 10:15:23 UTC
Since the regression appeared in 5.3, shouldn't this fix go into 5.3.z and 5.4.z too?

Comment 22 Mike Christie 2010-04-14 17:47:57 UTC
(In reply to comment #21)
> Since the regression appeared in 5.3, shouldn't this fix go into 5.3.z and
> 5.4.z too?    

Yes, I think it would be useful.

Comment 23 Jiri Pirko 2010-04-15 08:39:39 UTC
per comment #22, proposing this for 5.3.z and 5.4.z

Comment 27 Jarod Wilson 2010-04-21 19:41:54 UTC
in kernel-2.6.18-197.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 29 Ben Marzinski 2010-08-02 14:45:23 UTC
*** Bug 606801 has been marked as a duplicate of this bug. ***

Comment 32 Barry Donahue 2010-12-01 19:56:45 UTC
Verified via test case in comment #9 on 5.6.Server-20101124.1.

Comment 34 errata-xmlrpc 2011-01-13 20:38:09 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html


Note You need to log in before you can comment on or make changes to this bug.