Bug 570681
Summary: | REGRESSION: Fix iscsi failover time | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Mike Christie <mchristi> | ||||||||
Component: | kernel | Assignee: | Mike Christie <mchristi> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Storage QE <storage-qe> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | urgent | ||||||||||
Version: | 5.5 | CC: | andriusb, bdonahue, bturner, coughlan, dhoward, eric.wolfe, jpirko, jwest, kzhang, mkent, ogerlitz, pasik, slords, tanvi, tao | ||||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||||
Target Release: | 5.6 | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 570682 (view as bug list) | Environment: | |||||||||
Last Closed: | 2011-01-13 20:38:09 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 557597, 570682, 580840, 583892, 583893, 583898, 583899 | ||||||||||
Attachments: |
|
Description
Mike Christie
2010-03-05 01:47:52 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Oh yeah, this is a regression added in 5.3. In the older kernels we would fail IO right away and that allowed us to bypass the possible getting stuck in sk_stream_wait_memory. Now, we fail IO after the code chunk handling the xmit thread and sk_stream_wait_memory. (In reply to comment #0) > Description of problem: > > The time it takes to detect a problem and get IO failed upwards should take nop > timout + nop interval + replacement_timeout seconds. It is taking a lot longer > because when the problem occurs the xmit thread could be asleep in sendpage's > sk_stream_wait_memory or it could have partially sent data then it could swing > around and end up falling into sk_stream_wait_memory. > Steps to Reproduce: > 1. > 2. > 3. Hi Mike, Could you please give the steps to reproduce the bug? thanks a lot. How to replicate the bug: 1. Run a IO test that produces lots of writes. I use disktest with lots of large writes and lots of threads, or you can run lots of dds, or you can probably run a FS test. 2. While the IO is running, pull a cable or shutdown the target or kill the switch. 3. You should see the noop/ping timeout error within the timeouts set in the iscsi db (run iscsiadm -m node -T your_target to see them), but then instead of seeing the replacement/recovery timeout message X seconds after that. It will take several minutes to finally see that error and the IO failed upwards. Oh yeah, you would see the error messages I mention in /var/log/messages. And for #3 you can see the replacement_recovery timeout being used in the same "iscsiadm -m node -T target" command. Created attachment 405551 [details]
Both patches in this BZ combined.
Created attachment 405556 [details]
Spec file, I just renamed my patch to test patch.
Created attachment 405617 [details]
updated fialover fix
Patch sent to rh-kernel. It has one more fix.
Since the regression appeared in 5.3, shouldn't this fix go into 5.3.z and 5.4.z too? (In reply to comment #21) > Since the regression appeared in 5.3, shouldn't this fix go into 5.3.z and > 5.4.z too? Yes, I think it would be useful. per comment #22, proposing this for 5.3.z and 5.4.z in kernel-2.6.18-197.el5 You can download this test kernel from http://people.redhat.com/jwilson/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details. *** Bug 606801 has been marked as a duplicate of this bug. *** Verified via test case in comment #9 on 5.6.Server-20101124.1. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2011-0017.html |