Bug 155456

Summary: SG_IO with timeout value of UINT_MAX results in invalid timeout value passed to scsi_add_timer
Product: Red Hat Enterprise Linux 4 Reporter: Dave Wysochanski <davidw>
Component: kernelAssignee: Doug Ledford <dledford>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: coughlan, davej, davidw
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-08-22 13:32:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
test program to reproduce the failure
none
Possible fix for SG-IO timeout issue
none
Test case reproducer none

Description Dave Wysochanski 2005-04-20 15:07:42 UTC
Description of problem:
According to the comment in /usr/include/scsi/sg.h describing the timeout field
in sg_io_hdr_t, a MAX_UINT value is supposed to indicate "no timeout" for a SCSI
passthru command submitted via the SG_IO ioctl:
  unsigned int timeout;       /* [i] MAX_UINT->no timeout (unit: millisec) */

In RHEL3 and RHEL2.1, this seemed to work correctly.

However, on RHEL4, it looks like the sg module timeout calculation was changed
so that a value of UINT_MAX entered in the 'timeout' field of sg_io_hdr_t
results in an invalid timeout, and the scsi eh thread being called immediately
for a command submitted in this fashion.  I've tested this on both i386 and
x86_64 arch's, and they both behave the same.


Version-Release number of selected component (if applicable):
I've tested both 2.6.9-5.ELsmp and 2.6.9-6.37.ELsmp, both contain the error.

How reproducible:
Every time.

Steps to Reproduce:
1. Enable scsi logging via /sys/module/scsi_mod/scsi_logging_level
2. Use the SG_IO ioctl to issue a scsi passthru command with MAX_UINT timeout (I
sometimes put the attached test program in a while loop to issue the command
over and over for easier reproducibility).
3. Observe that the trace indicates scsi_add_timer getting called with a -2
timeout value, and the scsi eh thread getting woken up immediately as a result
  
Actual results:
SCSI passthru command may complete or may get aborted, depending on who wins the
race.

Expected results:
SCSI passthru command should not get immediately aborted, but should be given
essentially infinite time to complete.  Also, scsi_add_timer should be called
with a valid value, not -2.

Additional info:
I'll attach a little test program you can use to reproduce this case easily.  I
originally found this when testing iscsi for rhel4 u2 on the rhel4 u1 beta
kernel, 2.6.9-6.37.ELsmp.  I've also found the failure on the rhel4 kernel
2.6.9-5.ELsmp.  Running the same test program on rhel3 or rhel2.1 runs fine, and
this invalid timeout calculation is not seen, nor does the scsi eh thread fire
immediately.

Just at a glance, in my rhel3 tree I'm looking at (2.4.21-25.EL), sg_new_write()
calls sg_ms_to_jif(), which is a fairly straightforward function.  In rhel4
though, it looks like sg_new_write() now calls a more generic function,
msecs_to_jiffies(), which is more complicated, and apparently doesn't handle
this MAX_UINT value correctly.

Comment 1 Dave Wysochanski 2005-04-20 15:07:43 UTC
Created attachment 113413 [details]
test program to reproduce the failure

Comment 2 Dave Wysochanski 2005-04-20 15:21:18 UTC
I take back the comment about i386.  It looks like this problem only occurs on
x86_64 from what I can tell.  i386 platform seems ok - at least on 2.6.9-5.ELsmp
kernel.

Comment 3 Dave Wysochanski 2005-04-20 16:31:23 UTC
Just to clarify the scope of this problem.  We've only been able to reproduce it
consistently on x86_64 with 2.6.9-6.37.ELsmp (rhel4 u1 beta).  It doesn't appear
that the original rhel4 kernel - 2.6.9-5.ELsmp is affected.

Comment 4 Dave Wysochanski 2005-04-20 16:46:12 UTC
One more datapoint.  On x86_64, the attached test program seems to run ok on
2.6.9-5.ELsmp.  However, we have another tool that still fails in a similar
fashion on 2.6.9-5.ELsmp x86_64.  This same tool passes on 2.6.9-5.ELsmp i386. 
So  it looks like the scope is limited to x86_64 arch, and is easiest to
reproduce on 2.6.9-6.37.ELsmp.

Comment 6 Dave Wysochanski 2005-05-05 20:55:14 UTC
Is someone chasing this down for inclusion in RHEL4 U2?

Thanks.

Comment 7 Doug Ledford 2005-06-10 16:23:42 UTC
It's one my ToDo list.

Comment 11 Doug Ledford 2005-07-19 18:11:00 UTC
OK, I think I may have found the problem.  It's a small change to possibly fix
it.  I'm building a test RPM right now (but that takes several hours), so if you
want to try the fix sooner I'm uploading the proposed patch.

Comment 12 Doug Ledford 2005-07-19 18:14:13 UTC
Created attachment 116940 [details]
Possible fix for SG-IO timeout issue

Explanation of the patch is embedded in the comments.

Comment 13 Doug Ledford 2005-07-19 22:29:45 UTC
Test kernel located at
http://people.redhat.com/dledford/kernel-smp-2.6.9-11.32.EL.dledford.x86_64.rpm


Comment 14 Doug Ledford 2005-07-21 01:22:36 UTC
I ran the reproducer on the test kernel for 1,000,000 iterations without a
failure.  I'm now booting back to 2.6.9-5.ELsmp in order to try the same thing
there and see if it reproduces on that kernel (I don't have the 2.6.9-6.37.ELsmp
kernel handy).

Comment 15 Doug Ledford 2005-07-21 07:01:26 UTC
Created attachment 117013 [details]
Test case reproducer

The reproducer, as posted, does not work (amongst other things it doesn't even
use the MAX_UINT define for the timeout and therefore doesn't even begin to
test the failing case).  I've modified the reproducer to actually test the
problem at hand.  Using my modified reproducer, an i386 test machine works
perfectly fine running the 2.6.9-6.13.ELdledfordsmp kernel.  On the other hand,
an x86_64 machine fails within usually less than 60 seconds with either the
2.6.9-5.ELsmp or 2.6.9-11.32.ELdledfordsmp kernel.  The test patch doesn't
work, but with the reproducer now working properly I hopefully will be able to
find the actual problem.

In order to narrow this down, can you please check all machines that exhibit
this problem and see if the dmesg output contains a line such as this:

Losing some ticks... checking if CPU frequency changed.
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts

I suspect that timer coherency problems may be related to this issue and
knowing if afflicted machines display this message while non-afflicted machines
don't would narrow that down.  Thanks for checking on this.

Comment 16 Dave Wysochanski 2005-07-22 02:11:59 UTC
Sorry about the snafu with the test program - I must have posted the version of
testing a potential workaround (use a large timeout but not MAX_UINT).

Not sure if I can test this tonight but I'll try.  I'll also try to dig up the
messages file from when I originally repro'd this problem to check for the
things you mention.

Comment 17 Dave Wysochanski 2005-07-22 11:14:39 UTC
Unfortunately, I no longer have the messages file from the machine that I was
hitting the failure on 3 months ago, so I can't check for those messages.

I ran some tests on a x86_64 2.6.9-11.15.ELsmp machine, but using the test 
program (yours or mine - the corrected one) I'm no longer able to repro the
problem.  Looks like I need to run one one of the original kernels I saw the
problem on, or create another test case.

We do have a piece of software in house that can repro the problem on this
latest kernel, but it's more complicated, and I can't just attach the source to
this bz.  When I modify this software to just use a large timeout value but not
MAX_UINT, I don't see the problem on this latest kernel.  So I'm pretty sure the
problem is still there but it's just no longer repoducible with the simple test
case.

At any rate, don't kill yourself for the freeze this week.  I have a workaround
I can use for now.  I'll try to find another test case to repro the problem on
the latest kernel and run on the original kernels where I first saw it.  Let me
know if you have a kernel you want me to test.



Comment 18 Doug Ledford 2005-07-25 18:55:37 UTC
I've forwarded this bug to Ingo Molnar for review.  I suspect the actual problem
is in the scalable timer code that he helped to write.  Specifically, any time
you add a timer, the maximum allowed offset from the current time is jiffies +
MAX_JIFFIE_OFFSET.  At the time the timer is initialized by the SCSI stack, we
check against jiffies to get jiffies + MAX_JIFFIE_OFFSET (it doesn't use no
timer as the docs say, it just uses the maximum allowed timer for MAX_UINT). 
However, the scalable timer code keeps a per CPU list of bases that it sorts
newly added timers into.  On each CPU, it keeps track of what jiffie it is up to
having serviced already.  So, if the CPU in question is lagging behind by a
couple of jiffies, the number jiffies + MAX_JIFFIE_OFFSET will actually be a
wrapped negative number relative to base_jiffies stored on the CPU.  My theory
is that when a CPU that is behind by a couple of ticks in servicing timers
(which is why I posted the messages I did about loosing timer ticks, my test
machine is displaying that at bootup and that could easily make the per CPU
base_jiffies variable get behind the real global jiffies variable) gets a new
timer set to the global jiffies + MAX_JIFFIES_OFFSET, it will erroneously place
that timer into a bucket that is getting ready to be ran basically immediately.
 This would only be a problem on machines that are either A) heavily loaded
enough that the timer servicing routines are running behind or B) on a machine
that's loosing lots of ticks and getting behind as a result (which is more
common on x86_64 due to things like not having proper IDE drivers on newer
chipsets and therefore running with interrupts masked during IDE operations,
which is the case on my test machine, where as i386 hardware is more likely to
have optimized, interrupt friendly drivers and less likely to exhibit this
problem).  I'm not entirely sure what would be required to fix this, and since
Ingo wrote the timer code in question, that's why I sent him an email requesting
a review of this bug.  I suspect that in the mod_timer() function we could check
to see if the timer offset being added needs fixing up with this test:

if (time_after(timer->expires, base_jiffies + MAX_JIFFIE_OFFSET) &&
    time_before_eq(timer->expires, jiffies + MAX_JIFFIE_OFFSET))
        timer->expires = base_jiffies + MAX_JIFFIE_OFFSET;

I'll try that on my test machine, which crashes rather quickly, and see if it
solves the problem (although I'm not sure exactly how the per CPU stuff is
handled in this code, so I may run into cross CPU issues that make the test fail
for other reasons, which is why I want Ingo to review this).

Comment 26 RHEL Program Management 2007-05-09 11:23:12 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 28 RHEL Program Management 2007-09-07 19:47:42 UTC
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.