Bug 155456
Summary: | SG_IO with timeout value of UINT_MAX results in invalid timeout value passed to scsi_add_timer | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Dave Wysochanski <davidw> | ||||||||
Component: | kernel | Assignee: | Doug Ledford <dledford> | ||||||||
Status: | CLOSED WONTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 4.0 | CC: | coughlan, davej, davidw | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | x86_64 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-08-22 13:32:31 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Dave Wysochanski
2005-04-20 15:07:42 UTC
Created attachment 113413 [details]
test program to reproduce the failure
I take back the comment about i386. It looks like this problem only occurs on x86_64 from what I can tell. i386 platform seems ok - at least on 2.6.9-5.ELsmp kernel. Just to clarify the scope of this problem. We've only been able to reproduce it consistently on x86_64 with 2.6.9-6.37.ELsmp (rhel4 u1 beta). It doesn't appear that the original rhel4 kernel - 2.6.9-5.ELsmp is affected. One more datapoint. On x86_64, the attached test program seems to run ok on 2.6.9-5.ELsmp. However, we have another tool that still fails in a similar fashion on 2.6.9-5.ELsmp x86_64. This same tool passes on 2.6.9-5.ELsmp i386. So it looks like the scope is limited to x86_64 arch, and is easiest to reproduce on 2.6.9-6.37.ELsmp. Is someone chasing this down for inclusion in RHEL4 U2? Thanks. It's one my ToDo list. OK, I think I may have found the problem. It's a small change to possibly fix it. I'm building a test RPM right now (but that takes several hours), so if you want to try the fix sooner I'm uploading the proposed patch. Created attachment 116940 [details]
Possible fix for SG-IO timeout issue
Explanation of the patch is embedded in the comments.
Test kernel located at http://people.redhat.com/dledford/kernel-smp-2.6.9-11.32.EL.dledford.x86_64.rpm I ran the reproducer on the test kernel for 1,000,000 iterations without a failure. I'm now booting back to 2.6.9-5.ELsmp in order to try the same thing there and see if it reproduces on that kernel (I don't have the 2.6.9-6.37.ELsmp kernel handy). Created attachment 117013 [details]
Test case reproducer
The reproducer, as posted, does not work (amongst other things it doesn't even
use the MAX_UINT define for the timeout and therefore doesn't even begin to
test the failing case). I've modified the reproducer to actually test the
problem at hand. Using my modified reproducer, an i386 test machine works
perfectly fine running the 2.6.9-6.13.ELdledfordsmp kernel. On the other hand,
an x86_64 machine fails within usually less than 60 seconds with either the
2.6.9-5.ELsmp or 2.6.9-11.32.ELdledfordsmp kernel. The test patch doesn't
work, but with the reproducer now working properly I hopefully will be able to
find the actual problem.
In order to narrow this down, can you please check all machines that exhibit
this problem and see if the dmesg output contains a line such as this:
Losing some ticks... checking if CPU frequency changed.
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
I suspect that timer coherency problems may be related to this issue and
knowing if afflicted machines display this message while non-afflicted machines
don't would narrow that down. Thanks for checking on this.
Sorry about the snafu with the test program - I must have posted the version of testing a potential workaround (use a large timeout but not MAX_UINT). Not sure if I can test this tonight but I'll try. I'll also try to dig up the messages file from when I originally repro'd this problem to check for the things you mention. Unfortunately, I no longer have the messages file from the machine that I was hitting the failure on 3 months ago, so I can't check for those messages. I ran some tests on a x86_64 2.6.9-11.15.ELsmp machine, but using the test program (yours or mine - the corrected one) I'm no longer able to repro the problem. Looks like I need to run one one of the original kernels I saw the problem on, or create another test case. We do have a piece of software in house that can repro the problem on this latest kernel, but it's more complicated, and I can't just attach the source to this bz. When I modify this software to just use a large timeout value but not MAX_UINT, I don't see the problem on this latest kernel. So I'm pretty sure the problem is still there but it's just no longer repoducible with the simple test case. At any rate, don't kill yourself for the freeze this week. I have a workaround I can use for now. I'll try to find another test case to repro the problem on the latest kernel and run on the original kernels where I first saw it. Let me know if you have a kernel you want me to test. I've forwarded this bug to Ingo Molnar for review. I suspect the actual problem is in the scalable timer code that he helped to write. Specifically, any time you add a timer, the maximum allowed offset from the current time is jiffies + MAX_JIFFIE_OFFSET. At the time the timer is initialized by the SCSI stack, we check against jiffies to get jiffies + MAX_JIFFIE_OFFSET (it doesn't use no timer as the docs say, it just uses the maximum allowed timer for MAX_UINT). However, the scalable timer code keeps a per CPU list of bases that it sorts newly added timers into. On each CPU, it keeps track of what jiffie it is up to having serviced already. So, if the CPU in question is lagging behind by a couple of jiffies, the number jiffies + MAX_JIFFIE_OFFSET will actually be a wrapped negative number relative to base_jiffies stored on the CPU. My theory is that when a CPU that is behind by a couple of ticks in servicing timers (which is why I posted the messages I did about loosing timer ticks, my test machine is displaying that at bootup and that could easily make the per CPU base_jiffies variable get behind the real global jiffies variable) gets a new timer set to the global jiffies + MAX_JIFFIES_OFFSET, it will erroneously place that timer into a bucket that is getting ready to be ran basically immediately. This would only be a problem on machines that are either A) heavily loaded enough that the timer servicing routines are running behind or B) on a machine that's loosing lots of ticks and getting behind as a result (which is more common on x86_64 due to things like not having proper IDE drivers on newer chipsets and therefore running with interrupts masked during IDE operations, which is the case on my test machine, where as i386 hardware is more likely to have optimized, interrupt friendly drivers and less likely to exhibit this problem). I'm not entirely sure what would be required to fix this, and since Ingo wrote the timer code in question, that's why I sent him an email requesting a review of this bug. I suspect that in the mod_timer() function we could check to see if the timer offset being added needs fixing up with this test: if (time_after(timer->expires, base_jiffies + MAX_JIFFIE_OFFSET) && time_before_eq(timer->expires, jiffies + MAX_JIFFIE_OFFSET)) timer->expires = base_jiffies + MAX_JIFFIE_OFFSET; I'll try that on my test machine, which crashes rather quickly, and see if it solves the problem (although I'm not sure exactly how the per CPU stuff is handled in this code, so I may run into cross CPU issues that make the test fail for other reasons, which is why I want Ingo to review this). This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release. |