Bug 122729
Summary: | Race condition in tty driver | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Martin Wilck <martin.wilck> | ||||||||
Component: | kernel | Assignee: | Jason Baron <jbaron> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | |||||||||
Severity: | medium | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 2.1 | CC: | ernst-heinrich.klaas, knoel, raimondi, riel, tao | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2006-02-21 19:03:06 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Martin Wilck
2004-05-07 15:09:02 UTC
Created attachment 100078 [details]
Analysis of panic in n_tty_receive_buf()
Created attachment 100079 [details]
netdump crash log
We also have a netdump vmcore file, but the core dump is incomplete. We are not
sure about the reason. Probably some operator reset the machine during the
dump.
n_tty_receive_buf was called from an interrupt handler. Unfortunately I have no idea which interrupt it was that triggered the problem. Perhaps someone at RedHat has an idea? I have pretty low knowledge about the inner workings of the tty layer, and specifically from where the function pointers of the line discipline may be called. can you give a list of modules loaded at the time so that we can narrow down the code involved ? Created attachment 100121 [details]
Sysreport file from server
Hello, I attach you the SYSREPORT file, where you can find all this info.
well, I'm not sure there is a big involvement here, but the kernel is tainted, can you replace bcm5700 with tg3, and remove the ipmi module (intel bonding, I think thats for?), and see if the problem recurrs? As I commented in comment #21 in BUG 116738, we cannot easily exchange drivers and components because these are production systems, and the problems have not been reproduced in the lab. I am just asking you to have a look at the Oops I analyzed and tell me if you think my analysis is correct, or if not, what I got wrong. There are no signs of the bcm5700 or ipmi drivers being involved in the Oops. Please have one serious look at what we did before you reject it as tainted. Furthermore, I am still waiting for an answer to my question (in issue tracker #38803) wrt the NMI watchdog: how high is the risk to "shoot down" a running production system with the watchdog from RH's experience? As a temprorary workaround we have told the customer to shut off Hyperthreading in the servers, and thus, run as UP. Since then no further crashes were reported, and the customer is currently content because HT/smp doesn't benefit him too much. However, this may change in the future, so we'd rather solve the problem instead of hoping that the customer will always run UP. He may even buy SMP servers some time in the future. The fact that the problems are gone since we switched SMP off supports my suspicion that there are locking problems involved. The tty layer seems to be a likely candidate because a) there are known locking problems ther, b) the machines do a lot of serial IO, and c) my above analysis points in this direction. PS the ipmi driver is part of our Servermanagement package. It talks to our BMC. *** This bug has been marked as a duplicate of 131672 *** Changed to 'CLOSED' state since 'RESOLVED' has been deprecated. |