Bug 352701 - system freezes noapic helps
system freezes noapic helps
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.0
i386 Linux
low Severity high
: ---
: ---
Assigned To: Brian Maly
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-10-25 12:41 EDT by John Sopko
Modified: 2007-12-13 20:29 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-12-13 20:29:54 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
syslog messages (37.78 KB, text/plain)
2007-10-25 12:41 EDT, John Sopko
no flags Details
syslog messages noapic option (82.61 KB, text/plain)
2007-10-25 12:42 EDT, John Sopko
no flags Details
nvidia bug report output (141.79 KB, text/plain)
2007-10-25 12:43 EDT, John Sopko
no flags Details
nvidia bug report output noapic option (138.41 KB, text/plain)
2007-10-25 12:43 EDT, John Sopko
no flags Details
dmidecode command output (18.18 KB, text/plain)
2007-10-25 15:45 EDT, John Sopko
no flags Details

  None (edit)
Description John Sopko 2007-10-25 12:41:58 EDT
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:

system freezes every 2.5 hours or so

Steps to Reproduce:
1. login to gnome desktop
2. open a few windows
3.
  
Actual results:

system freezes

Expected results:

system not to freeze

Additional info:

From the Dell sysreport command:

Libsmbios:    0.8.0
System ID:    0x01C3
Service Tag:  J16B4B1
Product Name: Dell DXG051
BIOS Version: A11
Vendor:       Dell Inc.

We have a Dell XPS 600 with Nvidia 7900 GS graphics card dual processor,
latest BIOS installed. Running latest 2.6.18-8.1.15.el5 rhel5 kernel
and patches. Ran extensive hardware diags and the system seems fine.

The Dell XPS 600 that locks up under Red Hat Enterprise version 4 and 5,
32 bit OS. The problem is very repeatable under rhel-5. The user started
reporting the system locked up every few days or so some time ago when
rhel4 was installed. I had been trying to track down the problem since
August under rhel4 and then loaded rhel5 to see if that would fix.

The system ran fine under rhel4 for 6 months or so. I believe the problem
is similar to bug

https://bugzilla.redhat.com/show_bug.cgi?id=245941

The system always boots fine. Under rhel4 the system took several
days to become unresponsive/lock up, anywhere form 1 to 10 days.
Under rhel5 the system consistently locks up after about 2.5 hours.

I prefer to get rhel5 working since we have been upgrading to that for
sometime. I login to gnome desktop fine. I open serverel gnome terms
and firefox and then walk away. After about 2.5 hours the gnome clock on
the desktop "stops". The keyboard gets flakey, may or may not be able to
execute commands, cannot ssh to the machine and the system soon locks up.
When I am able to execute the dmesg command when the system gets flakey
it does not give any clues.

Ran the Dell diags for many days including letting the system board diag
run for 1.5 days. The system board diag checks the timers and interrupt
controllers. The diags did not ever get an error.

It feels like some sort of interrupt issue from what I have researched
on the web. The only clue I can see  from the /var/log/messages syslog
where it complains about ivalid IRQ's on boot up:

Oct 11 13:13:28 anderson kernel: PCI: If a device doesn't work, try
"pci=routeirq".  If it helps, post a report
Oct 11 13:13:28 anderson kernel: pcie_portdrv_probe->Dev[007e:10de] has invalid
IRQ. Check vendor BIOS
Oct 11 13:13:28 anderson kernel: pcie_portdrv_probe->Dev[007e:10de] has invalid
IRQ. Check vendor BIOS
Oct 11 13:13:28 anderson kernel: pcie_portdrv_probe->Dev[005d:10de] has invalid
IRQ. Check vendor BIOS

The syslog reports the same message as mentioned in bug 245941

Oct 24 10:33:57 anderson kernel: ACPI: BIOS IRQ0 pin2 override ignored.

The only thing I think may be causing the problem is the Nvidia
driver/card.  The system is running the lastest Nvidia driver.  We also
run the openafs kernel module.

I tried using the "pci=routeirq" kernel parameter as show above and it
made things much worse, the system would hang on gnome login bringing
up the gnome desktop.

We have another Dell Precision running rhel5 that was locking up within a
few minutes and I added the following kernel option that was recommended
in Nvidia's README, this fixed that particular system:

pci=nommconf

This does not help this 600 XPS. The Nvidia README also mentions to try
the famous "noapic" kernel option. Running with noapic definetly makes
the rhel5 system stable. I have not released the system back to the user
yet. I do not understand if the "noapic" option is the proper "fix"?
Is it ok to run the old programmable interrupt controller and disable
the new advanced controller?

I tried installing a differnet Nvidia card, a Quadro FX 550 and the
system still has the same problem.

Attached is the system /var/log/messages from a fresh boot with and
without the noapic kernel option and  log files created from nvidias
nvidia-bug-report.sh script. This nvidia script captures some good info.

messages.txt
messages-noapic.txt
nvidia-bug-report.log
nvidia-bug-report-noapic.log

 |sopko@lark:133%
Comment 1 John Sopko 2007-10-25 12:41:58 EDT
Created attachment 237621 [details]
syslog messages
Comment 2 John Sopko 2007-10-25 12:42:50 EDT
Created attachment 237631 [details]
syslog messages noapic option
Comment 3 John Sopko 2007-10-25 12:43:17 EDT
Created attachment 237641 [details]
nvidia bug report output
Comment 4 John Sopko 2007-10-25 12:43:37 EDT
Created attachment 237651 [details]
nvidia bug report output noapic option
Comment 5 John Sopko 2007-10-25 15:45:03 EDT
Created attachment 237921 [details]
dmidecode command output
Comment 6 John Sopko 2007-11-10 08:52:00 EST
Upgraded to rhel 5.1:

# uname -a
Linux anderson.cs.unc.edu 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:02 EDT 2007 i686
i686 i386 GNU/Linux

Removed noapic option, running with kernel options,

# cat /proc/cmdline 
ro root=LABEL=/ rhgb quiet pci=nommconf


The system has been up for:

08:46:48 up 21:05,  5 users,  load average: 0.04, 0.02, 0.00

I also notice that the "invalid IRQ. Check vendor BIOS" messages
as describe above are not in the syslog anymore. If the system
stays up until Monday I will release to the user and see if the
upgrade fixes.
Comment 7 John Sopko 2007-11-13 08:16:25 EST
The system was up for 2 days 21 hours. I placed on the users desktop.
I would like to see if it runs without locking up for 14 days. The
longest it ran under rhel4 was 12 days when it was running rhel4.
As described above it ran for about 2.5 hours under rhel5 before
update 1. It is looking good and the new kernel probably fixed.
Comment 8 Prarit Bhargava 2007-11-13 09:10:14 EST
John, could you update this bugzilla in a few more days and give us final
confirmation that the latest kernel has resolved your problem?

I'll put this BZ in "NEEDINFO" from you until then :)

Thanks,

P.
Comment 9 John Sopko 2007-11-19 12:54:44 EST
Looking good, the system has been up for:

% uptime
 12:53:15 up 7 days,  3:47,  4 users,  load average: 0.02, 0.01, 0.00

I will close Monday November 26th after it has been up for 14 days.
Comment 10 John Sopko 2007-11-27 07:48:40 EST
One of the dozens of kernel fixes fixed this problem. You can
close this issue. Thanks!

% uptime
 07:45:59 up 14 days, 22:40,  1 user,  load average: 0.00, 0.00, 0.00

Comment 11 Brian Maly 2007-12-13 20:29:54 EST
Closing issue as per Comment #10

Note You need to log in before you can comment on or make changes to this bug.