Bug 245941 - Hard locks after upgrading from 2.6.9-42.0.10.ELsmp to kernel 2.6.9-55.0.2.ELsmp
Summary: Hard locks after upgrading from 2.6.9-42.0.10.ELsmp to kernel 2.6.9-55.0.2.ELsmp
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.5
Hardware: i686
OS: Linux
low
high
Target Milestone: ---
: ---
Assignee: Stanislaw Gruszka
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-06-27 14:58 UTC by Dave Botsch
Modified: 2018-10-20 02:09 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-22 09:19:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
output of lspci -v (13.31 KB, text/plain)
2007-09-28 19:18 UTC, Dave Botsch
no flags Details
/var/log/messages - only shows successful boots (729.05 KB, text/plain)
2007-09-28 19:27 UTC, Dave Botsch
no flags Details
/proc/cpuinfo (2.06 KB, text/plain)
2007-09-28 19:27 UTC, Dave Botsch
no flags Details
lspci -v output (8.76 KB, text/plain)
2008-02-21 03:45 UTC, Dave Botsch
no flags Details

Description Dave Botsch 2007-06-27 14:58:10 UTC
Description of problem:
After upgrading to kernel 2.6.9-55.0.2.ELsmp from 2.6.9-42.0.10.ELsmp, system
hard locks, sometimes during boot as early as the INIT line and sometimes after
being up and running for a while. Downgrading back to 2.6.9-42.0.10.ELsmp has
fixed the issue.

Version-Release number of selected component (if applicable):
someplace between 2.6.9-42.0.10.ELsmp and 2.6.9-55.0.2.ELsmp

How reproducible:
While I cannot reproduce when it happens, with the 2.6.9-55.0.2.ELsmp kernel, I
have been able to reproduce it on every boot.

Steps to Reproduce:
1. Upgrade kernel
2. Reboot
  
Actual results:
Computer hard locks. Display shows whatever was last disaplyed. Computer no
longer pings. Keyboard no longer functions (even the led keys do not light their
leds). Have to press reset button on computer.

Expected results:
The system is stable as it is with 2.6.9-42.0.10.ELsmp

Additional info:
Computer has dual dual-core Opteron 2216s (socket F).
4 GB of RAM
forcedeth network driver
Tyan S2915a2nrf motherboard

as mentioned above, the locks happen at different points during different boots.
I've seen them as early as printing out the INIT version line, and then the
computer locking. I've seen them happen during the starting of the HAL daemon.
I've seem them happen while the computer is running. Nothing is left in any of
the log files.

32-bit version of rhel4 with all other updates applied.

Comment 1 Jason Baron 2007-06-29 15:53:48 UTC
ok, i'd like to figure out what patch broke this...via a binary search of the
kernels b/w 42.0.10 and 55.0.2. I've posted the following 4 kernels to
http://people.redhat.com/~jbaron/bz245941/

kernel-smp-2.6.9-42.12.EL.i686.rpm  
kernel-smp-2.6.9-42.25.EL.i686.rpm
kernel-smp-2.6.9-42.37.EL.i686.rpm
kernel-smp-2.6.9-55.EL.i686.rpm


Pleae start with 42.25, if that works then go to 42.37, else try 42.12. After
you post results I will post more kernels to narrow this down. thanks.

Comment 2 Dave Botsch 2007-06-30 04:48:12 UTC
Ok. It is a production server, so, I'll have to schedule time to do this (and
then leave it up and running for a while to see if it stays up).

Also, will need kernel-smp-devel packages for each of those kernels to build a
couple of extra kernel modules needed to boot the system completely.

thanks.

Comment 3 Dave Botsch 2007-07-09 16:14:35 UTC
Hi. Still need appropriate kernel-devel packages for the below kernels. thnx.

(In reply to comment #1)
> ok, i'd like to figure out what patch broke this...via a binary search of the
> kernels b/w 42.0.10 and 55.0.2. I've posted the following 4 kernels to
> http://people.redhat.com/~jbaron/bz245941/
> 
> kernel-smp-2.6.9-42.12.EL.i686.rpm  
> kernel-smp-2.6.9-42.25.EL.i686.rpm
> kernel-smp-2.6.9-42.37.EL.i686.rpm
> kernel-smp-2.6.9-55.EL.i686.rpm
> 
> 
> Pleae start with 42.25, if that works then go to 42.37, else try 42.12. After
> you post results I will post more kernels to narrow this down. thanks.



Comment 4 Jason Baron 2007-08-06 20:40:19 UTC
ok, i've updated that link with kernel-devel pkgs. thanks.

Comment 5 Jason Baron 2007-08-30 13:42:29 UTC
any test results yet here? thanks.

Comment 6 Dave Botsch 2007-09-06 15:29:20 UTC
This week and next week are my scheduled testing weeks :)

Just done w. the first kernel... 2.6.9-42.25.ELsmp ... seems to be ok. Moving on
and up to the next one.

Comment 7 Dave Botsch 2007-09-07 19:50:17 UTC
2.6.9-42.37.ELsmp appears to be ok. Will move up to the next one on Monday.

Comment 8 Dave Botsch 2007-09-11 15:48:02 UTC
With kernel-smp-2.6.9-55.EL.i686.rpm, the system hard locked after being up for
about 5 minutes. So, kernel-smp-2.6.9-55.EL.i686.rpm does not work, while
2.6.9-42.37.ELsmp works just fine.

Comment 9 Jason Baron 2007-09-25 16:02:29 UTC
ok, thanks for the update. I've put all the kernels b/w 42.37 and 55 at:
http://people.redhat.com/~jbaron/bz245941/

Please binary search these. thanks.

Comment 10 Jason Baron 2007-09-25 16:04:18 UTC
Also just to be clear, 42.38, 42.39, and 42.40, followed by 43, 44....is the
sequential ordering.

Comment 11 Dave Botsch 2007-09-27 17:39:21 UTC
kernel-smp-2.6.9-47.EL is broke. Moving on to the next kernel to test.

Comment 12 Dave Botsch 2007-09-27 18:23:02 UTC
kernel-smp-2.6.9-43.EL fails. Moving on.

Comment 13 Dave Botsch 2007-09-27 20:14:43 UTC
2.6.9-42.39.ELsmp seems to be running ok. Which leaves just
kernel-smp-2.6.9-42.40.EL.i686.rpm to try. Hopefully tomorrow.

Comment 14 Jason Baron 2007-09-28 17:13:12 UTC
ok, if you look at the changelog for -43 there was only 1 minor change. Thus, if
42.39 is ok, the problematic change was very likely introduced in 42.40.
Unfortunately, 42.40 has a large number of changes...i'm posting it below in
case anything stands out. A futher description of the h/w might help us guess.
Is there anything in logs at all? 

thanks.

* Tue Jan 9 2007 Jason Baron <jbaron> [2.6.9-42.40]
-add PCIe power management quirk (Geoff Gustafson) [197009]
-MCE Thresholding support for family 0x10 AMD processors (Bhavana Nagendra) [196894]
-i386: acpi_skip_timer_override for NVIDIA (Brian Maly) [193937 200412]
-fix aic79xx module removal path (David Milburn) [212388]
-fix stack overflow due to recursion between scsi_request_fn and
blk_requeue_request (Neil Horman) [202848]
-Fix JBD race in t_forget list handling (Eric Sandeen) [176738]
-Prevent userspace EOVERFLOW errors from increasing i_ino value (Jeff Layton)
[213652 215549]
-minimal changes to allow Lustre client support (Eric Sandeen) [196637]
-Efficient translation of intel binaries to ppc (Janice Girouard) [196794]
-Update MPTSAS driver to v3.02.73rh (Konrad Rzeszutek) [180936 196789 193760 196782]
-Support the LSI MegaRAID driver (Konrad Rzeszutek) [196790 193056]
-avoid ESB2 IDE interrupt storm (Kimball Murray) [218498]
-new ppc Host Ethernet Adapter Device Driver (Janice Girouard) [196774]
-USB equipment cannot be used on certain Toshiba systems (Brian Maly) [170134]
-Support for p-state transitions on Intel and support for ACPI 3.0 _PSD method
(Brian Maly) [196798]
-Backport the 'IBM eBus patch' from kernel 2.6.16 and 2.6.17 (Janice Girouard)
[196799]
-ext3: fix READA failures cause "directory hole" (Chip Coldwell, Stephen
Tweedie) [213921]-Audit: 196233: correct/update audit filter rule comparison
(Eric Paris) [196233]
-fix bre channel cable-pull bug (Chip Coldwell) [163818]
-Fix page_is_ram on x86_64 to always return 1 (Chris Lalancette) [221273]
-proc: fix readdir race fix (Nobuhiro Tachino) [212631]
-Add netdump support to 8139cp driver (Chris Lalancette) [217932]
-powernow: remove __initdata from tscsync (Prarit Bhargava) [221975]
-fix incorrect cache_decay_ticks calculation on x86_64 and i386 (Jason Baron)
[192395]
-aio: fix kernel panic in aio_free_ring (Jeff Moyer) [220971] {CVE-2006-5754}
-Fix for missing headers in kernel-xenU-devel (Don Dutile) [219538]
-fix ext2_check_page denial of service (Eric Sandeen) [217021] {CVE-2006-6054}
-PPC64: Return correct value from request_irq() (David Howells) [172436]
-dm: stalls on resume if noflush is used (Milan Broz) [221386]
-NFSv4 client doesn't return a delegation before removing a file (Steve Dickson)
[155929] -fix listxattr syscall can corrupt user space programs (Eric Sandeen)
[220677] {CVE-2006-5753}


Comment 15 Jason Baron 2007-09-28 17:33:40 UTC
Also what are the extra kernel modules that you are using? Perhaps the problem
lies there...

Comment 16 Jason Baron 2007-09-28 17:34:58 UTC
Also it would be interesting to see if the problem still exits with the current
U6 beta candidate...found at: http://people.redhat.com/~jbaron/rhel4/

Comment 17 Dave Botsch 2007-09-28 19:02:58 UTC
42.40 gave me a lock during the boot process.

With respect to kernel modules, the two custom modules I am using are:

1. openafs
2. forcedeth from nvidia since the forcedeth in the kernel does not support
gigabit (it sorta thinks it does, so, you end up with a network that doesn't work)

That said, some of the locks occurred when booting into single user mode so that
I could compile the gigabit enabled forcedeth. So, the forcedeth from the redhat
kernel was used and the openafs module was not loaded as oafs doesn't start in
single user mode. This would seem to rule out either of the two kernel modules
causing issues.

Comment 18 Dave Botsch 2007-09-28 19:13:31 UTC
One difference I've just noted in the startup.

With kernel -39 and earlier, it would appear that the following is printed out:

Sep 11 11:41:58 smoke kernel: Total of 4 processors activated (19202.47 BogoMIPS
).
Sep 11 11:41:58 smoke kernel: ENABLING IO-APIC IRQs
Sep 11 11:41:58 smoke kernel: ..TIMER: vector=0x31 pin1=2 pin2=-1
Sep 11 11:41:58 smoke kernel: ..MP-BIOS bug: 8254 timer not connected to IO-APIC
Sep 11 11:41:58 smoke kernel: ...trying to set up timer (IRQ0) through the 8259A
 ...  failed.
Sep 11 11:41:58 smoke kernel: ...trying to set up timer as Virtual Wire IRQ... f
ailed.
Sep 11 11:41:58 smoke kernel: ...trying to set up timer as ExtINT IRQ... works.
Sep 11 11:41:58 smoke kernel: checking TSC synchronization across 4 CPUs: 
Sep 11 11:41:58 smoke kernel: CPU#0 had 1344057 usecs TSC skew, fixed it up.
Sep 11 11:41:58 smoke kernel: CPU#1 had 0 usecs TSC skew, fixed it up.
Sep 11 11:41:58 smoke kernel: CPU#2 had 0 usecs TSC skew, fixed it up.
Sep 11 11:41:58 smoke kernel: CPU#3 had 0 usecs TSC skew, fixed it up.
Sep 11 11:41:58 smoke mdmonitor: mdadm startup succeeded
Sep 11 11:41:58 smoke kernel: Brought up 4 CPUs
Sep 11 11:41:58 smoke kernel: zapping low mappings.
Sep 11 11:41:58 smoke kernel: checking if image is initramfs... it is
Sep 11 11:41:58 smoke kernel: Freeing initrd memory: 760k freed
Sep 11 11:41:58 smoke kernel: NET: Registered protocol family 16
Sep 11 11:41:58 smoke kernel: PCI: BIOS BUG #81[00000282] found

Starting with -40, the output differs:
Sep 11 11:26:21 smoke kernel: Total of 4 processors activated (19202.51 BogoMIPS
).
Sep 11 11:26:21 smoke kernel: ENABLING IO-APIC IRQs
Sep 11 11:26:21 smoke kernel: ..TIMER: vector=0x31 pin1=0 pin2=-1
Sep 11 11:26:21 smoke kernel: checking TSC synchronization across 4 CPUs: 
Sep 11 11:26:21 smoke kernel: CPU#0 had 1344058 usecs TSC skew, fixed it up.
Sep 11 11:26:21 smoke kernel: CPU#1 had 0 usecs TSC skew, fixed it up.
Sep 11 11:26:21 smoke kernel: CPU#2 had 0 usecs TSC skew, fixed it up.
Sep 11 11:26:21 smoke kernel: CPU#3 had 0 usecs TSC skew, fixed it up.
Sep 11 11:26:21 smoke kernel: Brought up 4 CPUs
Sep 11 11:26:21 smoke kernel: zapping low mappings.
Sep 11 11:26:21 smoke kernel: checking if image is initramfs... it is
Sep 11 11:26:21 smoke kernel: Freeing initrd memory: 714k freed
Sep 11 11:26:21 smoke kernel: NET: Registered protocol family 16
Sep 11 11:26:21 smoke kernel: PCI: BIOS BUG #81[00000282] found


Comment 19 Dave Botsch 2007-09-28 19:18:02 UTC
Created attachment 210831 [details]
output of lspci -v

Comment 20 Dave Botsch 2007-09-28 19:27:26 UTC
Created attachment 210841 [details]
/var/log/messages - only shows successful boots

Comment 21 Dave Botsch 2007-09-28 19:27:44 UTC
Created attachment 210851 [details]
/proc/cpuinfo

Comment 22 Jason Baron 2007-09-28 20:38:39 UTC
ok, thanks for the info. Looks to me that this is likely caused by:

-i386: acpi_skip_timer_override for NVIDIA (Brian Maly) [193937 200412]

We can confirm the theory by backing that patch out...but i believe that if you
pass 'noapic' with a problematic kernel you may be able to workaround the
problem. Also, there should be a message like: "BIOS IRQ0 pin2 override
ignored." in the problematic boot case. Do you see that?

I will likely build a kernel without that patch on Monday to confirm my suspicions. 

Comment 23 Dave Botsch 2007-09-29 04:46:01 UTC
I only have one /var/log/messages with a recorded boot of a kernel that later
locked. Version 2.6.9-55

And, yes, that boot does have the line you mention.

If you missed it, note that I also attached my lspci, cpuinfo, and the
/var/log/messages showing the recent good kernel boots.

Comment 24 Jason Baron 2007-10-01 19:12:35 UTC
ok, i've placed a 60.1.EL.apic.1 kernel at:
http://people.redhat.com/~jbaron/bz245941/

This kernel is our latest beta kernel built without the apic patch which is
causing this problem. If you can confirm that this kernel is fine, we'll be sure
that we have found the source of this regression.

thanks.

Comment 25 Dave Botsch 2007-10-02 15:36:59 UTC
With the new kernel, the message "MP-BIOS bug: 8254 timer not connected to
IO-APIC" is now again in the output. Running the new kernel, now, to see if it
holds and does not lock up.

Comment 26 Dave Botsch 2007-10-02 21:02:58 UTC
At this point, it's been up for 6 hours running on 2.6.9-60.1.EL.apic.1smp --
since that is longer than it ever stayed up when the problem existed (usually,
never finishing booting or crashing within 15 minutes), I'm going to call the
problem fixed in this kernel.

Comment 27 Jason Baron 2007-10-03 14:15:20 UTC
ok, thanks for testing this...do have all the latest BIOS updates installed on
this box?

Comment 28 Dave Botsch 2007-10-03 15:40:04 UTC
My understanding is that there is a newer bios, which I have not yet installed.

Comment 29 Jason Baron 2007-10-03 18:50:29 UTC
well it might be worth tyring the latest update before we consider changing this
patch. thanks. 

Comment 30 Dave Botsch 2007-10-04 15:40:10 UTC
Ok. Certainly worth a shot, though, it doesn't look promising. About the only
thing in the changelog that comes close is:
Performance tune recommandation from AMD/nVidia
Updated AMD Opteron CPU support

Comment 31 Dave Botsch 2007-10-08 15:42:01 UTC
Just tested with the bios update. Still locks. In fact, the hardware became
unhappy enough after one boot that I had to remove power completely before it
would come back.

Of note, on my first attempt, I happened to try 2.6.9-47 and for the first time
actually got a Kernel panic error message:

MP-BIOS Bug: 8254 timer not connected to IO-APIC (why this is there under -47 is
interesting since it wasn't there with most of the "later" kernels)
Kernel panic - not syncing: IO-APIC + timer doesn't work! Boot with apic=debug
and send a report. Then try booting with the 'no-apic' option.

So, this would seem to further confirm the behavior we are seeing with the patch
removed.

If I did not mention it before, the motherboard is a Tyan S2915A2NRF 

Comment 32 Brian Maly 2008-02-15 14:31:05 UTC
This may be related to Bug #432405

Can we get an 'lspci' output to determine which chipset is being used?

Comment 33 Dave Botsch 2008-02-21 03:45:28 UTC
Created attachment 295479 [details]
lspci -v output

Here's the lspci -v output for the affected machine.

thanks.

Comment 34 Dave Botsch 2008-05-01 21:32:33 UTC
Hello? It's been over two months... what's going on with this bug?

thanks.

Comment 35 Brian Maly 2008-09-25 06:51:17 UTC
Re: Comment #26,

This problem is resolved in the newest kernel? Or this is still a problem in the newest kernel? Just trying to clarify.

Comment 36 Dave Botsch 2008-09-26 03:09:15 UTC
Which patch do you think should fix it?

I have not been able to test the newest kernel yet... I am working on moving user data to an alternate server so that I can schedule downtime to test.

thanks!

Comment 37 Brian Maly 2008-09-26 03:29:25 UTC
I was trying to clarify mention in Comment #26 that the problem is fixed in the 2.6.9-60.1.EL.apic.1smp (and later) kernels. I read this to mean the problem is now fixed, but Im trying to verify this is the case. 

We could try testing the latest kernel to be sure if needed, but we probably dont need to do that unless there is still uncertainty as to this being fixed or not.

Comment 38 Dave Botsch 2008-09-26 11:39:40 UTC
Nothing later than the specially modified 2.6.9-60.1.EL.apic.1smp has been tried. In fact, the particular machine is still running 2.6.9-60.1.EL.apic.1smp (since I hadn't heard anything back after verifying that kernel and doing the bios update test).

So, it's unclear if some patch or removal of a patch later is supposed to fix/work around this issue.

Comment 40 Brian Maly 2008-12-15 22:50:51 UTC
Could the reporter test the newest kernel? The patch that caused this regression had been reworked in a later kernel revision. The new code now checks to see if hpet is enabled first, otherwise any timer override is invalid and not used.

Comment 41 Issue Tracker 2008-12-17 21:08:57 UTC
The customer has tested kernel-2.6.9-60.1.EL.apic.1smp which was provided
to him earlier and it is working fine for him. Now his server is in
production and he can not test anymore even with the latest available
kernel for RHEL 4. The customer wants to know which RHEL 4 kernel has
fixed his issue so that he can update to that kernel. He wants definitive
answer for this as his server is in production and running a test kernel
provided earlier.


This event sent from IssueTracker by streeter 
 issue 248090

Comment 42 Brian Maly 2008-12-17 21:24:29 UTC
Im to establish that this was fixed but cant do so without having someone test this on the affected hardware to verify this. So we are stuck unless someone can test this or we can get access to affected hardware. I would suggest trying a 4.7 kernel as far as testing fixes is concerned.

Comment 43 Dave Botsch 2008-12-23 18:38:49 UTC
Presumably 2.6.9-78.0.8.ELsmp would be sufficient to test with?

I can try and schedule sometime in January or February of 2009 to test this if that kernel has the above mentioned reworked patch.

Comment 44 Brian Maly 2008-12-24 17:11:46 UTC
Yes, 2.6.9-78.0.8.ELsmp should be fine for testing. 


If this kernel does not work, I have a feeling acpi_skip_timer_override should not be set for this hardware. In this is the case I would like to debug this further so as to take advantage of this testing window. So I would like to work out a test patch to test as well. Are you able to test a kernel patch (i.e. patch and build a kernel from source)? If not I can try and spin a test kernel with the patch already applied instead. Please let me know. Thanks.

Comment 45 Dave Botsch 2008-12-24 18:16:04 UTC
Either way works.

Comment 49 Stanislaw Gruszka 2009-05-22 09:19:55 UTC
Since it is fixed in 2.6.9-78.0.8.ELsmp I'm closing this bug.


Note You need to log in before you can comment on or make changes to this bug.