Bug 177427

Summary: rhel3u5: kernel panic in fsync(2) while ~idle
Product: Red Hat Enterprise Linux 3 Reporter: Francois-Xavier 'FiX' KOWALSKI <francois-xavier.kowalski>
Component: kernelAssignee: Dave Anderson <anderson>
Status: CLOSED WONTFIX QA Contact: Brian Brock <bbrock>
Severity: urgent Docs Contact:
Priority: medium    
Version: 3.0CC: dhoward, petrides
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-06-02 21:51:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
console screen shot
none
console screen shot none

Description Francois-Xavier 'FiX' KOWALSKI 2006-01-10 16:34:45 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.12) Gecko/20050922 Fedora/1.0.7-1.1.fc3 Firefox/1.0.7

Description of problem:
After a fresh system boot, our -- java based -- application is started to perform some web-based configuration.  The output of this JVM (Sun HotSpot JDK 1.4.2) is redirected to syslog, via a pipe to logger(1).

Here is the crash backtrace:

EIP is at __out_of_line_bug [kernel] 0x17 (2.4.21-32.ELsmp/i686)
eax: 00000026   ebx: f7465f90   ecx: c0383eb4   edx: 01fa7ed7
esi: f7fee000   edi: f7465f90   ebp: 00000009   esp: f7465f34
ds: 0068   es: 0068   ss: 0068
Process syslogd (pid: 766, stackpage=f7465000)
Stack: c02bd964 000000fe c0173c1c 000000fe c0172a90 f7465f90 f7fee000
f7465f90
      c0173a87 f7fee000 f7fee000 f7465f90 c0173df9 00252d88 006afb8a
f7464000
      00000000 00000000 00000000 c0162c3b 00000000 bfff8e10 fffffeff
00000000
Call Trace:   [<c0173c1c>] path_init [kernel] 0x16c (0xf7465f3c)
[<c0172a90>] getname [kernel] 0xa0 (0xf7465f44)
[<c0173a87>] path_lookup [kernel] 0x17 (0xf7465f54)
[<c0173df9>] __user_walk [kernel] 0x49 (0xf7465f64)
[<c0162c3b>] sys_access [kernel] 0x7b (0xf7465f80)
[<c0166377>] sys_fsync [kernel] 0x47 (0xf7465f9c)

Code: 0f 0b 37 01 5f d0 2b c0 90 eb fe 8d b4 26 00 00 00 00 8d bc

Kernel panic: Fatal exception 

Version-Release number of selected component (if applicable):
kernel, `uname-r`=2.4.21-32.ELsmp

How reproducible:
Sometimes

Steps to Reproduce:
No simple scenario: The crash does not seem to be related to any specific user operation.  We are currently working to isolate this issue.

1.
2.
3.
  

Additional info:

Comment 1 Ernie Petrides 2006-01-10 18:05:08 UTC
Please try to reproduce this on the latest officially released kernel,
which is 2.4.21-37.EL (RHEL3 U6, released this past September).  There
was a post-U5 memory corruption fix that might have accounted for this.

Thanks in advance.


Comment 2 Dave Anderson 2006-01-10 19:22:55 UTC
Also, if it is reproducible with the latest kernel, please set up 
netdump and/or diskdump and forward us the vmcore.  

Comment 3 Francois-Xavier 'FiX' KOWALSKI 2006-01-24 15:44:58 UTC
We were unable to test with more recent kernels, due to a 3rd-party dependency
(a kernel module).  However, we have moved forward a lot.

The problem arises only on machines that have a very low commit-to-disk
performance.  For example, the machines that exhibited the bug (with the syslog
backtrace) was only able to commit 19 syslog entries on the disk per second.

The commit-to-disk performance issue being fixed  -- a H/W RAID setup problem --
the problem no longer arises at all.

Due to the above, it is very likelly that the long delays spent waiting for
write-completion were conccurency windows (the box has 8 processors) exposing to
the memory corruption that you have pointed.

I will update this record when we will have a chance to rest with the rhel3u6
kernel.

Comment 4 Francois-Xavier 'FiX' KOWALSKI 2006-09-28 07:51:27 UTC
A similar-looking problem was reproduced with rhel3u6 kernel.  The problem
occurs when rebooting the machine.  Here is the backtrace (it is a mnual copy
from a screen-shot taken with a digital camera on the console[1].  JPG as
attache to this bug record):

EIP is at ext3_get_inode_loc [ext3] 0xda (2.4.21.37.ELsmp/i686)
eax: 00000000 ebx: c4de1c00 ecx: 0000000c edx: f791ed3c
esi: 00000060 edi: 00000d00 ebp: 00000003 esp: f6843e30
ds: 0060 es: 0060 ss: 0060
Process reboot (pid: 1090, stackpage=f6843000)
Stack: c017fbd0 f70cb1f8 00000000 c4de1c00 f78cb100 00000003 f78cb100 f78ed080
       f78cb100 f78f5400 f8850d7b f78cb100 f6843e84 00000000 c32e8140 00009a9b
       c4de1c00 c010152a c4de1c00 00009a9b c32e8140 00000000 00000000 f78cb100
Call Trace: [<c017fbd8>] alloc_inode [kernel] 0xc0 (0xf6843e38)
[<f8858d7b>] ext3_read_inode [ext3] 0x1b (0xf6843e60)
[<c018152a>] iget4_locked [kernel] 0x10a (0xf6843e7c)
[<f885a78b>] ext3_lookup [ext3] 0xbb (0xf6843ea4)
[<c017338c>] real_lookup [kernel] 0xec (0xf0xf6843ec8)
[<c01739e7>] link_path_walk [kernel] 0x487 (0xf6843ee8)
[<c0173f69>] path_lookup [kernel] 0x39 (0xf6843f28)
[<c017452e>] open_namei [kernel] 0x7e (0xf6843f38)
[<c0163813>] filp_open [kernel] 0x43 (0xf6843f68)
[<c0163c53>] sys_open [kernel] 0x63 (0xf6843fa0)

[1] How could we get a text console other than this VGA stuff BTW? We have no
serial link available on this site...

About the diskdump/netdump setup, I have requested that it is setup.  I do not
knwo at this time whether it will be possible or not.

Comment 5 Francois-Xavier 'FiX' KOWALSKI 2006-09-28 07:53:35 UTC
Created attachment 137288 [details]
console screen shot

Comment 6 Francois-Xavier 'FiX' KOWALSKI 2006-09-28 07:54:26 UTC
Created attachment 137289 [details]
console screen shot