Bug 184940 - Kernel PANIC while unmounting ext3 volume on LVM2 on SoftRAID
Kernel PANIC while unmounting ext3 volume on LVM2 on SoftRAID
Status: CLOSED DUPLICATE of bug 196915
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.0
i686 Linux
medium Severity medium
: ---
: ---
Assigned To: Eric Sandeen
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2006-03-10 04:06 EST by Simon Matter
Modified: 2007-11-30 17:07 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-01-15 17:40:04 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Simon Matter 2006-03-10 04:06:10 EST
Description of problem:
We are running a backup server which backups a number of Linux servers using a
modified version of rsnapshot (http://www.rsnapshot.org/) which basically is a
perl script using GNU cp and rsync to create "snapshots" using hardlinked copies
of filesystem trees and the rsyncing the changes.
The problem is that after some time of running (~ one week in our case), when I
tried to unmount the filesystem for some reason, the box paniced - always.
fsck'ing the filesystem showed that the fs was corrupt. After fixing it or
creating a new filesystem, everything seems to be fine again. After one day, I
was able to unmount, after ~one week, I got panics again.
The server I describe here is an internal system running CentOS 4.2. However, we
are running two similar servers at a customers site which now seem to suffer the
same problem. They are both Dell PE830, which is RHEL4 certified, running
RHES4U2 with proper 3 years subscription.

Version-Release number of selected component (if applicable):
kernel-2.6.9-22.0.2.ELsmp

How reproducible:
Always


Steps to Reproduce:
1. Create an ext3 filesystem with default options on LVM2, and mount it
2. write lots of files, directories and hardlinks to a directory tree (in my
case rsnapshot was used)
3. try to umount /dev/xxx
  
Actual results:
The kernel panics

Expected results:
The filesystem should unmount without errors

Additional info:
The server in question is an IBM Netfinity 4-way P3-Xeon. The filesystem is ext3
on a LVM2 volume of 360G. The number of files and directories is ~15 millions,
with a hardlink rate arounf 30%. The oops below shows what happens when shutting
down the box:

Feb 12 11:01:01 backup crond(pam_unix)[27526]: session opened for user root by
(uid=0)
Feb 12 11:01:02 backup crond(pam_unix)[27526]: session closed for user root
Feb 12 11:53:54 backup smartd[3615]: Device: /dev/sdf, Temperature changed -2
Celsius to 31 Celsius since last report 
Feb 12 12:01:01 backup crond(pam_unix)[4651]: session opened for user root by
(uid=0)
Feb 12 12:01:01 backup crond(pam_unix)[4651]: session closed for user root
Feb 12 12:05:01 backup crond(pam_unix)[5293]: session opened for user root by
(uid=0)
Feb 12 12:53:54 backup smartd[3615]: Device: /dev/sdf, Temperature changed 2
Celsius to 33 Celsius since last report 
Feb 12 13:01:01 backup crond(pam_unix)[10868]: session opened for user root by
(uid=0)
Feb 12 13:01:01 backup crond(pam_unix)[10868]: session closed for user root
Feb 12 13:01:17 backup httpd: httpd shutdown succeeded
Feb 12 13:01:52 backup mysqld: Stopping MySQL:  succeeded
Feb 12 13:02:56 backup kernel: kjournald starting.  Commit interval 5 seconds
Feb 12 13:02:56 backup kernel: EXT3 FS on dm-5, internal journal
Feb 12 13:02:56 backup kernel: EXT3-fs: mounted filesystem with ordered data mode.
Feb 12 13:03:25 backup kernel: sb orphan head is 13092667
Feb 12 13:03:25 backup kernel: sb_info orphan list:
Feb 12 13:03:25 backup kernel:   inode (x�8�Y�^TM���?q���
r�g�^R\|kˉ�^_/e�:1332902803 at d7753810: mode 21675, nlink -1937605498, next
-980563492
Feb 12 13:03:25 backup kernel: Unable to handle kernel paging request at virtual
address 567a464f
Feb 12 13:03:25 backup kernel:  printing eip:
Feb 12 13:03:25 backup kernel: f8869b32
Feb 12 13:03:25 backup kernel: *pde = 00000000
Feb 12 13:03:25 backup kernel: Oops: 0000 [#1]
Feb 12 13:03:25 backup kernel: SMP 
Feb 12 13:03:25 backup kernel: Modules linked in: pcspkr md5 ipv6 parport_pc lp
parport autofs4 i2c_dev i2c_core nfs lockd sunrpc uhci_hcd pcnet32 mii e1000
floppy st sg dm_snapshot dm_zero dm_mirror ext3 jbd raid5 xor raid1 dm_mod
aic7xxx sd_mod scsi_mod
Feb 12 13:03:25 backup kernel: CPU:    0
Feb 12 13:03:25 backup kernel: EIP:    0060:[<f8869b32>]    Not tainted VLI
Feb 12 13:03:25 backup kernel: EFLAGS: 00010296   (2.6.9-22.0.2.ELsmp) 
Feb 12 13:03:25 backup kernel: EIP is at dump_orphan_list+0x2a/0x6c [ext3]
Feb 12 13:03:25 backup kernel: eax: 00000075   ebx: 567a464f   ecx: e98cbefc  
edx: f8871cd5
Feb 12 13:03:25 backup kernel: esi: daea8000   edi: e7b98800   ebp: f887ba60  
esp: e98cbf14
Feb 12 13:03:25 backup kernel: ds: 007b   es: 007b   ss: 0068
Feb 12 13:03:25 backup kernel: Process umount (pid: 20783, threadinfo=e98cb000
task=e1db3290)
Feb 12 13:03:25 backup kernel: Stack: daea8000 daeac140 f8869c40 e98cb000
e7b98800 00000000 c015f1f6 e7b98800 
Feb 12 13:03:25 backup kernel:        f7b1c640 f887bc40 e98cb000 c015fbdf
e7b98840 e7b98800 c015f09c 00000000 
Feb 12 13:03:25 backup kernel:        00000000 08051bda c01723a7 e073ea64
f5e35200 f5e6c334 00000202 00000000 
Feb 12 13:03:25 backup kernel: Call Trace:
Feb 12 13:03:25 backup kernel:  [<f8869c40>] ext3_put_super+0xcc/0x149 [ext3]
Feb 12 13:03:25 backup kernel:  [<c015f1f6>] generic_shutdown_super+0xa8/0x154
Feb 12 13:03:25 backup kernel:  [<c015fbdf>] kill_block_super+0xf/0x22
Feb 12 13:03:25 backup kernel:  [<c015f09c>] deactivate_super+0x5b/0x70
Feb 12 13:03:25 backup kernel:  [<c01723a7>] sys_umount+0x65/0x6c
Feb 12 13:03:25 backup kernel:  [<c014f711>] unmap_vma_list+0xe/0x17
Feb 12 13:03:25 backup kernel:  [<c014fa4b>] do_munmap+0x129/0x137
Feb 12 13:03:25 backup kernel:  [<c01723b9>] sys_oldumount+0xb/0xe
Feb 12 13:03:25 backup kernel:  [<c02d137f>] syscall_call+0x7/0xb
Feb 12 13:03:25 backup kernel: Code: c3 56 89 d6 53 8b 42 2c ff b0 e8 00 00 00
68 a3 1c 87 f8 e8 23 87 8b c7 68 bc 1c 87 f8 e8 19 87 8b c7 8b 9e 40 41 00 00 83
c4 0c <8b> 03 0f 18 00 90 8d 86 40 41 00 00 39 c3 74 2f ff 73 a0 8d 53 
Feb 12 13:03:25 backup kernel:  <0>Fatal exception: panic in 5 seconds
Feb 12 13:13:54 backup syslogd 1.4.1: restart.
Feb 12 13:13:54 backup syslog: syslogd startup succeeded
Feb 12 13:13:55 backup kernel: klogd 1.4.1, log source = /proc/kmsg started.
Feb 12 13:13:55 backup kernel: Linux version 2.6.9-22.0.2.ELsmp
(buildcentos@build-i386) (gcc version 3.4.5 20051201 (Red Hat 3.4.5-2)) #1 SMP
Tue Jan 17 07:10:04 CST 2006

I was just checking the kernel changes in U3 but couldn't find any matching
entry. While the bug below seems very similar, I'm quite sure my problem is
absolutely not NFS related so I guess the fix doesn't help here.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=163738

I was trying to fix this for several weeks now. Replaced most parts of the
server hardware (yes we have spare parts). No way. Whats has also worried me is
that this new backupserver replaces an old RedHat 9 PC server which suffered
almost the same problem and I just thought it's the bad, cheap hardware.
Now, finally I found a solution: Installed the XFS enabled
kernel-smp-2.6.9-22.0.2.106.unsupported from centosplus, created an XFS
filesystem instead of the ext3 one, and let the box run. It has worked for 4
weeks now without any problem. I have two times unmounted the volume in the
meantime and checked with xfs_check/xfs_db. No problems, it's just runnning and
running, and it seems to perform even better with this high amount of files and
hardlinks.
Now I have two problems, I can't fix the two supported RHEL4U2 servers the way I
fixed mine and I can't test whether it's fixed in U3 now, because I can't play
with the customers servers in the remote location and I won't kill our own
backupserver again.
Comment 1 Eric Sandeen 2006-08-14 23:09:31 EDT
I have another orphan list related bug... may as well have this one too (though
I don't currently have a good idea of the problem; perhaps I'll try running with
rsnapshot, thanks for the idea) :)
Comment 2 Eric Sandeen 2006-08-22 13:02:44 EDT
Simon, can you give me a little more info on your rsnapshot config?
Was the oopsing filesystem the one holding snapshot_root, or the filesystem
being backed up?  (or were they on the same filesystem?)

Thanks,
-Eric
Comment 3 Simon Matter 2006-08-23 12:04:12 EDT
Eric, here is some more info on my rsnapshot config.
The $snapshot_root is /home/snapshots which is an otherwise unused filesystem on
/dev/VolGroup02/LogVol00.

[root@backup ~]# ll /home/snapshots/
total 0
drwxr-xr-x  9 root root 115 Aug 23 12:05 daily.0
drwxr-xr-x  9 root root 115 Aug 22 12:05 daily.1
drwxr-xr-x  9 root root 115 Aug 21 12:05 daily.2
drwxr-xr-x  9 root root 115 Aug 20 20:12 daily.3
drwxr-xr-x  9 root root 115 Aug 19 12:05 daily.4
drwxr-xr-x  9 root root 115 Aug 18 12:05 daily.5
drwxr-xr-x  9 root root 115 Aug 17 12:05 daily.6
drwxr-xr-x  9 root root 102 Aug 13 19:47 weekly.0
drwxr-xr-x  6 root root  75 Aug 13 07:48 weekly.1
drwxr-xr-x  9 root root 115 Jul 30 19:36 weekly.2

'umount /home/snapshots' has repetedly triggered the panic above, but always
only after the server ran for a week or two.
The filesystems backed up are the backup servers filesystems itself (just the
OS) and then a number of networked servers backed up via ssh/rsh.

Needless to say the server runs fine since switching to XFS 5 months ago.
Comment 4 Eric Sandeen 2007-01-15 17:40:04 EST
This is almost certainly a dup of bug #196915, which unfortunately has mostly
private comments.

At issue is a race between link & unlink in ext3.

If one thread unlinks a file & puts it on the orphan list, another may
get in and bump nlink back up to 1.  In this case, the inode will usually
not get removed properly from the orphan inode list before it's freed, resulting
in corruption.  In this bug, the umount code is helpfully trying to dump
out the "remaining" inodes due to this bug, but unfortunately the sb inode
pointers now point to freed memory, hence the oops.

*** This bug has been marked as a duplicate of 196915 ***

Note You need to log in before you can comment on or make changes to this bug.