Bug 159133
Summary: | EXT3-fs error Freeing block in system zone | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 2.1 | Reporter: | Christian Schnuerer <c.schnuerer> | ||||||||
Component: | kernel | Assignee: | Jim Paradis <jparadis> | ||||||||
Status: | CLOSED CANTFIX | QA Contact: | Brian Brock <bbrock> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | medium | ||||||||||
Version: | 2.1 | CC: | peterm, sct | ||||||||
Target Milestone: | --- | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | i686 | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2005-09-12 19:58:39 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Christian Schnuerer
2005-05-30 14:21:17 UTC
Created attachment 114973 [details]
last dmesg-output
Have you run a full forced fsck on the filesystem in question? If so, what were the results? Of course! Each time some errors were repaired. Unfortunately there is no more output of it. We reinstalled the servers with AS3 last friday! Let's see what happens. OK; note that the "script" binary (from the util-linux rpm) is often helpful in capturing e2fsck output. It happened again! Jun 20 14:19:20 rhhelios1 anz1og[28737]: PPAPWrite failed: Request aborted Jun 20 14:19:20 rhhelios1 anz1og[28737]: PPAPRead failed: Request aborted Jun 20 14:19:20 rhhelios1 anz1og[28737]: Cannot initiate PPAPRead: Socket closed Jun 20 14:19:20 rhhelios1 kernel: EXT3-fs error (device lvm(58,0)): ext3_free_blocks: Freeing block in system zone - block = 255 Complete new installation of RedHat AS3, we use the standard-kernel as shipped by RedHat, all updates installed! Release: Red Hat Enterprise Linux AS release 3 (Taroon Update 5) Kernel: 2.4.21-32.0.1.ELsmp as Could that be an application-bug of Helios? It's generally unlikely to be due to an application. I still need "e2fsck" output showing the nature of the corruption to be able to continue. The fact that it's always the same block being returned is suspicious and makes me wonder if there's a bad sector on disk at that location or an lvm problem, but really without details of the problem it's just guesswork. Are there _any_ other filesystem or storage-related errors in the logs? A bad sector on disk as cause is impossible, because we have a central storage (SAN) and the problem went from one storage-box to another - even to a new SAN (STK D173 -> HP EVA5000). The problem exists on both cluster-nodes. Therefore a hardware-problem is very unlikely. The error only occurs in the same filesystem - always the volume where the spool-directories reside. Here the output of the last fsck: [root@RHHelios1 root]# e2fsck /dev/vgeva2/lvol0 e2fsck 1.32 (09-Nov-2002) /dev/vgeva2/lvol0 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes Inode 48581 was part of the orphaned inode list. FIXED. Inode 48600 was part of the orphaned inode list. FIXED. Inode 48643 was part of the orphaned inode list. FIXED. Inode 48650 was part of the orphaned inode list. FIXED. Inode 502045 was part of the orphaned inode list. FIXED. Deleted inode 744848 has zero dtime. Fix<y>? yes Inode 971547 was part of the orphaned inode list. FIXED. Inode 971698 was part of the orphaned inode list. FIXED. Inode 1003921 was part of the orphaned inode list. FIXED. Inode 1003924 was part of the orphaned inode list. FIXED. Inode 1003947 was part of the orphaned inode list. FIXED. Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -98919 -99029 -1059219 -1507884 -1971399 -2032180 - 2032546 Fix<y>? yes Free blocks count wrong for group #3 (32115, counted=32117). Fix<y>? yes Free blocks count wrong for group #32 (20266, counted=20267). Fix<y>? yes Free blocks count wrong for group #46 (27682, counted=27683). Fix<y>? yes Free blocks count wrong for group #60 (27484, counted=27485). Fix<y>? yes Free blocks count wrong for group #62 (32201, counted=32203). Fix<y>? yes Free blocks count wrong (1330925, counted=1330932). Fix<y>? yes Inode bitmap differences: -48581 -48600 -48643 -48650 -502045 -744848 -971547 - 971698 -1003921 -1003924 -1003947 Fix<y>? yes Free inodes count wrong for group #3 (16007, counted=16011). Fix<y>? yes Free inodes count wrong for group #31 (16048, counted=16049). Fix<y>? yes Free inodes count wrong for group #46 (16128, counted=16129). Fix<y>? yes Free inodes count wrong for group #60 (16003, counted=16005). Fix<y>? yes Free inodes count wrong for group #62 (16076, counted=16079). Fix<y>? yes Free inodes count wrong (1014210, counted=1014221). Fix<y>? yes /dev/vgeva2/lvol0: ***** FILE SYSTEM WAS MODIFIED ***** /dev/vgeva2/lvol0: 22067/1036288 files (0.6% non-contiguous), 741644/2072576 blocks [root@RHHelios1 root]# Created attachment 116234 [details]
Screenshot of Kernel Panic
Created attachment 116235 [details]
ksyms
Another Crash on the second cluster-node today. But without fs-corruption. Maybe the same cause? Please, see attachments. That last oops has nothing to do with the filesystem, indeed. There's not a lot I can do to help you right now, I suspect --- you're really going to need to try to capture a crash dump for our support services to investigate. The fact that you're seeing some of the problems associated with a specific LVM volume makes me think that there may be problems there, and we don't ship LVM with AS-2.1. Can you reproduce this without using lvm? As i wrote on 06.06, it's AS3 (Update 5, Kernel 2.4.21-32.0.1.ELsmp) in the meantime!!! I have disabled the audit-daemon - no crash since 10 weeks !? I cloned off Bug 168138 to track the RHEL3 manifestation of this issue. This bug will track the RHEL2.1 manifestation. RHEL2.1 is in maintenance support mode. This feature/bug is not something we would consider fixing in RHEL2.1 at this time due to the technical impact of the change. If you feel this issue should be addressed in RHEL2.1, please raise the issue to Red Hat Product Management. |