168138 – EXT3-fs error Freeing block in system zone

Bug 168138 - EXT3-fs error Freeing block in system zone

Summary: EXT3-fs error Freeing block in system zone

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Stephen Tweedie
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2005-09-12 19:51 UTC by Jim Paradis
Modified:	2013-08-06 01:16 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2005-09-13 15:38:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jim Paradis 2005-09-12 19:51:52 UTC

+++ This bug was initially created as a clone of Bug #159133 +++

Bug 159133 was initially filed against RHEL2.1.  Customer updated to RHEL3 and
problem still persists.  This bug was cloned off to track the RHEL3 version of
this issue.

Description of problem:

Serious problems with file-system corruption on two fully patched Dell PE6650 
systems running in a cluster(clumanager-1.0.28-1).
This problem exists for some months and it seems to happen ever more frequently:

 
May 29 15:52:51 machine1 bestlaser1[6483]: Cannot initiate PPAPRead: Socket 
closed
May 29 15:52:51 machine1 kernel: EXT3-fs error (device lvm(58,0)): 
ext3_free_blocks: Freeing block in system zone - block = 255

After the message appears the system crashes and we have to power off/on and 
fsck the system.

We know, LVM is not really supported in an AS2.1-cluster, but we use a simple 
configuration without stripes (just one VG per LUN).
There ara many filesystems in this cluster with about 600GBs and the error 
occurs only in a small (8GB) FS containing some spool-dirs of our Helios-NG-
installation (www.helios.de) 
and arises always at the same time with the crash of a printjob.
Also the block is always 255!

Two weeks ago we migrated the servers into a complete new SAN with new FC-
Switches and new storage (HP EVA5000) - the problem still exists!



lsmod:
Module                  Size  Used by    Tainted: P  
st                     30612   0 
lp                      8032   0  (autoclean)
parport                37696   0  (autoclean) [lp]
sr_mod                 17176   0  (autoclean) (unused)
ide-cd                 35168   0  (autoclean)
cdrom                  35392   0  (autoclean) [sr_mod ide-cd]
esm                    81297   1 
autofs                 13188   1  (autoclean)
3c59x                  32072   1 
bcm5700                71908   1 
loop                   11728   0  (autoclean)
usb-ohci               23328   0  (unused)
usbcore                68160   1  [usb-ohci]
ext3                   69952   6 
jbd                    54804   6  [ext3]
sg                     35012   0 
qla2300               608576   3 
qla2300_conf          301344   0 
megaraid               28576   7 
sd_mod                 13888   7 
scsi_mod              125916   6  [st sr_mod sg qla2300 megaraid sd_mod]

Version-Release number of selected component (if applicable):

kernel-smp-2.4.9-e.62

How reproducible:
not reproducible


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

DELL PE 6650 with 4 XEON 1,5 GHz, 4 GB RAM, hypertreading enabled,
2Gb SAN-Environment, each server has two QLA2340 HBAs with failover enabled 
(bios 1.43, driver 7.05.00-fo), 
central storage HP EVA5000

-- Additional comment from c.schnuerer on 2005-05-30 10:21 EST --
Created an attachment (id=114973)
last dmesg-output


-- Additional comment from sct on 2005-06-06 06:52 EST --
Have you run a full forced fsck on the filesystem in question?  If so, what were
the results?  

-- Additional comment from c.schnuerer on 2005-06-06 10:08 EST --
Of course!
Each time some errors were repaired. Unfortunately there is no more output of 
it.

We reinstalled the servers with AS3 last friday! 
Let's see what happens.


-- Additional comment from sct on 2005-06-06 13:03 EST --
OK; note that the "script" binary (from the util-linux rpm) is often helpful in
capturing e2fsck output.


-- Additional comment from c.schnuerer on 2005-06-20 08:59 EST --
It happened again!


Jun 20 14:19:20 rhhelios1 anz1og[28737]: PPAPWrite failed: Request aborted
Jun 20 14:19:20 rhhelios1 anz1og[28737]: PPAPRead failed: Request aborted
Jun 20 14:19:20 rhhelios1 anz1og[28737]: Cannot initiate PPAPRead: Socket closed
Jun 20 14:19:20 rhhelios1 kernel: EXT3-fs error (device lvm(58,0)): 
ext3_free_blocks: Freeing block in system zone - block = 255

Complete new installation of RedHat AS3, we use the standard-kernel as shipped 
by RedHat, all updates installed!

Release: Red Hat Enterprise Linux AS release 3 (Taroon Update 5)
Kernel: 2.4.21-32.0.1.ELsmp as 


Could that be an application-bug of Helios?

-- Additional comment from sct on 2005-06-20 16:46 EST --
It's generally unlikely to be due to an application.  I still need "e2fsck"
output showing the nature of the corruption to be able to continue.  The fact
that it's always the same block being returned is suspicious and makes me wonder
if there's a bad sector on disk at that location or an lvm problem, but really
without details of the problem it's just guesswork.  Are there _any_ other
filesystem or storage-related errors in the logs?

-- Additional comment from c.schnuerer on 2005-06-21 06:01 EST --
A bad sector on disk as cause is impossible, because we have a central storage 
(SAN) and the problem went from one storage-box to another - even to a new SAN
(STK D173 -> HP EVA5000). The problem exists on both cluster-nodes. Therefore a 
hardware-problem is very unlikely.
The error only occurs in the same filesystem - always the volume where the 
spool-directories reside.

Here the output of the last fsck:

[root@RHHelios1 root]# e2fsck /dev/vgeva2/lvol0
e2fsck 1.32 (09-Nov-2002)
/dev/vgeva2/lvol0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix<y>? yes

Inode 48581 was part of the orphaned inode list.  FIXED.
Inode 48600 was part of the orphaned inode list.  FIXED.
Inode 48643 was part of the orphaned inode list.  FIXED.
Inode 48650 was part of the orphaned inode list.  FIXED.
Inode 502045 was part of the orphaned inode list.  FIXED.
Deleted inode 744848 has zero dtime.  Fix<y>? yes

Inode 971547 was part of the orphaned inode list.  FIXED.
Inode 971698 was part of the orphaned inode list.  FIXED.
Inode 1003921 was part of the orphaned inode list.  FIXED.
Inode 1003924 was part of the orphaned inode list.  FIXED.
Inode 1003947 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -98919 -99029 -1059219 -1507884 -1971399 -2032180 -
2032546
Fix<y>? yes

Free blocks count wrong for group #3 (32115, counted=32117).
Fix<y>? yes

Free blocks count wrong for group #32 (20266, counted=20267).
Fix<y>? yes

Free blocks count wrong for group #46 (27682, counted=27683).
Fix<y>? yes

Free blocks count wrong for group #60 (27484, counted=27485).
Fix<y>? yes

Free blocks count wrong for group #62 (32201, counted=32203).
Fix<y>? yes

Free blocks count wrong (1330925, counted=1330932).
Fix<y>? yes

Inode bitmap differences:  -48581 -48600 -48643 -48650 -502045 -744848 -971547 -
971698 -1003921 -1003924 -1003947
Fix<y>? yes

Free inodes count wrong for group #3 (16007, counted=16011).
Fix<y>? yes

Free inodes count wrong for group #31 (16048, counted=16049).
Fix<y>? yes

Free inodes count wrong for group #46 (16128, counted=16129).
Fix<y>? yes

Free inodes count wrong for group #60 (16003, counted=16005).
Fix<y>? yes

Free inodes count wrong for group #62 (16076, counted=16079).
Fix<y>? yes

Free inodes count wrong (1014210, counted=1014221).
Fix<y>? yes


/dev/vgeva2/lvol0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vgeva2/lvol0: 22067/1036288 files (0.6% non-contiguous), 741644/2072576 
blocks
[root@RHHelios1 root]# 


-- Additional comment from c.schnuerer on 2005-07-01 07:05 EST --
Created an attachment (id=116234)
Screenshot of Kernel Panic


-- Additional comment from c.schnuerer on 2005-07-01 07:06 EST --
Created an attachment (id=116235)
ksyms


-- Additional comment from c.schnuerer on 2005-07-01 07:08 EST --
Another Crash on the second cluster-node today. But without fs-corruption. 
Maybe the same cause?
Please, see attachments.

-- Additional comment from sct on 2005-07-01 18:42 EST --
That last oops has nothing to do with the filesystem, indeed.  There's not a lot
I can do to help you right now, I suspect --- you're really going to need to try
to capture a crash dump for our support services to investigate.  The fact that
you're seeing some of the problems associated with a specific LVM volume makes
me think that there may be problems there, and we don't ship LVM with AS-2.1.

Can you reproduce this without using lvm?


-- Additional comment from c.schnuerer on 2005-07-02 10:43 EST --
As i wrote on 06.06, it's AS3 (Update 5, Kernel 2.4.21-32.0.1.ELsmp) in the 
meantime!!!

-- Additional comment from c.schnuerer on 2005-09-12 03:48 EST --
I have disabled the audit-daemon - no crash since 10 weeks !?

Comment 1 Stephen Tweedie 2005-09-13 15:38:40 UTC

It is hard to imagine how audit could be the cause of the underlying problem;
audit was not present in AS-2.1.  But if the problem recurs, please open an
official support ticket; this bug has included too many components --- LVM,
audit, data warnings from ext3 and panics outside ext3 entirely --- to be
conclusively escalated to any single engineering component right now.  

There's not enough information to identify the problem for now, so there's
nothing concrete for engineering to fix.  Support channels are better placed to
narrow down the possibilities if it happens again.

Note You need to log in before you can comment on or make changes to this bug.