Bug 159133

Summary: EXT3-fs error Freeing block in system zone
Product: Red Hat Enterprise Linux 2.1 Reporter: Christian Schnuerer <c.schnuerer>
Component: kernelAssignee: Jim Paradis <jparadis>
Status: CLOSED CANTFIX QA Contact: Brian Brock <bbrock>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: peterm, sct
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2005-09-12 19:58:39 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
last dmesg-output
none
Screenshot of Kernel Panic
none
ksyms none

Description Christian Schnuerer 2005-05-30 14:21:17 UTC
Description of problem:

Serious problems with file-system corruption on two fully patched Dell PE6650 
systems running in a cluster(clumanager-1.0.28-1).
This problem exists for some months and it seems to happen ever more frequently:

 
May 29 15:52:51 machine1 bestlaser1[6483]: Cannot initiate PPAPRead: Socket 
closed
May 29 15:52:51 machine1 kernel: EXT3-fs error (device lvm(58,0)): 
ext3_free_blocks: Freeing block in system zone - block = 255

After the message appears the system crashes and we have to power off/on and 
fsck the system.

We know, LVM is not really supported in an AS2.1-cluster, but we use a simple 
configuration without stripes (just one VG per LUN).
There ara many filesystems in this cluster with about 600GBs and the error 
occurs only in a small (8GB) FS containing some spool-dirs of our Helios-NG-
installation (www.helios.de) 
and arises always at the same time with the crash of a printjob.
Also the block is always 255!

Two weeks ago we migrated the servers into a complete new SAN with new FC-
Switches and new storage (HP EVA5000) - the problem still exists!



lsmod:
Module                  Size  Used by    Tainted: P  
st                     30612   0 
lp                      8032   0  (autoclean)
parport                37696   0  (autoclean) [lp]
sr_mod                 17176   0  (autoclean) (unused)
ide-cd                 35168   0  (autoclean)
cdrom                  35392   0  (autoclean) [sr_mod ide-cd]
esm                    81297   1 
autofs                 13188   1  (autoclean)
3c59x                  32072   1 
bcm5700                71908   1 
loop                   11728   0  (autoclean)
usb-ohci               23328   0  (unused)
usbcore                68160   1  [usb-ohci]
ext3                   69952   6 
jbd                    54804   6  [ext3]
sg                     35012   0 
qla2300               608576   3 
qla2300_conf          301344   0 
megaraid               28576   7 
sd_mod                 13888   7 
scsi_mod              125916   6  [st sr_mod sg qla2300 megaraid sd_mod]

Version-Release number of selected component (if applicable):

kernel-smp-2.4.9-e.62

How reproducible:
not reproducible


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

DELL PE 6650 with 4 XEON 1,5 GHz, 4 GB RAM, hypertreading enabled,
2Gb SAN-Environment, each server has two QLA2340 HBAs with failover enabled 
(bios 1.43, driver 7.05.00-fo), 
central storage HP EVA5000

Comment 1 Christian Schnuerer 2005-05-30 14:21:17 UTC
Created attachment 114973 [details]
last dmesg-output

Comment 2 Stephen Tweedie 2005-06-06 10:52:44 UTC
Have you run a full forced fsck on the filesystem in question?  If so, what were
the results?  

Comment 3 Christian Schnuerer 2005-06-06 14:08:40 UTC
Of course!
Each time some errors were repaired. Unfortunately there is no more output of 
it.

We reinstalled the servers with AS3 last friday! 
Let's see what happens.


Comment 4 Stephen Tweedie 2005-06-06 17:03:50 UTC
OK; note that the "script" binary (from the util-linux rpm) is often helpful in
capturing e2fsck output.


Comment 5 Christian Schnuerer 2005-06-20 12:59:10 UTC
It happened again!


Jun 20 14:19:20 rhhelios1 anz1og[28737]: PPAPWrite failed: Request aborted
Jun 20 14:19:20 rhhelios1 anz1og[28737]: PPAPRead failed: Request aborted
Jun 20 14:19:20 rhhelios1 anz1og[28737]: Cannot initiate PPAPRead: Socket closed
Jun 20 14:19:20 rhhelios1 kernel: EXT3-fs error (device lvm(58,0)): 
ext3_free_blocks: Freeing block in system zone - block = 255

Complete new installation of RedHat AS3, we use the standard-kernel as shipped 
by RedHat, all updates installed!

Release: Red Hat Enterprise Linux AS release 3 (Taroon Update 5)
Kernel: 2.4.21-32.0.1.ELsmp as 


Could that be an application-bug of Helios?

Comment 6 Stephen Tweedie 2005-06-20 20:46:56 UTC
It's generally unlikely to be due to an application.  I still need "e2fsck"
output showing the nature of the corruption to be able to continue.  The fact
that it's always the same block being returned is suspicious and makes me wonder
if there's a bad sector on disk at that location or an lvm problem, but really
without details of the problem it's just guesswork.  Are there _any_ other
filesystem or storage-related errors in the logs?

Comment 7 Christian Schnuerer 2005-06-21 10:01:26 UTC
A bad sector on disk as cause is impossible, because we have a central storage 
(SAN) and the problem went from one storage-box to another - even to a new SAN
(STK D173 -> HP EVA5000). The problem exists on both cluster-nodes. Therefore a 
hardware-problem is very unlikely.
The error only occurs in the same filesystem - always the volume where the 
spool-directories reside.

Here the output of the last fsck:

[root@RHHelios1 root]# e2fsck /dev/vgeva2/lvol0
e2fsck 1.32 (09-Nov-2002)
/dev/vgeva2/lvol0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found.  Fix<y>? yes

Inode 48581 was part of the orphaned inode list.  FIXED.
Inode 48600 was part of the orphaned inode list.  FIXED.
Inode 48643 was part of the orphaned inode list.  FIXED.
Inode 48650 was part of the orphaned inode list.  FIXED.
Inode 502045 was part of the orphaned inode list.  FIXED.
Deleted inode 744848 has zero dtime.  Fix<y>? yes

Inode 971547 was part of the orphaned inode list.  FIXED.
Inode 971698 was part of the orphaned inode list.  FIXED.
Inode 1003921 was part of the orphaned inode list.  FIXED.
Inode 1003924 was part of the orphaned inode list.  FIXED.
Inode 1003947 was part of the orphaned inode list.  FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences:  -98919 -99029 -1059219 -1507884 -1971399 -2032180 -
2032546
Fix<y>? yes

Free blocks count wrong for group #3 (32115, counted=32117).
Fix<y>? yes

Free blocks count wrong for group #32 (20266, counted=20267).
Fix<y>? yes

Free blocks count wrong for group #46 (27682, counted=27683).
Fix<y>? yes

Free blocks count wrong for group #60 (27484, counted=27485).
Fix<y>? yes

Free blocks count wrong for group #62 (32201, counted=32203).
Fix<y>? yes

Free blocks count wrong (1330925, counted=1330932).
Fix<y>? yes

Inode bitmap differences:  -48581 -48600 -48643 -48650 -502045 -744848 -971547 -
971698 -1003921 -1003924 -1003947
Fix<y>? yes

Free inodes count wrong for group #3 (16007, counted=16011).
Fix<y>? yes

Free inodes count wrong for group #31 (16048, counted=16049).
Fix<y>? yes

Free inodes count wrong for group #46 (16128, counted=16129).
Fix<y>? yes

Free inodes count wrong for group #60 (16003, counted=16005).
Fix<y>? yes

Free inodes count wrong for group #62 (16076, counted=16079).
Fix<y>? yes

Free inodes count wrong (1014210, counted=1014221).
Fix<y>? yes


/dev/vgeva2/lvol0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vgeva2/lvol0: 22067/1036288 files (0.6% non-contiguous), 741644/2072576 
blocks
[root@RHHelios1 root]# 


Comment 8 Christian Schnuerer 2005-07-01 11:05:14 UTC
Created attachment 116234 [details]
Screenshot of Kernel Panic

Comment 9 Christian Schnuerer 2005-07-01 11:06:03 UTC
Created attachment 116235 [details]
ksyms

Comment 10 Christian Schnuerer 2005-07-01 11:08:55 UTC
Another Crash on the second cluster-node today. But without fs-corruption. 
Maybe the same cause?
Please, see attachments.

Comment 11 Stephen Tweedie 2005-07-01 22:42:36 UTC
That last oops has nothing to do with the filesystem, indeed.  There's not a lot
I can do to help you right now, I suspect --- you're really going to need to try
to capture a crash dump for our support services to investigate.  The fact that
you're seeing some of the problems associated with a specific LVM volume makes
me think that there may be problems there, and we don't ship LVM with AS-2.1.

Can you reproduce this without using lvm?


Comment 12 Christian Schnuerer 2005-07-02 14:43:38 UTC
As i wrote on 06.06, it's AS3 (Update 5, Kernel 2.4.21-32.0.1.ELsmp) in the 
meantime!!!

Comment 13 Christian Schnuerer 2005-09-12 07:48:32 UTC
I have disabled the audit-daemon - no crash since 10 weeks !?

Comment 14 Jim Paradis 2005-09-12 19:53:39 UTC
I cloned off Bug 168138 to track the RHEL3 manifestation of this issue.  This
bug will track the RHEL2.1 manifestation.


Comment 15 Jim Paradis 2005-09-12 19:58:39 UTC
RHEL2.1 is in maintenance support mode. This feature/bug is not something we
would consider fixing in RHEL2.1 at this time due to the technical impact of the
change. If you feel this issue should be addressed in RHEL2.1, please raise the
issue to Red Hat Product Management.