Red Hat Bugzilla – Bug 159133
EXT3-fs error Freeing block in system zone
Last modified: 2013-08-05 21:14:04 EDT
Description of problem:
Serious problems with file-system corruption on two fully patched Dell PE6650
systems running in a cluster(clumanager-1.0.28-1).
This problem exists for some months and it seems to happen ever more frequently:
May 29 15:52:51 machine1 bestlaser1: Cannot initiate PPAPRead: Socket
May 29 15:52:51 machine1 kernel: EXT3-fs error (device lvm(58,0)):
ext3_free_blocks: Freeing block in system zone - block = 255
After the message appears the system crashes and we have to power off/on and
fsck the system.
We know, LVM is not really supported in an AS2.1-cluster, but we use a simple
configuration without stripes (just one VG per LUN).
There ara many filesystems in this cluster with about 600GBs and the error
occurs only in a small (8GB) FS containing some spool-dirs of our Helios-NG-
and arises always at the same time with the crash of a printjob.
Also the block is always 255!
Two weeks ago we migrated the servers into a complete new SAN with new FC-
Switches and new storage (HP EVA5000) - the problem still exists!
Module Size Used by Tainted: P
st 30612 0
lp 8032 0 (autoclean)
parport 37696 0 (autoclean) [lp]
sr_mod 17176 0 (autoclean) (unused)
ide-cd 35168 0 (autoclean)
cdrom 35392 0 (autoclean) [sr_mod ide-cd]
esm 81297 1
autofs 13188 1 (autoclean)
3c59x 32072 1
bcm5700 71908 1
loop 11728 0 (autoclean)
usb-ohci 23328 0 (unused)
usbcore 68160 1 [usb-ohci]
ext3 69952 6
jbd 54804 6 [ext3]
sg 35012 0
qla2300 608576 3
qla2300_conf 301344 0
megaraid 28576 7
sd_mod 13888 7
scsi_mod 125916 6 [st sr_mod sg qla2300 megaraid sd_mod]
Version-Release number of selected component (if applicable):
Steps to Reproduce:
DELL PE 6650 with 4 XEON 1,5 GHz, 4 GB RAM, hypertreading enabled,
2Gb SAN-Environment, each server has two QLA2340 HBAs with failover enabled
(bios 1.43, driver 7.05.00-fo),
central storage HP EVA5000
Created attachment 114973 [details]
Have you run a full forced fsck on the filesystem in question? If so, what were
Each time some errors were repaired. Unfortunately there is no more output of
We reinstalled the servers with AS3 last friday!
Let's see what happens.
OK; note that the "script" binary (from the util-linux rpm) is often helpful in
capturing e2fsck output.
It happened again!
Jun 20 14:19:20 rhhelios1 anz1og: PPAPWrite failed: Request aborted
Jun 20 14:19:20 rhhelios1 anz1og: PPAPRead failed: Request aborted
Jun 20 14:19:20 rhhelios1 anz1og: Cannot initiate PPAPRead: Socket closed
Jun 20 14:19:20 rhhelios1 kernel: EXT3-fs error (device lvm(58,0)):
ext3_free_blocks: Freeing block in system zone - block = 255
Complete new installation of RedHat AS3, we use the standard-kernel as shipped
by RedHat, all updates installed!
Release: Red Hat Enterprise Linux AS release 3 (Taroon Update 5)
Kernel: 2.4.21-32.0.1.ELsmp as
Could that be an application-bug of Helios?
It's generally unlikely to be due to an application. I still need "e2fsck"
output showing the nature of the corruption to be able to continue. The fact
that it's always the same block being returned is suspicious and makes me wonder
if there's a bad sector on disk at that location or an lvm problem, but really
without details of the problem it's just guesswork. Are there _any_ other
filesystem or storage-related errors in the logs?
A bad sector on disk as cause is impossible, because we have a central storage
(SAN) and the problem went from one storage-box to another - even to a new SAN
(STK D173 -> HP EVA5000). The problem exists on both cluster-nodes. Therefore a
hardware-problem is very unlikely.
The error only occurs in the same filesystem - always the volume where the
Here the output of the last fsck:
[root@RHHelios1 root]# e2fsck /dev/vgeva2/lvol0
e2fsck 1.32 (09-Nov-2002)
/dev/vgeva2/lvol0 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes
Inode 48581 was part of the orphaned inode list. FIXED.
Inode 48600 was part of the orphaned inode list. FIXED.
Inode 48643 was part of the orphaned inode list. FIXED.
Inode 48650 was part of the orphaned inode list. FIXED.
Inode 502045 was part of the orphaned inode list. FIXED.
Deleted inode 744848 has zero dtime. Fix<y>? yes
Inode 971547 was part of the orphaned inode list. FIXED.
Inode 971698 was part of the orphaned inode list. FIXED.
Inode 1003921 was part of the orphaned inode list. FIXED.
Inode 1003924 was part of the orphaned inode list. FIXED.
Inode 1003947 was part of the orphaned inode list. FIXED.
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -98919 -99029 -1059219 -1507884 -1971399 -2032180 -
Free blocks count wrong for group #3 (32115, counted=32117).
Free blocks count wrong for group #32 (20266, counted=20267).
Free blocks count wrong for group #46 (27682, counted=27683).
Free blocks count wrong for group #60 (27484, counted=27485).
Free blocks count wrong for group #62 (32201, counted=32203).
Free blocks count wrong (1330925, counted=1330932).
Inode bitmap differences: -48581 -48600 -48643 -48650 -502045 -744848 -971547 -
971698 -1003921 -1003924 -1003947
Free inodes count wrong for group #3 (16007, counted=16011).
Free inodes count wrong for group #31 (16048, counted=16049).
Free inodes count wrong for group #46 (16128, counted=16129).
Free inodes count wrong for group #60 (16003, counted=16005).
Free inodes count wrong for group #62 (16076, counted=16079).
Free inodes count wrong (1014210, counted=1014221).
/dev/vgeva2/lvol0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/vgeva2/lvol0: 22067/1036288 files (0.6% non-contiguous), 741644/2072576
Created attachment 116234 [details]
Screenshot of Kernel Panic
Created attachment 116235 [details]
Another Crash on the second cluster-node today. But without fs-corruption.
Maybe the same cause?
Please, see attachments.
That last oops has nothing to do with the filesystem, indeed. There's not a lot
I can do to help you right now, I suspect --- you're really going to need to try
to capture a crash dump for our support services to investigate. The fact that
you're seeing some of the problems associated with a specific LVM volume makes
me think that there may be problems there, and we don't ship LVM with AS-2.1.
Can you reproduce this without using lvm?
As i wrote on 06.06, it's AS3 (Update 5, Kernel 2.4.21-32.0.1.ELsmp) in the
I have disabled the audit-daemon - no crash since 10 weeks !?
I cloned off Bug 168138 to track the RHEL3 manifestation of this issue. This
bug will track the RHEL2.1 manifestation.
RHEL2.1 is in maintenance support mode. This feature/bug is not something we
would consider fixing in RHEL2.1 at this time due to the technical impact of the
change. If you feel this issue should be addressed in RHEL2.1, please raise the
issue to Red Hat Product Management.