+++ This bug was initially created as a clone of Bug #159133 +++ Bug 159133 was initially filed against RHEL2.1. Customer updated to RHEL3 and problem still persists. This bug was cloned off to track the RHEL3 version of this issue. Description of problem: Serious problems with file-system corruption on two fully patched Dell PE6650 systems running in a cluster(clumanager-1.0.28-1). This problem exists for some months and it seems to happen ever more frequently: May 29 15:52:51 machine1 bestlaser1[6483]: Cannot initiate PPAPRead: Socket closed May 29 15:52:51 machine1 kernel: EXT3-fs error (device lvm(58,0)): ext3_free_blocks: Freeing block in system zone - block = 255 After the message appears the system crashes and we have to power off/on and fsck the system. We know, LVM is not really supported in an AS2.1-cluster, but we use a simple configuration without stripes (just one VG per LUN). There ara many filesystems in this cluster with about 600GBs and the error occurs only in a small (8GB) FS containing some spool-dirs of our Helios-NG- installation (www.helios.de) and arises always at the same time with the crash of a printjob. Also the block is always 255! Two weeks ago we migrated the servers into a complete new SAN with new FC- Switches and new storage (HP EVA5000) - the problem still exists! lsmod: Module Size Used by Tainted: P st 30612 0 lp 8032 0 (autoclean) parport 37696 0 (autoclean) [lp] sr_mod 17176 0 (autoclean) (unused) ide-cd 35168 0 (autoclean) cdrom 35392 0 (autoclean) [sr_mod ide-cd] esm 81297 1 autofs 13188 1 (autoclean) 3c59x 32072 1 bcm5700 71908 1 loop 11728 0 (autoclean) usb-ohci 23328 0 (unused) usbcore 68160 1 [usb-ohci] ext3 69952 6 jbd 54804 6 [ext3] sg 35012 0 qla2300 608576 3 qla2300_conf 301344 0 megaraid 28576 7 sd_mod 13888 7 scsi_mod 125916 6 [st sr_mod sg qla2300 megaraid sd_mod] Version-Release number of selected component (if applicable): kernel-smp-2.4.9-e.62 How reproducible: not reproducible Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: DELL PE 6650 with 4 XEON 1,5 GHz, 4 GB RAM, hypertreading enabled, 2Gb SAN-Environment, each server has two QLA2340 HBAs with failover enabled (bios 1.43, driver 7.05.00-fo), central storage HP EVA5000 -- Additional comment from c.schnuerer on 2005-05-30 10:21 EST -- Created an attachment (id=114973) last dmesg-output -- Additional comment from sct on 2005-06-06 06:52 EST -- Have you run a full forced fsck on the filesystem in question? If so, what were the results? -- Additional comment from c.schnuerer on 2005-06-06 10:08 EST -- Of course! Each time some errors were repaired. Unfortunately there is no more output of it. We reinstalled the servers with AS3 last friday! Let's see what happens. -- Additional comment from sct on 2005-06-06 13:03 EST -- OK; note that the "script" binary (from the util-linux rpm) is often helpful in capturing e2fsck output. -- Additional comment from c.schnuerer on 2005-06-20 08:59 EST -- It happened again! Jun 20 14:19:20 rhhelios1 anz1og[28737]: PPAPWrite failed: Request aborted Jun 20 14:19:20 rhhelios1 anz1og[28737]: PPAPRead failed: Request aborted Jun 20 14:19:20 rhhelios1 anz1og[28737]: Cannot initiate PPAPRead: Socket closed Jun 20 14:19:20 rhhelios1 kernel: EXT3-fs error (device lvm(58,0)): ext3_free_blocks: Freeing block in system zone - block = 255 Complete new installation of RedHat AS3, we use the standard-kernel as shipped by RedHat, all updates installed! Release: Red Hat Enterprise Linux AS release 3 (Taroon Update 5) Kernel: 2.4.21-32.0.1.ELsmp as Could that be an application-bug of Helios? -- Additional comment from sct on 2005-06-20 16:46 EST -- It's generally unlikely to be due to an application. I still need "e2fsck" output showing the nature of the corruption to be able to continue. The fact that it's always the same block being returned is suspicious and makes me wonder if there's a bad sector on disk at that location or an lvm problem, but really without details of the problem it's just guesswork. Are there _any_ other filesystem or storage-related errors in the logs? -- Additional comment from c.schnuerer on 2005-06-21 06:01 EST -- A bad sector on disk as cause is impossible, because we have a central storage (SAN) and the problem went from one storage-box to another - even to a new SAN (STK D173 -> HP EVA5000). The problem exists on both cluster-nodes. Therefore a hardware-problem is very unlikely. The error only occurs in the same filesystem - always the volume where the spool-directories reside. Here the output of the last fsck: [root@RHHelios1 root]# e2fsck /dev/vgeva2/lvol0 e2fsck 1.32 (09-Nov-2002) /dev/vgeva2/lvol0 contains a file system with errors, check forced. Pass 1: Checking inodes, blocks, and sizes Inodes that were part of a corrupted orphan linked list found. Fix<y>? yes Inode 48581 was part of the orphaned inode list. FIXED. Inode 48600 was part of the orphaned inode list. FIXED. Inode 48643 was part of the orphaned inode list. FIXED. Inode 48650 was part of the orphaned inode list. FIXED. Inode 502045 was part of the orphaned inode list. FIXED. Deleted inode 744848 has zero dtime. Fix<y>? yes Inode 971547 was part of the orphaned inode list. FIXED. Inode 971698 was part of the orphaned inode list. FIXED. Inode 1003921 was part of the orphaned inode list. FIXED. Inode 1003924 was part of the orphaned inode list. FIXED. Inode 1003947 was part of the orphaned inode list. FIXED. Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information Block bitmap differences: -98919 -99029 -1059219 -1507884 -1971399 -2032180 - 2032546 Fix<y>? yes Free blocks count wrong for group #3 (32115, counted=32117). Fix<y>? yes Free blocks count wrong for group #32 (20266, counted=20267). Fix<y>? yes Free blocks count wrong for group #46 (27682, counted=27683). Fix<y>? yes Free blocks count wrong for group #60 (27484, counted=27485). Fix<y>? yes Free blocks count wrong for group #62 (32201, counted=32203). Fix<y>? yes Free blocks count wrong (1330925, counted=1330932). Fix<y>? yes Inode bitmap differences: -48581 -48600 -48643 -48650 -502045 -744848 -971547 - 971698 -1003921 -1003924 -1003947 Fix<y>? yes Free inodes count wrong for group #3 (16007, counted=16011). Fix<y>? yes Free inodes count wrong for group #31 (16048, counted=16049). Fix<y>? yes Free inodes count wrong for group #46 (16128, counted=16129). Fix<y>? yes Free inodes count wrong for group #60 (16003, counted=16005). Fix<y>? yes Free inodes count wrong for group #62 (16076, counted=16079). Fix<y>? yes Free inodes count wrong (1014210, counted=1014221). Fix<y>? yes /dev/vgeva2/lvol0: ***** FILE SYSTEM WAS MODIFIED ***** /dev/vgeva2/lvol0: 22067/1036288 files (0.6% non-contiguous), 741644/2072576 blocks [root@RHHelios1 root]# -- Additional comment from c.schnuerer on 2005-07-01 07:05 EST -- Created an attachment (id=116234) Screenshot of Kernel Panic -- Additional comment from c.schnuerer on 2005-07-01 07:06 EST -- Created an attachment (id=116235) ksyms -- Additional comment from c.schnuerer on 2005-07-01 07:08 EST -- Another Crash on the second cluster-node today. But without fs-corruption. Maybe the same cause? Please, see attachments. -- Additional comment from sct on 2005-07-01 18:42 EST -- That last oops has nothing to do with the filesystem, indeed. There's not a lot I can do to help you right now, I suspect --- you're really going to need to try to capture a crash dump for our support services to investigate. The fact that you're seeing some of the problems associated with a specific LVM volume makes me think that there may be problems there, and we don't ship LVM with AS-2.1. Can you reproduce this without using lvm? -- Additional comment from c.schnuerer on 2005-07-02 10:43 EST -- As i wrote on 06.06, it's AS3 (Update 5, Kernel 2.4.21-32.0.1.ELsmp) in the meantime!!! -- Additional comment from c.schnuerer on 2005-09-12 03:48 EST -- I have disabled the audit-daemon - no crash since 10 weeks !?
It is hard to imagine how audit could be the cause of the underlying problem; audit was not present in AS-2.1. But if the problem recurs, please open an official support ticket; this bug has included too many components --- LVM, audit, data warnings from ext3 and panics outside ext3 entirely --- to be conclusively escalated to any single engineering component right now. There's not enough information to identify the problem for now, so there's nothing concrete for engineering to fix. Support channels are better placed to narrow down the possibilities if it happens again.