Description of Problem: 4CPU mail server, 1GB RAM, approx 100GB of storage on /var/spool/mail with quotas enabled (ext3 fs). Machine gets very heavy pop3 & Sendmail traffic. This crash has been happening to them about once every 2 weeks, but it just happened twice today. They experience quota corruption over time and have to recreate all the quotas every so often. (not sure if it is causing the crashes or is a symptom) Version-Release number of selected component (if applicable): 7.3 w/ relevant errata, attached messages from 2.4.18-4. Have updated to 2.4.18-5, however it is a bit too early to tell if that fixes the problem. How Reproducible: Appears to be a result of heavy access to an ext3 filesystem with quotas. Very heavy pop traffic noted before the crash. System is a mail server in production with approximately 25,000 users. System will run from anywhere between 24 hours and 2 weeks before crashing. /var/spool/mail is approximately 100GB and is the only filesystem using quotas. Steps to Reproduce: 1. see above 2. 3. Actual Results: System crashes, quotas appear to get corrupted which take quite a long while to correct with 25000 users on the system. Expected Results: System should not crash :) Additional Information: lspci -mv Device: 00:08.0 Class: SCSI storage controller Vendor: LSI Logic / Symbios Logic (formerly NCR) Device: 53c810 SVendor: LSI Logic / Symbios Logic (formerly NCR) SDevice: 8100S Rev: 23 Device: 00:09.0 Class: Ethernet controller Vendor: Intel Corp. Device: 82557/8/9 [Ethernet Pro 100] SVendor: Intel Corp. SDevice: EtherExpress PRO/100+ Management Adapter Rev: 08 Device: 00:0a.0 Class: VGA compatible controller Vendor: Cirrus Logic Device: GD 5480 SVendor: 1013 SDevice: 00bc Rev: 23 Device: 00:0b.0 Class: PIC Vendor: Intel Corp. Device: 683053 Programmable Interrupt Device ProgIf: 03 Device: 00:0c.0 Class: ISA bridge Vendor: Intel Corp. Device: 82371AB/EB/MB PIIX4 ISA Rev: 02 Device: 00:0c.1 Class: IDE interface Vendor: Intel Corp. Device: 82371AB/EB/MB PIIX4 IDE Rev: 01 ProgIf: 80 Device: 00:0c.2 Class: USB Controller Vendor: Intel Corp. Device: 82371AB/EB/MB PIIX4 USB Rev: 01 Device: 00:0c.3 Class: Bridge Vendor: Intel Corp. Device: 82371AB/EB/MB PIIX4 ACPI Rev: 02 Device: 00:10.0 Class: Host bridge Vendor: Intel Corp. Device: 450NX - 82451NX Memory & I/O Controller Rev: 03 Device: 00:12.0 Class: Host bridge Vendor: Intel Corp. Device: 450NX - 82454NX/84460GX PCI Expander Bridge Rev: 02 Device: 00:13.0 Class: Host bridge Vendor: Intel Corp. Device: 450NX - 82454NX/84460GX PCI Expander Bridge Rev: 02 Device: 01:03.0 Class: SCSI storage controller Vendor: LSI Logic / Symbios Logic (formerly NCR) Device: 53c896 SVendor: LSI Logic / Symbios Logic (formerly NCR) SDevice: 000b Rev: 01 Device: 01:03.1 Class: SCSI storage controller Vendor: LSI Logic / Symbios Logic (formerly NCR) Device: 53c896 SVendor: LSI Logic / Symbios Logic (formerly NCR) SDevice: 000b Rev: 01 Device: 01:04.0 Class: SCSI storage controller Vendor: Adaptec Device: AHA-2940U2/U2W SVendor: Adaptec SDevice: AHA-2940U2W SCSI Controller lsmod: lsmod: Module Size Used by Not tainted eepro100 20816 1 usb-uhci 25604 0 (unused) usbcore 77024 1 [usb-uhci] ext3 70752 7 jbd 53664 7 [ext3] raid5 20736 0 (unused) xor 7536 0 [raid5] aic7xxx 125440 3 sym53c8xx 63236 6 sd_mod 12896 18 scsi_mod 112272 3 [aic7xxx sym53c8xx sd_mod] Log messages: after reboot.. probably due to corrupted quotas: Jul 3 00:36:00 mail3 kernel: VFS: Diskquotas version dquot_6.5.0 initialized Jul 3 00:36:02 mail3 kernel: VFS: Mounted root (ext2 filesystem). Jul 3 16:49:20 mail3 kernel: VFS: Diskquotas version dquot_6.5.0 initialized Jul 3 16:49:52 mail3 kernel: VFS: Mounted root (ext2 filesystem). Jul 3 16:54:51 mail3 kernel: VFS: find_free_dqentry(): Data block full but it shouldn't. Jul 3 16:54:51 mail3 kernel: VFS: Error -5 occured while creating quota. Jul 3 17:01:09 mail3 kernel: VFS: Quota for id 24872 referenced but not present. Jul 3 17:01:09 mail3 kernel: VFS: Can't read quota structure for id 24872. Jul 3 17:14:58 mail3 kernel: VFS: Quota for id 24902 referenced but not present. Jul 3 17:14:58 mail3 kernel: VFS: Can't read quota structure for id 24902. and now for the assertion failure message. Jul 3 00:18:03 mail3 kernel: Assertion failure in journal_write_metadata_buffer() at journal.c:406: "buffer_jdirty(jh2bh(jh_in))" Jul 3 00:18:03 mail3 kernel: ------------[ cut here ]------------ Jul 3 00:18:03 mail3 kernel: kernel BUG at journal.c:406! Jul 3 00:18:03 mail3 kernel: invalid operand: 0000 Jul 3 00:18:03 mail3 kernel: eepro100 usb-uhci usbcore ext3 jbd raid5 xor aic7xxx sym53c8xx sd_mod scsi_mod Jul 3 00:18:03 mail3 kernel: CPU: 2 Jul 3 00:18:03 mail3 kernel: EIP: 0010:[] Not tainted Jul 3 00:18:03 mail3 kernel: EFLAGS: 00010282 Jul 3 00:18:03 mail3 kernel: Jul 3 00:18:03 mail3 kernel: EIP is at journal_write_metadata_buffer [jbd] 0x74 (2.4.18-4smp) Jul 3 00:18:03 mail3 kernel: eax: 0000001d ebx: 00000000 ecx: c02eeec0 edx: 00004f0c Jul 3 00:18:03 mail3 kernel: esi: 00000000 edi: e3929840 ebp: f6a99220 esp: f6a51e44 Jul 3 00:18:03 mail3 kernel: ds: 0018 es: 0018 ss: 0018 Jul 3 00:18:03 mail3 kernel: Process kjournald (pid: 198, stackpage=f6a51000) Jul 3 00:18:03 mail3 kernel: Stack: f8882181 00000196 00001845 f777e600 00000000 00000000 d582e7f0 00000000 Jul 3 00:18:03 mail3 kernel: e3929840 f6a99220 f887adf4 e3929840 d582e7f0 f6a51e98 00001a56 f775e6b8 Jul 3 00:18:03 mail3 kernel: 00000000 00000fcc d574d034 00000004 e3929840 d57a2c70 00001a56 00000001 Jul 3 00:18:03 mail3 kernel: Call Trace: [] .rodata.str1.1 [jbd] 0x4e1 Jul 3 00:18:04 mail3 kernel: [] journal_commit_transaction [jbd] 0x7e4 Jul 3 00:18:04 mail3 kernel: [] rw_intr [sd_mod] 0x20f Jul 3 00:18:04 mail3 kernel: [] schedule [kernel] 0x348 Jul 3 00:18:04 mail3 kernel: [] kjournald [jbd] 0x136 Jul 3 00:18:04 mail3 kernel: [] commit_timeout [jbd] 0x0 Jul 3 00:18:04 mail3 kernel: [] kernel_thread [kernel] 0x26 Jul 3 00:18:04 mail3 kernel: [] kjournald [jbd] 0x0 Jul 3 00:18:04 mail3 kernel: Jul 3 00:18:04 mail3 kernel: Jul 3 00:18:04 mail3 kernel: Code: 0f 0b 5e 5f 8b 7c 24 28 8b 4f 0c 85 c9 74 2e c7 44 24 0c 01
I have found one possible cause of this problem, and have a provisional fix awaiting internal testing. I'm fairly sure that the bug has nothing to do with the quota problems being reported: it might be worth opening those in a separate bug report, since the quota system is notionally independent of ext3 and the quota corruption looks like a completely different problem.6. I have more logs which include several more crash messages as well as the boot-up messages if anyone would like to see them.
There is a new kernel, 2.4.18-7, available for testing on http://people.redhat.com/arjanv/testkernels/ I hope that this will fix the problem, and the patch will be released in the next official errata. Please feel free to try this kernel.
*** Bug 66663 has been marked as a duplicate of this bug. ***
*** Bug 70118 has been marked as a duplicate of this bug. ***