Bug 68026 - Frequent systems crashes with 2.4.18-4smp quotas and ext3
Frequent systems crashes with 2.4.18-4smp quotas and ext3
Status: CLOSED ERRATA
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i386 Linux
medium Severity medium
: ---
: ---
Assigned To: Stephen Tweedie
Brian Brock
:
: 66663 70118 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-07-05 10:05 EDT by Mike Gahagan
Modified: 2007-04-18 12:43 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-07-31 10:52:42 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Mike Gahagan 2002-07-05 10:05:16 EDT
Description of Problem:

4CPU mail server, 1GB RAM, approx 100GB of storage on /var/spool/mail with 
quotas enabled (ext3 fs). Machine gets very heavy pop3 & Sendmail traffic. 
This crash has been happening to them about once every 2 weeks, but it just
happened 
twice today. They experience quota corruption over time and have to 
recreate all the quotas every so often. (not sure if it is causing the 
crashes or is a symptom)


Version-Release number of selected component (if applicable):

7.3 w/ relevant errata, attached messages from 2.4.18-4. Have updated to
2.4.18-5, however it is a bit too early to tell if that fixes the problem. 

How Reproducible:

Appears to be a result of heavy access to an ext3 filesystem with quotas. Very
heavy pop traffic noted before the crash. System is a mail server in production
with approximately 25,000 users. System will run from anywhere between 24 hours
and 2 weeks before crashing. /var/spool/mail is approximately 100GB and is the
only filesystem using quotas.

Steps to Reproduce:
1. see above 
2. 
3. 

Actual Results:

System crashes, quotas appear to get corrupted which take quite a long while to
correct with 25000 users on the system.

Expected Results:

System should not crash :)

Additional Information:
	
lspci -mv

Device: 00:08.0
Class:  SCSI storage controller
Vendor: LSI Logic / Symbios Logic (formerly NCR)
Device: 53c810
SVendor:        LSI Logic / Symbios Logic (formerly NCR)
SDevice:        8100S
Rev:    23

Device: 00:09.0
Class:  Ethernet controller
Vendor: Intel Corp.
Device: 82557/8/9 [Ethernet Pro 100]
SVendor:        Intel Corp.
SDevice:        EtherExpress PRO/100+ Management Adapter
Rev:    08

Device: 00:0a.0
Class:  VGA compatible controller
Vendor: Cirrus Logic
Device: GD 5480
SVendor:        1013
SDevice:        00bc
Rev:    23

Device: 00:0b.0
Class:  PIC
Vendor: Intel Corp.
Device: 683053 Programmable Interrupt Device
ProgIf: 03

Device: 00:0c.0
Class:  ISA bridge
Vendor: Intel Corp.
Device: 82371AB/EB/MB PIIX4 ISA
Rev:    02

Device: 00:0c.1
Class:  IDE interface
Vendor: Intel Corp.
Device: 82371AB/EB/MB PIIX4 IDE
Rev:    01
ProgIf: 80

Device: 00:0c.2
Class:  USB Controller
Vendor: Intel Corp.
Device: 82371AB/EB/MB PIIX4 USB
Rev:    01

Device: 00:0c.3
Class:  Bridge
Vendor: Intel Corp.
Device: 82371AB/EB/MB PIIX4 ACPI
Rev:    02

Device: 00:10.0
Class:  Host bridge
Vendor: Intel Corp.
Device: 450NX - 82451NX Memory & I/O Controller
Rev:    03

Device: 00:12.0
Class:  Host bridge
Vendor: Intel Corp.
Device: 450NX - 82454NX/84460GX PCI Expander Bridge
Rev:    02

Device: 00:13.0
Class:  Host bridge
Vendor: Intel Corp.
Device: 450NX - 82454NX/84460GX PCI Expander Bridge
Rev:    02

Device: 01:03.0
Class:  SCSI storage controller
Vendor: LSI Logic / Symbios Logic (formerly NCR)
Device: 53c896
SVendor:        LSI Logic / Symbios Logic (formerly NCR)
SDevice:        000b
Rev:    01

Device: 01:03.1
Class:  SCSI storage controller
Vendor: LSI Logic / Symbios Logic (formerly NCR)
Device: 53c896
SVendor:        LSI Logic / Symbios Logic (formerly NCR)
SDevice:        000b
Rev:    01

Device: 01:04.0
Class:  SCSI storage controller
Vendor: Adaptec
Device: AHA-2940U2/U2W
SVendor:        Adaptec
SDevice:        AHA-2940U2W SCSI Controller

lsmod:

lsmod:
Module                  Size  Used by    Not tainted
eepro100               20816   1
usb-uhci               25604   0  (unused)
usbcore                77024   1  [usb-uhci]
ext3                   70752   7
jbd                    53664   7  [ext3]
raid5                  20736   0  (unused)
xor                     7536   0  [raid5]
aic7xxx               125440   3
sym53c8xx              63236   6
sd_mod                 12896  18
scsi_mod              112272   3  [aic7xxx sym53c8xx sd_mod]

Log messages:

after reboot.. probably due to corrupted quotas:

Jul  3 00:36:00 mail3 kernel: VFS: Diskquotas version dquot_6.5.0 
initialized
Jul  3 00:36:02 mail3 kernel: VFS: Mounted root (ext2 filesystem).
Jul  3 16:49:20 mail3 kernel: VFS: Diskquotas version dquot_6.5.0 
initialized
Jul  3 16:49:52 mail3 kernel: VFS: Mounted root (ext2 filesystem).
Jul  3 16:54:51 mail3 kernel: VFS: find_free_dqentry(): Data block full 
but it
shouldn't.
Jul  3 16:54:51 mail3 kernel: VFS: Error -5 occured while creating quota.
Jul  3 17:01:09 mail3 kernel: VFS: Quota for id 24872 referenced but not 
present.
Jul  3 17:01:09 mail3 kernel: VFS: Can't read quota structure for id 
24872.
Jul  3 17:14:58 mail3 kernel: VFS: Quota for id 24902 referenced but not 
present.
Jul  3 17:14:58 mail3 kernel: VFS: Can't read quota structure for id 
24902.

and now for the assertion failure message.

Jul  3 00:18:03 mail3 kernel: Assertion failure in 
journal_write_metadata_buffer() at
journal.c:406: "buffer_jdirty(jh2bh(jh_in))"
Jul  3 00:18:03 mail3 kernel: ------------[ cut here ]------------
Jul  3 00:18:03 mail3 kernel: kernel BUG at journal.c:406!
Jul  3 00:18:03 mail3 kernel: invalid operand: 0000
Jul  3 00:18:03 mail3 kernel: eepro100 usb-uhci usbcore ext3 jbd raid5 xor 
aic7xxx
sym53c8xx sd_mod scsi_mod
Jul  3 00:18:03 mail3 kernel: CPU:    2
Jul  3 00:18:03 mail3 kernel: EIP:    0010:[]    Not tainted
Jul  3 00:18:03 mail3 kernel: EFLAGS: 00010282
Jul  3 00:18:03 mail3 kernel:
Jul  3 00:18:03 mail3 kernel: EIP is at journal_write_metadata_buffer 
[jbd] 0x74
(2.4.18-4smp)
Jul  3 00:18:03 mail3 kernel: eax: 0000001d   ebx: 00000000   ecx: 
c02eeec0   edx:
00004f0c
Jul  3 00:18:03 mail3 kernel: esi: 00000000   edi: e3929840   ebp: 
f6a99220   esp:
f6a51e44
Jul  3 00:18:03 mail3 kernel: ds: 0018   es: 0018   ss: 0018
Jul  3 00:18:03 mail3 kernel: Process kjournald (pid: 198, 
stackpage=f6a51000)
Jul  3 00:18:03 mail3 kernel: Stack: f8882181 00000196 00001845 f777e600 
00000000
00000000 d582e7f0 00000000
Jul  3 00:18:03 mail3 kernel:        e3929840 f6a99220 f887adf4 e3929840 
d582e7f0
f6a51e98 00001a56 f775e6b8
Jul  3 00:18:03 mail3 kernel:        00000000 00000fcc d574d034 00000004 
e3929840
d57a2c70 00001a56 00000001
Jul  3 00:18:03 mail3 kernel: Call Trace: [] .rodata.str1.1 [jbd] 0x4e1
Jul  3 00:18:04 mail3 kernel: [] journal_commit_transaction [jbd] 0x7e4
Jul  3 00:18:04 mail3 kernel: [] rw_intr [sd_mod] 0x20f
Jul  3 00:18:04 mail3 kernel: [] schedule [kernel] 0x348
Jul  3 00:18:04 mail3 kernel: [] kjournald [jbd] 0x136
Jul  3 00:18:04 mail3 kernel: [] commit_timeout [jbd] 0x0
Jul  3 00:18:04 mail3 kernel: [] kernel_thread [kernel] 0x26
Jul  3 00:18:04 mail3 kernel: [] kjournald [jbd] 0x0
Jul  3 00:18:04 mail3 kernel:
Jul  3 00:18:04 mail3 kernel:
Jul  3 00:18:04 mail3 kernel: Code: 0f 0b 5e 5f 8b 7c 24 28 8b 4f 0c 85 c9 
74 2e c7
44 24 0c 01
Comment 1 Stephen Tweedie 2002-07-05 15:37:47 EDT
I have found one possible cause of this problem, and have a provisional fix
awaiting internal testing.

I'm fairly sure that the bug has nothing to do with the quota problems being
reported: it might be worth opening those in a separate bug report, since the
quota system is notionally independent of ext3 and the quota corruption looks
like a completely different problem.6. I have more logs which include several
more crash messages as well as the boot-up messages if anyone would like
to see them.
Comment 2 Stephen Tweedie 2002-07-31 10:11:08 EDT
There is a new kernel, 2.4.18-7, available for testing on

    http://people.redhat.com/arjanv/testkernels/

I hope that this will fix the problem, and the patch will be released in the
next official errata.  Please feel free to try this kernel.
Comment 3 Stephen Tweedie 2002-07-31 10:51:26 EDT
*** Bug 66663 has been marked as a duplicate of this bug. ***
Comment 4 Stephen Tweedie 2002-07-31 10:52:28 EDT
*** Bug 70118 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.