Bug 240077
| Summary: | Panic under high disk I/O (stack overflow: XFS + LVM) | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Nathan Valentine <nvalentine> | ||||
| Component: | kernel | Assignee: | Eric Sandeen <esandeen> | ||||
| Status: | CLOSED DUPLICATE | QA Contact: | Brian Brock <bbrock> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 6 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | i686 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2007-08-09 19:42:38 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Created attachment 154693 [details]
loghost capture of kern
xfs on lvm volumes will do that. Kernels from http://www.linuxant.com/driverloader/wlan/full/downloads.php may work but the 16K stacks can cause other problems (out of memory when trying to start a new task.) Setting /proc/sys/vm/min_free_kbytes to 16000 can help prevent that but it's not guaranteed. Is ext3 on LVM known to exhibit this behavior as well? Actually, I guess a better question would be "Where in the Fedora community should I have found documentation about this and similar issues?" (In reply to comment #3) > Is ext3 on LVM known to exhibit this behavior as well? ext3 should be fine. Raw devices probably would work best though. (Stack overflows with lvm + xfs are well-known in the Linux kernel community. Probably not known about enough in the user groups though.) I'll put this under my name - if nothing else I may dup it to another bug that might get WONTFIXed eventually, unfortunately. Stacked IO + XFS + 4k stacks is tough; lots of stack reductions have been done in xfs but it's unlikely that this is ever going to be 100% robust. 16K stacks are probably overkill; default 8K stacks (kernel config option on x86) or 8k stacks on x86_64 will probably work fine. As we get more stacked filesystems in the kernel (think unionfs, ecryptfs) the 4k stacks may get more interesting too. FWIW, I was able to reproduce this panic on both 8k and 16k stacks. We eventually "solved" the problem by one of two methods depending on the role of the server: 1) Swap XFS for ext3. 2) Move XFS filesystem from LVM to raw partitions. Since we made these changes, things have been stable and performant. But I agree with the assessment that it is likely that stacked storage management is not going away and thus this will continue to be a problem. Thanks for your help. On 16k stacks? Yikes... ok, that's unexpected. Do you happen to have any stack traces from that kernel? I wonder if you're hitting recursion.... I'll look more closely at the kernel log you have posted already. I'm a bit skeptical of it, it seems to show *hundreds* of functions on the stack...? Even with the false positives from dump_stack() it doesn't seem quite right. Unfortunately, I didn't save any of the debugging information from testing alternative stack sizes. Duping this to an earlier 4k+xfs+lvm bug, though the root cause may be slightly different it's the same issue as far as I can tell - and one without a good solution I'm afraid. *** This bug has been marked as a duplicate of 227331 *** |
Description of problem: We have a MySQL database server that hosts several very large and very active (read-intensive) databases on an XFS LVM volume running on top of a 3Ware mirrored RAID. Several times a week, the machine kernel panics under high disk I/O. The following stack trace is from a loghost but we are unable to get the oops message from the console as the crashes put the machine into a state where it does not accept console input. We are working on putting a serial console on the machine but we have the following information available now: # dmesg | grep -i 3ware 3ware 9000 Storage Controller device driver for Linux v2.26.02.008. scsi0 : 3ware 9000 Storage Controller 3w-9xxx: scsi0: Found a 3ware 9000 Storage Controller at 0xda300000, IRQ: 16. # df -h | grep mysql /dev/mapper/system-mysql 1000G 431G 570G 44% /var/lib/mysql # vgdisplay -v Finding all volume groups Finding volume group "system" --- Volume group --- VG Name system System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 11 VG Access read/write VG Status resizable MAX LV 0 Cur LV 1 Open LV 1 Max PV 0 Cur PV 1 Act PV 1 VG Size 2.63 TB PE Size 4.00 MB Total PE 689640 Alloc PE / Size 256000 / 1000.00 GB Free PE / Size 433640 / 1.65 TB VG UUID GgcbZ2-ex1S-mAIJ-D9xl-oH9m-DaPA-Ms9l09 --- Logical volume --- LV Name /dev/system/mysql VG Name system LV UUID wOtwiF-Qfde-AF2F-Sk4V-7Ufz-CzuO-qNeYMM LV Write Access read/write LV Status available # open 1 LV Size 1000.00 GB Current LE 256000 Segments 1 Allocation inherit Read ahead sectors 0 Block device 253:0 --- Physical volumes --- PV Name /dev/sdb1 PV UUID ZUdR21-DBuU-vzF3-RG9l-Xypc-J2jX-Gbwftv PV Status allocatable Total PE / Free PE 689640 / 433640 # uname -a Linux <somehost> 2.6.20-1.2925.fc6 #1 SMP Sat Mar 10 19:15:16 EST 2007 i686 i686 i386 GNU/Linux Version-Release number of selected component (if applicable): 2.6.20-1.2925.fc6 How reproducible: At least twice a week under "normal" high load. Additional Info: Kernel stack trace attached.