Bug 461506
Summary: | kernel BUG at mm/mempool.c:121! caused by lvcreate | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Mike Gahagan <mgahagan> | ||||||||||||
Component: | kernel | Assignee: | Mikuláš Patočka <mpatocka> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 5.3 | CC: | agk, christophe.varoqui, coughlan, dwysocha, dzickus, edamato, egoggin, heinzm, jbrassow, jtluka, junichi.nomura, kueda, lmb, mbroz, mjenner, prockai, syeghiay, tranlan | ||||||||||||
Target Milestone: | rc | Keywords: | Regression, Reopened | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | ia64 | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | |||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2010-03-30 07:36:39 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | |||||||||||||||
Bug Blocks: | 525215, 533192 | ||||||||||||||
Attachments: |
|
Description
Mike Gahagan
2008-09-08 17:33:55 UTC
I guess we'll have the same thing upstream too? Is it arch-specific? *** This bug has been marked as a duplicate of bug 443627 *** I have just seen this panic on ia64 machine using 2.6.18-160.el5 kernel with the RHEL5.4-Server-20090729.0 tree. This happened on kernel-xen. Scenario is the same - running RHTS test /kernel/storage/lvm/snapshot_remove. Full console log: http://rhts.redhat.com/testlogs/2009/07/80234/240707/1975636/console.txt device-mapper: snapshots: chunk size 0 in device metadata overrides table chunk size of 32. kernel BUG at mm/mempool.c:121! lvcreate[15369]: bugcheck! 0 [1] Modules linked in: nfs fscache nfsd exportfs nfs_acl auth_rpcgss loop autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api vfat fat dm_multipath scsi_dh button parport_pc lp parport joydev sr_mod cdrom e1000 qla2xxx lpfc scsi_transport_fc sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage cciss sd_mod scsi_mod raid0 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 15369, CPU 0, comm: lvcreate psr : 00001010085a6010 ifs : 800000000000050d ip : [<a000000100128370>] Not tainted (2.6.18-160.el5xen) ip is at mempool_resize+0x50/0x440 unat: 0000000000000000 pfs : 800000000000050d rsc : 000000000000000b rnat: a000000100a71170 bsps: fffffffffff00001 pr : 0000000000656659 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100128370 b6 : a0000001000c37e0 b7 : 0000000000000000 f6 : 1003e00000000000000a0 f7 : 1003e20c49ba5e353f7cf f8 : 1003e00000000000004e2 f9 : 1003e000000000fa00000 f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db r1 : a000000100c57ff0 r2 : a000000100a70e58 r3 : a000000100a71170 r8 : 0000000000000023 r9 : a000000100a70e88 r10 : a000000100a70e88 r11 : 0000000000000000 r12 : e0000000209cfc70 r13 : e0000000209c8000 r14 : a000000100a70e58 r15 : 0000000000000000 r16 : fffffffffff04c18 r17 : e0000001a6b87e18 r18 : 0000000000000001 r19 : fffffffffff04c18 r20 : a00000010088d300 r21 : a000000100a58678 r22 : a000000100a70e60 r23 : a000000100a70e60 r24 : fffffffffff00000 r25 : fffffffffff00001 r26 : a000000100a6e178 r27 : 00000320000004b0 r28 : 0000018ffffffe70 r29 : 00000000000007d0 r30 : 00000000018ffe70 r31 : 0000018ffe700000 Call Trace: [<a00000010001d240>] show_stack+0x40/0xa0 sp=e0000000209cf800 bsp=e0000000209c95f0 [<a00000010001db70>] show_regs+0x870/0x8c0 sp=e0000000209cf9d0 bsp=e0000000209c9598 [<a000000100043720>] die+0x1c0/0x380 sp=e0000000209cf9d0 bsp=e0000000209c9550 [<a000000100043930>] die_if_kernel+0x50/0x80 sp=e0000000209cf9f0 bsp=e0000000209c9520 [<a00000010067e4b0>] ia64_bad_break+0x270/0x4a0 sp=e0000000209cf9f0 bsp=e0000000209c94f8 [<a00000010006b140>] xen_leave_kernel+0x0/0x3e0 sp=e0000000209cfaa0 bsp=e0000000209c94f8 [<a000000100128370>] mempool_resize+0x50/0x440 sp=e0000000209cfc70 bsp=e0000000209c9490 [<a0000002013c78f0>] dm_io_client_resize+0x30/0x60 [dm_mod] sp=e0000000209cfc70 bsp=e0000000209c9468 [<a00000020142cd40>] persistent_read_metadata+0x460/0x880 [dm_snapshot] sp=e0000000209cfc70 bsp=e0000000209c9430 [<a00000020142b620>] snapshot_ctr+0x860/0xe80 [dm_snapshot] sp=e0000000209cfc70 bsp=e0000000209c93b8 [<a0000002013be650>] dm_table_add_target+0x350/0x740 [dm_mod] sp=e0000000209cfc80 bsp=e0000000209c9360 [<a0000002013c3f10>] table_load+0x1f0/0x4a0 [dm_mod] sp=e0000000209cfc90 bsp=e0000000209c9308 [<a0000002013c5de0>] ctl_ioctl+0x6a0/0x7a0 [dm_mod] sp=e0000000209cfca0 bsp=e0000000209c92b0 [<a0000001001b3a60>] do_ioctl+0x140/0x180 sp=e0000000209cfde0 bsp=e0000000209c9270 [<a0000001001b4860>] vfs_ioctl+0xdc0/0xec0 sp=e0000000209cfde0 bsp=e0000000209c9228 [<a0000001001b4a30>] sys_ioctl+0xd0/0x140 sp=e0000000209cfe20 bsp=e0000000209c91a0 [<a00000010006ae40>] xen_trace_syscall+0x100/0x140 sp=e0000000209cfe30 bsp=e0000000209c91a0 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e0000000209d0000 bsp=e0000000209c91a0 <0>Kernel panic - not syncing: Fatal exception (XEN) Domain 0 crashed: rebooting machine in 5 seconds. Here I'm posting upstream patches. I will make RHEL-5.4 patches if this bug is approved. Created attachment 356089 [details]
PATCH 1/4: refactor chunk_io
Created attachment 356090 [details]
PATCH 2/4: Use separate area for the header
This fixes the race condition corrupting of the header.
Created attachment 356091 [details]
PATCH 3/4: refactor set_chunk_size
Created attachment 356092 [details]
PATCH 4/4: check on-disk chunksize
Don't crash if the header is corrupted.
Description of the bug: There is a race condition in the snapshot code, if the snapshot fills up, a header is written flagging the snapshot as invalid. If, during this, simultaneously some chunk reallocation finishes, it modifies the same buffer as header writing code, it may result in writing invalid header. I can't prove that this happened in this case (because it is race, I can't reproduce it), but if I use the principle "if we exclude all impossible things (0 couldn't be written as chunksize to the header in none of normal code paths), whatever remains is truth", I came to a conclusion that this race caused these crashes. Furthermore, when the snapshot header is damaged in such way that chunksize is zero and activated later, the kernel crashes. The patches 1 and 2 fix the race corrupting the header. The patches 3 and 4 fix the crash on corrupted header. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release. Created attachment 375317 [details]
A backported patch for RHEL 5.5
in kernel-2.6.18-178.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details. confirmed ia64 has successfully ran this test a few times since the -178 kernel. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html |