Hide Forgot
Description of problem: device-mapper: snapshots: Snapshot is marked invalid. device-mapper: snapshots: chunk size 0 in device metadata overrides table chunk size of 32. kernel BUG at mm/mempool.c:121! lvcreate[11490]: bugcheck! 0 [1] Modules linked in: nfs fscache nfsd exportfs lockd nfs_acl auth_rpcgss loop autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api vfat fat dm_multipath button parport_pc lp parport joydev sr_mod cdrom e1000 sg dm_snapshot dm_zero dm_mirror dm_mod usb_storage qla2xxx lpfc scsi_transport_fc cciss sd_mod scsi_mod raid0 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 11490, CPU 15, comm: lvcreate psr : 0000101008526030 ifs : 800000000000050d ip : [<a000000100110630>] Not tainted ip is at mempool_resize+0x50/0x440 unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003 rnat: a000000100adbe68 bsps: 0000000000000004 pr : 000000000065a559 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100110630 b6 : a000000100011060 b7 : a00000010000b840 f6 : 1003e00000000000000a0 f7 : 1003e20c49ba5e353f7cf f8 : 1003e00000000000004e2 f9 : 1003e000000000fa00000 f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db r1 : a000000100c00ef0 r2 : a000000100a19088 r3 : a0000001009499e0 r8 : 0000000000000023 r9 : a000000100a190b8 r10 : a000000100a190b8 r11 : 0000000000000000 r12 : e0000016d7df7ca0 r13 : e0000016d7df0000 r14 : a000000100a19088 r15 : 0000000000000000 r16 : a0000001009499e8 r17 : e0000100e1a87e18 r18 : 0000000000000000 r19 : 000000000000000d r20 : a000000100849280 r21 : a000000100a01548 r22 : a000000100a19090 r23 : a000000100a19090 r24 : e0000100f9ec1054 r25 : 0000000000000000 r26 : e0000100f9ec105c r27 : e0000100f9ec1040 r28 : e0000100f9ec0008 r29 : 0000005000000078 r30 : 0000000000000000 r31 : 0000000000000000 Call Trace: [<a000000100013ba0>] show_stack+0x40/0xa0 sp=e0000016d7df7830 bsp=e0000016d7df15e8 [<a0000001000144a0>] show_regs+0x840/0x880 sp=e0000016d7df7a00 bsp=e0000016d7df1590 [<a000000100037b80>] die+0x1c0/0x2c0 sp=e0000016d7df7a00 bsp=e0000016d7df1548 [<a000000100037cd0>] die_if_kernel+0x50/0x80 sp=e0000016d7df7a20 bsp=e0000016d7df1518 [<a000000100644b90>] ia64_bad_break+0x270/0x4a0 sp=e0000016d7df7a20 bsp=e0000016d7df14f0 [<a00000010000c040>] __ia64_leave_kernel+0x0/0x280 sp=e0000016d7df7ad0 bsp=e0000016d7df14f0 [<a000000100110630>] mempool_resize+0x50/0x440 sp=e0000016d7df7ca0 bsp=e0000016d7df1488 [<a00000021e4eb090>] dm_io_client_resize+0x30/0x60 [dm_mod] sp=e0000016d7df7ca0 bsp=e0000016d7df1460 [<a00000021ebd4340>] persistent_read_metadata+0x460/0x820 [dm_snapshot] sp=e0000016d7df7ca0 bsp=e0000016d7df1428 [<a00000021ebd2980>] snapshot_ctr+0x700/0xcc0 [dm_snapshot] sp=e0000016d7df7ca0 bsp=e0000016d7df13b8 [<a00000021e4e2550>] dm_table_add_target+0x350/0x740 [dm_mod] sp=e0000016d7df7cb0 bsp=e0000016d7df1360 [<a00000021e4e7670>] table_load+0x1f0/0x4a0 [dm_mod] sp=e0000016d7df7cc0 bsp=e0000016d7df1308 [<a00000021e4e9540>] ctl_ioctl+0x6a0/0x7a0 [dm_mod] sp=e0000016d7df7cd0 bsp=e0000016d7df12b0 [<a00000010019cac0>] do_ioctl+0x140/0x180 sp=e0000016d7df7e10 bsp=e0000016d7df1270 [<a00000010019d380>] vfs_ioctl+0x880/0x8e0 sp=e0000016d7df7e10 bsp=e0000016d7df1228 [<a00000010019d4b0>] sys_ioctl+0xd0/0x140 sp=e0000016d7df7e20 bsp=e0000016d7df11a0 [<a00000010000bdd0>] __ia64_trace_syscall+0xd0/0x110 sp=e0000016d7df7e30 bsp=e0000016d7df11a0 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e0000016d7df8000 bsp=e0000016d7df11a0 <0>Kernel panic - not syncing: Fatal exception Version-Release number of selected component (if applicable): 2.6.18-108.el5, 5.2 userspace on ia64 How reproducible: not certain, this is the first time it has happened. Steps to Reproduce: 1.Install 5.2 on ia64 2.install 2.6.18-108.el5 3.run the rhts test: /kernel/storage/lvm/snapshot_remove/ Actual results: panic Expected results: test runs to completion Additional info: http://rhts.redhat.com/testlogs/28444/103180/875970/4163273-test_log--kernel-storage-lvm-snapshot_remove-EXTERNALWATCHDOG.log http://rhts.redhat.com/testlogs/28444/103180/875970/TESTOUT.log
I guess we'll have the same thing upstream too? Is it arch-specific?
*** This bug has been marked as a duplicate of bug 443627 ***
I have just seen this panic on ia64 machine using 2.6.18-160.el5 kernel with the RHEL5.4-Server-20090729.0 tree. This happened on kernel-xen. Scenario is the same - running RHTS test /kernel/storage/lvm/snapshot_remove. Full console log: http://rhts.redhat.com/testlogs/2009/07/80234/240707/1975636/console.txt device-mapper: snapshots: chunk size 0 in device metadata overrides table chunk size of 32. kernel BUG at mm/mempool.c:121! lvcreate[15369]: bugcheck! 0 [1] Modules linked in: nfs fscache nfsd exportfs nfs_acl auth_rpcgss loop autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api vfat fat dm_multipath scsi_dh button parport_pc lp parport joydev sr_mod cdrom e1000 qla2xxx lpfc scsi_transport_fc sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage cciss sd_mod scsi_mod raid0 ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 15369, CPU 0, comm: lvcreate psr : 00001010085a6010 ifs : 800000000000050d ip : [<a000000100128370>] Not tainted (2.6.18-160.el5xen) ip is at mempool_resize+0x50/0x440 unat: 0000000000000000 pfs : 800000000000050d rsc : 000000000000000b rnat: a000000100a71170 bsps: fffffffffff00001 pr : 0000000000656659 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100128370 b6 : a0000001000c37e0 b7 : 0000000000000000 f6 : 1003e00000000000000a0 f7 : 1003e20c49ba5e353f7cf f8 : 1003e00000000000004e2 f9 : 1003e000000000fa00000 f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db r1 : a000000100c57ff0 r2 : a000000100a70e58 r3 : a000000100a71170 r8 : 0000000000000023 r9 : a000000100a70e88 r10 : a000000100a70e88 r11 : 0000000000000000 r12 : e0000000209cfc70 r13 : e0000000209c8000 r14 : a000000100a70e58 r15 : 0000000000000000 r16 : fffffffffff04c18 r17 : e0000001a6b87e18 r18 : 0000000000000001 r19 : fffffffffff04c18 r20 : a00000010088d300 r21 : a000000100a58678 r22 : a000000100a70e60 r23 : a000000100a70e60 r24 : fffffffffff00000 r25 : fffffffffff00001 r26 : a000000100a6e178 r27 : 00000320000004b0 r28 : 0000018ffffffe70 r29 : 00000000000007d0 r30 : 00000000018ffe70 r31 : 0000018ffe700000 Call Trace: [<a00000010001d240>] show_stack+0x40/0xa0 sp=e0000000209cf800 bsp=e0000000209c95f0 [<a00000010001db70>] show_regs+0x870/0x8c0 sp=e0000000209cf9d0 bsp=e0000000209c9598 [<a000000100043720>] die+0x1c0/0x380 sp=e0000000209cf9d0 bsp=e0000000209c9550 [<a000000100043930>] die_if_kernel+0x50/0x80 sp=e0000000209cf9f0 bsp=e0000000209c9520 [<a00000010067e4b0>] ia64_bad_break+0x270/0x4a0 sp=e0000000209cf9f0 bsp=e0000000209c94f8 [<a00000010006b140>] xen_leave_kernel+0x0/0x3e0 sp=e0000000209cfaa0 bsp=e0000000209c94f8 [<a000000100128370>] mempool_resize+0x50/0x440 sp=e0000000209cfc70 bsp=e0000000209c9490 [<a0000002013c78f0>] dm_io_client_resize+0x30/0x60 [dm_mod] sp=e0000000209cfc70 bsp=e0000000209c9468 [<a00000020142cd40>] persistent_read_metadata+0x460/0x880 [dm_snapshot] sp=e0000000209cfc70 bsp=e0000000209c9430 [<a00000020142b620>] snapshot_ctr+0x860/0xe80 [dm_snapshot] sp=e0000000209cfc70 bsp=e0000000209c93b8 [<a0000002013be650>] dm_table_add_target+0x350/0x740 [dm_mod] sp=e0000000209cfc80 bsp=e0000000209c9360 [<a0000002013c3f10>] table_load+0x1f0/0x4a0 [dm_mod] sp=e0000000209cfc90 bsp=e0000000209c9308 [<a0000002013c5de0>] ctl_ioctl+0x6a0/0x7a0 [dm_mod] sp=e0000000209cfca0 bsp=e0000000209c92b0 [<a0000001001b3a60>] do_ioctl+0x140/0x180 sp=e0000000209cfde0 bsp=e0000000209c9270 [<a0000001001b4860>] vfs_ioctl+0xdc0/0xec0 sp=e0000000209cfde0 bsp=e0000000209c9228 [<a0000001001b4a30>] sys_ioctl+0xd0/0x140 sp=e0000000209cfe20 bsp=e0000000209c91a0 [<a00000010006ae40>] xen_trace_syscall+0x100/0x140 sp=e0000000209cfe30 bsp=e0000000209c91a0 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e0000000209d0000 bsp=e0000000209c91a0 <0>Kernel panic - not syncing: Fatal exception (XEN) Domain 0 crashed: rebooting machine in 5 seconds.
Here I'm posting upstream patches. I will make RHEL-5.4 patches if this bug is approved.
Created attachment 356089 [details] PATCH 1/4: refactor chunk_io
Created attachment 356090 [details] PATCH 2/4: Use separate area for the header This fixes the race condition corrupting of the header.
Created attachment 356091 [details] PATCH 3/4: refactor set_chunk_size
Created attachment 356092 [details] PATCH 4/4: check on-disk chunksize Don't crash if the header is corrupted.
Description of the bug: There is a race condition in the snapshot code, if the snapshot fills up, a header is written flagging the snapshot as invalid. If, during this, simultaneously some chunk reallocation finishes, it modifies the same buffer as header writing code, it may result in writing invalid header. I can't prove that this happened in this case (because it is race, I can't reproduce it), but if I use the principle "if we exclude all impossible things (0 couldn't be written as chunksize to the header in none of normal code paths), whatever remains is truth", I came to a conclusion that this race caused these crashes. Furthermore, when the snapshot header is damaged in such way that chunksize is zero and activated later, the kernel crashes. The patches 1 and 2 fix the race corrupting the header. The patches 3 and 4 fix the crash on corrupted header.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 375317 [details] A backported patch for RHEL 5.5
in kernel-2.6.18-178.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please update the appropriate value in the Verified field (cf_verified) to indicate this fix has been successfully verified. Include a comment with verification details.
confirmed ia64 has successfully ran this test a few times since the -178 kernel.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2010-0178.html