Bug 461506

Summary:

kernel BUG at mm/mempool.c:121! caused by lvcreate

Product:

Red Hat Enterprise Linux 5

Reporter:

Mike Gahagan <mgahagan>

Component:

kernel

Assignee:

Mikuláš Patočka <mpatocka>

Status:

CLOSED ERRATA

QA Contact:

Cluster QE <mspqa-list>

Severity:

high

Docs Contact:

Priority:

medium

Version:

5.3

CC:

agk, christophe.varoqui, coughlan, dwysocha, dzickus, edamato, egoggin, heinzm, jbrassow, jtluka, junichi.nomura, kueda, lmb, mbroz, mjenner, prockai, syeghiay, tranlan

Target Milestone:

Keywords:

Regression, Reopened

Target Release:

---

Hardware:

ia64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-03-30 07:36:39 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

525215, 533192

Attachments:

Description	Flags
PATCH 1/4: refactor chunk_io	none
PATCH 2/4: Use separate area for the header	none
PATCH 3/4: refactor set_chunk_size	none
PATCH 4/4: check on-disk chunksize	none
A backported patch for RHEL 5.5	none

Description Mike Gahagan 2008-09-08 17:33:55 UTC

Description of problem:

device-mapper: snapshots: Snapshot is marked invalid.
device-mapper: snapshots: chunk size 0 in device metadata overrides table chunk size of 32.
kernel BUG at mm/mempool.c:121!
lvcreate[11490]: bugcheck! 0 [1]
Modules linked in: nfs fscache nfsd exportfs lockd nfs_acl auth_rpcgss loop autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api vfat fat dm_multipath button parport_pc lp parport joydev sr_mod cdrom e1000 sg dm_snapshot dm_zero dm_mirror dm_mod usb_storage qla2xxx lpfc scsi_transport_fc cciss sd_mod scsi_mod raid0 ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 11490, CPU 15, comm:             lvcreate
psr : 0000101008526030 ifs : 800000000000050d ip  : [<a000000100110630>]    Not tainted
ip is at mempool_resize+0x50/0x440
unat: 0000000000000000 pfs : 000000000000050d rsc : 0000000000000003
rnat: a000000100adbe68 bsps: 0000000000000004 pr  : 000000000065a559
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000100110630 b6  : a000000100011060 b7  : a00000010000b840
f6  : 1003e00000000000000a0 f7  : 1003e20c49ba5e353f7cf
f8  : 1003e00000000000004e2 f9  : 1003e000000000fa00000
f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db
r1  : a000000100c00ef0 r2  : a000000100a19088 r3  : a0000001009499e0
r8  : 0000000000000023 r9  : a000000100a190b8 r10 : a000000100a190b8
r11 : 0000000000000000 r12 : e0000016d7df7ca0 r13 : e0000016d7df0000
r14 : a000000100a19088 r15 : 0000000000000000 r16 : a0000001009499e8
r17 : e0000100e1a87e18 r18 : 0000000000000000 r19 : 000000000000000d
r20 : a000000100849280 r21 : a000000100a01548 r22 : a000000100a19090
r23 : a000000100a19090 r24 : e0000100f9ec1054 r25 : 0000000000000000
r26 : e0000100f9ec105c r27 : e0000100f9ec1040 r28 : e0000100f9ec0008
r29 : 0000005000000078 r30 : 0000000000000000 r31 : 0000000000000000

Call Trace:
 [<a000000100013ba0>] show_stack+0x40/0xa0
                                sp=e0000016d7df7830 bsp=e0000016d7df15e8
 [<a0000001000144a0>] show_regs+0x840/0x880
                                sp=e0000016d7df7a00 bsp=e0000016d7df1590
 [<a000000100037b80>] die+0x1c0/0x2c0
                                sp=e0000016d7df7a00 bsp=e0000016d7df1548
 [<a000000100037cd0>] die_if_kernel+0x50/0x80
                                sp=e0000016d7df7a20 bsp=e0000016d7df1518
 [<a000000100644b90>] ia64_bad_break+0x270/0x4a0
                                sp=e0000016d7df7a20 bsp=e0000016d7df14f0
 [<a00000010000c040>] __ia64_leave_kernel+0x0/0x280
                                sp=e0000016d7df7ad0 bsp=e0000016d7df14f0
 [<a000000100110630>] mempool_resize+0x50/0x440
                                sp=e0000016d7df7ca0 bsp=e0000016d7df1488
 [<a00000021e4eb090>] dm_io_client_resize+0x30/0x60 [dm_mod]
                                sp=e0000016d7df7ca0 bsp=e0000016d7df1460
 [<a00000021ebd4340>] persistent_read_metadata+0x460/0x820 [dm_snapshot]
                                sp=e0000016d7df7ca0 bsp=e0000016d7df1428
 [<a00000021ebd2980>] snapshot_ctr+0x700/0xcc0 [dm_snapshot]
                                sp=e0000016d7df7ca0 bsp=e0000016d7df13b8
 [<a00000021e4e2550>] dm_table_add_target+0x350/0x740 [dm_mod]
                                sp=e0000016d7df7cb0 bsp=e0000016d7df1360
 [<a00000021e4e7670>] table_load+0x1f0/0x4a0 [dm_mod]
                                sp=e0000016d7df7cc0 bsp=e0000016d7df1308
 [<a00000021e4e9540>] ctl_ioctl+0x6a0/0x7a0 [dm_mod]
                                sp=e0000016d7df7cd0 bsp=e0000016d7df12b0
 [<a00000010019cac0>] do_ioctl+0x140/0x180
                                sp=e0000016d7df7e10 bsp=e0000016d7df1270
 [<a00000010019d380>] vfs_ioctl+0x880/0x8e0
                                sp=e0000016d7df7e10 bsp=e0000016d7df1228
 [<a00000010019d4b0>] sys_ioctl+0xd0/0x140
                                sp=e0000016d7df7e20 bsp=e0000016d7df11a0
 [<a00000010000bdd0>] __ia64_trace_syscall+0xd0/0x110
                                sp=e0000016d7df7e30 bsp=e0000016d7df11a0
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e0000016d7df8000 bsp=e0000016d7df11a0
 <0>Kernel panic - not syncing: Fatal exception

Version-Release number of selected component (if applicable):
2.6.18-108.el5, 5.2 userspace on ia64

How reproducible:
not certain, this is the first time it has happened.

Steps to Reproduce:
1.Install 5.2 on ia64
2.install 2.6.18-108.el5
3.run the rhts test: /kernel/storage/lvm/snapshot_remove/
  
Actual results:
panic

Expected results:
test runs to completion

Additional info:

http://rhts.redhat.com/testlogs/28444/103180/875970/4163273-test_log--kernel-storage-lvm-snapshot_remove-EXTERNALWATCHDOG.log
http://rhts.redhat.com/testlogs/28444/103180/875970/TESTOUT.log

Comment 1 Alasdair Kergon 2008-09-26 23:44:02 UTC

I guess we'll have the same thing upstream too?  Is it arch-specific?

Comment 3 Mikuláš Patočka 2008-09-29 04:24:23 UTC


*** This bug has been marked as a duplicate of bug 443627 ***

Comment 4 Jan Tluka 2009-08-03 12:26:02 UTC

I have just seen this panic on ia64 machine using 2.6.18-160.el5 kernel with the RHEL5.4-Server-20090729.0 tree. This happened on kernel-xen.

Scenario is the same - running RHTS test /kernel/storage/lvm/snapshot_remove.

Full console log: http://rhts.redhat.com/testlogs/2009/07/80234/240707/1975636/console.txt

device-mapper: snapshots: chunk size 0 in device metadata overrides table chunk size of 32.
kernel BUG at mm/mempool.c:121!
lvcreate[15369]: bugcheck! 0 [1]
Modules linked in: nfs fscache nfsd exportfs nfs_acl auth_rpcgss loop autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api vfat fat dm_multipath scsi_dh button parport_pc lp parport joydev sr_mod cdrom e1000 qla2xxx lpfc scsi_transport_fc sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage cciss sd_mod scsi_mod raid0 ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 15369, CPU 0, comm:             lvcreate
psr : 00001010085a6010 ifs : 800000000000050d ip  : [<a000000100128370>]    Not tainted (2.6.18-160.el5xen)
ip is at mempool_resize+0x50/0x440
unat: 0000000000000000 pfs : 800000000000050d rsc : 000000000000000b
rnat: a000000100a71170 bsps: fffffffffff00001 pr  : 0000000000656659
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000100128370 b6  : a0000001000c37e0 b7  : 0000000000000000
f6  : 1003e00000000000000a0 f7  : 1003e20c49ba5e353f7cf
f8  : 1003e00000000000004e2 f9  : 1003e000000000fa00000
f10 : 1003e000000003b9aca00 f11 : 1003e431bde82d7b634db
r1  : a000000100c57ff0 r2  : a000000100a70e58 r3  : a000000100a71170
r8  : 0000000000000023 r9  : a000000100a70e88 r10 : a000000100a70e88
r11 : 0000000000000000 r12 : e0000000209cfc70 r13 : e0000000209c8000
r14 : a000000100a70e58 r15 : 0000000000000000 r16 : fffffffffff04c18
r17 : e0000001a6b87e18 r18 : 0000000000000001 r19 : fffffffffff04c18
r20 : a00000010088d300 r21 : a000000100a58678 r22 : a000000100a70e60
r23 : a000000100a70e60 r24 : fffffffffff00000 r25 : fffffffffff00001
r26 : a000000100a6e178 r27 : 00000320000004b0 r28 : 0000018ffffffe70
r29 : 00000000000007d0 r30 : 00000000018ffe70 r31 : 0000018ffe700000

Call Trace:
 [<a00000010001d240>] show_stack+0x40/0xa0
                                sp=e0000000209cf800 bsp=e0000000209c95f0
 [<a00000010001db70>] show_regs+0x870/0x8c0
                                sp=e0000000209cf9d0 bsp=e0000000209c9598
 [<a000000100043720>] die+0x1c0/0x380
                                sp=e0000000209cf9d0 bsp=e0000000209c9550
 [<a000000100043930>] die_if_kernel+0x50/0x80
                                sp=e0000000209cf9f0 bsp=e0000000209c9520
 [<a00000010067e4b0>] ia64_bad_break+0x270/0x4a0
                                sp=e0000000209cf9f0 bsp=e0000000209c94f8
 [<a00000010006b140>] xen_leave_kernel+0x0/0x3e0
                                sp=e0000000209cfaa0 bsp=e0000000209c94f8
 [<a000000100128370>] mempool_resize+0x50/0x440
                                sp=e0000000209cfc70 bsp=e0000000209c9490
 [<a0000002013c78f0>] dm_io_client_resize+0x30/0x60 [dm_mod]
                                sp=e0000000209cfc70 bsp=e0000000209c9468
 [<a00000020142cd40>] persistent_read_metadata+0x460/0x880 [dm_snapshot]
                                sp=e0000000209cfc70 bsp=e0000000209c9430
 [<a00000020142b620>] snapshot_ctr+0x860/0xe80 [dm_snapshot]
                                sp=e0000000209cfc70 bsp=e0000000209c93b8
 [<a0000002013be650>] dm_table_add_target+0x350/0x740 [dm_mod]
                                sp=e0000000209cfc80 bsp=e0000000209c9360
 [<a0000002013c3f10>] table_load+0x1f0/0x4a0 [dm_mod]
                                sp=e0000000209cfc90 bsp=e0000000209c9308
 [<a0000002013c5de0>] ctl_ioctl+0x6a0/0x7a0 [dm_mod]
                                sp=e0000000209cfca0 bsp=e0000000209c92b0
 [<a0000001001b3a60>] do_ioctl+0x140/0x180
                                sp=e0000000209cfde0 bsp=e0000000209c9270
 [<a0000001001b4860>] vfs_ioctl+0xdc0/0xec0
                                sp=e0000000209cfde0 bsp=e0000000209c9228
 [<a0000001001b4a30>] sys_ioctl+0xd0/0x140
                                sp=e0000000209cfe20 bsp=e0000000209c91a0
 [<a00000010006ae40>] xen_trace_syscall+0x100/0x140
                                sp=e0000000209cfe30 bsp=e0000000209c91a0
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e0000000209d0000 bsp=e0000000209c91a0
 <0>Kernel panic - not syncing: Fatal exception
 (XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Comment 5 Mikuláš Patočka 2009-08-04 01:46:31 UTC

Here I'm posting upstream patches. I will make RHEL-5.4 patches if this bug is approved.

Comment 6 Mikuláš Patočka 2009-08-04 01:47:31 UTC

Created attachment 356089 [details]
PATCH 1/4: refactor chunk_io

Comment 7 Mikuláš Patočka 2009-08-04 01:51:36 UTC

Created attachment 356090 [details]
PATCH 2/4: Use separate area for the header

This fixes the race condition corrupting of the header.

Comment 8 Mikuláš Patočka 2009-08-04 01:52:24 UTC

Created attachment 356091 [details]
PATCH 3/4: refactor set_chunk_size

Comment 9 Mikuláš Patočka 2009-08-04 01:53:37 UTC

Created attachment 356092 [details]
PATCH 4/4: check on-disk chunksize

Don't crash if the header is corrupted.

Comment 10 Mikuláš Patočka 2009-08-04 02:01:16 UTC

Description of the bug:

There is a race condition in the snapshot code, if the snapshot fills up, a header is written flagging the snapshot as invalid. If, during this, simultaneously some chunk reallocation finishes, it modifies the same buffer as header writing code, it may result in writing invalid header. I can't prove that this happened in this case (because it is race, I can't reproduce it), but if I use the principle "if we exclude all impossible things (0 couldn't be written as chunksize to the header in none of normal code paths), whatever remains is truth", I came to a conclusion that this race caused these crashes.

Furthermore, when the snapshot header is damaged in such way that chunksize is zero and activated later, the kernel crashes.

The patches 1 and 2 fix the race corrupting the header. The patches 3 and 4 fix the crash on corrupted header.

Comment 11 RHEL Program Management 2009-08-04 02:05:03 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 23 RHEL Program Management 2009-09-25 17:41:42 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 24 Mikuláš Patočka 2009-12-02 04:30:47 UTC

Created attachment 375317 [details]
A backported patch for RHEL 5.5

Comment 25 Don Zickus 2009-12-09 18:11:14 UTC

in kernel-2.6.18-178.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 27 Mike Gahagan 2010-03-05 21:08:38 UTC

confirmed ia64 has successfully ran this test a few times since the -178 kernel.

Comment 29 errata-xmlrpc 2010-03-30 07:36:39 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html