Bug 1201473

Summary: unsynced raid snapshot creation/deletion causes panic
Product: Red Hat Enterprise Linux 6 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Heinz Mauelshagen <heinzm>
lvm2 sub component: Mirroring and RAID (RHEL6) QA Contact: cluster-qe <cluster-qe>
Status: CLOSED NEXTRELEASE Docs Contact:
Severity: urgent    
Priority: unspecified CC: agk, dhoward, heinzm, jbrassow, msnitzer, prajnoha, prockai, tlavigne, zkabelac
Version: 6.7Keywords: Regression, TestBlocker
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-12-17 18:18:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1268411    
Attachments:
Description Flags
my virt setup information none

Description Corey Marthaler 2015-03-12 18:36:26 UTC
Description of problem:
If I add a sleep right after the raid creation in the following loop to ensure that it's fully insync, the panic will not happen.

[root@host-111 ~]# vgcreate test /dev/sd[abcdefgh]1
  Volume group "test" successfully created

[root@host-111 ~]#  while true; do lvcreate  --type raid1 -m 1 -n exclusive_origin -L 100M test; lvcreate -s test/exclusive_origin -n rsnap -L 20M; lvremove -f test/rsnap; lvremove -f test/exclusive_origin; sleep 1; done
  Logical volume "exclusive_origin" created.
  Logical volume "rsnap" created.


BUG: unable to handle kernel paging request at ffffc9000123a048
IP: [<ffffffffa04293f0>] do_table_event+0x10/0x20 [dm_raid]
PGD 3f109067 PUD 3f10a067 PMD 3aaff067 PTE 0
Oops: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-6/dm/suspended
CPU 0 



Mar 12 13:10:48 host-111 lvm[2182]: Monitoring RAID device test-exclusive_origin for events.
Mar 12 13:10:48 host-111 lvm[2182]: raid1 array, test-exclusive_origin, is not in-sync.general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-5/dm/suspended
CPU 0 
Modules linked in: dm_snapshot dm_bufio dm_raid raid10 raid1 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx iptable_filter ip_tables autofs4 sg sd_mod crc_t10dif be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 26, comm: md_misc/0 Not tainted 2.6.32-540.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<ffffffffa04293f0>]  [<ffffffffa04293f0>] do_table_event+0x10/0x20 [dm_raid]
RSP: 0018:ffff88003ea5be30  EFLAGS: 00010202
RAX: 2f4065676e616863 RBX: ffff880002218e40 RCX: ffff880002218e48
RDX: ffff88003aa34bf0 RSI: ffff880002218e48 RDI: ffff88003aa34bf0
RBP: ffff88003ea5be30 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880002218e40
R13: ffffffffa04293e0 R14: ffff88003ea5bfd8 R15: ffff880002218e48
FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fff2935b210 CR3: 0000000039e71000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process md_misc/0 (pid: 26, threadinfo ffff88003ea58000, task ffff88003ea4eab0)
Stack:
 ffff88003ea5bee0 ffffffff8109a710 0000000000000000 0000000000000000
<d> ffff88003ea5be60 ffff88003ea4f128 ffff88003ea4eab0 ffff88003ea4eab0
<d> ffff88003ea4eab0 ffff880002218e58 0000000000000000 ffff88003ea4eab0
Call Trace:
 [<ffffffff8109a710>] worker_thread+0x170/0x2a0
 [<ffffffff810a12e0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8109a5a0>] ? worker_thread+0x0/0x2a0
 [<ffffffff810a0e4e>] kthread+0x9e/0xc0
 [<ffffffff8109a5a0>] ? worker_thread+0x0/0x2a0
 [<ffffffff8100c20a>] child_rip+0xa/0x20
 [<ffffffff810a0db0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20
Code: 07 ef fe ff c9 c3 0f 1f 44 00 00 48 8d 7a 10 e8 d7 22 fe ff c9 c3 0f 1f 44 00 00 55 48 89 e5 0f 1f 44 00 00 48 8b 87 10 fc ff ff <48> 8b 78 08 e8 17 c4 bd ff c9 c3 0f 1f 44 00 00 55 48 89 e5 41 
RIP  [<ffffffffa04293f0>] do_table_event+0x10/0x20 [dm_raid]
 RSP <ffff88003ea5be30>
---[ end trace c9dfd9c9539fb6d3 ]---
Kernel panic - not syncing: Fatal exception
Pid: 26, comm: md_misc/0 Tainted: G      D    ---------------    2.6.32-540.el6.x86_64 #1
Call Trace:
 [<ffffffff815340af>] ? panic+0xa7/0x16f
 [<ffffffff81538e84>] ? oops_end+0xe4/0x100
 [<ffffffff81010edb>] ? die+0x5b/0x90
 [<ffffffff81538962>] ? do_general_protection+0x152/0x160
 [<ffffffffa04293e0>] ? do_table_event+0x0/0x20 [dm_raid]
 [<ffffffff81538135>] ? general_protection+0x25/0x30
 [<ffffffffa04293e0>] ? do_table_event+0x0/0x20 [dm_raid]
 [<ffffffffa04293f0>] ? do_table_event+0x10/0x20 [dm_raid]
 [<ffffffff8109a710>] ? worker_thread+0x170/0x2a0
 [<ffffffff810a12e0>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8109a5a0>] ? worker_thread+0x0/0x2a0
 [<ffffffff810a0e4e>] ? kthread+0x9e/0xc0
 [<ffffffff8109a5a0>] ? worker_thread+0x0/0x2a0
 [<ffffffff8100c20a>] ? child_rip+0xa/0x20
 [<ffffffff810a0db0>] ? kthread+0x0/0xc0
 [<ffffffff8100c200>] ? child_rip+0x0/0x20


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Corey Marthaler 2015-03-12 19:02:34 UTC
Version-Release number of selected component (if applicable):

2.6.32-540.el6.x86_64

lvm2-2.02.117-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
lvm2-libs-2.02.117-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
lvm2-cluster-2.02.117-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
udev-147-2.57.el6    BUILT: Thu Jul 24 08:48:47 CDT 2014
device-mapper-1.02.94-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
device-mapper-libs-1.02.94-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
device-mapper-event-1.02.94-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
device-mapper-event-libs-1.02.94-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 08:43:06 CDT 2014
cmirror-2.02.117-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015

Comment 5 Corey Marthaler 2015-03-19 23:40:49 UTC
I can not reproduce this bug on a physical machine. I can however continue to reproduce this on my virt machines, even running a newer kernel. I've played around with the size of the raid volume on both types of machines to get varying degrees of raid sync. I'll attach the info on the set up of my virt machines.


2.6.32-544.el6.x86_64
lvm2-2.02.117-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
lvm2-libs-2.02.117-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
lvm2-cluster-2.02.117-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
udev-147-2.61.el6    BUILT: Mon Mar  2 05:08:11 CST 2015
device-mapper-1.02.94-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
device-mapper-libs-1.02.94-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
device-mapper-event-1.02.94-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
device-mapper-event-libs-1.02.94-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 08:43:06 CDT 2014
cmirror-2.02.117-1.el6    BUILT: Wed Mar  4 09:30:04 CST 2015


BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
IP: [<0000000000000004>] 0x4
PGD 0 
Oops: 0010 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-9/dm/suspended
CPU 0 
Modules linked in: dm_snapshot dm_bufio dm_raid raid10 raid1 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx iptable_filter ip_tables autofs4 sg sd_mod crc_t10dif be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_multipath microcode serio_raw virtio_balloon virtio_net i2c_piix4 i2c_core ext4 jbd2 mbcache virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded: speedstep_lib]

Pid: 26, comm: md_misc/0 Not tainted 2.6.32-544.el6.x86_64 #1 Red Hat KVM
RIP: 0010:[<0000000000000004>]  [<0000000000000004>] 0x4
RSP: 0018:ffff88003ea5be08  EFLAGS: 00010202
RAX: 0000000000000004 RBX: ffff880037fb6800 RCX: ffff880002218e88
RDX: 0000000000000000 RSI: ffff880002218e88 RDI: 000000066474e551
RBP: ffff88003ea5be20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff880002218e80
R13: ffffffffa042e3e0 R14: ffff88003ea5bfd8 R15: ffff880002218e88
FS:  0000000000000000(0000) GS:ffff880002200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000004 CR3: 000000003ce30000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process md_misc/0 (pid: 26, threadinfo ffff88003ea58000, task ffff88003ea4eab0)
Stack:
 ffffffffa0005859 ffff88003ea5be30 ffff880002218e80 ffff88003ea5be30
<d> ffffffffa042e3f9 ffff88003ea5bee0 ffffffff8109a730 0000000000000000
<d> 0000000000000000 ffff88003ea5be60 ffff88003ea4f128 ffff88003ea4eab0
Call Trace:
 [<ffffffffa0005859>] ? dm_table_event+0x49/0x60 [dm_mod]
 [<ffffffffa042e3f9>] do_table_event+0x19/0x20 [dm_raid]
 [<ffffffff8109a730>] worker_thread+0x170/0x2a0
 [<ffffffff810a1300>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8109a5c0>] ? worker_thread+0x0/0x2a0
 [<ffffffff810a0e6e>] kthread+0x9e/0xc0
 [<ffffffff8109a5c0>] ? worker_thread+0x0/0x2a0
 [<ffffffff8100c28a>] child_rip+0xa/0x20
 [<ffffffff810a0dd0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20
Code:  Bad RIP value.
RIP  [<0000000000000004>] 0x4
 RSP <ffff88003ea5be08>
CR2: 0000000000000004
---[ end trace ac1a7c7bdfa3a583 ]---
Kernel panic - not syncing: Fatal exception
Pid: 26, comm: md_misc/0 Tainted: G      D    ---------------    2.6.32-544.el6.x86_64 #1
Call Trace:
 [<ffffffff8153426f>] ? panic+0xa7/0x16f
 [<ffffffff81539044>] ? oops_end+0xe4/0x100
 [<ffffffff8104e8cb>] ? no_context+0xfb/0x260
 [<ffffffff8104eb55>] ? __bad_area_nosemaphore+0x125/0x1e0
 [<ffffffff8104ec23>] ? bad_area_nosemaphore+0x13/0x20
 [<ffffffff8104f31c>] ? __do_page_fault+0x30c/0x500
 [<ffffffff8106f6d2>] ? enqueue_entity+0x112/0x440
 [<ffffffff810602c4>] ? check_preempt_wakeup+0x1a4/0x260
 [<ffffffff8106fa64>] ? enqueue_task_fair+0x64/0x100
 [<ffffffff8105a7ec>] ? check_preempt_curr+0x7c/0x90
 [<ffffffff810670de>] ? try_to_wake_up+0x24e/0x3e0
 [<ffffffff8153af6e>] ? do_page_fault+0x3e/0xa0
 [<ffffffffa042e3e0>] ? do_table_event+0x0/0x20 [dm_raid]
 [<ffffffff81538325>] ? page_fault+0x25/0x30
 [<ffffffffa042e3e0>] ? do_table_event+0x0/0x20 [dm_raid]
 [<ffffffffa0005859>] ? dm_table_event+0x49/0x60 [dm_mod]
 [<ffffffffa042e3f9>] ? do_table_event+0x19/0x20 [dm_raid]
 [<ffffffff8109a730>] ? worker_thread+0x170/0x2a0
 [<ffffffff810a1300>] ? autoremove_wake_function+0x0/0x40
 [<ffffffff8109a5c0>] ? worker_thread+0x0/0x2a0
 [<ffffffff810a0e6e>] ? kthread+0x9e/0xc0
 [<ffffffff8109a5c0>] ? worker_thread+0x0/0x2a0
 [<ffffffff8100c28a>] ? child_rip+0xa/0x20
 [<ffffffff810a0dd0>] ? kthread+0x0/0xc0
 [<ffffffff8100c280>] ? child_rip+0x0/0x20

Comment 6 Corey Marthaler 2015-03-19 23:41:52 UTC
Created attachment 1004251 [details]
my virt setup information

Comment 11 Corey Marthaler 2015-03-20 22:13:28 UTC
Heinz,

My virt nodes use iscsi devices for storage.

Comment 12 Heinz Mauelshagen 2015-03-26 19:18:01 UTC
(In reply to Corey Marthaler from comment #11)
> Heinz,
> 
> My virt nodes use iscsi devices for storage.

Can you reproduce on your vms with other type storage?

Comment 17 Stephen Gilson 2015-04-13 19:02:35 UTC
This issue needs to be described in the Release Notes for RHEL 6.7

Content Services needs your input to make that happen. 

Please complete the Doc Text text field for this bug by April 20 using the Cause, Consequence, Workaround, and Result model, as follows:

Cause — Actions or circumstances that cause this bug to occur on a customer's system

Consequence — What happens to the customer's system or application when the bug occurs?

Workaround (if any) — If a workaround for the issue exists, describe in detail. If more than one workaround is available, describe each one.

Result — Describe what happens when a workaround is applied. If the issue is completely circumvented by the workaround, state so. Any side effects caused by the workaround should also be noted here. If no reliable workaround exists, try to describe some preventive measures that help to avoid the bug scenario.

Comment 26 Jonathan Earl Brassow 2015-08-11 19:34:08 UTC
moving back to assigned so any updated patch is not forgotten for posting in 6.8