Bug 871630

Summary:	DM RAID: kernel panic when attempting to activate partial RAID LV (i.e. an array that has missing devices)
Product:	Red Hat Enterprise Linux 6	Reporter:	Jonathan Earl Brassow <jbrassow>
Component:	kernel	Assignee:	Jonathan Earl Brassow <jbrassow>
Status:	CLOSED ERRATA	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.3	CC:	cmarthal
Target Milestone:	rc	Keywords:	Regression
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	kernel-2.6.32-339.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-02-21 06:54:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jonathan Earl Brassow 2012-10-30 22:07:22 UTC

Trying to activate a RAID LV that has missing devices results in a nasty kernel panic:

[root@bp-02 ~]# lvchange -ay --partial vg/raid1
  PARTIAL MODE. Incomplete logical volumes will be processed.
  ** hang/machine_reboot **

From the console:
device-mapper: raid: Failed to read superblock of device at position 0
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-7/queue/scheduler
CPU 6 
Modules linked in: dm_raid raid10 raid1 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx bfa sunrpc ipv6 power_meter microcode dcdbas serio_raw fam15h_power k10temp i2c_piix4 i2c_core amd64_edac_mod edac_core edac_mce_amd bnx2 sg ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [last unloaded: bfa]

Pid: 2304, comm: lvchange Not tainted 2.6.32-335.el6.x86_64 #1 Dell Inc. PowerEdge R415/08WNM9
RIP: 0010:[<ffffffffa00d2e8d>]  [<ffffffffa00d2e8d>] raid_ctr+0xdcd/0x1274 [dm_raid]
RSP: 0018:ffff880416c69c68  EFLAGS: 00010297
RAX: dead000000200200 RBX: ffff8804172c5000 RCX: ffff8804172c5438
RDX: dead000000100100 RSI: ffffffff81fc7440 RDI: ffff8804172c5448
RBP: ffff880416c69d08 R08: ffff8804172c5448 R09: 0000000000000249
R10: ffff880220076f80 R11: 0000000000000000 R12: dead000000100100
R13: ffff8804172c55c8 R14: ffff8804172c5028 R15: 0000000000000000
FS:  00007ffa378fd700(0000) GS:ffff880227400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002830000 CR3: 00000004189fc000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process lvchange (pid: 2304, threadinfo ffff880416c68000, task ffff880417742ae0)
Stack:
 0000000000000180 0000000000000002 ffffffffa00d35c3 ffff8804172c5010
<d> 0000000216c69d34 0000000000200000 ffff8804172c5028 0000000000000001
<d> ffffc90007424040 ffff8804172c5438 ffffc9000741e160 0000000000000400
Call Trace:
 [<ffffffffa0005f7f>] dm_table_add_target+0x13f/0x3b0 [dm_mod]
 [<ffffffffa00086f9>] table_load+0xc9/0x340 [dm_mod]
 [<ffffffffa0009984>] ctl_ioctl+0x1b4/0x270 [dm_mod]
 [<ffffffffa0008630>] ? table_load+0x0/0x340 [dm_mod]
 [<ffffffffa0009a53>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
 [<ffffffff811907d2>] vfs_ioctl+0x22/0xa0
 [<ffffffff81190974>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81190ef1>] sys_ioctl+0x81/0xa0
 [<ffffffff810d8255>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Code: a8 4c 89 e7 41 83 ef 01 48 c7 41 08 00 00 00 00 49 c7 44 24 30 00 00 00 00 e8 b0 15 1b e1 4d 8b 24 24 4d 39 f4 0f 84 38 01 00 00 <49> 83 7c 24 28 00 74 eb 49 8b 4c 24 38 49 c7 44 24 68 00 00 00 
RIP  [<ffffffffa00d2e8d>] raid_ctr+0xdcd/0x1274 [dm_raid]
 RSP <ffff880416c69c68>


Steps to reproduce:
~> lvcreate --type raid1 -m 1 -L 1G -n lv vg
# Wait for sync
~> vgchange -ay vg
# Disable a device in vg/lv
~> lvchange -ay --partial vg/lv  ######## BANG!

Comment 1 Jonathan Earl Brassow 2012-10-30 22:08:32 UTC

This bug was not present in 6.3 - it has turned up in 6.4 testing.

Comment 2 RHEL Program Management 2012-10-30 22:11:03 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 3 Jonathan Earl Brassow 2012-10-31 01:54:13 UTC

Issue is not present in upstream kernel (3.7.0-rc2).

Comment 4 Jonathan Earl Brassow 2012-10-31 03:24:45 UTC

Turns out, we've been over this problem upstream already:

Here's the upstream commit that fixed this problem:
commit a9ad8526bb1af0741a5c0e01155dac08e7bdde60
Author: Jonathan Brassow <jbrassow>
Date:   Tue Apr 24 10:23:13 2012 +1000

    DM RAID: Use safe version of rdev_for_each
    
    Fix segfault caused by using rdev_for_each instead of rdev_for_each_safe
    
    Commit dafb20fa34320a472deb7442f25a0c086e0feb33 mistakenly replaced a safe
    iterator with an unsafe one when making some macro changes.
    
    Signed-off-by: Jonathan Brassow <jbrassow>
    Signed-off-by: NeilBrown <neilb>

Comment 7 Corey Marthaler 2012-10-31 21:02:21 UTC

FWIW, this can be repo'ed with the following:

./raid_sanity -t raid10 -e vgcfgrestore_raid_with_missing_pv

Comment 8 Jarod Wilson 2012-11-06 21:19:36 UTC

Patch(es) available on kernel-2.6.32-339.el6

Comment 11 Corey Marthaler 2012-11-14 21:00:27 UTC

This has been verified fixed in the latest kernel (2.6.32-339.el6.x86_64).

SCENARIO (raid10) - [vgcfgrestore_raid_with_missing_pv]
Create a raid, force remove a leg, and then restore it's VG
taft-01: lvcreate --type raid10 -i 2 -n missing_pv_raid -L 100M --nosync raid_sanity
WARNING: New raid10 won't be synchronised. Don't read what you didn't write!
Deactivating missing_pv_raid raid
Backup the VG config
taft-01 vgcfgbackup -f /tmp/raid_sanity.bkup.6320 raid_sanity
Force removing PV /dev/sdc2 (used in this raid)
taft-01: 'echo y | pvremove -ff /dev/sdc2'
Really WIPE LABELS from physical volume "/dev/sdc2" of volume group "raid_sanity" [y/n]? WARNING: Wiping physical volume label from /dev/sdc2 of volume group "raid_sanity"
Verifying that this VG is now corrupt
No physical volume label read from /dev/sdc2
Failed to read physical volume "/dev/sdc2"
Attempt to restore the VG back to it's original state (should not segfault)
taft-01 vgcfgrestore -f /tmp/raid_sanity.bkup.6320 raid_sanity
Couldn't find device with uuid yRBOXP-7IVO-3yeH-dvtr-wG8H-KZJo-Ah4yRs.
Cannot restore Volume Group raid_sanity with 1 PVs marked as missing.
Restore failed.
Checking syslog to see if vgcfgrestore segfaulted
Activating VG in partial readonly mode
taft-01 vgchange -ay --partial raid_sanity
PARTIAL MODE. Incomplete logical volumes will be processed.
Couldn't find device with uuid yRBOXP-7IVO-3yeH-dvtr-wG8H-KZJo-Ah4yRs.
Recreating PV using it's old uuid
taft-01 pvcreate --norestorefile --uuid "yRBOXP-7IVO-3yeH-dvtr-wG8H-KZJo-Ah4yRs" /dev/sdc2
Restoring the VG back to it's original state
taft-01 vgcfgrestore -f /tmp/raid_sanity.bkup.6320 raid_sanity
Reactivating VG
Deactivating raid missing_pv_raid... and removing

Comment 12 Jonathan Earl Brassow 2012-11-19 15:26:39 UTC

*** Bug 867644 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2013-02-21 06:54:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html