871630 – DM RAID: kernel panic when attempting to activate partial RAID LV (i.e. an array that has missing devices)

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 871630 - DM RAID: kernel panic when attempting to activate partial RAID LV (i.e. an array that has missing devices)

Summary: DM RAID: kernel panic when attempting to activate partial RAID LV (i.e. an ar...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Jonathan Earl Brassow
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	867644 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-30 22:07 UTC by Jonathan Earl Brassow
Modified:	2013-02-21 06:54 UTC (History)
CC List:	1 user (show)
Fixed In Version:	kernel-2.6.32-339.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-02-21 06:54:41 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2013:0496	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 6 kernel update	2013-02-20 21:40:54 UTC

Description Jonathan Earl Brassow 2012-10-30 22:07:22 UTC

Trying to activate a RAID LV that has missing devices results in a nasty kernel panic:

[root@bp-02 ~]# lvchange -ay --partial vg/raid1
  PARTIAL MODE. Incomplete logical volumes will be processed.
  ** hang/machine_reboot **

From the console:
device-mapper: raid: Failed to read superblock of device at position 0
general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/block/dm-7/queue/scheduler
CPU 6 
Modules linked in: dm_raid raid10 raid1 raid456 async_raid6_recov async_pq raid6_pq async_xor xor async_memcpy async_tx bfa sunrpc ipv6 power_meter microcode dcdbas serio_raw fam15h_power k10temp i2c_piix4 i2c_core amd64_edac_mod edac_core edac_mce_amd bnx2 sg ext4 mbcache jbd2 sr_mod cdrom sd_mod crc_t10dif pata_acpi ata_generic pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas scsi_transport_fc scsi_tgt dm_mirror dm_region_hash dm_log dm_mod [last unloaded: bfa]

Pid: 2304, comm: lvchange Not tainted 2.6.32-335.el6.x86_64 #1 Dell Inc. PowerEdge R415/08WNM9
RIP: 0010:[<ffffffffa00d2e8d>]  [<ffffffffa00d2e8d>] raid_ctr+0xdcd/0x1274 [dm_raid]
RSP: 0018:ffff880416c69c68  EFLAGS: 00010297
RAX: dead000000200200 RBX: ffff8804172c5000 RCX: ffff8804172c5438
RDX: dead000000100100 RSI: ffffffff81fc7440 RDI: ffff8804172c5448
RBP: ffff880416c69d08 R08: ffff8804172c5448 R09: 0000000000000249
R10: ffff880220076f80 R11: 0000000000000000 R12: dead000000100100
R13: ffff8804172c55c8 R14: ffff8804172c5028 R15: 0000000000000000
FS:  00007ffa378fd700(0000) GS:ffff880227400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000002830000 CR3: 00000004189fc000 CR4: 00000000000406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process lvchange (pid: 2304, threadinfo ffff880416c68000, task ffff880417742ae0)
Stack:
 0000000000000180 0000000000000002 ffffffffa00d35c3 ffff8804172c5010
<d> 0000000216c69d34 0000000000200000 ffff8804172c5028 0000000000000001
<d> ffffc90007424040 ffff8804172c5438 ffffc9000741e160 0000000000000400
Call Trace:
 [<ffffffffa0005f7f>] dm_table_add_target+0x13f/0x3b0 [dm_mod]
 [<ffffffffa00086f9>] table_load+0xc9/0x340 [dm_mod]
 [<ffffffffa0009984>] ctl_ioctl+0x1b4/0x270 [dm_mod]
 [<ffffffffa0008630>] ? table_load+0x0/0x340 [dm_mod]
 [<ffffffffa0009a53>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
 [<ffffffff811907d2>] vfs_ioctl+0x22/0xa0
 [<ffffffff81190974>] do_vfs_ioctl+0x84/0x580
 [<ffffffff81190ef1>] sys_ioctl+0x81/0xa0
 [<ffffffff810d8255>] ? __audit_syscall_exit+0x265/0x290
 [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Code: a8 4c 89 e7 41 83 ef 01 48 c7 41 08 00 00 00 00 49 c7 44 24 30 00 00 00 00 e8 b0 15 1b e1 4d 8b 24 24 4d 39 f4 0f 84 38 01 00 00 <49> 83 7c 24 28 00 74 eb 49 8b 4c 24 38 49 c7 44 24 68 00 00 00 
RIP  [<ffffffffa00d2e8d>] raid_ctr+0xdcd/0x1274 [dm_raid]
 RSP <ffff880416c69c68>


Steps to reproduce:
~> lvcreate --type raid1 -m 1 -L 1G -n lv vg
# Wait for sync
~> vgchange -ay vg
# Disable a device in vg/lv
~> lvchange -ay --partial vg/lv  ######## BANG!

Comment 1 Jonathan Earl Brassow 2012-10-30 22:08:32 UTC

This bug was not present in 6.3 - it has turned up in 6.4 testing.

Comment 2 RHEL Program Management 2012-10-30 22:11:03 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 3 Jonathan Earl Brassow 2012-10-31 01:54:13 UTC

Issue is not present in upstream kernel (3.7.0-rc2).

Comment 4 Jonathan Earl Brassow 2012-10-31 03:24:45 UTC

Turns out, we've been over this problem upstream already:

Here's the upstream commit that fixed this problem:
commit a9ad8526bb1af0741a5c0e01155dac08e7bdde60
Author: Jonathan Brassow <jbrassow>
Date:   Tue Apr 24 10:23:13 2012 +1000

    DM RAID: Use safe version of rdev_for_each
    
    Fix segfault caused by using rdev_for_each instead of rdev_for_each_safe
    
    Commit dafb20fa34320a472deb7442f25a0c086e0feb33 mistakenly replaced a safe
    iterator with an unsafe one when making some macro changes.
    
    Signed-off-by: Jonathan Brassow <jbrassow>
    Signed-off-by: NeilBrown <neilb>

Comment 7 Corey Marthaler 2012-10-31 21:02:21 UTC

FWIW, this can be repo'ed with the following:

./raid_sanity -t raid10 -e vgcfgrestore_raid_with_missing_pv

Comment 8 Jarod Wilson 2012-11-06 21:19:36 UTC

Patch(es) available on kernel-2.6.32-339.el6

Comment 11 Corey Marthaler 2012-11-14 21:00:27 UTC

This has been verified fixed in the latest kernel (2.6.32-339.el6.x86_64).

SCENARIO (raid10) - [vgcfgrestore_raid_with_missing_pv]
Create a raid, force remove a leg, and then restore it's VG
taft-01: lvcreate --type raid10 -i 2 -n missing_pv_raid -L 100M --nosync raid_sanity
WARNING: New raid10 won't be synchronised. Don't read what you didn't write!
Deactivating missing_pv_raid raid
Backup the VG config
taft-01 vgcfgbackup -f /tmp/raid_sanity.bkup.6320 raid_sanity
Force removing PV /dev/sdc2 (used in this raid)
taft-01: 'echo y | pvremove -ff /dev/sdc2'
Really WIPE LABELS from physical volume "/dev/sdc2" of volume group "raid_sanity" [y/n]? WARNING: Wiping physical volume label from /dev/sdc2 of volume group "raid_sanity"
Verifying that this VG is now corrupt
No physical volume label read from /dev/sdc2
Failed to read physical volume "/dev/sdc2"
Attempt to restore the VG back to it's original state (should not segfault)
taft-01 vgcfgrestore -f /tmp/raid_sanity.bkup.6320 raid_sanity
Couldn't find device with uuid yRBOXP-7IVO-3yeH-dvtr-wG8H-KZJo-Ah4yRs.
Cannot restore Volume Group raid_sanity with 1 PVs marked as missing.
Restore failed.
Checking syslog to see if vgcfgrestore segfaulted
Activating VG in partial readonly mode
taft-01 vgchange -ay --partial raid_sanity
PARTIAL MODE. Incomplete logical volumes will be processed.
Couldn't find device with uuid yRBOXP-7IVO-3yeH-dvtr-wG8H-KZJo-Ah4yRs.
Recreating PV using it's old uuid
taft-01 pvcreate --norestorefile --uuid "yRBOXP-7IVO-3yeH-dvtr-wG8H-KZJo-Ah4yRs" /dev/sdc2
Restoring the VG back to it's original state
taft-01 vgcfgrestore -f /tmp/raid_sanity.bkup.6320 raid_sanity
Reactivating VG
Deactivating raid missing_pv_raid... and removing

Comment 12 Jonathan Earl Brassow 2012-11-19 15:26:39 UTC

*** Bug 867644 has been marked as a duplicate of this bug. ***

Comment 14 errata-xmlrpc 2013-02-21 06:54:41 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0496.html

Note You need to log in before you can comment on or make changes to this bug.