RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1060294 - raid10 multiple device failure scenario causes vgreduce to deadlock
Summary: raid10 multiple device failure scenario causes vgreduce to deadlock
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2
Version: 7.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: rc
: ---
Assignee: Heinz Mauelshagen
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks: BrassowRHEL7Bugs 1205796
TreeView+ depends on / blocked
 
Reported: 2014-01-31 17:34 UTC by Corey Marthaler
Modified: 2023-03-08 07:26 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2015-05-22 20:48:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kernel stack (467.57 KB, text/plain)
2014-01-31 17:36 UTC, Corey Marthaler
no flags Details

Description Corey Marthaler 2014-01-31 17:34:27 UTC
Description of problem:
This may be related to bug 1031204.

./black_bird -f -e kill_two_synced_raid10_3legs

================================================================================
Iteration 0.33 started at Thu Jan 30 18:48:58 CST 2014
================================================================================
Scenario kill_two_synced_raid10_3legs: Kill two legs (none of which share the same stripe leg) of synced 3 leg raid10 volume(s)

********* RAID hash info for this scenario *********
* names:              synced_two_raid10_3legs_1
* sync:               1
* type:               raid10
* -m |-i value:       3
* leg devices:        /dev/sdc1 /dev/sdb1 /dev/sda1 /dev/sdd1 /dev/sde1 /dev/sdh1
* failpv(s):          /dev/sdc1 /dev/sda1
* failnode(s):        host-002.virt.lab.msp.redhat.com
* additional snap:    /dev/sdb1
* lvmetad:             1
* raid fault policy:   warn
******************************************************

Creating raids(s) on host-002.virt.lab.msp.redhat.com...
host-002.virt.lab.msp.redhat.com: lvcreate --type raid10 -i 3 -n synced_two_raid10_3legs_1 -L 500M black_bird /dev/sdc1:0-2000 /dev/sdb1:0-2000 /dev/sda1:0-2000 /dev/sdd1:0-2000 /dev/sde1:0-2000 /dev/sdh1:0-2000
WARNING: ext3 signature detected on /dev/black_bird/synced_two_raid10_3legs_1 at offset 1080. Wipe it? [y/n] 
  1 existing signature left on the device.

Current mirror/raid device structure(s):
  LV                                   Attr       LSize   Cpy%Sync Devices
   synced_two_raid10_3legs_1            rwi-aor--- 504.00m     5.56 synced_two_raid10_3legs_1_rimage_0(0),synced_two_raid10_3legs_1_rimage_1(0),synced_two_raid10_3legs_1_rimage_2(0),synced_two_raid10_3legs_1_rimage_3(0),synced_two_raid10_3legs_1_rimage_4(0),synced_two_raid10_3legs_1_rimage_5(0)
   [synced_two_raid10_3legs_1_rimage_0] Iwi-aor--- 168.00m          /dev/sdc1(1)
   [synced_two_raid10_3legs_1_rimage_1] Iwi-aor--- 168.00m          /dev/sdb1(1)
   [synced_two_raid10_3legs_1_rimage_2] Iwi-aor--- 168.00m          /dev/sda1(1)
   [synced_two_raid10_3legs_1_rimage_3] Iwi-aor--- 168.00m          /dev/sdd1(1)
   [synced_two_raid10_3legs_1_rimage_4] Iwi-aor--- 168.00m          /dev/sde1(1)
   [synced_two_raid10_3legs_1_rimage_5] Iwi-aor--- 168.00m          /dev/sdh1(1)
   [synced_two_raid10_3legs_1_rmeta_0]  ewi-aor---   4.00m          /dev/sdc1(0)
   [synced_two_raid10_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdb1(0)
   [synced_two_raid10_3legs_1_rmeta_2]  ewi-aor---   4.00m          /dev/sda1(0)
   [synced_two_raid10_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdd1(0)
   [synced_two_raid10_3legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sde1(0)
   [synced_two_raid10_3legs_1_rmeta_5]  ewi-aor---   4.00m          /dev/sdh1(0)

* NOTE: not enough available devices for allocation fault polices to fully work *
(well technically, since we have 1, some allocation should work)

Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 56.36% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Creating ext on top of mirror(s) on host-002.virt.lab.msp.redhat.com...
mke2fs 1.42.9 (28-Dec-2013)
Mounting mirrored ext filesystems on host-002.virt.lab.msp.redhat.com...

PV=/dev/sda1
        synced_two_raid10_3legs_1_rimage_2: 2
        synced_two_raid10_3legs_1_rmeta_2: 2
PV=/dev/sdc1
        synced_two_raid10_3legs_1_rimage_0: 2
        synced_two_raid10_3legs_1_rmeta_0: 2

Creating a snapshot volume of each of the raids

Writing verification files (checkit) to mirror(s) on...
        ---- host-002.virt.lab.msp.redhat.com ----

Sleeping 15 seconds to get some outsanding EXT I/O locks before the failure 
Verifying files (checkit) on mirror(s) on...
        ---- host-002.virt.lab.msp.redhat.com ----

Disabling device sdc on host-002.virt.lab.msp.redhat.com (lvmetad is on)
Disabling device sda on host-002.virt.lab.msp.redhat.com (lvmetad is on)

Getting recovery check start time from /var/log/messages: Jan 30 18:49
Attempting I/O to cause mirror down conversion(s) on host-002.virt.lab.msp.redhat.com
10+0 records in
10+0 records out
41943040 bytes (42 MB) copied, 0.354805 s, 118 MB/s

Verifying current sanity of lvm after the failure

Current mirror/raid device structure(s):
  PV OuO3aj-5uki-4uH3-H03t-uoOc-rE5R-BfJj4f not recognised. Is the device missing?
  [...]
  PV UgfkxC-hZBz-mbks-GYO1-V3h1-he4F-ixdwRf not recognised. Is the device missing?
  LV                                   Attr       LSize   Cpy%Sync Devices
   bb_snap1                             swi-a-s--- 252.00m          /dev/sdb1(43)
   synced_two_raid10_3legs_1            owi-aor-p- 504.00m   100.00 synced_two_raid10_3legs_1_rimage_0(0),synced_two_raid10_3legs_1_rimage_1(0),synced_two_raid10_3legs_1_rimage_2(0),synced_two_raid10_3legs_1_rimage_3(0),synced_two_raid10_3legs_1_rimage_4(0),synced_two_raid10_3legs_1_rimage_5(0)
   [synced_two_raid10_3legs_1_rimage_0] iwi-aor-p- 168.00m          unknown device(1)
   [synced_two_raid10_3legs_1_rimage_1] iwi-aor--- 168.00m          /dev/sdb1(1)
   [synced_two_raid10_3legs_1_rimage_2] iwi-aor-p- 168.00m          unknown device(1)
   [synced_two_raid10_3legs_1_rimage_3] iwi-aor--- 168.00m          /dev/sdd1(1)
   [synced_two_raid10_3legs_1_rimage_4] iwi-aor--- 168.00m          /dev/sde1(1)
   [synced_two_raid10_3legs_1_rimage_5] iwi-aor--- 168.00m          /dev/sdh1(1)
   [synced_two_raid10_3legs_1_rmeta_0]  ewi-aor-p-   4.00m          unknown device(0)
   [synced_two_raid10_3legs_1_rmeta_1]  ewi-aor---   4.00m          /dev/sdb1(0)
   [synced_two_raid10_3legs_1_rmeta_2]  ewi-aor-p-   4.00m          unknown device(0)
   [synced_two_raid10_3legs_1_rmeta_3]  ewi-aor---   4.00m          /dev/sdd1(0)
   [synced_two_raid10_3legs_1_rmeta_4]  ewi-aor---   4.00m          /dev/sde1(0)
   [synced_two_raid10_3legs_1_rmeta_5]  ewi-aor---   4.00m          /dev/sdh1(0)

Verifying FAILED device /dev/sdc1 is *NOT* in the volume(s)
Verifying FAILED device /dev/sda1 is *NOT* in the volume(s)
Verifying IMAGE device /dev/sdb1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdd1 *IS* in the volume(s)
Verifying IMAGE device /dev/sde1 *IS* in the volume(s)
Verifying IMAGE device /dev/sdh1 *IS* in the volume(s)
verify the rimage/rmeta dm devices remain after the failures

Checking EXISTENCE and STATE of synced_two_raid10_3legs_1_rimage_2 on: host-002.virt.lab.msp.redhat.com 
Checking EXISTENCE and STATE of synced_two_raid10_3legs_1_rmeta_2 on: host-002.virt.lab.msp.redhat.com 
Checking EXISTENCE and STATE of synced_two_raid10_3legs_1_rimage_0 on: host-002.virt.lab.msp.redhat.com 
Checking EXISTENCE and STATE of synced_two_raid10_3legs_1_rmeta_0 on: host-002.virt.lab.msp.redhat.com 

Verify the raid image order is what's expected based on raid fault policy
EXPECTED LEG ORDER: unknown /dev/sdb1 unknown /dev/sdd1 /dev/sde1 /dev/sdh1
ACTUAL LEG ORDER: unknown /dev/sdb1 unknown /dev/sdd1 /dev/sde1 /dev/sdh1

Fault policy is warn, manually repairing failed raid volumes
host-002.virt.lab.msp.redhat.com: 'lvconvert --yes --repair black_bird/synced_two_raid10_3legs_1'
  PV OuO3aj-5uki-4uH3-H03t-uoOc-rE5R-BfJj4f not recognised. Is the device missing?
  PV UgfkxC-hZBz-mbks-GYO1-V3h1-he4F-ixdwRf not recognised. Is the device missing?
  Insufficient suitable allocatable extents for logical volume : 254 more required
  Failed to allocate replacement images for black_bird/synced_two_raid10_3legs_1
  Attempting replacement of 1 devices instead of 2

Waiting until all mirror|raid volumes become fully syncd...
   0/1 mirror(s) are fully synced: ( 20.96% )
   1/1 mirror(s) are fully synced: ( 100.00% )

Verifying files (checkit) on mirror(s) on...
        ---- host-002.virt.lab.msp.redhat.com ----

Enabling device sdc on host-002.virt.lab.msp.redhat.com (lvmetad is on)
Enabling device sda on host-002.virt.lab.msp.redhat.com (lvmetad is on)

-------------------------------------------------------------------------------
Force a vgreduce to clean up the corrupt additional LV
-------------------------------------------------------------------------------
  PV OuO3aj-5uki-4uH3-H03t-uoOc-rE5R-BfJj4f not recognised. Is the device missing?
  PV UgfkxC-hZBz-mbks-GYO1-V3h1-he4F-ixdwRf not recognised. Is the device missing?
  PV OuO3aj-5uki-4uH3-H03t-uoOc-rE5R-BfJj4f not recognised. Is the device missing?
  PV UgfkxC-hZBz-mbks-GYO1-V3h1-he4F-ixdwRf not recognised. Is the device missing?
  Internal error: Performing unsafe table load while 18 device(s) are known to be suspended:  (253:16) 
  
  [DEADLOCK]


# From syslog:
qarshd[25797]: Running cmdline: echo "- - -" > /sys/class/scsi_host/host2/scan
kernel: [ 7427.071965] scsi 2:0:0:1: Direct-Access     IET      VIRTUAL-DISK     0001 PQ: 0 ANSI: 5
kernel: [ 7427.076708] sd 2:0:0:1: Attached scsi generic sg8 type 0
kernel: [ 7427.077472] sd 2:0:0:1: [sda] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
kernel: [ 7427.081328] sd 2:0:0:1: [sda] Write Protect is off
kernel: [ 7427.082969] sd 2:0:0:1: [sda] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
kernel: [ 7427.091721]  sda: sda1
kernel: [ 7427.098108] sd 2:0:0:1: [sda] Attached SCSI disk
qarshd[25806]: Running cmdline: echo "- - -" > /sys/class/scsi_host/host7/scan
kernel: [ 7427.206696] scsi 7:0:0:1: Direct-Access     IET      VIRTUAL-DISK     0001 PQ: 0 ANSI: 5
kernel: [ 7427.210626] sd 7:0:0:1: [sdc] 20971520 512-byte logical blocks: (10.7 GB/10.0 GiB)
kernel: [ 7427.212202] sd 7:0:0:1: Attached scsi generic sg10 type 0
kernel: [ 7427.217196] sd 7:0:0:1: [sdc] Write Protect is off
kernel: [ 7427.220121] sd 7:0:0:1: [sdc] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA
kernel: [ 7427.231385]  sdc: sdc1
kernel: [ 7427.239565] sd 7:0:0:1: [sdc] Attached SCSI disk
qarshd[25810]: Running cmdline: vgreduce --removemissing --force black_bird
systemd: Starting LVM2 PV scan on device 8:33...
systemd: Starting LVM2 PV scan on device 8:1...
pvscan: WARNING: Inconsistent metadata found for VG black_bird
systemd: Started LVM2 PV scan on device 8:33.
kernel: [ 7427.545488] device-mapper: raid: Failed to read superblock of device at position 0
pvscan: WARNING: Inconsistent metadata found for VG black_bird
pvscan: Refusing refresh of partial LV synced_two_raid10_3legs_1. Use --partial to override.
lvm[3300]: No longer monitoring snapshot black_bird-bb_snap1
lvm[3300]: Monitoring snapshot black_bird-bb_snap1
kernel: [ 7427.671092] md/raid10:mdX: active with 5 out of 6 devices
kernel: [ 7427.673447] created bitmap (1 pages) for device mdX
lvm[3300]: No longer monitoring snapshot black_bird-bb_snap1
lvm[3300]: No longer monitoring RAID device black_bird-synced_two_raid10_3legs_1-real for events.
pvscan: Refusing refresh of partial LV synced_two_raid10_3legs_1. Use --partial to override.
kernel: [ 7427.753724] mdX: bitmap initialized from disk: read 1 pages, set 112 of 1008 bits
lvm[3300]: Monitoring snapshot black_bird-bb_snap1
pvscan: Refusing refresh of partial LV synced_two_raid10_3legs_1. Use --partial to override.
pvscan: Attempted to decrement suspended device counter below zero.
pvscan: Refusing refresh of partial LV synced_two_raid10_3legs_1. Use --partial to override.
lvm[3300]: No longer monitoring snapshot black_bird-bb_snap1
systemd-udevd: worker [25799] /devices/virtual/block/dm-13 timeout; kill it
systemd-udevd: seq 8955 '/devices/virtual/block/dm-13' killed
systemd-udevd: worker [25800] /devices/virtual/block/dm-15 timeout; kill it
systemd-udevd: seq 8967 '/devices/virtual/block/dm-15' killed


Version-Release number of selected component (if applicable):
3.10.0-80.el7.x86_64

lvm2-2.02.105-2.el7    BUILT: Sun Jan 26 09:00:43 CST 2014
lvm2-libs-2.02.105-2.el7    BUILT: Sun Jan 26 09:00:43 CST 2014
lvm2-cluster-2.02.105-2.el7    BUILT: Sun Jan 26 09:00:43 CST 2014
device-mapper-1.02.84-2.el7    BUILT: Sun Jan 26 09:00:43 CST 2014
device-mapper-libs-1.02.84-2.el7    BUILT: Sun Jan 26 09:00:43 CST 2014
device-mapper-event-1.02.84-2.el7    BUILT: Sun Jan 26 09:00:43 CST 2014
device-mapper-event-libs-1.02.84-2.el7    BUILT: Sun Jan 26 09:00:43 CST 2014
device-mapper-persistent-data-0.2.8-3.el7    BUILT: Fri Dec 27 13:40:56 CST 2013
cmirror-2.02.105-2.el7    BUILT: Sun Jan 26 09:00:43 CST 2014

Comment 1 Corey Marthaler 2014-01-31 17:36:33 UTC
Created attachment 857936 [details]
kernel stack

Comment 6 Jonathan Earl Brassow 2015-04-10 22:16:33 UTC
corey, is it possible to reproduce this?

I tried some time back (March 2014) but was never able to reproduce it.

I will close if you are unable.

Comment 7 Corey Marthaler 2015-05-22 20:48:12 UTC
I don't remember seeing this in the testing of the latest 7.1 rpms. Let's close this and we can reopen it if this is reproduced in 7.2.


Note You need to log in before you can comment on or make changes to this bug.