RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 892991 - [lvmetad] RAID or mirror leg failure is not handled when using lvmetad
Summary: [lvmetad] RAID or mirror leg failure is not handled when using lvmetad
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Petr Rockai
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On: 889465
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-01-08 12:02 UTC by Marian Csontos
Modified: 2014-10-14 08:23 UTC (History)
12 users (show)

Fixed In Version: lvm2-2.02.108-1.el6
Doc Type: Bug Fix
Doc Text:
Cause: When using lvmetad, dmeventd could see metadata that was not up to date at a time of a RAID volume repair. Consequence: The repair would not proceed, as based on the outdated information, the RAID volume was healthy. Fix: The repair code now forces a refresh of metadata for the PVs that host the RAID volume. Result: Automatic RAID volume repair using dmeventd and manual repair using lvconvert --repair now work as expected with or without lvmetad enabled.
Clone Of:
Environment:
Last Closed: 2014-10-14 08:23:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:1387 0 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2014-10-14 01:39:47 UTC

Description Marian Csontos 2013-01-08 12:02:08 UTC
Description of problem:
When using lvmetad, removing a leg from RAID LV, this is not handled by dmeventd nor visible to userspace.

mirrors and RAID volumes are not usable with lvmetad.

Version-Release number of selected component (if applicable):
lvm2-2.02.98-6.18 (release 6.el6 + additional patched designed for 7.el6)

How reproducible:
100%

Steps to Reproduce:
0. start lvmetad
1. create a RAID or mirror LV
2. remove a LV's leg

Actual results:

While lvmetad is running: lvs -avo+devices still listing removed device.

After stopping lvmetad, "unknown device" is listed, but it is not handled by dmeventd now.

I were unable to use vgreduce --removemissing afterwards - I had to resort to overwriting PV headers.


Expected results:

dmeventd should handle the situation. (When using lvmetad, should it explicitly rescan PVs?)

Additional info:

Comment 2 Marian Csontos 2013-01-08 13:36:45 UTC
I applied the suggested fix to generate udev event and wrote remove to device/uevent, but the problem persists.

Also now dmeventd incorrectly claims the device was replaced:

Jan  8 06:49:24 zaphodc1-node02 lvm[5748]: Faulty devices in black_bird/synced_primary_raid10_3legs_1 successfully replaced.

This was generated immediately after removing device, so `pvscan --cache` had little chance to run.

zkabelac is right and dmeventd should either skip lvmetad (or should wait for updated metadata.)

Comment 3 Marian Csontos 2013-01-08 13:49:04 UTC
The original scenario uses `echo offline > /sys/block/$DEV/device/state` to remove the leg.

Adding `echo remove > /sys/block/$DEV/uevent` has no effect. Will retry with `echo 1 > /sys/block/$DEV/device/delete` (the test may not recover from that - but it will die here anyway...)

Comment 4 Marian Csontos 2013-01-08 14:25:52 UTC
Works with `echo 1 > /sys/block/$DEV/device/delete`. Will check if this is reliable, as there still may be a space for race where dmeventd would handle the missing device before pvscan --cache.

Comment 5 RHEL Program Management 2013-01-12 06:47:26 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 6 Petr Rockai 2013-04-29 14:41:14 UTC
Marian, can you please verify that the dmsetup info prints correct leg status in your scenario with "offline"? I suppose it should and I suspect the problem is with lvconvert --repair code which is using that info.

Comment 7 Marian Csontos 2013-07-23 14:56:09 UTC
Needinfoing myself.

Comment 8 Marian Csontos 2013-12-02 15:38:53 UTC
Running without lvmetad, I disable state of one of the RAID's legs:

    (08:53:31) [root@barb-03c1-node01:~]$ echo offline > /sys/block/sdc/device/state 
    (08:53:44) [root@barb-03c1-node01:~]$ dmsetup info -c
    Name             Maj Min Stat Open Targ Event  UUID                                                                
    VolGroup-lv_swap 253   1 L--w    1    1      0 LVM-DBL1oNg7Kf3uNKw0uYfWXmXOfTo56f1Ckt4N8UXIbdM4W8qk6c7NEd0oQhhSMHQo
    VolGroup-lv_root 253   0 L--w    1    1      0 LVM-DBL1oNg7Kf3uNKw0uYfWXmXOfTo56f1C2DYW25PpysmuNtNoEu5Dk10ZuTKrvmHV
    vg-lv            253   6 L--w    0    1      1 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GR5cRo3cwCqJeHy3BuLDlAYrFHI9AEUlym
    vg-lv_rmeta_1    253   4 L--w    1    1      0 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GRmitGwCKYbplJDHC7tSYi8n6fHrQZmuLD
    vg-lv_rmeta_0    253   2 L--w    1    1      0 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GR3dZZzVdf0RbZWWP8uzhd58AoSMwfC2xe
    vg-lv_rimage_1   253   5 L--w    1    1      0 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GRpofPyXjbPBOukVoI3Qwe7mHOHqzbIzPL
    vg-lv_rimage_0   253   3 L--w    1    1      0 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GRyeBuAzW6qy2DD2gzSzJmFGUOXp4Ertdl
    (08:53:48) [root@barb-03c1-node01:~]$ dmsetup status
    VolGroup-lv_swap: 0 4128768 linear 
    VolGroup-lv_root: 0 9519104 linear 
    vg-lv: 0 524288 raid raid1 2 AA 524288/524288 idle 0
    vg-lv_rmeta_1: 0 8192 linear 
    vg-lv_rmeta_0: 0 8192 linear 
    vg-lv_rimage_1: 0 524288 linear 
    vg-lv_rimage_0: 0 524288 linear 

From DM point of view everything looks still sane and there is nothing in messages.

Now after I run lvs:

    (08:53:55) [root@barb-03c1-node01:~]$ lvs
      /dev/sdc1: read failed after 0 of 512 at 42935844864: Input/output error
      /dev/sdc1: read failed after 0 of 512 at 42935918592: Input/output error
      /dev/sdc1: read failed after 0 of 512 at 0: Input/output error
      /dev/sdc1: read failed after 0 of 512 at 4096: Input/output error
      /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
      Couldn't find device with uuid LaR3xv-GunU-HSmZ-pMyK-35q2-3r6K-YyNVSw.
      Couldn't find device with uuid ZPo1VU-EimH-nqUG-Znm0-gxeZ-Sdn1-JdqULS.
      LV      VG       Attr       LSize   Pool Origin Data%  Move Log Cpy%Sync Convert
      lv_root VolGroup -wi-ao----   4.54g                                             
      lv_swap VolGroup -wi-ao----   1.97g                                             
      lv      vg       rwi-a-r-p- 256.00m                               100.00        

The missing leg gets correctly reported:

    (08:54:08) [root@barb-03c1-node01:~]$ dmsetup status
    VolGroup-lv_swap: 0 4128768 linear 
    VolGroup-lv_root: 0 9519104 linear 
    vg-lv: 0 524288 raid raid1 2 DA 524288/524288 idle 0
    vg-lv_rmeta_1: 0 8192 linear 
    vg-lv_rmeta_0: 0 8192 linear 
    vg-lv_rimage_1: 0 524288 linear 
    vg-lv_rimage_0: 0 524288 linear

And this triggers disk failure:

    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: md/raid1:mdX: Disk failure on dm-3, disabling device.
    Dec  2 08:54:08 barb-03c1-node01 kernel: md/raid1:mdX: Operation continuing on 1 devices.
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: Device #0 of raid1 array, vg-lv, has failed.
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 512 at 42935844864: Input/output error
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 512 at 42935918592: Input/output error
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 512 at 0: Input/output error
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 512 at 4096: Input/output error
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: Couldn't find device with uuid LaR3xv-GunU-HSmZ-pMyK-35q2-3r6K-YyNVSw.
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: Couldn't find device with uuid ZPo1VU-EimH-nqUG-Znm0-gxeZ-Sdn1-JdqULS.
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 lvm[7045]: Use 'lvconvert --repair vg/lv' to replace failed device.
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
    Dec  2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device

Comment 9 Petr Rockai 2014-07-28 15:39:44 UTC
This should now work as expected, since 5dc6671bb550f4b480befee03d234373d08e188a; dmeventd first issues lvscan --cache for the affected LV before proceeding with lvconvert --repair, updating PV status in lvmetad appropriately.

Comment 12 Nenad Peric 2014-08-04 13:33:14 UTC
The failure is recognized and the device failure handled as defined in lvm.conf

In the case below, the raid_fault_policy is allocate:

[root@tardis-01 raid]# echo offline >/sys/block/sdd/device/state
[root@tardis-01 raid]# lvs -a -o+devices
  PV 8ozawT-pdg3-6sPH-e9xv-UXyc-5kz4-hqPOa0 not recognised. Is the device missing?
  /dev/sdd1: read failed after 0 of 512 at 16104947712: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 16105054208: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 0: Input/output error
  /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error
  PV 8ozawT-pdg3-6sPH-e9xv-UXyc-5kz4-hqPOa0 not recognised. Is the device missing?
  LV               VG          Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                            
  raid1            vg          rwi-aor---   2.00g                                    100.00           raid1_rimage_0(0),raid1_rimage_1(0)
  [raid1_rimage_0] vg          iwi-aor---   2.00g                                                     /dev/sde1(1)                       
  [raid1_rimage_1] vg          iwi-aor---   2.00g                                                     /dev/sdd1(1)                       
  [raid1_rmeta_0]  vg          ewi-aor---   4.00m                                                     /dev/sde1(0)                       
  [raid1_rmeta_1]  vg          ewi-aor---   4.00m                                                     /dev/sdd1(0)                       
  lv_home          vg_tardis01 -wi-ao---- 224.88g                                                     /dev/sda2(12800)                   
  lv_root          vg_tardis01 -wi-ao----  50.00g                                                     /dev/sda2(0)                       
  lv_swap          vg_tardis01 -wi-ao----   4.00g                                                     /dev/sda2(70368)                   
[root@tardis-01 raid]# dmsetup status
vg-raid1_rmeta_1: 0 8192 linear 
vg-raid1_rmeta_0: 0 8192 linear 
vg-raid1_rimage_1: 0 4194304 linear 
vg_tardis01-lv_home: 0 471597056 linear 
vg-raid1_rimage_0: 0 4194304 linear 
vg-raid1: 0 4194304 raid raid1 2 AA 4194304/4194304 idle 0
vg_tardis01-lv_swap: 0 8388608 linear 
vg_tardis01-lv_root: 0 104857600 linear 

But the repair started already:

[root@tardis-01 raid]# lvs -a -o+devices
  PV 8ozawT-pdg3-6sPH-e9xv-UXyc-5kz4-hqPOa0 not recognised. Is the device missing?
  PV fModjW-yXvP-XOFZ-87lP-IeTa-zyoR-dUHcmM not recognised. Is the device missing?
  PV 8ozawT-pdg3-6sPH-e9xv-UXyc-5kz4-hqPOa0 not recognised. Is the device missing?
  PV fModjW-yXvP-XOFZ-87lP-IeTa-zyoR-dUHcmM not recognised. Is the device missing?
  LV               VG          Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices                            
  raid1            vg          rwi-aor---   2.00g                                    6.25             raid1_rimage_0(0),raid1_rimage_1(0)
  [raid1_rimage_0] vg          iwi-aor---   2.00g                                                     /dev/sde1(1)                       
  [raid1_rimage_1] vg          Iwi-aor---   2.00g                                                     /dev/sdf1(1)                       
  [raid1_rmeta_0]  vg          ewi-aor---   4.00m                                                     /dev/sde1(0)                       
  [raid1_rmeta_1]  vg          ewi-aor---   4.00m                                                     /dev/sdf1(0)                       
  lv_home          vg_tardis01 -wi-ao---- 224.88g                                                     /dev/sda2(12800)                   
  lv_root          vg_tardis01 -wi-ao----  50.00g                                                     /dev/sda2(0)                       
  lv_swap          vg_tardis01 -wi-ao----   4.00g                                                     /dev/sda2(70368)                   

which is shown in the new status as well

[root@tardis-01 raid]# dmsetup status
vg-raid1_rmeta_1: 0 8192 linear 
vg-raid1_rmeta_0: 0 8192 linear 
vg-raid1_rimage_1: 0 4194304 linear 
vg_tardis01-lv_home: 0 471597056 linear 
vg-raid1_rimage_0: 0 4194304 linear 
vg-raid1: 0 4194304 raid raid1 2 Aa 1830016/4194304 recover 0
vg_tardis01-lv_swap: 0 8388608 linear 
vg_tardis01-lv_root: 0 104857600 linear 



The test was done with lvmetad running and enabled with:

lvm2-2.02.108-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
lvm2-libs-2.02.108-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
lvm2-cluster-2.02.108-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
udev-147-2.56.el6    BUILT: Fri Jul 11 16:53:07 CEST 2014
device-mapper-1.02.87-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
device-mapper-libs-1.02.87-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
device-mapper-event-1.02.87-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
device-mapper-event-libs-1.02.87-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 15:43:06 CEST 2014
cmirror-2.02.108-1.el6    BUILT: Thu Jul 24 17:29:50 CEST 2014


Marking VERIFIED.

Comment 13 errata-xmlrpc 2014-10-14 08:23:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-1387.html


Note You need to log in before you can comment on or make changes to this bug.