Bug 892991
| Summary: | [lvmetad] RAID or mirror leg failure is not handled when using lvmetad | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Marian Csontos <mcsontos> |
| Component: | lvm2 | Assignee: | Petr Rockai <prockai> |
| lvm2 sub component: | Mirroring and RAID (RHEL6) | QA Contact: | Cluster QE <mspqa-list> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | agk, cmarthal, dwysocha, heinzm, jbrassow, mcsontos, msnitzer, nperic, prajnoha, prockai, thornber, zkabelac |
| Version: | 6.4 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | lvm2-2.02.108-1.el6 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: When using lvmetad, dmeventd could see metadata that was not up to date at a time of a RAID volume repair.
Consequence: The repair would not proceed, as based on the outdated information, the RAID volume was healthy.
Fix: The repair code now forces a refresh of metadata for the PVs that host the RAID volume.
Result: Automatic RAID volume repair using dmeventd and manual repair using lvconvert --repair now work as expected with or without lvmetad enabled.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2014-10-14 08:23:59 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 889465 | ||
| Bug Blocks: | |||
|
Description
Marian Csontos
2013-01-08 12:02:08 UTC
I applied the suggested fix to generate udev event and wrote remove to device/uevent, but the problem persists. Also now dmeventd incorrectly claims the device was replaced: Jan 8 06:49:24 zaphodc1-node02 lvm[5748]: Faulty devices in black_bird/synced_primary_raid10_3legs_1 successfully replaced. This was generated immediately after removing device, so `pvscan --cache` had little chance to run. zkabelac is right and dmeventd should either skip lvmetad (or should wait for updated metadata.) The original scenario uses `echo offline > /sys/block/$DEV/device/state` to remove the leg. Adding `echo remove > /sys/block/$DEV/uevent` has no effect. Will retry with `echo 1 > /sys/block/$DEV/device/delete` (the test may not recover from that - but it will die here anyway...) Works with `echo 1 > /sys/block/$DEV/device/delete`. Will check if this is reliable, as there still may be a space for race where dmeventd would handle the missing device before pvscan --cache. This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Marian, can you please verify that the dmsetup info prints correct leg status in your scenario with "offline"? I suppose it should and I suspect the problem is with lvconvert --repair code which is using that info. Needinfoing myself. Running without lvmetad, I disable state of one of the RAID's legs:
(08:53:31) [root@barb-03c1-node01:~]$ echo offline > /sys/block/sdc/device/state
(08:53:44) [root@barb-03c1-node01:~]$ dmsetup info -c
Name Maj Min Stat Open Targ Event UUID
VolGroup-lv_swap 253 1 L--w 1 1 0 LVM-DBL1oNg7Kf3uNKw0uYfWXmXOfTo56f1Ckt4N8UXIbdM4W8qk6c7NEd0oQhhSMHQo
VolGroup-lv_root 253 0 L--w 1 1 0 LVM-DBL1oNg7Kf3uNKw0uYfWXmXOfTo56f1C2DYW25PpysmuNtNoEu5Dk10ZuTKrvmHV
vg-lv 253 6 L--w 0 1 1 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GR5cRo3cwCqJeHy3BuLDlAYrFHI9AEUlym
vg-lv_rmeta_1 253 4 L--w 1 1 0 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GRmitGwCKYbplJDHC7tSYi8n6fHrQZmuLD
vg-lv_rmeta_0 253 2 L--w 1 1 0 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GR3dZZzVdf0RbZWWP8uzhd58AoSMwfC2xe
vg-lv_rimage_1 253 5 L--w 1 1 0 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GRpofPyXjbPBOukVoI3Qwe7mHOHqzbIzPL
vg-lv_rimage_0 253 3 L--w 1 1 0 LVM-28wUMp1qzZd5ldLHKnupcm5TiuYrg9GRyeBuAzW6qy2DD2gzSzJmFGUOXp4Ertdl
(08:53:48) [root@barb-03c1-node01:~]$ dmsetup status
VolGroup-lv_swap: 0 4128768 linear
VolGroup-lv_root: 0 9519104 linear
vg-lv: 0 524288 raid raid1 2 AA 524288/524288 idle 0
vg-lv_rmeta_1: 0 8192 linear
vg-lv_rmeta_0: 0 8192 linear
vg-lv_rimage_1: 0 524288 linear
vg-lv_rimage_0: 0 524288 linear
From DM point of view everything looks still sane and there is nothing in messages.
Now after I run lvs:
(08:53:55) [root@barb-03c1-node01:~]$ lvs
/dev/sdc1: read failed after 0 of 512 at 42935844864: Input/output error
/dev/sdc1: read failed after 0 of 512 at 42935918592: Input/output error
/dev/sdc1: read failed after 0 of 512 at 0: Input/output error
/dev/sdc1: read failed after 0 of 512 at 4096: Input/output error
/dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
Couldn't find device with uuid LaR3xv-GunU-HSmZ-pMyK-35q2-3r6K-YyNVSw.
Couldn't find device with uuid ZPo1VU-EimH-nqUG-Znm0-gxeZ-Sdn1-JdqULS.
LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
lv_root VolGroup -wi-ao---- 4.54g
lv_swap VolGroup -wi-ao---- 1.97g
lv vg rwi-a-r-p- 256.00m 100.00
The missing leg gets correctly reported:
(08:54:08) [root@barb-03c1-node01:~]$ dmsetup status
VolGroup-lv_swap: 0 4128768 linear
VolGroup-lv_root: 0 9519104 linear
vg-lv: 0 524288 raid raid1 2 DA 524288/524288 idle 0
vg-lv_rmeta_1: 0 8192 linear
vg-lv_rmeta_0: 0 8192 linear
vg-lv_rimage_1: 0 524288 linear
vg-lv_rimage_0: 0 524288 linear
And this triggers disk failure:
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: md/raid1:mdX: Disk failure on dm-3, disabling device.
Dec 2 08:54:08 barb-03c1-node01 kernel: md/raid1:mdX: Operation continuing on 1 devices.
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: Device #0 of raid1 array, vg-lv, has failed.
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 512 at 42935844864: Input/output error
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 512 at 42935918592: Input/output error
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 512 at 0: Input/output error
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 512 at 4096: Input/output error
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: /dev/sdc1: read failed after 0 of 2048 at 0: Input/output error
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: Couldn't find device with uuid LaR3xv-GunU-HSmZ-pMyK-35q2-3r6K-YyNVSw.
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: Couldn't find device with uuid ZPo1VU-EimH-nqUG-Znm0-gxeZ-Sdn1-JdqULS.
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 lvm[7045]: Use 'lvconvert --repair vg/lv' to replace failed device.
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
Dec 2 08:54:08 barb-03c1-node01 kernel: sd 7:0:0:1: rejecting I/O to offline device
This should now work as expected, since 5dc6671bb550f4b480befee03d234373d08e188a; dmeventd first issues lvscan --cache for the affected LV before proceeding with lvconvert --repair, updating PV status in lvmetad appropriately. The failure is recognized and the device failure handled as defined in lvm.conf In the case below, the raid_fault_policy is allocate: [root@tardis-01 raid]# echo offline >/sys/block/sdd/device/state [root@tardis-01 raid]# lvs -a -o+devices PV 8ozawT-pdg3-6sPH-e9xv-UXyc-5kz4-hqPOa0 not recognised. Is the device missing? /dev/sdd1: read failed after 0 of 512 at 16104947712: Input/output error /dev/sdd1: read failed after 0 of 512 at 16105054208: Input/output error /dev/sdd1: read failed after 0 of 512 at 0: Input/output error /dev/sdd1: read failed after 0 of 512 at 4096: Input/output error PV 8ozawT-pdg3-6sPH-e9xv-UXyc-5kz4-hqPOa0 not recognised. Is the device missing? LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices raid1 vg rwi-aor--- 2.00g 100.00 raid1_rimage_0(0),raid1_rimage_1(0) [raid1_rimage_0] vg iwi-aor--- 2.00g /dev/sde1(1) [raid1_rimage_1] vg iwi-aor--- 2.00g /dev/sdd1(1) [raid1_rmeta_0] vg ewi-aor--- 4.00m /dev/sde1(0) [raid1_rmeta_1] vg ewi-aor--- 4.00m /dev/sdd1(0) lv_home vg_tardis01 -wi-ao---- 224.88g /dev/sda2(12800) lv_root vg_tardis01 -wi-ao---- 50.00g /dev/sda2(0) lv_swap vg_tardis01 -wi-ao---- 4.00g /dev/sda2(70368) [root@tardis-01 raid]# dmsetup status vg-raid1_rmeta_1: 0 8192 linear vg-raid1_rmeta_0: 0 8192 linear vg-raid1_rimage_1: 0 4194304 linear vg_tardis01-lv_home: 0 471597056 linear vg-raid1_rimage_0: 0 4194304 linear vg-raid1: 0 4194304 raid raid1 2 AA 4194304/4194304 idle 0 vg_tardis01-lv_swap: 0 8388608 linear vg_tardis01-lv_root: 0 104857600 linear But the repair started already: [root@tardis-01 raid]# lvs -a -o+devices PV 8ozawT-pdg3-6sPH-e9xv-UXyc-5kz4-hqPOa0 not recognised. Is the device missing? PV fModjW-yXvP-XOFZ-87lP-IeTa-zyoR-dUHcmM not recognised. Is the device missing? PV 8ozawT-pdg3-6sPH-e9xv-UXyc-5kz4-hqPOa0 not recognised. Is the device missing? PV fModjW-yXvP-XOFZ-87lP-IeTa-zyoR-dUHcmM not recognised. Is the device missing? LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices raid1 vg rwi-aor--- 2.00g 6.25 raid1_rimage_0(0),raid1_rimage_1(0) [raid1_rimage_0] vg iwi-aor--- 2.00g /dev/sde1(1) [raid1_rimage_1] vg Iwi-aor--- 2.00g /dev/sdf1(1) [raid1_rmeta_0] vg ewi-aor--- 4.00m /dev/sde1(0) [raid1_rmeta_1] vg ewi-aor--- 4.00m /dev/sdf1(0) lv_home vg_tardis01 -wi-ao---- 224.88g /dev/sda2(12800) lv_root vg_tardis01 -wi-ao---- 50.00g /dev/sda2(0) lv_swap vg_tardis01 -wi-ao---- 4.00g /dev/sda2(70368) which is shown in the new status as well [root@tardis-01 raid]# dmsetup status vg-raid1_rmeta_1: 0 8192 linear vg-raid1_rmeta_0: 0 8192 linear vg-raid1_rimage_1: 0 4194304 linear vg_tardis01-lv_home: 0 471597056 linear vg-raid1_rimage_0: 0 4194304 linear vg-raid1: 0 4194304 raid raid1 2 Aa 1830016/4194304 recover 0 vg_tardis01-lv_swap: 0 8388608 linear vg_tardis01-lv_root: 0 104857600 linear The test was done with lvmetad running and enabled with: lvm2-2.02.108-1.el6 BUILT: Thu Jul 24 17:29:50 CEST 2014 lvm2-libs-2.02.108-1.el6 BUILT: Thu Jul 24 17:29:50 CEST 2014 lvm2-cluster-2.02.108-1.el6 BUILT: Thu Jul 24 17:29:50 CEST 2014 udev-147-2.56.el6 BUILT: Fri Jul 11 16:53:07 CEST 2014 device-mapper-1.02.87-1.el6 BUILT: Thu Jul 24 17:29:50 CEST 2014 device-mapper-libs-1.02.87-1.el6 BUILT: Thu Jul 24 17:29:50 CEST 2014 device-mapper-event-1.02.87-1.el6 BUILT: Thu Jul 24 17:29:50 CEST 2014 device-mapper-event-libs-1.02.87-1.el6 BUILT: Thu Jul 24 17:29:50 CEST 2014 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 15:43:06 CEST 2014 cmirror-2.02.108-1.el6 BUILT: Thu Jul 24 17:29:50 CEST 2014 Marking VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-1387.html |