Bug 919604
Summary: | thinpool stacked on mirror volume fails to recover from device failure | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> |
Component: | lvm2 | Assignee: | Zdenek Kabelac <zkabelac> |
Status: | CLOSED ERRATA | QA Contact: | Cluster QE <mspqa-list> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.4 | CC: | agk, dwysocha, heinzm, jbrassow, msnitzer, prajnoha, prockai, slevine, thornber, zkabelac |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | lvm2-2.02.100-4.el6 | Doc Type: | Bug Fix |
Doc Text: |
Users who wish to have device fault tolerance for their thinpool logical volumes should use the RAID segment types for this purpose (e.g. "raid1"). This is especially encouraged for thinpool metadata. The 'lvconvert' command can be used for this purpose. Here is an example of converting the metadata portion of a thinpool named "my_thinpool" to the "raid1" segment type:
~> lvconvert --type raid1 -m 1 my_vg/my_thinpool_tmeta
The 'raid1' segment type is the new implementation of mirroring in LVM. The legacy mirror segment type is called 'mirror'. Conversions that result in thinpools layered on logical volumes of 'mirror' segment type are no longer allowed. That is, it is no longer possible to create thinpools on top of logical volumes of 'mirror' segment type. This is due to the possibility of I/O hangs and a failure to complete repairs during failure events. Users can still gain the desired fault tolerance by using the 'raid1' segment type which does not suffer from the same limitations.
Users who have already created thinpools with data or metadata areas of 'mirror' segment type will still be able to activate those logical volumes. However, they should convert them to the 'raid1' segment type as soon as possible. This can be quickly accomplished via the 'lvconvert' command. For example, the following command would convert the data portion of a thinpool named 'my_thinpool' in the volume group 'my_vg' from the 'mirror' segment type to the newer 'raid1' segment type:
~> lvconvert --type raid1 my_vg/my_thinpool_tdata
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2013-11-21 23:21:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 960054 |
Description
Corey Marthaler
2013-03-08 21:35:32 UTC
A manual test (i.e. non-dmeventd initiated) seems to work for the repair. If the device is then reinstated, the messages are very confusing and lead to disasterous results! [root@bp-xen-01 lvm2]# devices vg LV Attr Cpy%Sync Devices [lvol0_pmspare] ewi------- /dev/sda1(257) pool twi-a-tz-- pool_tdata(0) [pool_tdata] mwi-aom--- 100.00 pool_tdata_mimage_0(0),pool_tdata_mimage_1(0) [pool_tdata_mimage_0] iwi-aom--- /dev/sda1(0) [pool_tdata_mimage_1] iwi-aom--- /dev/sdb1(0) [pool_tdata_mlog] lwi-aom--- /dev/sdg1(0) [pool_tmeta] ewi-ao---- /dev/sda1(256) [root@bp-xen-01 lvm2]# off.sh sda Turning off sda [root@bp-xen-01 lvm2]# lvconvert --repair vg/pool /dev/sda1: read failed after 0 of 512 at 898381381632: Input/output error /dev/sda1: read failed after 0 of 512 at 898381488128: Input/output error /dev/sda1: read failed after 0 of 512 at 0: Input/output error /dev/sda1: read failed after 0 of 512 at 4096: Input/output error /dev/sda1: read failed after 0 of 2048 at 0: Input/output error Couldn't find device with uuid RA4cDd-471I-9HM1-PJG8-Db1t-X2gg-gmicT3. Only inactive pool can be repaired. [root@bp-xen-01 lvm2]# lvconvert --repair vg/pool_tdata /dev/sda1: read failed after 0 of 512 at 898381381632: Input/output error /dev/sda1: read failed after 0 of 512 at 898381488128: Input/output error /dev/sda1: read failed after 0 of 512 at 0: Input/output error /dev/sda1: read failed after 0 of 512 at 4096: Input/output error /dev/sda1: read failed after 0 of 2048 at 0: Input/output error Couldn't find device with uuid RA4cDd-471I-9HM1-PJG8-Db1t-X2gg-gmicT3. Mirror status: 1 of 2 images failed. Attempt to replace failed mirror images (requires full device resync)? [y/n]: y Trying to up-convert to 2 images, 1 logs. vg/pool_tdata: Converted: 2.0% vg/pool_tdata: Converted: 71.9% vg/pool_tdata: Converted: 100.0% [root@bp-xen-01 lvm2]# devices vg /dev/sda1: read failed after 0 of 512 at 898381381632: Input/output error /dev/sda1: read failed after 0 of 512 at 898381488128: Input/output error /dev/sda1: read failed after 0 of 512 at 0: Input/output error /dev/sda1: read failed after 0 of 512 at 4096: Input/output error /dev/sda1: read failed after 0 of 2048 at 0: Input/output error Couldn't find device with uuid RA4cDd-471I-9HM1-PJG8-Db1t-X2gg-gmicT3. LV Attr Cpy%Sync Devices [lvol0_pmspare] ewi-----p- unknown device(257) pool twi-a-tzp- pool_tdata(0) [pool_tdata] mwi-aom--- 100.00 pool_tdata_mimage_0(0),pool_tdata_mimage_1(0) [pool_tdata_mimage_0] iwi-aom--- /dev/sdb1(0) [pool_tdata_mimage_1] iwi-aom--- /dev/sdc1(0) [pool_tdata_mlog] lwi-aom--- /dev/sdg1(1) [pool_tmeta] ewi-ao--p- unknown device(256) [root@bp-xen-01 lvm2]# on.sh sda Turning on sda [root@bp-xen-01 lvm2]# lvs WARNING: Inconsistent metadata found for VG vg - updating to use version 13 Missing device /dev/sda1 reappeared, updating metadata for VG vg to version 13. Device still marked missing because of allocated data on it, remove volumes and consider vgreduce --removemissing. Failed to parse thin pool params: Fail. Failed to parse thin pool params: Fail. dm_report_object: report function failed for field data_percent LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert pool vg twi-a-tzp- 1.00g lv_root vg_bpxen01 -wi-ao---- 5.54g lv_swap vg_bpxen01 -wi-ao---- 1.97g [root@bp-xen-01 lvm2]# vgreduce --removemissing vg WARNING: Partial LV pool needs to be repaired or removed. WARNING: Partial LV pool_tmeta needs to be repaired or removed. WARNING: Partial LV lvol0_pmspare needs to be repaired or removed. There are still partial LVs in VG vg. To remove them unconditionally use: vgreduce --removemissing --force. Proceeding to remove empty missing PVs. [root@bp-xen-01 lvm2]# vgreduce --removemissing --force vg Removing partial LV pool. Logical volume "pool" successfully removed Wrote out consistent volume group vg [root@bp-xen-01 lvm2]# devices vg WARNING: Inconsistent metadata found for VG vg - updating to use version 18 Removing PV /dev/sda1 (RA4cDd-471I-9HM1-PJG8-Db1t-X2gg-gmicT3) that no longer belongs to VG vg [root@bp-xen-01 lvm2]# devices vg [root@bp-xen-01 lvm2]# vgs VG #PV #LV #SN Attr VSize VFree vg 6 0 0 wz--n- 4.90t 4.90t vg_bpxen01 1 2 0 wz--n- 7.51g 0 (In reply to Jonathan Earl Brassow from comment #5) > A manual test (i.e. non-dmeventd initiated) seems to work for the repair. > If the device is then reinstated, the messages are very confusing and lead > to disasterous results! > Yeah, the mirror repair worked fine, but the rest of the test is potentially bogus because I killed the one device that was part of the mirrored tdata and the only device in tmeta. Easier to gather logs without using dmeventd. 1) Create mirror 2) convert it to thinpool 3) create thinlv 4) wait for mirror sync 5) killall -9 dmeventd (we want to do the repair, not dmeventd) 6) kill any mirror device (but not one that is also _tmeta) ** 7) write a small amount to thinlv to make kernel notice failed dev ** 8) Attempt repair - it will hang. Note that if #7 is not done, the repair will complete just fine. This is odd because in both cases any write I/O done to the mirror will hang and the case where #7 is performed (the case where there is even more info that the device is dead) is the case that fails. (In reply to Jonathan Earl Brassow from comment #7) > Easier to gather logs without using dmeventd. > > 1) Create mirror > 2) convert it to thinpool > 3) create thinlv > 4) wait for mirror sync > 5) killall -9 dmeventd (we want to do the repair, not dmeventd) > 6) kill any mirror device (but not one that is also _tmeta) > ** 7) write a small amount to thinlv to make kernel notice failed dev ** > 8) Attempt repair - it will hang. > > Note that if #7 is not done, the repair will complete just fine. This is > odd because in both cases any write I/O done to the mirror will hang and the > case where #7 is performed (the case where there is even more info that the > device is dead) is the case that fails. Using this method to hang the 'lvconvert --repair', attaching gdb to the process, and then replacing the mirror with an error target allows us to get the following backtrace: (gdb) bt #0 0x00000038154db400 in __open_nocancel () from /lib64/libc.so.6 #1 0x000000000045ec8e in dev_open_flags (dev=0x8928e8, flags=278528, direct=1, quiet=1) at device/dev-io.c:470 #2 0x000000000045f163 in dev_open_readonly_quiet (dev=0x8928e8) at device/dev-io.c:553 #3 0x0000000000468623 in _passes_partitioned_filter (f=0x842950, dev=0x8928e8) at filters/filter-partitioned.c:27 #4 0x00000000004650a7 in _and_p (f=0x83c830, dev=0x8928e8) at filters/filter-composite.c:24 #5 0x00000000004661d7 in _lookup_p (f=0x838770, dev=0x8928e8) at filters/filter-persistent.c:295 #6 0x000000000045d921 in dev_iter_get (iter=0x836740) at device/dev-cache.c:1011 #7 0x000000000044e081 in lvmcache_label_scan (cmd=0x7ff0f0, full_scan=0) at cache/lvmcache.c:691 #8 0x000000000049ce06 in _vg_read (cmd=0x7ff0f0, vgname=0x879c82 "vg", vgid=0x0, warnings=1, consistent=0x7fff1433f9e4, precommitted=0) at metadata/metadata.c:2997 #9 0x000000000049e1d3 in vg_read_internal (cmd=0x7ff0f0, vgname=0x879c82 "vg", vgid=0x0, warnings=1, consistent=0x7fff1433f9e4) at metadata/metadata.c:3413 #10 0x000000000049f9ef in _vg_lock_and_read (cmd=0x7ff0f0, vg_name=0x879c82 "vg", vgid=0x0, lock_flags=36, status_flags=514, misc_flags=1048576) at metadata/metadata.c:4112 #11 0x000000000049fd6a in vg_read (cmd=0x7ff0f0, vg_name=0x879c82 "vg", vgid=0x0, flags=1048576) at metadata/metadata.c:4216 #12 0x000000000049fdab in vg_read_for_update (cmd=0x7ff0f0, vg_name=0x879c82 "vg", vgid=0x0, flags=0) at metadata/metadata.c:4227 #13 0x000000000041cb9a in _get_lvconvert_vg (cmd=0x7ff0f0, name=0x879c82 "vg", uuid=0x0) at lvconvert.c:562 #14 0x0000000000423e32 in get_vg_lock_and_logical_volume (cmd=0x7ff0f0, vg_name=0x879c82 "vg", lv_name=0x7fff143408a0 "lv_tdata") at lvconvert.c:2649 #15 0x0000000000424071 in lvconvert_single (cmd=0x7ff0f0, lp=0x7fff1433fbf0) at lvconvert.c:2687 #16 0x0000000000424646 in lvconvert (cmd=0x7ff0f0, argc=1, argv=0x7fff1433fee8) at lvconvert.c:2796 #17 0x000000000042e4a4 in lvm_run_command (cmd=0x7ff0f0, argc=1, argv=0x7fff1433fee8) at lvmcmdline.c:1168 #18 0x000000000042f9bd in lvm2_main (argc=7, argv=0x7fff1433feb8) at lvmcmdline.c:1604 #19 0x0000000000447690 in main (argc=7, argv=0x7fff1433feb8) at lvm.c:21 I am going to disallow thin* on top of mirror logical volumes. Users will have to use the "raid1" segment type if they want this. This bug has come down to a choice between: 1) Disallowing thin-LVs from being used as PVs. 2) Disallowing thinpools on top of mirrors. The problem is that the code in dev_manager.c:device_is_usable() is unable to tell whether there is a mirror device lower in the stack from the device being checked. Pretty much anything layered on top of a mirror will suffer from this problem. (Snapshots are a good example of this; and option #1 above has been chosen to deal with them. This can also be seen in dev_manager.c:device_is_usable().) When a mirror failure occurs, the kernel blocks all I/O to it. If there is an LVM command that comes along to do the repair (or a different operation that requires label reading), it would normally avoid the mirror when it sees that it is blocked. However, if there is a snapshot or a thin-LV that is on a mirror, the above code will not detect the mirror underneath and will issue label reading I/O. This causes the command to hang. Choosing #1 would mean that thin-LVs could never be used as PVs - even if they are stacked on something other than mirrors. Choosing #2 means that thinpools can never be placed on mirrors. This is probably better than we think, since it is preferred that people use the "raid1" segment type in the first place. However, RAID* cannot currently be used in a cluster volume group - even in EX-only mode. Thus, a complete solution for option #2 must include the ability to activate RAID logical volumes (and perform RAID operations) in a cluster volume group. I've already begun working on this. This bug has been addressed by better integration of RAID + thinpool and disallowing mirror + thinpool. The necessary commit IDs are: ca514351536c2dd8929944bb6b01a64587cb0a46 2691f1d764182722195cda80be1f511e968480aa 82228acfc95fa4dbe9acca2d3bfc5a89087fd5e4 Users will have to make use of RAID rather than mirror. Perhaps that qualifies this bug as a "WONTFIX". However, there were necessary changes to RAID and the fact that mirror has been disallowed requires a bug to pull in the changes. This is as good a bug as any to use for that purpose. Fix verified (in that this op is no longer allowed) in the latest rpms. 2.6.32-410.el6.x86_64 lvm2-2.02.100-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 lvm2-libs-2.02.100-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 lvm2-cluster-2.02.100-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 udev-147-2.48.el6 BUILT: Fri Aug 9 06:09:50 CDT 2013 device-mapper-1.02.79-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 device-mapper-libs-1.02.79-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 device-mapper-event-1.02.79-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 device-mapper-event-libs-1.02.79-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 cmirror-2.02.100-4.el6 BUILT: Fri Sep 27 09:05:32 CDT 2013 [root@taft-01 ~]# lvs -a -o +devices LV Attr LSize Log Cpy%Sync Devices to_pool_convert mwi-a-m--- 100.00m to_pool_convert_mlog 100.00 to_pool_convert_mimage_0(0),to_pool_convert_mimage_1(0) [to_pool_convert_mimage_0] iwi-aom--- 100.00m /dev/sdc1(0) [to_pool_convert_mimage_1] iwi-aom--- 100.00m /dev/sdd1(0) [to_pool_convert_mlog] lwi-aom--- 4.00m /dev/sde1(0) to_pool_meta_convert mwi-a-m--- 100.00m to_pool_meta_convert_mlog 100.00 to_pool_meta_convert_mimage_0(0),to_pool_meta_convert_mimage_1(0) [to_pool_meta_convert_mimage_0] iwi-aom--- 100.00m /dev/sdc1(25) [to_pool_meta_convert_mimage_1] iwi-aom--- 100.00m /dev/sdd1(25) [to_pool_meta_convert_mlog] lwi-aom--- 4.00m /dev/sde1(1) [root@taft-01 ~]# lvconvert --thinpool snapper_thinp/to_pool_convert --poolmetadata to_pool_meta_convert Mirror logical volumes cannot be used as thinpools. Try "raid1" segment type instead. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2013-1704.html |