| Summary: | mirror conversion and mirror image splitting results in device i/o errors | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> |
| Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> |
| Status: | CLOSED ERRATA | QA Contact: | Corey Marthaler <cmarthal> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 6.2 | CC: | agk, dwysocha, heinzm, jbrassow, mbroz, prajnoha, prockai, thornber, zkabelac |
| Target Milestone: | rc | Keywords: | Regression, TestBlocker |
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | lvm2-2.02.87-5.el6 | Doc Type: | Bug Fix |
| Doc Text: |
Do not document.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-12-06 17:02:37 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 747120 | ||
|
Description
Corey Marthaler
2011-08-24 19:33:03 UTC
Easier way to repo this is just start off with a 2-way mirror. [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Log Copy% Devices split_last_leg split_image mwi-a- 300.00m split_last_leg_mlog 100.00 split_last_leg_mimage_0(0),split_last_leg_mimage_1(0) [split_last_leg_mimage_0] split_image iwi-ao 300.00m /dev/sdh1(0) [split_last_leg_mimage_1] split_image iwi-ao 300.00m /dev/sdg1(0) [split_last_leg_mlog] split_image lwi-ao 4.00m /dev/sdc1(0) [root@taft-01 ~]# lvconvert --splitmirrors 1 --name new split_image/split_last_leg LV split_image/split_last_leg_mimage_0 in use: not deactivating [root@taft-01 ~]# lvs -a -o +devices LV VG Attr LSize Log Copy% Devices new split_image -wi-a- 300.00m /dev/sdg1(0) split_last_leg split_image -wi-a- 300.00m /dev/sdh1(0) split_last_leg_mimage_0 split_image vwi-a- 300.00m split_last_leg_mlog split_image -wi-a- 4.00m /dev/sdc1(0) Aug 24 14:42:26 taft-01 lvm[1243]: No longer monitoring mirror device split_image-split_last_leg for events. Aug 24 14:42:26 taft-01 kernel: __ratelimit: 127 callbacks suppressed Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 76784 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 76784 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 76798 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 76798 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 0 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 0 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 1 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 76799 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 76799 Aug 24 14:42:26 taft-01 kernel: Buffer I/O error on device dm-3, logical block 76799 The I/O errors don't bother me. They come from udev trying to access split_last_leg_mimage_0, which is now an error target. The real problem here is the fact that we get 'LV split_image/split_images_sequentially_mimage_0 in use: not deactivating' - probably an [de]activation sequence issue. Please test this bug again with the changes made for bug 732142 - the ordering of the split operations has changed and will likely affect this bug. It will need to be reconfirmed. I will mark this one as POST along with bug 732142. Need devel unit testing results from this fix. Devel should really be trying these test cases before moving to POST. SCENARIO - [mirror_split_fs_io_verification] Create 3-way mirror with simple fs data (checkit) and then split the images and verify taft-02: lvcreate -m 2 -n split_io_verify --nosync -L 4G mirror_sanity WARNING: New mirror won't be synchronised. Don't read what you didn't write! Placing a lock_nolock gfs2 filesystem on split_io_verify volume Mounting split_io_verify volume Writing files to /mnt/split_io_verify/taft-02 Splitting: lvconvert --splitmirrors 1 --name new1 mirror_sanity/split_io_verify LV mirror_sanity/split_io_verify_mimage_2 in use: not deactivating Failed to deactivate former mirror image, split_io_verify_mimage_2 couldn't split split_io_verify into new1 on taft-02 [root@taft-02 ~]# lvs -a -o +devices LV VG Attr Log Copy% Devices split_io_verify mirror_sanity Mwi-a- split_io_verify_mlog 100.00 split_io_verify_mimage_0(0),split_io_verify_mimage_1(0) [split_io_verify_mimage_0] mirror_sanity iwi-ao /dev/sdh1(0) [split_io_verify_mimage_1] mirror_sanity iwi-ao /dev/sdg1(0) split_io_verify_mimage_2 mirror_sanity -wi-a- /dev/sdf1(0) [split_io_verify_mlog] mirror_sanity lwi-ao /dev/sdc1(0) Even the basic test case still fails. [root@taft-02 ~]# lvs -a -o +devices LV Attr LSize Log Copy% Devices mirror mwi-a- 100.00m mirror_mlog 100.00 mirror_mimage_0(0),mirror_mimage_1(0) [mirror_mimage_0] iwi-ao 100.00m /dev/sdb1(0) [mirror_mimage_1] iwi-ao 100.00m /dev/sdc1(0) [mirror_mlog] lwi-ao 4.00m /dev/sdh1(0) [root@taft-02 ~]# lvconvert --splitmirrors 1 --name new taft/mirror LV taft/mirror_mimage_0 in use: not deactivating Sep 9 17:34:30 taft-02 kernel: Buffer I/O error on device dm-4, logical block 25598 Sep 9 17:34:30 taft-02 kernel: Buffer I/O error on device dm-4, logical block 25598 Sep 9 17:34:30 taft-02 kernel: Buffer I/O error on device dm-4, logical block 0 Sep 9 17:34:30 taft-02 kernel: Buffer I/O error on device dm-4, logical block 0 Sep 9 17:34:30 taft-02 kernel: Buffer I/O error on device dm-4, logical block 1 Sep 9 17:34:30 taft-02 kernel: Buffer I/O error on device dm-4, logical block 25599 Sep 9 17:34:30 taft-02 kernel: Buffer I/O error on device dm-4, logical block 25599 Sep 9 17:34:30 taft-02 kernel: Buffer I/O error on device dm-4, logical block 255992.6.32-193.el6.x86_64 lvm2-2.02.87-2.el6 BUILT: Fri Sep 9 09:35:36 CDT 2011 lvm2-libs-2.02.87-2.el6 BUILT: Fri Sep 9 09:35:36 CDT 2011 lvm2-cluster-2.02.87-2.el6 BUILT: Fri Sep 9 09:35:36 CDT 2011 udev-147-2.37.el6 BUILT: Wed Aug 10 07:48:15 CDT 2011 device-mapper-1.02.66-2.el6 BUILT: Fri Sep 9 09:35:36 CDT 2011 device-mapper-libs-1.02.66-2.el6 BUILT: Fri Sep 9 09:35:36 CDT 2011 device-mapper-event-1.02.66-2.el6 BUILT: Fri Sep 9 09:35:36 CDT 2011 device-mapper-event-libs-1.02.66-2.el6 BUILT: Fri Sep 9 09:35:36 CDT 2011 cmirror-2.02.87-2.el6 BUILT: Fri Sep 9 09:35:36 CDT 2011 FYI - This case doesn't always fail, I'd say it fails roughly 1/2 the time, and by fail, it doesn't do the actual mirror split. ============================================================ Iteration 1 of 10000 started at Mon Sep 12 11:50:30 CDT 2011 ============================================================ SCENARIO - [split_off_last_leg_leaving_no_mirror] Create a 2-way mirror and split off the last leg leaving no mirror taft-03: lvcreate -m 1 -n split_last_leg -L 300M split_image Waiting until all mirrors become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) splitting off last leg... Deactivating mirror new... and removing Deactivating mirror split_last_leg... and removing ============================================================ Iteration 2 of 10000 started at Mon Sep 12 11:50:51 CDT 2011 ============================================================ SCENARIO - [split_off_last_leg_leaving_no_mirror] Create a 2-way mirror and split off the last leg leaving no mirror taft-03: lvcreate -m 1 -n split_last_leg -L 300M split_image Waiting until all mirrors become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) splitting off last leg... couldn't split off last leg [root@taft-03 ~]# lvs -a -o +devices LV VG Attr LSize Copy% Devices split_last_leg split_image -wi-a- 300.00m /dev/sdh1(0) split_last_leg_mimage_0 split_image vwi-a- 300.00m split_last_leg_mimage_1 split_image -wi-a- 300.00m /dev/sdg1(0) split_last_leg_mlog split_image -wi-a- 4.00m /dev/sdc1(0) When an image is split from a 2-way mirror, the original mirror is converted to a linear device. To do this, the top "layer" must be removed. The segments are transferred from the sub-lv to the top-level LV and the link is severed. The former sub-lv - having its segments transferred - now contains a temporary error target. When the original LV is resumed, the old sub-lv that now contains an error segment is activated and scanned. This is what causes the I/O error messages. There are three ways to fix this problem: 1) Do not set the sub-lv which contains the error target as "visible" before suspending the original LV. This way, when the original is resumed, the sub-lv device node is not created and it is not scanned - avoiding the error messages. The problem with this approach is that if the machine crashes after the resume, it leaves the *hidden* LV in place and the user has a more difficult time noticing that it needs to be cleaned up. Thus, this type of processing is frowned upon. 2) Do like _remove_mirror_images does and suspend the original, then suspend the sub-lv (the error target), then resume the sub-lv, and finally resume the original LV. This seems like extra pointless operations to me, but it does not produce the error message (although, I'm not sure why) and it allows us to leave the visible flag in place. 3) Flag the sub-lv (error target) with a "do not scan" flag. This seems like the cleanest approach, but I have been unable to find the method for doing this. This same bug exists in RAID1 and the solution could be applied there as well. Option #2 was chosen from the solutions in comment 9 because it is the least invasive. Before the patch was applied, I would get: Sep 13 09:02:28 bp-01 kernel: __ratelimit: 118 callbacks suppressed Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 1008 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 1008 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 1022 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 1022 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 0 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 0 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 1 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 1023 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 1023 Sep 13 09:02:28 bp-01 kernel: Buffer I/O error on device dm-4, logical block 1023 I do not get these messages after the patch. Patch checked in upstream - Version 2.02.89 Jon, I filed bug 738714 for the fact these lvconvert split operations (device i/o errors aside) are failing most of the time when there are more then 2 legs of the mirror. Still seeing errors from time to time on multiple clusters with the latest rpms. The errors appear on the nodes not doing the split operation. SCENARIO - [split_off_last_leg_leaving_no_mirror] Create a 2-way mirror and split off the last leg leaving no mirror hayes-01: lvcreate -m 1 -n split_last_leg -L 300M split_image Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 31.67% ) 0/1 mirror(s) are fully synced: ( 38.83% ) 0/1 mirror(s) are fully synced: ( 54.33% ) 0/1 mirror(s) are fully synced: ( 83.00% ) 1/1 mirror(s) are fully synced: ( 100.00% ) splitting off last leg... couldn't split off last leg Error locking on node hayes-03: LV split_image/split_last_leg_mimage_0 in use: not deactivating Error locking on node hayes-02: LV split_image/split_last_leg_mimage_0 in use: not deactivating From the logs: Sep 23 11:07:48 taft-01 qarshd[31974]: Running cmdline: lvconvert --splitmirrors 1 --name new3 split_image/split_images_sequentiay Sep 23 11:07:54 taft-02 lvm[1229]: No longer monitoring mirror device split_image-split_images_sequentially for events. Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 76784 Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 76798 Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 76798 Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 0 Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 0 Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 1 Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 76799 Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 76799 Sep 23 11:07:54 taft-02 kernel: Buffer I/O error on device dm-3, logical block 76799 2.6.32-198.el6.x86_64 lvm2-2.02.87-3.el6 BUILT: Wed Sep 21 09:54:55 CDT 2011 lvm2-libs-2.02.87-3.el6 BUILT: Wed Sep 21 09:54:55 CDT 2011 lvm2-cluster-2.02.87-3.el6 BUILT: Wed Sep 21 09:54:55 CDT 2011 udev-147-2.38.el6 BUILT: Fri Sep 9 16:25:50 CDT 2011 device-mapper-1.02.66-3.el6 BUILT: Wed Sep 21 09:54:55 CDT 2011 device-mapper-libs-1.02.66-3.el6 BUILT: Wed Sep 21 09:54:55 CDT 2011 device-mapper-event-1.02.66-3.el6 BUILT: Wed Sep 21 09:54:55 CDT 2011 device-mapper-event-libs-1.02.66-3.el6 BUILT: Wed Sep 21 09:54:55 CDT 2011 cmirror-2.02.87-3.el6 BUILT: Wed Sep 21 09:54:55 CDT 2011 I'm seeing this while doing lv convert operations as well. --------------------------------------------- TEST CASE=22 base legs 1; convert to 0 legs; base log disk; convert to core log; active 1; sync=1; pvs=sufficient --------------------------------------------- creating a base mirror on taft-03 Waiting until all mirrors become fully syncd... 0/1 mirror(s) are fully synced: ( 28.36% ) 0/1 mirror(s) are fully synced: ( 49.21% ) 0/1 mirror(s) are fully synced: ( 72.86% ) 0/1 mirror(s) are fully synced: ( 93.71% ) 1/1 mirror(s) are fully synced: ( 100.00% ) LV VG Attr LSize Log Copy% Devices centi_base centipede mwi-a- 700.00m centi_base_mlog 100.00 centi_base_mimage_0(0),centi_base_mimage_1(0) [centi_base_mimage_0] centipede iwi-ao 700.00m /dev/sdh2(0) [centi_base_mimage_1] centipede iwi-ao 700.00m /dev/sdh1(0) [centi_base_mlog] centipede lwi-ao 4.00m /dev/sdd2(0) Converting from 1 leg(s) disk log; to 0 leg(s) core log on taft-03 legs_up= logs_up= legs_down=/dev/sdh2 logs_down=/dev/sdd2 total=/dev/sdh2 /dev/sdd2 lvconvert --mirrorlog core -m 0 centipede/centi_base /dev/sdh2 /dev/sdd2 sufficient pvs convert case failed on taft-03 Error locking on node taft-02: LV centipede/centi_base_mimage_0 in use: not deactivating Error locking on node taft-01: LV centipede/centi_base_mimage_0 in use: not deactivating Error locking on node taft-04: LV centipede/centi_base_mimage_0 in use: not deactivating Error locking on node taft-03: LV centipede/centi_base_mimage_0 in use: not deactivating Solution #2 from comment 9 does not work in a cluster. The suspend/resume of the sub-lv is not performed on the remote nodes - thus, the error target is scanned on the remote nodes which causes the I/O errors. Of the remaining two solutions in comment 9, #1 is the simplest but also the dirtiest for the reasons mentioned above. Solution #3 seems somewhat more promising, but I'd rather figure out why the resume is resuming all of the sub-lvs without passing them through _get_udev_flags(). It is the lack of calling _get_udev_flags() that makes udev think it should scan the sub-lv error target (two reasons that the device should not be scanned). I'll try to pursue both courses and implement the best. If I fail to complete the modified solution #3, #1 can be a temporary fall-back. Comment 9, solution 1 stops the I/O errors, but fails to rename the devices... Comment 9, solution 1 also fails to resume the sub-lv (the error target LV that represents the layer removed) as an error target - presumably because it is not visible and not attached to the top-level LV anymore. Although, it does still get resumed. A proper place to reset the udev flags has been found and the splitmirrors code has been updated. This required three check-ins: 1) Revert original work-around that failed to solve the problem in a cluster Upstream git commit id: f7235e7cb439d0c6cbae4ba94a912d50d46c5d25 2) Add code to properly set udev flags of mirror sub-LVs during changes Upstream git commit id: 857339758f8181575512fc51757291deb07078ec 3) Update splitmirror (_split_mirror_images) code Upstream git commit id: 1986e51928ae159781a273972d809a25990fd339 Tests confirm the error messages (access to intermediate error target) do not happen after the changes. Splitmirrors also properly functions for single machine mirrors, cluster mirrors, and exclusively activated mirrors in a cluster. Both the mirror conversion and mirror image splitting regression tests passed without any "Buffer I/O" errors. Marking verified. 2.6.32-207.el6.x86_64 lvm2-2.02.87-5.el6 BUILT: Wed Oct 12 10:47:46 CDT 2011 lvm2-libs-2.02.87-5.el6 BUILT: Wed Oct 12 10:47:46 CDT 2011 lvm2-cluster-2.02.87-5.el6 BUILT: Wed Oct 12 10:47:46 CDT 2011 udev-147-2.40.el6 BUILT: Fri Sep 23 07:51:13 CDT 2011 device-mapper-1.02.66-5.el6 BUILT: Wed Oct 12 10:47:46 CDT 2011 device-mapper-libs-1.02.66-5.el6 BUILT: Wed Oct 12 10:47:46 CDT 2011 device-mapper-event-1.02.66-5.el6 BUILT: Wed Oct 12 10:47:46 CDT 2011 device-mapper-event-libs-1.02.66-5.el6 BUILT: Wed Oct 12 10:47:46 CDT 2011 cmirror-2.02.87-5.el6 BUILT: Wed Oct 12 10:47:46 CDT 2011
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Do not document.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1522.html |