Bug 1301744
| Summary: | pvmove of device in two or more subvolumes can cause seg fault | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | Corey Marthaler <cmarthal> | ||||
| Component: | lvm2 | Assignee: | Ondrej Kozina <okozina> | ||||
| lvm2 sub component: | Thin Provisioning (RHEL6) | QA Contact: | cluster-qe <cluster-qe> | ||||
| Status: | CLOSED ERRATA | Docs Contact: | |||||
| Severity: | medium | ||||||
| Priority: | unspecified | CC: | agk, heinzm, jbrassow, msnitzer, prajnoha, prockai, thornber, tlavigne, zkabelac | ||||
| Version: | 6.8 | Keywords: | Regression | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | lvm2-2.02.143-1.el6 | Doc Type: | Bug Fix | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-05-11 01:20:35 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
After the pvmove failure, two pvmove0 show up: [root@host-116 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices POOL snapper_thinp twi-aotz-- 4.00g 4.07 4.10 POOL_tdata(0) [POOL_tdata] snapper_thinp TwI-ao---- 4.00g pvmove0(0) [POOL_tmeta] snapper_thinp ewi-ao---- 4.00m /dev/sdc1(0) [lvol0_pmspare] snapper_thinp ewI-a----- 4.00m pvmove0(0) origin snapper_thinp Vwi-aotz-- 1.00g POOL 16.25 other1 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 other2 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 other3 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 other4 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 other5 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 [pvmove0] snapper_thinp p-C-aom--- 4.00g /dev/sdd1 0.10 /dev/sdd1(0),/dev/sde1(0) [pvmove0] snapper_thinp p-C-aom--- 4.00g /dev/sdd1 0.10 /dev/sdd1(1),/dev/sde1(1) snap1 snapper_thinp Vwi-a-tz-- 1.00g POOL origin 16.25 I think this has to do w/ the lvol0_pmspare being on the same PV as that PV getting moved. [root@host-114 ~]# lvcreate --thinpool POOL --zero n -L 4G --poolmetadatasize 4M snapper_thinp Logical volume "POOL" created. [root@host-114 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Devices POOL snapper_thinp twi-a-t--- 4.00g 0.00 1.27 POOL_tdata(0) [POOL_tdata] snapper_thinp Twi-ao---- 4.00g /dev/sda1(1) [POOL_tmeta] snapper_thinp ewi-ao---- 4.00m /dev/sdh1(0) [lvol0_pmspare] snapper_thinp ewi------- 4.00m /dev/sda1(0) [root@host-114 ~]# pvmove /dev/sda1 Segmentation fault (core dumped) [pvmove0] snapper_thinp p-C-aom--- 4.00g /dev/sda1 0.10 /dev/sda1(0),/dev/sdd1(0) [pvmove0] snapper_thinp p-C-aom--- 4.00g /dev/sda1 0.10 /dev/sda1(1),/dev/sdd1(1) [root@host-114 ~]# lvcreate --thinpool POOL --pool-metadataspare n --zero n -L 4G foo WARNING: recovery of pools without pool metadata spare LV is not automated. Logical volume "POOL" created. [root@host-114 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Devices POOL foo twi-a-t--- 4.00g 0.00 1.27 POOL_tdata(0) [POOL_tdata] foo Twi-ao---- 4.00g /dev/sdc1(0) [POOL_tmeta] foo ewi-ao---- 4.00m /dev/sdh1(0) [root@host-114 ~]# pvmove /dev/sdc1 /dev/sdc1: Moved: 0.1% /dev/sdc1: Moved: 16.4% /dev/sdc1: Moved: 31.0% [...] This bug affects non thin pool volumes as well. In all reproductions, it appears that the PV being moved is being used in multiple sub volumes. SCENARIO (raid4) - [pvmove_during_io] Pvmove a raid volume during active I/O host-114.virt.lab.msp.redhat.com: lvcreate --type raid4 -i 3 -n move_during_io -L 800M raid_sanity Starting io to raid to be pvmoved 700+0 records in 700+0 records out 734003200 bytes (734 MB) copied, 4.95192 s, 148 MB/s Attempting pvmove of /dev/sda1 on host-114.virt.lab.msp.redhat.com pvmove /dev/sda1 Jan 28 15:24:49 host-114 kernel: pvmove[14307]: segfault at 7fa8ad10141a ip 00007fa8aa1b5301 sp 00007ffd47633fd8 error 4 in libc-2.12.so[7fa8aa134000+18a000] [root@host-114 ~]# lvs -a -o +devices LV Attr LSize Move Cpy%Sync Devices move_during_io rwi-a-r--- 804.00m 100.00 move_during_io_rimage_0(0),move_during_io_rimage_1(0),move_during_io_rimage_2(0),move_during_io_rimage_3(0) [move_during_io_rimage_0] iwI-aor--- 268.00m pvmove0(0) [move_during_io_rimage_1] iwi-aor--- 268.00m /dev/sda2(1) [move_during_io_rimage_2] iwi-aor--- 268.00m /dev/sdb1(1) [move_during_io_rimage_3] iwi-aor--- 268.00m /dev/sdb2(1) [move_during_io_rmeta_0] ewI-aor--- 4.00m pvmove0(0) [move_during_io_rmeta_1] ewi-aor--- 4.00m /dev/sda2(0) [move_during_io_rmeta_2] ewi-aor--- 4.00m /dev/sdb1(0) [move_during_io_rmeta_3] ewi-aor--- 4.00m /dev/sdb2(0) [pvmove0] p-C-aom--- 272.00m /dev/sda1 98.53 /dev/sda1(1),/dev/sdc1(0) [pvmove0] p-C-aom--- 272.00m /dev/sda1 98.53 /dev/sda1(0),/dev/sdc1(67) Another clue, if I remove all the /etc/lvm/archive files associated with that volume group, the segfault goes away, and the pvmove works in future iterations. Attempting the pvmove again after the failures causes the move to finish. Attempting an abort afterwards causes the next attempt to also fail. ### Attempt 1 (re-attempt the pvmove) [root@host-113 ~]# lvs -a -o +devices LV VG Attr LSize Cpy%Sync Devices move_during_io raid_sanity rwi-a-r--- 804.00m 100.00 move_during_io_rimage_0(0),move_during_io_rimage_1(0),move_during_io_rimage_2(0),move_during_io_rimage_3(0) [move_during_io_rimage_0] raid_sanity iwi-aor--- 268.00m /dev/sdd2(1) [move_during_io_rimage_1] raid_sanity iwi-aor--- 268.00m /dev/sdd1(1) [move_during_io_rimage_2] raid_sanity iwi-aor--- 268.00m /dev/sdc2(1) [move_during_io_rimage_3] raid_sanity iwi-aor--- 268.00m /dev/sdc1(1) [move_during_io_rmeta_0] raid_sanity ewi-aor--- 4.00m /dev/sdd2(0) [move_during_io_rmeta_1] raid_sanity ewi-aor--- 4.00m /dev/sdd1(0) [move_during_io_rmeta_2] raid_sanity ewi-aor--- 4.00m /dev/sdc2(0) [move_during_io_rmeta_3] raid_sanity ewi-aor--- 4.00m /dev/sdc1(0) [root@host-113 ~]# pvmove -vvvv /dev/sdd2 > /tmp/pvmove 2>&1 Segmentation fault (core dumped) [root@host-113 ~]# pvmove /dev/sdd2 Detected pvmove in progress for /dev/sdd2 /dev/sdd2: Moved: 98.5% /dev/sdd2: Moved: 100.0% ### Attempt 2 (attempt an abort before re-attempting the pvmove) [root@host-113 ~]# lvs -a -o +devices LV VG Attr LSize Cpy%Sync Devices move_during_io raid_sanity rwi-a-r--- 804.00m 100.00 move_during_io_rimage_0(0),move_during_io_rimage_1(0),move_during_io_rimage_2(0),move_during_io_rimage_3(0) [move_during_io_rimage_0] raid_sanity iwi-aor--- 268.00m /dev/sdd2(1) [move_during_io_rimage_1] raid_sanity iwi-aor--- 268.00m /dev/sdd1(1) [move_during_io_rimage_2] raid_sanity iwi-aor--- 268.00m /dev/sdc2(1) [move_during_io_rimage_3] raid_sanity iwi-aor--- 268.00m /dev/sdc1(1) [move_during_io_rmeta_0] raid_sanity ewi-aor--- 4.00m /dev/sdd2(0) [move_during_io_rmeta_1] raid_sanity ewi-aor--- 4.00m /dev/sdd1(0) [move_during_io_rmeta_2] raid_sanity ewi-aor--- 4.00m /dev/sdc2(0) [move_during_io_rmeta_3] raid_sanity ewi-aor--- 4.00m /dev/sdc1(0) [root@host-113 ~]# pvmove /dev/sdd2 Segmentation fault (core dumped) [root@host-113 ~]# pvmove --abort /dev/sdd2 [root@host-113 ~]# pvmove /dev/sdd2 Segmentation fault (core dumped) [root@host-113 ~]# pvmove --abort /dev/sdd2 [root@host-113 ~]# pvmove /dev/sdd2 Segmentation fault (core dumped) [root@host-113 ~]# pvmove /dev/sdd2 Detected pvmove in progress for /dev/sdd2 /dev/sdd2: Moved: 98.5% /dev/sdd2: Moved: 100.0% Created attachment 1122200 [details]
verbose pvmove attempt
I think this is related to the issue in bug 1257366. It doesn't appear to matter which specific archive file is left around, but how many. # a lot of files show up w/ many hours of testing [root@host-116 archive]# ls -l | wc -l 36140 All my machines quickly end up with many files due to the nature of testing, but once cleaned up the problem goes away. I was then able to run 25 iterations of this case. As a result of this, I'll remove the TestBlocker flag, and lower the severity, but I do believe this still may be a regression since all other releases would have had just as many archive files on our test machines when running regression tests. ============================================================ Iteration 25 of 25 started at Mon Feb 8 11:45:41 CST 2016 ============================================================ SCENARIO (raid4) - [pvmove_during_io] Pvmove a raid volume during active I/O host-113.virt.lab.msp.redhat.com: lvcreate --type raid4 -i 3 -n move_during_io -L 800M raid_sanity Starting io to raid to be pvmoved 700+0 records in 700+0 records out 734003200 bytes (734 MB) copied, 4.81797 s, 152 MB/s Attempting pvmove of /dev/sdd2 on host-113.virt.lab.msp.redhat.com pvmove /dev/sdd2 Quick regression check: performing raid scrubbing (lvchange --syncaction repair) on raid raid_sanity/move_during_io Waiting until all mirror|raid volumes become fully syncd... 1/1 mirror(s) are fully synced: ( 100.00% ) Deactivating raid move_during_io... and removing (In reply to Corey Marthaler from comment #9) > I think this is related to the issue in bug 1257366. It doesn't appear to > matter which specific archive file is left around, but how many. > > # a lot of files show up w/ many hours of testing > [root@host-116 archive]# ls -l | wc -l > 36140 Thanks Corey. This helped!:) In fact with so many archive files it seems any pvmove cmd segfaults no matter what setup I use. There are badly mixed pool allocations and pool deallocations in pvmove command. The bug was introduced with pvmove rewrite so that is uses process_each_pv(). The destroy_processing_handle call in tools/pvmove.c:903 unfortunately may also destroy all data allocated inside process_each_pv cycle. Fixed in https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=8a80cc5bc3799483a211b2ff9677a392a7b47e8b proper url for the fix: https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=0daf9d7ac5009d7c5642a6b1c7d8bb3fbc0c4382 Fixed verified with the latest rpms. This test case ran all night with close to 9000 archive files present. [root@host-118 ~]# ls -l /etc/lvm/archive/* | wc -l 8854 ============================================================ Iteration 313 of 1000000 started at Thu Feb 25 10:43:16 CST 2016 ============================================================ SCENARIO - [thin_pool_pvmove] Create snapshots of origin with fs data, and then pvmove the data from the thin pool volume Making pool volume lvcreate --thinpool POOL -L 4G --profile thin-performance --zero y --poolmetadatasize 4M snapper_thinp Sanity checking pool device (POOL) metadata lvcreate -L 4M -n meta_swap snapper_thinp lvconvert --yes --thinpool snapper_thinp/POOL --poolmetadata snapper_thinp/meta_swap lvchange -ay snapper_thinp/meta_swap thin_check /dev/mapper/snapper_thinp-meta_swap examining superblock examining devices tree examining mapping tree checking space map counts meta data NOT corrupt lvchange -an snapper_thinp/meta_swap lvconvert --yes --thinpool snapper_thinp/POOL --poolmetadata snapper_thinp/meta_swap lvremove snapper_thinp/meta_swap lvchange -ay --yes --select 'lv_name=POOL || pool_lv=POOL' Making origin volume lvcreate --virtualsize 1G -T snapper_thinp/POOL -n origin lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other1 lvcreate -V 1G -T snapper_thinp/POOL -n other2 lvcreate -V 1G -T snapper_thinp/POOL -n other3 lvcreate -V 1G -T snapper_thinp/POOL -n other4 WARNING: Sum of all thin volume sizes (5.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)! lvcreate -V 1G -T snapper_thinp/POOL -n other5 WARNING: Sum of all thin volume sizes (6.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)! Placing an xfs filesystem on origin volume Mounting origin volume Writing files to /mnt/origin Checking files on /mnt/origin syncing before snap creation... Making 1st snapshot of origin volume lvcreate -k n -s /dev/snapper_thinp/origin -n snap1 Moving data from thin pool volume... pvmove /dev/sdb1 Mounting 1st snap volume Checking files on /mnt/snap1 Writing files to /mnt/origin syncing before snap creation... Making 2nd snapshot of origin volume lvcreate -k n -s /dev/snapper_thinp/origin -n snap2 Extending the snapshot size by +250M WARNING: Sum of all thin volume sizes (8.25 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)! Again, moving data from thin pool volume... pvmove /dev/sda1 Mounting 2nd snap volume Checking files on /mnt/snap2 umount /mnt/snap1 /mnt/snap2 /mnt/origin Removing snap volume snapper_thinp/snap1 lvremove -f /dev/snapper_thinp/snap1 Removing snap volume snapper_thinp/snap2 lvremove -f /dev/snapper_thinp/snap2 Removing thin origin and other virtual thin volumes Removing thinpool snapper_thinp/POOL 2.6.32-616.el6.x86_64 lvm2-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 lvm2-libs-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 lvm2-cluster-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 udev-147-2.71.el6 BUILT: Wed Feb 10 07:07:17 CST 2016 device-mapper-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-libs-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-event-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-event-libs-1.02.117-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 device-mapper-persistent-data-0.6.2-0.1.rc5.el6 BUILT: Wed Feb 24 07:07:09 CST 2016 cmirror-2.02.143-1.el6 BUILT: Wed Feb 24 07:59:50 CST 2016 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-0964.html |
Description of problem: I'm having a hard time reproducing this on other machines currently. SCENARIO - [thin_pool_pvmove] Create snapshots of origin with fs data, and then pvmove the data from the thin pool volume Making pool volume lvcreate --thinpool POOL --zero y -L 4G --poolmetadatasize 4M snapper_thinp Sanity checking pool device (POOL) metadata examining superblock examining devices tree examining mapping tree checking space map counts Making origin volume lvcreate --virtualsize 1G -T snapper_thinp/POOL -n origin lvcreate -V 1G -T snapper_thinp/POOL -n other1 lvcreate -V 1G -T snapper_thinp/POOL -n other2 lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other3 lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other4 WARNING: Sum of all thin volume sizes (5.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)! lvcreate --virtualsize 1G -T snapper_thinp/POOL -n other5 WARNING: Sum of all thin volume sizes (6.00 GiB) exceeds the size of thin pool snapper_thinp/POOL (4.00 GiB)! Placing an xfs filesystem on origin volume Mounting origin volume Writing files to /mnt/origin Checking files on /mnt/origin syncing before snap creation... Making 1st snapshot of origin volume lvcreate -k n -s /dev/snapper_thinp/origin -n snap1 [root@host-116 ~]# lvs -a -o +devices LV VG Attr LSize Pool Origin Data% Meta% Devices POOL snapper_thinp twi-aotz-- 4.00g 4.07 4.10 POOL_tdata(0) [POOL_tdata] snapper_thinp Twi-ao---- 4.00g /dev/sdd1(1) [POOL_tmeta] snapper_thinp ewi-ao---- 4.00m /dev/sdc1(0) [lvol0_pmspare] snapper_thinp ewi------- 4.00m /dev/sdd1(0) origin snapper_thinp Vwi-aotz-- 1.00g POOL 16.25 other1 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 other2 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 other3 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 other4 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 other5 snapper_thinp Vwi-a-tz-- 1.00g POOL 0.00 snap1 snapper_thinp Vwi-a-tz-- 1.00g POOL origin 16.25 [root@host-116 ~]# pvmove /dev/sdd1 Segmentation fault (core dumped) Jan 25 15:19:44 host-116 kernel: pvmove[3999]: segfault at 7fa3e5c4b4a2 ip 00007fa3e2d8f301 sp 00007ffd24c77938 error 4 in libc-2.12.so[7fa3e2d0e000+18a000] Core was generated by `pvmove /dev/sda1'. Program terminated with signal 11, Segmentation fault. (gdb) bt #0 0x00007f0a9fcf2321 in __strlen_sse2 () from /lib64/libc.so.6 #1 0x00007f0aa048cc7c in dm_pool_strdup (p=0x7f0aa2211ad0, str=0x7f0aa309a7b2 <Address 0x7f0aa309a7b2 out of bounds>) at mm/pool.c:51 #2 0x00007f0aa0f2e736 in _create_id (cmd=0x7f0aa220c110, pv_name=0x7f0aa2254e8e "/dev/sda1", uuid=0x7f0aa2254e38 "wMk2B15qfHXuJ1zQG4IK9e8cTvzC6RPn7Z12OYbCcIkjcL7AdDmtfh8vFBYNLV4e", vg_name=0x7f0aa309a7b2 <Address 0x7f0aa309a7b2 out of bounds>, lv_name=<value optimized out>, background=0) at pvmove.c:774 #3 pvmove_poll (cmd=0x7f0aa220c110, pv_name=0x7f0aa2254e8e "/dev/sda1", uuid=0x7f0aa2254e38 "wMk2B15qfHXuJ1zQG4IK9e8cTvzC6RPn7Z12OYbCcIkjcL7AdDmtfh8vFBYNLV4e", vg_name=0x7f0aa309a7b2 <Address 0x7f0aa309a7b2 out of bounds>, lv_name=<value optimized out>, background=0) at pvmove.c:796 #4 0x00007f0aa0f2e936 in pvmove (cmd=0x7f0aa220c110, argc=<value optimized out>, argv=0x7ffc2053f4c8) at pvmove.c:931 #5 0x00007f0aa0f23fb7 in lvm_run_command (cmd=0x7f0aa220c110, argc=1, argv=0x7ffc2053f4c0) at lvmcmdline.c:1655 #6 0x00007f0aa0f28239 in lvm2_main (argc=2, argv=0x7ffc2053f4b8) at lvmcmdline.c:2121 #7 0x00007f0a9fc8fd1d in __libc_start_main () from /lib64/libc.so.6 #8 0x00007f0aa0f0cc89 in _start () Version-Release number of selected component (if applicable): 2.6.32-604.el6.x86_64 lvm2-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 lvm2-libs-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 lvm2-cluster-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 udev-147-2.66.el6 BUILT: Mon Jan 18 02:42:20 CST 2016 device-mapper-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-libs-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-event-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-event-libs-1.02.114-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016 device-mapper-persistent-data-0.6.0-1.el6 BUILT: Wed Jan 20 11:23:29 CST 2016 cmirror-2.02.140-3.el6 BUILT: Thu Jan 21 05:40:10 CST 2016