Bug 1481383

Summary: need to protect against multiple queued up merge attempts while origin is inactive
Product: Red Hat Enterprise Linux 7 Reporter: Corey Marthaler <cmarthal>
Component: lvm2Assignee: Zdenek Kabelac <zkabelac>
lvm2 sub component: Snapshots QA Contact: cluster-qe <cluster-qe>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: agk, germano.massullo, heinzm, jbrassow, lmiksik, msnitzer, prajnoha, prockai, zkabelac
Version: 7.4   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: lvm2-2.02.176-4.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-10 15:20:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Corey Marthaler 2017-08-14 19:42:00 UTC
Description of problem:
[root@host-075 tmp]# lvs -a -o +devices
  LV     VG      Attr       LSize   Pool Origin Devices        
  merge1 snapper swi---s---   2.00g      origin /dev/sda1(1024)
  merge2 snapper swi---s---   2.00g      origin /dev/sda1(1536)
  merge3 snapper swi---s---   2.00g      origin /dev/sda1(2048)
  merge4 snapper swi---s---   2.00g      origin /dev/sda1(2560)
  merge5 snapper swi---s---   2.00g      origin /dev/sda1(3072)
  origin snapper owi---s---   4.00g             /dev/sda1(0)   

[root@host-075 tmp]# lvconvert --merge snapper/merge1
  Merging of snapshot snapper/merge1 will occur on next activation of snapper/origin.
[root@host-075 tmp]# lvconvert --merge snapper/merge2
  Merging of snapshot snapper/merge2 will occur on next activation of snapper/origin.
[root@host-075 tmp]# lvconvert --merge snapper/merge3
  Merging of snapshot snapper/merge3 will occur on next activation of snapper/origin.
[root@host-075 tmp]# lvconvert --merge snapper/merge4
  Merging of snapshot snapper/merge4 will occur on next activation of snapper/origin.
[root@host-075 tmp]# lvconvert --merge snapper/merge5
  Merging of snapshot snapper/merge5 will occur on next activation of snapper/origin.
[root@host-075 tmp]# vgchange -ay snapper
  1 logical volume(s) in volume group "snapper" now active
  Background polling started for 1 logical volume(s) in volume group "snapper"

[root@host-075 ~]# dmsetup status
snapper-merge5-cow: 0 4194304 linear 
snapper-merge5: 0 8388608 error 
snapper-merge4: 0 8388608 error 
snapper-merge3: 0 8388608 error 
snapper-merge1-cow: 
snapper-merge2: 0 8388608 error 
snapper-merge1: 0 8388608 error 
snapper-origin-real: 0 8388608 linear 
snapper-merge2-cow: 
rhel_host--075-swap: 0 1679360 linear 
snapper-merge3-cow: 
rhel_host--075-root: 0 12992512 linear 
snapper-merge4-cow: 
snapper-origin: 0 8388608 snapshot-merge 512/4194304 512

[root@host-075 tmp]# lvs -a -o +devices
[Deadlock]

Aug 14 12:35:18 host-075 lvmpolld: W: #011LVPOLL: PID 27744: STDERR: '  Internal error: Performing unsafe table load while 12 device(s) are known to be suspended:  (253:5) '
Aug 14 12:36:18 host-075 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 27744) has timed out
Aug 14 12:36:26 host-075 systemd: Started Session 7749 of user root.
Aug 14 12:36:26 host-075 systemd: Starting Session 7749 of user root.
Aug 14 12:36:26 host-075 systemd-logind: New session 7749 of user root.
Aug 14 12:36:26 host-075 dbus[619]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Aug 14 12:36:26 host-075 dbus-daemon: dbus[619]: [system] Activating service name='org.freedesktop.problems' (using servicehelper)
Aug 14 12:36:26 host-075 dbus[619]: [system] Successfully activated service 'org.freedesktop.problems'
Aug 14 12:36:26 host-075 dbus-daemon: dbus[619]: [system] Successfully activated service 'org.freedesktop.problems'
Aug 14 12:37:18 host-075 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 27744) has timed out
Aug 14 12:38:18 host-075 lvmpolld[27740]: LVMPOLLD: LVM2 cmd is unresponsive too long (PID 27744) (no output for 180 seconds)
Aug 14 12:38:18 host-075 lvmpolld: W: LVMPOLLD: polling for output of the lvm cmd (PID 27744) has timed out
Aug 14 12:38:38 host-075 kernel: INFO: task kworker/0:0:25794 blocked for more than 120 seconds.
Aug 14 12:38:38 host-075 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 14 12:38:38 host-075 kernel: kworker/0:0     D ffff880016f217d4     0 25794      2 0x00000080
Aug 14 12:38:38 host-075 kernel: Workqueue: ksnaphd do_metadata [dm_snapshot]
Aug 14 12:38:38 host-075 kernel: ffff880028e5bc10 0000000000000046 ffff88001dd7bf40 ffff880028e5bfd8
Aug 14 12:38:38 host-075 kernel: ffff880028e5bfd8 ffff880028e5bfd8 ffff88001dd7bf40 ffff88003fc16cc0
Aug 14 12:38:38 host-075 kernel: 0000000000000000 7fffffffffffffff ffff88001dd7bf40 ffff880037fe0c60
Aug 14 12:38:38 host-075 kernel: Call Trace:
Aug 14 12:38:38 host-075 kernel: [<ffffffff816a8469>] schedule+0x29/0x70
Aug 14 12:38:38 host-075 kernel: [<ffffffff816a5f79>] schedule_timeout+0x239/0x2c0
Aug 14 12:38:38 host-075 kernel: [<ffffffff812f8e50>] ? submit_bio+0x70/0x150
Aug 14 12:38:38 host-075 kernel: [<ffffffff81062efe>] ? kvm_clock_get_cycles+0x1e/0x20
Aug 14 12:38:38 host-075 kernel: [<ffffffff816a7aed>] io_schedule_timeout+0xad/0x130
Aug 14 12:38:38 host-075 kernel: [<ffffffffc000acf0>] ? bvec_next_page+0x10/0x10 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffff816a8a9d>] wait_for_completion_io+0xfd/0x140
Aug 14 12:38:38 host-075 kernel: [<ffffffff810c4850>] ? wake_up_state+0x20/0x20
Aug 14 12:38:38 host-075 kernel: [<ffffffffc000b5bf>] dm_io+0x26f/0x2e0 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc000ae90>] ? dm_io_client_destroy+0x30/0x30 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc000acf0>] ? bvec_next_page+0x10/0x10 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc03f2931>] do_metadata+0x21/0x30 [dm_snapshot]
Aug 14 12:38:38 host-075 kernel: [<ffffffff810a885a>] process_one_work+0x17a/0x440
Aug 14 12:38:38 host-075 kernel: [<ffffffff810a9526>] worker_thread+0x126/0x3c0
Aug 14 12:38:38 host-075 kernel: [<ffffffff810a9400>] ? manage_workers.isra.24+0x2a0/0x2a0
Aug 14 12:38:38 host-075 kernel: [<ffffffff810b09cf>] kthread+0xcf/0xe0
Aug 14 12:38:38 host-075 kernel: [<ffffffff810b0900>] ? insert_kthread_work+0x40/0x40
Aug 14 12:38:38 host-075 kernel: [<ffffffff816b3ed8>] ret_from_fork+0x58/0x90
Aug 14 12:38:38 host-075 kernel: [<ffffffff810b0900>] ? insert_kthread_work+0x40/0x40
Aug 14 12:38:38 host-075 kernel: INFO: task lvm:27744 blocked for more than 120 seconds.
Aug 14 12:38:38 host-075 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 14 12:38:38 host-075 kernel: lvm             D ffff8800371dba50     0 27744  27743 0x00000080
Aug 14 12:38:38 host-075 kernel: ffff8800371db8f0 0000000000000082 ffff88001dd79fa0 ffff8800371dbfd8
Aug 14 12:38:38 host-075 kernel: ffff8800371dbfd8 ffff8800371dbfd8 ffff88001dd79fa0 ffff8800371dba70
Aug 14 12:38:38 host-075 kernel: 7fffffffffffffff ffff8800371dba68 ffff88001dd79fa0 ffff8800371dba50
Aug 14 12:38:38 host-075 kernel: Call Trace:
Aug 14 12:38:38 host-075 kernel: [<ffffffff816a8469>] schedule+0x29/0x70
Aug 14 12:38:38 host-075 kernel: [<ffffffff816a5f79>] schedule_timeout+0x239/0x2c0
Aug 14 12:38:38 host-075 kernel: [<ffffffff810c1315>] ? check_preempt_curr+0x85/0xa0
Aug 14 12:38:38 host-075 kernel: [<ffffffff810c1349>] ? ttwu_do_wakeup+0x19/0xd0
Aug 14 12:38:38 host-075 kernel: [<ffffffff816a881d>] wait_for_completion+0xfd/0x140
Aug 14 12:38:38 host-075 kernel: [<ffffffff810c4850>] ? wake_up_state+0x20/0x20
Aug 14 12:38:38 host-075 kernel: [<ffffffff810a750c>] flush_workqueue+0x12c/0x5b0
Aug 14 12:38:38 host-075 kernel: [<ffffffffc03f2a36>] chunk_io+0xf6/0x140 [dm_snapshot]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc03f2910>] ? persistent_prepare_exception+0x90/0x90 [dm_snapshot]
Aug 14 12:38:38 host-075 kernel: [<ffffffff811c192b>] ? vmalloc+0x4b/0x50
Aug 14 12:38:38 host-075 kernel: [<ffffffffc03f0210>] ? alloc_completed_exception+0x50/0x50 [dm_snapshot]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc03f31e6>] persistent_read_metadata+0xa6/0x560 [dm_snapshot]
Aug 14 12:38:38 host-075 kernel: [<ffffffff8118573d>] ? mempool_create_node+0xdd/0x140
Aug 14 12:38:38 host-075 kernel: [<ffffffff811df75c>] ? kmem_cache_alloc_trace+0x3c/0x200
Aug 14 12:38:38 host-075 kernel: [<ffffffffc03f0baa>] snapshot_ctr+0x5fa/0x670 [dm_snapshot]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc0005a57>] dm_table_add_target+0x177/0x430 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc0009417>] table_load+0x157/0x390 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc00092c0>] ? retrieve_status+0x1c0/0x1c0 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc000a015>] ctl_ioctl+0x1e5/0x500 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffffc000a343>] dm_ctl_ioctl+0x13/0x20 [dm_mod]
Aug 14 12:38:38 host-075 kernel: [<ffffffff812151fd>] do_vfs_ioctl+0x33d/0x540
Aug 14 12:38:38 host-075 kernel: [<ffffffff812b766f>] ? file_has_perm+0x9f/0xb0
Aug 14 12:38:38 host-075 kernel: [<ffffffff812154a1>] SyS_ioctl+0xa1/0xc0
Aug 14 12:38:38 host-075 kernel: [<ffffffff816b3f89>] system_call_fastpath+0x16/0x1b


Version-Release number of selected component (if applicable):
3.10.0-686.el7.x86_64

lvm2-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-libs-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
lvm2-cluster-2.02.171-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-event-libs-1.02.140-8.el7    BUILT: Wed Jun 28 13:28:58 CDT 2017
device-mapper-persistent-data-0.7.0-0.1.rc6.el7    BUILT: Mon Mar 27 10:15:46 CDT 2017

Comment 2 Zdenek Kabelac 2017-11-08 12:17:47 UTC
Solved with upstream commit:

https://www.redhat.com/archives/lvm-devel/2017-November/msg00014.html

Comment 4 Corey Marthaler 2017-12-08 23:17:25 UTC
Fix verified in the latest rpms.

3.10.0-811.el7.x86_64
lvm2-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-libs-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-cluster-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-lockd-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
lvm2-python-boom-0.8.1-5.el7    BUILT: Wed Dec  6 04:15:40 CST 2017
cmirror-2.02.176-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-libs-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-event-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-event-libs-1.02.145-5.el7    BUILT: Wed Dec  6 04:13:07 CST 2017
device-mapper-persistent-data-0.7.3-2.el7    BUILT: Tue Oct 10 04:00:07 CDT 2017



SCENARIO - [write_to_snap_merge]
Create snaps of origin with fs data, verify data on snaps, change data on snaps, merge data back to origin, verify origin data
Making origin volume
lvcreate -L 4G snapper -n origin
Placing an ext filesystem on origin volume
mke2fs 1.42.9 (28-Dec-2013)
Mounting origin volume

Writing files to /mnt/origin

Checking files on /mnt/origin


Making 6 snapshots of the origin volume, mounting, and verifying original data
lvcreate -s /dev/snapper/origin -c 128 -n merge1 -L 2G
+++ Mounting and verifying snapshot merge1 data +++
Checking files on /mnt/merge1

lvcreate -s /dev/snapper/origin -c 128 -n merge2 -L 2G
+++ Mounting and verifying snapshot merge2 data +++
Checking files on /mnt/merge2

lvcreate -s /dev/snapper/origin -c 128 -n merge3 -L 2G
+++ Mounting and verifying snapshot merge3 data +++
Checking files on /mnt/merge3

lvcreate -s /dev/snapper/origin -c 128 -n merge4 -L 2G
+++ Mounting and verifying snapshot merge4 data +++
Checking files on /mnt/merge4

lvcreate -s /dev/snapper/origin -c 128 -n merge5 -L 2G
+++ Mounting and verifying snapshot merge5 data +++
Checking files on /mnt/merge5

lvcreate -s /dev/snapper/origin -c 128 -n merge6 -L 2G
+++ Mounting and verifying snapshot merge6 data +++
Checking files on /mnt/merge6

Writing new snapshot data and then merging back each of the snapshot volumes
+++ snapshot snapper/merge1 +++
Writing files to /mnt/merge1
Checking files on /mnt/merge1
Deactivating origin/snap volume(s)
Merge snapshot snapper/merge1 back into the origin
lvconvert --merge snapper/merge1

Attempting regression check for multiple queued up merge attempts
1. Same snap already initiated (bug 1481403): lvconvert --merge snapper/merge1
  Command on LV snapper/merge1 is invalid on LV with properties: lv_is_merging_cow .
  Command not permitted on LV snapper/merge1.
2. Different snap with other initiated (bug 1481383): lvconvert --merge snapper/merge2
  Cannot merge snapshot snapper/merge2 into the origin snapper/origin with merging snapshot snapper/merge1.

Activating origin/snap volume(s)
Waiting for the snap merge to complete...
  Failed to find logical volume "snapper/merge1"
Mount and verify the proper data now exists on the origin
Checking files on /mnt/origin


+++ snapshot snapper/merge2 +++
Writing files to /mnt/merge2
Checking files on /mnt/merge2
Deactivating origin/snap volume(s)
Merge snapshot snapper/merge2 back into the origin
lvconvert --merge snapper/merge2

Attempting regression check for multiple queued up merge attempts
1. Same snap already initiated (bug 1481403): lvconvert --merge snapper/merge2
  Command on LV snapper/merge2 is invalid on LV with properties: lv_is_merging_cow .
  Command not permitted on LV snapper/merge2.
2. Different snap with other initiated (bug 1481383): lvconvert --merge snapper/merge3
  Cannot merge snapshot snapper/merge3 into the origin snapper/origin with merging snapshot snapper/merge2.

Activating origin/snap volume(s)
Waiting for the snap merge to complete...
  Failed to find logical volume "snapper/merge2"
Mount and verify the proper data now exists on the origin
Checking files on /mnt/origin
[...]

Comment 7 errata-xmlrpc 2018-04-10 15:20:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0853