Bug 803271 - lvmetad udev rules deadlocks pvmove in cluster
lvmetad udev rules deadlocks pvmove in cluster
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2 (Show other bugs)
6.3
x86_64 Linux
urgent Severity urgent
: rc
: ---
Assigned To: Peter Rajnoha
Cluster QE
: Regression, Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-03-14 07:01 EDT by Nenad Peric
Modified: 2013-02-21 03:09 EST (History)
14 users (show)

See Also:
Fixed In Version: lvm2-2.02.98-3.el6
Doc Type: Bug Fix
Doc Text:
Cause: Using lvmetad (global/use_lvmetad=1 LVM2 configuration option) when processing LVM2 commands in cluster environment (global/locking_type=3). Consequence: LVM2 commands working incorrectly and issuing various error messages. Fix: If clustered locking is set, disable lvmetad use in LVM2 automatically as this configuration is not yet supported. Result: There's now a fallback to non-lvmetad operation in LVM2 if clustered locking is used and a warning message is issued: "WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet."
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-02-21 03:09:14 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
pvmove + clvmd logs (223.85 KB, text/plain)
2012-03-27 07:06 EDT, Milan Broz
no flags Details

  None (edit)
Description Nenad Peric 2012-03-14 07:01:32 EDT
Description of problem:

CLVMD crashed at the end of pvmove while updating LVM metadata. 

Version-Release number of selected component (if applicable):


lvm2-2.02.96-0.75.el6.x86_64
lvm2-libs-2.02.96-0.75.el6.x86_64
lvm2-cluster-2.02.96-0.75.el6.x86_64

cmirror-2.02.96-0.75.el6.x86_64

device-mapper-1.02.75-0.75.el6.x86_64
device-mapper-libs-1.02.75-0.75.el6.x86_64
device-mapper-event-libs-1.02.75-0.75.el6.x86_64
device-mapper-event-1.02.75-0.75.el6.x86_64


kernel-2.6.32-220 and greater (last tried 251)


How reproducible:

Everytime

Steps to Reproduce:

Here's what's been done:

[root@node02:~]$ lvcreate -n linear -L 3.5G pv_shuffle_A
  Logical volume "linear" created
[root@node02:~]$ lvcreate -i 3 -n stripe -L 3.5G pv_shuffle_A
  Using default stripesize 64.00 KiB
  Rounding size (896 extents) up to stripe boundary size (897 extents)
  Logical volume "stripe" created
[root@node02:~]$ lvcreate -n linear -L 3.5G pv_shuffle_B
  Logical volume "linear" created
[root@node02:~]$ lvcreate -i 3 -n stripe -L 3.5G pv_shuffle_B
  Using default stripesize 64.00 KiB
  Rounding size (896 extents) up to stripe boundary size (897 extents)
  Logical volume "stripe" created

[root@node02:~]$ vgextend pv_shuffle_A /dev/sdf1
  Volume group "pv_shuffle_A" successfully extended
[root@node02:~]$ vgextend pv_shuffle_B /dev/sdg2
  Volume group "pv_shuffle_B" successfully extended

[root@node02:~]$ vgchange -an pv_shuffle_A
  0 logical volume(s) in volume group "pv_shuffle_A" now active
[root@node02:~]$ vgchange -an pv_shuffle_B
  0 logical volume(s) in volume group "pv_shuffle_B" now active


[root@node02:~]$ pvs
  PV         VG           Fmt  Attr PSize PFree  
  /dev/sda1  pv_shuffle_B lvm2 a--  4.88g 212.00m
  /dev/sda2  pv_shuffle_A lvm2 a--  4.88g   4.88g
  /dev/sdb1  pv_shuffle_A lvm2 a--  4.88g   3.71g
  /dev/sdb2  pv_shuffle_A lvm2 a--  4.88g 212.00m
  /dev/sdd1  pv_shuffle_B lvm2 a--  4.88g   3.71g
  /dev/sdd2  pv_shuffle_B lvm2 a--  4.88g   3.71g
  /dev/sde1  pv_shuffle_A lvm2 a--  4.88g   4.88g
  /dev/sde2  pv_shuffle_A lvm2 a--  4.88g   3.71g
  /dev/sdf1  pv_shuffle_A lvm2 a--  4.88g   4.88g
  /dev/sdf2               lvm2 a--  4.88g   4.88g
  /dev/sdg1               lvm2 a--  4.88g   4.88g
  /dev/sdg2  pv_shuffle_B lvm2 a--  4.88g   4.88g
  /dev/sdi1  pv_shuffle_B lvm2 a--  4.88g   4.88g
  /dev/sdi2  pv_shuffle_B lvm2 a--  4.88g   4.88g
  /dev/vda2  VolGroup     lvm2 a--  9.51g      0 


[root@node02:~]$ pvmove -v /dev/sdb2 /dev/sdf1


Actual results:


[root@node02:~]$ pvmove -v /dev/sdb2 /dev/sdf1
    Finding volume group "pv_shuffle_A"
    Executing: /sbin/modprobe dm-log-userspace
    Archiving volume group "pv_shuffle_A" metadata (seqno 4).
    Creating logical volume pvmove0
    Moving 896 extents of logical volume pv_shuffle_A/linear
    Moving 299 extents of logical volume pv_shuffle_A/stripe
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 5).
    Checking progress before waiting every 15 seconds
  /dev/sdb2: Moved: 0.3%
  /dev/sdb2: Moved: 10.3%
  /dev/sdb2: Moved: 19.2%
  /dev/sdb2: Moved: 21.8%
  /dev/sdb2: Moved: 28.9%
  /dev/sdb2: Moved: 38.7%
  /dev/sdb2: Moved: 44.4%
  /dev/sdb2: Moved: 50.9%
  /dev/sdb2: Moved: 62.2%
  /dev/sdb2: Moved: 71.2%
  /dev/sdb2: Moved: 72.6%
  /dev/sdb2: Moved: 75.0%
    Updating volume group metadata
  Error locking on node node03: Command timed out
  Error locking on node node02: Command timed out
  Error locking on node node01: Command timed out



Mar 14 05:54:39 node02 kernel: INFO: task clvmd:2018 blocked for more than 120 seconds.
Mar 14 05:54:39 node02 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 14 05:54:39 node02 kernel: clvmd         D 0000000000000000     0  2018      1 0x00000000
Mar 14 05:54:39 node02 kernel: ffff880037439b18 0000000000000082 ffff8800ffffffff 00000402c6b7f594
Mar 14 05:54:39 node02 kernel: ffffffffa0002720 ffff88003748d980 0000000000057db0 ffffffffb09f8e19
Mar 14 05:54:39 node02 kernel: ffff88003d0e1ab8 ffff880037439fd8 000000000000fb48 ffff88003d0e1ab8
Mar 14 05:54:39 node02 kernel: Call Trace:
Mar 14 05:54:39 node02 kernel: [<ffffffffa0002720>] ? dm_unplug_all+0x0/0x70 [dm_mod]
Mar 14 05:54:39 node02 kernel: [<ffffffff8109b5d9>] ? ktime_get_ts+0xa9/0xe0
Mar 14 05:54:39 node02 kernel: [<ffffffff814f81b3>] io_schedule+0x73/0xc0
Mar 14 05:54:39 node02 kernel: [<ffffffff811b473e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Mar 14 05:54:39 node02 kernel: [<ffffffff8125b7bd>] ? get_disk+0x7d/0xf0
Mar 14 05:54:39 node02 kernel: [<ffffffff811b4c2e>] __blockdev_direct_IO+0x5e/0xd0
Mar 14 05:54:39 node02 kernel: [<ffffffff811b14f0>] ? blkdev_get_blocks+0x0/0xc0
Mar 14 05:54:39 node02 kernel: [<ffffffff8127065a>] ? kobject_get+0x1a/0x30
Mar 14 05:54:39 node02 kernel: [<ffffffff811b2357>] blkdev_direct_IO+0x57/0x60
Mar 14 05:54:39 node02 kernel: [<ffffffff811b14f0>] ? blkdev_get_blocks+0x0/0xc0
Mar 14 05:54:39 node02 kernel: [<ffffffff8111491b>] generic_file_aio_read+0x6bb/0x700
Mar 14 05:54:39 node02 kernel: [<ffffffff81216d11>] ? avc_has_perm+0x71/0x90
Mar 14 05:54:39 node02 kernel: [<ffffffff8121080f>] ? security_inode_permission+0x1f/0x30
Mar 14 05:54:39 node02 kernel: [<ffffffff8117901a>] do_sync_read+0xfa/0x140
Mar 14 05:54:39 node02 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
Mar 14 05:54:39 node02 kernel: [<ffffffff811b192c>] ? block_ioctl+0x3c/0x40
Mar 14 05:54:39 node02 kernel: [<ffffffff8118bfc2>] ? vfs_ioctl+0x22/0xa0
Mar 14 05:54:39 node02 kernel: [<ffffffff8121c2fb>] ? selinux_file_permission+0xfb/0x150
Mar 14 05:54:39 node02 kernel: [<ffffffff8120f6a6>] ? security_file_permission+0x16/0x20
Mar 14 05:54:39 node02 kernel: [<ffffffff81179a15>] vfs_read+0xb5/0x1a0
Mar 14 05:54:39 node02 kernel: [<ffffffff81179b51>] sys_read+0x51/0x90
Mar 14 05:54:39 node02 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b


clmvd gets stuck and times out. 

Expected results:

pvmove should complete successfully

Additional info:
Comment 2 Milan Broz 2012-03-22 08:21:27 EDT
Please can you retest with 6.3 packages on errata? Thanks.

Also note there is bug #799071 (could be related).
Comment 3 Nenad Peric 2012-03-22 10:57:02 EDT
I have tested it today with the following:

lvm2-libs-2.02.96-0.82.el6.x86_64
lvm2-2.02.96-0.82.el6.x86_64
lvm2-cluster-2.02.96-0.82.el6.x86_64
device-mapper-libs-1.02.75-0.82.el6.x86_64
device-mapper-1.02.75-0.82.el6.x86_64
device-mapper-event-1.02.75-0.82.el6.x86_64
device-mapper-event-libs-1.02.75-0.82.el6.x86_64
cmirror-2.02.96-0.82.el6.x86_64

kernel: 2.6.32-251.el6.x86_64



(09:34:25) [root@node01:~]$ vgcreate pv_shuffle_A /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1
  No physical volume label read from /dev/sdb1
  No physical volume label read from /dev/sdc1
  No physical volume label read from /dev/sdd1
  Physical volume "/dev/sdb1" successfully created
  Physical volume "/dev/sdc1" successfully created
  Physical volume "/dev/sdd1" successfully created
  Clustered volume group "pv_shuffle_A" successfully created


(09:36:01) [root@node01:~]$ lvcreate -n linear -L 3.5G pv_shuffle_A
  Logical volume "linear" created

(09:36:50) [root@node01:~]$ lvcreate -i 3 -n stripe -L 3.5G pv_shuffle_A
  Using default stripesize 64.00 KiB
  Rounding size (896 extents) up to stripe boundary size (897 extents)
  Logical volume "stripe" created

(09:37:06) [root@node01:~]$ lvs
  LV      VG           Attr     LSize Pool Origin Data%  Move Log Copy%  Convert
  lv_root VolGroup     -wi-ao-- 7.54g                                           
  lv_swap VolGroup     -wi-ao-- 1.97g                                           
  linear  pv_shuffle_A -wi-a--- 3.50g                                           
  stripe  pv_shuffle_A -wi-a--- 3.50g            


(09:37:19) [root@node01:~]$ vgextend pv_shuffle_A /dev/sde1
  No physical volume label read from /dev/sde1
  Physical volume "/dev/sde1" successfully created
  Volume group "pv_shuffle_A" successfully extended

(09:37:37) [root@node01:~]$ vgchange -an pv_shuffle_A
  0 logical volume(s) in volume group "pv_shuffle_A" now active

(09:38:31) [root@node01:~]$ pvs
  PV         VG           Fmt  Attr PSize PFree
  /dev/sda1  pv_shuffle_A lvm2 a--  9.76g 5.09g
  /dev/sdb1  pv_shuffle_A lvm2 a--  9.76g 8.59g
  /dev/sdc1  pv_shuffle_A lvm2 a--  9.76g 8.59g
  /dev/sdd1  pv_shuffle_A lvm2 a--  9.76g 9.76g
  /dev/sde1  pv_shuffle_A lvm2 a--  9.76g 9.76g
  /dev/vda2  VolGroup     lvm2 a--  9.51g    0 

(09:38:39) [root@node01:~]$ lvs -o +devices
  LV      VG           Attr     LSize Pool Origin Data%  Move Log Copy%  Convert Devices                                 
  lv_root VolGroup     -wi-ao-- 7.54g                                            /dev/vda2(0)                            
  lv_swap VolGroup     -wi-ao-- 1.97g                                            /dev/vda2(1930)                         
  linear  pv_shuffle_A -wi----- 3.50g                                            /dev/sda1(0)                            
  stripe  pv_shuffle_A -wi----- 3.50g                                            /dev/sda1(896),/dev/sdb1(0),/dev/sdc1(0)

pvmove -v /dev/sda1 /dev/sde1
    Finding volume group "pv_shuffle_A"
    Executing: /sbin/modprobe dm-log-userspace
    Archiving volume group "pv_shuffle_A" metadata (seqno 4).
    Creating logical volume pvmove0
    Moving 896 extents of logical volume pv_shuffle_A/linear
    Moving 299 extents of logical volume pv_shuffle_A/stripe
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/pv_shuffle_A" (seqno 5).
    Checking progress before waiting every 15 seconds
  /dev/sda1: Moved: 0.6%
  /dev/sda1: Moved: 24.5%
  /dev/sda1: Moved: 43.3%
  /dev/sda1: Moved: 45.9%
  /dev/sda1: Moved: 54.6%
  /dev/sda1: Moved: 63.7%
  /dev/sda1: Moved: 64.4%
  /dev/sda1: Moved: 68.9%
  /dev/sda1: Moved: 75.0%
    Updating volume group metadata
  Error locking on node node03: Command timed out
  Error locking on node node02: Command timed out
  Error locking on node node01: Command timed out


And again in /var/log/messages:

Mar 22 09:47:28 node01 kernel: INFO: task clvmd:3350 blocked for more than 120 seconds.
Mar 22 09:47:28 node01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 22 09:47:28 node01 kernel: clvmd         D 0000000000000000     0  3350      1 0x00000000
Mar 22 09:47:28 node01 kernel: ffff88003b69db18 0000000000000086 ffff8800ffffffff 00000730195c24c8
Mar 22 09:47:28 node01 kernel: ffff88003b69dae8 ffff880037485bb0 0000000000050250 ffffffffb094ced0
Mar 22 09:47:28 node01 kernel: ffff880038f61058 ffff88003b69dfd8 000000000000fb48 ffff880038f61058
Mar 22 09:47:28 node01 kernel: Call Trace:
Mar 22 09:47:28 node01 kernel: [<ffffffff8109b5d9>] ? ktime_get_ts+0xa9/0xe0
Mar 22 09:47:28 node01 kernel: [<ffffffff814f81b3>] io_schedule+0x73/0xc0
Mar 22 09:47:28 node01 kernel: [<ffffffff811b473e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Mar 22 09:47:28 node01 kernel: [<ffffffff8125b7bd>] ? get_disk+0x7d/0xf0
Mar 22 09:47:28 node01 kernel: [<ffffffff811b4c2e>] __blockdev_direct_IO+0x5e/0xd0
Mar 22 09:47:28 node01 kernel: [<ffffffff811b14f0>] ? blkdev_get_blocks+0x0/0xc0
Mar 22 09:47:28 node01 kernel: [<ffffffff8127065a>] ? kobject_get+0x1a/0x30
Mar 22 09:47:28 node01 kernel: [<ffffffff811b2357>] blkdev_direct_IO+0x57/0x60
Mar 22 09:47:28 node01 kernel: [<ffffffff811b14f0>] ? blkdev_get_blocks+0x0/0xc0
Mar 22 09:47:28 node01 kernel: [<ffffffff8111491b>] generic_file_aio_read+0x6bb/0x700
Mar 22 09:47:28 node01 kernel: [<ffffffff81216d11>] ? avc_has_perm+0x71/0x90
Mar 22 09:47:28 node01 kernel: [<ffffffff8121080f>] ? security_inode_permission+0x1f/0x30
Mar 22 09:47:28 node01 kernel: [<ffffffff8117901a>] do_sync_read+0xfa/0x140
Mar 22 09:47:28 node01 kernel: [<ffffffff81090990>] ? autoremove_wake_function+0x0/0x40
Mar 22 09:47:28 node01 kernel: [<ffffffff811b192c>] ? block_ioctl+0x3c/0x40
Mar 22 09:47:28 node01 kernel: [<ffffffff8118bfc2>] ? vfs_ioctl+0x22/0xa0
Mar 22 09:47:28 node01 kernel: [<ffffffff8121c2fb>] ? selinux_file_permission+0xfb/0x150
Mar 22 09:47:28 node01 kernel: [<ffffffff8120f6a6>] ? security_file_permission+0x16/0x20
Mar 22 09:47:28 node01 kernel: [<ffffffff81179a15>] vfs_read+0xb5/0x1a0
Mar 22 09:47:28 node01 kernel: [<ffffffff81179b51>] sys_read+0x51/0x90
Mar 22 09:47:28 node01 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b



I am not sure if this is the same case like  bug #799071 since it does not really say that any device is busy.. It just gets stuck and then crashes. 

The process still waits:

(09:51:23) [root@node01:/var/log]$ ps -ef | grep pvmo
root      3602  3382  0 09:41 pts/1    00:00:00 pvmove -v /dev/sda1 /dev/sde1


(09:54:30) [root@node01:/var/log]$ clustat
Cluster Status for STSRHTS7541 @ Thu Mar 22 09:55:07 2012
Member Status: Quorate

 Member Name                                 ID   Status
 ------ ----                                 ---- ------
 node01                                          1 Online, Local
 node02                                          2 Online
 node03                                          3 Online

(10:02:34) [root@node01:/var/log]$ grep -e '^\s*locking_type ' /etc/lvm/lvm.conf
    locking_type = 3


If you want me to test it with some specific versions of packages, please let me know and I can do that.
Comment 4 Corey Marthaler 2012-03-22 15:48:39 EDT
I'm able to reproduce this issue with the latest errata rpms as well. This is a regression from the rhel6.2 rpms.


2.6.32-251.el6.x86_64
lvm2-2.02.95-2.el6    BUILT: Fri Mar 16 08:39:54 CDT 2012
lvm2-libs-2.02.95-2.el6    BUILT: Fri Mar 16 08:39:54 CDT 2012
lvm2-cluster-2.02.95-2.el6    BUILT: Fri Mar 16 08:39:54 CDT 2012
udev-147-2.40.el6    BUILT: Fri Sep 23 07:51:13 CDT 2011
device-mapper-1.02.74-2.el6    BUILT: Fri Mar 16 08:39:54 CDT 2012
device-mapper-libs-1.02.74-2.el6    BUILT: Fri Mar 16 08:39:54 CDT 2012
device-mapper-event-1.02.74-2.el6    BUILT: Fri Mar 16 08:39:54 CDT 2012
device-mapper-event-libs-1.02.74-2.el6    BUILT: Fri Mar 16 08:39:54 CDT 2012
cmirror-2.02.95-2.el6    BUILT: Fri Mar 16 08:39:54 CDT 2012


Mar 22 14:38:41 grant-01 kernel: INFO: task clvmd:7411 blocked for more than 120 seconds.
Mar 22 14:38:41 grant-01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 22 14:38:41 grant-01 kernel: clvmd         D 0000000000000000     0  7411      1 0x00000080
Mar 22 14:38:41 grant-01 kernel: ffff88011b7c5b18 0000000000000082 0000000000000000 ffffffffa00041fc
Mar 22 14:38:41 grant-01 kernel: ffff88011b7c5ae8 0000000012bfce9b ffff88011b7c5b08 ffff880102c33ec0
Mar 22 14:38:41 grant-01 kernel: ffff88011b80f098 ffff88011b7c5fd8 000000000000fb48 ffff88011b80f098
Mar 22 14:38:41 grant-01 kernel: Call Trace:
Mar 22 14:38:41 grant-01 kernel: [<ffffffffa00041fc>] ? dm_table_unplug_all+0x5c/0x100 [dm_mod]
Mar 22 14:38:41 grant-01 kernel: [<ffffffff814f81b3>] io_schedule+0x73/0xc0
Mar 22 14:38:41 grant-01 kernel: [<ffffffff811b473e>] __blockdev_direct_IO_newtrunc+0x6fe/0xb90
Mar 22 14:38:41 grant-01 kernel: [<ffffffff8125b7bd>] ? get_disk+0x7d/0xf0
Mar 22 14:38:41 grant-01 kernel: [<ffffffff811b4c2e>] __blockdev_direct_IO+0x5e/0xd0
Mar 22 14:38:41 grant-01 kernel: [<ffffffff811b14f0>] ? blkdev_get_blocks+0x0/0xc0
[...]
Comment 5 Milan Broz 2012-03-25 14:49:56 EDT
Can you try to remove (you can just delete it) lvmetad udev file?
/lib/udev/rules.d/69-dm-lvm-metad.rules
and retry?

What happens here:

This rule always calls on every change event (without any check if lvmetad is running)

ENV{ID_FS_TYPE}=="LVM2_member|LVM1_member", RUN+="$env{LVM_SBIN_PATH}/lvm pvscan --cache --major $major --minor $minor"                     

For some reason, this code always tries to take VG_GLOBAL READ lock, this is propagated across the cluster and deadlocks clvmd.

This code is completely broken. It MUST NOT take any lock here, especially when lvmetad is not running (default).
Comment 6 Milan Broz 2012-03-27 07:06:38 EDT
Created attachment 573019 [details]
pvmove + clvmd logs

To reporoduce, you need two volumes active on one moved PV:

[root@bar-01 ~]# vgchange -a y vg_test
  2 logical volume(s) in volume group "vg_test" now active

[root@bar-01 ~]# lvs -o +devices vg_test
  LV   VG      Attr     LSize Pool Origin Data%  Move Log Copy%  Convert Devices    
  lv1  vg_test -wi-a--- 4.00m                                            /dev/sdf(0)
  lv2  vg_test -wi-a--- 4.00m                                            /dev/sdf(1)

and now run
#pvmove -i1 -vvvv /dev/sdf
Comment 7 Milan Broz 2012-03-27 07:11:28 EDT
and of course pvscan is in ps ax:

bar-01 (local)
13116 ?        S<     0:00 /sbin/lvm pvscan --cache --major 8 --minor 64
13127 ?        S<     0:00 /sbin/lvm pvscan --cache --major 8 --minor 80

8:80 is /dev/sdf
8:64 is /dev/sde

bar-02 (remote)
18243 ?        S<     0:00 /sbin/lvm pvscan --cache --major 8 --minor 80
Comment 8 Milan Broz 2012-03-28 11:20:21 EDT
Here is the description what happens here, new udev rule is just trigger, bug was introduced in pvmove by Alasdair's fixes...

- you have segmented pvmove (2 or more LVs active on one PV moving)

- you have running clvmd (no need to clustered pvmove, exclusive pvmove on one node is enough)

- pvmove suspends top-level devices and configures additional pvmove mirror device,
then resumes all -> pvmove starts

- when the first segment (LV) finishes move, poll daemon updates metadata (vg_write)

- vg_write triggers watch rule
- shortly after pvmove suspends temporary mirror (but not top-level devices)

- vg_write, asynchronously, triggers multiple pvscan --cache on PVs from udev

- pvscan(s) take VG_GLOBAL read lock

- clvmd receives VG_GLOBAL lock and calls refresh cache (expected effect of VG_GLOBAL)

- inside cache refresh, clvmd tries to read top-level device (which is not suspended) but waits
because underlying pvmove layer device IS suspended

- clvmd is now blocked inside lvm lock, queuing all operations (including pvmove resume itself)

Logic in pvmove was changed by the patch

commit 76564e3168b61b79c7486a034b2b53819ee4b28f
Author: Alasdair Kergon <agk@redhat.com>
Date:   Sat Jun 11 00:03:06 2011 +0000

    Major pvmove fix to issue ioctls in the correct order when multiple LVs
    are affected by the move.  (Currently it's possible for I/O to become
    trapped between suspended devices amongst other problems.
    
    The current fix was selected so as to minimise the testing surface.  I
    hope eventually to replace it with a cleaner one that extends the
    deptree code.
    
    Some lvconvert scenarios still suffer from related problems.


This patch DOES NOT suspend top level devices on next segment update, only temporary mirror in between.

So the usual logic (ignoring mirrors and suspended devices) doesn't work here to filter top-level devices out.

Previously (before the patch), both top-level devices and temporary mirror layer were suspended, so cache scan ignored these devices.

Not suspending top level device seems to be in conflict with lvm internal principles, so without explanation from Alasdair we cannot fix this - it can reintroduce some bug which was perhaps fixed by this patch.
Comment 9 Alasdair Kergon 2012-03-28 12:41:35 EDT
I see 3 problems there:

1) pvscan --cache from the udev rule for lvmetad should NOT be getting a clvmd VG_GLOBAL lock:  we do not yet support lvmetad in combination with clvmd.
If pvscan --cache is run and locking_type is not 0,1 or 5, the code should exit without getting any locks or updating anything.

2) the 'is this dev suspended?' check needs fixing to check also recursively if any device this dev is using is suspended if it's not doing that already

3) the vg_write is not changing anything that udev cares about so it should be exempt from watch rules if we can find a way to do that


"Not suspending top level device seems to be in conflict with lvm internal
principles" - the table of the top-level device is not changing so there is no point in suspending it.

Dev A uses Dev B that uses Dev C.

Only the table for Dev B is changing, so Dev B needs reloading.  Dev A is unaffected.

E.g. This becomes important if there are 100 Dev As using Dev B.
After the patch only 1 dev is suspended and resumed.
Before the patch, 101 devs got suspended and resumed - a considerable amount of extra work for no reason.
Comment 10 Zdenek Kabelac 2012-03-28 13:23:44 EDT
I think change 2) is very expensive compared with occasional need of having 100 devs suspend and resumed - and would have to be done for every tree operation - while those potentially unreasonable suspend/resume happens only on very few occasions (and probably with significantly smaller number of devs) 

Moreover it makes it FAR more obvious from the plan check of dmsetup info -c why device has stopped accepting any bio and waits for something (suspend device below)

I'd make things simpler and would rather stay with starting suspend from toplevel devices  (and there is still my acceleration patch for avoid suspend of devices deeper in the tree - which is IMHO much better optimization we should focus on)


3) avoiding watch rule for vg_write would be hard - you can't control them - user may add as many watches as they want...


1) pvscan is accessing same device which is locked by another VG lock while only using global lock - IMHO this breaks some internal assumption in the lvm code - also it's quite questionable from the performance point of view that each modified device invokes immediate parallel scan from udev rule... - i.e. 1 VG on 300PVs this is highly inefficient....
Comment 11 Peter Rajnoha 2012-03-29 04:38:04 EDT
(In reply to comment #9)
> 3) the vg_write is not changing anything that udev cares about so it should be
> exempt from watch rules if we can find a way to do that

(In reply to comment #10)
> 3) avoiding watch rule for vg_write would be hard - you can't control them -
> user may add as many watches as they want...

Also, unfortunately, we can't control watch rule behaviour for *non-dm devices* (no cookie). And this one is exactly the case - the "pvscan --cache" is triggered based on the watch rule set in 60-persistent-storage.rules where it's set for underlying block devices in general. So the solution can't count with 3).
Comment 12 Peter Rajnoha 2012-04-10 05:47:39 EDT
(In reply to comment #8)
> - when the first segment (LV) finishes move, poll daemon updates metadata
> (vg_write)
> 
> - vg_write triggers watch rule
> - shortly after pvmove suspends temporary mirror (but not top-level devices)
> 
> - vg_write, asynchronously, triggers multiple pvscan --cache on PVs from udev

The watch udev rule has been removed in lvm2-2.02.95-3.el6 build, so we should not run into this conrete problem anymore. But still, we need to think about how to avoid such situations to happen and consider all consequences in general when the global lock is taken and it fires the rescan like we saw here...
Comment 13 Milan Broz 2012-04-10 10:49:26 EDT
That bug disappeare with watch rule removal.

lvmetad & cluster is not yet supported configuration, closing this fixed in currrent release.
Comment 14 Corey Marthaler 2012-10-23 13:17:30 EDT
Reopening this now that lvmetad is supported in 6.4. What is the solution, to have a document telling everyone that they need to remove the udev rules?
Comment 21 Corey Marthaler 2012-10-24 16:58:09 EDT
Furthermore, as soon as I kill lvmetad and turn it off in lvm.conf, I can start creating cluster volumes again. Then, turn it back on, start it again, and I can no longer create them.

2.6.32-330.el6.x86_64

lvm2-2.02.98-2.el6    BUILT: Tue Oct 16 05:15:59 CDT 2012
lvm2-libs-2.02.98-2.el6    BUILT: Tue Oct 16 05:15:59 CDT 2012
lvm2-cluster-2.02.98-2.el6    BUILT: Tue Oct 16 05:15:59 CDT 2012
udev-147-2.43.el6    BUILT: Thu Oct 11 05:59:38 CDT 2012
device-mapper-1.02.77-2.el6    BUILT: Tue Oct 16 05:15:59 CDT 2012
device-mapper-libs-1.02.77-2.el6    BUILT: Tue Oct 16 05:15:59 CDT 2012
device-mapper-event-1.02.77-2.el6    BUILT: Tue Oct 16 05:15:59 CDT 2012
device-mapper-event-libs-1.02.77-2.el6    BUILT: Tue Oct 16 05:15:59 CDT 2012
cmirror-2.02.98-2.el6    BUILT: Tue Oct 16 05:15:59 CDT 2012
Comment 22 Peter Rajnoha 2012-10-25 06:35:04 EDT
I did a scratch build with a change that I believe is the cause of the problem (the problem might be using the global lock in pvscan --cache while that one is called in udev rules at the same time). Please, give it a try whether you can still hit the issues reported. Thanks.

  https://brewweb.devel.redhat.com/taskinfo?taskID=5014490
Comment 24 Peter Rajnoha 2012-10-25 07:56:36 EDT
Thing with cluster is that lvmetad *should* detect locking_type=3 used and if it's used, it should do everything without lvmetad - so a fallback to the classical original way (as we don't support lvmetad + cluster now).

However, the 69-dm-lvmetad.rules does not know whether the device processed is clustered or not (and doesn't need to!) - it calls pvscan --cache and that one takes global lock. Either we must remove the global lock (which is the scratch build I attached in comment #22) OR we must check whether locking_type=3 is used and if yes, we should exit from pvscan cleanly, effectively doing no operation until cluster is supported with pvscan cache. And I'm afraid we're not doing that right now.
Comment 25 Marian Csontos 2012-10-25 08:00:32 EDT
At the end the bug is still present with the rules present though the reproducibility is lower than it used to be.
Comment 26 Peter Rajnoha 2012-10-25 08:02:19 EDT
IOW, we need locking_type=3 => use_lvmetad=0, not matter what the current setting of use_lvmetad is!
Comment 27 Peter Rajnoha 2012-10-25 08:08:38 EDT
So, it seems, we don't have such check yet! This must be documented as "Known Issue" in 6.4 documentation. Mornfall is preparing a simple patch that will cover this hole (a fallback to use_lvmetad=0 in case cluster locking is used).
Comment 29 Peter Rajnoha 2012-10-25 08:55:51 EDT
Patched, I'll add that in next 6.4 build. See also bug #814779 comment #9.
Comment 30 Peter Rajnoha 2012-10-25 09:41:59 EDT
So here's a 6.4 scratch with the patch from previous comment:

  https://brewweb.devel.redhat.com/taskinfo?taskID=5015211
Comment 31 Peter Rajnoha 2012-11-06 15:14:42 EST
There's now a warning message and a fallback to locking type 1 (file-based locking) if lvmetad is used with locking type 3 (clustered) which is not yet supported - bug #814779.
Comment 33 Nenad Peric 2012-11-13 10:15:26 EST
Is it supposed to turn of use_lvmetad or to move to locking_type 1?

So far any command executed under cluster with lvmetad enabled and running gives out:

(09:16:38) [root@r6-node01:~]$ lvs
  WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet.
  WARNING: lvmetad is running but disabled. Restart lvmetad before enabling it!
  LV      VG       Attr      LSize Pool Origin Data%  Move Log Cpy%Sync Convert
  lv_root VolGroup -wi-ao--- 7.54g                                             
  lv_swap VolGroup -wi-ao--- 1.97g                                             


So it turns off use_lvmetad setting in the lvm.conf and keeps clustered locking. 

This in its own right is not an issue for me, just asking if this is what was intended since it differs from what is said in #Comment 31 .
Comment 34 Marian Csontos 2012-11-13 13:49:08 EST
It's supposed to turn off lvmetad.
Comment 35 Corey Marthaler 2012-11-14 14:39:04 EST
The pvmove tests pass now with the latest rpms, giving the warning that Nenad mentioned in comment #33.

  WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet.
  WARNING: configuration setting use_lvmetad overriden to 0 due to locking_type 3. Clustered environment not supported by lvmetad yet.

Marking Verified.

2.6.32-338.el6.x86_64

lvm2-2.02.98-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
lvm2-libs-2.02.98-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
lvm2-cluster-2.02.98-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
udev-147-2.43.el6    BUILT: Thu Oct 11 05:59:38 CDT 2012
device-mapper-1.02.77-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
device-mapper-libs-1.02.77-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
device-mapper-event-1.02.77-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
device-mapper-event-libs-1.02.77-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
cmirror-2.02.98-3.el6    BUILT: Mon Nov  5 06:45:48 CST 2012
Comment 36 errata-xmlrpc 2013-02-21 03:09:14 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0501.html

Note You need to log in before you can comment on or make changes to this bug.