Bug 1174884

Summary:	clvmd/cmirror hangs on pvmove > 2TB
Product:	Red Hat Enterprise Linux 6	Reporter:	Jon Magrini <jmagrini>
Component:	lvm2	Assignee:	Jonathan Earl Brassow <jbrassow>
lvm2 sub component:	Clustered Mirror / cmirrord (RHEL6)	QA Contact:	cluster-qe <cluster-qe>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	urgent	CC:	agk, cww, djeffery, heinzm, jbrassow, jherrman, jpriddy, jruemker, kstrickl, msnitzer, nperic, prajnoha, prockai, wnix, zkabelac
Version:	6.6	Keywords:	ZStream
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	lvm2-2.02.117-1.el6	Doc Type:	Bug Fix
Doc Text:	Do not document, already released via ZStream	Story Points:	---
Clone Of:
Clones:	1212089 (view as bug list)		Environment:
Last Closed:	2015-07-22 07:37:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	682771
Bug Blocks:	1075802, 1212089

Description Jon Magrini 2014-12-16 16:33:58 UTC

Description of problem:

pvmove of a clustered PV which is 2TB or larger causes cluster outage.  

"pvmove" a PV that is less than 2TB in size is successful.  Try to pvmove a (source) PV that is >= 2TB, the pvmove fails without exception, with a locking error. At that point, cvlmd is deadlocked on every cluster node, and every node starts babbling messages like this:

Dec 13 10:13:25 hostname kernel: device-mapper: dm-log-userspace: [cXxC5Ifq] Request timed out: [5/23928] - retrying

Running "pvmove --abort" does nothing (block forever).

# pvs
  PV                       VG              Fmt  Attr PSize    PFree   
  /dev/mapper/gridsh.1     gridshvg        lvm2 a--     2.00t    2.00t
  /dev/mapper/gridshlun    gridshvg        lvm2 a--     2.00t       0 

# pvmove -vv  /dev/mapper/gridshlun  /dev/mapper/gridsh.1
    DEGRADED MODE. Incomplete RAID LVs will be processed.
      Setting activation/monitoring to 1
      Setting global/locking_type to 3
      Setting global/wait_for_locks to 1
      Cluster locking selected.
      Getting target version for mirror
      Found mirror target v1.12.0.
      Getting target version for mirror
      Found mirror target v1.12.0.
    Executing: /sbin/modprobe dm-log-userspace
      /dev/mapper/gridshlun: size is 4294967296 sectors
...
Finding volume group "gridshvg"
      Locking VG V_gridshvg PW (VG) (0x4)
      /dev/mapper/gridshlun: lvm2 label detected at sector 1
      /dev/mapper/gridsh.1: lvm2 label detected at sector 1
...
 Creating logical volume pvmove0
      gridshvg/gridshvol is active
      gridshvg/gridshvol is active
      gridshvg/gridshvol is active
      Inserting layer pvmove0 for segments of gridshvol on /dev/mapper/gridshlun
      Inserting /dev/mapper/gridshlun:0-524286 of gridshvg/gridshvol
      Stack gridshvol:0[0] on LV pvmove0:0
      Adding gridshvol:0 as an user of pvmove0
    Moving 524287 extents of logical volume gridshvg/gridshvol
      Setting allocation/mirror_logs_require_separate_pvs to 0
      Setting allocation/maximise_cling to 1
      gridshvg/gridshvol is active
      Locking LV pcXSWIrikFLfb9kfh2YTbp9ESlBcIR7b2hWgI4h61sg78o12sakF3TxrubS39Pj5 CR (LV|NONBLOCK|CLUSTER) (0x99)
    Updating volume group metadata
      Locking VG P_gridshvg PW (VG|CACHE) (0x104)
      Stack gridshvol:0[0] on LV pvmove0:0
      Adding gridshvol:0 as an user of pvmove0
      Setting activation/use_mlockall to 0
      Locking memory
      Locking LV pcXSWIrikFLfb9kfh2YTbp9ESlBcIR7b2hWgI4h61sg78o12sakF3TxrubS39Pj5 PW (LV|NONBLOCK|CLUSTER) (0x9c)
      Locking VG P_gridshvg PW (VG|HOLD|CACHE) (0x124)
      gridshvg/pvmove0 is active
      Locking LV pcXSWIrikFLfb9kfh2YTbp9ESlBcIR7b1rajXaovroZRVw8806MfQDp0cXxC5Ifq CR (LV|NONBLOCK|CLUSTER) (0x99)
  Error locking on node node1-priv: Command timed out
  Error locking on node node2-priv: Command timed out
  ABORTING: Temporary pvmove mirror activation failed. Run pvmove --abort.

Version-Release number of selected component (if applicable):
RHEL-6.6

How reproducible:
Very

Steps to Reproduce:
1. Create +2TB clustered volume
2. Attempt pvmove of clustered volume

Actual results:
cluster hangs indefinitely, --abort does nothing.  

Expected results:
pvmove successful.

Additional info:

Comment 2 Jon Magrini 2014-12-16 19:29:33 UTC

Reproducer: 


# grep gfs2 /proc/mounts 
/dev/dm-2 /gfs2_data gfs2 rw,seclabel,relatime,hostdata=jid=1 0 0

# service cmirrord start
Starting cmirrord:                                         [  OK  ]

# vgextend -v vg_gfs2 /dev/vdc1
    DEGRADED MODE. Incomplete RAID LVs will be processed.
    Checking for volume group "vg_gfs2"
    Archiving volume group "vg_gfs2" metadata (seqno 6).
    Wiping cache of LVM-capable devices
    Adding physical volume '/dev/vdc1' to volume group 'vg_gfs2'
    Volume group "vg_gfs2" will be extended by 1 new physical volumes
    Creating volume group backup "/etc/lvm/backup/vg_gfs2" (seqno 7).
  Volume group "vg_gfs2" successfully extended

# pvscan
  PV /dev/vdb1   VG vg_gfs2   lvm2 [2.00 TiB / 0    free]
  PV /dev/vdc1   VG vg_gfs2   lvm2 [2.00 TiB / 2.00 TiB free]

# lvs -o +devices
  LV      VG      Attr       LSize Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert Devices        
  lv_gfs2 vg_gfs2 -wi-ao---- 2.00t                                                     /dev/vdb1(0)   

# pvmove -v /dev/vdb1 /dev/vdc1
    DEGRADED MODE. Incomplete RAID LVs will be processed.
    Executing: /sbin/modprobe dm-log-userspace
    Finding volume group "vg_gfs2"
    Archiving volume group "vg_gfs2" metadata (seqno 7).
    Creating logical volume pvmove0
    Moving 524519 extents of logical volume vg_gfs2/lv_gfs2
    Updating volume group metadata

    <pauses_here>

  Error locking on node rhel6-node2-priv: Command timed out
  ABORTING: Temporary pvmove mirror activation failed. Run pvmove --abort.
  Error locking on node rhel6-node2-priv: Command timed out
  Unable to resume logical volumes

# tail -f /var/log/messages
Dec 16 19:06:35 rhel6-node2 kernel: device-mapper: dm-log-userspace: version 1.3.0 loaded
Dec 16 19:06:38 rhel6-node2 rsyslogd-2177: imuxsock begins to drop messages from pid 2294 due to rate-limiting
Dec 16 19:06:50 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/3] - retrying
Dec 16 19:07:05 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/4] - retrying
Dec 16 19:07:20 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/5] - retrying
Dec 16 19:07:35 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/6] - retrying
Dec 16 19:07:50 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/7] - retrying
Dec 16 19:08:05 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/8] - retrying
Dec 16 19:08:20 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/9] - retrying
Dec 16 19:08:35 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/10] - retrying
Dec 16 19:08:50 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/11] - retrying
Dec 16 19:09:05 rhel6-node2 kernel: INFO: task gfs2_quotad:2385 blocked for more than 120 seconds.
Dec 16 19:09:05 rhel6-node2 kernel:      Not tainted 2.6.32-504.el6.x86_64 #1
Dec 16 19:09:05 rhel6-node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 16 19:09:05 rhel6-node2 kernel: gfs2_quotad   D 0000000000000000     0  2385      2 0x00000080
Dec 16 19:09:05 rhel6-node2 kernel: ffff88003ad17c20 0000000000000046 0000000537ead000 0000000400000018
Dec 16 19:09:05 rhel6-node2 kernel: ffff88000000001c ffffffffa0389030 ffff88003dfe3098 ffffffffa0388fe0
Dec 16 19:09:05 rhel6-node2 kernel: ffff880000000005 ffff88003dfe3148 ffff88003b30c638 ffff88003ad17fd8
Dec 16 19:09:05 rhel6-node2 kernel: Call Trace:
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0389030>] ? gdlm_ast+0x0/0x210 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0388fe0>] ? gdlm_bast+0x0/0x50 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0368100>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa036810e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8152ac7f>] __wait_on_bit+0x5f/0x90
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0368100>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8152ad28>] out_of_line_wait_on_bit+0x78/0x90
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8109eb80>] ? wake_bit_function+0x0/0x50
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa03690e5>] gfs2_glock_wait+0x45/0x90 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa036c619>] gfs2_glock_nq+0x2c9/0x410 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff81087fdb>] ? try_to_del_timer_sync+0x7b/0xe0
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0385469>] gfs2_statfs_sync+0x59/0x1c0 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8152a84a>] ? schedule_timeout+0x19a/0x2e0
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0385461>] ? gfs2_statfs_sync+0x51/0x1c0 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa037cc27>] quotad_check_timeo+0x57/0xb0 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa037ceb4>] gfs2_quotad+0x234/0x2b0 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa037cc80>] ? gfs2_quotad+0x0/0x2b0 [gfs2]
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0
Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
Dec 16 19:09:05 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/12] - retrying
Dec 16 19:09:20 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/13] - retrying
Dec 16 19:09:35 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/14] - retrying
Dec 16 19:09:50 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/15] - retrying
Dec 16 19:10:05 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/16] - retrying
Dec 16 19:10:20 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/17] - retrying

Dec 16 19:17:27 rhel6-node2 fenced[2365]: fence rhel6-node1-priv success
Dec 16 19:17:28 rhel6-node2 kernel: GFS2: fsid=ex-cluster:vg_gfs2.1: jid=0: Looking at journal...
Dec 16 19:17:28 rhel6-node2 kernel: GFS2: fsid=ex-cluster:vg_gfs2.1: jid=0: Done

Dec 16 19:17:48 rhel6-node2 corosync[2294]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Dec 16 19:17:48 rhel6-node2 corosync[2294]:   [QUORUM] Members[2]: 1 2
Dec 16 19:17:48 rhel6-node2 corosync[2294]:   [QUORUM] Members[2]: 1 2
Dec 16 19:17:54 rhel6-node2 kernel: dlm: got connection from 1

The following is then log continuous: 

Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry
Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry
Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry
Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry
Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry
Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry
Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry
Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry
...

Comment 3 Jonathan Earl Brassow 2015-02-13 16:35:25 UTC

Infinite loop when failing to issue checkpoint creation:

525	sync_create_retry:
526		rv = saCkptSectionCreate(h, &section_attr,
527					 cp->sync_bits, cp->bitmap_size);
528		if (rv == SA_AIS_ERR_TRY_AGAIN) {
529			LOG_ERROR("Sync checkpoint section create retry");
530			usleep(1000);
(gdb) p section_attr
$5 = {sectionId = 0x7fff960b3480, expirationTime = 9223372036854775807}
(gdb) p *cp
$6 = {requester = 2,
  uuid = "LVM-8Es4nqRaLqAdfX4zkQl7Hzfp7pZx3NyFdbKogJ8W2CYs2vEJIWyztVCuGGIgFCDt", '\000' <repeats 59 times>, bitmap_size = 524292,
  sync_bits = 0x7f2a74b0a5b0 "", clean_bits = 0x7f2a74a8a5a0 "",
  recovering_region = 0x7f2a74563a90 "0 1", next = 0x7f2a74885f60}

Seems like a bug in checkpoint code to me, but we have already worked around this in RHEL7 where there is no checkpoint support.  I will try that option and see if things work better.

Comment 4 Jonathan Earl Brassow 2015-02-18 22:21:18 UTC

This is still a problem even when not using the old checkpoint code.  Also, I found a very simple reproducer.  Increasing the region_size is a workaround for this issue.

[root@bp-01 ~]# lvcreate --type mirror --corelog --regionsize 2M -L 2T -n lv vg
  Logical volume "lv" created.
[root@bp-01 ~]# ^C
[root@bp-01 ~]# lvremove -ff vg
  Logical volume "lv" successfully removed
[root@bp-01 ~]# lvcreate --type mirror --corelog --regionsize 1M -L 2T -n lv vg
*** HANG ***

Comment 5 Jonathan Earl Brassow 2015-02-19 01:47:37 UTC

There seems to be an undocumented size limit on messages that is being hit when sending checkpoint data.  Rather than returning CS_ERR_TOO_BIG or something, CS_ERR_TRY_AGAIN is returned; which invites repeating failures.

I'll ask about the size limit from the CPG folks.

Comment 6 Jonathan Earl Brassow 2015-02-19 14:38:39 UTC

Confirmed.  There is an ~1MB payload limit that is not documented.  They are working on lifting the limit, but I don't know when that will land.

The region_size will need to be adjusted by the code to limit the bitmap size, thereby avoiding the issue.

Comment 7 Jonathan Earl Brassow 2015-02-19 22:24:55 UTC

This will be fixed by Bug 682771.  That bug is not targeted at 6.7, however, so we will need to warn users in the meantime.

Comment 9 Jonathan Earl Brassow 2015-02-24 22:45:29 UTC

The limitations of CPG payload size affects more than just 'pvmove'.  In a cluster, it affects:
- pvmove
- creation/activation of mirrors
- conversion from linear to mirror

It also affects mirrors less than 2TB in size if the 'region_size' of the mirror is reduced.  Region size can be, at a minimum, the system page size - usually 4kB.  If we reduce the region size to 4kB, we can encounter this bug when creating a mirror of just 8GB (e.g.> lvcreate -m1 --regionsize 4k -L 8G -n hanging_mirror vg).

Increasing the region size will alleviate the problem.  I will make the LVM code auto adjust the region size higher to cope with this issue until it is resolved by the CPG library.

Comment 10 Jonathan Earl Brassow 2015-02-24 22:51:13 UTC

sorry, 16G.  Creating a mirror with region size of 4k will cause the problem to occur for a mirror that is 16GB in size.

Comment 11 Jonathan Earl Brassow 2015-02-25 20:53:47 UTC

Fix (aka work-around) committed upstream:

commit dd0ee35378cc2ff405183fea9a3d970aab96ac63
Author: Jonathan Brassow <jbrassow>
Date:   Wed Feb 25 14:42:15 2015 -0600

    cmirror: Adjust region size to work around CPG msg limit to avoid hang.

    cmirror uses the CPG library to pass messages around the cluster and maintain
    its bitmaps.  When a cluster mirror starts-up, it must send the current state
    to any joining members - a checkpoint.  When mirrors are large (or the region
    size is small), the bitmap size can exceed the message limit of the CPG
    library.  When this happens, the CPG library returns CPG_ERR_TRY_AGAIN.
    (This is also a bug in CPG, since the message will never be successfully sent.)

    There is an outstanding bug (bug 682771) that is meant to lift this message
    length restriction in CPG, but for now we work around the issue by increasing
    the mirror region size.  This limits the size of the bitmap and avoids any
    issues we would otherwise have around checkpointing.

    Since this issue only affects cluster mirrors, the region size adjustments
    are only made on cluster mirrors.  This patch handles cluster mirror issues
    involving pvmove, lvconvert (from linear to mirror), and lvcreate.  It also
    ensures that when users convert a VG from single-machine to clustered, any
    mirrors with too many regions (i.e. a bitmap that would be too large to
    properly checkpoint) are trapped.

Comment 19 Nenad Peric 2015-04-21 13:09:23 UTC

Tested this with lvcreate with >2TB PVs, pvmove and additionally with small region size. The region size was updated to accomodate for the message limit.

No hang during any of the commands was encountered:

[root@duck-03 ~]# lvcreate --type mirror --corelog --regionsize 1M -L 2T -n lv vg
  Logical volume "lv" created.


[root@duck-03 ~]# pvmove -v /dev/sdl1 /dev/sdl3
    Executing: /sbin/modprobe dm-log-userspace
    Finding volume group "vg"
    Archiving volume group "vg" metadata (seqno 2).
    Creating logical volume pvmove0
    Moving 524288 extents of logical volume vg/lv_mimage_0
  Increasing mirror region size from 0 to 2048 sectors.
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/vg" (seqno 3).
    Checking progress before waiting every 15 seconds
  /dev/sdl1: Moved: 0.0%
  /dev/sdl1: Moved: 0.0%
  /dev/sdl1: Moved: 0.0%
  /dev/sdl1: Moved: 0.1%



[root@duck-03 ~]# lvcreate --type mirror -m1 --regionsize 4k -L16G -n hang vg
  Logical volume "hang" created.


[root@duck-03 ~]# lvs -o+regionsize
  LV      VG        Attr       LSize   Pool Origin Data%  Meta%  Move Log       Cpy%Sync Convert Region
  hang    vg        mwi-a-m---  16.00g                                hang_mlog 0.07              8.00k


marking VERIFIED with:

2.6.32-554.el6.x86_64

lvm2-2.02.118-2.el6    BUILT: Wed Apr 15 13:34:08 CEST 2015
lvm2-libs-2.02.118-2.el6    BUILT: Wed Apr 15 13:34:08 CEST 2015
lvm2-cluster-2.02.118-2.el6    BUILT: Wed Apr 15 13:34:08 CEST 2015
udev-147-2.61.el6    BUILT: Mon Mar  2 12:08:11 CET 2015
device-mapper-1.02.95-2.el6    BUILT: Wed Apr 15 13:34:08 CEST 2015
device-mapper-libs-1.02.95-2.el6    BUILT: Wed Apr 15 13:34:08 CEST 2015
device-mapper-event-1.02.95-2.el6    BUILT: Wed Apr 15 13:34:08 CEST 2015
device-mapper-event-libs-1.02.95-2.el6    BUILT: Wed Apr 15 13:34:08 CEST 2015
device-mapper-persistent-data-0.3.2-1.el6    BUILT: Fri Apr  4 15:43:06 CEST 2014
cmirror-2.02.118-2.el6    BUILT: Wed Apr 15 13:34:08 CEST 2015

Comment 20 errata-xmlrpc 2015-07-22 07:37:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1411.html