Bug 1174884
Summary: | clvmd/cmirror hangs on pvmove > 2TB | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Jon Magrini <jmagrini> | |
Component: | lvm2 | Assignee: | Jonathan Earl Brassow <jbrassow> | |
lvm2 sub component: | Clustered Mirror / cmirrord (RHEL6) | QA Contact: | cluster-qe <cluster-qe> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | high | |||
Priority: | urgent | CC: | agk, cww, djeffery, heinzm, jbrassow, jherrman, jpriddy, jruemker, kstrickl, msnitzer, nperic, prajnoha, prockai, wnix, zkabelac | |
Version: | 6.6 | Keywords: | ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | lvm2-2.02.117-1.el6 | Doc Type: | Bug Fix | |
Doc Text: |
Do not document, already released via ZStream
|
Story Points: | --- | |
Clone Of: | ||||
: | 1212089 (view as bug list) | Environment: | ||
Last Closed: | 2015-07-22 07:37:57 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 682771 | |||
Bug Blocks: | 1075802, 1212089 |
Description
Jon Magrini
2014-12-16 16:33:58 UTC
Reproducer: # grep gfs2 /proc/mounts /dev/dm-2 /gfs2_data gfs2 rw,seclabel,relatime,hostdata=jid=1 0 0 # service cmirrord start Starting cmirrord: [ OK ] # vgextend -v vg_gfs2 /dev/vdc1 DEGRADED MODE. Incomplete RAID LVs will be processed. Checking for volume group "vg_gfs2" Archiving volume group "vg_gfs2" metadata (seqno 6). Wiping cache of LVM-capable devices Adding physical volume '/dev/vdc1' to volume group 'vg_gfs2' Volume group "vg_gfs2" will be extended by 1 new physical volumes Creating volume group backup "/etc/lvm/backup/vg_gfs2" (seqno 7). Volume group "vg_gfs2" successfully extended # pvscan PV /dev/vdb1 VG vg_gfs2 lvm2 [2.00 TiB / 0 free] PV /dev/vdc1 VG vg_gfs2 lvm2 [2.00 TiB / 2.00 TiB free] # lvs -o +devices LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Devices lv_gfs2 vg_gfs2 -wi-ao---- 2.00t /dev/vdb1(0) # pvmove -v /dev/vdb1 /dev/vdc1 DEGRADED MODE. Incomplete RAID LVs will be processed. Executing: /sbin/modprobe dm-log-userspace Finding volume group "vg_gfs2" Archiving volume group "vg_gfs2" metadata (seqno 7). Creating logical volume pvmove0 Moving 524519 extents of logical volume vg_gfs2/lv_gfs2 Updating volume group metadata <pauses_here> Error locking on node rhel6-node2-priv: Command timed out ABORTING: Temporary pvmove mirror activation failed. Run pvmove --abort. Error locking on node rhel6-node2-priv: Command timed out Unable to resume logical volumes # tail -f /var/log/messages Dec 16 19:06:35 rhel6-node2 kernel: device-mapper: dm-log-userspace: version 1.3.0 loaded Dec 16 19:06:38 rhel6-node2 rsyslogd-2177: imuxsock begins to drop messages from pid 2294 due to rate-limiting Dec 16 19:06:50 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/3] - retrying Dec 16 19:07:05 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/4] - retrying Dec 16 19:07:20 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/5] - retrying Dec 16 19:07:35 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/6] - retrying Dec 16 19:07:50 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/7] - retrying Dec 16 19:08:05 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/8] - retrying Dec 16 19:08:20 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/9] - retrying Dec 16 19:08:35 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/10] - retrying Dec 16 19:08:50 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/11] - retrying Dec 16 19:09:05 rhel6-node2 kernel: INFO: task gfs2_quotad:2385 blocked for more than 120 seconds. Dec 16 19:09:05 rhel6-node2 kernel: Not tainted 2.6.32-504.el6.x86_64 #1 Dec 16 19:09:05 rhel6-node2 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Dec 16 19:09:05 rhel6-node2 kernel: gfs2_quotad D 0000000000000000 0 2385 2 0x00000080 Dec 16 19:09:05 rhel6-node2 kernel: ffff88003ad17c20 0000000000000046 0000000537ead000 0000000400000018 Dec 16 19:09:05 rhel6-node2 kernel: ffff88000000001c ffffffffa0389030 ffff88003dfe3098 ffffffffa0388fe0 Dec 16 19:09:05 rhel6-node2 kernel: ffff880000000005 ffff88003dfe3148 ffff88003b30c638 ffff88003ad17fd8 Dec 16 19:09:05 rhel6-node2 kernel: Call Trace: Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0389030>] ? gdlm_ast+0x0/0x210 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0388fe0>] ? gdlm_bast+0x0/0x50 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0368100>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa036810e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8152ac7f>] __wait_on_bit+0x5f/0x90 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0368100>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8152ad28>] out_of_line_wait_on_bit+0x78/0x90 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8109eb80>] ? wake_bit_function+0x0/0x50 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa03690e5>] gfs2_glock_wait+0x45/0x90 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa036c619>] gfs2_glock_nq+0x2c9/0x410 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff81087fdb>] ? try_to_del_timer_sync+0x7b/0xe0 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0385469>] gfs2_statfs_sync+0x59/0x1c0 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8152a84a>] ? schedule_timeout+0x19a/0x2e0 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa0385461>] ? gfs2_statfs_sync+0x51/0x1c0 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa037cc27>] quotad_check_timeo+0x57/0xb0 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa037ceb4>] gfs2_quotad+0x234/0x2b0 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffffa037cc80>] ? gfs2_quotad+0x0/0x2b0 [gfs2] Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8109e66e>] kthread+0x9e/0xc0 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8109e5d0>] ? kthread+0x0/0xc0 Dec 16 19:09:05 rhel6-node2 kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20 Dec 16 19:09:05 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/12] - retrying Dec 16 19:09:20 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/13] - retrying Dec 16 19:09:35 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/14] - retrying Dec 16 19:09:50 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/15] - retrying Dec 16 19:10:05 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/16] - retrying Dec 16 19:10:20 rhel6-node2 kernel: device-mapper: dm-log-userspace: [JvdmdQtn] Request timed out: [5/17] - retrying Dec 16 19:17:27 rhel6-node2 fenced[2365]: fence rhel6-node1-priv success Dec 16 19:17:28 rhel6-node2 kernel: GFS2: fsid=ex-cluster:vg_gfs2.1: jid=0: Looking at journal... Dec 16 19:17:28 rhel6-node2 kernel: GFS2: fsid=ex-cluster:vg_gfs2.1: jid=0: Done Dec 16 19:17:48 rhel6-node2 corosync[2294]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Dec 16 19:17:48 rhel6-node2 corosync[2294]: [QUORUM] Members[2]: 1 2 Dec 16 19:17:48 rhel6-node2 corosync[2294]: [QUORUM] Members[2]: 1 2 Dec 16 19:17:54 rhel6-node2 kernel: dlm: got connection from 1 The following is then log continuous: Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry Dec 16 19:17:55 rhel6-node2 cmirrord[3429]: Sync checkpoint section create retry ... Infinite loop when failing to issue checkpoint creation: 525 sync_create_retry: 526 rv = saCkptSectionCreate(h, §ion_attr, 527 cp->sync_bits, cp->bitmap_size); 528 if (rv == SA_AIS_ERR_TRY_AGAIN) { 529 LOG_ERROR("Sync checkpoint section create retry"); 530 usleep(1000); (gdb) p section_attr $5 = {sectionId = 0x7fff960b3480, expirationTime = 9223372036854775807} (gdb) p *cp $6 = {requester = 2, uuid = "LVM-8Es4nqRaLqAdfX4zkQl7Hzfp7pZx3NyFdbKogJ8W2CYs2vEJIWyztVCuGGIgFCDt", '\000' <repeats 59 times>, bitmap_size = 524292, sync_bits = 0x7f2a74b0a5b0 "", clean_bits = 0x7f2a74a8a5a0 "", recovering_region = 0x7f2a74563a90 "0 1", next = 0x7f2a74885f60} Seems like a bug in checkpoint code to me, but we have already worked around this in RHEL7 where there is no checkpoint support. I will try that option and see if things work better. This is still a problem even when not using the old checkpoint code. Also, I found a very simple reproducer. Increasing the region_size is a workaround for this issue. [root@bp-01 ~]# lvcreate --type mirror --corelog --regionsize 2M -L 2T -n lv vg Logical volume "lv" created. [root@bp-01 ~]# ^C [root@bp-01 ~]# lvremove -ff vg Logical volume "lv" successfully removed [root@bp-01 ~]# lvcreate --type mirror --corelog --regionsize 1M -L 2T -n lv vg *** HANG *** There seems to be an undocumented size limit on messages that is being hit when sending checkpoint data. Rather than returning CS_ERR_TOO_BIG or something, CS_ERR_TRY_AGAIN is returned; which invites repeating failures. I'll ask about the size limit from the CPG folks. Confirmed. There is an ~1MB payload limit that is not documented. They are working on lifting the limit, but I don't know when that will land. The region_size will need to be adjusted by the code to limit the bitmap size, thereby avoiding the issue. This will be fixed by Bug 682771. That bug is not targeted at 6.7, however, so we will need to warn users in the meantime. The limitations of CPG payload size affects more than just 'pvmove'. In a cluster, it affects: - pvmove - creation/activation of mirrors - conversion from linear to mirror It also affects mirrors less than 2TB in size if the 'region_size' of the mirror is reduced. Region size can be, at a minimum, the system page size - usually 4kB. If we reduce the region size to 4kB, we can encounter this bug when creating a mirror of just 8GB (e.g.> lvcreate -m1 --regionsize 4k -L 8G -n hanging_mirror vg). Increasing the region size will alleviate the problem. I will make the LVM code auto adjust the region size higher to cope with this issue until it is resolved by the CPG library. sorry, 16G. Creating a mirror with region size of 4k will cause the problem to occur for a mirror that is 16GB in size. Fix (aka work-around) committed upstream: commit dd0ee35378cc2ff405183fea9a3d970aab96ac63 Author: Jonathan Brassow <jbrassow> Date: Wed Feb 25 14:42:15 2015 -0600 cmirror: Adjust region size to work around CPG msg limit to avoid hang. cmirror uses the CPG library to pass messages around the cluster and maintain its bitmaps. When a cluster mirror starts-up, it must send the current state to any joining members - a checkpoint. When mirrors are large (or the region size is small), the bitmap size can exceed the message limit of the CPG library. When this happens, the CPG library returns CPG_ERR_TRY_AGAIN. (This is also a bug in CPG, since the message will never be successfully sent.) There is an outstanding bug (bug 682771) that is meant to lift this message length restriction in CPG, but for now we work around the issue by increasing the mirror region size. This limits the size of the bitmap and avoids any issues we would otherwise have around checkpointing. Since this issue only affects cluster mirrors, the region size adjustments are only made on cluster mirrors. This patch handles cluster mirror issues involving pvmove, lvconvert (from linear to mirror), and lvcreate. It also ensures that when users convert a VG from single-machine to clustered, any mirrors with too many regions (i.e. a bitmap that would be too large to properly checkpoint) are trapped. Tested this with lvcreate with >2TB PVs, pvmove and additionally with small region size. The region size was updated to accomodate for the message limit. No hang during any of the commands was encountered: [root@duck-03 ~]# lvcreate --type mirror --corelog --regionsize 1M -L 2T -n lv vg Logical volume "lv" created. [root@duck-03 ~]# pvmove -v /dev/sdl1 /dev/sdl3 Executing: /sbin/modprobe dm-log-userspace Finding volume group "vg" Archiving volume group "vg" metadata (seqno 2). Creating logical volume pvmove0 Moving 524288 extents of logical volume vg/lv_mimage_0 Increasing mirror region size from 0 to 2048 sectors. Updating volume group metadata Creating volume group backup "/etc/lvm/backup/vg" (seqno 3). Checking progress before waiting every 15 seconds /dev/sdl1: Moved: 0.0% /dev/sdl1: Moved: 0.0% /dev/sdl1: Moved: 0.0% /dev/sdl1: Moved: 0.1% [root@duck-03 ~]# lvcreate --type mirror -m1 --regionsize 4k -L16G -n hang vg Logical volume "hang" created. [root@duck-03 ~]# lvs -o+regionsize LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert Region hang vg mwi-a-m--- 16.00g hang_mlog 0.07 8.00k marking VERIFIED with: 2.6.32-554.el6.x86_64 lvm2-2.02.118-2.el6 BUILT: Wed Apr 15 13:34:08 CEST 2015 lvm2-libs-2.02.118-2.el6 BUILT: Wed Apr 15 13:34:08 CEST 2015 lvm2-cluster-2.02.118-2.el6 BUILT: Wed Apr 15 13:34:08 CEST 2015 udev-147-2.61.el6 BUILT: Mon Mar 2 12:08:11 CET 2015 device-mapper-1.02.95-2.el6 BUILT: Wed Apr 15 13:34:08 CEST 2015 device-mapper-libs-1.02.95-2.el6 BUILT: Wed Apr 15 13:34:08 CEST 2015 device-mapper-event-1.02.95-2.el6 BUILT: Wed Apr 15 13:34:08 CEST 2015 device-mapper-event-libs-1.02.95-2.el6 BUILT: Wed Apr 15 13:34:08 CEST 2015 device-mapper-persistent-data-0.3.2-1.el6 BUILT: Fri Apr 4 15:43:06 CEST 2014 cmirror-2.02.118-2.el6 BUILT: Wed Apr 15 13:34:08 CEST 2015 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-1411.html |