Bug 181574
Summary: | device-mapper mirror removal stuck on kcopyd_client_destroy (pvmove hangs) | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Corey Marthaler <cmarthal> |
Component: | kernel | Assignee: | Alasdair Kergon <agk> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | 4.0 | CC: | jbaron, jbrassow, kanderso, mauelshagen-do-not-use, pjones, rkenna |
Target Milestone: | --- | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHSA-2006-0132 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2006-03-07 21:23:38 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 168430 |
Description
Corey Marthaler
2006-02-14 22:40:27 UTC
With problems where a device-mapper operation is hanging in the kernel, I need diagnostics similar to those requested on bug 179786, such as sysrq t for every process on the machine, and dmsetup info -c etc. This has nothing to do with mirror creation. The mirror gets created just fine, from there though, any LVM cmd on that mirror, it's VG, or PVs, deadlock. [root@link-08 lvm]# ps -ef | grep vg root 7152 3709 0 09:49 pts/1 00:00:00 qarsh root link-08 vgchange -an root 7154 7153 0 09:49 ? 00:00:00 vgchange -an root 7429 7232 0 10:03 pts/2 00:00:00 grep vg [root@link-08 lvm]# lsof -p 7154 COMMAND PID USER FD TYPE DEVICE SIZE NODE NAME lvm 7154 root cwd DIR 253,0 4096 2 / lvm 7154 root rtd DIR 253,0 4096 2 / lvm 7154 root txt REG 253,0 497688 4176867 /usr/sbin/lvm lvm 7154 root mem REG 253,0 48524848 81922 /usr/lib/locale/locale-archive lvm 7154 root mem REG 253,0 105080 1409026 /lib64/ld-2.3.4.so lvm 7154 root mem REG 253,0 1493186 1409049 /lib64/tls/libc-2.3.4.so lvm 7154 root mem REG 253,0 17943 1409051 /lib64/libdl-2.3.4.so lvm 7154 root mem REG 253,0 64904 1409075 /lib64/libdevmapper.so.1.02 lvm 7154 root mem REG 253,0 229824 4172535 /usr/lib64/libreadline.so.4.3 lvm 7154 root mem REG 253,0 62504 1409260 /lib64/libselinux.so.1 lvm 7154 root mem REG 253,0 1018858 4167671 /usr/lib64/libncurses.so.5.4 lvm 7154 root 0u IPv4 26789 TCP link-08.lab.msp.redhat.com:34666->link-08.lab.msp.redhat.com:5010 (ESTABLISHED) lvm 7154 root 1u IPv4 26791 TCP link-08.lab.msp.redhat.com:34667->link-08.lab.msp.redhat.com:5011 (ESTABLISHED) lvm 7154 root 2u IPv4 26793 TCP link-08.lab.msp.redhat.com:34668->link-08.lab.msp.redhat.com:5012 (ESTABLISHED) lvm 7154 root 3uR REG 253,0 0 722964 /var/lock/lvm/V_mirror_6_9215 lvm 7154 root 4r BLK 8,1 22662 /dev/sda1 lvm 7154 root 5r BLK 8,2 22681 /dev/sda2 lvm 7154 root 6r BLK 8,17 22798 /dev/sdb1 lvm 7154 root 7r BLK 8,18 22811 /dev/sdb2 lvm 7154 root 8r BLK 8,33 22923 /dev/sdc1 lvm 7154 root 9r BLK 8,34 22933 /dev/sdc2 lvm 7154 root 10r BLK 8,49 23046 /dev/sdd1 lvm 7154 root 11r BLK 8,50 23064 /dev/sdd2 lvm 7154 root 12r BLK 8,65 23185 /dev/sde1 lvm 7154 root 13r BLK 8,66 23180 /dev/sde2 lvm 7154 root 14r BLK 8,81 23308 /dev/sdf1 lvm 7154 root 15r BLK 8,82 23327 /dev/sdf2 lvm 7154 root 16u CHR 10,63 1258 /dev/mapper/control [root@link-08 bin]# dmsetup info -c --noopencount [HANG] Feb 15 10:16:39 link-08 kernel: lvm D 000001003c8075a8 0 7154 7153 (NOTLB) Feb 15 10:16:39 link-08 kernel: 000001001bbb9cb8 0000000000000002 000001001f4d38e0 ffffffff0000006a Feb 15 10:16:39 link-08 kernel: 000001001ab097f0 000000000000006a 0000010001709fc0 000000001bbb9c20 Feb 15 10:16:39 link-08 kernel: 000001001c9f4030 0000000000007f85 Feb 15 10:16:39 link-08 kernel: Call Trace:<ffffffffa00ad764>{:dm_mod:dev_remove+0} <ffffffffa00b0276>{:dm_mod:kcopyd_clie nt_destroy+118} Feb 15 10:16:39 link-08 kernel: <ffffffff80134dce>{autoremove_wake_function+0} <ffffffff80134dce>{autoremove_wake_f unction+0} Feb 15 10:16:39 link-08 kernel: <ffffffffa00f35cc>{:dm_mirror:mirror_dtr+101} <ffffffffa00ab7d9>{:dm_mod:table_dest roy+72} Feb 15 10:16:39 link-08 kernel: <ffffffffa00ad02e>{:dm_mod:__hash_remove+124} <ffffffffa00ad7fd>{:dm_mod:dev_remove +153} Feb 15 10:16:39 link-08 kernel: <ffffffffa00aeae6>{:dm_mod:ctl_ioctl+599} <ffffffff80189241>{sys_ioctl+853} Feb 15 10:16:39 link-08 kernel: <ffffffff801101c6>{system_call+126} Feb 15 10:16:39 link-08 kernel: sshd S 0000000000000008 0 7230 2622 7232 7437 3638 (NOTLB) Feb 15 10:16:39 link-08 kernel: 00000100177e3d78 0000000000000006 0000000000003fd0 00000100177e3d80 Feb 15 10:16:39 link-08 kernel: 000000d000000246 0000000000000246 0000000000000256 000000000000c780 Feb 15 10:16:39 link-08 kernel: 000001001ab097f0 0000000000000b91 Feb 15 10:16:39 link-08 kernel: Call Trace:<ffffffff80223e81>{tty_ldisc_try+60} <ffffffff80305461>{schedule_timeout+101} Feb 15 10:16:39 link-08 kernel: <ffffffff80223fb4>{tty_ldisc_deref+103} <ffffffff80189c3b>{do_select+939} Feb 15 10:16:39 link-08 kernel: <ffffffff801897d5>{__pollwait+0} <ffffffff80189fba>{sys_select+820} Feb 15 10:16:39 link-08 kernel: <ffffffff8019154c>{dnotify_parent+34} <ffffffff801101c6>{system_call+126} Feb 15 10:16:39 link-08 kernel: Feb 15 10:16:39 link-08 kernel: bash S 0000007fbffff7b4 0 7232 7230 7433 (NOTLB) Feb 15 10:16:39 link-08 kernel: 000001001ac1beb8 0000000000000002 0000000100000246 0000000000000075 Feb 15 10:16:39 link-08 kernel: 000001001f27b030 0000000000000075 0000010020713cc0 0000000100000000 Feb 15 10:16:39 link-08 kernel: 000001001b2547f0 0000000000005d23 Feb 15 10:16:39 link-08 kernel: Call Trace:<ffffffff8013af23>{do_wait+3298} <ffffffff801333a4>{default_wake_function+0} Feb 15 10:16:39 link-08 kernel: <ffffffff801333a4>{default_wake_function+0} <ffffffff801101c6>{system_call+126} Feb 15 10:16:39 link-08 kernel: Feb 15 10:16:39 link-08 kernel: dmsetup D 0000000000000000 0 7433 7232 (NOTLB) Feb 15 10:16:39 link-08 kernel: 000001003b653d08 0000000000000006 0000000000000000 000001003a4af608 Feb 15 10:16:39 link-08 kernel: 0000000100000001 0000000000000000 0000002a983bf000 000000011f27b030 Feb 15 10:16:39 link-08 kernel: 000001001f27b030 00000000001046fc Feb 15 10:16:39 link-08 kernel: Call Trace:<ffffffff801a8ca7>{proc_root_lookup+49} <ffffffff8030562b>{__down_write+134} Feb 15 10:16:39 link-08 kernel: <ffffffffa00ad150>{:dm_mod:list_devices+63} <ffffffff8016e9d4>{__vmalloc+216} Feb 15 10:16:40 link-08 kernel: <ffffffffa00ad111>{:dm_mod:list_devices+0} <ffffffffa00aeae6>{:dm_mod:ctl_ioctl+599 } Feb 15 10:16:40 link-08 kernel: <ffffffff80189241>{sys_ioctl+853} <ffffffff801101c6>{system_call+126} Feb 15 10:16:40 link-08 kernel: Feb 15 10:16:40 link-08 kernel: dmsetup D 0000000000000000 0 7434 3595 (NOTLB) Feb 15 10:16:40 link-08 kernel: 000001001befdd08 0000000000000002 0000000000000000 00000100177fa608 Feb 15 10:16:40 link-08 kernel: 0000000100000001 0000000000000000 0000002a983bf000 000000003a72a7f0 Feb 15 10:16:40 link-08 kernel: 000001003a72a7f0 0000000000150d32 Feb 15 10:16:40 link-08 kernel: Call Trace:<ffffffff801a8ca7>{proc_root_lookup+49} <ffffffff8030562b>{__down_write+134} Feb 15 10:16:40 link-08 kernel: <ffffffffa00ad150>{:dm_mod:list_devices+63} <ffffffff8016e9d4>{__vmalloc+216} Feb 15 10:16:40 link-08 kernel: <ffffffffa00ad111>{:dm_mod:list_devices+0} <ffffffffa00aeae6>{:dm_mod:ctl_ioctl+599 } Feb 15 10:16:40 link-08 kernel: <ffffffff80189241>{sys_ioctl+853} <ffffffff801101c6>{system_call+126} Feb 15 10:16:40 link-08 kernel: These are the rpms I'm using: [root@tng1-1 ~]# rpm -q device-mapper device-mapper-1.02.03-1.0.RHEL4 [root@tng1-1 ~]# rpm -q lvm2 lvm2-2.02.02-1.0.RHEL4 [root@tng1-1 ~]# They seem to work fine. They're somewhat hard to find... http://porkchop.devel.redhat.com/dist/4E-qu-candidate/device-mapper/1.02.03-1.0.RHEL4/SRPMS/device-mapper-1.02.03-1.0.RHEL4.src.rpm http://porkchop.devel.redhat.com/dist/4E-cluster-qu-candidate/lvm2/2.02.02-1.0.RHEL4/SRPMS/lvm2-2.02.02-1.0.RHEL4.src.rpm on x86_64 with the lastest rpms, it's the same problem. ioctl(8, DM_DEV_REMOVE [root@link-08 ~]# rpm -q lvm2 lvm2-2.02.02-1.0.RHEL4 [root@link-08 ~]# rpm -q device-mapper device-mapper-1.02.03-1.0.RHEL4 yup, just confirmed this bug. I was using an earlier kernel (with patches) and the latest rpms, and it works fine. Switching to the new kernel causes lockups. I'll see if I can figure out what has changed. There was a snapshot fix that involved tracking I/O's that go through kcopyd and making sure all were completed before returning from kcopyd_client_destroy. This patch was introduced somewhere between 28.EL and 32.EL. the mirror_dtr is hanging on this. Specifically, after 400 kcopyd_copy requests, and trying to remove the mirror, I get: Feb 17 15:46:58 tng1-1 kernel: Entering kcopyd_client_destroy -- nr_jobs == -400 Obviously, there is a problem incrementing the counter. I'm looking for this. it's the extra atomic_dec in segment_complete. Either that has to be removed, or it needs to be balanced with an atomic_inc in split_job. switching severity to regression, because 'pvmove' will never work with this bug in place. patch posted to rhkernel Does setting mirror_region_size = 32 in the activation section of lvm.conf workaround the problem? Do you also see the problem with snapshots if you set a chunksize larger than 32KB? Alasdair, because unloading dmraid mappings online is rare, this should be a minor issue. FYI on comment #13: should theoretically be 64KB, because SUB_JOB_SIZE is defined as 128 sectors of 512 bytes in kcopyd.c But the test is < not <=. Does setting mirror_region_size = 32 in the activation section of lvm.conf workaround the problem? It does appear to be a work around in this case. Do you also see the problem with snapshots if you set a chunksize larger than 32KB? Nope, snaps appear to work fine. Corey, are you saying, you can delete snapshots with > 64KB chunk_size ok *after* you wrote to either originor snapshot ? Would be astonishing... No one said anything about attempting to write anything earlier. :) Once I wrote to the snap and the origin, this issue is present when attempting to remove. I was afraid so. Kernel respin scheduled fo tonight anyway AFAIK. Can you run a recursion with the new kernel tomorrow, please ? BTW: there's another minor issue with the kcopyd_client_destroy() functionality, which *doesn't* cause a regression. In case a client request a multi-segment copy, multiple wake_up() calls will be carried out redundantly in run_complete_job() rather than just one. Minor overhead not causing a functional problem. fix verified. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html |