Bug 979395 - mirror: lvconvert - udev timeout leaving cookie behind
mirror: lvconvert - udev timeout leaving cookie behind
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: lvm2 (Show other bugs)
7.0
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Peter Rajnoha
Cluster QE
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-06-28 08:23 EDT by Marian Csontos
Modified: 2013-11-26 04:59 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-11-26 04:59:54 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Marian Csontos 2013-06-28 08:23:29 EDT
Description of problem:
lvconverting a mirror (same number of legs, changing mirrorlog from core to disk only) results in udev worker timeout:

lvcreate:

    02:49:08 lvcreate -m 1 -n mirror_log_convert --mirrorlog core -l 7506 mirror_sanity /dev/vdb3 /dev/vdc1 /dev/vdc2 /dev/vdc3

    02:49:08 kernel: [  124.894615] bio: create slab <bio-2> at 2
    02:49:08 lvm[2731]: Monitoring mirror device mirror_sanity-mirror_log_convert for events.

this exits immediately. After 5s lvconvert is called:

    02:49:13 lvconvert -m 1 --mirrorlog disk /dev/mirror_sanity/mirror_log_convert

    02:49:13 lvm[2731]: No longer monitoring mirror device mirror_sanity-mirror_log_convert for events.
    02:49:13 lvm[2731]: Monitoring mirror device mirror_sanity-mirror_log_convert for events.
    02:49:43 systemd-udevd[287]: worker [3323] /devices/virtual/block/dm-4 timeout; kill it
    02:49:43 systemd-udevd[287]: seq 2392 '/devices/virtual/block/dm-4' killed
    02:49:55 systemd-udevd[287]: worker [3323] terminated by signal 9 (Killed)
    03:02:03 systemd[1]: Starting Cleanup of Temporary Directories...
    03:02:03 systemd[1]: Started Cleanup of Temporary Directories.
    03:04:50 lvm[2731]: mirror_sanity-mirror_log_convert is now in-sync.


Version-Release number of selected component (if applicable):
RHEL7 nightly build 2.02.99-0.66, git commit 9233e6d0bb5b3b21cc67fd0baae3fb4cb099bdcb

How reproducible:
rarely

Steps to Reproduce:
1. lvcreate -m 1 -n mirror_log_convert --mirrorlog core -l 7506 mirror_sanity /dev/vdb3 /dev/vdc1 /dev/vdc2 /dev/vdc3
2. lvconvert -m 1 --mirrorlog disk /dev/mirror_sanity/mirror_log_convert

Actual results:
udev worker timeout leaving a cookie behind

Expected results:
pass

Additional info:

I am not sure this is really an LVM problem or udev is not patient enough killing the worker after approximately 30s.
Similar to https://bugzilla.redhat.com/show_bug.cgi?id=956387.

    #0  0x00007f3c530f88b7 in semop () from /lib64/libc.so.6
    No symbol table info available.
    #1  0x00007f3c5382ece0 in _udev_wait (cookie=223156231) at libdm-common.c:2433
            semid = 983040
            sb = {sem_num = 0, sem_op = 0, sem_flg = 0}
    #2  0x00007f3c5382edf6 in dm_udev_wait (cookie=223156231) at libdm-common.c:2451
            r = 0
    #3  0x00000000004d59ba in fs_unlock () at activate/fs.c:491                                                                                                                                                                       
    No locals.
    #4  0x00000000004dcde2 in _file_lock_resource (cmd=0x1706040, resource=0x7fff9c781c00 "#sync_names", flags=256, lv=0x0)
        at locking/file_locking.c:270
            lockfile = "\000+obviously a lot of junk..."
            origin_only = 0
            revert = 0
    #5  0x000000000047cd44 in _lock_vol (cmd=0x1706040, resource=0x7fff9c781c00 "#sync_names", flags=256, lv_op=LV_NOOP, lv=0x0)
        at locking/locking.c:382
            lck_type = 0
            lck_scope = 0
            ret = 0
            __PRETTY_FUNCTION__ = "_lock_vol"
    #6  0x000000000047d066 in lock_vol (cmd=0x1706040, vol=0x4fcecc "#sync_names", flags=256, lv=0x0) at locking/locking.c:460
            resource = "#sync_names", '\000' <repeats 246 times>
            lv_op = LV_NOOP
            lck_type = 0
    #7  0x000000000047d9f3 in sync_dev_names (cmd=0x1706040) at locking/locking.c:636
    No locals.
    #8  0x00000000004228e5 in lvconvert_single (cmd=0x1706040, lp=0x7fff9c781da0) at lvconvert.c:2457
            lv = 0x1768d00
            ret = 1
            saved_ignore_suspended_devices = 0
    #9  0x0000000000422c3c in lvconvert (cmd=0x1706040, argc=1, argv=0x7fff9c782060) at lvconvert.c:2536
            lp = {force = 0, snapshot = 0, merge = 0, merge_mirror = 0, thin = 0, yes = 0, zero = 0, origin = 0x0,
              lv_name = 0x7fff9c783ede "mirror_log_convert", lv_split_name = 0x0,
              lv_name_full = 0x7fff9c783ecb "/dev/mirror_sanity/mirror_log_convert", vg_name = 0x175b182 "mirror_sanity",
              wait_completion = 1, need_polling = 0, chunk_size = 0, region_size = 1024, mirrors = 2, mirrors_sign = SIGN_NONE,
              keep_mimages = 0, stripes = 1, stripe_size = 0, read_ahead = 0, segtype = 0x173b340, target_attr = 0,
              alloc = ALLOC_INHERIT, pv_count = 0, pvs = 0x7fff9c782068, pvh = 0x1767cf8, replace_pv_count = 0, replace_pvs = 0x0,
              replace_pvh = 0x0, lv_to_poll = 0x1768d00, passed_args = 0, poolmetadata_size = 0, origin_lv_name = 0x0,
              pool_data_lv_name = 0x0, pool_metadata_lv_name = 0x0, discards = THIN_DISCARDS_IGNORE}
    #10 0x000000000042c45b in lvm_run_command (cmd=0x1706040, argc=1, argv=0x7fff9c782060) at lvmcmdline.c:1131
            ret = 0
            locking_type = -1
            monitoring = 1
            old_cft = 0x42d578 <_nonroot_warning+18>
    #11 0x000000000042d948 in lvm2_main (argc=6, argv=0x7fff9c782038) at lvmcmdline.c:1567
            base = 0x7fff9c783eab "lvconvert"
            ret = 5122829
            alias = 1
            cmd = 0x1706040
    #12 0x00000000004491fc in main (argc=6, argv=0x7fff9c782038) at lvm.c:21
    No locals.
Comment 3 Peter Rajnoha 2013-08-28 04:42:55 EDT
(In reply to Marian Csontos from comment #0)
> Additional info:
> 
> I am not sure this is really an LVM problem or udev is not patient enough
> killing the worker after approximately 30s.
> Similar to https://bugzilla.redhat.com/show_bug.cgi?id=956387.

If it's after 30s, then yes, it's udev timeout (which was decreased in recent versions of udev from 3 min to 30s).

Marian, you can try to change the timeout to a higher value with:
  udevadm control --timeout=<some value higher than 30s>

You can also check if udevadm control --children-max=1 helps in any way (I'm not sure at the moment if the patch for udevd that decreased default children-max landed in RHEL7 yet or not, I need to check...).

If this proves to be the timeout, then we can choose among these options:

  A. revert back to the 3min timeout instead of the 30s one
  B. investigate where the time is taken mostly and try to optimize (not quite easy)
  C. submit an RFE for udev to use some heuristics when the system is under a load and increase the timeouts dynamically
  D. tell people to set the timeout and children max to a suitable value for LVM with many devices

I see D as the one to avoid (I just mentioned that for completness). The option C would be nice (but I don't think udev will try to change that). We should do B, but I'm not sure whether we can speed it up any more. The option B is the most realistic of all of this (but let's expect some flame from udev).
Comment 4 Peter Rajnoha 2013-09-11 09:38:42 EDT
(In reply to Peter Rajnoha from comment #3)
> If this proves to be the timeout, then we can choose among these options:
> 
>   A. revert back to the 3min timeout instead of the 30s one

Well, we can actually change the timeout directly in udev rules. I reverted that back to 3 min for LVM devs. Marian, would you see if you can still hit the issue reported?

  https://git.fedorahosted.org/cgit/lvm2.git/commit/?id=72a9d4f879751d58e531efbffaef1f7d325092c7
Comment 5 Marian Csontos 2013-11-26 04:58:37 EST
I have not seen the issue recently since the timeout value was increased (and I have speed up the tests significantly too.)
Comment 6 Peter Rajnoha 2013-11-26 04:59:54 EST
OK, I'll close it then - if you see it again, please reopen.

Note You need to log in before you can comment on or make changes to this bug.