Bug 1182243
Summary: | partition scan in losetup does not succeed when bound repeatedly | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Tomas Dolezal <todoleza> | ||||
Component: | kernel | Assignee: | Jarod Wilson <jarod> | ||||
Status: | CLOSED ERRATA | QA Contact: | xhe <xhe> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 7.0 | CC: | jarod, kzak, yanwang | ||||
Target Milestone: | rc | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | kernel-3.10.0-263.el7 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2015-11-19 21:09:05 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Tomas Dolezal
2015-01-14 17:08:00 UTC
This is not first time I see such report, anyway, losetup just calls kernel ioctl with LO_FLAGS_PARTSCAN flag, the reset is kernel business. Maybe the problem is already improved in more recent kernels (3.19), see http://www.spinics.net/lists/util-linux-ng/msg10301.html. Note that possible workaround is to force kernel to reread PT (e.g. blockdev --rereadpt), but it would be nice to have a better solution. It seems that loopdev driver is a little bit fragile now. Has anyone tried with more recent upstream kernels or with older RHEL (i.e. 6) kernels to see how prevalent this is, or if we have somewhere to look to extract fixes? Either way, I'll start digging... I'm seeing different but also incorrect behavior on kernel-3.10.0-229.el7: # losetup -fP isofile # lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop [root@ibm-x3250m5-01 ~]# losetup -D [root@ibm-x3250m5-01 ~]# losetup -fP isofile [root@ibm-x3250m5-01 ~]# lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop The partitions didn't show up at the first invocation, they did for the second. They're still there on the 3rd pass as well. Subsequent reboots, the partition table is getting read every time, even after several iterations of the three commands. Not sure where to go from here, doesn't seem that I can reproduce the reported problem. Most recent pass, for the record: $ ssh root.eng.bos.redhat.com Last login: Fri Feb 13 14:00:40 2015 from vpn-230-219.phx2.redhat.com ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** This System is reserved by jarod. To return this system early. You can run the command: return2beaker.sh Ensure you have your logs off the system before returning to Beaker To extend your reservation time. You can run the command: extendtesttime.sh This is an interactive script. You will be prompted for how many hours you would like to extend the reservation. You should verify the watchdog was updated succesfully after you extend your reservation. https://beaker.engineering.redhat.com/recipes/1823707 For ssh, kvm, serial and power control operations please look here: https://beaker.engineering.redhat.com/view/ibm-x3250m5-01.rhts.eng.bos.redhat.com For the default root password, see: https://beaker.engineering.redhat.com/prefs/ Beaker Test information: HOSTNAME=ibm-x3250m5-01.rhts.eng.bos.redhat.com JOBID=882032 RECIPEID=1823707 RESULT_SERVER=[::1]:7092 DISTRO=RHEL-7.1-20150206.2 ARCHITECTURE=x86_64 Job Whiteboard: Bug 1182243: losetup partitions Recipe Whiteboard: ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** ** [root@ibm-x3250m5-01 ~]# ll /dev/loop* crw-------. 1 root root 10, 237 Feb 13 14:52 /dev/loop-control [root@ibm-x3250m5-01 ~]# losetup -fP isofile [root@ibm-x3250m5-01 ~]# lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop [root@ibm-x3250m5-01 ~]# losetup -D [root@ibm-x3250m5-01 ~]# losetup -fP isofile [root@ibm-x3250m5-01 ~]# lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop [root@ibm-x3250m5-01 ~]# losetup -D [root@ibm-x3250m5-01 ~]# losetup -fP isofile [root@ibm-x3250m5-01 ~]# lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop [root@ibm-x3250m5-01 ~]# losetup -D; losetup -fP isofile; lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:2 0 1M 0 loop └─loop0p2 259:3 0 938K 0 loop [root@ibm-x3250m5-01 ~]# losetup -D; losetup -fP isofile; lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop [root@ibm-x3250m5-01 ~]# losetup -D; losetup -fP isofile; lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:2 0 1M 0 loop └─loop0p2 259:3 0 938K 0 loop [root@ibm-x3250m5-01 ~]# losetup -D; sleep 5; losetup -fP isofile; lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop [root@ibm-x3250m5-01 ~]# losetup -D; sleep 5; losetup -fP isofile; lsblk /dev/loop0 NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop [root@ibm-x3250m5-01 ~]# uname -r 3.10.0-229.el7.x86_64 I can still reproduce on kernel-3.10.0-229.el7.x86_64, virtual machine (tested on RHEL-7.1-20150206.2, slightly modified openstack image) When reproducing on phys-machine, 5th mount in a row has reproduced the issue. (RHEL-7.1-20150206.2 Server x86_64 / ibm-hs22-01.rhts.eng.brq.redhat.com) I've tried experimenting with sleep command # while true; do losetup -D; sleep 1; losetup -fP isofile ; sleep 1; lsblk /dev/loop0; done this has succeeded in most of the times (I remember a single fail after >10 iterations) if run on physical machine. if there were no sleeps, it failed more often but not so many times as when I've tried it on virtual machine. On vm running under (busy) hypervisor even two 1second sleeps are not enough for successful probing to be done. on that vm it helped if I unloaded loop module and did the scenario again manually, with >5s waiting. (baseos machine sheep-65.lab.eng.brq.redhat.com) I definitely suggest testing on a vm. I'll see what I can do in the way of some testing with a vm this afternoon. Smells like some sort of race, if its easier to reproduce on a resource-constrained vm with a busy hypervisor. I'm able to reproduce the problem every once in a while on a vm on a system that is mostly idle, particularly if I force some disk activity at the same time (such as booting another VM while running the losetup commands repeatedly in a loop). Another oddity I see sometimes is loop0p1 and loop0p2 getting minors of 2 and 3 respectively, instead of 0 and 1. Upstream (well, 3.18.7 in Fedora) is even worse, I don't seem to ever get a partition table there. Here I was, hoping things might be improved, and they're actually worse... Haven't looked at RHEL6 yet. Definitely some research to do here -- upstream's loop code has been ported from bio to blk-mq, which may or may not be related... (In reply to Jarod Wilson from comment #11) > Upstream (well, 3.18.7 in Fedora) is even worse, I don't seem to ever get a > partition table there. Here I was, hoping things might be improved, and > they're actually worse... Haven't looked at RHEL6 yet. Definitely some > research to do here -- upstream's loop code has been ported from bio to > blk-mq, which may or may not be related... There may be something else going on here. I just rebooted the guest, and now the lsblk command is always showing partitions. Still poking. This is slightly maddening. Okay, some more digging, and I can now make lsblk show partitions 100% of the time, even when I've got a dd going in a second guest, which tends to give me at least a 20% failure rate. What I've done is modify lsblk to issue a BLKRRPART (re-read block device partition table) ioctl if it encounters a device with 0 partitions. This is more or less what fdisk (and sfdisk and cfdisk) do, though I think they don't even bother with only doing the ioctl if there are no partitions, they just always do it, if BLKRRPART is supported. However... There's a big caveat here. This seems like papering over the actual bug. It seems that whatever populates the sysfs tree with partition information (udev? systemd?) is failing, probably because it is triggering before the device has been fully set up, which we might be able to combat with some code changes on the kernel side in the loop driver. Well, I had suspicions about kobject_uevent calls being in the wrong places, then I was questioning whether or not the LO_FLAGS_PARTSCAN flag set by losetup's -P option was being propagated to the kernel. I'm now seeing that it definitely is, and a partition table re-read *is* being triggered, so I'm still not sure where things are falling down. I'm finally reasonably well convinced that the kernel loop code is doing everything it possibly can to get this right, including triggering a partition rescan at setup time, as requested by losetup's -P flag. I do have the following patch that makes lsblk 100% reliable just like fdisk though, employing the same trick of issuing a partition re-read ioctl: --- misc-utils/lsblk.c 2015-02-27 15:42:26.825565650 -0500 +++ misc-utils/lsblk.c.new 2015-02-27 15:42:08.838594243 -0500 @@ -1123,6 +1123,13 @@ static int set_cxt(struct blkdev_cxt *cx } cxt->npartitions = sysfs_count_partitions(&cxt->sysfs, name); + if (cxt->npartitions == 0) { + fd = open(cxt->filename, O_RDONLY); + ioctl(fd, BLKRRPART); + close(fd); + cxt->npartitions = sysfs_count_partitions(&cxt->sysfs, name); + } + cxt->nholders = sysfs_count_dirents(&cxt->sysfs, "holders"); cxt->nslaves = sysfs_count_dirents(&cxt->sysfs, "slaves"); This would implement the work-around solution Karel mentions in comment #2. Things are definitely not better with a newer kernel though, I'm able to reproduce the problem reliably on 3.18.7 (the link was for 3.14.19, not 3.19, though I can give 3.19 or 4.0rc1 a spin as well, but I expect more of the same). I'm not sure if its the kernel's partition scanning code failing to find partitions, or if its a timing thing with udev/systemd simply not noticing the quick remove/re-add series. Might have to figure out how to enable some udev/systemd debugging to see what's going on at that level when our partitions fail to show up. Might have to trace some code down in fs/block_dev.c now... From drivers/block/loop.c's loop_set_status call to ioctl_by_bdev(), we get to fs/block_dev.c and ioctl_by_bdev() calls blkdev_ioctl() in block/ioctl.c. For BLKRRPART, that takes us over to block/ioctl.c's blkdev_reread_part(). There are several checks, one of which may well be dumping us out of blkdev_reread_part() before the call to rescan_partitions(). Now I'm thinking that we're likely hitting the -EBUSY return from rescan_partitions(), since another caller of ioctl_by_bdev, drivers/s390/block/dasd_genhd.c's dasd_scan_partitions(), calls it with a while loop with a few retries while rc == -EBUSY. If I can confirm that's what we're actually hitting, I'll pretty much duplicate the dasd_genhd code into the loop driver and test that out. This is looking like a winner. I've confirmed that when the loop device fails to show any partitions, its because mutex_trylock in blkdev_reread_part() fails and we get returned an -EBUSY that the loop code currently ignores. I'll whip up a patch for the loop driver as soon as I can. Well, unfortunately, the retry loop ported over from dasd_genhd didn't help any. Even with increased delays and more retries, if the first call hits -EBUSY, so do all subsequent tries. I'm still clueless as to what is holding the lock, though there's a nasty comment in loop_clr_fd() that describes a somewhat similar situation: /* * If we've explicitly asked to tear down the loop device, * and it has an elevated reference count, set it for auto-teardown when * the last reference goes away. This stops $!~#$@ udev from * preventing teardown because it decided that it needs to run blkid on * the loopback device whenever they appear. xfstests is notorious for * failing tests because blkid via udev races with a losetup * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d * command to fail with EBUSY. */ Our issue is on the create side though, not the destroy side, so this may be a red herring. I guess I'll have to go figure out lock debugging... So kernel-debug isn't turning up anything new, but given that a clean boot and initial insmod of loop.ko (with some debugging spew added) shows this busy behavior, I'm inclined to believe that its various parts of losetup racing against udev, and maybe we can do something about it by adding some additional locking to the loop driver, but basically, I think udev is screwing things over, and its more prevalent on slow storage like we find in a file-backed vm. Still digging, determined to get to the bottom of this... :) So, as illustrated by comments in an older loop.c commit: commit 5370019dc2d2c2ff90e95d181468071362934f3a Author: Guo Chao <yan.ibm.com> Date: Thu Feb 21 15:16:45 2013 -0800 loopdev: fix a deadlock bd_mutex and lo_ctl_mutex can be held in different order. Path #1: blkdev_open blkdev_get __blkdev_get (hold bd_mutex) lo_open (hold lo_ctl_mutex) Path #2: blkdev_ioctl lo_ioctl (hold lo_ctl_mutex) lo_set_capacity (hold bd_mutex) ... The problem we have here is that device setup and partition scanning aren't an atomic operation, they're done by two ioctls that losetup calls, first LOOP_SET_FD, then LOOP_SET_STATUS64. In between the two of them, udev is making a call to blkdev_open, which grabs bd_mutex, and the calls lo_open, which grabs lo_ctl_mutex. The LOOP_SET_STATUS64 call grabs lo_ctl_mutex first, then tries to scan partitions, which requires grabbing bd_mutex. Its a classic AB-BA deadlock scenario if we try to force the issue and take bd_mutex ourselves. The situation does improve some if we temporarily give up the lo_ctl_mutex, I show several cases where this results in success where we'd otherwise have hit failure, but sadly, we still fail sometimes regardless. Don't know of any reason why this bug can't be public, so making it so. Patch submitted upstream that helps a fair bit with this: https://lkml.org/lkml/2015/3/31/888 I've been referenced back to a post from a few months ago trying to address the same problem: https://lkml.org/lkml/2015/1/26/137 Reading over that and some replies to my own posting, will work on an updated patch... What should be pretty close to a final set can be found here: https://lkml.org/lkml/2015/4/8/502 Patch 7 just needs a minor tweak, and I think this should get into a tree somewhere, at which point, it can be backported to RHEL7. I've got a vm running a kernel with the upstream patchset, and this scriptlet... for i in `seq `1 1000` do losetup -fP isofile sleep .1 lsblk /dev/loop0 losetup -D sleep .1 done ...didn't fail a single time listing partitions. Without the sleep .1 between losetup -fP isofile and lsblk, I'm occasionally racing udev, and ~5-10x out of 1000, lsblk will show either one or none of the partitions, but a subsequent call to it should give proper results, and in-kernel, I show all the partitions being detected properly. There really isn't much that can be done about that, just need to make sure we're allowed some time to set up the device nodes before we try to list them. Still waiting for this to get merged into an upstream tree, but the block maintainer, Jens Axboe, has said that he'll prep it for inclusion in 4.2. Patches are staged for upstream 4.2 in the block maintainer's tree: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/log/?h=for-4.2/drivers Backporting to RHEL7, will post for internal review after build and sanity-testing. Patch(es) available on kernel-3.10.0-263.el7 With the server Tomas provided in #c35, I reproduced this bug on -229 kernel, I verified this bug on -320 kernel. I move to VERIFIED here. ********** Reproduced on kernel 3.10.0-229.el7.x86_64 ************** MySteps: 1. Reboot system 2. # losetup -fP isofile 3. # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 252:0 0 80G 0 disk ├─vda1 252:1 0 500M 0 part /boot └─vda2 252:2 0 79.5G 0 part ├─rhel_cloud--qe--16--vm--06-swap 253:0 0 2G 0 lvm [SWAP] ├─rhel_cloud--qe--16--vm--06-root 253:1 0 50G 0 lvm / └─rhel_cloud--qe--16--vm--06-home 253:2 0 27.5G 0 lvm /home loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop 4. # losetup -D 5. # losetup -fP isofile 6. # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT vda 252:0 0 70G 0 disk ├─vda1 252:1 0 500M 0 part /boot └─vda2 252:2 0 69.5G 0 part ├─rhel_sheep--25-swap 253:0 0 3G 0 lvm [SWAP] ├─rhel_sheep--25-root 253:1 0 44.7G 0 lvm / └─rhel_sheep--25-home 253:2 0 21.8G 0 lvm /home loop0 7:0 0 1M 0 loop ^^^^^^^^^^^^^^^^^^^ NOTE: the loop0p1/2 are not detected The loop0p1/2 could be detected at first time only, they are not showed in the next iterations. ********** Verified on kernel 3.10.0-320.el7.x86_64 ************** [root@sheep-25 ~]# uname -r 3.10.0-320.el7.x86_64 [root@sheep-25 ~]# bash test.sh NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT loop0 7:0 0 1M 0 loop ├─loop0p1 259:0 0 1M 0 loop └─loop0p2 259:1 0 938K 0 loop Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2015-2152.html |