Bug 1182243

Summary:

partition scan in losetup does not succeed when bound repeatedly

Product:

Red Hat Enterprise Linux 7

Reporter:

Tomas Dolezal <todoleza>

Component:

kernel

Assignee:

Jarod Wilson <jarod>

Status:

CLOSED ERRATA

QA Contact:

xhe <xhe>

Severity:

medium

Docs Contact:

Priority:

medium

Version:

7.0

CC:

jarod, kzak, yanwang

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Unspecified

Whiteboard:

Fixed In Version:

kernel-3.10.0-263.el7

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2015-11-19 21:09:05 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
isofile.xz	none

Description Tomas Dolezal 2015-01-14 17:08:00 UTC

Created attachment 980131 [details]
isofile.xz

Description of problem:
on x86_64, --partscan option does not load partitions of mounted file on second or third try and all following ones. partprobe also doesn't succeed with
>Warning: The driver descriptor says the physical block size is 2048 bytes, but Linux says it is 512 bytes.
only fdisk's "write" command works
unloading loop module does not help if done before new losetup invocation

--partscan fails even for standard msdos disk file of 100MB with 3 partitions
partprobe fails for an attached disk file (truncated iso of live fedora)

Version-Release number of selected component (if applicable):
util-linux-2.23.2-21.el7.x86_64
kernel-3.10.0-123.el7.x86_64
kernel-3.10.0-221.el7.x86_64 (also)

How reproducible:
on second or third try and next ones

Steps to Reproduce:
1. losetup -fP isofile
2. lsblk /dev/loop?
3. losetup -D
repeat once or twice

Actual results:
no partitions are loaded by kernel

Expected results:
like fdisk's call, partitions are detected

Additional info:
dmesg contains no new messages when no partitions are found and loop module was already loaded
this might not be util-linux's fault, feel free to reassign if needed

Comment 2 Karel Zak 2015-01-15 11:45:19 UTC

This is not first time I see such report, anyway, losetup just calls kernel ioctl with LO_FLAGS_PARTSCAN flag, the reset is kernel business.

Maybe the problem is already improved in more recent kernels (3.19), see
http://www.spinics.net/lists/util-linux-ng/msg10301.html.

Note that possible workaround is to force kernel to reread PT (e.g. blockdev --rereadpt), but it would be nice to have a better solution. 

It seems that loopdev driver is a little bit fragile now.

Comment 3 Jarod Wilson 2015-02-12 21:21:18 UTC

Has anyone tried with more recent upstream kernels or with older RHEL (i.e. 6) kernels to see how prevalent this is, or if we have somewhere to look to extract fixes? Either way, I'll start digging...

Comment 4 Jarod Wilson 2015-02-13 18:54:15 UTC

I'm seeing different but also incorrect behavior on kernel-3.10.0-229.el7:

# losetup -fP isofile

# lsblk /dev/loop0
NAME  MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
loop0   7:0    0   1M  0 loop

[root@ibm-x3250m5-01 ~]# losetup -D

[root@ibm-x3250m5-01 ~]# losetup -fP isofile

[root@ibm-x3250m5-01 ~]# lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:0    0    1M  0 loop
└─loop0p2 259:1    0  938K  0 loop

The partitions didn't show up at the first invocation, they did for the second. They're still there on the 3rd pass as well.

Comment 5 Jarod Wilson 2015-02-13 20:02:18 UTC

Subsequent reboots, the partition table is getting read every time, even after several iterations of the three commands. Not sure where to go from here, doesn't seem that I can reproduce the reported problem.

Comment 6 Jarod Wilson 2015-02-13 20:24:40 UTC

Most recent pass, for the record:

$ ssh root.eng.bos.redhat.com
Last login: Fri Feb 13 14:00:40 2015 from vpn-230-219.phx2.redhat.com
**  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **
                 This System is reserved by jarod.

 To return this system early. You can run the command: return2beaker.sh
  Ensure you have your logs off the system before returning to Beaker

 To extend your reservation time. You can run the command:
  extendtesttime.sh
 This is an interactive script. You will be prompted for how many
  hours you would like to extend the reservation.

 You should verify the watchdog was updated succesfully after
  you extend your reservation.
  https://beaker.engineering.redhat.com/recipes/1823707

 For ssh, kvm, serial and power control operations please look here:
  https://beaker.engineering.redhat.com/view/ibm-x3250m5-01.rhts.eng.bos.redhat.com

 For the default root password, see:
  https://beaker.engineering.redhat.com/prefs/

      Beaker Test information:
                         HOSTNAME=ibm-x3250m5-01.rhts.eng.bos.redhat.com
                            JOBID=882032
                         RECIPEID=1823707
                    RESULT_SERVER=[::1]:7092
                           DISTRO=RHEL-7.1-20150206.2
                     ARCHITECTURE=x86_64

      Job Whiteboard: Bug 1182243: losetup partitions

      Recipe Whiteboard:
**  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **  **
[root@ibm-x3250m5-01 ~]# ll /dev/loop*
crw-------. 1 root root 10, 237 Feb 13 14:52 /dev/loop-control
[root@ibm-x3250m5-01 ~]# losetup -fP isofile
[root@ibm-x3250m5-01 ~]# lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:0    0    1M  0 loop
└─loop0p2 259:1    0  938K  0 loop
[root@ibm-x3250m5-01 ~]# losetup -D
[root@ibm-x3250m5-01 ~]# losetup -fP isofile
[root@ibm-x3250m5-01 ~]# lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:0    0    1M  0 loop
└─loop0p2 259:1    0  938K  0 loop
[root@ibm-x3250m5-01 ~]# losetup -D
[root@ibm-x3250m5-01 ~]# losetup -fP isofile
[root@ibm-x3250m5-01 ~]# lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:0    0    1M  0 loop
└─loop0p2 259:1    0  938K  0 loop
[root@ibm-x3250m5-01 ~]# losetup -D; losetup -fP isofile; lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:2    0    1M  0 loop
└─loop0p2 259:3    0  938K  0 loop
[root@ibm-x3250m5-01 ~]# losetup -D; losetup -fP isofile; lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:0    0    1M  0 loop
└─loop0p2 259:1    0  938K  0 loop
[root@ibm-x3250m5-01 ~]# losetup -D; losetup -fP isofile; lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:2    0    1M  0 loop
└─loop0p2 259:3    0  938K  0 loop
[root@ibm-x3250m5-01 ~]# losetup -D; sleep 5; losetup -fP isofile; lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:0    0    1M  0 loop
└─loop0p2 259:1    0  938K  0 loop
[root@ibm-x3250m5-01 ~]# losetup -D; sleep 5; losetup -fP isofile; lsblk /dev/loop0
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop
├─loop0p1 259:0    0    1M  0 loop
└─loop0p2 259:1    0  938K  0 loop
[root@ibm-x3250m5-01 ~]# uname -r
3.10.0-229.el7.x86_64

Comment 7 Tomas Dolezal 2015-02-16 12:29:27 UTC

I can still reproduce on kernel-3.10.0-229.el7.x86_64, virtual machine
(tested on RHEL-7.1-20150206.2, slightly modified openstack image)

When reproducing on phys-machine, 5th mount in a row has reproduced the issue.
(RHEL-7.1-20150206.2 Server x86_64 / ibm-hs22-01.rhts.eng.brq.redhat.com)

Comment 8 Tomas Dolezal 2015-02-16 12:51:12 UTC

I've tried experimenting with sleep command
# while true; do losetup -D; sleep 1; losetup -fP isofile ; sleep 1; lsblk /dev/loop0; done

this has succeeded in most of the times (I remember a single fail after >10 iterations) if run on physical machine. if there were no sleeps, it failed more often but not so many times as when I've tried it on virtual machine.
On vm running under (busy) hypervisor even two 1second sleeps are not enough for successful probing to be done.

on that vm it helped if I unloaded loop module and did the scenario again manually, with >5s waiting.
(baseos machine sheep-65.lab.eng.brq.redhat.com)

I definitely suggest testing on a vm.

Comment 9 Jarod Wilson 2015-02-24 18:55:08 UTC

I'll see what I can do in the way of some testing with a vm this afternoon. Smells like some sort of race, if its easier to reproduce on a resource-constrained vm with a busy hypervisor.

Comment 10 Jarod Wilson 2015-02-25 01:18:59 UTC

I'm able to reproduce the problem every once in a while on a vm on a system that is mostly idle, particularly if I force some disk activity at the same time (such as booting another VM while running the losetup commands repeatedly in a loop). Another oddity I see sometimes is loop0p1 and loop0p2 getting minors of 2 and 3 respectively, instead of 0 and 1.

Comment 11 Jarod Wilson 2015-02-25 20:21:25 UTC

Upstream (well, 3.18.7 in Fedora) is even worse, I don't seem to ever get a partition table there. Here I was, hoping things might be improved, and they're actually worse... Haven't looked at RHEL6 yet. Definitely some research to do here -- upstream's loop code has been ported from bio to blk-mq, which may or may not be related...

Comment 12 Jarod Wilson 2015-02-25 22:16:00 UTC

(In reply to Jarod Wilson from comment #11)
> Upstream (well, 3.18.7 in Fedora) is even worse, I don't seem to ever get a
> partition table there. Here I was, hoping things might be improved, and
> they're actually worse... Haven't looked at RHEL6 yet. Definitely some
> research to do here -- upstream's loop code has been ported from bio to
> blk-mq, which may or may not be related...

There may be something else going on here. I just rebooted the guest, and now the lsblk command is always showing partitions. Still poking. This is slightly maddening.

Comment 13 Jarod Wilson 2015-02-26 18:27:35 UTC

Okay, some more digging, and I can now make lsblk show partitions 100% of the time, even when I've got a dd going in a second guest, which tends to give me at least a 20% failure rate.

What I've done is modify lsblk to issue a BLKRRPART (re-read block device partition table) ioctl if it encounters a device with 0 partitions. This is more or less what fdisk (and sfdisk and cfdisk) do, though I think they don't even bother with only doing the ioctl if there are no partitions, they just always do it, if BLKRRPART is supported.

However... There's a big caveat here. This seems like papering over the actual bug. It seems that whatever populates the sysfs tree with partition information (udev? systemd?) is failing, probably because it is triggering before the device has been fully set up, which we might be able to combat with some code changes on the kernel side in the loop driver.

Comment 14 Jarod Wilson 2015-02-26 23:33:33 UTC

Well, I had suspicions about kobject_uevent calls being in the wrong places, then I was questioning whether or not the LO_FLAGS_PARTSCAN flag set by losetup's -P option was being propagated to the kernel. I'm now seeing that it definitely is, and a partition table re-read *is* being triggered, so I'm still not sure where things are falling down.

Comment 15 Jarod Wilson 2015-02-27 20:56:02 UTC

I'm finally reasonably well convinced that the kernel loop code is doing everything it possibly can to get this right, including triggering a partition rescan at setup time, as requested by losetup's -P flag.

I do have the following patch that makes lsblk 100% reliable just like fdisk though, employing the same trick of issuing a partition re-read ioctl:

--- misc-utils/lsblk.c  2015-02-27 15:42:26.825565650 -0500
+++ misc-utils/lsblk.c.new      2015-02-27 15:42:08.838594243 -0500
@@ -1123,6 +1123,13 @@ static int set_cxt(struct blkdev_cxt *cx
        }

        cxt->npartitions = sysfs_count_partitions(&cxt->sysfs, name);
+       if (cxt->npartitions == 0) {
+               fd = open(cxt->filename, O_RDONLY);
+               ioctl(fd, BLKRRPART);
+               close(fd);
+               cxt->npartitions = sysfs_count_partitions(&cxt->sysfs, name);
+       }
+
        cxt->nholders = sysfs_count_dirents(&cxt->sysfs, "holders");
        cxt->nslaves = sysfs_count_dirents(&cxt->sysfs, "slaves");

This would implement the work-around solution Karel mentions in comment #2. Things are definitely not better with a newer kernel though, I'm able to reproduce the problem reliably on 3.18.7 (the link was for 3.14.19, not 3.19, though I can give 3.19 or 4.0rc1 a spin as well, but I expect more of the same).

I'm not sure if its the kernel's partition scanning code failing to find partitions, or if its a timing thing with udev/systemd simply not noticing the quick remove/re-add series. Might have to figure out how to enable some udev/systemd debugging to see what's going on at that level when our partitions fail to show up. Might have to trace some code down in fs/block_dev.c now...

Comment 16 Jarod Wilson 2015-02-27 22:05:25 UTC

From drivers/block/loop.c's loop_set_status call to ioctl_by_bdev(), we get to fs/block_dev.c and ioctl_by_bdev() calls blkdev_ioctl() in block/ioctl.c. For BLKRRPART, that takes us over to block/ioctl.c's blkdev_reread_part(). There are several checks, one of which may well be dumping us out of blkdev_reread_part() before the call to rescan_partitions().

Comment 17 Jarod Wilson 2015-02-27 22:56:54 UTC

Now I'm thinking that we're likely hitting the -EBUSY return from rescan_partitions(), since another caller of ioctl_by_bdev, drivers/s390/block/dasd_genhd.c's dasd_scan_partitions(), calls it with a while loop with a few retries while rc == -EBUSY. If I can confirm that's what we're actually hitting, I'll pretty much duplicate the dasd_genhd code into the loop driver and test that out.

Comment 18 Jarod Wilson 2015-02-28 04:15:48 UTC

This is looking like a winner. I've confirmed that when the loop device fails to show any partitions, its because mutex_trylock in blkdev_reread_part() fails and we get returned an -EBUSY that the loop code currently ignores. I'll whip up a patch for the loop driver as soon as I can.

Comment 19 Jarod Wilson 2015-03-03 02:32:16 UTC

Well, unfortunately, the retry loop ported over from dasd_genhd didn't help any. Even with increased delays and more retries, if the first call hits -EBUSY, so do all subsequent tries. I'm still clueless as to what is holding the lock, though there's a nasty comment in loop_clr_fd() that describes a somewhat similar situation:

/*
 * If we've explicitly asked to tear down the loop device,
 * and it has an elevated reference count, set it for auto-teardown when
 * the last reference goes away. This stops $!~#$@ udev from
 * preventing teardown because it decided that it needs to run blkid on
 * the loopback device whenever they appear. xfstests is notorious for
 * failing tests because blkid via udev races with a losetup
 * <dev>/do something like mkfs/losetup -d <dev> causing the losetup -d
 * command to fail with EBUSY.
 */

Our issue is on the create side though, not the destroy side, so this may be a red herring. I guess I'll have to go figure out lock debugging...

Comment 20 Jarod Wilson 2015-03-03 02:52:25 UTC

So kernel-debug isn't turning up anything new, but given that a clean boot and initial insmod of loop.ko (with some debugging spew added) shows this busy behavior, I'm inclined to believe that its various parts of losetup racing against udev, and maybe we can do something about it by adding some additional locking to the loop driver, but basically, I think udev is screwing things over, and its more prevalent on slow storage like we find in a file-backed vm. Still digging, determined to get to the bottom of this... :)

Comment 21 Jarod Wilson 2015-03-05 22:18:32 UTC

So, as illustrated by comments in an older loop.c commit:

commit 5370019dc2d2c2ff90e95d181468071362934f3a
Author: Guo Chao <yan.ibm.com>
Date:   Thu Feb 21 15:16:45 2013 -0800

    loopdev: fix a deadlock

    bd_mutex and lo_ctl_mutex can be held in different order.

    Path #1:

    blkdev_open
     blkdev_get
      __blkdev_get (hold bd_mutex)
       lo_open (hold lo_ctl_mutex)

    Path #2:

    blkdev_ioctl
     lo_ioctl (hold lo_ctl_mutex)
      lo_set_capacity (hold bd_mutex)
...

The problem we have here is that device setup and partition scanning aren't an atomic operation, they're done by two ioctls that losetup calls, first LOOP_SET_FD, then LOOP_SET_STATUS64. In between the two of them, udev is making a call to blkdev_open, which grabs bd_mutex, and the calls lo_open, which grabs lo_ctl_mutex. The LOOP_SET_STATUS64 call grabs lo_ctl_mutex first, then tries to scan partitions, which requires grabbing bd_mutex. Its a classic AB-BA deadlock scenario if we try to force the issue and take bd_mutex ourselves. The situation does improve some if we temporarily give up the lo_ctl_mutex, I show several cases where this results in success where we'd otherwise have hit failure, but sadly, we still fail sometimes regardless.

Comment 23 Jarod Wilson 2015-03-31 21:18:36 UTC

Don't know of any reason why this bug can't be public, so making it so.

Patch submitted upstream that helps a fair bit with this:

https://lkml.org/lkml/2015/3/31/888

Comment 25 Jarod Wilson 2015-04-01 21:33:41 UTC

I've been referenced back to a post from a few months ago trying to address the same problem:

https://lkml.org/lkml/2015/1/26/137

Reading over that and some replies to my own posting, will work on an updated patch...

Comment 26 Jarod Wilson 2015-04-08 18:24:15 UTC

What should be pretty close to a final set can be found here:

https://lkml.org/lkml/2015/4/8/502

Patch 7 just needs a minor tweak, and I think this should get into a tree somewhere, at which point, it can be backported to RHEL7.

Comment 27 Jarod Wilson 2015-04-08 23:09:03 UTC

I've got a vm running a kernel with the upstream patchset, and this scriptlet...

for i in `seq `1 1000`
do
  losetup -fP isofile
  sleep .1
  lsblk /dev/loop0
  losetup -D
  sleep .1
done

...didn't fail a single time listing partitions.

Without the sleep .1 between losetup -fP isofile and lsblk, I'm occasionally racing udev, and ~5-10x out of 1000, lsblk will show either one or none of the partitions, but a subsequent call to it should give proper results, and in-kernel, I show all the partitions being detected properly. There really isn't much that can be done about that, just need to make sure we're allowed some time to set up the device nodes before we try to list them.

Comment 28 Jarod Wilson 2015-05-20 18:28:14 UTC

Still waiting for this to get merged into an upstream tree, but the block maintainer, Jens Axboe, has said that he'll prep it for inclusion in 4.2.

Comment 29 Jarod Wilson 2015-05-21 19:11:28 UTC

Patches are staged for upstream 4.2 in the block maintainer's tree:
  https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git/log/?h=for-4.2/drivers

Backporting to RHEL7, will post for internal review after build and sanity-testing.

Comment 31 Rafael Aquini 2015-06-11 12:29:40 UTC

Patch(es) available on kernel-3.10.0-263.el7

Comment 36 xhe@redhat.com 2015-09-29 11:55:42 UTC

With the server Tomas provided in #c35, I reproduced this bug on -229 kernel, I verified this bug on -320 kernel. I move to VERIFIED here.

********** Reproduced on kernel 3.10.0-229.el7.x86_64 **************
MySteps:

1. Reboot system
2. # losetup -fP isofile 
3. # lsblk
NAME                                MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                                 252:0    0   80G  0 disk 
├─vda1                              252:1    0  500M  0 part /boot
└─vda2                              252:2    0 79.5G  0 part 
  ├─rhel_cloud--qe--16--vm--06-swap 253:0    0    2G  0 lvm  [SWAP]
  ├─rhel_cloud--qe--16--vm--06-root 253:1    0   50G  0 lvm  /
  └─rhel_cloud--qe--16--vm--06-home 253:2    0 27.5G  0 lvm  /home
loop0                                 7:0    0    1M  0 loop 
├─loop0p1                           259:0    0    1M  0 loop 
└─loop0p2                           259:1    0  938K  0 loop 
4. # losetup -D
5. # losetup -fP isofile
6. # lsblk
NAME                    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
vda                     252:0    0   70G  0 disk 
├─vda1                  252:1    0  500M  0 part /boot
└─vda2                  252:2    0 69.5G  0 part 
  ├─rhel_sheep--25-swap 253:0    0    3G  0 lvm  [SWAP]
  ├─rhel_sheep--25-root 253:1    0 44.7G  0 lvm  /
  └─rhel_sheep--25-home 253:2    0 21.8G  0 lvm  /home
loop0                     7:0    0    1M  0 loop  
^^^^^^^^^^^^^^^^^^^ NOTE: the loop0p1/2 are not detected

The loop0p1/2 could be detected at first time only, they are not showed in the next iterations.


********** Verified on kernel 3.10.0-320.el7.x86_64 **************
[root@sheep-25 ~]# uname -r
3.10.0-320.el7.x86_64
[root@sheep-25 ~]# bash test.sh 
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop 
├─loop0p1 259:0    0    1M  0 loop 
└─loop0p2 259:1    0  938K  0 loop 
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop 
├─loop0p1 259:0    0    1M  0 loop 
└─loop0p2 259:1    0  938K  0 loop 
NAME      MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0       7:0    0    1M  0 loop 
├─loop0p1 259:0    0    1M  0 loop 
└─loop0p2 259:1    0  938K  0 loop

Comment 37 errata-xmlrpc 2015-11-19 21:09:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2015-2152.html