Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 1100080

Summary: `xenstore-read` sometimes read from block device backend store entry after the block device backend store entry was removed via `xenstore-rm` and du to this xen will fail to release the loopback device
Product: Red Hat Enterprise Linux 5 Reporter: Lingfei Kong <lkong>
Component: xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED WONTFIX QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.11CC: drjones, harald, knoel, leiwang, ovasik, tdosek, udev-maint-list, vkuznets, wshi, xen-maint
Target Milestone: rcKeywords: Regression
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Cause: There are two scripts for block device which are being executed by udev rules when an instance dies: /etc/xen/scripts/block and /etc/xen/scripts/xen-hotplug-cleanup. When loopback device is being used for storage access (e.g. when disk is specified as 'file:/' in guest config file) it can remain connected after guest shutdown under certain circumstances. The root cause of this bug is the fact that /etc/xen/scripts/block script grabs /var/run/xen-hotplug/block lock too late and a race with /etc/xen/scripts/xen-hotplug-cleanup happens. Consequence: Loopback device remains connected. It can be seen with 'loopback -a' and removed with 'loopback -d' Workaround (if any): 1) Changing udev loglevel to 'info' ('udevcontrol log_priority=info' in runtime or by editing /etc/udev/udev.conf and rebooting) is known to help. alternatively 2) Add 'sleep 1' before 'claim_lock "block"' line in /etc/xen/scripts/xen-hotplug-cleanup.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-04-18 21:58:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
udevmonitor_goodcase.txt
none
udevmonitor_badcase.txt
none
proof-of-concept patch none

Description Lingfei Kong 2014-05-22 00:27:13 UTC
Description of problem:
When destroy a guest which have a loopback device, xen will first use command `xenstore_read "$XENBUS_PATH/node` (file:/etc/xen/scripts/block line:307) to get the name of the loopback device then use `losetup -d "$node"` to release the loopback device and then execute script /etc/xen/scripts/xen-hotplug-cleanup (line 38 and 39) to remove the block device backend store entry.

But sometimes xen will first remove the backend story entry, this will cause xen fail to get the loopback device from xenstore, because the entry have already been removed. Thus the loopback device will failed to release.

When i try  to find who will call script /etc/xen/scripts/xen-hotplug-cleanup, i found the following udev rules in /etc/udev/rules.d/xen-backend.rules
#cat /etc/udev/rules.d/xen-backend.rules
...
SUBSYSTEM=="xen-backend", KERNEL=="vbd*", RUN+="/etc/xen/scripts/block $env{ACTION}"
...
SUBSYSTEM=="xen-backend", ACTION=="remove", RUN+="/etc/xen/scripts/xen-hotplug-cleanup"

I checked the version of udev, found it was updated from udev-095-14.29.el5 to udev-095-14.31.el5 on RHEL5.11-Server-20140508.0, i uninstall udev-095-14.31.el5 and install udev-095-14.29.el5 on  the same machine, redo the test, this time all the loopback device were released.


Version-Release number of selected component (if applicable):
rhel5.11 host: kernel-xen-2.6.18-389.el5
rhel7.0 guest: kernel-3.10.0-115.el7
udev-095-14.31.el5
xen-3.0.3-144.el5


How reproducible:
70%


Steps to Reproduce:
1. Create a test script:
#cat test.sh 
#!/bin/bash
i=0
while :
do
    xm create xm-test.conf >/dev/null
    sleep 5
    xm destroy vm1
    echo $i
    ((i++))
done

2. Create a vm config file named xm-test.conf 
#cat xm-test.conf 
name = "vm1"
uuid = "1efb30c3-86fd-9dd7-4934-9b72b6a833fc"
builder = "hvm"
kernel = "/usr/lib/xen/boot/hvmloader"
memory = 1024
maxmem = 1024
vcpus = 4
on_poweroff = "destroy"
on_reboot = "restart"
on_crash = "preserve"
pae = 1
acpi = 1
apic = 1
localtime = 0
device_model = "/usr/lib64/xen/bin/qemu-dm"
disk = ['file:/root/xen.autotest.new/client/tests/xen/images/RHEL-Server-7.0-64-hvm.raw,hda,w']
vif = ['mac=00:19:30:1F:68:16,script=vif-bridge,bridge=xenbr0,model=e1000,type=ioemu']
vnc = 1
vncunused = 1
vnclisten = "0.0.0.0"
sdl = 0
serial = "pty"
soundhw = "es1370"

3. Run the script and use losetup -a to check the loopback device


Actual results:
After a number of(maybe 10 times) create-destroy operation, you will find the following error message appear on your tty:
Error: Device 768 (vbd) could not be connected. /etc/xen/scripts/block failed; error detected.

And type 'losetup -a', you will find all the loopback device was in use:
#losetup -a
/dev/loop0: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw)
/dev/loop1: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw)
/dev/loop2: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw)
/dev/loop3: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw)
/dev/loop4: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw)
/dev/loop5: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw)
/dev/loop6: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw)
/dev/loop7: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw)


Expected results:
You can find no more than one loopback device when you type `losetup -a`, and no `xm create xm-test.conf` failed with error message:
Failed to create domain: 'Using config file "/tmp/xm-test.conf". Error: Device 768 (vbd) could not be connected. /etc/xen/scripts/block failed; error detected. '


Additional info:
When xen failed to release the loopback device after `xm destroy $DomU_ID`, you will find the following messages in log file: /var/log/xen/xen-hotplug.log
xenstore-read: couldn't read path backend/vbd/6/768/node

Comment 1 Wei Shi 2014-05-22 01:40:16 UTC
No problems for
udev-095-14.29.el5(RHEL5.10 release)

Can reproduced from:
udev-095-14.30.el5


kernel: 2.6.18-388.el5xen
xen: xen-3.0.3-144.el5

Comment 3 Andrew Jones 2014-05-22 13:54:30 UTC
(In reply to Wei Shi from comment #1)
> No problems for
> udev-095-14.29.el5(RHEL5.10 release)
> 
> Can reproduced from:
> udev-095-14.30.el5
> 
> 

There's no doubt that being able to run xen is the perfect test case, if you want to find all the corner-cases. This bug appears to me to be just that - xen found an issue with a udev update. Kicking this BZ over to udev. udev shouldn't break userspace, not even xen.

Comment 4 Harald Hoyer 2014-05-27 13:48:12 UTC
is it reproducible, if you move /etc/udev/redhat/90-dm.rules away?

Comment 5 Lingfei Kong 2014-05-27 23:59:55 UTC
(In reply to Harald Hoyer from comment #4)
> is it reproducible, if you move /etc/udev/redhat/90-dm.rules away?

It is reproducible.  I did not find that file on the xen host

#ls /etc/udev/
makedev.d  rules.d  udev.conf
#ls /etc/udev/rules.d/
05-udev-early.rules  50-udev.rules  51-hotplug.rules  95-pam-console.rules
 

I download udev-095-14.31.el5.x86_64.rpm and unpack it, no /etc/udev/redhat/90-dm.rules file found in the rpm package
#rpm2cpio ../udev-095-14.31.el5.x86_64.rpm |cpio -div
#ls
etc  lib  sbin  usr
#find . -name "*redhat*"
<Nothing>
#find . -name "*90-dm.rules*"
<Nothing>

But I can found '90-dm.rules' in udev-095-14.29.el5.x86_64.rpm

Comment 6 Harald Hoyer 2014-05-28 11:17:53 UTC
(In reply to Lingfei Kong from comment #5)
> (In reply to Harald Hoyer from comment #4)
> > is it reproducible, if you move /etc/udev/redhat/90-dm.rules away?
> 
> It is reproducible.  I did not find that file on the xen host
> 
> #ls /etc/udev/
> makedev.d  rules.d  udev.conf
> #ls /etc/udev/rules.d/
> 05-udev-early.rules  50-udev.rules  51-hotplug.rules  95-pam-console.rules
>  
> 
> I download udev-095-14.31.el5.x86_64.rpm and unpack it, no
> /etc/udev/redhat/90-dm.rules file found in the rpm package
> #rpm2cpio ../udev-095-14.31.el5.x86_64.rpm |cpio -div
> #ls
> etc  lib  sbin  usr
> #find . -name "*redhat*"
> <Nothing>
> #find . -name "*90-dm.rules*"
> <Nothing>
> 
> But I can found '90-dm.rules' in udev-095-14.29.el5.x86_64.rpm

Please test: install udev-095-14.29.el5.x86_64.rpm, move /etc/udev/redhat/90-dm.rules to e.g. /tmp and test.

Comment 7 Lingfei Kong 2014-05-29 05:30:46 UTC
(In reply to Harald Hoyer from comment #6)
> Please test: install udev-095-14.29.el5.x86_64.rpm, move
> /etc/udev/redhat/90-dm.rules to e.g. /tmp and test.

I installed udev-095-14.29.el5.x86_64.rpm and moved /etc/udev/rules.d/90-dm.rules to ~/ then repeat create and destroy xen guest, all the loopback device are realeased.

Comment 8 Harald Hoyer 2014-07-23 13:47:49 UTC
(In reply to Lingfei Kong from comment #7)
> (In reply to Harald Hoyer from comment #6)
> > Please test: install udev-095-14.29.el5.x86_64.rpm, move
> > /etc/udev/redhat/90-dm.rules to e.g. /tmp and test.
> 
> I installed udev-095-14.29.el5.x86_64.rpm and moved
> /etc/udev/rules.d/90-dm.rules to ~/ then repeat create and destroy xen
> guest, all the loopback device are realeased.

well, other than the change with 90-dm.rules, nothing in your area has changed with udev-095-14.31.el5

So, my rough guess is, that it is a timing issue and has nothing to do with the changes.

Shouldn't there be a 

   claim_lock "block"
   release_lock "block"

in the remove section also in /etc/xen/scripts/block

Also, I see in /etc/udev/rules.d/xen-backend.rules:

SUBSYSTEM=="xen-backend", ACTION=="remove", RUN+="/etc/xen/scripts/xen-hotplug-cleanup"

has no KERNEL=="vbd*" matching.

So, if any non-vbd device triggers xen-hotplug-cleanup, couldn't that be the source of the problems?


Can you compare the output of:

# /usr/sbin/udevmonitor --env

for the good and the bad case?

Comment 9 Lingfei Kong 2014-07-30 13:49:41 UTC
(In reply to Harald Hoyer from comment #8)
> well, other than the change with 90-dm.rules, nothing in your area has
> changed with udev-095-14.31.el5
> 
> So, my rough guess is, that it is a timing issue and has nothing to do with
> the changes.
> 
> Shouldn't there be a 
> 
>    claim_lock "block"
>    release_lock "block"
> 
Hi Harald, thanks for your analysis.
I guess this is a timing issue too. I have did some testing, i found if i sleep 0.04 seconds in /etc/xen/scripts/xen-hotplug-cleanup, loopback device will be removed successfully. I did this by adding "sleep 0.04" above line: claim_lock "block" in /etc/xen/scripts/xen-hotplug-cleanup. 
------------------------
# Any other scripts that do similar things will have to have their lock
# claimed too.
# This is pretty horrible, but there's not really a nicer way of solving this.
sleep 0.04
claim_lock "block"

# get /vm/UUID path
vm=$(xenstore_read_default "/local/domain/${path_array[2]}/vm" "")
------------------------

Also i found with udev-095-14.29.el5, /etc/xen/scripts/block will always be triggered first. But with udev-095-14.32.el5, /etc/xen/scripts/xen-hotplug-cleanup will always be triggered first. So in the above test, i let  /etc/xen/scripts/xen-hotplug-cleanup to have a 0.04 seconds sleep to wait for  /etc/xen/scripts/block to be triggered. 

I guess when shutdown/destroy a xen guest xen or xen guest will first trigger udev rule:
SUBSYSTEM=="xen-backend", KERNEL=="vbd*", RUN+="/etc/xen/scripts/block $env{ACTION}"
ACTION=remove
then trigger udev rule:
SUBSYSTEM=="xen-backend", ACTION=="remove", RUN+="/etc/xen/scripts/xen-hotplug-cleanup"

So the problem maybe udev call /etc/xen/scripts/block after /etc/xen/scripts/xen-hotplug-cleanup even if RUN+="/etc/xen/scripts/block $env{ACTION}" be triggered first?

> Also, I see in /etc/udev/rules.d/xen-backend.rules:
> 
> SUBSYSTEM=="xen-backend", ACTION=="remove",
> RUN+="/etc/xen/scripts/xen-hotplug-cleanup"
> 
> has no KERNEL=="vbd*" matching.
> 
> So, if any non-vbd device triggers xen-hotplug-cleanup, couldn't that be the
> source of the problems?
>
 
vif device or other device can triggers xen-hotplug-cleanup too, but if vif device triggers xen-hotplug-cleanup it can only remove the backend which belongs to this vif device not the backend of vbd device in xenstore(xen get the loopback device name from vbd's backend). 

> 
> Can you compare the output of:
> 
> # /usr/sbin/udevmonitor --env
> 
> for the good and the bad case?
I checked the log of udevmonitor, found it did the same thing, you can have a double check.
udevmonitor_goodcase.txt is for the good case
udevmonitor_badcase.txt is for the bad case

Steps:
1. Start '/usr/sbin/udevmonitor --env |tee result' at the backend
2. Create xen guest and then shutdown this xen guest
3. Get file result for the good case or bad case

As we still not clear which component this problem belong to, i needinfo Vitaly to have a check too :-)

Comment 10 Lingfei Kong 2014-07-30 13:50:26 UTC
Created attachment 922599 [details]
udevmonitor_goodcase.txt

Comment 11 Lingfei Kong 2014-07-30 13:50:48 UTC
Created attachment 922600 [details]
udevmonitor_badcase.txt

Comment 12 Vitaly Kuznetsov 2014-08-11 08:05:21 UTC
(In reply to Lingfei Kong from comment #9)
> 
> As we still not clear which component this problem belong to, i needinfo
> Vitaly to have a check too :-)

Unfortunately I have nothing to add here. xen package has not changed for more that a year. The fact that changing udev from udev-095-14.32.el5 to udev-095-14.29.el5 solves the issue also shows us we need to examine udev first.

I compared these udev versions and only one change seems relevant:

-install -m 0644 etc/udev/redhat/90-dm.rules $RPM_BUILD_ROOT {_sysconfdir}/udev/rules.d/90-dm.rules
-%config %attr(0644,root,root) %{_sysconfdir}/udev/rules.d/90-dm.rules

Lingfei, you can just go ahead and try putting this file back (take it from udev-095-14.29.el5).

Comment 13 Lingfei Kong 2014-08-13 00:46:46 UTC
I(In reply to Vitaly Kuznetsov from comment #12)
> I compared these udev versions and only one change seems relevant:
> 
> -install -m 0644 etc/udev/redhat/90-dm.rules $RPM_BUILD_ROOT
> {_sysconfdir}/udev/rules.d/90-dm.rules
> -%config %attr(0644,root,root) %{_sysconfdir}/udev/rules.d/90-dm.rules
> 
> Lingfei, you can just go ahead and try putting this file back (take it from
> udev-095-14.29.el5).

I move 90-dm.rules in udev-095-14.29. to /etc/udev/rules.d/ and do the test, loopback device still not be released after shutdown/destroy the guest. Also see Comment 7

Comment 15 Harald Hoyer 2014-08-14 09:16:27 UTC
(In reply to Lingfei Kong from comment #9)
> Also i found with udev-095-14.29.el5, /etc/xen/scripts/block will always be
> triggered first. But with udev-095-14.32.el5,
> /etc/xen/scripts/xen-hotplug-cleanup will always be triggered first. So in
> the above test, i let  /etc/xen/scripts/xen-hotplug-cleanup to have a 0.04
> seconds sleep to wait for  /etc/xen/scripts/block to be triggered. 
> 

Just fix the rules and scripts. I don't think this is a udev problem, but a problem with the xen rules and scripts.

Comment 16 Vitaly Kuznetsov 2014-08-15 15:09:28 UTC
(In reply to Harald Hoyer from comment #15)

> 
> Just fix the rules and scripts. I don't think this is a udev problem, but a
> problem with the xen rules and scripts.

I did some investigation and it seems udev rules always were . For some reason we loose the race with the latest udev.

So what actually happens: we have the following rules:
SUBSYSTEM=="xen-backend", KERNEL=="vbd*", RUN+="/etc/xen/scripts/block $env{ACTION}"
SUBSYSTEM=="xen-backend", ACTION=="remove", RUN+="/etc/xen/scripts/xen-hotplug-cleanup"

both these scripts use /var/run/xen-hotplug/block lock and we expect that 'block' script grabs it first. However it grabs the lock somewhere in the middle, not when it starts. It is racy and if 'xen-hotplug-cleanup' grabs it first we get what we get.

There are lots of things which can influence this race. For me even switching udev loglevel with 'udevcontrol log_priority=info' changes the behavior and 'block' script always win.

I will attach a proof-of-concept patch which fixes the issue.

Comment 17 Vitaly Kuznetsov 2014-08-15 15:12:36 UTC
Created attachment 927142 [details]
proof-of-concept patch

Comment 22 Chris Williams 2017-04-18 21:58:25 UTC
Red Hat Enterprise Linux 5 shipped it's last minor release, 5.11, on September 14th, 2014. On March 31st, 2017 RHEL 5 exited Production Phase 3 and entered Extended Life Phase. For RHEL releases in the Extended Life Phase, Red Hat  will provide limited ongoing technical support. No bug fixes, security fixes, hardware enablement or root-cause analysis will be available during this phase, and support will be provided on existing installations only.  If the customer purchases the Extended Life-cycle Support (ELS), certain critical-impact security fixes and selected urgent priority bug fixes for the last minor release will be provided.  For more details please consult the Red Hat Enterprise Linux Life Cycle Page:
https://access.redhat.com/support/policy/updates/errata

This BZ does not appear to meet ELS criteria so is being closed WONTFIX. If this BZ is critical for your environment and you have an Extended Life-cycle Support Add-on entitlement, please open a case in the Red Hat Customer Portal, https://access.redhat.com ,provide a thorough business justification and ask that the BZ be re-opened for consideration of an errata. Please note, only certain critical-impact security fixes and selected urgent priority bug fixes for the last minor release can be considered.