Bug 1100080
| Summary: | `xenstore-read` sometimes read from block device backend store entry after the block device backend store entry was removed via `xenstore-rm` and du to this xen will fail to release the loopback device | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | Lingfei Kong <lkong> | ||||||||
| Component: | xen | Assignee: | Xen Maintainance List <xen-maint> | ||||||||
| Status: | CLOSED WONTFIX | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 5.11 | CC: | drjones, harald, knoel, leiwang, ovasik, tdosek, udev-maint-list, vkuznets, wshi, xen-maint | ||||||||
| Target Milestone: | rc | Keywords: | Regression | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Known Issue | |||||||||
| Doc Text: |
Cause: There are two scripts for block device which are being executed by udev rules when an instance dies: /etc/xen/scripts/block and /etc/xen/scripts/xen-hotplug-cleanup. When loopback device is being used for storage access (e.g. when disk is specified as 'file:/' in guest config file) it can remain connected after guest shutdown under certain circumstances.
The root cause of this bug is the fact that /etc/xen/scripts/block script grabs /var/run/xen-hotplug/block lock too late and a race with /etc/xen/scripts/xen-hotplug-cleanup happens.
Consequence:
Loopback device remains connected. It can be seen with 'loopback -a' and removed with 'loopback -d'
Workaround (if any):
1) Changing udev loglevel to 'info' ('udevcontrol log_priority=info' in runtime or by editing /etc/udev/udev.conf and rebooting) is known to help.
alternatively
2) Add 'sleep 1' before 'claim_lock "block"' line in /etc/xen/scripts/xen-hotplug-cleanup.
|
Story Points: | --- | ||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-04-18 21:58:25 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Attachments: |
|
||||||||||
No problems for udev-095-14.29.el5(RHEL5.10 release) Can reproduced from: udev-095-14.30.el5 kernel: 2.6.18-388.el5xen xen: xen-3.0.3-144.el5 (In reply to Wei Shi from comment #1) > No problems for > udev-095-14.29.el5(RHEL5.10 release) > > Can reproduced from: > udev-095-14.30.el5 > > There's no doubt that being able to run xen is the perfect test case, if you want to find all the corner-cases. This bug appears to me to be just that - xen found an issue with a udev update. Kicking this BZ over to udev. udev shouldn't break userspace, not even xen. is it reproducible, if you move /etc/udev/redhat/90-dm.rules away? (In reply to Harald Hoyer from comment #4) > is it reproducible, if you move /etc/udev/redhat/90-dm.rules away? It is reproducible. I did not find that file on the xen host #ls /etc/udev/ makedev.d rules.d udev.conf #ls /etc/udev/rules.d/ 05-udev-early.rules 50-udev.rules 51-hotplug.rules 95-pam-console.rules I download udev-095-14.31.el5.x86_64.rpm and unpack it, no /etc/udev/redhat/90-dm.rules file found in the rpm package #rpm2cpio ../udev-095-14.31.el5.x86_64.rpm |cpio -div #ls etc lib sbin usr #find . -name "*redhat*" <Nothing> #find . -name "*90-dm.rules*" <Nothing> But I can found '90-dm.rules' in udev-095-14.29.el5.x86_64.rpm (In reply to Lingfei Kong from comment #5) > (In reply to Harald Hoyer from comment #4) > > is it reproducible, if you move /etc/udev/redhat/90-dm.rules away? > > It is reproducible. I did not find that file on the xen host > > #ls /etc/udev/ > makedev.d rules.d udev.conf > #ls /etc/udev/rules.d/ > 05-udev-early.rules 50-udev.rules 51-hotplug.rules 95-pam-console.rules > > > I download udev-095-14.31.el5.x86_64.rpm and unpack it, no > /etc/udev/redhat/90-dm.rules file found in the rpm package > #rpm2cpio ../udev-095-14.31.el5.x86_64.rpm |cpio -div > #ls > etc lib sbin usr > #find . -name "*redhat*" > <Nothing> > #find . -name "*90-dm.rules*" > <Nothing> > > But I can found '90-dm.rules' in udev-095-14.29.el5.x86_64.rpm Please test: install udev-095-14.29.el5.x86_64.rpm, move /etc/udev/redhat/90-dm.rules to e.g. /tmp and test. (In reply to Harald Hoyer from comment #6) > Please test: install udev-095-14.29.el5.x86_64.rpm, move > /etc/udev/redhat/90-dm.rules to e.g. /tmp and test. I installed udev-095-14.29.el5.x86_64.rpm and moved /etc/udev/rules.d/90-dm.rules to ~/ then repeat create and destroy xen guest, all the loopback device are realeased. (In reply to Lingfei Kong from comment #7) > (In reply to Harald Hoyer from comment #6) > > Please test: install udev-095-14.29.el5.x86_64.rpm, move > > /etc/udev/redhat/90-dm.rules to e.g. /tmp and test. > > I installed udev-095-14.29.el5.x86_64.rpm and moved > /etc/udev/rules.d/90-dm.rules to ~/ then repeat create and destroy xen > guest, all the loopback device are realeased. well, other than the change with 90-dm.rules, nothing in your area has changed with udev-095-14.31.el5 So, my rough guess is, that it is a timing issue and has nothing to do with the changes. Shouldn't there be a claim_lock "block" release_lock "block" in the remove section also in /etc/xen/scripts/block Also, I see in /etc/udev/rules.d/xen-backend.rules: SUBSYSTEM=="xen-backend", ACTION=="remove", RUN+="/etc/xen/scripts/xen-hotplug-cleanup" has no KERNEL=="vbd*" matching. So, if any non-vbd device triggers xen-hotplug-cleanup, couldn't that be the source of the problems? Can you compare the output of: # /usr/sbin/udevmonitor --env for the good and the bad case? (In reply to Harald Hoyer from comment #8) > well, other than the change with 90-dm.rules, nothing in your area has > changed with udev-095-14.31.el5 > > So, my rough guess is, that it is a timing issue and has nothing to do with > the changes. > > Shouldn't there be a > > claim_lock "block" > release_lock "block" > Hi Harald, thanks for your analysis. I guess this is a timing issue too. I have did some testing, i found if i sleep 0.04 seconds in /etc/xen/scripts/xen-hotplug-cleanup, loopback device will be removed successfully. I did this by adding "sleep 0.04" above line: claim_lock "block" in /etc/xen/scripts/xen-hotplug-cleanup. ------------------------ # Any other scripts that do similar things will have to have their lock # claimed too. # This is pretty horrible, but there's not really a nicer way of solving this. sleep 0.04 claim_lock "block" # get /vm/UUID path vm=$(xenstore_read_default "/local/domain/${path_array[2]}/vm" "") ------------------------ Also i found with udev-095-14.29.el5, /etc/xen/scripts/block will always be triggered first. But with udev-095-14.32.el5, /etc/xen/scripts/xen-hotplug-cleanup will always be triggered first. So in the above test, i let /etc/xen/scripts/xen-hotplug-cleanup to have a 0.04 seconds sleep to wait for /etc/xen/scripts/block to be triggered. I guess when shutdown/destroy a xen guest xen or xen guest will first trigger udev rule: SUBSYSTEM=="xen-backend", KERNEL=="vbd*", RUN+="/etc/xen/scripts/block $env{ACTION}" ACTION=remove then trigger udev rule: SUBSYSTEM=="xen-backend", ACTION=="remove", RUN+="/etc/xen/scripts/xen-hotplug-cleanup" So the problem maybe udev call /etc/xen/scripts/block after /etc/xen/scripts/xen-hotplug-cleanup even if RUN+="/etc/xen/scripts/block $env{ACTION}" be triggered first? > Also, I see in /etc/udev/rules.d/xen-backend.rules: > > SUBSYSTEM=="xen-backend", ACTION=="remove", > RUN+="/etc/xen/scripts/xen-hotplug-cleanup" > > has no KERNEL=="vbd*" matching. > > So, if any non-vbd device triggers xen-hotplug-cleanup, couldn't that be the > source of the problems? > vif device or other device can triggers xen-hotplug-cleanup too, but if vif device triggers xen-hotplug-cleanup it can only remove the backend which belongs to this vif device not the backend of vbd device in xenstore(xen get the loopback device name from vbd's backend). > > Can you compare the output of: > > # /usr/sbin/udevmonitor --env > > for the good and the bad case? I checked the log of udevmonitor, found it did the same thing, you can have a double check. udevmonitor_goodcase.txt is for the good case udevmonitor_badcase.txt is for the bad case Steps: 1. Start '/usr/sbin/udevmonitor --env |tee result' at the backend 2. Create xen guest and then shutdown this xen guest 3. Get file result for the good case or bad case As we still not clear which component this problem belong to, i needinfo Vitaly to have a check too :-) Created attachment 922599 [details]
udevmonitor_goodcase.txt
Created attachment 922600 [details]
udevmonitor_badcase.txt
(In reply to Lingfei Kong from comment #9) > > As we still not clear which component this problem belong to, i needinfo > Vitaly to have a check too :-) Unfortunately I have nothing to add here. xen package has not changed for more that a year. The fact that changing udev from udev-095-14.32.el5 to udev-095-14.29.el5 solves the issue also shows us we need to examine udev first. I compared these udev versions and only one change seems relevant: -install -m 0644 etc/udev/redhat/90-dm.rules $RPM_BUILD_ROOT {_sysconfdir}/udev/rules.d/90-dm.rules -%config %attr(0644,root,root) %{_sysconfdir}/udev/rules.d/90-dm.rules Lingfei, you can just go ahead and try putting this file back (take it from udev-095-14.29.el5). I(In reply to Vitaly Kuznetsov from comment #12) > I compared these udev versions and only one change seems relevant: > > -install -m 0644 etc/udev/redhat/90-dm.rules $RPM_BUILD_ROOT > {_sysconfdir}/udev/rules.d/90-dm.rules > -%config %attr(0644,root,root) %{_sysconfdir}/udev/rules.d/90-dm.rules > > Lingfei, you can just go ahead and try putting this file back (take it from > udev-095-14.29.el5). I move 90-dm.rules in udev-095-14.29. to /etc/udev/rules.d/ and do the test, loopback device still not be released after shutdown/destroy the guest. Also see Comment 7 (In reply to Lingfei Kong from comment #9) > Also i found with udev-095-14.29.el5, /etc/xen/scripts/block will always be > triggered first. But with udev-095-14.32.el5, > /etc/xen/scripts/xen-hotplug-cleanup will always be triggered first. So in > the above test, i let /etc/xen/scripts/xen-hotplug-cleanup to have a 0.04 > seconds sleep to wait for /etc/xen/scripts/block to be triggered. > Just fix the rules and scripts. I don't think this is a udev problem, but a problem with the xen rules and scripts. (In reply to Harald Hoyer from comment #15) > > Just fix the rules and scripts. I don't think this is a udev problem, but a > problem with the xen rules and scripts. I did some investigation and it seems udev rules always were . For some reason we loose the race with the latest udev. So what actually happens: we have the following rules: SUBSYSTEM=="xen-backend", KERNEL=="vbd*", RUN+="/etc/xen/scripts/block $env{ACTION}" SUBSYSTEM=="xen-backend", ACTION=="remove", RUN+="/etc/xen/scripts/xen-hotplug-cleanup" both these scripts use /var/run/xen-hotplug/block lock and we expect that 'block' script grabs it first. However it grabs the lock somewhere in the middle, not when it starts. It is racy and if 'xen-hotplug-cleanup' grabs it first we get what we get. There are lots of things which can influence this race. For me even switching udev loglevel with 'udevcontrol log_priority=info' changes the behavior and 'block' script always win. I will attach a proof-of-concept patch which fixes the issue. Created attachment 927142 [details]
proof-of-concept patch
Red Hat Enterprise Linux 5 shipped it's last minor release, 5.11, on September 14th, 2014. On March 31st, 2017 RHEL 5 exited Production Phase 3 and entered Extended Life Phase. For RHEL releases in the Extended Life Phase, Red Hat will provide limited ongoing technical support. No bug fixes, security fixes, hardware enablement or root-cause analysis will be available during this phase, and support will be provided on existing installations only. If the customer purchases the Extended Life-cycle Support (ELS), certain critical-impact security fixes and selected urgent priority bug fixes for the last minor release will be provided. For more details please consult the Red Hat Enterprise Linux Life Cycle Page: https://access.redhat.com/support/policy/updates/errata This BZ does not appear to meet ELS criteria so is being closed WONTFIX. If this BZ is critical for your environment and you have an Extended Life-cycle Support Add-on entitlement, please open a case in the Red Hat Customer Portal, https://access.redhat.com ,provide a thorough business justification and ask that the BZ be re-opened for consideration of an errata. Please note, only certain critical-impact security fixes and selected urgent priority bug fixes for the last minor release can be considered. |
Description of problem: When destroy a guest which have a loopback device, xen will first use command `xenstore_read "$XENBUS_PATH/node` (file:/etc/xen/scripts/block line:307) to get the name of the loopback device then use `losetup -d "$node"` to release the loopback device and then execute script /etc/xen/scripts/xen-hotplug-cleanup (line 38 and 39) to remove the block device backend store entry. But sometimes xen will first remove the backend story entry, this will cause xen fail to get the loopback device from xenstore, because the entry have already been removed. Thus the loopback device will failed to release. When i try to find who will call script /etc/xen/scripts/xen-hotplug-cleanup, i found the following udev rules in /etc/udev/rules.d/xen-backend.rules #cat /etc/udev/rules.d/xen-backend.rules ... SUBSYSTEM=="xen-backend", KERNEL=="vbd*", RUN+="/etc/xen/scripts/block $env{ACTION}" ... SUBSYSTEM=="xen-backend", ACTION=="remove", RUN+="/etc/xen/scripts/xen-hotplug-cleanup" I checked the version of udev, found it was updated from udev-095-14.29.el5 to udev-095-14.31.el5 on RHEL5.11-Server-20140508.0, i uninstall udev-095-14.31.el5 and install udev-095-14.29.el5 on the same machine, redo the test, this time all the loopback device were released. Version-Release number of selected component (if applicable): rhel5.11 host: kernel-xen-2.6.18-389.el5 rhel7.0 guest: kernel-3.10.0-115.el7 udev-095-14.31.el5 xen-3.0.3-144.el5 How reproducible: 70% Steps to Reproduce: 1. Create a test script: #cat test.sh #!/bin/bash i=0 while : do xm create xm-test.conf >/dev/null sleep 5 xm destroy vm1 echo $i ((i++)) done 2. Create a vm config file named xm-test.conf #cat xm-test.conf name = "vm1" uuid = "1efb30c3-86fd-9dd7-4934-9b72b6a833fc" builder = "hvm" kernel = "/usr/lib/xen/boot/hvmloader" memory = 1024 maxmem = 1024 vcpus = 4 on_poweroff = "destroy" on_reboot = "restart" on_crash = "preserve" pae = 1 acpi = 1 apic = 1 localtime = 0 device_model = "/usr/lib64/xen/bin/qemu-dm" disk = ['file:/root/xen.autotest.new/client/tests/xen/images/RHEL-Server-7.0-64-hvm.raw,hda,w'] vif = ['mac=00:19:30:1F:68:16,script=vif-bridge,bridge=xenbr0,model=e1000,type=ioemu'] vnc = 1 vncunused = 1 vnclisten = "0.0.0.0" sdl = 0 serial = "pty" soundhw = "es1370" 3. Run the script and use losetup -a to check the loopback device Actual results: After a number of(maybe 10 times) create-destroy operation, you will find the following error message appear on your tty: Error: Device 768 (vbd) could not be connected. /etc/xen/scripts/block failed; error detected. And type 'losetup -a', you will find all the loopback device was in use: #losetup -a /dev/loop0: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw) /dev/loop1: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw) /dev/loop2: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw) /dev/loop3: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw) /dev/loop4: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw) /dev/loop5: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw) /dev/loop6: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw) /dev/loop7: [fd00]:4587587 (/var/run/xen-autotest/images/RHEL-Server-7.0-64-hvm.raw) Expected results: You can find no more than one loopback device when you type `losetup -a`, and no `xm create xm-test.conf` failed with error message: Failed to create domain: 'Using config file "/tmp/xm-test.conf". Error: Device 768 (vbd) could not be connected. /etc/xen/scripts/block failed; error detected. ' Additional info: When xen failed to release the loopback device after `xm destroy $DomU_ID`, you will find the following messages in log file: /var/log/xen/xen-hotplug.log xenstore-read: couldn't read path backend/vbd/6/768/node