Bug 456926
Summary: | virt-install with 8GB of ram aborts after it creates the storage file. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Barry Donahue <bdonahue> |
Component: | xen | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 5.2 | CC: | berrange, clalance, crobinso, cward, jburke, jpirko, pveiga, riel, syeghiay, tao, xen-maint |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | xen-3.0.3-86.el5 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2009-09-02 10:07:20 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 477162 | ||
Attachments: |
Description
Barry Donahue
2008-07-28 17:28:42 UTC
Can you try the virt-install command without the --nonsparse and see if you get an error? If so, can you try with --nodisks and remove the -f piece? Also please attach /var/log/xen/xend.log and /var/log/xen/xen-hotplug.log I deleted the --nonsparse option and the system is installing now. It's past the point where it failed before. Okay. Can you try and reproduce the original error again with the same virt-install command, and post the above mentioned logs? (/var/log/xen/xend.log and /var/log/xen/xen-hotplug.log) Thanks. Created attachment 312866 [details]
nex daemon log
Created attachment 312868 [details]
hotplug log
*** Bug 465866 has been marked as a duplicate of this bug. *** Just want to mention, the duped bug has a link to more logs, so anyone checking this out can find extra info there as well. What will likely happening is that creating the non-sparse disk is filling up the host OS with cached I/O. Then starting the guest requires significant work by Dom0 to flush this data out to disk, and the time/overhead of this is probably causing hotplug scripts to timeout Now they really shouldn't timeout, so we need to figure out why. As a first step you need to collect info from the hotplug scripts by add *.* /var/log/debug.log to syslog.conf, and restart syslog. This should log anything bad that the hotplug scripts do. Created attachment 320467 [details]
Relevant piece of /var/log/debug.log using unaltered virt-install
Command exits with:
[Wed, 15 Oct 2008 13:52:09 virt-install 8171] ERROR (virt-install:506) virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')
Traceback (most recent call last):
File "/usr/sbin/virt-install", line 502, in ?
main()
File "/usr/sbin/virt-install", line 462, in main
dom = guest.start_install(conscb,progresscb)
File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 818, in start_install
return self._do_install(consolecb, meter)
File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 839, in _do_install
self.domain = self.conn.createLinux(install_xml, 0)
File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux
if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self)
libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')
Created attachment 320469 [details]
/var/log/debug.log using O_SYNC
I changed virtinst to open the file with O_SYNC to see if it would help. Much slower, but system isn't sluggish afterwards. The command still fails, but with a different error:
[Wed, 15 Oct 2008 13:24:56 virt-install 7768] ERROR (virt-install:506) virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 51712 (tap) could not be connected. xenstore-read backend/tap/2/51712/params failed.')
Traceback (most recent call last):
File "/usr/sbin/virt-install", line 502, in ?
main()
File "/usr/sbin/virt-install", line 462, in main
dom = guest.start_install(conscb,progresscb)
File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 818, in start_install
return self._do_install(consolecb, meter)
File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 839, in _do_install
self.domain = self.conn.createLinux(install_xml, 0)
File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux
if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self)
libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 51712 (tap) could not be connected. xenstore-read backend/tap/2/51712/params failed.')
Created attachment 320470 [details]
/var/log/debug.log using O_SYNC (2)
Wrong attachment the first time.
We are still running into this issue in RHTS. [2009-02-02 11:34:29 xend 4996] DEBUG (DevController:154) Waiting for devices vif. [2009-02-02 11:34:29 xend 4996] DEBUG (DevController:160) Waiting for 0. [2009-02-02 11:34:29 xend 4996] DEBUG (DevController:496) hotplugStatusCallback /local/domain/0/backend/vif/1/0/hotplug-status. [2009-02-02 11:36:09 xend 4996] ERROR (SrvBase:88) Request wait_for_devices failed. Traceback (most recent call last): File "/usr/lib64/python2.4/site-packages/xen/web/SrvBase.py", line 85, in perform return op_method(op, req) File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvDomain.py", line 72, in op_wait_for_devices return self.dom.waitForDevices() File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1831, in waitForDevices self.waitForDevices_(c) File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1206, in waitForDevices_ return self.getDeviceController(deviceClass).waitForDevices() File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 156, in waitForDevices return map(self.waitForDevice, self.deviceIDs()) File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 166, in waitForDevice raise VmError("Device %s (%s) could not be connected. " VmError: Device 0 (vif) could not be connected. Hotplug scripts not working. Seems to reproduce fairly often with sun-x4600m2-01.rhts.bos.redhat.com Creating virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.') Domain installation may not have been successful. If it was, you can restart your domain by running 'virsh start guest-80-207.rhts.bos.redhat.com'; otherwise, please restart your installation. Sat, 07 Feb 2009 01:33:49 ERROR virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.') Traceback (most recent call last): File "/usr/sbin/virt-install", line 560, in ? main() File "/usr/sbin/virt-install", line 492, in main dom = guest.start_install(conscb, progresscb, wait=(not wait)) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 822, in start_install return self._do_install(consolecb, meter, wait) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 843, in _do_install self.domain = self.conn.createLinux(install_xml, 0) File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self) libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.') Install error: Creating | 14 GB *** Bug 462917 has been marked as a duplicate of this bug. *** *** Bug 408901 has been marked as a duplicate of this bug. *** I've been able to reproduce this on my machine although I had to run dd in parallel to really make xend timeout on waiting for devices. So this would confirm Daniel's thoughts in comment #8. The reason why xend timeouts is that hotplug scripts fail to start and finish in time. Actually nothing seems to be wrong with the scripts themselves, it's just that the kernel fails to read all required scripts and binaries from disk in time because it is so busy with writing. While we could increase the timeouts, I think this should be addressed in kernel. Writing to disk shouldn't really have such a huge impact on reads. Any other opinions? Created attachment 340523 [details]
Patch to make hotplug timeout configurable
After applying this patch, (device-create-timeout SECONDS) configuration option can be used in /etc/xen/xend-config.sxp to change hotplug timeout.
This failed with the latest installable RHEL5.4 nightly (RHEL5.4-Server-20090412.nightly) Link to the log: http://rhts.redhat.com/testlogs/56948/190903/1593982/TESTOUT.log Note: About the log above. After a couple of days the above link gets zipped/compressed for space saving. You may need to add a .gz on the end Snippet virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.') Domain installation may not have been successful. If it was, you can restart your domain by running 'virsh start guest-83-224.rhts.bos.redhat.com'; otherwise, please restart your installation. Fri, 01 May 2009 15:06:00 ERROR virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.') Traceback (most recent call last): File "/usr/sbin/virt-install", line 560, in ? main() File "/usr/sbin/virt-install", line 492, in main dom = guest.start_install(conscb, progresscb, wait=(not wait)) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 822, in start_install return self._do_install(consolecb, meter, wait) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 843, in _do_install self.domain = self.conn.createLinux(install_xml, 0) File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self) libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.') Jeff, Could you download test packages from http://people.redhat.com/jdenemar/xen/ and try your test again with the following line appended to /etc/xen/xend-config.sxp: (device-create-timeout 600) That would increase hotplug timeout to 10 minutes (default is 100 seconds). Thanks I was finally able to get to sun-x4600m2-01.rhts.bos.redhat.com which causes these troubles and successfully reproduce this bug with the following command: virt-install --name ble --mac 00:16:3E:53:E0:55 --location nfs:bigpapi.bos.redhat.com:/vol/engineering/redhat/nightly/RHEL5.4-Server-20090412.nightly/5/x86_64/os --ram=1024 --vcpus=1 --file-size=20 --paravirt --nonsparse --extra-args ks=http://lab.rhts.bos.redhat.com/kickstarts/hosts/guest-83-224.rhts.bos.redhat.com/ks.cfg --noreboot -f /var/lib/xen/images/ble.img Hotplug scripts take about 4 minutes to finish. No need to run any I/O in parallel. The reason why it takes so long is that the xen hotplug machinery gets blocked completely until all caches are synced. The following is what happens: 1 virt-install creates 20GB files, most of which stays in caches 2 xend is instructed to create a domain 3 hotplug scripts are started for domain's devices 4 hotplug scripts need to log something and start logger for that purpose 5 logger sends the message through a socket to syslogd 6 syslogd writes it to a file and calls fsync() 7 the whole hotplug machinery is blocked until fsync() in syslogd finishes If we remove all the 'logger' calls from the hotplug scripts, does hotplug work at a more reasonable speed ? These log messages are simply thrown away on a standard RHEL-5 install, unless you have manually reconfigured syslog to actually record them. So if removing or disabling logger by default fixes it, that'd be simpler than fixing fsync()+ext3. If we remove logger calls from hotplug scripts, hotplug works fine with no timeouts. However I don't really like the idea of removing them completely as they may be very useful. Anyway the scenario turned out to be a bit more complicated than I thought when writing c#32... syslogd doesn't call fsync() on behalf of log messages sent by hogtplug scripts, as they are dropped in default configuration. fsync() is called because another process logs a message which is written to /var/log/messages. Log messages from hotplug scripts are blocked until fsync() finishes, than they are dropped or written to a file depending on syslog configuration. How about adding a sysconfig file for the hotplug scripts... $ cat /etc/sysconfig/xenhotplug # # Uncomment to enable syslog of hotplug problems, and # adjust /etc/syslog.conf to record them #SYSLOG=yes And then the Xen hotplug-command.sh can source this file, and default to SYSLOG=no. All calls to 'logger' would be protected by test "$SYSLOG" = "yes" && logger "some message" Created attachment 344923 [details]
Patch 1/3: Add xenhotplug.sysconfig and change xen.spec
This part is a patch for xen.spec to install new /etc/sysconfig/xenhotplug file.
Created attachment 344924 [details]
Patch 2/3: Do not log anything by default in hotplug scripts
This part is a patch for hotplug scripts to log messages only when SYSLOG=yes
in /etc/sysconfig/xenhotplug.
Created attachment 344926 [details]
Patch 3/3: Make device create timeout configurable
This is a backport of upstream staging c/s 19617 to make hotplug timeout
configurable.
There are machines in RHTS where guest creation would fail with hotplug timeout when virt-install is told to create a nonsparse file. The reason is a crappy implementation of fsync(). The following is what happens: - virt-install creates a non-sparse file, most of which stays in cache - xend is asked to create a domain using that file - hotplug scripts are started for domain's devices - hotplug scripts need to log something and start logger for that purpose - in the meantime: - another process logs something through syslogd - syslogd writes the log messages to a log file and calls fsync() on it - fsync() blocks until all caches are flushed to a disk - logger sends the message through a socket to syslogd - as syslogd is blocked on fsync(), logger is blocked as well - the whole hotplug machinery is blocked until fsync() in syslogd finishes And just to be sure something is really logged during the hotplug process, some part of device backend for the image file always logs a message :-) This series fixes this in the following way: (part 1) a new configuration file /etc/sysconfig/xenhotplug is created (part 2) log function in hotplug scripts is changed to no-op if SYSLOG != yes in /etc/sysconfig/xenhotplug; SYSLOG is "no" by default (part 3) hotplug timeout is made configurable in /etc/xen/xend-config.sxp Log messages from hotplug scripts are logged with debug priority and are dropped by default so not sending the messages to syslogd at all does not change anything for normal users. Users who need log messages from hotplug scripts have to uncomment SYSLOG=yes line in /etc/sysconfig/xenhotplug. They are also advised to increase hotplug timeout if they start seeing timeouts during domain creation. Fix built into xen-3.0.3-86.el5 ~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2009-1328.html |