Bug 456926

Summary: virt-install with 8GB of ram aborts after it creates the storage file.
Product: Red Hat Enterprise Linux 5 Reporter: Barry Donahue <bdonahue>
Component: xenAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 5.2CC: berrange, clalance, crobinso, cward, jburke, jpirko, pveiga, riel, syeghiay, tao, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: xen-3.0.3-86.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-09-02 10:07:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 477162    
Attachments:
Description Flags
nex daemon log
none
hotplug log
none
Relevant piece of /var/log/debug.log using unaltered virt-install
none
/var/log/debug.log using O_SYNC
none
/var/log/debug.log using O_SYNC (2)
none
Patch to make hotplug timeout configurable
none
Patch 1/3: Add xenhotplug.sysconfig and change xen.spec
none
Patch 2/3: Do not log anything by default in hotplug scripts
none
Patch 3/3: Make device create timeout configurable none

Description Barry Donahue 2008-07-28 17:28:42 UTC
Description of problem:I'm trying to  create an 8GB guest.

 virt-install -v -n w2k3_large -r 8192 --vcpus=4 -f
/var/lib/xen/images/w2k3_large.img -c /var/lib/xen/images/Win2003R2-64-disk1.iso
-s 16 --nonsparse

Version-Release number of selected component (if applicable):
kernel: 2.6.18-92.1.4.el5xen

How reproducible: Every time.


Steps to Reproduce:
1.Just run the enclosed virt-install command
2.
3.
  
Actual results:
 virt-install -v -n w2k3_large -r 8192 --vcpus=4 -f
/var/lib/xen/images/w2k3_large.img -c /var/lib/xen/images/Win2003R2-64-disk1.iso
-s 16 --nonsparse
Would you like to enable graphics support? (yes or no) yes
 

Starting install...
Creating storage file...  100% |=========================|  16 GB    05:35     
virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif)
could not be connected. Hotplug scripts not working.')
Domain installation may not have been
 successful.  If it was, you can restart your domain
 by running 'virsh start w2k3_large'; otherwise, please
 restart your installation.
Thu, 19 Jun 2008 20:39:51 ERROR    virDomainCreateLinux() failed POST operation
failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not
working.')
Traceback (most recent call last):
  File "/usr/sbin/virt-install", line 502, in ?
    main()
  File "/usr/sbin/virt-install", line 462, in main
    dom = guest.start_install(conscb,progresscb)
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 813, in
start_install
    return self._do_install(consolecb, meter)
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 834, in
_do_install
    self.domain = self.conn.createLinux(install_xml, 0)
  File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux
    if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self)
libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err
'Device 0 (vif) could not be connected. Hotplug scripts not working.')



Expected results: It should install the guest.


Additional info:

Comment 1 Cole Robinson 2008-07-28 18:03:27 UTC
Can you try the virt-install command without the --nonsparse and see if you get
an error? If so, can you try with --nodisks and remove the -f piece?

Also please attach /var/log/xen/xend.log and /var/log/xen/xen-hotplug.log

Comment 2 Barry Donahue 2008-07-28 19:06:27 UTC
I deleted the --nonsparse option and the system is installing now. It's past the
point where it failed before.

Comment 3 Cole Robinson 2008-07-28 19:32:43 UTC
Okay. Can you try and reproduce the original error again with the same
virt-install command, and post the above mentioned logs? (/var/log/xen/xend.log
and /var/log/xen/xen-hotplug.log) Thanks.

Comment 4 Barry Donahue 2008-07-29 13:05:35 UTC
Created attachment 312866 [details]
nex daemon log

Comment 5 Barry Donahue 2008-07-29 13:08:12 UTC
Created attachment 312868 [details]
hotplug log

Comment 6 Chris Lalancette 2008-10-09 14:08:12 UTC
*** Bug 465866 has been marked as a duplicate of this bug. ***

Comment 7 Cole Robinson 2008-10-09 14:52:31 UTC
Just want to mention, the duped bug has a link to more logs, so anyone checking this out can find extra info there as well.

Comment 8 Daniel Berrangé 2008-10-09 14:58:52 UTC
What will likely happening is that creating the non-sparse disk is filling up the host OS with cached I/O. Then starting the guest requires significant work by Dom0 to flush this data out to disk, and the time/overhead of this is probably causing hotplug scripts to timeout

Now they really shouldn't timeout, so we need to figure out why. As a first step you need to collect info from the hotplug scripts by add

  *.*  /var/log/debug.log  


to syslog.conf, and restart syslog. This should log anything bad that the hotplug scripts  do.

Comment 10 Cole Robinson 2008-10-15 18:05:20 UTC
Created attachment 320467 [details]
Relevant piece of /var/log/debug.log using unaltered virt-install

Command exits with:

[Wed, 15 Oct 2008 13:52:09 virt-install 8171] ERROR (virt-install:506) virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')
Traceback (most recent call last):
  File "/usr/sbin/virt-install", line 502, in ?
    main()
  File "/usr/sbin/virt-install", line 462, in main
    dom = guest.start_install(conscb,progresscb)
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 818, in start_install
    return self._do_install(consolecb, meter)
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 839, in _do_install
    self.domain = self.conn.createLinux(install_xml, 0)
  File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux
    if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self)
libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')

Comment 11 Cole Robinson 2008-10-15 18:07:02 UTC
Created attachment 320469 [details]
/var/log/debug.log using O_SYNC

I changed virtinst to open the file with O_SYNC to see if it would help. Much slower, but system isn't sluggish afterwards. The command still fails, but with a different error:

[Wed, 15 Oct 2008 13:24:56 virt-install 7768] ERROR (virt-install:506) virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 51712 (tap) could not be connected. xenstore-read backend/tap/2/51712/params failed.')
Traceback (most recent call last):
  File "/usr/sbin/virt-install", line 502, in ?
    main()
  File "/usr/sbin/virt-install", line 462, in main
    dom = guest.start_install(conscb,progresscb)
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 818, in start_install
    return self._do_install(consolecb, meter)
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 839, in _do_install
    self.domain = self.conn.createLinux(install_xml, 0)
  File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux
    if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self)
libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 51712 (tap) could not be connected. xenstore-read backend/tap/2/51712/params failed.')

Comment 12 Cole Robinson 2008-10-15 18:10:39 UTC
Created attachment 320470 [details]
/var/log/debug.log using O_SYNC (2)

Wrong attachment the first time.

Comment 16 Jeff Burke 2009-02-02 18:39:53 UTC
We are still running into this issue in RHTS.

[2009-02-02 11:34:29 xend 4996] DEBUG (DevController:154) Waiting for devices vif.
[2009-02-02 11:34:29 xend 4996] DEBUG (DevController:160) Waiting for 0.
[2009-02-02 11:34:29 xend 4996] DEBUG (DevController:496) hotplugStatusCallback /local/domain/0/backend/vif/1/0/hotplug-status.
[2009-02-02 11:36:09 xend 4996] ERROR (SrvBase:88) Request wait_for_devices failed.
Traceback (most recent call last):
  File "/usr/lib64/python2.4/site-packages/xen/web/SrvBase.py", line 85, in perform
    return op_method(op, req)
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/SrvDomain.py", line 72, in op_wait_for_devices
    return self.dom.waitForDevices()
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1831, in waitForDevices
    self.waitForDevices_(c)
  File "/usr/lib64/python2.4/site-packages/xen/xend/XendDomainInfo.py", line 1206, in waitForDevices_
    return self.getDeviceController(deviceClass).waitForDevices()
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 156, in waitForDevices
    return map(self.waitForDevice, self.deviceIDs())
  File "/usr/lib64/python2.4/site-packages/xen/xend/server/DevController.py", line 166, in waitForDevice
    raise VmError("Device %s (%s) could not be connected. "
VmError: Device 0 (vif) could not be connected. Hotplug scripts not working.

Comment 18 Jeff Burke 2009-02-09 15:07:40 UTC
Seems to reproduce fairly often with sun-x4600m2-01.rhts.bos.redhat.com

Creating
virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')
Domain installation may not have been
 successful.  If it was, you can restart your domain
 by running 'virsh start guest-80-207.rhts.bos.redhat.com'; otherwise, please
 restart your installation.
Sat, 07 Feb 2009 01:33:49 ERROR    virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')
Traceback (most recent call last):
  File "/usr/sbin/virt-install", line 560, in ?
    main()
  File "/usr/sbin/virt-install", line 492, in main
    dom = guest.start_install(conscb, progresscb, wait=(not wait))
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 822, in start_install
    return self._do_install(consolecb, meter, wait)
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 843, in _do_install
    self.domain = self.conn.createLinux(install_xml, 0)
  File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux
    if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self)
libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')
Install error: 
Creating |  14 GB

Comment 19 Cole Robinson 2009-02-13 21:01:16 UTC
*** Bug 462917 has been marked as a duplicate of this bug. ***

Comment 20 Bill Burns 2009-02-27 19:09:16 UTC
*** Bug 408901 has been marked as a duplicate of this bug. ***

Comment 21 Jiri Denemark 2009-04-02 09:29:07 UTC
I've been able to reproduce this on my machine although I had to run dd in parallel to really make xend timeout on waiting for devices. So this would confirm Daniel's thoughts in comment #8. The reason why xend timeouts is that hotplug scripts fail to start and finish in time. Actually nothing seems to be wrong with the scripts themselves, it's just that the kernel fails to read all required scripts and binaries from disk in time because it is so busy with writing.

While we could increase the timeouts, I think this should be addressed in kernel. Writing to disk shouldn't really have such a huge impact on reads. Any other opinions?

Comment 28 Jiri Denemark 2009-04-21 13:32:31 UTC
Created attachment 340523 [details]
Patch to make hotplug timeout configurable

After applying this patch, (device-create-timeout SECONDS) configuration option can be used in /etc/xen/xend-config.sxp to change hotplug timeout.

Comment 29 Jeff Burke 2009-05-01 19:47:30 UTC
This failed with the latest installable RHEL5.4 nightly (RHEL5.4-Server-20090412.nightly)

Link to the log:
http://rhts.redhat.com/testlogs/56948/190903/1593982/TESTOUT.log
Note: About the log above. After a couple of days the above link gets zipped/compressed for space saving. You may need to add a .gz on the end

Snippet
virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')
Domain installation may not have been
 successful.  If it was, you can restart your domain
 by running 'virsh start guest-83-224.rhts.bos.redhat.com'; otherwise, please
 restart your installation.
Fri, 01 May 2009 15:06:00 ERROR    virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')
Traceback (most recent call last):
  File "/usr/sbin/virt-install", line 560, in ?
    main()
  File "/usr/sbin/virt-install", line 492, in main
    dom = guest.start_install(conscb, progresscb, wait=(not wait))
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 822, in start_install
    return self._do_install(consolecb, meter, wait)
  File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 843, in _do_install
    self.domain = self.conn.createLinux(install_xml, 0)
  File "/usr/lib64/python2.4/site-packages/libvirt.py", line 573, in createLinux
    if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self)
libvirtError: virDomainCreateLinux() failed POST operation failed: (xend.err 'Device 0 (vif) could not be connected. Hotplug scripts not working.')

Comment 30 Jiri Denemark 2009-05-05 09:39:22 UTC
Jeff,

Could you download test packages from http://people.redhat.com/jdenemar/xen/ and try your test again with the following line appended to /etc/xen/xend-config.sxp:

(device-create-timeout 600)

That would increase hotplug timeout to 10 minutes (default is 100 seconds).

Thanks

Comment 32 Jiri Denemark 2009-05-15 13:53:44 UTC
I was finally able to get to sun-x4600m2-01.rhts.bos.redhat.com which causes these troubles and successfully reproduce this bug with the following command:

virt-install --name ble --mac 00:16:3E:53:E0:55 --location nfs:bigpapi.bos.redhat.com:/vol/engineering/redhat/nightly/RHEL5.4-Server-20090412.nightly/5/x86_64/os --ram=1024 --vcpus=1 --file-size=20 --paravirt --nonsparse --extra-args ks=http://lab.rhts.bos.redhat.com/kickstarts/hosts/guest-83-224.rhts.bos.redhat.com/ks.cfg --noreboot -f /var/lib/xen/images/ble.img

Hotplug scripts take about 4 minutes to finish. No need to run any I/O in parallel.

The reason why it takes so long is that the xen hotplug machinery gets blocked completely until all caches are synced. The following is what happens:

1  virt-install creates 20GB files, most of which stays in caches
2  xend is instructed to create a domain
3  hotplug scripts are started for domain's devices
4  hotplug scripts need to log something and start logger for that purpose
5  logger sends the message through a socket to syslogd
6  syslogd writes it to a file and calls fsync()
7  the whole hotplug machinery is blocked until fsync() in syslogd finishes

Comment 33 Daniel Berrangé 2009-05-15 14:07:10 UTC
If we remove all the 'logger' calls from the hotplug scripts, does hotplug work at a more reasonable speed ?  These log messages are simply thrown away on a standard RHEL-5 install, unless you have manually reconfigured syslog to actually record them. So if removing or disabling logger by default fixes it, that'd be simpler than fixing fsync()+ext3.

Comment 34 Jiri Denemark 2009-05-18 09:30:20 UTC
If we remove logger calls from hotplug scripts, hotplug works fine with no timeouts. However I don't really like the idea of removing them completely as they may be very useful.

Anyway the scenario turned out to be a bit more complicated than I thought when writing c#32... syslogd doesn't call fsync() on behalf of log messages sent by hogtplug scripts, as they are dropped in default configuration. fsync() is called because another process logs a message which is written to /var/log/messages. Log messages from hotplug scripts are blocked until fsync() finishes, than they are dropped or written to a file depending on syslog configuration.

Comment 35 Daniel Berrangé 2009-05-18 09:44:08 UTC
How about adding a sysconfig file for the hotplug scripts...


  $ cat /etc/sysconfig/xenhotplug
  #
  # Uncomment to enable syslog of hotplug problems, and 
  # adjust /etc/syslog.conf to record them
  #SYSLOG=yes


And then the Xen  hotplug-command.sh can source this file, and default to SYSLOG=no. All calls to 'logger' would be protected by  test "$SYSLOG" = "yes" && logger "some message"

Comment 39 Jiri Denemark 2009-05-21 07:58:04 UTC
Created attachment 344923 [details]
Patch 1/3: Add xenhotplug.sysconfig and change xen.spec

This part is a patch for xen.spec to install new /etc/sysconfig/xenhotplug file.

Comment 40 Jiri Denemark 2009-05-21 08:00:01 UTC
Created attachment 344924 [details]
Patch 2/3: Do not log anything by default in hotplug scripts

This part is a patch for hotplug scripts to log messages only when SYSLOG=yes
in /etc/sysconfig/xenhotplug.

Comment 41 Jiri Denemark 2009-05-21 08:00:59 UTC
Created attachment 344926 [details]
Patch 3/3: Make device create timeout configurable

This is a backport of upstream staging c/s 19617 to make hotplug timeout
configurable.

Comment 42 Jiri Denemark 2009-05-21 08:03:08 UTC
There are machines in RHTS where guest creation would fail with hotplug
timeout when virt-install is told to create a nonsparse file. The reason is a
crappy implementation of fsync(). The following is what happens:

- virt-install creates a non-sparse file, most of which stays in cache
- xend is asked to create a domain using that file
- hotplug scripts are started for domain's devices
- hotplug scripts need to log something and start logger for that purpose
- in the meantime:
    - another process logs something through syslogd
    - syslogd writes the log messages to a log file and calls fsync() on it
    - fsync() blocks until all caches are flushed to a disk
- logger sends the message through a socket to syslogd
- as syslogd is blocked on fsync(), logger is blocked as well
- the whole hotplug machinery is blocked until fsync() in syslogd finishes

And just to be sure something is really logged during the hotplug process,
some part of device backend for the image file always logs a message :-)


This series fixes this in the following way:

(part 1)    a new configuration file /etc/sysconfig/xenhotplug is created
(part 2)    log function in hotplug scripts is changed to no-op if
            SYSLOG != yes in /etc/sysconfig/xenhotplug;
            SYSLOG is "no" by default
(part 3)    hotplug timeout is made configurable in /etc/xen/xend-config.sxp

Log messages from hotplug scripts are logged with debug priority and are
dropped by default so not sending the messages to syslogd at all does not
change anything for normal users. Users who need log messages from hotplug
scripts have to uncomment SYSLOG=yes line in /etc/sysconfig/xenhotplug. They
are also advised to increase hotplug timeout if they start seeing timeouts
during domain creation.

Comment 43 Jiri Denemark 2009-05-21 15:39:55 UTC
Fix built into xen-3.0.3-86.el5

Comment 45 Chris Ward 2009-07-03 18:05:28 UTC
~~ Attention - RHEL 5.4 Beta Released! ~~

RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner!

If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity.

Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value.

Questions can be posted to this bug or your customer or partner representative.

Comment 48 errata-xmlrpc 2009-09-02 10:07:20 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-1328.html