Created attachment 386851 [details] RHTS XML script for queuing this job. Note TESTDISTRO needs to be set Description of problem: Have had multiple virt installs fail in RHTS against the 20091227 and 20100117 rhel5.5 distros with the message ... ... The guest1_install log says ... Mon, 25 Jan 2010 14:21:28 ERROR internal error unable to start guest: char device redirected to /dev/pts/1 char device redirected to /dev/pts/2 Domain installation may not have been successful. If it was, you can restart your domain by running 'virsh start guest1'; otherwise, please restart your installation. Mon, 25 Jan 2010 14:21:28 ERROR internal error unable to start guest: char device redirected to /dev/pts/1 char device redirected to /dev/pts/2 Traceback (most recent call last): File "/usr/sbin/virt-install", line 889, in ? main() File "/usr/sbin/virt-install", line 751, in main start_time, guest.start_install) File "/usr/sbin/virt-install", line 813, in do_install dom = install_func(conscb, progresscb, wait=(not wait)) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 541, in start_install return self._do_install(consolecb, meter, removeOld, wait) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 633, in _do_install self.domain = self.conn.createLinux(install_xml, 0) File "/usr/lib64/python2.4/site-packages/libvirt.py", line 974, in createLinux if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self) libvirtError: internal error unable to start guest: char device redirected to /dev/pts/1 char device redirected to /dev/pts/2 A Libvirt error occurred during installation. The RHTS job that failed (one of them ) is http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=120318 The second recipe set. Version-Release number of selected component (if applicable): RHEL5.5-Server-20100117.0 How reproducible: Almost always. Seems to depend on box sometimes. Other times it works. Steps to Reproduce: 1. Will attach RHTS XML style job for review and queuing 2. 3. Actual results: Expected results: Additional info:
Can I get some one to look into this ? My ISV testing in virt environments runs into this 80% of the time. Since it works some times its suspected to be a race condition. Thanks, Barry
Can we get the XML for the domain used ? The attachment provided doesn't even allow to understand what arch or what hypervisor is being used ! Daniel
take a look at RHTS job http://rhts.redhat.com/cgi-bin/rhts/recipes.cgi?id=328768 Look at the second recipe set. The job is still hanging on because it has a built in reserve_workflow for the host system. pw:rhts. Barry
scratch that . wrong job ... though this one died too .. thats job http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=121429 Barry
test host is ibm-x3950m2-01.rhts.eng.bos.redhat.com - the virt-install command did not get far enough along to create a guest # virsh list --all Id Name State ---------------------------------- the virt-install command used was # virt-install --name guest1 --mac 00:16:3E:53:E4:2A --cdrom /var/lib/libvirt/images/guest1.iso --file /var/lib/libvirt/images/guest1.img -s 80 --nonsparse --hvm --debug --extra-args ks=http://lab2.rhts.eng.bos.redhat.com/cblr/svc/op/ks/system/guest-83-228.rhts.bos.redhat.com --prompt --accelerate --os-variant=virtio26 --network bridge:br0 --noreboot --vnc
I see the process running: /usr/libexec/qemu-kvm -S -M pc -m 512 -smp 1 -name guest1 -uuid 07021de3-8e93-d39e-9710-53453564c384 -no-kvm-pit-reinjection -monitor pty -pidfile /var/run/libvirt/qemu//guest1.pid -no-reboot -boot d -drive file=/var/lib/libvirt/images/guest1.img,if=virtio,index=0,cache=none -drive file=/var/lib/libvirt/images/guest1.iso,if=ide,media=cdrom,index=2 -net nic,macaddr=00:16:3e:53:e4:2a,vlan=0,model=virtio -net tap,fd=16,script=,vlan=0,ifname=vnet0 -serial pty -parallel none -usb -vnc 127.0.0.1:0 -k en-us I'm attaching the XML I extracted for it Daniel
this issue was also seen on system intel-s3ea2-02.rhts.eng.bos.redhat.com virt-install --name guest1 --mac 00:16:3E:53:4A:DE --cdrom /var/lib/libvirt/images/guest1.iso --file /var/lib/libvirt/images/guest1.img -s 180 --nonsparse --hvm --debug --extra-args ks=http://lab2.rhts.eng.bos.redhat.com/cblr/svc/op/ks/system/guest-83-74.rhts.bos.redhat.com --prompt --accelerate --os-variant=virtio26 --network bridge:br0 --noreboot --vnc Thu, 28 Jan 2010 17:10:51 DEBUG Received libvirt URI 'qemu:///system' Thu, 28 Jan 2010 17:10:51 DEBUG Requesting virt method 'hvm' Thu, 28 Jan 2010 17:10:51 DEBUG Received virt method 'hvm' Thu, 28 Jan 2010 17:10:51 DEBUG Hypervisor name is 'kvm' How much RAM should be allocated (in megabytes)? 512 Thu, 28 Jan 2010 17:10:51 DEBUG Setting os type to 'linux' for variant 'virtio26' Thu, 28 Jan 2010 17:10:51 DEBUG DistroInstaller location is a local file/path: /var/lib/libvirt/images/guest1.iso Thu, 28 Jan 2010 17:10:51 DEBUG Setting size for existing storage to '0.00978851318359' Thu, 28 Jan 2010 17:10:51 DEBUG Detected storage as type 'file' Starting install... Thu, 28 Jan 2010 17:10:51 DEBUG Setting size for existing storage to '0.00978851318359' Thu, 28 Jan 2010 17:10:51 DEBUG Detected storage as type 'file' Creating | 142 MB Creating | 266 MB Creating | 390 MB . . . . Creating | 180 GB Creating Thu, 28 Jan 2010 17:56:41 DEBUG Creating guest from: <domain type='kvm'> <name>guest1</name> <currentMemory>524288</currentMemory> <memory>524288</memory> <uuid>1cd3150e-3ed0-b967-2d4b-a00299bca836</uuid> <os> <type arch='x86_64'>hvm</type> <boot dev='cdrom'/> </os> <features> <acpi/><apic/><pae/> </features> <clock offset="utc"/> <on_poweroff>destroy</on_poweroff> <on_reboot>destroy</on_reboot> <on_crash>destroy</on_crash> <vcpu>1</vcpu> <devices> <emulator>/usr/libexec/qemu-kvm</emulator> <console type='pty'/> <disk type='file' device='disk'> <driver name='qemu' cache='none'/> <source file='/var/lib/libvirt/images/guest1.img'/> <target dev='vda' bus='virtio'/> </disk> <disk type='file' device='cdrom'> <source file='/var/lib/libvirt/images/guest1.iso'/> <target dev='hdc' bus='ide'/> <readonly/> </disk> <interface type='bridge'> <source bridge='br0'/> <mac address='00:16:3E:53:4A:DE'/> <model type='virtio'/> </interface> <input type='mouse' bus='ps2'/> <graphics type='vnc' port='-1' keymap='en-us'/> </devices> </domain> Thu, 28 Jan 2010 17:58:30 ERROR internal error unable to start guest: Domain installation may not have been successful. If it was, you can restart your domain by running 'virsh start guest1'; otherwise, please restart your installation. Thu, 28 Jan 2010 17:58:30 ERROR internal error unable to start guest: Traceback (most recent call last): File "/usr/sbin/virt-install", line 889, in ? main() File "/usr/sbin/virt-install", line 751, in main start_time, guest.start_install) File "/usr/sbin/virt-install", line 813, in do_install dom = install_func(conscb, progresscb, wait=(not wait)) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 541, in start_install return self._do_install(consolecb, meter, removeOld, wait) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 633, in _do_install self.domain = self.conn.createLinux(install_xml, 0) File "/usr/lib64/python2.4/site-packages/libvirt.py", line 974, in createLinux if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self) libvirtError: internal error unable to start guest: A Libvirt error occurred during installation.
This issue is definitely compounded by creating the guest image using the --nonsparse option. I found I could reproduce the issue by kicking off a load prior to starting the virt-install command - First make sure the image file from any prior run no longer exists # rm -f /var/lib/libvirt/images/vm1.img - start a small load # /bin/bash /etc/cron.daily/rpm - run the install command - my example is below # virt-install --hvm --name vm1 --ram 1024 --vnc --cdrom http://qafiler.bos.redhat.com/redhat/rel-eng/RHEL5.5-Server-20100117.0/5/x86_64/os --file /var/lib/libvirt/images/vm1.img --file-size 12 --accelerate --os-variant=virtio26 --debug --nonsparse I did see on some runs a slightly different output from libvirt but the symptom of qemu exiting or timing out still exists. The alternate libvirt trace was: Fri, 29 Jan 2010 14:58:58 DEBUG Removing /var/lib/libvirt/boot/virtinst-boot.iso.5OdWNI Fri, 29 Jan 2010 14:58:58 ERROR internal error unable to start guest: Domain installation may not have been successful. If it was, you can restart your domain by running 'virsh start vm1'; otherwise, please restart your installation. Fri, 29 Jan 2010 14:58:58 ERROR internal error unable to start guest: Traceback (most recent call last): File "/usr/sbin/virt-install", line 889, in ? main() File "/usr/sbin/virt-install", line 751, in main start_time, guest.start_install) File "/usr/sbin/virt-install", line 813, in do_install dom = install_func(conscb, progresscb, wait=(not wait)) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 541, in start_install return self._do_install(consolecb, meter, removeOld, wait) File "/usr/lib/python2.4/site-packages/virtinst/Guest.py", line 633, in _do_install self.domain = self.conn.createLinux(install_xml, 0) File "/usr/lib64/python2.4/site-packages/libvirt.py", line 974, in createLinux if ret is None:raise libvirtError('virDomainCreateLinux() failed', conn=self) libvirtError: internal error unable to start guest:
Actually I could only reproduce the issue with '--nonsparse' switch if I left this option off the virt-install line the guest always started for me.
Yeah, it sounds like bug 456926 we faced with xen. The problem there was that dom0 was very busy flushing out dirty cache pages that it was very slow in doing other stuff, especially reading from the disk, which significantly delayed execution of new processes. However, this is not xen, which was waiting with a timeout for an asynchronous event triggered by udev. I hope qemu does not do something similarly stupid.
It might be a libvirt timeout in this case. There are several timeouts in the libvirt qemu driver in the start up code: qemu could legitimately be taking too long due to the cache issues, and libvirt might be timing out.
If heavy dirty page writing is the issue, my questions are: 1. Is the creation of the image container being written with direct I/O ? We all know that page cache writes on memory starved systems can be quite slow. We are asking for a 180GB image. 2. Who sets the timeout times for libvirt qemu driver ? Barry
Timeouts in libvirt qemu driver are hardcoded, not manually configurable at this point (and really they shouldn't be yet another knob needed tweaking for things to work well !) Daniel
I was able to reproduce it locally using the virt-install command from comment #10. Sometimes I don't need to do anything special, sometimes running sync just before the image creation finishes helps to see this issue. And running dd at the end of image creation makes it reliably reproducible. I haven't seen the error described by the reporter, I only hit the one described in comment #10. But they both happen in qemudReadLogOutput() as a result of some unusual behavior when libvirt tries to read qemu's output. In my case the root error in libvirt was error : internal error Timed out while reading console log output which happens in qemudReadLogOutput() which is called from qemudWaitForMonitor() with a timeout of 3 seconds. By modifying virt-install to open the image file with O_DSYNC, we can avoid seeing this issue when virt-install is the only process doing heavy I/O. When, however, another process is writing a lot to a disk, I guess the only solution is increasing the timeout, although it can take any amount of time to start qemu in such situation. As installing a guest in parallel to running an I/O bound process is probably quite uncommon, we could be fine with just fixing virt-install.
Hum, that's complex ... What flag is virt-install using now for opening the image file ? We should probably split this bug because clearly there isn't one magic change which would fix the issue forever, but incremental improvements on the two components may help get rid of this in most cases: - fix virt-install opening flag if the semantic of O_DSYNC is not creating any risk - possibly change libvirt timeout, either for a longer timeout or something more accomodating of delays in case of load. We have a update of python-virtinst for 5.5 https://errata.devel.redhat.com/errata/show/9266 but it may be too late to push this, Daniel
virt-install creates the image with O_WRONLY | O_CREAT. I don't see how O_DSYNC could bring any risk. Barry, could you try python-virtinst package from https://brewweb.devel.redhat.com/taskinfo?taskID=2241466 I think it could solve the issue for you. It solved it for me.
Created attachment 388551 [details] Open nonsparse image files with O_DSYNC This is a patch for python-virtinst which should solve the most common cause of this bug. If it really fixes it, I'll move this bug to python-virtinst and clone it back for libvirt to address the other cause.
We should also make a similar change in the libvirt storage code, since virtinst/virt-manager will use that instead for many configurations.
First test with bits in https://brewweb.devel.redhat.com/taskinfo?taskID=2241466 succeeded. This was both for i386 and x86_64. I want to run it on more IO challenged systems before I declare victory. Barry
I also met this "internal error.. time out to read cosole log ...": I just start/stop 4 vms on one physical machine, every 20 seconds, in my stress tests. The bug occurred one time in an hour, some like that. When I manually increase the waiting time from 3 seconds to 10 seconds in libvirt, then seems the bug is gone.
I tested this fix on single spindle host machines with both i386 and x86_64 guests. In both cases updating python-virtinst succeeded. Barry
Patch from comment #20 built into python-virtinst-0.400.3-8.el5. Thanks Jiri!
Verified, fixed with the following packages. libvirt-0.6.3-33.el5 python-virtinst-0.400.3-9.el5 kvm-83-161.el5
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2010-0282.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days