Description of problem: Attempting to run virt-sparsify on a fresh RHEL 6.4 install. After some time, the process seemed to freeze, and virt-manager showed the guestfs VM as paused. I manually unpaused it, and it made a small amount of forward progress before pausing again. Repeated this for some time, then the guestfs qemu crashed with attached core. Version-Release number of selected component: qemu-system-x86-1.4.2-4.fc19 Additional info: reporter: libreport-2.1.5 backtrace_rating: 4 cmdline: /usr/bin/qemu-system-x86_64 -machine accel=kvm -name guestfs-radorwbctnktvu5t -S -machine pc-i440fx-1.4,accel=kvm,usb=off -m 500 -smp 1,sockets=1,cores=1,threads=1 -uuid 2a9b6621-437c-4e0f-ae68-68f92a0b6d7a -nographic -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/guestfs-radorwbctnktvu5t.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-reboot -no-shutdown -no-acpi -kernel /var/tmp/.guestfs-0/kernel.20473 -initrd /var/tmp/.guestfs-0/initrd.20473 -append 'panic=1 console=ttyS0 udevtimeout=600 no_timer_check acpi=off printk.time=1 cgroup_disable=memory root=/dev/sdb selinux=0 TERM=xterm-256color' -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x3 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x4 -drive file=/var/tmp/sparsify04fe47.qcow2,if=none,id=drive-scsi0-0-0-0,format=qcow2,cache=none -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0,bootindex=1 -drive file=/var/tmp/libguestfs1OyZuh/snapshot1,if=none,id=drive-scsi0-0-1-0,format=qcow2,cache=unsafe -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,drive=drive-scsi0-0-1-0,id=scsi0-0-1-0 -chardev socket,id=charserial0,path=/var/tmp/libguestfs1OyZuh/console.sock -device isa-serial,chardev=charserial0,id=serial0 -chardev socket,id=charchannel0,path=/var/tmp/libguestfs1OyZuh/guestfsd.sock -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.libguestfs.channel.0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 crash_function: scsi_dma_restart_bh executable: /usr/bin/qemu-system-x86_64 kernel: 3.9.9-302.fc19.x86_64 runlevel: N 5 uid: 107 var_log_messages: Aug 5 19:29:24 voom abrt[22277]: Saved core dump of pid 20503 (/usr/bin/qemu-system-x86_64) to /var/tmp/abrt/ccpp-2013-08-05-19:29:22-20503 (1260548096 bytes) xsession_errors: Truncated backtrace: Thread no. 1 (8 frames) #4 scsi_dma_restart_bh at hw/scsi-bus.c:101 #5 aio_bh_poll at async.c:69 #6 aio_poll at aio-posix.c:148 #7 aio_ctx_dispatch at async.c:166 #10 glib_select_poll at main-loop.c:209 #11 os_host_main_loop_wait at main-loop.c:236 #12 main_loop_wait at main-loop.c:416 #13 main_loop at vl.c:2001
Created attachment 782738 [details] File: backtrace
Created attachment 782739 [details] File: cgroup
Created attachment 782740 [details] File: core_backtrace
Created attachment 782741 [details] File: dso_list
Created attachment 782742 [details] File: environ
Created attachment 782743 [details] File: limits
Created attachment 782744 [details] File: maps
Created attachment 782745 [details] File: open_fds
Created attachment 782746 [details] File: proc_pid_status
(In reply to David Gibson from comment #0) > Description of problem: > Attempting to run virt-sparsify on a fresh RHEL 6.4 install. After some > time, the process seemed to freeze, and virt-manager showed the guestfs VM > as paused. I manually unpaused it, and it made a small amount of forward > progress before pausing again. Repeated this for some time, then the > guestfs qemu crashed with attached core. qemu should never crash, so there's a bug of some sort here. However the original reason for virt-sparsify "pausing" is likely to be because you ran out of disk space. See: https://bugzilla.redhat.com/show_bug.cgi?id=745576 http://libguestfs.org/virt-sparsify.1.html#environment-variables
I don't think it's out of disk space. I hit the problem with it freezing when disk was full earlier, and changed TMPDIR to work around. At the time it froze up this time, none of my filesystems showed as close to full. I also wouldn't have expected it to be able to make forward progress when I manually unpaused if it were out of disk space. By manually unpausing it, it made several % worth of extra progress.
It seems a bit of a mystery what could be causing qemu to pause. Libguestfs never pauses the appliance. As far as I know, libvirt won't pause a guest without being asked to. The only thing I know that could cause a guest to pause is ENOSPC, but that appears to be ruled out (comment 11). Are there any errors in the libvirt log file (~/.cache/libvirt/qemu/log/guestfs-radorwbctnktvu5t.log)? Are there any errors (eg I/O errors etc) in dmesg? It would be helpful if you could try to reproduce this, and enable full debugging at the same time: http://libguestfs.org/guestfs-faq.1.html#debugging-libguestfs This will create a huge amount of output which you should capture to a file: virt-sparsify [...] |& tee /tmp/log
Ok, tried this again with the debugging enabled as suggested. Again I got repeated pauses. This time I didn't get a qemu crash, but after pausing/resuming quite a lot of times, it seemed to stop making forward progress. At the time of the pause I ran these commands: [root@voom base-images]# cp sparsify.log sparsify.log-pause1 [root@voom base-images]# du -h /var/tmp/sparsify127052.qcow2 87G /var/tmp/sparsify127052.qcow2 [root@voom base-images]# df -h Filesystem Size Used Avail Use% Mounted on /dev/dm-1 225G 165G 60G 74% / devtmpfs 3.7G 0 3.7G 0% /dev tmpfs 3.7G 144K 3.7G 1% /dev/shm tmpfs 3.7G 968K 3.7G 1% /run tmpfs 3.7G 0 3.7G 0% /sys/fs/cgroup tmpfs 3.7G 98M 3.6G 3% /tmp /dev/sda1 488M 126M 327M 28% /boot /dev/dm-1 225G 165G 60G 74% /home mcgurkus:/volume1/ISO 2.7T 1.2T 1.6T 44% /var/lib/libvirt/images/mcgurkus-ISO When it seemed to stop making forward progress despite attempting to unpause, I ran these commands: [root@voom base-images]# killall virt-sparsify [root@voom base-images]# du -h /var/tmp/sparsify127052.qcow2 91G /var/tmp/sparsify127052.qcow2 [root@voom base-images]# df -h Filesystem Size Used Avail Use% Mounted on /dev/dm-1 225G 169G 56G 76% / devtmpfs 3.7G 0 3.7G 0% /dev tmpfs 3.7G 144K 3.7G 1% /dev/shm tmpfs 3.7G 956K 3.7G 1% /run tmpfs 3.7G 0 3.7G 0% /sys/fs/cgroup tmpfs 3.7G 98M 3.6G 3% /tmp /dev/sda1 488M 126M 327M 28% /boot /dev/dm-1 225G 169G 56G 76% /home mcgurkus:/volume1/ISO 2.7T 1.2T 1.6T 44% /var/lib/libvirt/images/mcgurkus-ISO [root@voom base-images]# rm /var/tmp/sparsify127052.qcow2 rm: remove regular file ‘/var/tmp/sparsify127052.qcow2’? y [root@voom base-images]# df -h Filesystem Size Used Avail Use% Mounted on /dev/dm-1 225G 74G 147G 34% / devtmpfs 3.7G 0 3.7G 0% /dev tmpfs 3.7G 144K 3.7G 1% /dev/shm tmpfs 3.7G 956K 3.7G 1% /run tmpfs 3.7G 0 3.7G 0% /sys/fs/cgroup tmpfs 3.7G 98M 3.6G 3% /tmp /dev/sda1 488M 126M 327M 28% /boot /dev/dm-1 225G 74G 147G 34% /home mcgurkus:/volume1/ISO 2.7T 1.2T 1.6T 44% /var/lib/libvirt/images/mcgurkus-ISO It turns out that between the first pause and no longer making progress, nothing further was output in the debug trace. i.e. sparsify.log and sparsify.log-pause1 compared equal. I'll attach the final sparsify.log and /var/log/libvirt/qemu/guestfs-i5ttt07coy5v8du0.log
Created attachment 786828 [details] sparsify debugging output
Created attachment 786829 [details] libvirt log
The pause or whatever it is happens when filling free space: mount -o /dev/vg_rhel6u4serverx8664/lv_root /sysroot/ [ 1654.502299] EXT4-fs (dm-0): mounted filesystem with ordered data mode. Opts: (null) guestfsd: main_loop: proc 1 (mount) took 0.13 seconds libguestfs: trace: mount = 0 Fill free space in /dev/vg_rhel6u4serverx8664/lv_root with zero ... libguestfs: trace: zero_free_space "/" guestfsd: main_loop: new request, len 0x30 random filename: /sysroot//ufy633xl.qwa I'm no closer to understanding why this happens. It doesn't appear to be running out of disk space on the rootfs. I wonder if there's some kind of maximum file size (on the qcow2 file, or files in general, or quota). For these type of huge images, we will be able to do better in future once discard is fully supported.
(In reply to Richard W.M. Jones from comment #16) > For these type of huge images, we will be able to do better > in future once discard is fully supported. To continue on this theme, you've got a 128 GB disk of which only about 3.5 GB is being used. This is really the worst case for virt-sparsify since it has to write a temporary file of around 124 GB (of zeroes!). Although there is enough free space, for some reason -- perhaps quota or a max file size somewhere -- it only managed 91 GB. In-place sparsification is what's needed in this case. http://libguestfs.org/guestfs-faq.1.html#why-doesnt-virt-sparsify-work-on-the-disk-image-in-place%2d There is also a non-upstream qemu patch which makes writing files full of zeroes much more disk efficient: https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03747.html
I realise it needs to write a huge intermediate file, I just can't see why it's failing to do so. I've worked around by using a smaller disk (since I realised I can give the guest more space by adding more disks later). It's still weird though.
I see this thread talking about a crash in aio_bh_poll_sync and trim: https://lists.gnu.org/archive/html/qemu-devel/2013-05/msg03267.html Not sure if it's related. Any thoughts Paolo, Stefan?
(In reply to Cole Robinson from comment #19) > I see this thread talking about a crash in aio_bh_poll_sync and trim: > > https://lists.gnu.org/archive/html/qemu-devel/2013-05/msg03267.html > > Not sure if it's related. Any thoughts Paolo, Stefan? Cole: I'm not sure they are related. Richard: The QEMU and qcow2 file size limits are around 2^63 bytes so this case should not hit the limit. David: What is the output of "virsh dominfo <domain>" and "virsh domblkerror <domain>" when the guest has paused for the first time?
Can this be reproduced with a newer QEMU? BTW, for QEMU 1.8, you will be able to write zeroes efficiently to a qcow2 file using virtio-scsi and the BLKZEROOUT ioctl (from the guest of course). The BLKZEROOUT ioctl can be used unconditionally, even on older QEMU; it just won't be as effective. This should have the same effect as the patch in https://lists.gnu.org/archive/html/qemu-devel/2012-07/msg03747.html.
This message is a notice that Fedora 19 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 19. It is Fedora's policy to close all bug reports from releases that are no longer maintained. Approximately 4 (four) weeks from now this bug will be closed as EOL if it remains open with a Fedora 'version' of '19'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 19 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 19 changed to end-of-life (EOL) status on 2015-01-06. Fedora 19 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.