Description of problem: virt-df (See command 1 below) hangs using 100% of one core until killed. A normal sudo kill stops it immediately. It is invoked as many as 30 times at once on a virtualization host as a part of a cronjob, to check disk usage of virtual machines. Most invocations succeed, but about 1/100 fails with this problem. We have investigated the /proc/<pid>/fd directory, and it shows that one of the VPS disks have been opened for reading. Version-Release number of selected component (if applicable): 1.40.2 running on Ubuntu 20.04.6. How reproducible: Unsure as to how to reproduce, but it happens every day on our Ubuntu Servers. We can provice more information from our systems. Steps to Reproduce: Unknown. Actual results: virt-df hangs using 100% of one core until killed. Expected results: virt-df finishes with error code or result. Additional info: - The virtual disks are LVM volumes - About 1/100 invocations fail with this problem. Command 1: /usr/bin/qemu-system-x86_64 -global virtio-blk-pci.scsi=off -no-user-config -enable-fips -nodefaults -display none -machine accel=kvm:tcg -cpu host -m 768 -no-reboot -rtc driftfix=slew -no-hpet -global kvm-pit.lost_tick_policy=discard -kernel /var/tmp/.guestfs-0/appliance.d/kernel -initrd /var/tmp/.guestfs-0/appliance.d/initrd -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device virtio-scsi-pci,id=scsi -drive file.file.filename=/tmp/libguestfsuEQ2nY/overlay1.qcow2,file.driver=qcow2,file.backing.file.locking=off,cache=unsafe,id=hd0,if=none -device scsi-hd,drive=hd0 -drive file=/var/tmp/.guestfs-0/appliance.d/root,snapshot=on,id=appliance,cache=unsafe,if=none,format=raw -device scsi-hd,drive=appliance -device virtio-serial-pci -serial stdio -chardev socket,path=/tmp/libguestfs9sWCeV/guestfsd.sock,id=channel0 -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 -append panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 quiet TERM=linux guestfs_identifier=thread_2_domain_2
Can you run 'libguestfs-test-tool' and attach the complete, unedited output.
Created attachment 1991048 [details] libguestfs-test-tool output
OK now run: LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 virt-df [etc] (for whatever virt-df command which hangs) and attach that complete output.
Wer are investigating the exact command being run and will come back with this information as soon as we can.
The exact command being run is: "virt-df --csv" I have attached the output of stdout and stderr of running "LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 virt-df --csv". The run is a non-failing run. I am trying to reproduce a failing run, but it happens seldom.
Created attachment 1991988 [details] err.log of success-run
Created attachment 1991989 [details] out.log of success-run
We finally managed to get a failing run reproduced. see err5.log for stderr, out5.log for stdout. This is the command we ran: LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 virt-df --csv 2> err5.log > out5.log It hung on line 47993 "libguestfs: command: run: \ -rf /tmp/libguestfsfFUBad" untill i killed it with sudo kill. It hung on 100% CPU usage for a long time untill we killed it.
Created attachment 1993586 [details] err5.log - stderr of failing run
Created attachment 1993587 [details] out5.log -- stdout of failing run
There's something up with qemu and/or the kernel which causes intermittent hangs, so this isn't really a virt-df or libguestfs issue. I would try something like this: while guestfish -vx -a /dev/null run >& /tmp/log1 ; do echo -n . ; done and leave that to run for a long time. If it hangs, examine /tmp/log1. If it appears to run forever then try running several of those loops in parallel (with differently named log files). Another tool to look at is: https://people.redhat.com/~rjones/qemu-sanity-check/
> There's something up with qemu and/or the kernel which causes intermittent hangs, so this isn't really a virt-df or libguestfs issue. Thanks a lot for investigating this for us. A quick follow up, what in the logs points towards your conclusion? > I would try something like this: Thanks! We will try!
> libguestfs: error: guestfs_launch failed, see earlier error messages However because all of the messages from all the appliances are mixed together, it's hard to see exactly which appliance failed to start or why. Therefore having separated logs as suggested in comment 11 will help to diagnose exactly where the kernel boot is hanging. BTW if this is TCG then you may be hitting the infamous https://rwmj.wordpress.com/2023/06/18/follow-up-to-i-booted-linux-292612-times/ (However that only affects TCG, and is fixed in recent kernels)
I ran the command you suggested several thousand times, and it hit the hanging bug. guestfish -vx -a /dev/null run >& /tmp/log2099 I attached a successfull run and a failing/hanging run. Might this be the infamous bug you are refering to? If so, could a kernel update fix it?
Created attachment 1993649 [details] failing run on guestfish
Created attachment 1993650 [details] successfull run of guestfish
It's not the infamous bug because it seems this is not TCG. From the good/bad outputs I would guess that the next line should be: [ 0.169181] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (family: 0x6, model: 0x4f, stepping: 0x1) This is classically either a kernel or qemu bug. As these are quite old versions of both (especially qemu) I'd suggest upgrading them. The latest versions are very stable. Else open a bug against Ubuntu LTS, since qemu should always be able to boot the kernel reliably. For RHEL we have started to use qemu-sanity-check to verify this.
Closing as this is not a bug in libguestfs itself.