Bug 2241293 - virt-df hangs using 100% of one core
Summary: virt-df hangs using 100% of one core
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libguestfs
Version: unspecified
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Richard W.M. Jones
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-09-29 07:10 UTC by Martin J.
Modified: 2023-10-12 19:23 UTC (History)
2 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2023-10-12 19:23:33 UTC
Embargoed:


Attachments (Terms of Use)
libguestfs-test-tool output (56.64 KB, text/plain)
2023-09-29 07:59 UTC, Martin J.
no flags Details
err.log of success-run (446.77 KB, text/plain)
2023-10-04 08:44 UTC, Martin J.
no flags Details
out.log of success-run (426 bytes, text/plain)
2023-10-04 08:44 UTC, Martin J.
no flags Details
err5.log - stderr of failing run (2.47 MB, text/plain)
2023-10-12 08:37 UTC, Martin J.
no flags Details
out5.log -- stdout of failing run (1.78 KB, text/plain)
2023-10-12 08:38 UTC, Martin J.
no flags Details
failing run on guestfish (14.13 KB, text/plain)
2023-10-12 18:26 UTC, Martin J.
no flags Details
successfull run of guestfish (56.05 KB, text/plain)
2023-10-12 18:26 UTC, Martin J.
no flags Details

Description Martin J. 2023-09-29 07:10:04 UTC
Description of problem: virt-df (See command 1 below) hangs using 100% of one core until killed. A normal sudo kill stops it immediately. It is invoked as many as 30 times at once on a virtualization host as a part of a cronjob, to check disk usage of virtual machines. Most invocations succeed, but about 1/100 fails with this problem.

We have investigated the /proc/<pid>/fd directory, and it shows that one of the VPS disks have been opened for reading.

Version-Release number of selected component (if applicable): 1.40.2 running on Ubuntu 20.04.6.


How reproducible: Unsure as to how to reproduce, but it happens every day on our Ubuntu Servers. We can provice more information from our systems.

Steps to Reproduce:
Unknown.

Actual results: virt-df hangs using 100% of one core until killed.


Expected results: virt-df finishes with error code or result.


Additional info:

 - The virtual disks are LVM volumes
 - About 1/100 invocations fail with this problem.

Command 1: /usr/bin/qemu-system-x86_64 -global virtio-blk-pci.scsi=off -no-user-config -enable-fips -nodefaults -display none -machine accel=kvm:tcg -cpu host -m 768 -no-reboot -rtc driftfix=slew -no-hpet -global kvm-pit.lost_tick_policy=discard -kernel /var/tmp/.guestfs-0/appliance.d/kernel -initrd /var/tmp/.guestfs-0/appliance.d/initrd -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -device virtio-scsi-pci,id=scsi -drive file.file.filename=/tmp/libguestfsuEQ2nY/overlay1.qcow2,file.driver=qcow2,file.backing.file.locking=off,cache=unsafe,id=hd0,if=none -device scsi-hd,drive=hd0 -drive file=/var/tmp/.guestfs-0/appliance.d/root,snapshot=on,id=appliance,cache=unsafe,if=none,format=raw -device scsi-hd,drive=appliance -device virtio-serial-pci -serial stdio -chardev socket,path=/tmp/libguestfs9sWCeV/guestfsd.sock,id=channel0 -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 -append panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 quiet TERM=linux guestfs_identifier=thread_2_domain_2

Comment 1 Richard W.M. Jones 2023-09-29 07:45:40 UTC
Can you run 'libguestfs-test-tool' and attach the complete, unedited output.

Comment 2 Martin J. 2023-09-29 07:59:44 UTC
Created attachment 1991048 [details]
libguestfs-test-tool output

Comment 3 Richard W.M. Jones 2023-09-29 08:13:32 UTC
OK now run:

LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 virt-df [etc]

(for whatever virt-df command which hangs) and attach that complete output.

Comment 4 Martin J. 2023-09-29 09:02:28 UTC
Wer are investigating the exact command being run and will come back with this information as soon as we can.

Comment 5 Martin J. 2023-10-04 08:43:17 UTC
The exact command being run is: "virt-df --csv"

I have attached the output of stdout and stderr of running "LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 virt-df --csv". The run is a non-failing run. I am trying to reproduce a failing run, but it happens seldom.

Comment 6 Martin J. 2023-10-04 08:44:01 UTC
Created attachment 1991988 [details]
err.log of success-run

Comment 7 Martin J. 2023-10-04 08:44:20 UTC
Created attachment 1991989 [details]
out.log of success-run

Comment 8 Martin J. 2023-10-12 08:37:10 UTC
We finally managed to get a failing run reproduced. see err5.log for stderr, out5.log for stdout.

This is the command we ran:

LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 virt-df --csv 2> err5.log > out5.log

It hung on line 47993 "libguestfs: command: run: \ -rf /tmp/libguestfsfFUBad" untill i killed it with sudo kill.

It hung on 100% CPU usage for a long time untill we killed it.

Comment 9 Martin J. 2023-10-12 08:37:56 UTC
Created attachment 1993586 [details]
err5.log - stderr of failing run

Comment 10 Martin J. 2023-10-12 08:38:48 UTC
Created attachment 1993587 [details]
out5.log -- stdout of failing run

Comment 11 Richard W.M. Jones 2023-10-12 08:54:53 UTC
There's something up with qemu and/or the kernel which causes intermittent
hangs, so this isn't really a virt-df or libguestfs issue.  I would try
something like this:

while guestfish -vx -a /dev/null run >& /tmp/log1 ; do echo -n . ; done

and leave that to run for a long time.  If it hangs, examine /tmp/log1.
If it appears to run forever then try running several of those loops
in parallel (with differently named log files).

Another tool to look at is:
https://people.redhat.com/~rjones/qemu-sanity-check/

Comment 12 Martin J. 2023-10-12 10:02:24 UTC
> There's something up with qemu and/or the kernel which causes intermittent hangs, so this isn't really a virt-df or libguestfs issue.

Thanks a lot for investigating this for us. A quick follow up, what in the logs points towards your conclusion?

> I would try something like this:

Thanks! We will try!

Comment 13 Richard W.M. Jones 2023-10-12 10:14:56 UTC
> libguestfs: error: guestfs_launch failed, see earlier error messages

However because all of the messages from all the appliances are mixed
together, it's hard to see exactly which appliance failed to start or
why.  Therefore having separated logs as suggested in comment 11
will help to diagnose exactly where the kernel boot is hanging.

BTW if this is TCG then you may be hitting the infamous
https://rwmj.wordpress.com/2023/06/18/follow-up-to-i-booted-linux-292612-times/
(However that only affects TCG, and is fixed in recent kernels)

Comment 14 Martin J. 2023-10-12 18:24:41 UTC
I ran the command you suggested several thousand times, and it hit the hanging bug.

guestfish -vx -a /dev/null run >& /tmp/log2099

I attached a successfull run and a failing/hanging run. Might this be the infamous bug you are refering to? If so, could a kernel update fix it?

Comment 15 Martin J. 2023-10-12 18:26:05 UTC
Created attachment 1993649 [details]
failing run on guestfish

Comment 16 Martin J. 2023-10-12 18:26:48 UTC
Created attachment 1993650 [details]
successfull run of guestfish

Comment 17 Richard W.M. Jones 2023-10-12 19:23:11 UTC
It's not the infamous bug because it seems this is not TCG.

From the good/bad outputs I would guess that the next line should be:

[    0.169181] smpboot: CPU0: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (family: 0x6, model: 0x4f, stepping: 0x1)

This is classically either a kernel or qemu bug.  As these are quite
old versions of both (especially qemu) I'd suggest upgrading them.  The
latest versions are very stable.

Else open a bug against Ubuntu LTS, since qemu should always be able
to boot the kernel reliably.

For RHEL we have started to use qemu-sanity-check to verify this.

Comment 18 Richard W.M. Jones 2023-10-12 19:23:33 UTC
Closing as this is not a bug in libguestfs itself.


Note You need to log in before you can comment on or make changes to this bug.