Bug 1771976
Summary: | Occasional supermin failures when the appliance is partially deleted perhaps by a tmp cleaner | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Lubomír Sedlář <lsedlar> | |
Component: | supermin | Assignee: | Richard W.M. Jones <rjones> | |
Status: | CLOSED WONTFIX | QA Contact: | YongkuiGuo <yoguo> | |
Severity: | unspecified | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 7.7 | CC: | bthurber, fdupont, nperic, ptoscano, rjones, tzheng, xchen, yoguo | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1796120 (view as bug list) | Environment: | ||
Last Closed: | 2020-06-29 09:45:23 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1813809 | |||
Bug Blocks: |
Description
Lubomír Sedlář
2019-11-13 10:23:54 UTC
Does /var/tmp/.guestfs-1000/appliance.d/root exist? How about other files under /var/tmp/.guestfs-1000 ? The files exist now, but there may have been successful runs since seeing the issues. ls -lh /var/tmp/.guestfs-1000/appliance.d/ total 337M -rw-r--r--. 1 rcm-auto-compose rcm-auto-compose 3.3M Nov 13 03:34 initrd -rwxr-xr-x. 1 rcm-auto-compose rcm-auto-compose 6.5M Nov 13 03:34 kernel -rw-r--r--. 1 rcm-auto-compose rcm-auto-compose 4.0G Nov 13 03:34 root /var/tmp/.guestfs-1000 ├── appliance.d │ ├── initrd │ ├── kernel │ └── root ├── lock ├── qemu-5258400-1565614036.devices ├── qemu-5258400-1565614036.help ├── qemu-5258400-1565614036.qmp-schema ├── qemu-5258400-1565614036.query-kvm └── qemu-5258400-1565614036.stat Could this be caused by multiple instances of guestmount running at the same time? > Could this be caused by multiple instances of guestmount running at the same time?
No, if the appliance needs to be rebuilt then it takes a lock so
parallel runs are fine.
Anyway it seems as if /var/tmp/.guestfs-1000/appliance.d/root now exists. Does
it work now?
I can't test it since I don't have access to the machine in question. I'll ask owners to test. One one affected machine the /var/tmp/.guestfs-1000 is empty, and there it seems to not be working. What is responsible for creating the structure? The directory is a cache, populated on demand when any libguestfs-using tool runs. You can delete it at any time if you want to free up space, at the cost of slower start up next time you use a libguestfs tool. This bug will be addressed in next major release. Cannot reproduce this issue with the following steps: Version of packages: libguestfs-1.40.2-5.el7_7.1.x86_64 qemu-kvm-1.5.3-167.el7.x86_64 kernel-3.10.0-1062.el7.x86_64 1. Install libguestfs relevant packages in a rhel7.7 vm 2. # wget http://download-node-02.eng.bos.redhat.com/rhel-7/nightly/RHEL-7/RHEL-7.8-20191031.n.1/work/x86_64/buildinstall/Server/images/efiboot.img # mkdir /tmp/guestmount-test # LIBGUESTFS_BACKEND=direct guestmount -a /home/efiboot.img -m /dev/sda /tmp/guestmount-test # ll /tmp/guestmount-test total 2 drwxr-xr-x. 3 root root 2048 Oct 31 19:47 EFI Move back against RHEL 7.7. We have seen the same problem again. The directory it complained about was /var/tmp/.guestfs-101448/appliance.d/root and right after the failure it did not exist. How can I debug this to provide more information? Definitely enable debugging all the time, and capture the messages on every run. When it happens we'd want to see the full messages. (It's probably not going to help to enable debugging right now and run it again once. We want to see debug from the time it starts happening). (In reply to Lubomír Sedlář from comment #11) > We have seen the same problem again. The directory it complained about was > /var/tmp/.guestfs-101448/appliance.d/root and right after the failure it did > not exist. Under a properly cached appliance, you will see three files, something like (with different sizes of course): $ du -hcs /var/tmp/.guestfs-101448/appliance.d/ 1.6M tmp/.guestfs-1000/appliance.d/initrd 9.9M tmp/.guestfs-1000/appliance.d/kernel 383M tmp/.guestfs-1000/appliance.d/root 394M total Since 'root' is a big file (at least 300/350M) and that the directory is under /var/tmp, my theory is that systemd-tmpfiles is cleaning that to save space. In particular, /usr/lib/tmpfiles.d/tmp.conf sets 30d (at least on Fedora 31) as age for /var/tmp, so in case libguestfs was not run for more than 30 days then I think that files might be cleaned. When this happens, can you please check the timestamp of the appliance.d and all the files inside it? - `stat /var/tmp/.guestfs-101448/appliance.d/` - `stat /var/tmp/.guestfs-101448/appliance.d/*` --- The issue I see is that supermin (that creates the appliance in appliance.d) does not check for the existance of all the files it creates, but only whether the directory exists. Most probably it should be smarter and check which files exists -- sadly ATM it is not easy to do so, as the files created change depending on the appliance type (tgz vs chroot) and depending on the files themselves. For the latest failure I have this. It was run with LIBGUESTFS_DEBUG=1 and LIBGUESTFS_TRACE=1. Is there more debugging information I can turn on? 2020-01-28 14:10:12 [ERROR ] Command ['guestmount', '-a', '/mnt/redhat/rhel-8/devel/candidate-trees/RHEL-8/RHEL-8.1.1-updates-20200128.3/work/global/tmp/tweak_buildinstall_EkC1x6/images/efiboot.img', '-m', '/dev/sda', '/tmp/iso-mount-G5ZZbg'] exited with 1 and output: libguestfs: trace: set_verbose true libguestfs: trace: set_verbose = 0 libguestfs: trace: set_backend "direct" libguestfs: trace: set_backend = 0 libguestfs: create: flags = 0, handle = 0x563460863b50, program = guestmount libguestfs: trace: set_recovery_proc false libguestfs: trace: set_recovery_proc = 0 libguestfs: trace: add_drive "/mnt/redhat/rhel-8/devel/candidate-trees/RHEL-8/RHEL-8.1.1-updates-20200128.3/work/global/tmp/tweak_buildinstall_EkC1x6/images/efiboot.img" libguestfs: trace: add_drive = 0 libguestfs: trace: launch libguestfs: trace: max_disks libguestfs: trace: max_disks = 255 libguestfs: trace: get_tmpdir libguestfs: trace: get_tmpdir = "/tmp" libguestfs: trace: version libguestfs: trace: version = <struct guestfs_version = major: 1, minor: 40, release: 2, extra: rhel=7,release=5.el7_7.2,libvirt, > libguestfs: trace: get_backend libguestfs: trace: get_backend = "direct" libguestfs: launch: program=guestmount libguestfs: launch: version=1.40.2rhel=7,release=5.el7_7.2,libvirt libguestfs: launch: backend registered: unix libguestfs: launch: backend registered: uml libguestfs: launch: backend registered: libvirt libguestfs: launch: backend registered: direct libguestfs: launch: backend=direct libguestfs: launch: tmpdir=/tmp/libguestfskaSVES libguestfs: launch: umask=0002 libguestfs: launch: euid=101448 libguestfs: trace: get_cachedir libguestfs: trace: get_cachedir = "/var/tmp" libguestfs: begin building supermin appliance libguestfs: run supermin libguestfs: command: run: /usr/bin/supermin5 libguestfs: command: run: \ --build libguestfs: command: run: \ --verbose libguestfs: command: run: \ --if-newer libguestfs: command: run: \ --lock /var/tmp/.guestfs-101448/lock libguestfs: command: run: \ --copy-kernel libguestfs: command: run: \ -f ext2 libguestfs: command: run: \ --host-cpu x86_64 libguestfs: command: run: \ /usr/lib64/guestfs/supermin.d libguestfs: command: run: \ -o /var/tmp/.guestfs-101448/appliance.d supermin: version: 5.1.19 supermin: rpm: detected RPM version 4.13 supermin: package handler: fedora/rpm supermin: acquiring lock on /var/tmp/.guestfs-101448/lock supermin: if-newer: output does not need rebuilding libguestfs: finished building supermin appliance libguestfs: begin testing qemu features libguestfs: trace: get_cachedir libguestfs: trace: get_cachedir = "/var/tmp" libguestfs: checking for previously cached test results of /usr/libexec/qemu-kvm, in /var/tmp/.guestfs-101448 libguestfs: loading previously cached test results libguestfs: QMP parse error: '[' or '{' expected near end of file (ignored) libguestfs: qemu version: 1.5 libguestfs: qemu mandatory locking: no libguestfs: qemu KVM: disabled libguestfs: trace: get_backend_setting "force_tcg" libguestfs: trace: get_backend_setting = NULL (error) libguestfs: trace: get_sockdir libguestfs: trace: get_sockdir = "/tmp" libguestfs: finished testing qemu features libguestfs: trace: get_backend_setting "gdb" libguestfs: trace: get_backend_setting = NULL (error) libguestfs: command: run: dmesg | grep -Eoh 'lpj=[[:digit:]]+' libguestfs: read_lpj_from_dmesg: calculated lpj=2499998 /usr/libexec/qemu-kvm \ -global virtio-blk-pci.scsi=off \ -no-user-config \ -enable-fips \ -nodefaults \ -display none \ -machine accel=kvm:tcg \ -m 768 \ -no-reboot \ -rtc driftfix=slew \ -no-hpet \ -global kvm-pit.lost_tick_policy=discard \ -kernel /var/tmp/.guestfs-101448/appliance.d/kernel \ -initrd /var/tmp/.guestfs-101448/appliance.d/initrd \ -object rng-random,filename=/dev/urandom,id=rng0 \ -device virtio-rng-pci,rng=rng0 \ -device virtio-scsi-pci,id=scsi \ -drive file=/mnt/redhat/rhel-8/devel/candidate-trees/RHEL-8/RHEL-8.1.1-updates-20200128.3/work/global/tmp/tweak_buildinstall_EkC1x6/images/efiboot.img,cache=writeback,id=hd0,if=none \ -device scsi-hd,drive=hd0 \ -drive file=/var/tmp/.guestfs-101448/appliance.d/root,snapshot=on,id=appliance,cache=unsafe,if=none,format=raw \ -device scsi-hd,drive=appliance \ -device virtio-serial-pci \ -serial stdio \ -device sga \ -chardev socket,path=/tmp/libguestfsiIhnQy/guestfsd.sock,id=channel0 \ -device virtserialport,chardev=channel0,name=org.libguestfs.channel.0 \ -append "panic=1 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check lpj=2499998 printk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=linux" Could not access KVM kernel module: No such file or directory failed to initialize KVM: No such file or directory Back to tcg accelerator. qemu-kvm: -drive file=/var/tmp/.guestfs-101448/appliance.d/root,snapshot=on,id=appliance,cache=unsafe,if=none,format=raw: could not open disk image /var/tmp/.guestfs-101448/appliance.d/root: Could not open '/var/tmp/.guestfs-101448/appliance.d/root': No such file or directory libguestfs: error: appliance closed the connection unexpectedly, see earlier error messages libguestfs: child_cleanup: 0x563460863b50: child process died libguestfs: sending SIGTERM to process 46402 libguestfs: error: guestfs_launch failed, see earlier error messages libguestfs: trace: launch = -1 (error) libguestfs: trace: close libguestfs: closing guestfs handle 0x563460863b50 (state 0) libguestfs: command: run: rm libguestfs: command: run: \ -rf /tmp/libguestfskaSVES libguestfs: command: run: rm libguestfs: command: run: \ -rf /tmp/libguestfsiIhnQy There is no more debug. > supermin: if-newer: output does not need rebuilding This message is consistent with what Pino says. https://github.com/libguestfs/supermin/blob/62d5c774d6c8fcac11e28fcba99754b5478e5088/src/supermin.ml#L232 Note on line 239 it only checks the date of the output directory. In this case the appliance.d/root file is 4.0G. Looking around the machine, I see there is /var/tmp/.guestfs-$UID/ for multiple users where the only content is an empty appliance.d directory. I can replicate the error now: /var/tmp/.guestfs-$UID is missing => guestmount works /var/tmp/.guestfs-$UID contains all files => guestmount works /var/tmp/.guestfs-$UID/appliance.d is empty directory and nothing else exists in the directory => error /usr/lib/tmpfiles.d/tmp.conf defines that files should be deleted after 30 days. v /var/tmp 1777 root root 30d Is there something I can change about the guestmount command to avoid this? The only workaround I can probably do is to unconditionally remove /var/tmp/.guestfs-$UID before running it. I think this is an actual bug in supermin, as described in Pino in comment 13. I wonder why we've not seen it before though - maybe the defaults in tmpfiles changed recently? While removing /var/tmp/.guestfs-* will work it means it'll rebuild the cache each time, so YMMV. > While removing /var/tmp/.guestfs-* will work it means it'll rebuild the cache each time, so YMMV.
Yeah, I will probably check if the appliance.d is empty, and only remove the directory in such case. That still leaves some room for possible race condition, but it seems unlikely enough to not warrant the performance penalty.
Anyway, thank you all for debugging this!
Closing per comment 21. |