Bug 1784961
Summary: | libguestfs failing on power9 images | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Kevin Fenzi <kevin> |
Component: | qemu | Assignee: | Fedora Virtualization Maintainers <virt-maint> |
Status: | CLOSED EOL | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 32 | CC: | amit, awilliam, berrange, cfergeau, crobinso, dan, dgibson, dwmw2, gustavold, hannsj_uhl, itamar, jcajka, lvivier, normand, pbonzini, rjones, virt-maint |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | ppc64le | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-05-25 15:14:31 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1071880 |
Description
Kevin Fenzi
2019-12-18 20:03:32 UTC
There may be more information from the libguestfs-test-tool run if you look in /var/log/libvirt/qemu/guestfs-z2tu3i19vx35na9x.log. Also if qemu segfaulted then abrt/coredumpctl may have captured a core dump. However the basic problem is that qemu is crashing, so this is most likely to be a qemu (or possibly kernel/firmware) problem. The /var/log/libvirt/guestfs*.log: PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \ HOME=/var/lib/libvirt/qemu/domain-1-guestfs-z2tu3i19vx35 \ XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-guestfs-z2tu3i19vx35/.local/share \ XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-guestfs-z2tu3i19vx35/.cache \ XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-guestfs-z2tu3i19vx35/.config \ QEMU_AUDIO_DRV=none \ TMPDIR=/var/tmp \ /usr/bin/qemu-system-ppc64 \ -name guest=guestfs-z2tu3i19vx35na9x,debug-threads=on \ -S \ -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-guestfs-z2tu3i19vx35/master-key.a es \ -machine pseries-4.1,accel=kvm,usb=off,dump-guest-core=off \ -m 1024 \ -overcommit mem-lock=off \ -smp 1,sockets=1,cores=1,threads=1 \ -uuid 0024f525-c02c-42c5-9326-0bfa6acf4a9a \ -display none \ -no-user-config \ -nodefaults \ -chardev socket,id=charmonitor,fd=32,server,nowait \ -mon chardev=charmonitor,id=monitor,mode=control \ -rtc base=utc,driftfix=slew \ -no-reboot \ -boot strict=on \ -kernel /var/tmp/.guestfs-0/appliance.d/kernel \ -initrd /var/tmp/.guestfs-0/appliance.d/initrd \ -append 'panic=1 console=hvc0 console=ttyS0 edd=off udevtimeout=6000 udev.event-timeout=6000 no_timer_check p rintk.time=1 cgroup_disable=memory usbcore.nousb cryptomgr.notests tsc=reliable 8250.nr_uarts=1 root=/dev/sdb selinux=0 guestfs_verbose=1 TERM=screen' \ -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x1 \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x2 \ -drive file=/tmp/libguestfsI8FPWT/scratch1.img,format=raw,if=none,id=drive-scsi0-0-0-0,cache=unsafe \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=drive-scsi0-0-0-0,id= scsi0-0-0-0,bootindex=1,write-cache=on \ -drive file=/tmp/libguestfsI8FPWT/overlay2.qcow2,format=qcow2,if=none,id=drive-scsi0-0-1-0,cache=unsafe \ -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=1,lun=0,device_id=drive-scsi0-0-1-0,drive=drive-scsi0-0-1-0,id= scsi0-0-1-0,write-cache=on \ -chardev socket,id=charserial0,path=/tmp/libguestfsSJfQVT/console.sock \ -device spapr-vty,chardev=charserial0,id=serial0,reg=0x30000000 \ -chardev socket,id=charchannel0,path=/tmp/libguestfsSJfQVT/guestfsd.sock \ -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.libguestfs.channel .0 \ -object rng-random,id=objrng0,filename=/dev/urandom \ -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x3 \ -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \ -msg timestamp=on 2019-12-18 19:56:10.134+0000: Domain id=1 is tainted: custom-argv 2019-12-18 19:56:14.213+0000: shutting down, reason=shutdown There's no crash or core... Any ideas or news here? Still happening. ;( For the record, I have already seen the "restarting" VM when playing with RHEL 8 cloud images (and perhaps with others too). The symptom was similar, boot starts, writes the grub boot menu and instead of booting the selected OS, it boots again to the grub menu. And with the second grub run, it allows to Linux to boot. Also, have you tried passing the same number of threads as the host? I mean "-smp 1,sockets=1,cores=1,threads=1" (P9 usually has 4 threads per core). See Bug 1789199. []'s Gustavo I meant to write "-smp 1,sockets=1,cores=1,threads=4" should be "-smp 4,sockets=1,cores=1,threads=4" :-) But I see no change. Kevin's command line gives a good reproducer, so let's switch to qemu or start a new bug against qemu. it could be related to this warning qemu-system-ppc64: warning: kernel_irqchip allowed but unavailable: IRQ_XIVE capability must be present for KVM Falling back to kernel-irqchip=off because it appear when Linux kernel is loaded/booted for the first time. It's missing in the second boot. and I think there has been a bug for it already Seems it's the machine type problem again, using -M pseries-4.0 makes the "double boot" problem in qemu go away. (In reply to Dan Horák from comment #9) > Seems it's the machine type problem again, using -M pseries-4.0 makes the > "double boot" problem in qemu go away. This has been fixed upstream by 8deb8019d696 ("spapr: Don't trigger a CAS reboot for XICS/XIVE mode changeover") Laurent, could it be backported to qemu 4.2? Are there any prerequisite patches? Cedric has already explained me the "double boot" in https://bugzilla.redhat.com/show_bug.cgi?id=1769600#c34 so I'm wondering what route should be we go - running regular VMs is OK, but libguestfs can't deal with that, but it would allow us override the machine parameters via http://libguestfs.org/guestfs.3.html#qemu-wrappers. Easier would be to use pacthed qemu. (In reply to Dan Horák from comment #11) > Laurent, could it be backported to qemu 4.2? Yes, and it's straightforward. No prerequisite patches. (In reply to Laurent Vivier from comment #12) > (In reply to Dan Horák from comment #11) > > Laurent, could it be backported to qemu 4.2? > > Yes, and it's straightforward. No prerequisite patches. and how about qemu 4.1? Because that's the version in F-31 that installed on the Fedora builders. (In reply to Dan Horák from comment #13) > (In reply to Laurent Vivier from comment #12) > > (In reply to Dan Horák from comment #11) > > > Laurent, could it be backported to qemu 4.2? > > > > Yes, and it's straightforward. No prerequisite patches. > > and how about qemu 4.1? Because that's the version in F-31 that installed on > the Fedora builders. It more complicated: 4.1 needs more patches to regenerate the device tree when CAS is called, to add a code path to activate and deactivate interrupt controllers and a new version of SLOF. dgibson knows better than me what is the list of needed patches OK, then it would be better to use the virt stack from the virt-preview repo which has 4.2 for F-31 and F-30 And I confirm, that the "double boot" goes away when I use qemu 4.2 with the 8deb8019d696 patch applied. I'll now check if oz/image-factory works too. And I got an image created ============ Final Image Details ============ UUID: 48b2d109-d0ef-4f0b-8e20-4a62424ea25b Type: base_image Image filename: /var/lib/imagefactory/storage/48b2d109-d0ef-4f0b-8e20-4a62424ea25b.body Image build completed SUCCESSFULLY! So the main question is about the next steps? Can infra use qemu from virt-preview? How to best integrate the patch into our qemu package? So to clarify, this is in the guest right? We could do a qemu build + the patch in our infra repo... that would upgrade all the builders, but I guess that might be ok? Dan, There is perhaps another way to fix the problem easily if the problem is with the no-reboot parameter and not with the double-boot. The commit 9146206eb26c ("spapr: Use SHUTDOWN_CAUSE_SUBSYSTEM_RESET for CAS reboots") allows to reboot to do the CAS negotiation even if the --no-reboot parameter is provided. It is already included in 4.2 and easy to backport to 4.1 (In reply to Kevin Fenzi from comment #18) > So to clarify, this is in the guest right? yes, qemu in the builder VM needs the update > We could do a qemu build + the patch in our infra repo... that would upgrade > all the builders, but I guess that might be ok? yes, it should be OK, you could give the other arches some testing in staging env first (In reply to Laurent Vivier from comment #19) > Dan, > > There is perhaps another way to fix the problem easily if the problem is > with the no-reboot parameter and not with the double-boot. > > The commit 9146206eb26c ("spapr: Use SHUTDOWN_CAUSE_SUBSYSTEM_RESET for CAS > reboots") allows to reboot to do the CAS negotiation even if the --no-reboot > parameter is provided. > > It is already included in 4.2 and easy to backport to 4.1 I think it's a question for the libguests guys and how libguestfs communicates with qemu. Also I think we have no way to pass additional parameters to to the domain XML or qemu, so qemu 4.2 + the patch looks as a good solution to me :-) Not really sure of the question but maybe this diagram helps? http://libguestfs.org/guestfs-internals.1.html#architecture We try not to need custom tweaking for each architecture. If qemu can't by default boot a kernel image then we usually think of that architecture as needing to be fixed. I even have a tool to test this: https://people.redhat.com/~rjones/qemu-sanity-check/ Applying 8deb8019d696 to 4.1 will require a *lot* of preliminary patches. However, for the problem you have specifically with libguestfs (which occurs because you use -no-reboot), it should only be necessary to use the stopgap fix in 9146206eb26c1436c80a7c2ca1e4c5f86b27179d "spapr: Use SHUTDOWN_CAUSE_SUBSYSTEM_RESET for CAS reboots". That one should apply to 4.1 much more easily. The reboots turned out to actually be a problem for openQA as well, it was just a bit more subtle than I first realized. We have some tests which are set to specify kernel parameters, by typing them into the bootloader when the VM boots. But those tests are broken by the reboot behaviour, because the test only types the parameters on the *first* boot, then the reboot happens and they are effectively lost. Rewriting the test code to handle the VM spontaneously rebooting like this would be a bit awkward. For now, Kevin has done a backport of qemu 4.2 in the infra repo, and I have that deployed on the openQA VMs as well; with a newer SLOF it seems to be working OK. Ah, right. 9146206eb26c1436c80a7c2ca1e4c5f86b27179d alone won't fix that kernel parameters problem. For that you will need 8deb8019d696 and all its preliminaries. Switching to qemu to build an update with the mentioned patch applied. Cole (or another qemu maintainer), could you build new qemu 4.2 (f32 + rawhide + virt-preview) with 8deb8019d696 ("spapr: Don't trigger a CAS reboot for XICS/XIVE mode changeover") applied? Thanks, Dan. Patches pushed and f32 build is done, rawhide qemu build is failing due to some kernel headers breakage though: https://bugzilla.redhat.com/show_bug.cgi?id=1804330 This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. Fedora 32 changed to end-of-life (EOL) status on 2021-05-25. Fedora 32 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed. |