Description of problem: When we run imagefactory tool to create an image (e.g. container image), vmlinuz and initrd.img file gets download first. Later, vm gets launched for further installation process. Within few seconds after launch, vm crashes and install process aborts. Version-Release number of selected component (if applicable): qemu-2.11.0-4.fc28.1.ppc64le How reproducible: Run imagefactory tool $ sudo imagefactory --verbose --debug base_image --file-parameter install_script <kickstart_file> <tdl_xml_template> --parameter offline_icicle true exception thrown by imagefactory: Exception encountered in _build_image_from_template thread Unable to find libguestfs inspectable image or rpm-ostree image on %(diskfile) I couldn't fine any smaller scenario to reproduce this bug Additional info: * imagefactory works fine for building Fedora 27 images. Crash happen only when f28/rawhide image is built. * Pagure ticket with some more detail - https://pagure.io/releng/issue/7326 * oz log from one of the failed container image build - https://kojipkgs.fedoraproject.org//work/tasks/3788/25123788/oz-ppc64le.log * Seems to also impact ppc64 - https://koji.fedoraproject.org/koji/taskinfo?taskID=25123781
There's not much information to go on. Is there an error from qemu? What is the error? If qemu is being run under libvirt you should find any error/warning messages in either /var/log/libvirt/qemu or $HOME/.cache/libvirt/qemu/log/
Not quite what you asked for but the linked to pagure issue does have this in it: ``` Couldn't emulate instruction 0x7c00491c (op 31 xop 142) kvmppc_exit_pr_progint: emulation at c00000000004e0b0 failed (7c00491c) Couldn't emulate instruction 0x7c00491c (op 31 xop 142) kvmppc_exit_pr_progint: emulation at c00000000004e0b0 failed (7c00491c) ``` Might help in the mean time while sinny grabs the qemu logs for you. Dusty
Created attachment 1398753 [details] qemu log from imagefactory run Attachment contains qemu log which I see in /var/log/libvirt/qemu/ . Let me know if any additional information is required.
The factory-build-*.log file seems to indicate that qemu crashed. If so then something (crashdumpctl perhaps?) should have collected a core dump. It would be very useful to see the stack trace from that.
(In reply to Richard W.M. Jones from comment #4) > The factory-build-*.log file seems to indicate that qemu crashed. > If so then something (crashdumpctl perhaps?) should have collected a > core dump. It would be very useful to see the stack trace from that. It doesn't look like any crash dump has been captured for this crash. # coredumpctl list /usr/bin/qemu-system-ppc64 TIME PID UID GID SIG COREFILE EXE ... Mon 2018-02-19 10:36:17 CET 3780 107 107 6 none /usr/bin/qemu-system-ppc64 There is no qemu log from today.
> Mon 2018-02-19 10:36:17 CET 3780 107 107 6 none /usr/bin/qemu-system-ppc64 Could it be this one? Can you run it again and see if you can collect a core dump? I think without at least a stack trace we're going to have a hard time helping.
Created attachment 1398815 [details] qemu-system-ppc64 coredump log from last run Just for reference, I have attached log from last run which I see in coredumpctl. Log is available in attachment which is output from "coredumpctl info 3780". It doesn't look to me crash report we want to see. I re-ran imagefactory couple of times both as root and non-root user but don't see any new dump in 'coredumpctl list'. I will keep digging and will update here if I find something useful.
I think you need to run: coredumpctl gdb and in gdb use the ‘t a a bt’ command to get the stack trace.
(In reply to Richard W.M. Jones from comment #8) > I think you need to run: > > coredumpctl gdb Running this doesn't provide gdb prompt, it just prints what gets printed from "coredumpctl info 3780". Additionally, it appends message "Coredump entry has no core attached (neither internally in the journal nor externally on disk).".
Sinny, please check /etc/systemd/coredump.conf. Try increasing ProcessSizeMax, and ExternalSizeMax. Another reason for the missing core file could be SELinux.
I'm looking at the last image creation failure - https://koji.fedoraproject.org/koji/taskinfo?taskID=25403811 - and if I see correctly in the oz-ppc64le.log file, then VM installation succeeds ... Waiting for factory-build-7945fe38-a9be-4934-bf66-e9f52e7a609e to finish installing, 7170/7200 Waiting for factory-build-7945fe38-a9be-4934-bf66-e9f52e7a609e to finish shutdown, 90/90 Install of factory-build-7945fe38-a9be-4934-bf66-e9f52e7a609e succeeded ... but it crashed in the subsequent boot that is used to create the actual image.
Sinny, I'm tryign to reproduce that, but getting an error from imagefactory ... escription>Fedora-Container-Minimal-Base-28-20180301.n.1.ppc64le OS</description>\n <disk>\n <size>5G</size>\n </disk>\n</template>\n", 'percent_complete': 0, 'type': 'BaseImage', 'properties': {}} 2018-03-02 16:37:01,637 ERROR imgfac.Builder.Builder thread(770bf29a) Message: Exception encountered in _build_image_from_template thread 2018-03-02 16:37:01,637 ERROR imgfac.Builder.Builder thread(770bf29a) Message: 'NoneType' object has no attribute 'create_base_image' Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/imgfac/Builder.py", line 132, in _build_image_from_template self.os_plugin.create_base_image(self, template, parameters) AttributeError: 'NoneType' object has no attribute 'create_base_image' Image build FAILED with error: 'NoneType' object has no attribute 'create_base_image' I'm replicating https://koji.fedoraproject.org/koji/taskinfo?taskID=25403811 so I did wget https://kojipkgs.fedoraproject.org//work/tasks/3811/25403811/tdl-ppc64le.xml wget https://kojipkgs.fedoraproject.org//work/tasks/3811/25403811/koji-f28-build-25403811-base.ks to get the KS and TDL files and then I'm running imagefactory --verbose --debug base_image --file-parameter install_script koji-f28-build-25403811-base.ks tdl-ppc64le.xml --parameter offline_icicle true What am I missing?
And the answer is plugins :-) from https://fedoramagazine.org/building-fedora-rawhide-images-imagefactory/ sudo dnf install imagefactory imagefactory-plugins-TinMan imagefactory-plugins-Docker
(In reply to Dan Horák from comment #13) > And the answer is plugins :-) > > from https://fedoramagazine.org/building-fedora-rawhide-images-imagefactory/ > sudo dnf install imagefactory imagefactory-plugins-TinMan > imagefactory-plugins-Docker does that mean you were able to do an imagefactory run? did you observe the issue documented in this BZ?
(In reply to Dusty Mabe from comment #14) > (In reply to Dan Horák from comment #13) > > And the answer is plugins :-) > > > > from https://fedoramagazine.org/building-fedora-rawhide-images-imagefactory/ > > sudo dnf install imagefactory imagefactory-plugins-TinMan > > imagefactory-plugins-Docker > > does that mean you were able to do an imagefactory run? did you observe the > issue documented in this BZ? yes, I were able to run imagefactory, and no, I haven't seen the crash yet, but I have tried in slightly different environments - F-27 running on physical host (so no nested virt) and in F-28 based VM. Need to check with F-27 VM too.
(In reply to Michal Schmidt from comment #10) > Sinny, please check /etc/systemd/coredump.conf. Try increasing > ProcessSizeMax, and ExternalSizeMax. Another reason for the missing core > file could be SELinux. Default size of ExternalSizeMax and ProcessSizeMax were 2G which I have increased to 7G for each. Also, I disabled selinux on host. Sadly, there is still no coredunmp created or any error message seen which can give a clue if there were a problem while creating coredump.
(In reply to Dan Horák from comment #11) > I'm looking at the last image creation failure - > https://koji.fedoraproject.org/koji/taskinfo?taskID=25403811 - and if I see > correctly in the oz-ppc64le.log file, then VM installation succeeds > > ... > Waiting for factory-build-7945fe38-a9be-4934-bf66-e9f52e7a609e to finish > installing, 7170/7200 > Waiting for factory-build-7945fe38-a9be-4934-bf66-e9f52e7a609e to finish > shutdown, 90/90 > Install of factory-build-7945fe38-a9be-4934-bf66-e9f52e7a609e succeeded > ... No, it didn't really succeed VM installation. It just starts VM installation process and crashes within 1 minute. Installation process takes at least 15 minutes. Crash occurs in between kernel boot and before anaconda install starts. > but it crashed in the subsequent boot that is used to create the actual > image. Subsequent boot fails because VM install was incomplete.
(In reply to Dan Horák from comment #15) > (In reply to Dusty Mabe from comment #14) > > (In reply to Dan Horák from comment #13) > > > And the answer is plugins :-) > > > > > > from https://fedoramagazine.org/building-fedora-rawhide-images-imagefactory/ > > > sudo dnf install imagefactory imagefactory-plugins-TinMan > > > imagefactory-plugins-Docker > > > > does that mean you were able to do an imagefactory run? did you observe the > > issue documented in this BZ? > > yes, I were able to run imagefactory, and no, I haven't seen the crash yet, > but I have tried in slightly different environments - F-27 running on > physical host (so no nested virt) and in F-28 based VM. Need to check with > F-27 VM too. Imagefactory on ppc64le works fine for me on F27 guest locally. It fails locally on F28 and F29 guest. Our Fedora koji builders uses F27 guest for building images but no idea why it is failing. See comment https://pagure.io/releng/issue/7326#comment-496085 .
So finally I was able to reproduce the crash when running imagefactory. Seems the trigger could be nested virt with # of vcpu > # of host cpu, where host is a VM. Together with SMT threads vs. virtual/physical cores or something like that.
the crashing instruction is "Message Send Privileged" if my decode is correct 0: 1c 49 00 7c msgsndp r9
hopefully I have a complete set of instructions for reproducing the issue - take a power8 host, in my case it's 8247-21L with single 10-core P8, running F-27 ppc64le - install F-28 guest (2 CPUs, 16 GB RAM, 20GB disk) using Fedora-Server-netinst-ppc64le-28-20180319.n.0, all default, I have used virt-manager to do that, it was the only guest on the host - log into the new guest - dnf install imagefactory imagefactory-plugins-TinMan imagefactory-plugins-Docker - we are replicating https://koji.fedoraproject.org/koji/taskinfo?taskID=25835596 - wget https://kojipkgs.fedoraproject.org//work/tasks/5596/25835596/tdl-ppc64le.xml - wget https://kojipkgs.fedoraproject.org//work/tasks/5596/25835596/koji-f28-build-25835596-base.ks - vi /etc/oz/oz.cfg and change cpus = 2 and memory = 3072 - systemctl start libvirtd - imagefactory --verbose --debug base_image --file-parameter install_script koji-f28-build-25835596-base.ks tdl-ppc64le.xml --parameter offline_icicle true - you should now get the fail - dmesg will show ... [ 669.200479] Couldn't emulate instruction 0x7c00491c (op 31 xop 142) [ 669.200533] kvmppc_exit_pr_progint: emulation at c00000000004a190 failed (7c00491c) ...
as no such problem exists when running imagefactory on physical host directly, I would call the issue "nested virt with multi-cpu guest problem"
This message is a reminder that Fedora 28 is nearing its end of life. On 2019-May-28 Fedora will stop maintaining and issuing updates for Fedora 28. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '28'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 28 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 28 changed to end-of-life (EOL) status on 2019-05-28. Fedora 28 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.
This bug still occur in F30, Power9 nested vm. Re-opening it
CCing some kvm/qemu ppc folks. power9 nested VM crash in Fedora 28 through Fedora 30 that apparently didn't exist in older Fedora? Reproducer in comment #21.
The errors in comment 2 suggest that KVM PR is in use. But KVM PR won't (and won't ever) work for a POWER9 guest. If you want nested POWER9 guests you'll need to use the nested KVM HV support.
This message is a reminder that Fedora 30 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 30 on 2020-05-26. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '30'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 30 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 30 changed to end-of-life (EOL) status on 2020-05-26. Fedora 30 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days