Bug 1685358
| Summary: | Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 9 | Reporter: | Yanan Fu <yfu> | ||||||
| Component: | dracut | Assignee: | Pavel Valena <pvalena> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Yanan Fu <yfu> | ||||||
| Severity: | urgent | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | unspecified | CC: | bdas, chayang, coli, coughlan, dracut-maint-list, dtardon, jen, jinzhao, juzhang, knoel, kwolf, mhou, pvalena, qinwang, rbalakri, virt-maint, xuwei, yanghliu, yfu | ||||||
| Target Milestone: | rc | Keywords: | Bugfix, TestOnly, Triaged | ||||||
| Target Release: | --- | Flags: | pm-rhel:
mirror+
|
||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | dracut-057-13.git20220816.el9 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1717323 (view as bug list) | Environment: | |||||||
| Last Closed: | 2022-08-17 18:46:04 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | 2066816 | ||||||||
| Bug Blocks: | 1717323 | ||||||||
| Attachments: |
|
||||||||
(In reply to Yanan Fu from comment #0) > Created attachment 1540799 [details] > Test log- serial log, debug log, screendumps > > Description of problem: > Reboot rhel8.0 guest repeatedly hit kernel panic: > > 2019-03-04 08:33:18: [ 16.544142] Kernel panic - not syncing: Attempted to > kill init! exitcode=0x00007f00 > 2019-03-04 08:33:18: [ 16.544142] > 2019-03-04 08:33:18: [ 16.547177] CPU: 2 PID: 1 Comm: shutdown Not tainted > 4.18.0-75.el8.x86_64 #1 > 2019-03-04 08:33:18: [ 16.548884] Hardware name: Red Hat KVM, BIOS > 1.11.1-3.module+el8+2529+a9686a4d 04/01/2014 > 2019-03-04 08:33:18: [ 16.550764] Call Trace: > 2019-03-04 08:33:18: [ 16.551801] dump_stack+0x5c/0x80 > 2019-03-04 08:33:18: [ 16.553082] panic+0xe7/0x247 > 2019-03-04 08:33:18: [ 16.554199] do_exit.cold.22+0x26/0xc1 > 2019-03-04 08:33:18: [ 16.555351] do_group_exit+0x3a/0xa0 > 2019-03-04 08:33:18: [ 16.556411] __x64_sys_exit_group+0x14/0x20 > 2019-03-04 08:33:18: [ 16.557548] do_syscall_64+0x5b/0x1b0 > 2019-03-04 08:33:18: [ 16.558622] entry_SYSCALL_64_after_hwframe+0x65/0xca > 2019-03-04 08:33:18: [ 16.559859] RIP: 0033:0x7fae609f3e2e > 2019-03-04 08:33:18: [ 16.560901] Code: Bad RIP value. > 2019-03-04 08:33:18: [ 16.561896] RSP: 002b:00007ffcb24a39d8 EFLAGS: > 00000202 ORIG_RAX: 00000000000000e7 > 2019-03-04 08:33:18: [ 16.563495] RAX: ffffffffffffffda RBX: > 00007fae609fc528 RCX: 00007fae609f3e2e > 2019-03-04 08:33:18: [ 16.565043] RDX: 000000000000007f RSI: > 000000000000003c RDI: 000000000000007f > 2019-03-04 08:33:18: [ 16.566597] RBP: 00007fae60c02e00 R08: > 00000000000000e7 R09: 00007ffcb24a38e8 > 2019-03-04 08:33:18: [ 16.568168] R10: 0000000000000000 R11: > 0000000000000202 R12: 0000000000000002 > 2019-03-04 08:33:18: [ 16.569723] R13: 0000000000000001 R14: > 00007fae60c02e40 R15: 00007fae60c02e30 > > > Version-Release number of selected component (if applicable): > host kernel: kernel-4.18.0-72.el8.x86_64 > qemu-kvm: qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.x86_64 (2.12.0-62 also > have this problem) > guest-kernel: kernel-4.18.0-75.el8.x86_64 (4.18.0-72 also have this problem) > > > How reproducible: > 1/20 > > Steps to Reproduce: > 1. Boot a RHEL8 VM > 2. execute "shutdown -r now" after login vm > 3. repeat step2 after guest bootup repeatedly. > > Actual results: > kernel panic during reboot. > > Expected results: > no panic, guest work well > > Additional info: > 1. Same host kernel version, guest kernel version: > Is the rest of the guest userspace the same ? It looks like shutdown is executed when init is still bringing up the system. Although I believe init should ignore any SIGTERM or SIGKILL. Is the shutdown scripted/automated ? (In reply to Bandan Das from comment #1) > > > > Additional info: > > 1. Same host kernel version, guest kernel version: > > > > Is the rest of the guest userspace the same ? Yes, it is same. I tried, with same guest image, only changed the qemu-kvm version. Let me update with the latest result: I rerun for another 50 times last night with "qemu-kvm-3.1.0-18.module+el8+2834+fa8bb6e2.x86_64", hit this issue too. > > It looks like shutdown is executed when init is still bringing up the > system. Although I believe init should ignore any SIGTERM or SIGKILL. > Is the shutdown scripted/automated ? Yes, It is a automation case, i checked the whole logic: 1. after guest boot up, login vm, get a session. 2. send "shutdown -r now" through the session 3. check if guest go down successfully 4. check if guest boot up successfully. 5. repeat step 1~4. A little more verbose output: 2019-03-22 14:44:58: [ 18.504311] [3492]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'. 2019-03-22 14:44:58: [ 18.510911] [3492]: Failed to remount '/' read-only: Device or resource busy 2019-03-22 14:44:58: [ 18.515469] [3493]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'. 2019-03-22 14:44:58: [ 18.521877] [3493]: Failed to remount '/' read-only: Device or resource busy 2019-03-22 14:44:58: [ 18.527759] systemd-shutdown[1]: Not all file systems unmounted, 1 left. 2019-03-22 14:44:58: [ 18.530465] systemd-shutdown[1]: Deactivating swaps. 2019-03-22 14:44:58: [ 18.533060] systemd-shutdown[1]: All swaps deactivated. 2019-03-22 14:44:58: [ 18.535461] systemd-shutdown[1]: Detaching loop devices. 2019-03-22 14:44:58: [ 18.538151] systemd-shutdown[1]: All loop devices detached. 2019-03-22 14:44:58: [ 18.540221] systemd-shutdown[1]: Detaching DM devices. 2019-03-22 14:44:58: [ 18.553508] [3494]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'. 2019-03-22 14:44:58: [ 18.558148] [3494]: Failed to remount '/' read-only: Device or resource busy 2019-03-22 14:44:58: [ 18.598036] shutdown: 9 output lines suppressed due to ratelimiting 2019-03-22 14:44:58: [ 18.603021] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 2019-03-22 14:44:58: [ 18.603021] 2019-03-22 14:44:58: [ 18.608125] CPU: 2 PID: 1 Comm: shutdown Not tainted 4.18.0-80.el8_panic.x86_64 #1 2019-03-22 14:44:58: [ 18.610101] Hardware name: Red Hat KVM, BIOS 1.12.0-1.module+el8+2706+3c6581b6 04/01/2014 2019-03-22 14:44:58: [ 18.612152] Call Trace: 2019-03-22 14:44:58: [ 18.613342] dump_stack+0x5c/0x80 2019-03-22 14:44:58: [ 18.614660] panic+0xe7/0x247 2019-03-22 14:44:58: [ 18.615911] do_exit.cold.22+0x26/0xc1 2019-03-22 14:44:58: [ 18.617321] do_group_exit+0x3a/0xa0 2019-03-22 14:44:58: [ 18.618634] __x64_sys_exit_group+0x14/0x20 2019-03-22 14:44:58: [ 18.620031] do_syscall_64+0x5b/0x1b0 2019-03-22 14:44:58: [ 18.621402] entry_SYSCALL_64_after_hwframe+0x65/0xca 2019-03-22 14:44:58: [ 18.622912] RIP: 0033:0x7f7524a81e2e 2019-03-22 14:44:58: [ 18.624218] Code: Bad RIP value. So, device mapper isn't done but we are shutting down anyway. I don't think that would cause a panic but the underlying cause for why remounting failed is probably the cultprit. The usual suspect is a sync() that hasn't completed. It's definitely a bug if it still exists but just to be sure I will try a newer kernel as well as upstream and see if it makes a difference. Also hit it on RHEL8.1.0. Versions: kernel-4.18.0-80.19.el8.x86_64 qemu-kvm-2.12.0-67.module+el8.1.0+3088+c3b61d6f One more question about the trace function, does "scsi_*" enough ? like this: # trace-cmd record -p function -l "scsi_*" Hi Karen and Bandan, Thanks for your reply. Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have some risk: 1. "300s" is not long enough sometimes, we still can hit this issue, and cause gating test failed. 2. Miss product bz after enlarging timeout, since this workaround will be used in our automation script after installation finished, all of the cases will run with the modified guest, not only "reboot" test. From developer's perspective, please help check if these risks are acceptable for our product. If it is ok, we can use this method as a workaround at current stage. Many thanks! Best regards Yanan Fu (In reply to Yanan Fu from comment #27) > Hi Karen and Bandan, > > Thanks for your reply. > Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have > some risk: > 1. "300s" is not long enough sometimes, we still can hit this issue, and > cause gating test failed. Ok, I think I misread your comments about the outcome of using TimeoutSec. I understood from your comments that we do *not* hit the issue if we use the parameter. If we are still hitting it, obviously, it cannot be a workaround. > 2. Miss product bz after enlarging timeout, since this workaround will be > used in our automation > script after installation finished, all of the cases will run with the > modified guest, not only > "reboot" test. > I think this should be doable. Can you not script a new install to change the Timeout only for a reboot test ? > From developer's perspective, please help check if these risks are > acceptable for our product. > If it is ok, we can use this method as a workaround at current stage. > Many thanks! > Here's what I would suggest: Do a reboot test with n=50 and TimeoutSec=300. If you don't hit the panic, we can remove this as a test blocker but we can continue investigating this bug by collecting traces of qemu scsi functions. Once we root cause this and have a fix, you can go back to removing TimeoutSec altogether. > > Best regards > Yanan Fu (In reply to Bandan Das from comment #28) > (In reply to Yanan Fu from comment #27) > > Hi Karen and Bandan, > > > > Thanks for your reply. > > Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have > > some risk: > > 1. "300s" is not long enough sometimes, we still can hit this issue, and > > cause gating test failed. > > Ok, I think I misread your comments about the outcome of using TimeoutSec. > I understood from your comments that we do *not* hit the issue if we use the > parameter. If we are still hitting it, obviously, it cannot be a workaround. Here is about the potential risk not actual test result. With "300s", i really can't reproduce this issue in my test now. But, "it may have a risk that "300s" may be not enough and cause gating test failed. > > > 2. Miss product bz after enlarging timeout, since this workaround will be > > used in our automation > > script after installation finished, all of the cases will run with the > > modified guest, not only > > "reboot" test. > > > I think this should be doable. Can you not script a new install to change > the Timeout > only for a reboot test ? I am sorry, only modify automation case "reboot" is achievable, but we can not do that. Because, this issue effect other automation cases that need reboot vm too, that is why we mark "blocker" before. If only modify "reboot", other cases still can failed as this issue and failed gating test. > > > From developer's perspective, please help check if these risks are > > acceptable for our product. > > If it is ok, we can use this method as a workaround at current stage. > > Many thanks! > > > > Here's what I would suggest: > Do a reboot test with n=50 and TimeoutSec=300. If you don't hit the panic, > we can remove this as > a test blocker but we can continue investigating this bug by collecting > traces of qemu scsi functions. > Once we root cause this and have a fix, you can go back to removing > TimeoutSec altogether. > > > > > Best regards > > Yanan Fu (In reply to Bandan Das from comment #28) > (In reply to Yanan Fu from comment #27) > > Hi Karen and Bandan, > > > > Thanks for your reply. > > Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have > > some risk: > > 1. "300s" is not long enough sometimes, we still can hit this issue, and > > cause gating test failed. > > Ok, I think I misread your comments about the outcome of using TimeoutSec. > I understood from your comments that we do *not* hit the issue if we use the > parameter. If we are still hitting it, obviously, it cannot be a workaround. Hi Bandan, According to QE testing (over hundreds of times), 300s is enough, we think it could be a workaround. But there still may risk in some situations that 300s is not enough and cause gating test failed, like compose(iso) update, different systems... We could not say 300s is always enough for all situations, even it's low risk, we still can not 100% guarantee. So QE would like to double confirm if such risk is acceptable for developer, hope you can understand. > > > 2. Miss product bz after enlarging timeout, since this workaround will be > > used in our automation > > script after installation finished, all of the cases will run with the > > modified guest, not only > > "reboot" test. > > > I think this should be doable. Can you not script a new install to change > the Timeout > only for a reboot test ? As Yanan mentioned, there are many cases in gating test call reboot function, so only updating reboot case may does not work, we still could meet this issue in other cases. > > > From developer's perspective, please help check if these risks are > > acceptable for our product. > > If it is ok, we can use this method as a workaround at current stage. > > Many thanks! > > > > Here's what I would suggest: > Do a reboot test with n=50 and TimeoutSec=300. If you don't hit the panic, > we can remove this as > a test blocker but we can continue investigating this bug by collecting > traces of qemu scsi functions. > Once we root cause this and have a fix, you can go back to removing > TimeoutSec altogether. Agree. Based on QE testing, the workaround works well currently. QE agree remove TestBlocker if the risk I mentioned above is acceptable for developer. We can use this workaround until we get a fix. Thanks. (In reply to CongLi from comment #30) > (In reply to Bandan Das from comment #28) > > (In reply to Yanan Fu from comment #27) > > > Hi Karen and Bandan, > > > > > > Thanks for your reply. > > > Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have > > > some risk: > > > 1. "300s" is not long enough sometimes, we still can hit this issue, and > > > cause gating test failed. > > > > Ok, I think I misread your comments about the outcome of using TimeoutSec. > > I understood from your comments that we do *not* hit the issue if we use the > > parameter. If we are still hitting it, obviously, it cannot be a workaround. > > Hi Bandan, > > According to QE testing (over hundreds of times), 300s is enough, we think > it > could be a workaround. > > But there still may risk in some situations that 300s is not enough and > cause > gating test failed, like compose(iso) update, different systems... > We could not say 300s is always enough for all situations, even it's low > risk, we still > can not 100% guarantee. > > So QE would like to double confirm if such risk is acceptable for developer, > hope you can understand. > > > > > > 2. Miss product bz after enlarging timeout, since this workaround will be > > > used in our automation > > > script after installation finished, all of the cases will run with the > > > modified guest, not only > > > "reboot" test. > > > > > I think this should be doable. Can you not script a new install to change > > the Timeout > > only for a reboot test ? > > As Yanan mentioned, there are many cases in gating test call reboot > function, > so only updating reboot case may does not work, we still could meet this > issue > in other cases. > > > > > > From developer's perspective, please help check if these risks are > > > acceptable for our product. > > > If it is ok, we can use this method as a workaround at current stage. > > > Many thanks! > > > > > > > Here's what I would suggest: > > Do a reboot test with n=50 and TimeoutSec=300. If you don't hit the panic, > > we can remove this as > > a test blocker but we can continue investigating this bug by collecting > > traces of qemu scsi functions. > > Once we root cause this and have a fix, you can go back to removing > > TimeoutSec altogether. > > Agree. > Based on QE testing, the workaround works well currently. > QE agree remove TestBlocker if the risk I mentioned above is acceptable for > developer. > We can use this workaround until we get a fix. > > Hi everyone, Thank you very much for the clarification. Yes, I think this is acceptable. I have spoken about this to both Rick and Karen and they both agree as well. So: 1. Remove TestBlocker. 2. Set up Qemu tracing and gather results. 3. Setup a local reproducer. I really want to work on 3 to rule out this being an issue with your setup. I remember trying this in the past with instructions from Yanan but was not able to reproduce it. I will grab a system from beaker and ping him again to help me with the setup. > Thanks. Based on comment 31, remove 'TestBlocker' keyword. Thanks. Created attachment 1639406 [details]
debug info
QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks Same with the bz 1717323 (for Advanced Virtualization). QE still can hit this issue from time to time, with RHEL8.3, RHEL8.4, and also with RHEL9, refer https://bugzilla.redhat.com/show_bug.cgi?id=1922896 Bulk update: Move RHEL8 bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release. Hello Folks
I meet this issue on real-time kernel. This issue occur when I try to start a guest. Here is my reproduce step.
qemu and libvirt version:
qemu-kvm-7.0.0-6.el9.x86_64
libvirt-8.4.0-2.el9.x86_64
host kernel version: 5.14.0-109.rt21.109.el9.x86_64
1. perpare rhel9.1 guest.(kernel version is: 5.14.0-114.el9.x86_64)
2. create a guest xml as below.
# cat rhel9.1.xml
<domain type="kvm">
<name>rhel9.1</name>
<memory unit="KiB">8388608</memory>
<currentMemory unit="KiB">8388608</currentMemory>
<memoryBacking>
<hugepages>
<page size="1048576" unit="KiB" />
</hugepages>
<access mode="shared" />
</memoryBacking>
<vcpu placement="static">3</vcpu>
<resource>
<partition>/machine</partition>
</resource>
<os>
<type arch="x86_64" machine="q35">hvm</type>
<boot dev="hd" />
</os>
<features>
<acpi />
<pmu state="off" />
<vmport state="off" />
<ioapic driver="qemu" />
</features>
<clock offset="utc">
<timer name="rtc" tickpolicy="catchup" />
<timer name="pit" tickpolicy="delay" />
<timer name="hpet" present="no" />
</clock>
<on_poweroff>destroy</on_poweroff>
<on_reboot>restart</on_reboot>
<on_crash>restart</on_crash>
<pm>
<suspend-to-mem enabled="no" />
<suspend-to-disk enabled="no" />
</pm>
<devices>
<emulator>/usr/libexec/qemu-kvm</emulator>
<disk type="file" device="disk">
<driver name="qemu" type="qcow2" />
<source file="/root/rhel9.1-latest.qcow2" />
<target dev="vda" bus="virtio" />
<address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0" />
</disk>
<controller type="usb" index="0" model="none" />
<controller type="pci" index="0" model="pcie-root" />
<controller type="pci" index="1" model="pcie-root-port">
<model name="pcie-root-port" />
<target chassis="1" port="0x10" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0" />
</controller>
<controller type="pci" index="2" model="pcie-root-port">
<model name="pcie-root-port" />
<target chassis="2" port="0x11" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0" />
</controller>
<controller type="pci" index="3" model="pcie-root-port">
<model name="pcie-root-port" />
<target chassis="3" port="0x8" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x04" function="0x0" />
</controller>
<controller type="pci" index="4" model="pcie-root-port">
<model name="pcie-root-port" />
<target chassis="4" port="0x9" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x0" />
</controller>
<controller type="pci" index="5" model="pcie-root-port">
<model name="pcie-root-port" />
<target chassis="5" port="0xa" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x06" function="0x0" />
</controller>
<controller type="pci" index="6" model="pcie-root-port">
<model name="pcie-root-port" />
<target chassis="6" port="0xb" />
<address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0" />
</controller>
<controller type="sata" index="0">
<address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2" />
</controller>
<interface type="bridge">
<mac address="52:54:00:bb:63:7e" />
<source bridge="virbr0" />
<model type="virtio" />
<address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0" />
</interface>
<serial type="pty">
<target type="isa-serial" port="0">
<model name="isa-serial" />
</target>
</serial>
<console type="pty">
<target type="serial" port="0" />
</console>
<input type="mouse" bus="ps2" />
<input type="keyboard" bus="ps2" />
<graphics type="vnc" port="-1" autoport="yes" listen="0.0.0.0">
<listen type="address" address="0.0.0.0" />
</graphics>
<video>
<model type="cirrus" vram="16384" heads="1" primary="yes" />
<address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0" />
</video>
<memballoon model="virtio">
<address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0" />
</memballoon>
<iommu model="intel">
<driver intremap="on" caching_mode="on" iotlb="on" />
</iommu>
</devices>
<seclabel type="dynamic" model="selinux" relabel="yes" />
</domain>
3. start guest and got call trace
# virsh define rhel9.1.xml
# virsh console rhel9.1
Fatal glibc error: CPU does not support x86-64-v2
[ 3.202929] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
[ 3.204111] CPU: 0 PID: 1 Comm: init Not tainted 5.14.0-114.el9.x86_64 #1
[ 3.205148] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.0-3.el9 04/01/2014
[ 3.206182] Call Trace:
[ 3.206562] dump_stack_lvl+0x34/0x44
[ 3.207133] panic+0x102/0x2d4
[ 3.207609] do_exit.cold+0x87/0x9f
[ 3.208151] do_group_exit+0x33/0xa0
[ 3.208699] __x64_sys_exit_group+0x14/0x20
[ 3.209341] do_syscall_64+0x5c/0x80
[ 3.209888] ? do_writev+0x6b/0x110
[ 3.210431] ? syscall_exit_to_user_mode+0x12/0x30
[ 3.211166] ? do_syscall_64+0x69/0x80
[ 3.211737] ? syscall_exit_to_user_mode+0x12/0x30
[ 3.212471] ? do_syscall_64+0x69/0x80
[ 3.213048] ? exc_page_fault+0x62/0x140
[ 3.213644] ? asm_exc_page_fault+0x8/0x30
[ 3.214274] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 3.215124] RIP: 0033:0x7f8e515eb311
[ 3.215670] Code: c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa be e7 00 00 00 ba 3c 00 00 00 eb 0d 89 d0 0f 05 48 3d 00 f0 ff ff 77 1c f4 89 f0 0f 05 <48> 3d 00 f0 ff ff 76 e7 f7 d8 89 05 bf fe 00 00 eb dd 0f 1f 44 00
[ 3.218493] RSP: 002b:00007ffe2839da58 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[ 3.219639] RAX: ffffffffffffffda RBX: 00007f8e515e6050 RCX: 00007f8e515eb311
[ 3.220714] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 000000000000007f
[ 3.221788] RBP: 0000565470979040 R08: 00007ffe2839d5c9 R09: 0000000000000000
[ 3.222865] R10: 00000000ffffffff R11: 0000000000000246 R12: 000000000000000d
[ 3.223939] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
[ 3.225167] Kernel Offset: 0x12200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3.228731] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 ]---
qemu command line:
/usr/libexec/qemu-kvm -name guest=rhel9.1,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-4-rhel9.1/master-key.aes"} -machine pc-q35-rhel9.0.0,usb=off,vmport=off,dump-guest-core=off,kernel_irqchip=split,memory-backend=pc.ram -accel kvm -cpu qemu64,pmu=off -m 8192 -object {"qom-type":"memory-backend-file","id":"pc.ram","mem-path":"/dev/hugepages/libvirt/qemu/4-rhel9.1","share":true,"x-use-canonical-path-for-ramblock-id":false,"prealloc":true,"size":8589934592} -overcommit mem-lock=off -smp 3,sockets=3,cores=1,threads=1 -uuid 39c5ac63-050f-4e18-b895-54c21fae2a1a -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=24,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device {"driver":"intel-iommu","intremap":"on","caching-mode":true,"device-iotlb":true} -device {"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","addr":"0x2"} -device {"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x3"} -device {"driver":"pcie-root-port","port":8,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x4"} -device {"driver":"pcie-root-port","port":9,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x5"} -device {"driver":"pcie-root-port","port":10,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x6"} -device {"driver":"pcie-root-port","port":11,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x7"} -blockdev {"driver":"file","filename":"/root/rhel9.1-latest.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null} -device {"driver":"virtio-blk-pci","bus":"pci.1","addr":"0x0","drive":"libvirt-1-format","id":"virtio-disk0","bootindex":1} -netdev tap,fd=27,vhost=on,vhostfd=29,id=hostnet0 -device {"driver":"virtio-net-pci","netdev":"hostnet0","id":"net0","mac":"52:54:00:bb:63:7e","bus":"pci.2","addr":"0x0"} -chardev pty,id=charserial0 -device {"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0} -audiodev {"id":"audio1","driver":"none"} -vnc 0.0.0.0:1,audiodev=audio1 -device {"driver":"cirrus-vga","id":"video0","bus":"pci.5","addr":"0x0"} -device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.6","addr":"0x0"} -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on
I upload my guest image on this url[1]. guest kernel is 5.14.0-114.el9. [1]http://netqe-bj.usersys.redhat.com/share/mhou/image/rhel9.1-latest.qcow2 I upload my guest image on this url[1]. guest kernel is 5.14.0-114.el9. [1]http://netqe-bj.usersys.redhat.com/share/mhou/image/rhel9.1-latest.qcow2 Hello, please test, if you can, whether the planned rebase for 9.1 works for you. RPMS: https://github.com/pvalena/rpms/tree/main/dracut/2066816 (In reply to Pavel Valena from comment #53) > Hello, please test, if you can, whether the planned rebase for 9.1 works for > you. > > RPMS: https://github.com/pvalena/rpms/tree/main/dracut/2066816 Hi Pavel, Thanks for your scratch build, I will test with it. This is a probabilistic problem, we may need a bit more time to verify it, hope you can understand, thanks a lot ! Best regards Yanan Fu |
Created attachment 1540799 [details] Test log- serial log, debug log, screendumps Description of problem: Reboot rhel8.0 guest repeatedly hit kernel panic: 2019-03-04 08:33:18: [ 16.544142] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 2019-03-04 08:33:18: [ 16.544142] 2019-03-04 08:33:18: [ 16.547177] CPU: 2 PID: 1 Comm: shutdown Not tainted 4.18.0-75.el8.x86_64 #1 2019-03-04 08:33:18: [ 16.548884] Hardware name: Red Hat KVM, BIOS 1.11.1-3.module+el8+2529+a9686a4d 04/01/2014 2019-03-04 08:33:18: [ 16.550764] Call Trace: 2019-03-04 08:33:18: [ 16.551801] dump_stack+0x5c/0x80 2019-03-04 08:33:18: [ 16.553082] panic+0xe7/0x247 2019-03-04 08:33:18: [ 16.554199] do_exit.cold.22+0x26/0xc1 2019-03-04 08:33:18: [ 16.555351] do_group_exit+0x3a/0xa0 2019-03-04 08:33:18: [ 16.556411] __x64_sys_exit_group+0x14/0x20 2019-03-04 08:33:18: [ 16.557548] do_syscall_64+0x5b/0x1b0 2019-03-04 08:33:18: [ 16.558622] entry_SYSCALL_64_after_hwframe+0x65/0xca 2019-03-04 08:33:18: [ 16.559859] RIP: 0033:0x7fae609f3e2e 2019-03-04 08:33:18: [ 16.560901] Code: Bad RIP value. 2019-03-04 08:33:18: [ 16.561896] RSP: 002b:00007ffcb24a39d8 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7 2019-03-04 08:33:18: [ 16.563495] RAX: ffffffffffffffda RBX: 00007fae609fc528 RCX: 00007fae609f3e2e 2019-03-04 08:33:18: [ 16.565043] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f 2019-03-04 08:33:18: [ 16.566597] RBP: 00007fae60c02e00 R08: 00000000000000e7 R09: 00007ffcb24a38e8 2019-03-04 08:33:18: [ 16.568168] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002 2019-03-04 08:33:18: [ 16.569723] R13: 0000000000000001 R14: 00007fae60c02e40 R15: 00007fae60c02e30 Version-Release number of selected component (if applicable): host kernel: kernel-4.18.0-72.el8.x86_64 qemu-kvm: qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.x86_64 (2.12.0-62 also have this problem) guest-kernel: kernel-4.18.0-75.el8.x86_64 (4.18.0-72 also have this problem) How reproducible: 1/20 Steps to Reproduce: 1. Boot a RHEL8 VM 2. execute "shutdown -r now" after login vm 3. repeat step2 after guest bootup repeatedly. Actual results: kernel panic during reboot. Expected results: no panic, guest work well Additional info: 1. Same host kernel version, guest kernel version: Test with fast train "qemu-kvm-3.1.0-18.module+el8+2834+fa8bb6e2.x86_64", repeat the automation case for 50 times (reboot 25 times in one case), didn't hit this issue. Test with slow train "qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.x86_64", repeat automation case for 50 times, hit three times. 2. After hit this issue, boot vm with the same guest image, no problem. 3. Related log was added in attachment 4. Full qemu command line: MALLOC_PERTURB_=1 /usr/libexec/qemu-kvm \ -S \ -name 'avocado-vt-vm1' \ -machine pc \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_jn507cv_/monitor-qmpmonitor1-20190304-081632-4oGe4uXH,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_jn507cv_/monitor-catch_monitor-20190304-081632-4oGe4uXH,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=idTo3L8N \ -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_jn507cv_/serial-serial0-20190304-081632-4oGe4uXH,server,nowait \ -device isa-serial,chardev=serial_id_serial0 \ -chardev socket,id=seabioslog_id_20190304-081632-4oGe4uXH,path=/var/tmp/avocado_jn507cv_/seabios-20190304-081632-4oGe4uXH,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20190304-081632-4oGe4uXH,iobase=0x402 \ -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 \ -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 \ -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 \ -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x3 \ -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel80-64-virtio-scsi.qcow2 \ -device scsi-hd,id=image1,drive=drive_image1 \ -device virtio-net-pci,mac=9a:fb:fc:fd:fe:ff,id=idlDJ4wH,vectors=4,netdev=idPP6wdA,bus=pci.0,addr=0x4 \ -netdev tap,id=idPP6wdA,vhost=on,vhostfd=22,fd=14 \ -m 7168 \ -smp 4,maxcpus=4,cores=2,threads=1,sockets=2 \ -cpu 'Skylake-Server',+kvm_pv_unhalt \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -enable-kvm