Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1685358

Summary:

Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00

Product:

Red Hat Enterprise Linux 9

Reporter:

Yanan Fu <yfu>

Component:

dracut

Assignee:

Pavel Valena <pvalena>

Status:

CLOSED ERRATA

QA Contact:

Yanan Fu <yfu>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

unspecified

CC:

bdas, chayang, coli, coughlan, dracut-maint-list, dtardon, jen, jinzhao, juzhang, knoel, kwolf, mhou, pvalena, qinwang, rbalakri, virt-maint, xuwei, yanghliu, yfu

Target Milestone:

Keywords:

Bugfix, TestOnly, Triaged

Target Release:

---

Flags:

pm-rhel: mirror+

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

dracut-057-13.git20220816.el9

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1717323 (view as bug list)

Environment:

Last Closed:

2022-08-17 18:46:04 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

2066816

Bug Blocks:

1717323

Attachments:

Description	Flags
Test log- serial log, debug log, screendumps	none
debug info	none

Description Yanan Fu 2019-03-05 03:06:46 UTC

Created attachment 1540799 [details]
Test log- serial log, debug log, screendumps

Description of problem:
Reboot rhel8.0 guest repeatedly hit kernel panic:

2019-03-04 08:33:18: [   16.544142] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
2019-03-04 08:33:18: [   16.544142]
2019-03-04 08:33:18: [   16.547177] CPU: 2 PID: 1 Comm: shutdown Not tainted 4.18.0-75.el8.x86_64 #1
2019-03-04 08:33:18: [   16.548884] Hardware name: Red Hat KVM, BIOS 1.11.1-3.module+el8+2529+a9686a4d 04/01/2014
2019-03-04 08:33:18: [   16.550764] Call Trace:
2019-03-04 08:33:18: [   16.551801]  dump_stack+0x5c/0x80
2019-03-04 08:33:18: [   16.553082]  panic+0xe7/0x247
2019-03-04 08:33:18: [   16.554199]  do_exit.cold.22+0x26/0xc1
2019-03-04 08:33:18: [   16.555351]  do_group_exit+0x3a/0xa0
2019-03-04 08:33:18: [   16.556411]  __x64_sys_exit_group+0x14/0x20
2019-03-04 08:33:18: [   16.557548]  do_syscall_64+0x5b/0x1b0
2019-03-04 08:33:18: [   16.558622]  entry_SYSCALL_64_after_hwframe+0x65/0xca
2019-03-04 08:33:18: [   16.559859] RIP: 0033:0x7fae609f3e2e
2019-03-04 08:33:18: [   16.560901] Code: Bad RIP value.
2019-03-04 08:33:18: [   16.561896] RSP: 002b:00007ffcb24a39d8 EFLAGS: 00000202 ORIG_RAX: 00000000000000e7
2019-03-04 08:33:18: [   16.563495] RAX: ffffffffffffffda RBX: 00007fae609fc528 RCX: 00007fae609f3e2e
2019-03-04 08:33:18: [   16.565043] RDX: 000000000000007f RSI: 000000000000003c RDI: 000000000000007f
2019-03-04 08:33:18: [   16.566597] RBP: 00007fae60c02e00 R08: 00000000000000e7 R09: 00007ffcb24a38e8
2019-03-04 08:33:18: [   16.568168] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
2019-03-04 08:33:18: [   16.569723] R13: 0000000000000001 R14: 00007fae60c02e40 R15: 00007fae60c02e30


Version-Release number of selected component (if applicable):
host kernel: kernel-4.18.0-72.el8.x86_64
qemu-kvm: qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.x86_64 (2.12.0-62 also have this problem)
guest-kernel: kernel-4.18.0-75.el8.x86_64 (4.18.0-72 also have this problem)


How reproducible:
1/20 

Steps to Reproduce:
1. Boot a RHEL8 VM
2. execute "shutdown -r now" after login vm
3. repeat step2 after guest bootup repeatedly.

Actual results:
kernel panic during reboot.

Expected results:
no panic, guest work well

Additional info:
1. Same host kernel version, guest kernel version:

Test with fast train "qemu-kvm-3.1.0-18.module+el8+2834+fa8bb6e2.x86_64", repeat the automation case for 50 times (reboot 25 times in one case), didn't hit this issue.

Test with slow train "qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.x86_64", repeat automation case for 50 times, hit three times.

2. After hit this issue, boot vm with the same guest image, no problem.
3. Related log was added in attachment
4. Full qemu command line:
MALLOC_PERTURB_=1  /usr/libexec/qemu-kvm \
    -S  \
    -name 'avocado-vt-vm1' \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/avocado_jn507cv_/monitor-qmpmonitor1-20190304-081632-4oGe4uXH,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/avocado_jn507cv_/monitor-catch_monitor-20190304-081632-4oGe4uXH,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=idTo3L8N  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/avocado_jn507cv_/serial-serial0-20190304-081632-4oGe4uXH,server,nowait \
    -device isa-serial,chardev=serial_id_serial0  \
    -chardev socket,id=seabioslog_id_20190304-081632-4oGe4uXH,path=/var/tmp/avocado_jn507cv_/seabios-20190304-081632-4oGe4uXH,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20190304-081632-4oGe4uXH,iobase=0x402 \
    -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 \
    -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 \
    -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 \
    -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 \
    -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x3 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/home/kvm_autotest_root/images/rhel80-64-virtio-scsi.qcow2 \
    -device scsi-hd,id=image1,drive=drive_image1 \
    -device virtio-net-pci,mac=9a:fb:fc:fd:fe:ff,id=idlDJ4wH,vectors=4,netdev=idPP6wdA,bus=pci.0,addr=0x4  \
    -netdev tap,id=idPP6wdA,vhost=on,vhostfd=22,fd=14 \
    -m 7168  \
    -smp 4,maxcpus=4,cores=2,threads=1,sockets=2  \
    -cpu 'Skylake-Server',+kvm_pv_unhalt \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm

Comment 1 Bandan Das 2019-03-05 18:12:43 UTC

(In reply to Yanan Fu from comment #0)
> Created attachment 1540799 [details]
> Test log- serial log, debug log, screendumps
> 
> Description of problem:
> Reboot rhel8.0 guest repeatedly hit kernel panic:
> 
> 2019-03-04 08:33:18: [   16.544142] Kernel panic - not syncing: Attempted to
> kill init! exitcode=0x00007f00
> 2019-03-04 08:33:18: [   16.544142]
> 2019-03-04 08:33:18: [   16.547177] CPU: 2 PID: 1 Comm: shutdown Not tainted
> 4.18.0-75.el8.x86_64 #1
> 2019-03-04 08:33:18: [   16.548884] Hardware name: Red Hat KVM, BIOS
> 1.11.1-3.module+el8+2529+a9686a4d 04/01/2014
> 2019-03-04 08:33:18: [   16.550764] Call Trace:
> 2019-03-04 08:33:18: [   16.551801]  dump_stack+0x5c/0x80
> 2019-03-04 08:33:18: [   16.553082]  panic+0xe7/0x247
> 2019-03-04 08:33:18: [   16.554199]  do_exit.cold.22+0x26/0xc1
> 2019-03-04 08:33:18: [   16.555351]  do_group_exit+0x3a/0xa0
> 2019-03-04 08:33:18: [   16.556411]  __x64_sys_exit_group+0x14/0x20
> 2019-03-04 08:33:18: [   16.557548]  do_syscall_64+0x5b/0x1b0
> 2019-03-04 08:33:18: [   16.558622]  entry_SYSCALL_64_after_hwframe+0x65/0xca
> 2019-03-04 08:33:18: [   16.559859] RIP: 0033:0x7fae609f3e2e
> 2019-03-04 08:33:18: [   16.560901] Code: Bad RIP value.
> 2019-03-04 08:33:18: [   16.561896] RSP: 002b:00007ffcb24a39d8 EFLAGS:
> 00000202 ORIG_RAX: 00000000000000e7
> 2019-03-04 08:33:18: [   16.563495] RAX: ffffffffffffffda RBX:
> 00007fae609fc528 RCX: 00007fae609f3e2e
> 2019-03-04 08:33:18: [   16.565043] RDX: 000000000000007f RSI:
> 000000000000003c RDI: 000000000000007f
> 2019-03-04 08:33:18: [   16.566597] RBP: 00007fae60c02e00 R08:
> 00000000000000e7 R09: 00007ffcb24a38e8
> 2019-03-04 08:33:18: [   16.568168] R10: 0000000000000000 R11:
> 0000000000000202 R12: 0000000000000002
> 2019-03-04 08:33:18: [   16.569723] R13: 0000000000000001 R14:
> 00007fae60c02e40 R15: 00007fae60c02e30
> 
> 
> Version-Release number of selected component (if applicable):
> host kernel: kernel-4.18.0-72.el8.x86_64
> qemu-kvm: qemu-kvm-2.12.0-63.module+el8+2833+c7d6d092.x86_64 (2.12.0-62 also
> have this problem)
> guest-kernel: kernel-4.18.0-75.el8.x86_64 (4.18.0-72 also have this problem)
> 
> 
> How reproducible:
> 1/20 
> 
> Steps to Reproduce:
> 1. Boot a RHEL8 VM
> 2. execute "shutdown -r now" after login vm
> 3. repeat step2 after guest bootup repeatedly.
> 
> Actual results:
> kernel panic during reboot.
> 
> Expected results:
> no panic, guest work well
> 
> Additional info:
> 1. Same host kernel version, guest kernel version:
> 

Is the rest of the guest userspace the same ?

It looks like shutdown is executed when init is still bringing up the system. Although I believe init should ignore any SIGTERM or SIGKILL. 
Is the shutdown scripted/automated ?

Comment 3 Yanan Fu 2019-03-06 06:49:35 UTC

(In reply to Bandan Das from comment #1)
> > 
> > Additional info:
> > 1. Same host kernel version, guest kernel version:
> > 
> 
> Is the rest of the guest userspace the same ?

Yes, it is same. I tried, with same guest image, only changed the qemu-kvm version.

Let me update with the latest result:
I rerun for another 50 times last night with "qemu-kvm-3.1.0-18.module+el8+2834+fa8bb6e2.x86_64", hit
this issue too.

> 
> It looks like shutdown is executed when init is still bringing up the
> system. Although I believe init should ignore any SIGTERM or SIGKILL. 
> Is the shutdown scripted/automated ?

Yes, It is a automation case, i checked the whole logic:
1. after guest boot up, login vm, get a session.
2. send "shutdown -r now" through the session
3. check if guest go down successfully
4. check if guest boot up successfully.
5. repeat step 1~4.

Comment 9 Bandan Das 2019-03-22 19:20:45 UTC

A little more verbose output:
2019-03-22 14:44:58: [   18.504311] [3492]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'.
2019-03-22 14:44:58: [   18.510911] [3492]: Failed to remount '/' read-only: Device or resource busy
2019-03-22 14:44:58: [   18.515469] [3493]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'.
2019-03-22 14:44:58: [   18.521877] [3493]: Failed to remount '/' read-only: Device or resource busy
2019-03-22 14:44:58: [   18.527759] systemd-shutdown[1]: Not all file systems unmounted, 1 left.
2019-03-22 14:44:58: [   18.530465] systemd-shutdown[1]: Deactivating swaps.
2019-03-22 14:44:58: [   18.533060] systemd-shutdown[1]: All swaps deactivated.
2019-03-22 14:44:58: [   18.535461] systemd-shutdown[1]: Detaching loop devices.
2019-03-22 14:44:58: [   18.538151] systemd-shutdown[1]: All loop devices detached.
2019-03-22 14:44:58: [   18.540221] systemd-shutdown[1]: Detaching DM devices.
2019-03-22 14:44:58: [   18.553508] [3494]: Remounting '/' read-only in with options 'seclabel,attr2,inode64,noquota'.
2019-03-22 14:44:58: [   18.558148] [3494]: Failed to remount '/' read-only: Device or resource busy
2019-03-22 14:44:58: [   18.598036] shutdown: 9 output lines suppressed due to ratelimiting
2019-03-22 14:44:58: [   18.603021] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
2019-03-22 14:44:58: [   18.603021]
2019-03-22 14:44:58: [   18.608125] CPU: 2 PID: 1 Comm: shutdown Not tainted 4.18.0-80.el8_panic.x86_64 #1
2019-03-22 14:44:58: [   18.610101] Hardware name: Red Hat KVM, BIOS 1.12.0-1.module+el8+2706+3c6581b6 04/01/2014
2019-03-22 14:44:58: [   18.612152] Call Trace:
2019-03-22 14:44:58: [   18.613342]  dump_stack+0x5c/0x80
2019-03-22 14:44:58: [   18.614660]  panic+0xe7/0x247
2019-03-22 14:44:58: [   18.615911]  do_exit.cold.22+0x26/0xc1
2019-03-22 14:44:58: [   18.617321]  do_group_exit+0x3a/0xa0
2019-03-22 14:44:58: [   18.618634]  __x64_sys_exit_group+0x14/0x20
2019-03-22 14:44:58: [   18.620031]  do_syscall_64+0x5b/0x1b0
2019-03-22 14:44:58: [   18.621402]  entry_SYSCALL_64_after_hwframe+0x65/0xca
2019-03-22 14:44:58: [   18.622912] RIP: 0033:0x7f7524a81e2e
2019-03-22 14:44:58: [   18.624218] Code: Bad RIP value.

So, device mapper isn't done but we are shutting down anyway. I don't think that would cause a panic but the underlying cause for why
remounting failed is probably the cultprit. The usual suspect is a sync() that hasn't completed. It's definitely a bug if it still exists but 
just to be sure I will try a newer kernel as well as upstream and see if it makes a difference.

Comment 10 Xueqiang Wei 2019-04-30 07:38:06 UTC

Also hit it on RHEL8.1.0.

Versions:
kernel-4.18.0-80.19.el8.x86_64
qemu-kvm-2.12.0-67.module+el8.1.0+3088+c3b61d6f

Comment 23 Yanan Fu 2019-06-11 07:21:29 UTC

One more question about the trace function, does "scsi_*" enough ? like this:
# trace-cmd record -p function -l "scsi_*"

Comment 27 Yanan Fu 2019-06-12 08:44:05 UTC

Hi Karen and Bandan,

Thanks for your reply. 
Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have some risk:
1. "300s" is not long enough sometimes, we still can hit this issue, and cause gating test failed.
2. Miss product bz after enlarging timeout, since this workaround will be used in our automation
   script after installation finished, all of the cases will run with the modified guest, not only
   "reboot" test.

From developer's perspective, please help check if these risks are acceptable for our product.
If it is ok, we can use this method as a workaround at current stage.
Many thanks! 


Best regards
Yanan Fu

Comment 28 Bandan Das 2019-06-12 15:22:10 UTC

(In reply to Yanan Fu from comment #27)
> Hi Karen and Bandan,
> 
> Thanks for your reply. 
> Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have
> some risk:
> 1. "300s" is not long enough sometimes, we still can hit this issue, and
> cause gating test failed.

Ok, I think I misread your comments about the outcome of using TimeoutSec.
I understood from your comments that we do *not* hit the issue if we use the 
parameter. If we are still hitting it, obviously, it cannot be a workaround.
  
> 2. Miss product bz after enlarging timeout, since this workaround will be
> used in our automation
>    script after installation finished, all of the cases will run with the
> modified guest, not only
>    "reboot" test.
> 
I think this should be doable. Can you not script a new install to change the Timeout
only for a reboot test ?
 
> From developer's perspective, please help check if these risks are
> acceptable for our product.
> If it is ok, we can use this method as a workaround at current stage.
> Many thanks! 
> 

Here's what I would suggest:
Do a reboot test with n=50 and TimeoutSec=300. If you don't hit the panic, we can remove this as 
a test blocker but we can continue investigating this bug by collecting traces of qemu scsi functions.
Once we root cause this and have a fix, you can go back to removing TimeoutSec altogether.

> 
> Best regards
> Yanan Fu

Comment 29 Yanan Fu 2019-06-13 07:52:55 UTC

(In reply to Bandan Das from comment #28)
> (In reply to Yanan Fu from comment #27)
> > Hi Karen and Bandan,
> > 
> > Thanks for your reply. 
> > Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have
> > some risk:
> > 1. "300s" is not long enough sometimes, we still can hit this issue, and
> > cause gating test failed.
> 
> Ok, I think I misread your comments about the outcome of using TimeoutSec.
> I understood from your comments that we do *not* hit the issue if we use the 
> parameter. If we are still hitting it, obviously, it cannot be a workaround.


Here is about the potential risk not actual test result.
With "300s", i really can't reproduce this issue in my test now.

But, "it may have a risk that "300s" may be not enough and cause gating test failed.


>   
> > 2. Miss product bz after enlarging timeout, since this workaround will be
> > used in our automation
> >    script after installation finished, all of the cases will run with the
> > modified guest, not only
> >    "reboot" test.
> > 
> I think this should be doable. Can you not script a new install to change
> the Timeout
> only for a reboot test ?

I am sorry, only modify automation case "reboot" is achievable, but we can not do that.
Because, this issue effect other automation cases that need reboot vm too, that is why
we mark "blocker" before. 

If only modify "reboot", other cases still can failed as this issue and failed gating
test.

>  
> > From developer's perspective, please help check if these risks are
> > acceptable for our product.
> > If it is ok, we can use this method as a workaround at current stage.
> > Many thanks! 
> > 
> 
> Here's what I would suggest:
> Do a reboot test with n=50 and TimeoutSec=300. If you don't hit the panic,
> we can remove this as 
> a test blocker but we can continue investigating this bug by collecting
> traces of qemu scsi functions.
> Once we root cause this and have a fix, you can go back to removing
> TimeoutSec altogether.
> 
> > 
> > Best regards
> > Yanan Fu

Comment 30 CongLi 2019-06-13 11:30:53 UTC

(In reply to Bandan Das from comment #28)
> (In reply to Yanan Fu from comment #27)
> > Hi Karen and Bandan,
> > 
> > Thanks for your reply. 
> > Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have
> > some risk:
> > 1. "300s" is not long enough sometimes, we still can hit this issue, and
> > cause gating test failed.
> 
> Ok, I think I misread your comments about the outcome of using TimeoutSec.
> I understood from your comments that we do *not* hit the issue if we use the 
> parameter. If we are still hitting it, obviously, it cannot be a workaround.

Hi Bandan,

According to QE testing (over hundreds of times), 300s is enough, we think it 
could be a workaround.

But there still may risk in some situations that 300s is not enough and cause 
gating test failed, like compose(iso) update, different systems...
We could not say 300s is always enough for all situations, even it's low risk, we still 
can not 100% guarantee.

So QE would like to double confirm if such risk is acceptable for developer, 
hope you can understand.

>   
> > 2. Miss product bz after enlarging timeout, since this workaround will be
> > used in our automation
> >    script after installation finished, all of the cases will run with the
> > modified guest, not only
> >    "reboot" test.
> > 
> I think this should be doable. Can you not script a new install to change
> the Timeout
> only for a reboot test ?

As Yanan mentioned, there are many cases in gating test call reboot function, 
so only updating reboot case may does not work, we still could meet this issue 
in other cases.

>  
> > From developer's perspective, please help check if these risks are
> > acceptable for our product.
> > If it is ok, we can use this method as a workaround at current stage.
> > Many thanks! 
> > 
> 
> Here's what I would suggest:
> Do a reboot test with n=50 and TimeoutSec=300. If you don't hit the panic,
> we can remove this as 
> a test blocker but we can continue investigating this bug by collecting
> traces of qemu scsi functions.
> Once we root cause this and have a fix, you can go back to removing
> TimeoutSec altogether.

Agree.
Based on QE testing, the workaround works well currently. 
QE agree remove TestBlocker if the risk I mentioned above is acceptable for developer.
We can use this workaround until we get a fix.


Thanks.

Comment 31 Bandan Das 2019-06-13 20:09:36 UTC

(In reply to CongLi from comment #30)
> (In reply to Bandan Das from comment #28)
> > (In reply to Yanan Fu from comment #27)
> > > Hi Karen and Bandan,
> > > 
> > > Thanks for your reply. 
> > > Enlarge "TimeoutSec" in "umount.target" is ok for QE, but there may have
> > > some risk:
> > > 1. "300s" is not long enough sometimes, we still can hit this issue, and
> > > cause gating test failed.
> > 
> > Ok, I think I misread your comments about the outcome of using TimeoutSec.
> > I understood from your comments that we do *not* hit the issue if we use the 
> > parameter. If we are still hitting it, obviously, it cannot be a workaround.
> 
> Hi Bandan,
> 
> According to QE testing (over hundreds of times), 300s is enough, we think
> it 
> could be a workaround.
> 
> But there still may risk in some situations that 300s is not enough and
> cause 
> gating test failed, like compose(iso) update, different systems...
> We could not say 300s is always enough for all situations, even it's low
> risk, we still 
> can not 100% guarantee.
> 
> So QE would like to double confirm if such risk is acceptable for developer, 
> hope you can understand.
> 
> >   
> > > 2. Miss product bz after enlarging timeout, since this workaround will be
> > > used in our automation
> > >    script after installation finished, all of the cases will run with the
> > > modified guest, not only
> > >    "reboot" test.
> > > 
> > I think this should be doable. Can you not script a new install to change
> > the Timeout
> > only for a reboot test ?
> 
> As Yanan mentioned, there are many cases in gating test call reboot
> function, 
> so only updating reboot case may does not work, we still could meet this
> issue 
> in other cases.
> 
> >  
> > > From developer's perspective, please help check if these risks are
> > > acceptable for our product.
> > > If it is ok, we can use this method as a workaround at current stage.
> > > Many thanks! 
> > > 
> > 
> > Here's what I would suggest:
> > Do a reboot test with n=50 and TimeoutSec=300. If you don't hit the panic,
> > we can remove this as 
> > a test blocker but we can continue investigating this bug by collecting
> > traces of qemu scsi functions.
> > Once we root cause this and have a fix, you can go back to removing
> > TimeoutSec altogether.
> 
> Agree.
> Based on QE testing, the workaround works well currently. 
> QE agree remove TestBlocker if the risk I mentioned above is acceptable for
> developer.
> We can use this workaround until we get a fix.
> 
> 

Hi everyone, 

Thank you very much for the clarification. Yes, I think this is acceptable. I have spoken about this
to both Rick and Karen and they both agree as well. 

So:
1. Remove TestBlocker.
2. Set up Qemu tracing and gather results.
3. Setup a local reproducer.

I really want to work on 3 to rule out this being an issue with your setup.
I remember trying this in the past with instructions from Yanan but was not 
able to reproduce it. I will grab a system from beaker and ping him again to 
help me with the setup.


> Thanks.

Comment 32 CongLi 2019-06-17 01:11:15 UTC

Based on comment 31, remove 'TestBlocker' keyword.

Thanks.

Comment 35 qing.wang 2019-11-25 08:50:48 UTC

Created attachment 1639406 [details]
debug info

Comment 36 Ademar Reis 2020-02-05 22:54:40 UTC

QEMU has been recently split into sub-components and as a one-time operation to avoid breakage of tools, we are setting the QEMU sub-component of this BZ to "General". Please review and change the sub-component if necessary the next time you review this BZ. Thanks

Comment 43 Yanan Fu 2021-03-01 03:44:57 UTC

Same with the bz 1717323 (for Advanced Virtualization).

QE still can hit this issue from time to time, with RHEL8.3, RHEL8.4, 
and also with RHEL9, refer https://bugzilla.redhat.com/show_bug.cgi?id=1922896

Comment 44 John Ferlan 2021-09-17 16:10:16 UTC

Bulk update: Move RHEL8 bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 50 mhou 2022-06-21 07:04:24 UTC

Hello Folks

I meet this issue on real-time kernel. This issue occur when I try to start a guest. Here is my reproduce step.
qemu and libvirt version:
qemu-kvm-7.0.0-6.el9.x86_64
libvirt-8.4.0-2.el9.x86_64

host kernel version: 5.14.0-109.rt21.109.el9.x86_64

1. perpare rhel9.1 guest.(kernel version is: 5.14.0-114.el9.x86_64)
2. create a guest xml as below.
# cat rhel9.1.xml 
<domain type="kvm">
  <name>rhel9.1</name>
  <memory unit="KiB">8388608</memory>
  <currentMemory unit="KiB">8388608</currentMemory>
  <memoryBacking>
    <hugepages>
      <page size="1048576" unit="KiB" />
    </hugepages>
    <access mode="shared" />
  </memoryBacking>
  <vcpu placement="static">3</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch="x86_64" machine="q35">hvm</type>
    <boot dev="hd" />
  </os>
  <features>
    <acpi />
    <pmu state="off" />
    <vmport state="off" />
    <ioapic driver="qemu" />
  </features>
  <clock offset="utc">
    <timer name="rtc" tickpolicy="catchup" />
    <timer name="pit" tickpolicy="delay" />
    <timer name="hpet" present="no" />
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <pm>
    <suspend-to-mem enabled="no" />
    <suspend-to-disk enabled="no" />
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type="file" device="disk">
      <driver name="qemu" type="qcow2" />
      <source file="/root/rhel9.1-latest.qcow2" />
      <target dev="vda" bus="virtio" />
      <address type="pci" domain="0x0000" bus="0x01" slot="0x00" function="0x0" />
    </disk>
    <controller type="usb" index="0" model="none" />
    <controller type="pci" index="0" model="pcie-root" />
    <controller type="pci" index="1" model="pcie-root-port">
      <model name="pcie-root-port" />
      <target chassis="1" port="0x10" />
      <address type="pci" domain="0x0000" bus="0x00" slot="0x02" function="0x0" />
    </controller>
    <controller type="pci" index="2" model="pcie-root-port">
      <model name="pcie-root-port" />
      <target chassis="2" port="0x11" />
      <address type="pci" domain="0x0000" bus="0x00" slot="0x03" function="0x0" />
    </controller>
    <controller type="pci" index="3" model="pcie-root-port">
      <model name="pcie-root-port" />
      <target chassis="3" port="0x8" />
      <address type="pci" domain="0x0000" bus="0x00" slot="0x04" function="0x0" />
    </controller>
    <controller type="pci" index="4" model="pcie-root-port">
      <model name="pcie-root-port" />
      <target chassis="4" port="0x9" />
      <address type="pci" domain="0x0000" bus="0x00" slot="0x05" function="0x0" />
    </controller>
    <controller type="pci" index="5" model="pcie-root-port">
      <model name="pcie-root-port" />
      <target chassis="5" port="0xa" />
      <address type="pci" domain="0x0000" bus="0x00" slot="0x06" function="0x0" />
    </controller>
    <controller type="pci" index="6" model="pcie-root-port">
      <model name="pcie-root-port" />
      <target chassis="6" port="0xb" />
      <address type="pci" domain="0x0000" bus="0x00" slot="0x07" function="0x0" />
    </controller>
    <controller type="sata" index="0">
      <address type="pci" domain="0x0000" bus="0x00" slot="0x1f" function="0x2" />
    </controller>
    <interface type="bridge">
      <mac address="52:54:00:bb:63:7e" />
      <source bridge="virbr0" />
      <model type="virtio" />
      <address type="pci" domain="0x0000" bus="0x02" slot="0x00" function="0x0" />
    </interface>
    <serial type="pty">
      <target type="isa-serial" port="0">
        <model name="isa-serial" />
      </target>
    </serial>
    <console type="pty">
      <target type="serial" port="0" />
    </console>
    <input type="mouse" bus="ps2" />
    <input type="keyboard" bus="ps2" />
    <graphics type="vnc" port="-1" autoport="yes" listen="0.0.0.0">
      <listen type="address" address="0.0.0.0" />
    </graphics>
    <video>
      <model type="cirrus" vram="16384" heads="1" primary="yes" />
      <address type="pci" domain="0x0000" bus="0x05" slot="0x00" function="0x0" />
    </video>
    <memballoon model="virtio">
      <address type="pci" domain="0x0000" bus="0x06" slot="0x00" function="0x0" />
    </memballoon>
    <iommu model="intel">
      <driver intremap="on" caching_mode="on" iotlb="on" />
    </iommu>
  </devices>
  <seclabel type="dynamic" model="selinux" relabel="yes" />
</domain>
3. start guest and got call trace
# virsh define rhel9.1.xml
# virsh console rhel9.1
Fatal glibc error: CPU does not support x86-64-v2
[    3.202929] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00
[    3.204111] CPU: 0 PID: 1 Comm: init Not tainted 5.14.0-114.el9.x86_64 #1
[    3.205148] Hardware name: Red Hat KVM/RHEL, BIOS 1.16.0-3.el9 04/01/2014
[    3.206182] Call Trace:
[    3.206562]  dump_stack_lvl+0x34/0x44
[    3.207133]  panic+0x102/0x2d4
[    3.207609]  do_exit.cold+0x87/0x9f
[    3.208151]  do_group_exit+0x33/0xa0
[    3.208699]  __x64_sys_exit_group+0x14/0x20
[    3.209341]  do_syscall_64+0x5c/0x80
[    3.209888]  ? do_writev+0x6b/0x110
[    3.210431]  ? syscall_exit_to_user_mode+0x12/0x30
[    3.211166]  ? do_syscall_64+0x69/0x80
[    3.211737]  ? syscall_exit_to_user_mode+0x12/0x30
[    3.212471]  ? do_syscall_64+0x69/0x80
[    3.213048]  ? exc_page_fault+0x62/0x140
[    3.213644]  ? asm_exc_page_fault+0x8/0x30
[    3.214274]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    3.215124] RIP: 0033:0x7f8e515eb311
[    3.215670] Code: c3 0f 1f 84 00 00 00 00 00 f3 0f 1e fa be e7 00 00 00 ba 3c 00 00 00 eb 0d 89 d0 0f 05 48 3d 00 f0 ff ff 77 1c f4 89 f0 0f 05 <48> 3d 00 f0 ff ff 76 e7 f7 d8 89 05 bf fe 00 00 eb dd 0f 1f 44 00
[    3.218493] RSP: 002b:00007ffe2839da58 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
[    3.219639] RAX: ffffffffffffffda RBX: 00007f8e515e6050 RCX: 00007f8e515eb311
[    3.220714] RDX: 000000000000003c RSI: 00000000000000e7 RDI: 000000000000007f
[    3.221788] RBP: 0000565470979040 R08: 00007ffe2839d5c9 R09: 0000000000000000
[    3.222865] R10: 00000000ffffffff R11: 0000000000000246 R12: 000000000000000d
[    3.223939] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000001
[    3.225167] Kernel Offset: 0x12200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    3.228731] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00 ]---

qemu command line:
/usr/libexec/qemu-kvm -name guest=rhel9.1,debug-threads=on -S -object {"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-4-rhel9.1/master-key.aes"} -machine pc-q35-rhel9.0.0,usb=off,vmport=off,dump-guest-core=off,kernel_irqchip=split,memory-backend=pc.ram -accel kvm -cpu qemu64,pmu=off -m 8192 -object {"qom-type":"memory-backend-file","id":"pc.ram","mem-path":"/dev/hugepages/libvirt/qemu/4-rhel9.1","share":true,"x-use-canonical-path-for-ramblock-id":false,"prealloc":true,"size":8589934592} -overcommit mem-lock=off -smp 3,sockets=3,cores=1,threads=1 -uuid 39c5ac63-050f-4e18-b895-54c21fae2a1a -no-user-config -nodefaults -chardev socket,id=charmonitor,fd=24,server=on,wait=off -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=delay -no-hpet -no-shutdown -global ICH9-LPC.disable_s3=1 -global ICH9-LPC.disable_s4=1 -boot strict=on -device {"driver":"intel-iommu","intremap":"on","caching-mode":true,"device-iotlb":true} -device {"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","addr":"0x2"} -device {"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x3"} -device {"driver":"pcie-root-port","port":8,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x4"} -device {"driver":"pcie-root-port","port":9,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x5"} -device {"driver":"pcie-root-port","port":10,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x6"} -device {"driver":"pcie-root-port","port":11,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x7"} -blockdev {"driver":"file","filename":"/root/rhel9.1-latest.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"} -blockdev {"node-name":"libvirt-1-format","read-only":false,"driver":"qcow2","file":"libvirt-1-storage","backing":null} -device {"driver":"virtio-blk-pci","bus":"pci.1","addr":"0x0","drive":"libvirt-1-format","id":"virtio-disk0","bootindex":1} -netdev tap,fd=27,vhost=on,vhostfd=29,id=hostnet0 -device {"driver":"virtio-net-pci","netdev":"hostnet0","id":"net0","mac":"52:54:00:bb:63:7e","bus":"pci.2","addr":"0x0"} -chardev pty,id=charserial0 -device {"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0} -audiodev {"id":"audio1","driver":"none"} -vnc 0.0.0.0:1,audiodev=audio1 -device {"driver":"cirrus-vga","id":"video0","bus":"pci.5","addr":"0x0"} -device {"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.6","addr":"0x0"} -sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny -msg timestamp=on

Comment 51 mhou 2022-06-21 07:41:59 UTC

I upload my guest image on this url[1]. guest kernel is 5.14.0-114.el9.

[1]http://netqe-bj.usersys.redhat.com/share/mhou/image/rhel9.1-latest.qcow2

Comment 52 mhou 2022-06-21 07:43:07 UTC

I upload my guest image on this url[1]. guest kernel is 5.14.0-114.el9.

[1]http://netqe-bj.usersys.redhat.com/share/mhou/image/rhel9.1-latest.qcow2

Comment 53 Pavel Valena 2022-07-13 21:57:12 UTC

Hello, please test, if you can, whether the planned rebase for 9.1 works for you.

RPMS: https://github.com/pvalena/rpms/tree/main/dracut/2066816

Comment 54 Yanan Fu 2022-07-14 01:35:25 UTC

(In reply to Pavel Valena from comment #53)
> Hello, please test, if you can, whether the planned rebase for 9.1 works for
> you.
> 
> RPMS: https://github.com/pvalena/rpms/tree/main/dracut/2066816

Hi Pavel,

Thanks for your scratch build, I will test with it.

This is a probabilistic problem, we may need a bit more time to verify it, hope you
can understand, thanks a lot !


Best regards
Yanan Fu