Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1892669

Summary:

VM start hang - qemu stuck in query-balloon

Product:

Red Hat Enterprise Linux 9

Reporter:

chhu

Component:

qemu-kvm

Assignee:

Marcelo Tosatti <mtosatti>

qemu-kvm sub component:

Devices

QA Contact:

Pei Zhang <pezhang>

Status:

CLOSED DEFERRED

Docs Contact:

Severity:

low

Priority:

low

CC:

broskos, chayang, jinzhao, junzhao, juzhang, kchamart, mhou, mkletzan, mprivozn, mtosatti, nilal, pezhang, virt-maint, yanghliu, ymankad, yuhuang

Version:

unspecified

Keywords:

Reopened, Triaged

Target Milestone:

Flags:

pm-rhel: mirror+

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

libvirt_OSP_INT

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2023-02-19 07:27:44 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

1883636, 1922007

Attachments:

Description	Flags
Guest xml	none
client_hang_libvirtd.log	none
g3.xml	none
libvirtd.log	none

Description chhu 2020-10-29 12:17:06 UTC

Description of problem:
After starting and destroying VM for many times, `vm start` hang, qemu stuck in query-balloon

Tested on packages:
libvirt-daemon-kvm-6.0.0-25.module+el8.2.1+7154+47ffd890.x86_64
qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64
RT kernel: 4.18.0-193.rt13.51.el8.x86_64

Test steps:
1. Start VM successfully with xml: g2.xml, destroy the VM

2. Start and destroy the VM for 20 times 
    # for i in {1..20}; do virsh start g2; sleep 5; virsh destroy g2; done

3. Try to start and destroy the VM for 200 time one terminal 1, 
   and `virsh list --all` on terminate 2.
   The `vm start` hang on terminal 1, the VM status is paused.

  Terminal 1: 
     # for i in {1..200}; do virsh start g2; sleep 1; virsh destroy g2; done
  Terminal 2: 
     # for i in {1..500}; do virsh list --all; sleep 1; done
       Id   Name   State
       -----------------------
       1    g2     paused
       ...

Additional information:
- libvirtd.log
- g2.xml

Comment 1 chhu 2020-10-29 12:21:29 UTC

Created attachment 1725057 [details]
Guest xml

Comment 2 chhu 2020-10-29 12:22:38 UTC

Created attachment 1725058 [details]
client_hang_libvirtd.log

Comment 3 Michal Privoznik 2020-10-29 12:32:33 UTC

Chenli and I were trying to reproduce another bug (bug 1821277) and when we hit this I dumped the stack trace:

#0  0x00007f2aecc7048c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x000056513d9e1aed in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x56513e254fa0 <qemu_global_mutex>, file=0x56513da850b0 "/builddir/build/BUILD/qemu-4.2.0/cpus.c", line=1275) at util/qemu-thread-posix.c:173
#2  0x000056513d6b30f7 in qemu_wait_io_event (cpu=0x56513eadf7b0) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1275
#3  0x000056513d6b4b58 in qemu_kvm_cpu_thread_fn (arg=0x56513eadf7b0) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1323
#4  0x000056513d9e1734 in qemu_thread_start (args=0x56513eb06c90) at util/qemu-thread-posix.c:519
#5  0x00007f2aecc6a2de in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f2aec99be83 in clone () from target:/lib64/libc.so.6

Thread 7 (Thread 0x7f2adf6fc700 (LWP 201805)):
#0  0x00007f2aecc7048c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x000056513d9e1aed in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x56513e254fa0 <qemu_global_mutex>, file=0x56513da850b0 "/builddir/build/BUILD/qemu-4.2.0/cpus.c", line=1275) at util/qemu-thread-posix.c:173
#2  0x000056513d6b30f7 in qemu_wait_io_event (cpu=0x56513eab7880) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1275
#3  0x000056513d6b4b58 in qemu_kvm_cpu_thread_fn (arg=0x56513eab7880) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1323
#4  0x000056513d9e1734 in qemu_thread_start (args=0x56513eadef70) at util/qemu-thread-posix.c:519
#5  0x00007f2aecc6a2de in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f2aec99be83 in clone () from target:/lib64/libc.so.6

Thread 6 (Thread 0x7f2adeefb700 (LWP 201804)):
#0  0x00007f2aecc7048c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x000056513d9e1aed in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x56513e254fa0 <qemu_global_mutex>, file=0x56513da850b0 "/builddir/build/BUILD/qemu-4.2.0/cpus.c", line=1275) at util/qemu-thread-posix.c:173
#2  0x000056513d6b30f7 in qemu_wait_io_event (cpu=0x56513ea8ed10) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1275
#3  0x000056513d6b4b58 in qemu_kvm_cpu_thread_fn (arg=0x56513ea8ed10) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1323
#4  0x000056513d9e1734 in qemu_thread_start (args=0x56513eab6240) at util/qemu-thread-posix.c:519
#5  0x00007f2aecc6a2de in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f2aec99be83 in clone () from target:/lib64/libc.so.6

Thread 5 (Thread 0x7f2ade6fa700 (LWP 201803)):
#0  0x00007f2aecc7048c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x000056513d9e1aed in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x56513e254fa0 <qemu_global_mutex>, file=0x56513da850b0 "/builddir/build/BUILD/qemu-4.2.0/cpus.c", line=1275) at util/qemu-thread-posix.c:173
#2  0x000056513d6b30f7 in qemu_wait_io_event (cpu=0x56513ea666a0) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1275
#3  0x000056513d6b4b58 in qemu_kvm_cpu_thread_fn (arg=0x56513ea666a0) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1323
#4  0x000056513d9e1734 in qemu_thread_start (args=0x56513ea8e690) at util/qemu-thread-posix.c:519
#5  0x00007f2aecc6a2de in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f2aec99be83 in clone () from target:/lib64/libc.so.6

Thread 4 (Thread 0x7f2addef9700 (LWP 201801)):
#0  0x00007f2aecc7048c in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib64/libpthread.so.0
#1  0x000056513d9e1aed in qemu_cond_wait_impl (cond=<optimized out>, mutex=0x56513e254fa0 <qemu_global_mutex>, file=0x56513da850b0 "/builddir/build/BUILD/qemu-4.2.0/cpus.c", line=1275) at util/qemu-thread-posix.c:173
#2  0x000056513d6b30f7 in qemu_wait_io_event (cpu=0x56513ea13df0) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1275
#3  0x000056513d6b4b58 in qemu_kvm_cpu_thread_fn (arg=0x56513ea13df0) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/cpus.c:1323
#4  0x000056513d9e1734 in qemu_thread_start (args=0x56513ea3c6c0) at util/qemu-thread-posix.c:519
#5  0x00007f2aecc6a2de in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f2aec99be83 in clone () from target:/lib64/libc.so.6

Thread 3 (Thread 0x7f2add6f8700 (LWP 201800)):
#0  0x00007f2aec990f21 in poll () from target:/lib64/libc.so.6
#1  0x00007f2af115f9b6 in g_main_context_iterate.isra () from target:/lib64/libglib-2.0.so.0
#2  0x00007f2af115fd72 in g_main_loop_run () from target:/lib64/libglib-2.0.so.0
#3  0x000056513d7b8771 in iothread_run (opaque=0x56513e994c00) at iothread.c:82
#4  0x000056513d9e1734 in qemu_thread_start (args=0x56513ea11880) at util/qemu-thread-posix.c:519
#5  0x00007f2aecc6a2de in start_thread () from target:/lib64/libpthread.so.0
#6  0x00007f2aec99be83 in clone () from target:/lib64/libc.so.6

Thread 2 (Thread 0x7f2ae5114700 (LWP 201785)):
#0  0x00007f2aec9966ed in syscall () from target:/lib64/libc.so.6
#1  0x000056513d9e1f9f in qemu_futex_wait (val=<optimized out>, f=<optimized out>) at /usr/src/debug/qemu-kvm-4.2.0-29.module+el8.2.1+7297+a825794d.x86_64/include/qemu/futex.h:29
#2  qemu_event_wait (ev=ev@entry=0x56513e2852e8 <rcu_call_ready_event>) at util/qemu-thread-posix.c:459
#3  0x000056513d9f4202 in call_rcu_thread (opaque=<optimized out>) at util/rcu.c:260
#4  0x000056513d9e1734 in qemu_thread_start (args=0x56513e8baac0) at util/qemu-thread-posix.c:519
#5  0x00007f2aecc6a2de in start_thread () from target:/lib64/libpthread.so.0
--Type <RET> for more, q to quit, c to continue without paging--
#6  0x00007f2aec99be83 in clone () from target:/lib64/libc.so.6

Thread 1 (Thread 0x7f2af1cf5f00 (LWP 201751)):
#0  0x00007f2aec991016 in ppoll () from target:/lib64/libc.so.6
#1  0x000056513d9dd625 in ppoll (__ss=0x0, __timeout=0x7ffc9a3006d0, __nfds=<optimized out>, __fds=<optimized out>) at /usr/include/bits/poll2.h:77
#2  qemu_poll_ns (fds=<optimized out>, nfds=<optimized out>, timeout=timeout@entry=962000000) at util/qemu-timer.c:348
#3  0x000056513d9de4c5 in os_host_main_loop_wait (timeout=962000000) at util/main-loop.c:237
#4  main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:518
#5  0x000056513d7bdda1 in main_loop () at vl.c:1828
#6  0x000056513d669852 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:450

Comment 4 Yumei Huang 2020-10-30 07:36:15 UTC

Hi Chenli,

IIUC, when libvirt start vm, vm is in paused status at very first begining(as '-S' in qemu cli), then a 'cont' cmd is sent to qemu monitor to resume vm. So if run virsh list just before 'cont' cmd, the vm state should be paused as expect.


Below is my test, 'paused' state only appears the second when vm starts, then vm goes into running state.

# for i in {1..200}; do date; virsh start rhel8; sleep 5; date; virsh destroy rhel8; sleep 2; done
...
Fri Oct 30 03:18:44 EDT 2020
Domain rhel8 destroyed

Fri Oct 30 03:18:46 EDT 2020
Domain rhel8 started

Fri Oct 30 03:18:52 EDT 2020
Domain rhel8 destroyed
...


# for i in {1..5000}; do date; virsh list --all; sleep 1; done
...
Fri Oct 30 03:18:43 EDT 2020
 Id    Name    State
------------------------
 465   rhel8   running

Fri Oct 30 03:18:44 EDT 2020
 Id   Name    State
------------------------
 -    rhel8   shut off

Fri Oct 30 03:18:45 EDT 2020
 Id   Name    State
------------------------
 -    rhel8   shut off

Fri Oct 30 03:18:46 EDT 2020
 Id    Name    State
-----------------------
 466   rhel8   paused

Fri Oct 30 03:18:47 EDT 2020
 Id    Name    State
------------------------
 466   rhel8   running

Fri Oct 30 03:18:48 EDT 2020
 Id    Name    State
------------------------
 466   rhel8   running
...

I ran the test on 8.3 av, 
   qemu-kvm-5.1.0-13.module+el8.3.0+8382+afc3bbea
   libvirt-client-6.6.0-6.module+el8.3.0+8125+aefcf088.x86_64
   kernel-4.18.0-240.4.el8.x86_64

Would you please have a try with above cmds with your packages? Thanks.

Comment 7 Pei Zhang 2020-11-02 11:09:07 UTC

Tried with server from KVM-RT testing pool, I failed to reproduce this issue with below versions.

After start/destroy VM 400 times, VM doesn't pause.

Versions:

4.18.0-193.30.1.rt13.79.el8_2.x86_64
libvirt-libs-6.0.0-25.4.module+el8.2.1+8060+c0c58169.x86_64
qemu-kvm-4.2.0-29.module+el8.2.1+7990+27f1e480.4.x86_64

Server info:
# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              20
On-line CPU(s) list: 0-19
Thread(s) per core:  1
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               63
Model name:          Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Stepping:            2
CPU MHz:             2297.448
BogoMIPS:            4594.74
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            25600K
NUMA node0 CPU(s):   0,2,4,6,8,10,12,14,16,18
NUMA node1 CPU(s):   1,3,5,7,9,11,13,15,17,19
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm arat pln pts md_clear flush_l1d

Comment 9 Yumei Huang 2020-11-03 02:00:59 UTC

I was able to reproduce with the xml Chenli attached on her host. 

However, if remove cputune from the xml, the issue is gone, guest status changed from paused to running in the first 2 or 3 second always.

  <cputune>
    <vcpupin vcpu='0' cpuset='12'/>
    <vcpupin vcpu='1' cpuset='14'/>
    <vcpupin vcpu='2' cpuset='16'/>
    <vcpupin vcpu='3' cpuset='18'/>
    <vcpupin vcpu='4' cpuset='20'/>
    <emulatorpin cpuset='11,13,15,17'/>
    <emulatorsched scheduler='fifo' priority='1'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
  </cputune>

Comment 10 Yumei Huang 2020-11-03 08:49:35 UTC

(In reply to Yumei Huang from comment #9)
> I was able to reproduce with the xml Chenli attached on her host. 
> 
> However, if remove cputune from the xml, the issue is gone, guest status
> changed from paused to running in the first 2 or 3 second always.
> 
>   <cputune>
>     <vcpupin vcpu='0' cpuset='12'/>
>     <vcpupin vcpu='1' cpuset='14'/>
>     <vcpupin vcpu='2' cpuset='16'/>
>     <vcpupin vcpu='3' cpuset='18'/>
>     <vcpupin vcpu='4' cpuset='20'/>
>     <emulatorpin cpuset='11,13,15,17'/>
>     <emulatorsched scheduler='fifo' priority='1'/>
>     <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
>     <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
>     <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
>     <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
>     <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
>   </cputune>

Also tried adding above cputune to my own xml, and ran 'virsh start' and 'virsh destroy' 2000 times on Chenli's rt host, sometimes guest stay paused for about 50 seconds before running.


BTW, above cputune xml can't run on non-rt host, would hit error.

# virsh start rhel8
error: Failed to start domain rhel8
error: Cannot set scheduler parameters for pid 504844: Operation not permitted

Comment 11 Yumei Huang 2020-11-05 02:30:52 UTC

(In reply to Yumei Huang from comment #10)
> (In reply to Yumei Huang from comment #9)
> > I was able to reproduce with the xml Chenli attached on her host. 
> > 
> > However, if remove cputune from the xml, the issue is gone, guest status
> > changed from paused to running in the first 2 or 3 second always.
> > 
> >   <cputune>
> >     <vcpupin vcpu='0' cpuset='12'/>
> >     <vcpupin vcpu='1' cpuset='14'/>
> >     <vcpupin vcpu='2' cpuset='16'/>
> >     <vcpupin vcpu='3' cpuset='18'/>
> >     <vcpupin vcpu='4' cpuset='20'/>
> >     <emulatorpin cpuset='11,13,15,17'/>
> >     <emulatorsched scheduler='fifo' priority='1'/>
> >     <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
> >     <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
> >     <vcpusched vcpus='2' scheduler='fifo' priority='1'/>
> >     <vcpusched vcpus='3' scheduler='fifo' priority='1'/>
> >     <vcpusched vcpus='4' scheduler='fifo' priority='1'/>
> >   </cputune>
> 
> Also tried adding above cputune to my own xml, and ran 'virsh start' and
> 'virsh destroy' 2000 times on Chenli's rt host, sometimes guest stay paused
> for about 50 seconds before running.
> 
> 
> BTW, above cputune xml can't run on non-rt host, would hit error.
> 
> # virsh start rhel8
> error: Failed to start domain rhel8
> error: Cannot set scheduler parameters for pid 504844: Operation not
> permitted

Test with the cputune part but without emulatorsched and vcpusched on non-rt host, start and destroy guest 5000 times, guest only stays paused for two seconds at most, then change to running. 

In conclusion, the issue is only reproducible on rt host with vcpusched.

Comment 12 John Ferlan 2020-11-06 13:18:00 UTC

Luiz - given the above analysis, I'll assign to your team for further analysis. Probably need to adjust the component/subcomponent, but I wasn't sure what to choose.

Comment 14 Nitesh Narayan Lal 2020-11-09 22:02:07 UTC

Hi Luiz, Chenli,

On looking at the shared XML I didn't find anything unusual (there were a
few things that I skip in my configuration but I don't think that those
configs should cause the reported issue).

However, I am wondering if there is any specific reason for using
the 8.2.z batch#1 rt-kernel here?

If not then as a first step we should reproduce the issue with batch#4 or
preferably the latest batch#5 candidate.

I tried reproducing the issue at my end with the reported VM config but I
failed in doing so for over 300 start/destroy operations.

@Chenli once you reproduce the issue with the latest 8.2.z kernel, please
share the machine with me and I can further investigate this.

Thanks

Comment 15 Chao Yang 2020-12-01 05:54:00 UTC

Per Comment 14

Comment 16 Nitesh Narayan Lal 2020-12-15 19:12:31 UTC

Hi,

I wanted to check if there is any update on the testing with the latest z-stream kernel?

Thanks

Comment 17 Pei Zhang 2020-12-21 09:32:58 UTC

Hello Minxi,

I didn't hit this issue in all the past 8.2.z testings. I would ask: did you hit this issue again in your recent testings?  If yes, would you please share your server with Nitesh to debug once you hit it again? Thanks a lot.


Best regards,

Pei

Comment 18 chhu 2020-12-30 05:16:04 UTC

Sorry for late reply, we didn't run this test these days, 
let's wait for Minxi's reply to see if she hit it again on .z build in their testing.

Comment 20 mhou 2020-12-30 07:40:18 UTC

Created attachment 1743162 [details]
g3.xml

Comment 21 mhou 2020-12-30 07:49:08 UTC

Created attachment 1743172 [details]
libvirtd.log

Comment 28 Nitesh Narayan Lal 2021-01-07 23:26:12 UTC

Here is the summary of the findings so far:
On an rt-kernel, if we keep starting and destroying a VM with the emulator
thread sched:priority set to fifo:1 it ends up in the paused state after
some iterations and fails to recover.
The minimal configuration with which I was able to reproduce this:
  <cputune>
    <vcpupin vcpu='0' cpuset='12'/>
    <vcpupin vcpu='1' cpuset='14'/>
    <emulatorpin cpuset='11,13'/>
    <emulatorsched scheduler='fifo' priority='1'/>
    <vcpusched vcpus='0' scheduler='fifo' priority='1'/>
    <vcpusched vcpus='1' scheduler='fifo' priority='1'/>
  </cputune>


Another scenario that I tried was with only 1 CPU designated for the emulator
thread. In this case, the VM fails to start for the first time itself.
It just remains in the paused state.

The above two could be two different issues or they could be a result of
the same issue that is getting triggered when we have emulatoresched
set to fifio:1.

I am adding Martin who has worked on the BZ1580229 to see if he has any
suggestions on how to find some meaningful information from the libvirt
logs.

In the meanwhile, I have started capturing some kernel traces to see if
I can find something in there.

Comment 29 Martin Kletzander 2021-01-11 21:17:56 UTC

The first log suggests that the qemu stops responding slightly after the daemon sets affinity for an I/O Thread.  The I/O Thread does not seem to be pinned anywhere, so my guess is that it might sometimes starve the emulator thread out of its cpu time.  I can't find anything else, at least for now.  Try setting the I/O thread pinning using <iothreadpin/> or restricting its scheduler using <iothreadsched/> or not using io='threads' at least to test the theory.  I wonder why the scheduler and pinning of I/O threads default to the same ones as the emulator.  I guess we ought to fix that, but there probably was some reasoning behind it.  Or maybe it defaulted to the common behaviour for vcpu pinning.

Comment 30 Nitesh Narayan Lal 2021-01-12 14:05:48 UTC

Hi Martin,

Thanks for taking a look.
So, one change that I have made and forgot to mention in the guest XML was
to set the sched:priority of vcpu0 as well to fifo:1 as that's what we use
and recommend for KVM-RT.

I also tried removing the io='threads', pinning the iothread to one of the
housekeeping CPUs, and setting the iothreadsched to fifio but unfortunately,
nothing helped.

I think Minxi has started using the test box for some other testing so I
couldn't look into the logs.

Another interesting thing that is worth mentioning here is that we have
recently found that the issue (Bug 1580229) for which this emulatorsched
was introduced in the KVM-RT testing is not reproducible anymore.

Comment 31 Martin Kletzander 2021-01-12 15:46:39 UTC

(In reply to Nitesh Narayan Lal from comment #30)
Thanks for getting back to me so quickly.

The vcpu0 change should not be significant.

Did you also tried changing the scheduler of the I/O thread to some other scheduler? It already inherited the same pinning and scheduler settings as the emulator thread.  I'd try changing it to something else, e.g. move it to a different housekeeping cpu (not any of those for the emulator) and/or change the scheduler to something else, ideally both.

About removing the I/O thread, if this happens without io='threads' I would probably try forcing it to io='native' if you haven't tried that already.

Also when this bug happens, could you check what threads and processes are running on the cpus that the emulator thread is assigned to?  Another idea I had in the meantime was that maybe QEMU changed and it is now spawning some new threads/processes which might hog the CPU.  Maybe?

About the previous bug that stopped happening.  Do you mean it stopped happening after upgrading to new versions (i.e. the ones with the fix)?  That would be expected, no?  If you mean it stopped happening even with the older libvirt, then it just confirms my doubt in the commit for the fix:

  "If the scheduler is set before vCPU0 cannot be moved into its cpu,cpuacct
   cgroup.  While it is not yet known whether this is a bug or not..."
  -- https://www.redhat.com/archives/libvir-list/2019-May/msg00620.html

Comment 32 Nitesh Narayan Lal 2021-01-12 16:39:28 UTC

(In reply to Martin Kletzander from comment #31)
> (In reply to Nitesh Narayan Lal from comment #30)
> Thanks for getting back to me so quickly.
> 
> The vcpu0 change should not be significant.
> 
> Did you also tried changing the scheduler of the I/O thread to some other
> scheduler? It already inherited the same pinning and scheduler settings as

Good point, I was not particularly sure about this.
I didn't change the scheduler for iothread but I did try changing the emulator
thread scheduler to batch and the issue was resolved with that.

> the emulator thread.  I'd try changing it to something else, e.g. move it to
> a different housekeeping cpu (not any of those for the emulator) and/or
> change the scheduler to something else, ideally both.

So, I did move the iothread to a housekeeping CPU which is not used for any 
other purposes in the VM but it didn't help.

I have lost access to the test environment, so I am trying to reproduce this
in one of my boxes.
I can try changing the iothread scheduler to something else along with
pinning it to HK CPU  once I reproduce this.

> 
> About removing the I/O thread, if this happens without io='threads' I would
> probably try forcing it to io='native' if you haven't tried that already.
> 
> Also when this bug happens, could you check what threads and processes are
> running on the cpus that the emulator thread is assigned to?  Another idea I
> had in the meantime was that maybe QEMU changed and it is now spawning some
> new threads/processes which might hog the CPU.  Maybe?

I will try checking the process and threads running on the CPU to which
iothread and emulator threads are pinned.

> 
> About the previous bug that stopped happening.  Do you mean it stopped
> happening after upgrading to new versions (i.e. the ones with the fix)?

Yes, and I think there is a gap in my understanding of the issue.
I was under an impression that the failed reboot issue was happening because the
emulator thread was starved due to other vcpu threads that were running with
SCHED_FIFO. Then we added the emulatorsched option in the libvirt and as we
explicitly started setting it with SCHED_FIFO as well the issue was resolved.

> That would be expected, no?  If you mean it stopped happening even with the
> older libvirt, then it just confirms my doubt in the commit for the fix:
> 
>   "If the scheduler is set before vCPU0 cannot be moved into its cpu,cpuacct
>    cgroup.  While it is not yet known whether this is a bug or not..."
>   -- https://www.redhat.com/archives/libvir-list/2019-May/msg00620.html

I haven't tried with the older libvirt version.

Comment 33 Martin Kletzander 2021-01-12 19:15:19 UTC

(In reply to Nitesh Narayan Lal from comment #32)
Interesting.  So making the emulator *not* run RT actually fixed it?  Now I wonder whether reverting the patch for Bug 1580229 would also fix it.  Let me know if you want to try that and I can do a scratch build.  Unfortunately it also means that now I have no idea how it works.

Comment 34 Nitesh Narayan Lal 2021-01-12 19:23:21 UTC

(In reply to Martin Kletzander from comment #33)
> (In reply to Nitesh Narayan Lal from comment #32)
> Interesting.  So making the emulator *not* run RT actually fixed it?  Now I
> wonder whether reverting the patch for Bug 1580229 would also fix it.  Let
> me know if you want to try that and I can do a scratch build.  Unfortunately
> it also means that now I have no idea how it works.


TBH I am not particularly sure about the root cause either.
I have been trying to reproduce the issue at my end with emulator threads
set to SCHED_FIFO and even by explicitly pinning the emulator and iothread
to the same CPU but like the last time, I am again failing in doing so.

@Mixi can you please make the environment available again for some more
debugging?

We can surely try the scratch build on Minxi's setup.

Comment 38 Nitesh Narayan Lal 2021-01-14 18:35:27 UTC

Thanks, Minxi for confirming.

The issue here is the usage of userspace based IOAPIC with the emulator
thread that is running with SCHED_FIFO. In this scenario, the userspace IOAPIC
thread is starved sometimes due to a higher priority emulator threat holding
a mutex lock and never releasing it. This is the reason why either removing
emulatorsched that is set to fifo:1 or setting it to batch or removing userspace
IOAPIC resolved the issue.

Since userspace-based IOAPIC is not supported with KVM-RT configuration as
it requires the vcpu and emulator thread to be running with SCHED_FIFO, I am
closing this BZ as Not A Bug.

Comment 39 Marcelo Tosatti 2021-01-20 14:20:34 UTC

(In reply to Nitesh Narayan Lal from comment #38)
> Thanks, Minxi for confirming.
> 
> The issue here is the usage of userspace based IOAPIC with the emulator
> thread that is running with SCHED_FIFO. In this scenario, the userspace
> IOAPIC
> thread is starved sometimes due to a higher priority emulator threat holding
> a mutex lock and never releasing it. This is the reason why either removing
> emulatorsched that is set to fifo:1 or setting it to batch or removing
> userspace
> IOAPIC resolved the issue.
> 
> Since userspace-based IOAPIC is not supported with KVM-RT configuration as
> it requires the vcpu and emulator thread to be running with SCHED_FIFO, I am
> closing this BZ as Not A Bug.

Nitesh,

Attempting to change ioapic from "qemu" to "kvm" results in

[root@dell-per430-11 ~]# virsh edit rhel8.2.0.z_rt_1vcpu
error: XML error: IOMMU interrupt remapping requires split I/O APIC (ioapic driver='qemu')
Failed. Try again? [y,n,i,f,?]: 

    <iommu model='intel'>
      <driver intremap='on' caching_mode='on' iotlb='on'/>
    </iommu>

https://wiki.qemu.org/Features/VT-d

And guest vIOMMU, IIRC, is necessary for DPDK to properly enable the
FPGA in the guest.

Nitesh, do you have a guest with DPDK enabled in the virtlab testbox ?
Maybe Brent can confirm as well.

(reopening for now).

Comment 41 Nitesh Narayan Lal 2021-01-20 14:34:51 UTC

(In reply to Marcelo Tosatti from comment #39)
> (In reply to Nitesh Narayan Lal from comment #38)
> > Thanks, Minxi for confirming.
> > 
> > The issue here is the usage of userspace based IOAPIC with the emulator
> > thread that is running with SCHED_FIFO. In this scenario, the userspace
> > IOAPIC
> > thread is starved sometimes due to a higher priority emulator threat holding
> > a mutex lock and never releasing it. This is the reason why either removing
> > emulatorsched that is set to fifo:1 or setting it to batch or removing
> > userspace
> > IOAPIC resolved the issue.
> > 
> > Since userspace-based IOAPIC is not supported with KVM-RT configuration as
> > it requires the vcpu and emulator thread to be running with SCHED_FIFO, I am
> > closing this BZ as Not A Bug.
> 
> Nitesh,
> 
> Attempting to change ioapic from "qemu" to "kvm" results in

But why do you have to manually change the ioapic from "qemu" to
"kvm" since "kvm" is already the default mode?

> 
> [root@dell-per430-11 ~]# virsh edit rhel8.2.0.z_rt_1vcpu
> error: XML error: IOMMU interrupt remapping requires split I/O APIC (ioapic
> driver='qemu')
> Failed. Try again? [y,n,i,f,?]: 
> 
>     <iommu model='intel'>
>       <driver intremap='on' caching_mode='on' iotlb='on'/>
>     </iommu>
> 
> https://wiki.qemu.org/Features/VT-d
> 
> And guest vIOMMU, IIRC, is necessary for DPDK to properly enable the
> FPGA in the guest.
> 
> Nitesh, do you have a guest with DPDK enabled in the virtlab testbox ?

No.

Thanks

Comment 42 Marcelo Tosatti 2021-01-20 15:24:44 UTC

(In reply to Nitesh Narayan Lal from comment #41)
> (In reply to Marcelo Tosatti from comment #39)
> > (In reply to Nitesh Narayan Lal from comment #38)
> > > Thanks, Minxi for confirming.
> > > 
> > > The issue here is the usage of userspace based IOAPIC with the emulator
> > > thread that is running with SCHED_FIFO. In this scenario, the userspace
> > > IOAPIC
> > > thread is starved sometimes due to a higher priority emulator threat holding
> > > a mutex lock and never releasing it. This is the reason why either removing
> > > emulatorsched that is set to fifo:1 or setting it to batch or removing
> > > userspace
> > > IOAPIC resolved the issue.
> > > 
> > > Since userspace-based IOAPIC is not supported with KVM-RT configuration as
> > > it requires the vcpu and emulator thread to be running with SCHED_FIFO, I am
> > > closing this BZ as Not A Bug.
> > 
> > Nitesh,
> > 
> > Attempting to change ioapic from "qemu" to "kvm" results in
> 
> But why do you have to manually change the ioapic from "qemu" to
> "kvm" since "kvm" is already the default mode?
> 
> > 
> > [root@dell-per430-11 ~]# virsh edit rhel8.2.0.z_rt_1vcpu
> > error: XML error: IOMMU interrupt remapping requires split I/O APIC (ioapic
> > driver='qemu')
> > Failed. Try again? [y,n,i,f,?]:

So that interrupt remapping works (see above).

Comment 44 Nitesh Narayan Lal 2021-01-20 15:39:57 UTC

(In reply to Marcelo Tosatti from comment #42)
> (In reply to Nitesh Narayan Lal from comment #41)

<snip>

> > > 
> > > Nitesh,
> > > 
> > > Attempting to change ioapic from "qemu" to "kvm" results in
> > 
> > But why do you have to manually change the ioapic from "qemu" to
> > "kvm" since "kvm" is already the default mode?
> > 
> > > 
> > > [root@dell-per430-11 ~]# virsh edit rhel8.2.0.z_rt_1vcpu
> > > error: XML error: IOMMU interrupt remapping requires split I/O APIC (ioapic
> > > driver='qemu')
> > > Failed. Try again? [y,n,i,f,?]:
> 
> So that interrupt remapping works (see above).

Ah I see, so we need the userspace IOAPIC for IOMMU which is required by DPDK to properly
enable the FPGA in the guest but with emulatorsched set to fifo:1 this might not work.

In that case, we can explore two options:
- A way by which the IOAPIC userspace thread can inherit the sched priority of emulator thread
Or,
- We will have to add another tag to specify the sched priority of the userspace IOAPIC thread,
  so that we can set it to fifo:1 as well.

Martin may have more suggestions on this.

Comment 46 Nitesh Narayan Lal 2021-01-28 00:15:15 UTC

Marcelo, Based on Brent's comment, is it right to conclude that we don't need vIOMMU
for FPGA enablement?

If so then we can close the bug or is there any other use-case where this might be
required?

Comment 47 Marcelo Tosatti 2021-01-28 10:29:53 UTC

(In reply to Nitesh Narayan Lal from comment #46)
> Marcelo, Based on Brent's comment, is it right to conclude that we don't
> need vIOMMU
> for FPGA enablement?
> 
> If so then we can close the bug or is there any other use-case where this
> might be
> required?

Nitesh,

I think we should understand the problem (its probably fixable).
But perhaps not high prio for RHEL 8.4 (we should double check 
CISCO is not using IOMMU in the guest... do we have a guest
kernel commandline from them?).

You can reproduce it, correct ? Would have to look into what each
thread is doing, when the lockup happens (i can help you with 
that if necessary).

Comment 50 Nitesh Narayan Lal 2021-02-02 19:31:03 UTC

So based on the previous comments it doesn't look like that anyone is
using vIOMMU (hence userspace IOAPIC) with KVM-RT at the moment.
Hence, this looks like a low priority item, for now, however, it would
still, be good to get this issue resolved.

I am currently occupied with some other high priority items at the moment
so I will get back to this once I have some free cycles.

Thanks

Comment 54 Marcelo Tosatti 2021-02-18 20:09:32 UTC

(In reply to Nitesh Narayan Lal from comment #38)
> Thanks, Minxi for confirming.
> 
> The issue here is the usage of userspace based IOAPIC with the emulator
> thread that is running with SCHED_FIFO. In this scenario, the userspace
> IOAPIC
> thread is starved sometimes due to a higher priority emulator threat holding
> a mutex lock and never releasing it. This is the reason why either removing
> emulatorsched that is set to fifo:1 or setting it to batch or removing
> userspace
> IOAPIC resolved the issue.
> 
> Since userspace-based IOAPIC is not supported with KVM-RT configuration as
> it requires the vcpu and emulator thread to be running with SCHED_FIFO, I am
> closing this BZ as Not A Bug.

Some threads do not have SCHED_FIFO priority:

[root@hp-dl388g10-02 ~]# ps -L -A -o pid,tname,time,cmd,policy,rtprio,lwp |grep qemu
  58229 ?        00:01:24 /usr/libexec/qemu-kvm -name FF       1   58229
  58229 ?        00:00:00 /usr/libexec/qemu-kvm -name TS       -   58264
  58229 ?        00:00:00 /usr/libexec/qemu-kvm -name TS       -   58266
  58229 ?        00:00:02 /usr/libexec/qemu-kvm -name TS       -   58276
  58229 ?        00:00:00 /usr/libexec/qemu-kvm -name TS       -   58277
  58229 ?        00:00:00 /usr/libexec/qemu-kvm -name FF       1   58278
  58229 ?        00:00:00 /usr/libexec/qemu-kvm -name FF       1   58279
  58229 ?        00:00:00 /usr/libexec/qemu-kvm -name FF       1   58280
  58229 ?        00:00:00 /usr/libexec/qemu-kvm -name FF       1   58281
  58302 pts/1    00:00:00 grep --color=auto qemu      TS       -   58302

Feb 18 14:39:12 hp-dl388g10-02 kernel: call_rcu        R  running task        0 58264      1 0x000803a4
Feb 18 14:39:12 hp-dl388g10-02 kernel: Call Trace:
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __schedule+0x316/0x7c0
Feb 18 14:39:12 hp-dl388g10-02 kernel: schedule+0x39/0xd0
Feb 18 14:39:12 hp-dl388g10-02 kernel: futex_wait_queue_me+0xbb/0x110
Feb 18 14:39:12 hp-dl388g10-02 kernel: futex_wait+0x133/0x230
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? finish_task_switch+0x108/0x300
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __switch_to+0x419/0x470
Feb 18 14:39:12 hp-dl388g10-02 kernel: do_futex+0x308/0x670
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __seccomp_filter+0x3e/0x490
Feb 18 14:39:12 hp-dl388g10-02 kernel: __x64_sys_futex+0x143/0x180
Feb 18 14:39:12 hp-dl388g10-02 kernel: do_syscall_64+0x87/0x1a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Feb 18 14:39:12 hp-dl388g10-02 kernel: RIP: 0033:0x7f149beb96ed
Feb 18 14:39:12 hp-dl388g10-02 kernel: Code: Bad RIP value.

Feb 18 14:39:12 hp-dl388g10-02 kernel: worker          R  running task        0 58266      1 0x000803a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: Call Trace:
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __schedule+0x316/0x7c0
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? _raw_spin_unlock_irqrestore+0x20/0x60
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? hrtimer_start_range_ns+0x21f/0x390
Feb 18 14:39:12 hp-dl388g10-02 kernel: schedule+0x39/0xd0
Feb 18 14:39:12 hp-dl388g10-02 kernel: futex_wait_queue_me+0xbb/0x110
Feb 18 14:39:12 hp-dl388g10-02 kernel: futex_wait+0x133/0x230
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __hrtimer_init_sleeper+0x60/0x60
Feb 18 14:39:12 hp-dl388g10-02 kernel: do_futex+0x308/0x670
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __seccomp_filter+0x3e/0x490
Feb 18 14:39:12 hp-dl388g10-02 kernel: __x64_sys_futex+0x143/0x180
Feb 18 14:39:12 hp-dl388g10-02 kernel: do_syscall_64+0x87/0x1a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Feb 18 14:39:12 hp-dl388g10-02 kernel: RIP: 0033:0x7f149c196082
Feb 18 14:39:12 hp-dl388g10-02 kernel: Code: Bad RIP value.
Feb 18 14:39:12 hp-dl388g10-02 kernel: RSP: 002b:00007f149360f600 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Feb 18 14:39:12 hp-dl388g10-02 kernel: RAX: ffffffffffffffda RBX: 00007f149360f6a0 RCX: 00007f149c196082
Feb 18 14:39:12 hp-dl388g10-02 kernel: RDX: 0000000000000000 RSI: 0000000000000189 RDI: 0000564372f2bbc8
Feb 18 14:39:12 hp-dl388g10-02 kernel: RBP: 0000564372f2bbc8 R08: 0000000000000000 R09: 00000000ffffffff
Feb 18 14:39:12 hp-dl388g10-02 kernel: R10: 00007f149360f6a0 R11: 0000000000000246 R12: 0000000000000000
Feb 18 14:39:12 hp-dl388g10-02 kernel: R13: 0000000000000000 R14: 00007f149360f6a0 R15: 00007f149360f800

Feb 18 14:39:12 hp-dl388g10-02 kernel: IO mon_iothread R  running task        0 58276      1 0x000843a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: Call Trace:
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __schedule+0x316/0x7c0
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? ___preempt_schedule+0x16/0x18
Feb 18 14:39:12 hp-dl388g10-02 kernel: preempt_schedule_common+0x23/0x80
Feb 18 14:39:12 hp-dl388g10-02 kernel: ___preempt_schedule+0x16/0x18
Feb 18 14:39:12 hp-dl388g10-02 kernel: rt_mutex_postunlock+0x5a/0x60
Feb 18 14:39:12 hp-dl388g10-02 kernel: rt_mutex_futex_unlock+0xa1/0xb0
Feb 18 14:39:12 hp-dl388g10-02 kernel: rt_spin_unlock+0x39/0x40
Feb 18 14:39:12 hp-dl388g10-02 kernel: eventfd_write+0xbe/0x290
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? migrate_enable+0x3a0/0x3a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: vfs_write+0xa5/0x1a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: ksys_write+0x52/0xc0
Feb 18 14:39:12 hp-dl388g10-02 kernel: do_syscall_64+0x87/0x1a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Feb 18 14:39:12 hp-dl388g10-02 kernel: RIP: 0033:0x7f149c196af7
Feb 18 14:39:12 hp-dl388g10-02 kernel: Code: Bad RIP value.
Feb 18 14:39:12 hp-dl388g10-02 kernel: RSP: 002b:00007f148bffd470 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
Feb 18 14:39:12 hp-dl388g10-02 kernel: RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007f149c196af7
Feb 18 14:39:12 hp-dl388g10-02 kernel: RDX: 0000000000000008 RSI: 00005643711a2eb8 RDI: 0000000000000004
Feb 18 14:39:12 hp-dl388g10-02 kernel: RBP: 00005643711a2eb8 R08: 0000000000000000 R09: 0000000000000000
Feb 18 14:39:12 hp-dl388g10-02 kernel: R10: 0000000000000019 R11: 0000000000000293 R12: 0000000000000008
Feb 18 14:39:12 hp-dl388g10-02 kernel: R13: 0000564372e9b1a0 R14: 0000000000000049 R15: 0000000000000168
Feb 18 14:39:12 hp-dl388g10-02 kernel: CPU 0/KVM       S    0 58277      1 0x000843a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: Call Trace:
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __schedule+0x316/0x7c0
Feb 18 14:39:12 hp-dl388g10-02 kernel: schedule+0x39/0xd0
Feb 18 14:39:12 hp-dl388g10-02 kernel: futex_wait_queue_me+0xbb/0x110
Feb 18 14:39:12 hp-dl388g10-02 kernel: futex_wait+0x133/0x230
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? get_futex_key+0x3a1/0x420
Feb 18 14:39:12 hp-dl388g10-02 kernel: do_futex+0x308/0x670
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? __seccomp_filter+0x3e/0x490
Feb 18 14:39:12 hp-dl388g10-02 kernel: ? do_vfs_ioctl+0xa4/0x630
Feb 18 14:39:12 hp-dl388g10-02 kernel: __x64_sys_futex+0x143/0x180
Feb 18 14:39:12 hp-dl388g10-02 kernel: do_syscall_64+0x87/0x1a0
Feb 18 14:39:12 hp-dl388g10-02 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Feb 18 14:39:12 hp-dl388g10-02 kernel: RIP: 0033:0x7f149c19348c
Feb 18 14:39:12 hp-dl388g10-02 kernel: Code: Bad RIP value.
Feb 18 14:39:12 hp-dl388g10-02 kernel: RSP: 002b:00007f14909485e0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
Feb 18 14:39:12 hp-dl388g10-02 kernel: RAX: ffffffffffffffda RBX: 0000564372fabcf0 RCX: 00007f149c19348c
Feb 18 14:39:12 hp-dl388g10-02 kernel: RDX: 0000000000000000 RSI: 0000000000000080 RDI: 0000564372fabd1c
Feb 18 14:39:12 hp-dl388g10-02 kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 000056437188c6e0
Feb 18 14:39:12 hp-dl388g10-02 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000056437186ffa0
Feb 18 14:39:12 hp-dl388g10-02 kernel: R13: 000000000000000b R14: 0000000000000000 R15: 0000564372fabd1c

Agree with previous conclusion that this can be closed: its not a customer supported configuration, so closing again.

Comment 60 John Ferlan 2021-09-09 14:07:59 UTC

Bulk update: Move RHEL-AV bugs to RHEL9. If necessary to resolve in RHEL8, then clone to the current RHEL8 release.

Comment 62 RHEL Program Management 2021-10-16 07:27:07 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 68 RHEL Program Management 2022-05-16 07:27:38 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 71 John Ferlan 2022-11-04 11:20:29 UTC

This bug has bounced into various closed states a few times and had it's stale state extended due to reopening; however, there's no indication in recent such changes what active adjustments are being made to resolve the problem.

As it stands how, the stale date will occur during the current release cycle, but it doesn't seem we're any closer to resolution. 

Can you please provide an update on expectations for resolution? Is this actively being worked or is it just some low priority backlog task that could be tracked outside of bugzilla?

tks -

Comment 72 Marcelo Tosatti 2022-11-14 15:34:47 UTC

(In reply to John Ferlan from comment #71)
> This bug has bounced into various closed states a few times and had it's
> stale state extended due to reopening; however, there's no indication in
> recent such changes what active adjustments are being made to resolve the
> problem.
> 
> As it stands how, the stale date will occur during the current release
> cycle, but it doesn't seem we're any closer to resolution. 
> 
> Can you please provide an update on expectations for resolution? Is this
> actively being worked or is it just some low priority backlog task that
> could be tracked outside of bugzilla?
> 
> tks -

John,

This is an item that we'd like to fix - but that is on lower priority backlog.
Is there a problem to continue tracking this in bugzilla?

There is valuable information in the BZ, thats why it would be interesting 
to keep it open.

Comment 73 John Ferlan 2022-11-17 12:55:44 UTC

(In reply to Marcelo Tosatti from comment #72)
> (In reply to John Ferlan from comment #71)
> > This bug has bounced into various closed states a few times and had it's
> > stale state extended due to reopening; however, there's no indication in
> > recent such changes what active adjustments are being made to resolve the
> > problem.
> > 
> > As it stands how, the stale date will occur during the current release
> > cycle, but it doesn't seem we're any closer to resolution. 
> > 
> > Can you please provide an update on expectations for resolution? Is this
> > actively being worked or is it just some low priority backlog task that
> > could be tracked outside of bugzilla?
> > 
> > tks -
> 
> John,
> 
> This is an item that we'd like to fix - but that is on lower priority
> backlog.
> Is there a problem to continue tracking this in bugzilla?
> 
> There is valuable information in the BZ, thats why it would be interesting 
> to keep it open.

In theory not a problem to keep bugs open; however, we deal with RHEL processes (and bz bots) that handle stale/aging (e.g. bugs open longer than 18 months) bugs by setting a deadline date to resolve. As you've seen above when we reach that date, the bug gets closed. Then someone has reopened it (more than once). It's perfectly fine to extend the date, but for how long does one reasonably expect to keep a bug open without resolution?  The conundrum is should bugzilla be used as a long term todo list or should it be used as a bug / feature tracker for things that are/can be addressed within 18 months.

I try to be proactive and in an effort to reduce the number of bugs open that are in a backlog (>200 open bugs within Virt SST) I look to give a bump/push to the oldest bugs to make sure developers consider addressing them. The reality is if something isn't going to be fixed after 2+ years, what's the likelihood that in the next 6-12 months the problem will be addressed. Additionally, there's no customer case on this bug. Still as a former developer I also realize some bugs can take time to resolve and other priorities exist, but we still should at least update status from time to time especially when the stale date approaches or has to be extended.

Comment 76 RHEL Program Management 2023-02-19 07:27:44 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Comment 77 Yash Mankad 2023-02-22 06:05:43 UTC

Changing the CLOSE reason to DEFERRED. 

As indicated by Marcelo in #c72 - we intend to fix this BZ, but we consider it a lower priority right now.
Will re-open the BZ in the future if the priority ever changes.