Bug 1852298

Summary: Install VM fails and RT host get Call Trace "WARNING: CPU: 18 PID: 2947 at fs/eventfd.c:68 eventfd_signal+0x7e/0x90"
Product: Red Hat Enterprise Linux 8 Reporter: Pei Zhang <pezhang>
Component: kernel-rtAssignee: Juri Lelli <jlelli>
kernel-rt sub component: File Systems - Other QA Contact: Pei Zhang <pezhang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: ailan, bhu, chayang, dhowells, jinzhao, jlelli, jmoyer, juzhang, lcapitulino, lgoncalv, mszeredi, mtosatti, nilal, peterx, swhiteho, virt-maint
Version: 8.3Keywords: Regression, TestBlocker
Target Milestone: rc   
Target Release: 8.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-rt-4.18.0-229.rt7.41.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-04 02:26:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1817732, 1823810    
Attachments:
Description Flags
VM installation hang here none

Description Pei Zhang 2020-06-30 05:23:44 UTC
Description of problem:
In RT host, install a VM fails and RT host get Call Trace.

Version-Release number of selected component (if applicable):
4.18.0-221.rt7.33.el8.x86_64
qemu-kvm-5.0.0-0.module+el8.3.0+6612+6b86f0c9.x86_64
libvirt-6.4.0-1.scrmod+el8.3.0+6977+09119430.x86_64


How reproducible:
2/2

Steps to Reproduce:
1. Setup RT host

2. Install a VM. The installation looks hang.

# virt-install \
--graphics type=spice,listen=0.0.0.0 \
--name=rhel8.3_rt_1vcpu \
--machine q35 \
--memory=4096,hugepages=yes \
--memorybacking hugepages=yes,size=1,unit=G,locked=yes,access.mode=shared \
--cpu host-passthrough,cache.mode=passthrough,numa.cell0.memory=4194304,numa.cell0.cpus=0-1 \
--numatune=1 \
--vcpus=2,cpuset=18,19 \
--controller type=scsi,model=virtio-scsi,driver.iommu=on  \
--controller type=virtio-serial,driver.iommu=on  \
--rng type=/dev/random,driver.iommu=on  \
--memballoon virtio,driver.iommu=on  \
--disk path=/home/images_nfv-virt-rt-kvm/rhel8.3_rt_1vcpu.qcow2,size=40,target.bus=scsi \
-l http://download.eng.pek2.redhat.com/nightly/RHEL-8.3.0-20200620.n.0/compose/BaseOS/x86_64/os/ \
--noautoconsole \
--noreboot -x ks=http://10.73.72.41/kickstart-rhel8.cfg  \
--network bridge=switch,model=virtio,mac=38:66:77:5f:aa:01,driver.iommu=on

3. Check RT host, there are Call Trace.

[  199.611601] WARNING: CPU: 18 PID: 2947 at fs/eventfd.c:68 eventfd_signal+0x7e/0x90
[  199.611603] Modules linked in: vhost_net vhost vhost_iotlb tap xt_CHECKSUM ipt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 nft_compat nft_chain_route_ipv6 nft_chain_nat_ipv6 nf_nat_ipv6 nft_counter nft_chain_route_ipv4 nft_chain_nat_ipv4 nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink tun bridge stp llc intel_rapl_msr iTCO_wdt iTCO_vendor_support dcdbas kvm_intel kvm irqbypass intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp intel_cstate intel_uncore intel_rapl_perf pcspkr lpc_ich ipmi_ssif ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mei_me mei ip_tables xfs libcrc32c sd_mod sg mgag200 drm_vram_helper drm_ttm_helper i2c_algo_bit drm_kms_helper crct10dif_pclmul crc32_pclmul syscopyarea sysfillrect sysimgblt crc32c_intel fb_sys_fops ttm ghash_clmulni_intel drm ahci libahci libata tg3 megaraid_sas mxm_wmi wmi sunrpc dm_mirror dm_region_hash dm_log dm_mod
[  199.611634] CPU: 18 PID: 2947 Comm: vhost-2906 Kdump: loaded Not tainted 4.18.0-221.rt7.33.el8.x86_64 #1
[  199.611634] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.0.1 04/11/2016
[  199.611636] RIP: 0010:eventfd_signal+0x7e/0x90
[  199.611638] Code: 01 00 00 00 be 03 00 00 00 4c 89 ef e8 ab 61 dd ff 65 ff 0d f4 22 d0 47 4c 89 ef e8 8c 25 5c 00 5b 48 89 e8 5d 41 5c 41 5d c3 <0f> 0b 31 ed eb f0 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00
[  199.611639] RSP: 0018:ffffc00d8cfffd58 EFLAGS: 00010202
[  199.611641] RAX: 0000000000000001 RBX: ffff99184e1c46c0 RCX: 000000001aa00001
[  199.611642] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff99189b061b40
[  199.611642] RBP: 000000000000004e R08: 0000000000000001 R09: 0000000000000000
[  199.611643] R10: ffff99184bf98640 R11: 0000000000000000 R12: ffffc00d8cfffe18
[  199.611644] R13: ffff99184e1c0000 R14: ffff991899a47498 R15: ffff99184e1c46c0
[  199.611645] FS:  0000000000000000(0000) GS:ffff99189fa80000(0000) knlGS:0000000000000000
[  199.611646] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  199.611647] CR2: 000056022ba55378 CR3: 0000001037566003 CR4: 00000000001626e0
[  199.611647] Call Trace:
[  199.611656]  vhost_tx_batch.isra.17+0x9d/0xf0 [vhost_net]
[  199.611659]  handle_tx_copy+0x181/0x580 [vhost_net]
[  199.611665]  ? vhost_dev_init+0x2c0/0x2c0 [vhost]
[  199.611667]  handle_tx+0xa7/0xe0 [vhost_net]
[  199.611670]  vhost_worker+0xd6/0x140 [vhost]
[  199.611674]  kthread+0x112/0x130
[  199.611676]  ? kthread_flush_work_fn+0x10/0x10
[  199.611680]  ret_from_fork+0x35/0x40
[  199.611683] ---[ end trace 0000000000000002 ]---


Actual results:
VM installation fails and RT host hang.

Expected results:
VM installation should work well and no Call Trace in RT host.

Additional info:
1. This should be a regression bug.

4.18.0-217.rt7.29.el8.x86_64  works well
4.18.0-221.rt7.33.el8.x86_64  fails

Comment 1 Pei Zhang 2020-06-30 05:25:21 UTC
Created attachment 1699236 [details]
VM installation hang here

Comment 2 Pei Zhang 2020-07-01 13:19:56 UTC
This is real time only issue. We cannot reproduce with standard kernel.

4.18.0-221.rt7.33.el8.x86_64    fail
4.18.0-221.el8.x86_64           works well

Comment 3 Juri Lelli 2020-07-02 07:51:29 UTC
Hi,

I had a look and commit 68848557caf6 ("[fs] eventfd: track eventfd_signal()
recursion depth") (kernel-4.18.0-218.el8) introduced the check that generates
the WARNING reported.

Now, it is not immediately clear to me if the erroneous condition pointed
out by the warning and the fact that VM install hangs are correlated.

I'm auditing recent changes (more info below), but as a first try I'd kindly
ask Pei to try to replicate with kernel-rt-4.18.0-218.rt7.30.el8 (so that we
restrict the number of potential problematic changes). As a second tentative
(don't have high hopes on it, but still think it's worth a shot) I'd try using
a more recent compose to start the VM install, as I seem to remember I had
troubles with testing using some compose from last week (don't remember the
specific one, sorry). Could you also please try to replicate with, say,
http://download.eng.pek2.redhat.com/nightly/RHEL-8.3.0-20200702.n.0/compose/BaseOS/x86_64/os/
? Asking this because I noticed dracut failing to download artifacts from the
install log you shared.

Anyway, by looking at code, I noticed that eventfd_signal_count() is not
called from anywhere, which looks suspicious as eventfd_signal() comment
states:

"Deadlock or stack overflow issues can happen if we recurse here
 through waitqueue wakeup handlers. If the caller users potentially
 nested waitqueues with custom wakeup handlers, then it should
 check eventfd_signal_count() before calling this function. If
 it returns true, the eventfd_signal() call should be deferred to a
 safe context."

In upstream RT (v5.6.17-rt10) eventfd_signal_count() is called from two
locations:

 - fs/io_uring.c <<io_poll_wake>>
 - fs/aio.c <<aio_poll_wake>>

Both locations are different in 8.3 (more recent code got pulled from
upstream with the major fs update that got into 218). So, I'll try to
understand if a problem might reside somewhere around here.

Adding Jeff in Cc, as he backported the fs changes and might have
suggestions on how to tackle this.

Comment 4 Pei Zhang 2020-07-02 09:17:18 UTC
(In reply to Juri Lelli from comment #3)
> Hi,
> 
> I had a look and commit 68848557caf6 ("[fs] eventfd: track eventfd_signal()
> recursion depth") (kernel-4.18.0-218.el8) introduced the check that generates
> the WARNING reported.
> 
> Now, it is not immediately clear to me if the erroneous condition pointed
> out by the warning and the fact that VM install hangs are correlated.
> 
> I'm auditing recent changes (more info below), but as a first try I'd kindly
> ask Pei to try to replicate with kernel-rt-4.18.0-218.rt7.30.el8 (so that we
> restrict the number of potential problematic changes). As a second tentative

Hi Juri,

4.18.0-218.rt7.30.el8.x86_64 hit this issue.

So 4.18.0-218.rt7.30.el8.x86_64 is the first version which hit this problem.


> (don't have high hopes on it, but still think it's worth a shot) I'd try
> using
> a more recent compose to start the VM install, as I seem to remember I had
> troubles with testing using some compose from last week (don't remember the
> specific one, sorry). Could you also please try to replicate with, say,
> http://download.eng.pek2.redhat.com/nightly/RHEL-8.3.0-20200702.n.0/compose/
> BaseOS/x86_64/os/
> ? Asking this because I noticed dracut failing to download artifacts from the
> install log you shared.

Yes, I'm testing with the latest repo with 4.18.0-218.rt7.30.el8.x86_64. Reproduced this issue.

Thank you.

Best regards,

Pei

Comment 5 Pei Zhang 2020-07-02 09:20:59 UTC
Another info:

When using more vcpus (--vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15) to install, everything looks good, didn't hit this issue.

virt-install \
        --graphics type=spice,listen=0.0.0.0 \
        --name=rhel8.3_rt_8vcpu \
        --machine q35 \
        --memory=4096,hugepages=yes \
        --memorybacking hugepages=yes,size=1,unit=G,locked=yes,access.mode=shared \
        --cpu host-passthrough,cache.mode=passthrough,numa.cell0.memory=4194304,numa.cell0.cpus=0-9 \
        --numatune=1 \
        --vcpus=10,cpuset=16,18,1,3,5,7,9,11,13,15 \
        --controller type=scsi,model=virtio-scsi,driver.iommu=on  \
        --controller type=virtio-serial,driver.iommu=on  \
        --rng type=/dev/random,driver.iommu=on  \
        --memballoon virtio,driver.iommu=on  \
        --disk path=/home/images_nfv-virt-rt-kvm/rhel8.3_rt_8vcpu.qcow2,size=40,target.bus=scsi \
        -l http://download.eng.pek2.redhat.com/nightly/RHEL-8.3.0-20200701.n.0/compose/BaseOS/x86_64/os/ \
        --noautoconsole \
        --noreboot \
        -x ks=http://10.73.72.41/kickstart-rhel8.cfg  \
        --network bridge=switch,model=virtio,mac=38:66:77:5f:dd:95,driver.iommu=on

Comment 6 Juri Lelli 2020-07-02 16:46:30 UTC
I did some further investigation on this issue, which I'll try to summarize below.

Root cause of the hang seems really to be commit 68848557caf6; I could
verify that cherry picking this single commit on -217.rt brings the issue
in the last known good kernel and that by applying the following latest
RT kernel works (I could install and start a VM). So all the other changes
that came with -218 should be OK.

---
 fs/eventfd.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index ce1d1711fbbaf..cf9df57fa3719 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -65,8 +65,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
         * it returns true, the eventfd_signal() call should be deferred to a
         * safe context.
         */
-       if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
-               return 0;
+       WARN_ON_ONCE(this_cpu_read(eventfd_wake_count));

        spin_lock_irqsave(&ctx->wqh.lock, flags);
        this_cpu_inc(eventfd_wake_count);

This is of course not fixing the potential issue, but no user of eventfd_signal()
is currently checking its return value, so it looks like everybody is expecting
this to always work (and the fact that it doesn't in this case makes the install
process to hang, guess waiting on a wakeup event that never happens).

Upstream RT (v5.6.17-rt10) has the same problem. So, I'll soon report it. Apart
from the issue above, I think RT also has a problem if the comment above
eventfd_signal() is correct in saying that the function is expected to be
working in atomic ctx: spin_lock_irqsave is converted to a mutex in RT and
that of course should be allowed to sleep. I don't think we witness the
related warning though, still a potential issue is present.

Biggest problem in my understanding is that issue is blocking kvm-rt testing
and it might take a while before we get a proper fix upstream. Do we want
a stop gap "solution" of the kind above in the meanwhile? In the end the
actual issue has always been present (without causing crashes or observable
latencies) and we just uncovered it.

Comment 7 Marcelo Tosatti 2020-07-02 16:52:26 UTC
(In reply to Juri Lelli from comment #6)
> I did some further investigation on this issue, which I'll try to summarize
> below.
> 
> Root cause of the hang seems really to be commit 68848557caf6; I could
> verify that cherry picking this single commit on -217.rt brings the issue
> in the last known good kernel and that by applying the following latest
> RT kernel works (I could install and start a VM). So all the other changes
> that came with -218 should be OK.
> 
> ---
>  fs/eventfd.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index ce1d1711fbbaf..cf9df57fa3719 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -65,8 +65,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>          * it returns true, the eventfd_signal() call should be deferred to a
>          * safe context.
>          */
> -       if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> -               return 0;
> +       WARN_ON_ONCE(this_cpu_read(eventfd_wake_count));
> 
>         spin_lock_irqsave(&ctx->wqh.lock, flags);
>         this_cpu_inc(eventfd_wake_count);
> 
> This is of course not fixing the potential issue, but no user of
> eventfd_signal()
> is currently checking its return value, so it looks like everybody is
> expecting
> this to always work (and the fact that it doesn't in this case makes the
> install
> process to hang, guess waiting on a wakeup event that never happens).
> 
> Upstream RT (v5.6.17-rt10) has the same problem. So, I'll soon report it.
> Apart
> from the issue above, I think RT also has a problem if the comment above
> eventfd_signal() is correct in saying that the function is expected 

It is expected to be called from interrupt context.

> to be
> working in atomic ctx: spin_lock_irqsave is converted to a mutex in RT and
> that of course should be allowed to sleep. I don't think we witness the
> related warning though, still a potential issue is present.
> 
> Biggest problem in my understanding is that issue is blocking kvm-rt testing
> and it might take a while before we get a proper fix upstream. Do we want
> a stop gap "solution" of the kind above in the meanwhile? In the end the
> actual issue has always been present (without causing crashes or observable
> latencies) and we just uncovered it.

Comment 8 Nitesh Narayan Lal 2020-07-02 21:44:01 UTC
(In reply to Juri Lelli from comment #6)
> I did some further investigation on this issue, which I'll try to summarize
> below.
> 
> Root cause of the hang seems really to be commit 68848557caf6; I could
> verify that cherry picking this single commit on -217.rt brings the issue
> in the last known good kernel and that by applying the following latest
> RT kernel works (I could install and start a VM). So all the other changes
> that came with -218 should be OK.
> 
> ---
>  fs/eventfd.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index ce1d1711fbbaf..cf9df57fa3719 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -65,8 +65,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>          * it returns true, the eventfd_signal() call should be deferred to a
>          * safe context.
>          */
> -       if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> -               return 0;
> +       WARN_ON_ONCE(this_cpu_read(eventfd_wake_count));
> 
>         spin_lock_irqsave(&ctx->wqh.lock, flags);
>         this_cpu_inc(eventfd_wake_count);
> 
> This is of course not fixing the potential issue, but no user of
> eventfd_signal()
> is currently checking its return value, so it looks like everybody is
> expecting
> this to always work (and the fact that it doesn't in this case makes the
> install
> process to hang, guess waiting on a wakeup event that never happens).
> 
> Upstream RT (v5.6.17-rt10) has the same problem. So, I'll soon report it.

Not sure if you have seen the following upstream patches from He Zhe:
v1: https://lore.kernel.org/lkml/1586257192-58369-1-git-send-email-zhe.he@windriver.com/
v2: https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/

Maybe we start discussing the initial ideas in the same thread along with the 
other issue that you have mentioned.


> Apart
> from the issue above, I think RT also has a problem if the comment above
> eventfd_signal() is correct in saying that the function is expected to be
> working in atomic ctx: spin_lock_irqsave is converted to a mutex in RT and
> that of course should be allowed to sleep. I don't think we witness the
> related warning though, still a potential issue is present.
> 
> Biggest problem in my understanding is that issue is blocking kvm-rt testing
> and it might take a while before we get a proper fix upstream. Do we want
> a stop gap "solution" of the kind above in the meanwhile? In the end the
> actual issue has always been present (without causing crashes or observable
> latencies) and we just uncovered it.

Comment 9 Juri Lelli 2020-07-03 07:59:57 UTC
(In reply to Nitesh Narayan Lal from comment #8)
> (In reply to Juri Lelli from comment #6)
> > I did some further investigation on this issue, which I'll try to summarize
> > below.
> > 
> > Root cause of the hang seems really to be commit 68848557caf6; I could
> > verify that cherry picking this single commit on -217.rt brings the issue
> > in the last known good kernel and that by applying the following latest
> > RT kernel works (I could install and start a VM). So all the other changes
> > that came with -218 should be OK.
> > 
> > ---
> >  fs/eventfd.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> > 
> > diff --git a/fs/eventfd.c b/fs/eventfd.c
> > index ce1d1711fbbaf..cf9df57fa3719 100644
> > --- a/fs/eventfd.c
> > +++ b/fs/eventfd.c
> > @@ -65,8 +65,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
> >          * it returns true, the eventfd_signal() call should be deferred to a
> >          * safe context.
> >          */
> > -       if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> > -               return 0;
> > +       WARN_ON_ONCE(this_cpu_read(eventfd_wake_count));
> > 
> >         spin_lock_irqsave(&ctx->wqh.lock, flags);
> >         this_cpu_inc(eventfd_wake_count);
> > 
> > This is of course not fixing the potential issue, but no user of
> > eventfd_signal()
> > is currently checking its return value, so it looks like everybody is
> > expecting
> > this to always work (and the fact that it doesn't in this case makes the
> > install
> > process to hang, guess waiting on a wakeup event that never happens).
> > 
> > Upstream RT (v5.6.17-rt10) has the same problem. So, I'll soon report it.
> 
> Not sure if you have seen the following upstream patches from He Zhe:
> v1:
> https://lore.kernel.org/lkml/1586257192-58369-1-git-send-email-zhe.
> he/
> v2: https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/

I didn't, thanks for the pointers!

v2 seems to work, but I wonder if that "raising the bar" approach isn't too
fragile.

> 
> Maybe we start discussing the initial ideas in the same thread along with
> the other issue that you have mentioned.

Anyway, I'll follow the discussion and report the potential problem with
upstream RT.

Comment 18 Pei Zhang 2020-08-05 02:33:43 UTC
Verification:

This issue cannot be reproduced with kernel-rt-4.18.0-230.rt7.42.el8.x86_64 any more.

Versions:
kernel-rt-4.18.0-230.rt7.42.el8.x86_64
qemu-kvm-4.2.0-31.module+el8.3.0+7437+4bb96e0d.x86_64
tuned-2.14.0-3.el8.noarch
libvirt-6.5.0-1.scrmod+el8.3.0+7493+a5e196a4.x86_64
python3-libvirt-6.0.0-1.module+el8.3.0+6423+e4cb6418.x86_64
microcode_ctl-20200609-2.el8.x86_64
rt-tests-1.8-8.el8.x86_64

12h cyclictest results:
(1)Single VM with 1 rt vCPU:
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00022

(2)Single VM with 8 rt vCPUs:
# Min Latencies: 00008 00012 00012 00012 00012 00012 00012 00012
# Avg Latencies: 00009 00012 00012 00012 00012 00012 00012 00012
# Max Latencies: 00020 00022 00022 00022 00022 00022 00020 00022

(3)Multiple VMs each with 1 rt vCPU:
- VM1
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00019

- VM2
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00019

- VM3
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00022

- VM4
# Min Latencies: 00007
# Avg Latencies: 00009
# Max Latencies: 00023

Beaker job:
https://beaker.engineering.redhat.com/jobs/4457653

So this bug has been fixed very well. Move to 'VERIFIED'.

Comment 21 errata-xmlrpc 2020-11-04 02:26:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: kernel-rt security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4609