RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1292902 - rt: netpoll: live lock with NAPI polling and busy polling on realtime kernel
Summary: rt: netpoll: live lock with NAPI polling and busy polling on realtime kernel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel-rt
Version: 7.3
Hardware: x86_64
OS: Linux
high
high
Target Milestone: rc
: 7.3
Assignee: Clark Williams
QA Contact: Zhang Kexin
URL:
Whiteboard:
: 1273264 (view as bug list)
Depends On:
Blocks: 1274397 RT7.3-Prio 1293230 1295884 1313485
TreeView+ depends on / blocked
 
Reported: 2015-12-18 16:55 UTC by Clark Williams
Modified: 2016-11-03 19:38 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1293230 (view as bug list)
Environment:
Last Closed: 2016-11-03 19:38:57 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
netpoll: Always take poll_lock when doing polling (2.17 KB, patch)
2015-12-18 18:47 UTC, Clark Williams
no flags Details | Diff
Revert "ixgbevf: Prevent livelock spinning grabbing ixgbevf_qv_lock" (3.18 KB, patch)
2016-01-07 02:37 UTC, Luis Claudio R. Goncalves
no flags Details | Diff
revert "ixgbe: Prevent livelock spinning grabbing ixgbe_qv_lock" (3.53 KB, patch)
2016-01-07 02:38 UTC, Luis Claudio R. Goncalves
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2016:2584 0 normal SHIPPED_LIVE Important: kernel-rt security, bug fix, and enhancement update 2016-11-03 12:08:49 UTC

Description Clark Williams 2015-12-18 16:55:55 UTC
A "live lock" has been seen in the NAPI polling capable NICs, such as ixgbe and sfc. 

Synchronization between NAPI polling and busy polling is done by looping on NAPI_STATE_SCHED 'bitset'. This method works fine on a non-rt kernel because a softirq can not be preempted, and the thread poll is called with local_bh_disable() which prevents softirqs from running and preempting it. But on rt, this code can be preempted. Thus, the code may be preempted out while holding the NAPI_STATE_SCHED 'bitset', opening a window for a livelock.

Comment 1 Clark Williams 2015-12-18 18:47:55 UTC
Created attachment 1107309 [details]
netpoll: Always take poll_lock when doing polling

Patch to synchronize NAPI polling and busy-polling to prevent live-lock.

Comment 2 Clark Williams 2015-12-18 18:49:34 UTC
Note: the RT engineering team originally thought this was a problem in the ixgbe driver code but further BZs revealed that it was a consequence of how RT is implemented combined with the NAPI polling and busy-polling code in the network driver framework.

Comment 6 Luis Claudio R. Goncalves 2016-01-07 02:37:31 UTC
Created attachment 1112320 [details]
Revert "ixgbevf: Prevent livelock spinning grabbing ixgbevf_qv_lock"

Comment 7 Luis Claudio R. Goncalves 2016-01-07 02:38:26 UTC
Created attachment 1112321 [details]
revert "ixgbe: Prevent livelock spinning grabbing ixgbe_qv_lock"

Comment 9 Zhenjie Chen 2016-06-06 05:14:54 UTC
QE update,

Reproduced on 3.10.0-327.rt56.204.el7.x86_64 with test like https://bugzilla.redhat.com/show_bug.cgi?id=1293230#c14

[ 1112.876788] INFO: rcu_preempt self-detected stall on CPU { 13}  (t=60000 jiffies g=4995 c=4994 q=0)          
[ 1112.876789] sending NMI to all CPUs:
[ 1112.876793] NMI backtrace for cpu 0
[ 1112.876796] CPU: 0 PID: 788 Comm: irq/86-0000:07: Not tainted 3.10.0-327.rt56.204.el7.x86_64 #1              
[ 1112.876797] Hardware name: HP ProLiant DL388p Gen8, BIOS P70 12/14/2012
[ 1112.876799] task: ffff880416031780 ti: ffff880416040000 task.ti: ffff880416040000
[ 1112.876807] RIP: 0010:[<ffffffff810a9f8f>]  [<ffffffff810a9f8f>] migrate_disable+0xf/0xf0
[ 1112.876808] RSP: 0018:ffff880416043b38  EFLAGS: 00000203
[ 1112.876808] RAX: ffff880416043fd8 RBX: ffff88042f613680 RCX: 0000000000000020
[ 1112.876809] RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000000000000200
[ 1112.876810] RBP: ffff880416043b78 R08: 000000000000003c R09: 0000000000000001
[ 1112.876810] R10: ffff880419a1368e R11: ffff880416efc980 R12: 0000000000013680
[ 1112.876811] R13: 0000000000000200 R14: 0000000000000020 R15: ffff880416031780
[ 1112.876812] FS:  0000000000000000(0000) GS:ffff88042f600000(0000) knlGS:0000000000000000
[ 1112.876813] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1112.876813] CR2: 00000000006eb0f8 CR3: 00000000bb4b9000 CR4: 00000000000407f0
[ 1112.876814] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1112.876815] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1112.876824] Stack:
[ 1112.876827]  ffff880416043b78 ffffffff81501dd4 ffff8800bc82dca0 0000000000000200
[ 1112.876828]  ffff8804165fb000 ffff8800bc82dcb8 ffff880416efc980 0000000000000001
[ 1112.876830]  ffff880416043b98 ffffffff815024b1 ffff880416efc000 ffff8804165fb000
[ 1112.876831] Call Trace:
[ 1112.876836]  [<ffffffff81501dd4>] ? __netdev_alloc_frag+0x54/0xe0
[ 1112.876838]  [<ffffffff815024b1>] __alloc_rx_skb+0x51/0xb0
[ 1112.876840]  [<ffffffff8150252b>] __netdev_alloc_skb+0x1b/0x40
[ 1112.876869]  [<ffffffffa04c423f>] __efx_rx_packet+0xff/0x5f0 [sfc]
[ 1112.876877]  [<ffffffffa04c49d9>] efx_rx_packet+0x2a9/0x3f0 [sfc]
[ 1112.876884]  [<ffffffffa04be90b>] efx_ef10_ev_process+0x3bb/0x6b0 [sfc]
[ 1112.876887]  [<ffffffff81512ef9>] ? netif_receive_skb+0x89/0xe0
[ 1112.876893]  [<ffffffffa04a8469>] efx_process_channel+0x99/0x1b0 [sfc]
[ 1112.876898]  [<ffffffffa04a8760>] efx_poll+0xb0/0x230 [sfc]
[ 1112.876900]  [<ffffffff81513f5b>] net_rx_action+0x1fb/0x360
[ 1112.876903]  [<ffffffff81077558>] do_current_softirqs+0x1d8/0x3c0
[ 1112.876906]  [<ffffffff8110bfc0>] ? irq_thread_fn+0x50/0x50
[ 1112.876908]  [<ffffffff810777b4>] local_bh_enable+0x74/0xa0
[ 1112.876909]  [<ffffffff8110c001>] irq_forced_thread_fn+0x41/0x70
[ 1112.876911]  [<ffffffff8110c49f>] irq_thread+0x12f/0x180
[ 1112.876912]  [<ffffffff8110c080>] ? wake_threads_waitq+0x50/0x50
[ 1112.876914]  [<ffffffff8110c370>] ? irq_thread_check_affinity+0x30/0x30
[ 1112.876917]  [<ffffffff81099e41>] kthread+0xc1/0xd0
[ 1112.876919]  [<ffffffff81099d80>] ? kthread_worker_fn+0x170/0x170
[ 1112.876922]  [<ffffffff81631558>] ret_from_fork+0x58/0x90
[ 1112.876923]  [<ffffffff81099d80>] ? kthread_worker_fn+0x170/0x170
[ 1112.876934] Code: 75 08 48 83 87 88 07 00 00 01 e8 ed b1 ff ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 65 48 8b 04 25 78 c0 00 00 <48> 89 e5 41 55 41 54 53 65 48 8b 1c 25 80 c0 00 00 f7 80 44 c0


Verified on 3.10.0-415.rt56.298.el7.x86_64
Run the reproducer several hours, no problem found.

Comment 10 Beth Uptagrafft 2016-07-29 14:06:19 UTC
*** Bug 1273264 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2016-11-03 19:38:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2584.html


Note You need to log in before you can comment on or make changes to this bug.