Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1292902

Summary: rt: netpoll: live lock with NAPI polling and busy polling on realtime kernel
Product: Red Hat Enterprise Linux 7 Reporter: Clark Williams <williams>
Component: kernel-rtAssignee: Clark Williams <williams>
kernel-rt sub component: Misc QA Contact: Zhang Kexin <kzhang>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: bhu, daolivei, kzhang, lgoncalv, zshi
Version: 7.3Keywords: ZStream
Target Milestone: rc   
Target Release: 7.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1293230 (view as bug list) Environment:
Last Closed: 2016-11-03 19:38:57 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1274397, 1282922, 1293230, 1295884, 1313485    
Attachments:
Description Flags
netpoll: Always take poll_lock when doing polling
none
Revert "ixgbevf: Prevent livelock spinning grabbing ixgbevf_qv_lock"
none
revert "ixgbe: Prevent livelock spinning grabbing ixgbe_qv_lock" none

Description Clark Williams 2015-12-18 16:55:55 UTC
A "live lock" has been seen in the NAPI polling capable NICs, such as ixgbe and sfc. 

Synchronization between NAPI polling and busy polling is done by looping on NAPI_STATE_SCHED 'bitset'. This method works fine on a non-rt kernel because a softirq can not be preempted, and the thread poll is called with local_bh_disable() which prevents softirqs from running and preempting it. But on rt, this code can be preempted. Thus, the code may be preempted out while holding the NAPI_STATE_SCHED 'bitset', opening a window for a livelock.

Comment 1 Clark Williams 2015-12-18 18:47:55 UTC
Created attachment 1107309 [details]
netpoll: Always take poll_lock when doing polling

Patch to synchronize NAPI polling and busy-polling to prevent live-lock.

Comment 2 Clark Williams 2015-12-18 18:49:34 UTC
Note: the RT engineering team originally thought this was a problem in the ixgbe driver code but further BZs revealed that it was a consequence of how RT is implemented combined with the NAPI polling and busy-polling code in the network driver framework.

Comment 6 Luis Claudio R. Goncalves 2016-01-07 02:37:31 UTC
Created attachment 1112320 [details]
Revert "ixgbevf: Prevent livelock spinning grabbing ixgbevf_qv_lock"

Comment 7 Luis Claudio R. Goncalves 2016-01-07 02:38:26 UTC
Created attachment 1112321 [details]
revert "ixgbe: Prevent livelock spinning grabbing ixgbe_qv_lock"

Comment 9 Zhenjie Chen 2016-06-06 05:14:54 UTC
QE update,

Reproduced on 3.10.0-327.rt56.204.el7.x86_64 with test like https://bugzilla.redhat.com/show_bug.cgi?id=1293230#c14

[ 1112.876788] INFO: rcu_preempt self-detected stall on CPU { 13}  (t=60000 jiffies g=4995 c=4994 q=0)          
[ 1112.876789] sending NMI to all CPUs:
[ 1112.876793] NMI backtrace for cpu 0
[ 1112.876796] CPU: 0 PID: 788 Comm: irq/86-0000:07: Not tainted 3.10.0-327.rt56.204.el7.x86_64 #1              
[ 1112.876797] Hardware name: HP ProLiant DL388p Gen8, BIOS P70 12/14/2012
[ 1112.876799] task: ffff880416031780 ti: ffff880416040000 task.ti: ffff880416040000
[ 1112.876807] RIP: 0010:[<ffffffff810a9f8f>]  [<ffffffff810a9f8f>] migrate_disable+0xf/0xf0
[ 1112.876808] RSP: 0018:ffff880416043b38  EFLAGS: 00000203
[ 1112.876808] RAX: ffff880416043fd8 RBX: ffff88042f613680 RCX: 0000000000000020
[ 1112.876809] RDX: 0000000000000000 RSI: 0000000000000020 RDI: 0000000000000200
[ 1112.876810] RBP: ffff880416043b78 R08: 000000000000003c R09: 0000000000000001
[ 1112.876810] R10: ffff880419a1368e R11: ffff880416efc980 R12: 0000000000013680
[ 1112.876811] R13: 0000000000000200 R14: 0000000000000020 R15: ffff880416031780
[ 1112.876812] FS:  0000000000000000(0000) GS:ffff88042f600000(0000) knlGS:0000000000000000
[ 1112.876813] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1112.876813] CR2: 00000000006eb0f8 CR3: 00000000bb4b9000 CR4: 00000000000407f0
[ 1112.876814] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1112.876815] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1112.876824] Stack:
[ 1112.876827]  ffff880416043b78 ffffffff81501dd4 ffff8800bc82dca0 0000000000000200
[ 1112.876828]  ffff8804165fb000 ffff8800bc82dcb8 ffff880416efc980 0000000000000001
[ 1112.876830]  ffff880416043b98 ffffffff815024b1 ffff880416efc000 ffff8804165fb000
[ 1112.876831] Call Trace:
[ 1112.876836]  [<ffffffff81501dd4>] ? __netdev_alloc_frag+0x54/0xe0
[ 1112.876838]  [<ffffffff815024b1>] __alloc_rx_skb+0x51/0xb0
[ 1112.876840]  [<ffffffff8150252b>] __netdev_alloc_skb+0x1b/0x40
[ 1112.876869]  [<ffffffffa04c423f>] __efx_rx_packet+0xff/0x5f0 [sfc]
[ 1112.876877]  [<ffffffffa04c49d9>] efx_rx_packet+0x2a9/0x3f0 [sfc]
[ 1112.876884]  [<ffffffffa04be90b>] efx_ef10_ev_process+0x3bb/0x6b0 [sfc]
[ 1112.876887]  [<ffffffff81512ef9>] ? netif_receive_skb+0x89/0xe0
[ 1112.876893]  [<ffffffffa04a8469>] efx_process_channel+0x99/0x1b0 [sfc]
[ 1112.876898]  [<ffffffffa04a8760>] efx_poll+0xb0/0x230 [sfc]
[ 1112.876900]  [<ffffffff81513f5b>] net_rx_action+0x1fb/0x360
[ 1112.876903]  [<ffffffff81077558>] do_current_softirqs+0x1d8/0x3c0
[ 1112.876906]  [<ffffffff8110bfc0>] ? irq_thread_fn+0x50/0x50
[ 1112.876908]  [<ffffffff810777b4>] local_bh_enable+0x74/0xa0
[ 1112.876909]  [<ffffffff8110c001>] irq_forced_thread_fn+0x41/0x70
[ 1112.876911]  [<ffffffff8110c49f>] irq_thread+0x12f/0x180
[ 1112.876912]  [<ffffffff8110c080>] ? wake_threads_waitq+0x50/0x50
[ 1112.876914]  [<ffffffff8110c370>] ? irq_thread_check_affinity+0x30/0x30
[ 1112.876917]  [<ffffffff81099e41>] kthread+0xc1/0xd0
[ 1112.876919]  [<ffffffff81099d80>] ? kthread_worker_fn+0x170/0x170
[ 1112.876922]  [<ffffffff81631558>] ret_from_fork+0x58/0x90
[ 1112.876923]  [<ffffffff81099d80>] ? kthread_worker_fn+0x170/0x170
[ 1112.876934] Code: 75 08 48 83 87 88 07 00 00 01 e8 ed b1 ff ff 5d c3 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 65 48 8b 04 25 78 c0 00 00 <48> 89 e5 41 55 41 54 53 65 48 8b 1c 25 80 c0 00 00 f7 80 44 c0


Verified on 3.10.0-415.rt56.298.el7.x86_64
Run the reproducer several hours, no problem found.

Comment 10 Beth Uptagrafft 2016-07-29 14:06:19 UTC
*** Bug 1273264 has been marked as a duplicate of this bug. ***

Comment 13 errata-xmlrpc 2016-11-03 19:38:57 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2584.html