Bug 483445

Summary: Packets Loss with Netdump
Product: Red Hat Enterprise Linux 4 Reporter: Qian Cai <qcai>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED NOTABUG QA Contact: Martin Jenner <mjenner>
Severity: medium Docs Contact:
Priority: low    
Version: 4.8CC: nhorman
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-01 19:09:23 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Qian Cai 2009-02-01 12:28:30 UTC
Description of problem:
This is to track the additional issue with the fix,

Bug 477945 - Kernel Panic with Bnx2 - Badness in local_bh_enable at kernel/softirq.c:141

I have seen consistently packets loss while running "echo t >/proc/sysrq-trigger" in a loop.

From the affected machine's serial console,
# while :; do echo t >/proc/sysrq-trigger; done

From another host,
$ ping hp-dl785g5-01.rhts.bos.redhat.com
...

I have seen lots of packets loss here.

It likely happens on machines using bnx2 driver.

hp-dl785g5-01.rhts.bos.redhat.com
dell-pe1950-01.rhts.bos.redhat.com
dell-pe1950-01.rhts.englab.brq.redhat.com

Version-Release number of selected component (if applicable):
kernel-2.6.9-78.23.EL + patch from,

https://bugzilla.redhat.com/show_bug.cgi?id=477945#c11

How reproducible:
always

Steps to Reproduce:
1. reserve one of the affected machines.
2. while :; do echo t >/proc/sysrq-trigger; done
3. From another host,
$ ping <the affected machine>
  
Actual results:
packets loss

Expected results:
no packet loss

Comment 1 Neil Horman 2009-02-01 19:09:23 UTC
This isn't a bug, you're exercizing the pessimal case of netpoll.  In the prior bug that you mention, we found a problem wherein there was access to shared data from multiple contexts causing a panic.  The fix for that was to enforce the needed mutual exclusion between those contexts.  Since one of the contexts was the nominal receive fast path (net_rx_action), netpoll now (correctly) blocks receive operations while calling the poll_controller/poll methods of a driver.  doing this puts us at risk for frame loss.  By sending multiple sysrq-t's, you effectively create multiple windows of time where we can't rx frames, leading to overflow and frame drops.  This is working as it should.