Bug 483445

Summary:	Packets Loss with Netdump
Product:	Red Hat Enterprise Linux 4	Reporter:	Qian Cai <qcai>
Component:	kernel	Assignee:	Neil Horman <nhorman>
Status:	CLOSED NOTABUG	QA Contact:	Martin Jenner <mjenner>
Severity:	medium	Docs Contact:
Priority:	low
Version:	4.8	CC:	nhorman
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-02-01 19:09:23 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Qian Cai 2009-02-01 12:28:30 UTC

Description of problem:
This is to track the additional issue with the fix,

Bug 477945 - Kernel Panic with Bnx2 - Badness in local_bh_enable at kernel/softirq.c:141

I have seen consistently packets loss while running "echo t >/proc/sysrq-trigger" in a loop.

From the affected machine's serial console,
# while :; do echo t >/proc/sysrq-trigger; done

From another host,
$ ping hp-dl785g5-01.rhts.bos.redhat.com
...

I have seen lots of packets loss here.

It likely happens on machines using bnx2 driver.

hp-dl785g5-01.rhts.bos.redhat.com
dell-pe1950-01.rhts.bos.redhat.com
dell-pe1950-01.rhts.englab.brq.redhat.com

Version-Release number of selected component (if applicable):
kernel-2.6.9-78.23.EL + patch from,

https://bugzilla.redhat.com/show_bug.cgi?id=477945#c11

How reproducible:
always

Steps to Reproduce:
1. reserve one of the affected machines.
2. while :; do echo t >/proc/sysrq-trigger; done
3. From another host,
$ ping <the affected machine>
  
Actual results:
packets loss

Expected results:
no packet loss

Comment 1 Neil Horman 2009-02-01 19:09:23 UTC

This isn't a bug, you're exercizing the pessimal case of netpoll.  In the prior bug that you mention, we found a problem wherein there was access to shared data from multiple contexts causing a panic.  The fix for that was to enforce the needed mutual exclusion between those contexts.  Since one of the contexts was the nominal receive fast path (net_rx_action), netpoll now (correctly) blocks receive operations while calling the poll_controller/poll methods of a driver.  doing this puts us at risk for frame loss.  By sending multiple sysrq-t's, you effectively create multiple windows of time where we can't rx frames, leading to overflow and frame drops.  This is working as it should.