Bug 251978 - kernel panic : kernel BUG at include/linux/netdevice.h:888!
Summary: kernel panic : kernel BUG at include/linux/netdevice.h:888!
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.5
Hardware: i386
OS: Linux
low
medium
Target Milestone: ---
: ---
Assignee: Neil Horman
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 430698
TreeView+ depends on / blocked
 
Reported: 2007-08-13 18:17 UTC by Keshav Sharma
Modified: 2018-10-20 02:46 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-08-22 13:30:50 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
patch file (1.82 KB, patch)
2007-08-13 18:17 UTC, Keshav Sharma
no flags Details | Diff
patch to hide device from the poll list during netpoll operations to avoid net_rx_action races (975 bytes, patch)
2007-08-30 18:54 UTC, Neil Horman
no flags Details | Diff
new patch to test (1.99 KB, patch)
2007-08-31 12:14 UTC, Neil Horman
no flags Details | Diff
patch to serialize netpoll code (1.90 KB, patch)
2007-08-31 20:41 UTC, Neil Horman
no flags Details | Diff
patch example w/o additional locking (2.41 KB, patch)
2007-09-04 03:20 UTC, Tina Yang
no flags Details | Diff
patch example w/o additional locking (2.41 KB, patch)
2007-09-04 03:20 UTC, Tina Yang
no flags Details | Diff
patch forbids cross-cpu dev processing (2.67 KB, patch)
2007-09-04 21:36 UTC, Tina Yang
no flags Details | Diff
alternate patch possibility (2.46 KB, patch)
2007-09-05 14:17 UTC, Neil Horman
no flags Details | Diff
revised version of my alternate patch (2.83 KB, patch)
2007-09-05 18:14 UTC, Neil Horman
no flags Details | Diff
patch prohibits cross-cpu dev processing (2.49 KB, patch)
2007-09-06 21:38 UTC, Tina Yang
no flags Details | Diff
new version of my patch (8.77 KB, patch)
2007-09-07 18:48 UTC, Neil Horman
no flags Details | Diff
correct patch (2.83 KB, patch)
2007-09-07 19:33 UTC, Neil Horman
no flags Details | Diff
updated patch to fix possible deadlock (2.83 KB, patch)
2007-09-13 11:45 UTC, Neil Horman
no flags Details | Diff

Description Keshav Sharma 2007-08-13 18:17:24 UTC
Description of problem:

Client upgraded from ocfs2 v1.2.3 to v1.2.5 and begin experiencing evictions
due to node fencing within a day of upgrading.

This is due to 
PANIC: "kernel BUG  at include/linux/netdevice.h:888!"

The chain of events appears to be a period of higher load on one system (e.g.
node1) causing it to slow and eventually fail to respond to the OCFS2
heartbeat. This load however is less than loads at other times and those
cause no problems. Node1 messages file is busy reporting "kernel: fown:
signal 29, pid 10946" messages as it slows until ...

  15:11 - node1 panics and dumps vmcore
  15:12 - node2 initiates the eviction of node1 from the cluster
  15:14 - node2 reports that the DLM eviction was successful
  15:56 - node1 finally restarts

This sequence of occurrences is the same regardless of which node slows and
is eventually fenced out.

Analysis : The tg3 driver is detecting an extremely rare state inconsistency
which is being mishandled by the kernel resulting in a kernel panic. 


Version-Release number of selected component (if applicable):

kernel-2.6.9.55

How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:
Patch attached and fix is queued for mainline 2.6.23.

Comment 1 Keshav Sharma 2007-08-13 18:17:24 UTC
Created attachment 161198 [details]
patch file

Comment 2 Neil Horman 2007-08-29 20:25:54 UTC
Sorry, but this just looks like a hack to prevent netif_rx_complete from
checking the __LINK_STATE_RX_SCHED flag when its clearly improperly cleared. 
The right thing to do is figure out why the RX_SCHED flag is cleared on entry to
this function and prevent that from happening, not just shortcut it with an
additional flag.  Looking at the most recent RHEL4 build, I don't see anything
between poll_napi, the tg3_poll routine and netif_rx_complete that changes the
state of that flag.  I've not looked but that would suggest to me that this is
either a bug thats been fixed in an interviening release (meaning a test of the
latest kernel is in order).  Alternatively it could mean that something outside
of tg3 or the netpoll code is messing with that flag.  Has the ocfs code been
audited to verify that this is not the case?

Comment 3 Tina Yang 2007-08-30 17:38:27 UTC
The vmcore dump indicates a potential race conditions for dev->poll at the time
of the crash (flag checking) when tg3 is not even on the per cpu poll_list.
From the code, it clearly shows more than one such scenarios between poll_napi
and net_rx_action since by design, poll_napi is not regulated by the per cpu
poll_list.
1) A race window exists in net_rx_action between finding device on poll_list and
waiting on the poll_lock - i.e. poll_napi could have already been in the
dev->poll and remove it from the poll_list in the interim.
2) Another one exists between checking list_empty and getting the next list
entry in the same code segment while poll_napi could remove the last entry from
the poll_list and results in another type of crash (invalid pointer).

This patch proposed by Olaf Kirch is not a perfect but a reasonable minimum one
that will address all races and was posted to netdev, acked by Dave Miller and
queued for 2.6.23.


Comment 4 Neil Horman 2007-08-30 18:26:49 UTC
"The vmcore dump indicates a potential race conditions for dev->poll at the time
of the crash"
You never provided me with the vmcore.  That would help alot.  Please do so in
the future.

"This patch proposed by Olaf Kirch is not a perfect"
No its not perfect.  Its a hack IMO.  I'll propose it since Miller took it
upstream, but that really doesn't make it any better.  Looking at it off the top
of my head it would have seemed easier to simply remove the device from the
poll_list after calling poll_controller under the protection of
local_irq_disable in netpoll_poll.  Either way though....

Comment 5 RHEL Program Management 2007-08-30 18:34:35 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 6 Neil Horman 2007-08-30 18:54:38 UTC
Created attachment 181601 [details]
patch to hide device from the poll list during netpoll operations to avoid net_rx_action races

Actually, scrap my last comment: 
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.24.git;a=commit;h=2e27afb300b56d83bb03fbfa68852b9c1e2920c6


It appears this is rather worse than a hack.  This is complete breakage of at
least one (possbily more) big drivers.	Point me to that vmcore so we can
figure this out properly.

Also, I'm attaching a patch for you to try.  Its just off the top of my head,
so it may not work, but it seems to me if we remove the deivce from the
poll_list and re-init the list head, then we avoid the risk of race with
net_rx_action, since we'll never poll from there.

Comment 7 Tina Yang 2007-08-30 19:08:06 UTC
1) What breakage, could you elaborate on it ?
2) The vmcore is rather huge (several GBs), and the support probably had already
deleted it.
3) The poll_list is a per cpu list that is associated with the cpu to which the
interrupt is delivered while netpoll thread could run on any cpu than the one
whose list the dev is on, the local_irq_disable is not going to help.



Comment 8 Neil Horman 2007-08-30 19:32:49 UTC
1) did you not read the link I attached in comment #6? I'll attach it again.
http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.24.git;a=commit;h=2e27afb300b56d83bb03fbfa68852b9c1e2920c6

2) Do you not have the vmocre someplace I can dowload it from?  If you don't
have it anymore than there is nothing we can do, but it would be helpful if you
kept debug material around until the problem was solved.

3) Yes, you make a good point, you'll have to globally disable irq's briefly
with irq_save/restore_flags.  Can you make that change to the patch?
 


Comment 9 Tina Yang 2007-08-30 20:49:23 UTC
2) The vmcore is from a customer about ~28GB and I couldn't find it any more,
probably was removed for space.
3) Is there a way to synchronously and globally disable irqs ?  Also, could
net_rx_action run from a non-interrupt context (e.g. ksoftirqd) ?

Comment 10 Tina Yang 2007-08-30 22:20:24 UTC
2) ok, we found a backup copy of the vmcore, but since it's a customer's we are
not at liberty to provide it to others.  But I can certainly look up bug-related
information for you.

Comment 11 Neil Horman 2007-08-31 00:18:54 UTC
comment #9:
net_rx_action runs as a softirq already.  If you want to disable irqs globally
use  save_irq_flags or one of its brother functions as I suggested.

Comment #10:
I'm glad you found the vmcore, but why can't you provide it to us?  If they are
running RHEL, they should be a mutual customer, or are they not running RHEL? 
If they are not running RHEL?



Comment 12 Tina Yang 2007-08-31 01:52:42 UTC
>If you want to disable irqs globally use save_irq_flags or one of its brother
functions as I suggested.
I couldn't find any save_irq_flags throughout entire kernel, are you talking
about spin_lock_irqsave ?  The irqsave of the spin_lock_irqsave is for local
processor only.  The global control effect of the spin_lock_irqsave actually
comes from the spin_lock itself.  From what I know, there is no such a thing as
"globally disabled irqs".

>net_rx_action runs as a softirq already.
net_rx_action can run either from an interrupt context or from a process
context.  A "globally disabled interrupts" can't prevent the latter.

>I'm glad you found the vmcore, but why can't you provide it to us?
It's company's policy, not up to me to decide.


Comment 13 Neil Horman 2007-08-31 12:09:56 UTC
1) "I couldn't find any save_irq_flags"
You're right, it was local_irq_save I was thinking of, which obviously is a
local only irq disable.  To provide list syncronization, we'll need a global
spinlock. I'm attaching a new patch to do that.

2) "It's company's policy, not up to me to decide."
You seem to have missed the point.  Is this customer of yours running RHEL?  If
they are, then they should be a mututal customer and you should have a process
by which you can provide that data to us.  Or are you saying that they aren't
running RHEL?

The attached patch is untested again.  Please build test it at your customer
site and report results.

Comment 14 Neil Horman 2007-08-31 12:14:25 UTC
Created attachment 183141 [details]
new patch to test

new patch

Comment 15 Tina Yang 2007-08-31 17:30:29 UTC
>The attached patch is untested again.  Please build test it 
Does this patch work ?
1) First of all, an unnecessary global lock will kill the SMP performance down
to one cpu.
2) Secondly, Once you blindly remove dev from the list as you did in
netpoll_poll_dev, neither net_rx_action nor poll_napi is going to process it,
period.
3) Thirdly, if the device driver keeps interrupt on and the ksoftirqd is holding
the lock, a deadlock ensues.
4) Maybe more ... Just don't want to think about it.
Do you think maybe you should fully understand the scope of the problem before a
meaningful proposal ?

>Is this customer of yours running RHEL?  If they are, then they should be a
>mututal customer and you should have a process by which you can provide that
>data to us.
They are running RHEL, as a matter of fact it's them who had requested us file
this bug with you.  Maybe it would be easier for you to contact them directly.
I just located a record (dated 8/10/07) mentioning your end of the contact for
this customer is Payne, Robert Justin (RHCE).  And the customer is  GRUPO
INDUSTRIAL SALTILLO SAB DE CV, located in Mexico.


Comment 16 Neil Horman 2007-08-31 18:40:37 UTC
"Does this patch work ?"
I've already told you I haven't had time to build/test it yet, and that it was
off the top of my head.  I would think its pretty clear that I'm not 100% sure
that it works yet.

1) What in your mind makes the global lock unnecesecary?  Yes, it would be nice
if we could avoid it, but we need to protect against a race accross all cpus
with the netpoll code.  Yes smp performance will be impacted of course, but a
BUG() halt is a worse impact on performance.  We'll worry about the performance
once the race is closed.  Off the top of my head we could replace the spinlock
with an rwlock_t and read_lock in the net_rx_action path to allow for parallel
execution and write_lock in the in the netpoll path to protect all cpus.

2) yes, thats rather the point, since netpoll_poll_dev will process the removed
dev it via poll_napi instead.

3) Thats what the local_irq_disable is for in the net_rx_action path.  If we
update to use rwlocks, we should probably use the irqsave version in the netpoll
path.

4) Thank you for the customer contact, I'll work directly with our contact here.


Comment 17 Neil Horman 2007-08-31 20:41:39 UTC
Created attachment 183961 [details]
patch to serialize netpoll code

Jeffrey, I'm currently testing this patch, once I have verified that its not
breaking anything, I'll build packages for your customer and they can test on
site.

Comment 18 Tina Yang 2007-08-31 21:17:09 UTC
The challenge is to do w/o introducing additional locking.  In case of your
proposal, since poll list is per-cpu-based, lock is needed only for this cpu and
netpoll cpu, why lock down totally unconcerned cpus ?  If you have 16 cpus as
this customer has, are you going to hold all of them for that long duration ? 
One big lock can solve everything, but it's certainly not the way to go.  If the
performance is unacceptable, are you going to redo the patch all over again ?

Lastly, the customer has switched all their tg3 to e1000 to avoid hitting this
problem.  The environment that is most inducing is tg3 + the latest ocfs2
release + netconsole.



Comment 19 Neil Horman 2007-08-31 23:06:00 UTC
No kidding, the challenge is to fix this without extra locking, but we need an
extra lock here at the moment to avoid the race condition. The race for which
this issue is reported happens because 'unconcerned' cpus may place a device on
the poll list while the netpoll code is trying to remove it, or vice versa
(hence the change in the status of the __LINK_STATE_RX_SCHED flag in
net_rx_action that leads to the bug halt).  This lists are per-cpu, but the list
head that each device uses spans cpus, meaning we ned a shared lock to protect
any list it might be on.  It would be nice to avoid the use of an extra lock,
but given that a BUG halt is worse than impeded performance IMO, we'll at least
start with this.  With my latest patch, at least multiple net_rx_action softirqs
can run in parallel.  We only have to serialize when we issue netpoll sent
frames (which are hopefully rare in the nominal case).

I've got this patch built here.  I'll test it out a bit on monday locally and
then post kernels for you Jeffrey, for the customer to test with for as much as
they are able.

Comment 20 Tina Yang 2007-09-01 01:26:26 UTC
>No kidding, the challenge is to fix this without extra locking, but we need an
>extra lock here at the moment to avoid the race condition.
Your approach in trying to remove the device from netpoll has made the demand. 
    There could be other ways that can do without.  Theoretically speaking, the
per cpu poll list can only be touched by its owner cpu in the original design. 
Your method has already violated that assumption.

>This lists are per-cpu, but the list head that each device uses spans cpus,
>meaning we ned a shared lock to protect any list it might be on.
Not necessarily.  The goal is to ensure the integrity of the individual list
which has no relation to one another whatsoever.  A shared lock is a bad idea.

>but given that a BUG halt is worse than impeded performance IMO, we'll at least
>start with this.
We want both.

>With my latest patch, at least multiple net_rx_action softirqs can run in
>parallel.
Locking penalty is not only from contention but also from the locking operation
itself especially when net_rx_action is on one of the heaviest path in peak time.

Comment 21 Neil Horman 2007-09-01 01:51:39 UTC
"Your approach in trying to remove the device from netpoll has made the demand"
I don't know what you're trying to say here

"Theoretically speaking, the
per cpu poll list can only be touched by its owner cpu in the original design. 
Your method has already violated that assumption."
Again, no kidding.  The assumption was violated however, prior to my patch, and
that is causing the problem.  Netpoll is able to work on any of the cpu lists. 
So we can either protect the list, or remove netpoll (which we can't do)

"Not necessarily.  The goal is to ensure the integrity of the individual list"
yes which we need to do with a shared lock.  We need this because netpoll
manipulates the shared device data (the _RX_SCHED flag) while its may be on
another cpu's poll list.  I can't think of a better way  to approach handling
this.  If you can, please present it.

"We want both."
So do I, but I'm trying to at least avoid an panic before we bug halt.  We'll
worry about performance after we have the system stable.

"Locking penalty is not only from contention but also from the locking operation
itself especially when net_rx_action is on one of the heaviest path in peak time."
Again, I know this.  If you have a better idea, please present it.




Comment 22 Tina Yang 2007-09-03 20:46:27 UTC
"Your approach in trying to remove the device from netpoll has made the demand"
>I don't know what you're trying to say here
To avoid locking, I will probably try to see if I can get away with additional 
synchronization flags or parameters in "struct net_device" under the existing
dev poll_lock umbrella.

"Theoretically speaking, the
per cpu poll list can only be touched by its owner cpu in the original design. 
Your method has already violated that assumption."
>Again, no kidding.  The assumption was violated however, prior to my patch, and
>that is causing the problem.  Netpoll is able to work on any of the cpu lists. 
>So we can either protect the list, or remove netpoll (which we can't do)
That's why I think netpoll implementation is a hack, hence the many problems it
had resulted in.  In my opinion, netpoll is an emergency path that should only
be used when the normal interrupt-driven path (net_rx_action) has failed and is
shutdown such as when system crashes, in order to reliably do a netdump, we do
all things synchronously from a single cpu - send data, poll device for i/o.  So
I think a better way is to return to the original design and have one clear
network path at a time (normal vs crash).  However the scale of the changes
might be significant that the original designer should be consulted and involved.

"Not necessarily.  The goal is to ensure the integrity of the individual list"
>yes which we need to do with a shared lock.  We need this because netpoll
>manipulates the shared device data (the _RX_SCHED flag) while its may be on
>another cpu's poll list.  I can't think of a better way  to approach handling
>this.  If you can, please present it.
Finer grained locking is what I am trying to say - "One lock per list" so that 
cpu that is neither the list owner nor the contending netpoll's at that
instance will not be held up unnecessarily.

"We want both."
>So do I, but I'm trying to at least avoid an panic before we bug halt.  We'll
>worry about performance after we have the system stable.
The performance revision might end up a complete redesign of the patch.  Why 
waste the energy ?

"Locking penalty is not only from contention but also from the locking operation
itself especially when net_rx_action is on one of the heaviest path in peak time."
>Again, I know this.  If you have a better idea, please present it.
Same as the first comment above.

Comment 23 Tina Yang 2007-09-04 03:20:06 UTC
Created attachment 185531 [details]
patch example w/o additional locking

Comment 24 Tina Yang 2007-09-04 03:20:41 UTC
Created attachment 185541 [details]
patch example w/o additional locking

Comment 25 Tina Yang 2007-09-04 03:45:43 UTC
Attachment in #23 and #24 are identical.
Since dev removal occurs in dev->poll which is protected by dev->poll_lock, we
can safely decide on whether to skip the poll based on the current status of
dev->poll_list once we get a hold of the lock.  The overhead is near 0. 
However, the right way is to fix the basic design of the netpoll as I mentioned
before.

Comment 26 Tina Yang 2007-09-04 05:07:04 UTC
Comment on attachment 185541 [details]
patch example w/o additional locking

>--- linux-2.6.9/net/core/dev.c.orig     2007-06-28 00:53:51.178885000 -0700
>+++ linux-2.6.9/net/core/dev.c  2007-09-03 20:01:34.767326000 -0700
>@@ -1760,6 +1760,8 @@ static void net_rx_action(struct softirq
>        unsigned long start_time = jiffies;
>        int budget = netdev_max_backlog;
>        void *have;
>+       struct net_device *null_dev = list_entry(&queue->poll_list,
>+                                     struct net_device, poll_list);
>
>        local_irq_disable();
>
>@@ -1774,8 +1776,22 @@ static void net_rx_action(struct softirq
>                dev = list_entry(queue->poll_list.next,
>                                 struct net_device, poll_list);
>
>+               /* The last dev could have been removed by netpoll
>+                */
>+               if (dev == null_dev)
>+                       break;
>+
>                have = netpoll_poll_lock(dev);
>
>+               /* netpoll could have just released the lock,
>+                * check if dev has already been polled
>+                */
>+               if ((dev->poll_list->next != LIST_POISON1) {
>+                       netpoll_poll_unlock(have);
>+                       local_irq_disable();
>+                       continue;
>+               }
>+
>                if (dev->quota <= 0 || dev->poll(dev, &budget)) {
>                        netpoll_poll_unlock(have);
>                        local_irq_disable();
>
>
>--- linux-2.6.9/net/core/netpoll.c.orig 2007-06-28 00:53:27.847957000 -0700
>+++ linux-2.6.9/net/core/netpoll.c      2007-09-03 20:04:08.655594000 -0700
>@@ -96,13 +96,17 @@ static void poll_napi(struct net_device
>        if (test_bit(__LINK_STATE_RX_SCHED, &dev->state) &&
>            npinfo->poll_owner != smp_processor_id() &&
>            spin_trylock(&npinfo->poll_lock)) {
>-               npinfo->rx_flags |= NETPOLL_RX_DROP;
>-               atomic_inc(&trapped);
>+               /* Check if dev has already been processed
>+                */
>+               if (dev->poll_list->next != LIST_POISON1) {
>+                       npinfo->rx_flags |= NETPOLL_RX_DROP;
>+                       atomic_inc(&trapped);
>
>-               dev->poll(dev, &budget);
>+                       dev->poll(dev, &budget);
>
>-               atomic_dec(&trapped);
>-               npinfo->rx_flags &= ~NETPOLL_RX_DROP;
>+                       atomic_dec(&trapped);
>+                       npinfo->rx_flags &= ~NETPOLL_RX_DROP;
>+               }
>                spin_unlock(&npinfo->poll_lock);
>        }
> }

Comment 27 Neil Horman 2007-09-04 11:55:14 UTC
Well, I can't disagree with you, netpoll is a hack, always has been.  Buf for
the sake of functionality (netdump/netconsole) and the preservation of ABI, we
are stuck with it.  That being siad, have you tested this?  Looking at it I
agree with you, its definately got less overhead than my approach.  However,
your approach relies on the net device being deleted from any per-cpu poll list
that it may be on, and your patch doesn't add any code to poll_napi to do that.
 I think you need a list_del(&dev->poll_list); right before teh call to
dev->poll to make this work, otherwise its effectively a no-op.

Comment 28 Tina Yang 2007-09-04 17:38:24 UTC
The removal of the dev (list_del) is always done in netif_rx_complete embedded
in dev->poll (tg3_poll in this case).  The list_del currently in net_rx_action
is for device that is eligible for another run, therefore it's removed from the
head and queued to the tail.  Does this make sense to you ?  And no, I haven't
tested it yet.

Comment 29 Tina Yang 2007-09-04 17:57:45 UTC
Change 
if (dev->poll_list->next != LIST_POISON1)
to 
if (test_bit(__LINK_STATE_RX_SCHED, &dev->state))
is better.


--- linux-2.6.9/net/core/dev.c.orig     2007-06-28 00:53:51.178885000 -0700
+++ linux-2.6.9/net/core/dev.c  2007-09-04 10:48:42.409418000 -0700
@@ -1760,6 +1760,8 @@ static void net_rx_action(struct softirq
        unsigned long start_time = jiffies;
        int budget = netdev_max_backlog;
        void *have;
+       struct net_device *null_dev = list_entry(&queue->poll_list, 
+                                      struct net_device, poll_list);
 
        local_irq_disable();
 
@@ -1774,8 +1776,22 @@ static void net_rx_action(struct softirq
                dev = list_entry(queue->poll_list.next,
                                 struct net_device, poll_list);
 
+               /* The last dev could have been removed by netpoll
+                */
+               if (dev == null_dev)
+                       break;
+
                have = netpoll_poll_lock(dev);
 
+               /* netpoll could have just released the lock,
+                * check if dev has already been polled 
+                */
+               if (!test_bit(__LINK_STATE_RX_SCHED, &dev->state)) {
+                       netpoll_poll_unlock(have);
+                       local_irq_disable();
+                       continue;
+               }
+ 
                if (dev->quota <= 0 || dev->poll(dev, &budget)) {
                        netpoll_poll_unlock(have);
                        local_irq_disable();

--- linux-2.6.9/net/core/netpoll.c.orig 2007-06-28 00:53:27.847957000 -0700
+++ linux-2.6.9/net/core/netpoll.c      2007-09-04 10:43:42.667574000 -0700
@@ -96,13 +96,17 @@ static void poll_napi(struct net_device 
        if (test_bit(__LINK_STATE_RX_SCHED, &dev->state) &&
            npinfo->poll_owner != smp_processor_id() &&
            spin_trylock(&npinfo->poll_lock)) {
-               npinfo->rx_flags |= NETPOLL_RX_DROP;
-               atomic_inc(&trapped);
+               /* Check if dev has already been processed 
+                */
+               if (test_bit(__LINK_STATE_RX_SCHED, &dev->state)) {
+                       npinfo->rx_flags |= NETPOLL_RX_DROP;
+                       atomic_inc(&trapped);
 
-               dev->poll(dev, &budget);
+                       dev->poll(dev, &budget);
 
-               atomic_dec(&trapped);
-               npinfo->rx_flags &= ~NETPOLL_RX_DROP;
+                       atomic_dec(&trapped);
+                       npinfo->rx_flags &= ~NETPOLL_RX_DROP;
+               }
                spin_unlock(&npinfo->poll_lock);
        }
 }


Comment 30 Neil Horman 2007-09-04 18:08:14 UTC
Yeah, I see what you're saying.  I was looking at the list_del in net_rx_action
and assumed it was the only place we removed net_devices from the list, and
didn't bother to look further.

I'm still a little learly of this patch however.  consider the following scenario:

1) net_devices A and B are added to the poll_list for cpu 0
2) cpu 1 atempts to preform a netpoll operation on net_device B
3) net_rx_action runs on cpu 0 and begins handling the poll operation for
net_device A

Since the poll_lock for A and B are separate cpu 0 and 1 can preform their
operations unimpeded in parallel (which is, as we agree, nominally a good
thing).  If however, both CPU 0 and CPU 1 get through their respective devices
poll routines and call netif_rx_complete at the same time, we have a problem. 
Even though net_device B is on the poll_list for cpu 0, the netif_rx_complete
operation for device B is being run on CPU 1.  If the list_del opperations occur
close together on CPU 1 and CPU 0, we can wind up with list corruption, and an
oops will result.

So I think your patch may well solve the problem reported, but it will create
another one in the process, one not easily reproduced. Thoughts?

Comment 31 Tina Yang 2007-09-04 21:36:49 UTC
Created attachment 186681 [details]
patch forbids cross-cpu dev processing

Glad you caught it - still can't escape from the netpoll design flaw.
Here is a new approach (patch attached).  The basic idea is to forbid cross-cpu
processing.  Let me know what you think.

Comment 32 Neil Horman 2007-09-05 14:17:09 UTC
Created attachment 187461 [details]
alternate patch possibility

I like it.  It seems like it should work to me, and with a minimum of
performance impact.  It needs some very minor manipulations to avoid ABI
breakage on RHEL4 but it looks upstream ready to me.  While you were writing
this, I also wrote the above patch, which is a variation on your last patch,
but avoids the list corruption we discussed, and hopefully the oops that the
originally reverted upstream patch caused.  Let me know which implementation
you like better, and I'll roll a test kernel for our customer.	Thanks!

Comment 33 Tina Yang 2007-09-05 17:34:08 UTC
About your patch -
If net_rx_action happens to run first and before RX_SCHED can be cleared in
netif_rx_complete, poll_napi might process the dev->poll the second time as soon
as the poll_lock is freed by net_rx_action, same as in the original patch (the
one proposed by Olaf), which although avoids the netdevice.h:888 panic but might
have subtle hw/sw impact and were reportedly not working according to your
earlier note as well as some testing report we received here in-house two days ago.

Without a doubt, I like the "non-cpu-crossing" approach much better simply
because it restores the original design order and are clean, less convoluted and
efficient.  We shouldn't let a secondary functionality such as netpoll to
overburden the normal, primary network stacks unnecessarily given that's where
the most performance is demanded.  That said, however, I would like to have it
tested first to make sure it's not breaking anything before further discussion.
 I'll see if I can do it today.  What do you think ?

Comment 34 Neil Horman 2007-09-05 18:14:10 UTC
Created attachment 187831 [details]
revised version of my alternate patch

good catch, I hadn't considered the alternate case where net_rx_action ran
first.	I think that can be fixed by only checking the
__LINK_STATE_RX_SCHED_FLAG after we obtain the poll lock in poll_napi and
skipping the device if its clear.

As for the e1000 problem that caused the origional versions reverting, I've not
received much (any) detail regarding the problem.  I was hoping to avoid that
problem in this version by checking the new flag in net_rx_action rather than
just inside netif_rx_complete.	I'm honestly not sure though.  At any rate, I
have e1000 adapters here so I can do some minimal testing before we send it
out.

That all being said I think both patches have a good deal of merit.  I think,
unless you have a better idea, I'll roll two kernels (one with your patch, one
with mine, and put them up on my people page) and we can start testing from
there.

Comment 35 Tina Yang 2007-09-05 19:13:45 UTC
Let me do some basic testing on my patch first before you put it up.  Thanks.

Comment 36 Neil Horman 2007-09-05 20:37:00 UTC
copy that, I'll focus on testing my patch until I hear from you.

Comment 37 Neil Horman 2007-09-06 17:26:37 UTC
Ok, I'm greenlightling my patch for external testing.  I've built it internally
here, and am able to run it on an smp system with onboard e1000 NIC's using both
netconsole and netdump during a flood ping.  

Mr Payne, Mr West, if you would please indicate to the customer that a test
kernel rpm for this bz can be found here:
http://people.redhat.com/nhorman/rpms/kernel-smp-2.6.9-57.EL.bz251978.rht.i686.rpm

Tina, if you would like to test this out as well and offer feedback, it would be
appreciated.  Also, if you could let me know what your testing status is, so we
have a better idea of what our current upstream options are, I would appreciate
it.  Thanks!

Comment 38 Tina Yang 2007-09-06 21:38:11 UTC
Created attachment 189261 [details]
patch prohibits cross-cpu dev processing

Above is my revised patch with limited testing done including netconsole and
netdump with normal network traffic.  Let me know if you have question about
the patch.  In the meantime, I'll review your latest patch.  Thanks.

Comment 39 Tina Yang 2007-09-06 22:46:22 UTC
About your patch, here is a scenario see if you can dispute -
. netpoll_poll calls dev->poll_controller to schedule the dev and turn off the
hw interrupt
. poll_napi sets the NETPOLL_SERVICED, turns on hw interrupt, but w/o clear RX_SCHED
. much time elapses ...
. packets coming in, hw interrupts, sees RX_SCHED, disable interrupts
. net_rx_action is invoked and sees the NETPOLL_SERVICED, which is stale from
previous event.  It removes the dev from the list and exits w/o poll.
. since nothing is done with the current hw event and interrupt is disabled,
it's stalled until maybe the next possible netpoll_poll.

Also, I did a little investigations on the netpoll path on a live system.  It
seems the calling frequency is rather low (it normally won't go through it
unless a extremely stressful situation such as running out of skb) and that's
why this part of the logics although is broken but hardly got caught.  As I said
before, the customer didn't run into it until they upgraded the OCFS2 with tg3
specifically.  Some thought: To have a more productive test, maybe the code
needs to be tweaked a bit to go through this path intentionally.


Comment 40 Neil Horman 2007-09-07 13:17:45 UTC
Yes, I understand your concern and I beleive that the patch should handle that.
 If I may paraphrase, you're worried about the situation in which we conduct a
netpoll operation, and no subsequent frames come in for a 'long time' after that
(a 'long time' being a time gap of arbitrary length after the netpoll operation
completes).  The concern is that on reception of the first frame after the
netpoll operation we still have __LINK_STATE_NETPOLL_SERVICED set, which we
clear, and as an artifact, we skip the poll operation on that device, and so we
may not accept that frame into the network stack without a potentially large
latency.

The answer is that we have two ways out of that situation, one is inside the
clause that I added to net_rx_action.  After we clear the
__LINK_STATE_NETPOLL_SERVICED bit and the RX_SCHED bit we look at the devices
quota.  If we still have a quota > 0 on the device we immediately reschedule it
for servicing on the same cpu again.  Alternatively if subsequent packets arrive
in close succession, the driver interrupt routine will re-set the
__LINK_STATE_RX_SCHED bit to one and re-enque the device for servicing via
net_rx_action.  Either way, there is a 'hiccup' in napi processing for the
device in that we skip the net device once in our receive processing, but we
immediately recover and catch up.

As for the netpoll path, you are correct, its use is (ideally) fairly infrequent
and used only in exceptional cases.  As you discovered it is used in extreeme
low memory situations.  It is also used in situations that required synchronous
network processing outside the scope of the normal network infrastructure
(netconsole, netdump, kgdb, etc, all use netpoll to communicate).  And yes,
netpoll is a horribly broken design, but for all its irritations it is
unfortunately needed for several cases still.

If you want to go through this path intentionally, we don't really need to tweak
much.  My test should exercize if fairly well.  What I did was the following:

1) configure netconsole and netdump on the system
2) start the netdump/netconsole services and make sure they are operating on the
remove receiving machine
3) flood ping out the interface to a remote host
4) run a script that generates _lots_ of console messages
5) crash the machine with a `echo c > /proc/sysrq-trigger`

Step 4 should cause the netpoll path to be run quite frequently, with netconsole
spewing messages to the remote netlog server. In conjunction with the normal
network processing going on with the flood ping in step 3, this should give us a
pretty good idea of how well this patch handles integrating the netpoll and
non-netpoll paths.

Step 5 should test the patchs ability to switch over to exclusive netpoll use,
since netdump only sends frames via the netpoll path.

Let me know what you think.

Also, Mr. Payne, if you could provide the test kernel I mentioned previously to
the customer for testing, I would appreciate it.  Thanks!

Comment 41 Tina Yang 2007-09-07 18:06:27 UTC
>If we still have a quota > 0 on the device we immediately reschedule it
>for servicing on the same cpu again.
Ok, but quota > 0 doesn't mean there is work to be done.  Rescheduling the dev
might be a waste of cycles in case the same event has already been processed by
netpoll.  That's what Olaf's patch does too and it seems to have some subtle
impact on the driver/hw which I haven't really looked into yet.

Also, in your net_rx_action
+			if (dev->quota > 0)
+				netif_rx_schedule(dev);
+			continue;
1) What if quota < 0 ?  It'll lost its turn forever.
2) It will get into deadlock if it doesn't release the poll_lock before "continue"

>1) configure netconsole and netdump on the system
>2) start the netdump/netconsole services and make sure they are operating on >the
>remove receiving machine
>3) flood ping out the interface to a remote host
>4) run a script that generates _lots_ of console messages
>5) crash the machine with a `echo c > /proc/sysrq-trigger`
These were pretty much what I did too for my testing, but with tg3 instead of
e1000.  I'll see if I can tally the frequency.

Comment 42 Neil Horman 2007-09-07 18:31:26 UTC
"Ok, but quota > 0 doesn't mean there is work to be done"
But it does, or at least its the best heuristic available to us.  Note that
right below the section I added, we preform the exact same check:
if (dev->quota <= 0 || dev->poll(dev, &budget)) {
...
} else {
 netpoll_poll_unlock(have);
 dev_put(dev);
 local_irq_disable();
}
if quota is greater than zero here, the device falls into the else clause here
and we simply unlock it and move on to the next item in the list, without ever
calling dev->poll.  This means that the device is still on the cpu's poll list,
and the RX_SCHED bit is still set, meaning the next scheduling of net_rx_action
will pick it up.  My added code emulates that behavior, and should be safe.

 "1) What if quota < 0 ?  It'll lost its turn forever."
No. if quota is less than zero, then it needs to wait until such time as the
hardware interrupts again and re-schedules the interface to run, at which point
(unless netpoll has locked the device again) net_rx_action will run its nominal
path and start receiving frames on the device normally again.

"2) It will get into deadlock if it doesn't release the poll_lock before "continue""
Dang, your right.  I'll fix that up and attach a new version of the patch.  Thanks!

Comment 43 Neil Horman 2007-09-07 18:48:52 UTC
Created attachment 190311 [details]
new version of  my patch

new version of my patch.  I'm rebuilding my test kernel now.  Let me know how
your testing goes.

Comment 44 Tina Yang 2007-09-07 19:01:39 UTC
>if quota is greater than zero here, the device falls into the else clause here
>and we simply unlock it and move on to the next item in the list, without ever
>calling dev->poll.
No, if > 0, you will definitely do dev->poll because the "or" condition in the
"if" clause.

>"1) What if quota < 0 ?  It'll lost its turn forever."
>No. if quota is less than zero, then it needs to wait until such time as the
>hardware interrupts again and re-schedules the interface to run, at which point
>(unless netpoll has locked the device again) net_rx_action will run its nominal
>path and start receiving frames on the device normally again.
But the original code logic is it will be queued to the tail of the list with
added weight this time (below), while in your case, it simply "continue" w/o
rescheduling,
 
                if (dev->quota <= 0 || dev->poll(dev, &budget)) {
                        netpoll_poll_unlock(have);
                        local_irq_disable();
                        list_del(&dev->poll_list);
                        list_add_tail(&dev->poll_list, &queue->poll_list);
                        if (dev->quota < 0)
                                dev->quota += dev->weight;
                        else
                                dev->quota = dev->weight;
                } else {

BTW, it's the wrong patch (nfs) you put up in Comment #43

Comment 45 Neil Horman 2007-09-07 19:33:53 UTC
Created attachment 190351 [details]
correct patch

correct version of patch.  Sorry, a few too many plates spinning today.  As for
the scheduling, its not a concern: note the logic in the while loop.  The loop
exits only when the list is empty (!list_empty).  Since we're on the list at
the head, its not empty.  And since the device we just checked is at the head
of the list, it gets selected again in list_entry.  Since we just cleared the
_NETPOLL_SERVICED bit, we skip that if statement, and procede as normal through
the net_rx_action code.  So we're covered.

Comment 46 Tina Yang 2007-09-07 20:11:27 UTC
>Since we're on the list at
>the head, its not empty.  And since the device we just checked is at the head
>of the list, it gets selected again in list_entry.
It's not at the head of the list, it's removed from the list the first thing in
the if (test_and_set_bit()) clause.

Not sure you are answering my previous questions, 
1) if dev < quota, dev is removed and w/o being rescheduled before "continue" ?
2) if dev >= quota, dev->poll possibly gets executed second time in the
rescheduled run.
 

+			list_del(&dev->poll_list);
+			clear_bit(__LINK_STATE_RX_SCHED, &dev->state);
+			/*
+			 * Check to see if any pending frames have come in
+			 * while we were mucking about.  If they have lets reschedule
+			 * this device
+			 */
+			if (dev->quota > 0) {
+				netif_rx_schedule(dev);
+				netpoll_poll_unlock(have);
+			}
+			continue;


Comment 47 Neil Horman 2007-09-07 23:46:52 UTC
shoot, you're right, I do remove it, and I mean to do that too.  Hmm.  We'll I
think for the purposes of testing I think we can ignore this issue, since it
really is a corner case.  To observe the bug, you would have to get the kernel
to issue a netpoll, after which you would have to stop the incomming flow of
frames at exactly the right time, then you would need to send in a single frame
to be received on the regular input path, after which you would need to stop the
incomming flow of frames again, and the result would be a handful of delayed
frames.  A bug that needs to be fixed for certain, but nothing that will have a
severe impact on upper level protocol behavior, and not one we are likely to see.

As for how to fix it, i would say we could probably add a
clear_bit(__LINK_STATE_NETPOLL_SERVICED to __netif_rx_schedule.  What are your
thoughts?

As for your questions:

1) if dev->quota <= 0, then we have the bug we are talking about, which I agree
needs fixing, but is not serious, at least for the purposes of testing the
origionally reported BUG_ON halt, and certainly will be fixable with minor
adjustment to this patch as I propose above.

2) if dev->quota > 0, the all is well as we'll just reschedule the poll and work
in the normal receive path.

I've got the new kernel with my last patch build, lets go ahead with testing
here and at the customer Site.  Justin, let us know what the customer gets for
results with the kernel at my people page, I've just uploaded the new version there.

Comment 48 Tina Yang 2007-09-10 20:12:07 UTC
The testing status on my patch remains the same as in Comment #38 if you need an
alternative approach.  Otherwise, please keep me posted of the progress of your
patch and upstream filing status so that I can put a closure to this bug on my
end.  Thanks.

Comment 49 Neil Horman 2007-09-12 18:56:10 UTC
update regarding the race condition from comment #39: After looking at it a bit
longer, I'm back to thinking its not an issue.  I think this because for the
'stale' NETPOLL_SERVICED bit will never occur, due to the way we set the
__RX_SCHED bit.  From looking at it, when we call ->poll_controller in
netpoll_poll_dev, if there are frames to be processed, the poll_controller path
in the driver will call netif_rx_schedule which will do 2 things:
1) ensure that the __LINK_STATE_RX_SCHED bit is set on the devices state flags
2) if it sets the __LINK_STATE_RX_SCHED bit, it enqueues the device to the local
cpus poll_list.

In the event that __LINK_STATE_RX_SCHED was already set by another cpu, the
device will be enqueued to that remote cpus poll_list.  Either way, when we exit
poll_napi, we are guaranteed that the device we just polled is waiting to be
serviced by the net_rx_action softirq, either on this or another cpu, and as
soon as it runs, that bit will be cleared, and the device can return to normal
processing.  So its never possible for us to be in the condition where we
netpoll a device and its NETPOLL_SERVICED bit remains set while no pending
net_rx_action is also queued on it.  The patch is good as it is.

Comment 50 Tina Yang 2007-09-12 19:44:46 UTC
>2) if it sets the __LINK_STATE_RX_SCHED bit, it enqueues the device to the >local
>cpus poll_list.
Since it's poll_controller that enqueues it to the poll_list, not the actual hw
interrupt, the softirq/net_rx_action will never be invoked.

>when we exit
>poll_napi, we are guaranteed that the device we just polled is waiting to be
>serviced by the net_rx_action softirq, either on this or another cpu, and as
>soon as it runs, that bit will be cleared, and the device can return to normal
>processing.  So its never possible for us to be in the condition where we
>netpoll a device and its NETPOLL_SERVICED bit remains set while no pending
>net_rx_action is also queued on it.
Not necessarily. Same as above - 
Softirq is called only when there is a hw interrupt.  If it's the
poll_controller that turns off interrupt, sets RX_SCHED and does
netif_rx_schedule, net_rx_action will never gets called.

Comment 51 Neil Horman 2007-09-12 20:08:57 UTC
Yes, it works. The net_rx_action Softirq is raised from inside netif_rx_schedule
which is called from the interrupt handler for a given driver, which in turn is
called when a hardware interrupt is encountered, _or_ when a properly written
poll_controller method is called.  Either way, the device in question is added
to one of the cpus in the system if there are frames to be read.  At that point
either net_rx_action runs first, and services the device, in which case
poll_napi skips it, or poll_napi gets it first, and net_rx_action skips it,
clearing the NETPOLL_SERVICED bit.  Either way, both poll_napi and net_rx_action
will interrogate the device struct every time netpoll_poll_dev is called and
frames are received.

Comment 52 Tina Yang 2007-09-12 21:11:33 UTC
>Yes, it works. The net_rx_action Softirq is raised from inside netif_rx_schedule
>which is called from the interrupt handler for a given driver, which in turn is
>called when a hardware interrupt is encountered.
Which hw interrupt ?  The device's interrupt is turned off unless you count on
other types of interrupts.

Comment 53 Tina Yang 2007-09-12 21:15:27 UTC
Never mind.  The previous argument is invalid.

Comment 54 Tina Yang 2007-09-12 22:03:32 UTC
One more thing,
The NETPOLL_SERVICED bit is already protected by poll_lock everywhere you access
it, why use a heavyweight bitops (test_and_clear_bit, set_bit ... etc)
especially in net_rx_action ?  That overhead is equivalent to a "locking".

Comment 55 Tina Yang 2007-09-13 00:30:25 UTC
Continue from Comment #54,
The following code segment looks suspicious,
First of all, shouldn't netpoll_poll_unlock be outside of 
"if (dev->quota)" ?
Secondly, why reschedule the dev based on dev->quota > 0 ?
If poll_napi has just done a netif_rx_complete on this event, will you end up
dev->poll a second time in net_rx_action ?

+			/*
+			 * Check to see if any pending frames have come in
+			 * while we were mucking about.  If they have lets reschedule
+			 * this device
+			 */
+			if (dev->quota > 0) {
+				netif_rx_schedule(dev);
+				netpoll_poll_unlock(have);
+			}
+			continue;


Comment 56 Tina Yang 2007-09-13 06:54:14 UTC
Continue from previous Comments,
Poll_napi is running with interrupt enabled.  Will it get deadlocked after
trylock when an interrupt/softirq occurs before the lock is released ?

-	if (test_bit(__LINK_STATE_RX_SCHED, &dev->state) &&
-	    npinfo->poll_owner != smp_processor_id() &&
+	if (npinfo->poll_owner != smp_processor_id() &&
 	    spin_trylock(&npinfo->poll_lock)) {
+		/* net_rx_action already got to us */
+		if (!test_bit(__LINK_STATE_RX_SCHED, &dev->state))
+			goto skip;
+		/*
+		 * This gets cleared in net_rx_action
+		 */
+		set_bit(__LINK_STATE_NETPOLL_SERVICED, &dev->state);
 		npinfo->rx_flags |= NETPOLL_RX_DROP;
 		atomic_inc(&trapped);
 

Comment 57 Neil Horman 2007-09-13 11:44:22 UTC
Comment 53 - Its possible for the hw to interrupt on an another cpu.
Alternatively the bit can be set from the poll_controller mehod (which should be
calling something equivalent to the hardware drivers interrupt routine, which in
turn will set the RX_SCHED bit.

Comment 54 - Well, I'm using bitops for two reasons - primarily because its the
data structure we have to work with, and secondly, because interrupts are only
disabled at the local cpu when we work with that flag.  There is no reason the
hardware cant issue and irq to be serviced by another cpu, which means we can
have two cpus reading/writng that flag at the same time.

Comment 55 - yup your right, we leave the lock dangling there.  the unlock
should be moved outside of the conditional, right before the continue statement.
 I'll attch a patch to corect that.

Comment 56 - Hmm, you would appear correct, although that deadlock condition
would appear to be present outside the scope of this patch.  We should probably
tackle that separately, since that seems like a condition that could happen
rather easily, yet it doesn't.  We should probably look at this more closely.

Comment 58 Neil Horman 2007-09-13 11:45:44 UTC
Created attachment 194551 [details]
updated patch to fix possible deadlock

Comment 59 Tina Yang 2007-09-20 22:57:48 UTC
I asked one of our testing engineer to try your patch about 10 days ago, system
hangs on both 2.6.18 and 2.6.23, and there was this "transmit timeout ... tg3
resetting" message shown on 2.6.18's console.  Then, I had him try my patch, the
system didn't hang this time, but the "transmit timeout" was still there.  It
appears that dev->controller() while called synchronously by the netpoll thread,
is competing with the ISR thread from the interrupt from another cpu.  We tested
it again with that line removed, all is well now.  I redid the patch (below) to
skip the ->controller() + poll_napi() altogether and the initial testing is
showing no performance impact and no hangs or timeouts.  The reason for giving
up patching the existing netpoll path is clear - it will cost too much
implementation and performance-wise while the benefits are little to none. 
Please let me know what you think.

+++ linux-2.6.9/net/core/netpoll.c      2007-09-15 11:07:25.000000000 -0700
@@ -127,14 +127,21 @@
 void netpoll_poll_dev(struct net_device *dev)
 {
        struct netpoll_info *npi = dev_wrapper(dev)->npinfo;
+       int polling_mode = netdump_mode;
 
        if(!netif_running(dev) || !dev->poll_controller)
                return;
 
        /* Process pending work on NIC */
-       dev->poll_controller(dev);
-       if (dev->poll)
-               poll_napi(dev);
+	/* Only do the polling when the machine is
+	 * single-threaded, i.e. only one cpu active
+	 * and with interrupt disabled such as in 
+	 * crash mode, because the code below is 
+	 * neither smp nor irq safe.
+	 */ 
+       if (polling_mode) {
+               dev->poll_controller(dev);
+               if (dev->poll)
+                       poll_napi(dev);
+       }
 
        service_arp_queue(npi);


Comment 61 Vince Worthington 2007-10-05 20:08:27 UTC
flipping back to NEEDINFO this happened b/c of the IT update.

--vince

Comment 62 Neil Horman 2008-01-18 21:49:24 UTC
Tina, the patch looks reasonable for the netdump case, but it ignores other
users of the netpoll infrastructure.  netconsole and kgdb spring immediately to
mind.  I expect that  this patch breaks both of those, since it prevents either
service from receiving frames.  Also, there are times when netdump leaves
netdump mode, like when the netdump server sends a COMM_SHOWSTATE message. 
During these times we won't be able to receive other netdump messages.

Can you provide some detail as to how my previous patch hung?  It worked fine
for me here (although I may well have not stressed it as much as you).  Sysrq-t
output would be great.



Comment 63 Tina Yang 2008-01-18 23:18:08 UTC
>the patch looks reasonable for the netdump case, but it ignores other
>users of the netpoll infrastructure.  netconsole and kgdb spring immediately to
>mind.  I expect that  this patch breaks both of those, since it prevents either
>service from receiving frames.

     No, it doesn't prevent from receiving frames, all it does is to let the 
     receiving take its course through the regular fastpath process instead of
     this intrusive netpoll approach.  We did test this with both netdump and
     netconsole, and it's fine.  I've tried submitting a refined version of 
     this patch to netdev but was rejected for no convincing reason.

>Can you provide some detail as to how my previous patch hung?  It worked fine
>for me here (although I may well have not stressed it as much as you).  Sysrq-t
>output would be great.

     It's a hard hang, nothing can be gotten out of the console.  It seems
     easier to trigger on 2.6.23 than on 2.6.18 (see 
     http://bugzilla.kernel.org/show_bug.cgi?id=9124)
     Maybe by modifying the source to increase the call frequency of
     netpoll_poll() (i.e force the call everytime you send a skb) will 
     get it to occur sooner ?!


     frequency the callers such as netpoll_send_skb() to force 
     a call to netpoll_poll() everytime, 
     

Comment 64 Neil Horman 2008-01-19 01:34:31 UTC
"the regular fastpath process instead of
     this intrusive netpoll approach.  We did test this with both netdump and
     netconsole, and it's fine"
Of course it will be fine for netdump, the patch is written for that purpose,
and netconsole will be fine in most cases, because it doesn't nominally receive
frames, it nomincally only sends them.  kgdb is another case however, as are the
netdump corner cases.  Did you try logging the state of a large machine during a
netdump?  I expect we will drop reboot requests if the state takes too long to
dump. 

Regarding your upstream submission, do you have a pointer to the discussion
thread?  The last traffic I see on netdev from you in regards to this issue is
from last october, and the comments made at that time regarding that version of
the patch seem rather relevant to me.  Regardless, if upstream rejected this
fix, then we definately should avoid it here.



Comment 65 Tina Yang 2008-01-19 03:23:21 UTC
This patch is written mainly for the netconsole hang not for netdump, and yes,
we've considered and tested all situations with the patch I submitted upstream.
 I didn't receive anything (response) regarding this after the rejection and am
not tracking this any more since the patch we have here runs fine.

Comment 66 Neil Horman 2008-01-19 14:28:22 UTC
This patch is written mainly for hte netconsole hang not for netdump?  I'm not
sure what that means.  The hang as described in this bug should be one and the
same.  Are there two separate problems in your view?

You've considered and tested all situatoins with the patch?  Can you guarantee
that services other than netdump are able to receive frames properly with this
patch?  I see how it could be possible to receive frames through the normal
receive path here, but I'd like to be certain.

Regarding your upstream post, I'd still like to have a link to the netdev post
in the archives so that I can follow up on it, but I don't see anything after
your october exchange.  Is that the post set that you're referring to?  I ask
because the post I looked at bears no resemlance to your latest patch here.

Comment 69 Tina Yang 2008-03-04 20:28:44 UTC
I only posted once to netdev (last october), and it is based on the patch I
posted here - basically eliminate the poll_controller() and poll_napi() all
together in normal operations.  Yes, even the poll_controller() shouldn't be
called because consider a netpoll thread can contend with the real interrupt
running the same ISR that manipulating the device interrupt that even though the
two sides initiate the PIOs in correct order regulated by atomic/locking, they
are not synchronized at the actual IO-to-bus level because they are running on
different cpus.  e.g. device interrupt enabling/disabling could end up at the
device in the wrong order and results in it being turned off forever.

Anyway, I saw a fix posted to 2.6.24 to address specifically this
"netdevice.h:888" race albeit not all the potential (race) issues (like the one
 above) found in the netpoll code.

Comment 72 RHEL Program Management 2008-03-25 14:09:22 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 76 John Sobecki 2009-02-18 23:13:35 UTC
Hi,

Can we get this BZ re-opened?  It's a critical patch for customers loading 
netconsole/netdump, otherwise they will experience outages.

Also note that we have been shipping Tina's patch since
OEL 4.6 (2.6.9-67.0.0.0.1.EL) and it has been thru extensive QA/stress load here
at Oracle.   Thanks, John

Comment 77 Neil Horman 2009-02-19 01:20:13 UTC
Thought you didn't fork OEL? :)

You can reopen this if you like, but I'll just close it again unless you can reproduce it on the latest RHEL4 kernel.  Most notably, we fixed bz's 474479 and 477202, which should have fixed this bug as well.


Note You need to log in before you can comment on or make changes to this bug.