Description of problem: Client upgraded from ocfs2 v1.2.3 to v1.2.5 and begin experiencing evictions due to node fencing within a day of upgrading. This is due to PANIC: "kernel BUG at include/linux/netdevice.h:888!" The chain of events appears to be a period of higher load on one system (e.g. node1) causing it to slow and eventually fail to respond to the OCFS2 heartbeat. This load however is less than loads at other times and those cause no problems. Node1 messages file is busy reporting "kernel: fown: signal 29, pid 10946" messages as it slows until ... 15:11 - node1 panics and dumps vmcore 15:12 - node2 initiates the eviction of node1 from the cluster 15:14 - node2 reports that the DLM eviction was successful 15:56 - node1 finally restarts This sequence of occurrences is the same regardless of which node slows and is eventually fenced out. Analysis : The tg3 driver is detecting an extremely rare state inconsistency which is being mishandled by the kernel resulting in a kernel panic. Version-Release number of selected component (if applicable): kernel-2.6.9.55 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: Patch attached and fix is queued for mainline 2.6.23.
Created attachment 161198 [details] patch file
Sorry, but this just looks like a hack to prevent netif_rx_complete from checking the __LINK_STATE_RX_SCHED flag when its clearly improperly cleared. The right thing to do is figure out why the RX_SCHED flag is cleared on entry to this function and prevent that from happening, not just shortcut it with an additional flag. Looking at the most recent RHEL4 build, I don't see anything between poll_napi, the tg3_poll routine and netif_rx_complete that changes the state of that flag. I've not looked but that would suggest to me that this is either a bug thats been fixed in an interviening release (meaning a test of the latest kernel is in order). Alternatively it could mean that something outside of tg3 or the netpoll code is messing with that flag. Has the ocfs code been audited to verify that this is not the case?
The vmcore dump indicates a potential race conditions for dev->poll at the time of the crash (flag checking) when tg3 is not even on the per cpu poll_list. From the code, it clearly shows more than one such scenarios between poll_napi and net_rx_action since by design, poll_napi is not regulated by the per cpu poll_list. 1) A race window exists in net_rx_action between finding device on poll_list and waiting on the poll_lock - i.e. poll_napi could have already been in the dev->poll and remove it from the poll_list in the interim. 2) Another one exists between checking list_empty and getting the next list entry in the same code segment while poll_napi could remove the last entry from the poll_list and results in another type of crash (invalid pointer). This patch proposed by Olaf Kirch is not a perfect but a reasonable minimum one that will address all races and was posted to netdev, acked by Dave Miller and queued for 2.6.23.
"The vmcore dump indicates a potential race conditions for dev->poll at the time of the crash" You never provided me with the vmcore. That would help alot. Please do so in the future. "This patch proposed by Olaf Kirch is not a perfect" No its not perfect. Its a hack IMO. I'll propose it since Miller took it upstream, but that really doesn't make it any better. Looking at it off the top of my head it would have seemed easier to simply remove the device from the poll_list after calling poll_controller under the protection of local_irq_disable in netpoll_poll. Either way though....
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Created attachment 181601 [details] patch to hide device from the poll list during netpoll operations to avoid net_rx_action races Actually, scrap my last comment: http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.24.git;a=commit;h=2e27afb300b56d83bb03fbfa68852b9c1e2920c6 It appears this is rather worse than a hack. This is complete breakage of at least one (possbily more) big drivers. Point me to that vmcore so we can figure this out properly. Also, I'm attaching a patch for you to try. Its just off the top of my head, so it may not work, but it seems to me if we remove the deivce from the poll_list and re-init the list head, then we avoid the risk of race with net_rx_action, since we'll never poll from there.
1) What breakage, could you elaborate on it ? 2) The vmcore is rather huge (several GBs), and the support probably had already deleted it. 3) The poll_list is a per cpu list that is associated with the cpu to which the interrupt is delivered while netpoll thread could run on any cpu than the one whose list the dev is on, the local_irq_disable is not going to help.
1) did you not read the link I attached in comment #6? I'll attach it again. http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.24.git;a=commit;h=2e27afb300b56d83bb03fbfa68852b9c1e2920c6 2) Do you not have the vmocre someplace I can dowload it from? If you don't have it anymore than there is nothing we can do, but it would be helpful if you kept debug material around until the problem was solved. 3) Yes, you make a good point, you'll have to globally disable irq's briefly with irq_save/restore_flags. Can you make that change to the patch?
2) The vmcore is from a customer about ~28GB and I couldn't find it any more, probably was removed for space. 3) Is there a way to synchronously and globally disable irqs ? Also, could net_rx_action run from a non-interrupt context (e.g. ksoftirqd) ?
2) ok, we found a backup copy of the vmcore, but since it's a customer's we are not at liberty to provide it to others. But I can certainly look up bug-related information for you.
comment #9: net_rx_action runs as a softirq already. If you want to disable irqs globally use save_irq_flags or one of its brother functions as I suggested. Comment #10: I'm glad you found the vmcore, but why can't you provide it to us? If they are running RHEL, they should be a mutual customer, or are they not running RHEL? If they are not running RHEL?
>If you want to disable irqs globally use save_irq_flags or one of its brother functions as I suggested. I couldn't find any save_irq_flags throughout entire kernel, are you talking about spin_lock_irqsave ? The irqsave of the spin_lock_irqsave is for local processor only. The global control effect of the spin_lock_irqsave actually comes from the spin_lock itself. From what I know, there is no such a thing as "globally disabled irqs". >net_rx_action runs as a softirq already. net_rx_action can run either from an interrupt context or from a process context. A "globally disabled interrupts" can't prevent the latter. >I'm glad you found the vmcore, but why can't you provide it to us? It's company's policy, not up to me to decide.
1) "I couldn't find any save_irq_flags" You're right, it was local_irq_save I was thinking of, which obviously is a local only irq disable. To provide list syncronization, we'll need a global spinlock. I'm attaching a new patch to do that. 2) "It's company's policy, not up to me to decide." You seem to have missed the point. Is this customer of yours running RHEL? If they are, then they should be a mututal customer and you should have a process by which you can provide that data to us. Or are you saying that they aren't running RHEL? The attached patch is untested again. Please build test it at your customer site and report results.
Created attachment 183141 [details] new patch to test new patch
>The attached patch is untested again. Please build test it Does this patch work ? 1) First of all, an unnecessary global lock will kill the SMP performance down to one cpu. 2) Secondly, Once you blindly remove dev from the list as you did in netpoll_poll_dev, neither net_rx_action nor poll_napi is going to process it, period. 3) Thirdly, if the device driver keeps interrupt on and the ksoftirqd is holding the lock, a deadlock ensues. 4) Maybe more ... Just don't want to think about it. Do you think maybe you should fully understand the scope of the problem before a meaningful proposal ? >Is this customer of yours running RHEL? If they are, then they should be a >mututal customer and you should have a process by which you can provide that >data to us. They are running RHEL, as a matter of fact it's them who had requested us file this bug with you. Maybe it would be easier for you to contact them directly. I just located a record (dated 8/10/07) mentioning your end of the contact for this customer is Payne, Robert Justin (RHCE). And the customer is GRUPO INDUSTRIAL SALTILLO SAB DE CV, located in Mexico.
"Does this patch work ?" I've already told you I haven't had time to build/test it yet, and that it was off the top of my head. I would think its pretty clear that I'm not 100% sure that it works yet. 1) What in your mind makes the global lock unnecesecary? Yes, it would be nice if we could avoid it, but we need to protect against a race accross all cpus with the netpoll code. Yes smp performance will be impacted of course, but a BUG() halt is a worse impact on performance. We'll worry about the performance once the race is closed. Off the top of my head we could replace the spinlock with an rwlock_t and read_lock in the net_rx_action path to allow for parallel execution and write_lock in the in the netpoll path to protect all cpus. 2) yes, thats rather the point, since netpoll_poll_dev will process the removed dev it via poll_napi instead. 3) Thats what the local_irq_disable is for in the net_rx_action path. If we update to use rwlocks, we should probably use the irqsave version in the netpoll path. 4) Thank you for the customer contact, I'll work directly with our contact here.
Created attachment 183961 [details] patch to serialize netpoll code Jeffrey, I'm currently testing this patch, once I have verified that its not breaking anything, I'll build packages for your customer and they can test on site.
The challenge is to do w/o introducing additional locking. In case of your proposal, since poll list is per-cpu-based, lock is needed only for this cpu and netpoll cpu, why lock down totally unconcerned cpus ? If you have 16 cpus as this customer has, are you going to hold all of them for that long duration ? One big lock can solve everything, but it's certainly not the way to go. If the performance is unacceptable, are you going to redo the patch all over again ? Lastly, the customer has switched all their tg3 to e1000 to avoid hitting this problem. The environment that is most inducing is tg3 + the latest ocfs2 release + netconsole.
No kidding, the challenge is to fix this without extra locking, but we need an extra lock here at the moment to avoid the race condition. The race for which this issue is reported happens because 'unconcerned' cpus may place a device on the poll list while the netpoll code is trying to remove it, or vice versa (hence the change in the status of the __LINK_STATE_RX_SCHED flag in net_rx_action that leads to the bug halt). This lists are per-cpu, but the list head that each device uses spans cpus, meaning we ned a shared lock to protect any list it might be on. It would be nice to avoid the use of an extra lock, but given that a BUG halt is worse than impeded performance IMO, we'll at least start with this. With my latest patch, at least multiple net_rx_action softirqs can run in parallel. We only have to serialize when we issue netpoll sent frames (which are hopefully rare in the nominal case). I've got this patch built here. I'll test it out a bit on monday locally and then post kernels for you Jeffrey, for the customer to test with for as much as they are able.
>No kidding, the challenge is to fix this without extra locking, but we need an >extra lock here at the moment to avoid the race condition. Your approach in trying to remove the device from netpoll has made the demand. There could be other ways that can do without. Theoretically speaking, the per cpu poll list can only be touched by its owner cpu in the original design. Your method has already violated that assumption. >This lists are per-cpu, but the list head that each device uses spans cpus, >meaning we ned a shared lock to protect any list it might be on. Not necessarily. The goal is to ensure the integrity of the individual list which has no relation to one another whatsoever. A shared lock is a bad idea. >but given that a BUG halt is worse than impeded performance IMO, we'll at least >start with this. We want both. >With my latest patch, at least multiple net_rx_action softirqs can run in >parallel. Locking penalty is not only from contention but also from the locking operation itself especially when net_rx_action is on one of the heaviest path in peak time.
"Your approach in trying to remove the device from netpoll has made the demand" I don't know what you're trying to say here "Theoretically speaking, the per cpu poll list can only be touched by its owner cpu in the original design. Your method has already violated that assumption." Again, no kidding. The assumption was violated however, prior to my patch, and that is causing the problem. Netpoll is able to work on any of the cpu lists. So we can either protect the list, or remove netpoll (which we can't do) "Not necessarily. The goal is to ensure the integrity of the individual list" yes which we need to do with a shared lock. We need this because netpoll manipulates the shared device data (the _RX_SCHED flag) while its may be on another cpu's poll list. I can't think of a better way to approach handling this. If you can, please present it. "We want both." So do I, but I'm trying to at least avoid an panic before we bug halt. We'll worry about performance after we have the system stable. "Locking penalty is not only from contention but also from the locking operation itself especially when net_rx_action is on one of the heaviest path in peak time." Again, I know this. If you have a better idea, please present it.
"Your approach in trying to remove the device from netpoll has made the demand" >I don't know what you're trying to say here To avoid locking, I will probably try to see if I can get away with additional synchronization flags or parameters in "struct net_device" under the existing dev poll_lock umbrella. "Theoretically speaking, the per cpu poll list can only be touched by its owner cpu in the original design. Your method has already violated that assumption." >Again, no kidding. The assumption was violated however, prior to my patch, and >that is causing the problem. Netpoll is able to work on any of the cpu lists. >So we can either protect the list, or remove netpoll (which we can't do) That's why I think netpoll implementation is a hack, hence the many problems it had resulted in. In my opinion, netpoll is an emergency path that should only be used when the normal interrupt-driven path (net_rx_action) has failed and is shutdown such as when system crashes, in order to reliably do a netdump, we do all things synchronously from a single cpu - send data, poll device for i/o. So I think a better way is to return to the original design and have one clear network path at a time (normal vs crash). However the scale of the changes might be significant that the original designer should be consulted and involved. "Not necessarily. The goal is to ensure the integrity of the individual list" >yes which we need to do with a shared lock. We need this because netpoll >manipulates the shared device data (the _RX_SCHED flag) while its may be on >another cpu's poll list. I can't think of a better way to approach handling >this. If you can, please present it. Finer grained locking is what I am trying to say - "One lock per list" so that cpu that is neither the list owner nor the contending netpoll's at that instance will not be held up unnecessarily. "We want both." >So do I, but I'm trying to at least avoid an panic before we bug halt. We'll >worry about performance after we have the system stable. The performance revision might end up a complete redesign of the patch. Why waste the energy ? "Locking penalty is not only from contention but also from the locking operation itself especially when net_rx_action is on one of the heaviest path in peak time." >Again, I know this. If you have a better idea, please present it. Same as the first comment above.
Created attachment 185531 [details] patch example w/o additional locking
Created attachment 185541 [details] patch example w/o additional locking
Attachment in #23 and #24 are identical. Since dev removal occurs in dev->poll which is protected by dev->poll_lock, we can safely decide on whether to skip the poll based on the current status of dev->poll_list once we get a hold of the lock. The overhead is near 0. However, the right way is to fix the basic design of the netpoll as I mentioned before.
Comment on attachment 185541 [details] patch example w/o additional locking >--- linux-2.6.9/net/core/dev.c.orig 2007-06-28 00:53:51.178885000 -0700 >+++ linux-2.6.9/net/core/dev.c 2007-09-03 20:01:34.767326000 -0700 >@@ -1760,6 +1760,8 @@ static void net_rx_action(struct softirq > unsigned long start_time = jiffies; > int budget = netdev_max_backlog; > void *have; >+ struct net_device *null_dev = list_entry(&queue->poll_list, >+ struct net_device, poll_list); > > local_irq_disable(); > >@@ -1774,8 +1776,22 @@ static void net_rx_action(struct softirq > dev = list_entry(queue->poll_list.next, > struct net_device, poll_list); > >+ /* The last dev could have been removed by netpoll >+ */ >+ if (dev == null_dev) >+ break; >+ > have = netpoll_poll_lock(dev); > >+ /* netpoll could have just released the lock, >+ * check if dev has already been polled >+ */ >+ if ((dev->poll_list->next != LIST_POISON1) { >+ netpoll_poll_unlock(have); >+ local_irq_disable(); >+ continue; >+ } >+ > if (dev->quota <= 0 || dev->poll(dev, &budget)) { > netpoll_poll_unlock(have); > local_irq_disable(); > > >--- linux-2.6.9/net/core/netpoll.c.orig 2007-06-28 00:53:27.847957000 -0700 >+++ linux-2.6.9/net/core/netpoll.c 2007-09-03 20:04:08.655594000 -0700 >@@ -96,13 +96,17 @@ static void poll_napi(struct net_device > if (test_bit(__LINK_STATE_RX_SCHED, &dev->state) && > npinfo->poll_owner != smp_processor_id() && > spin_trylock(&npinfo->poll_lock)) { >- npinfo->rx_flags |= NETPOLL_RX_DROP; >- atomic_inc(&trapped); >+ /* Check if dev has already been processed >+ */ >+ if (dev->poll_list->next != LIST_POISON1) { >+ npinfo->rx_flags |= NETPOLL_RX_DROP; >+ atomic_inc(&trapped); > >- dev->poll(dev, &budget); >+ dev->poll(dev, &budget); > >- atomic_dec(&trapped); >- npinfo->rx_flags &= ~NETPOLL_RX_DROP; >+ atomic_dec(&trapped); >+ npinfo->rx_flags &= ~NETPOLL_RX_DROP; >+ } > spin_unlock(&npinfo->poll_lock); > } > }
Well, I can't disagree with you, netpoll is a hack, always has been. Buf for the sake of functionality (netdump/netconsole) and the preservation of ABI, we are stuck with it. That being siad, have you tested this? Looking at it I agree with you, its definately got less overhead than my approach. However, your approach relies on the net device being deleted from any per-cpu poll list that it may be on, and your patch doesn't add any code to poll_napi to do that. I think you need a list_del(&dev->poll_list); right before teh call to dev->poll to make this work, otherwise its effectively a no-op.
The removal of the dev (list_del) is always done in netif_rx_complete embedded in dev->poll (tg3_poll in this case). The list_del currently in net_rx_action is for device that is eligible for another run, therefore it's removed from the head and queued to the tail. Does this make sense to you ? And no, I haven't tested it yet.
Change if (dev->poll_list->next != LIST_POISON1) to if (test_bit(__LINK_STATE_RX_SCHED, &dev->state)) is better. --- linux-2.6.9/net/core/dev.c.orig 2007-06-28 00:53:51.178885000 -0700 +++ linux-2.6.9/net/core/dev.c 2007-09-04 10:48:42.409418000 -0700 @@ -1760,6 +1760,8 @@ static void net_rx_action(struct softirq unsigned long start_time = jiffies; int budget = netdev_max_backlog; void *have; + struct net_device *null_dev = list_entry(&queue->poll_list, + struct net_device, poll_list); local_irq_disable(); @@ -1774,8 +1776,22 @@ static void net_rx_action(struct softirq dev = list_entry(queue->poll_list.next, struct net_device, poll_list); + /* The last dev could have been removed by netpoll + */ + if (dev == null_dev) + break; + have = netpoll_poll_lock(dev); + /* netpoll could have just released the lock, + * check if dev has already been polled + */ + if (!test_bit(__LINK_STATE_RX_SCHED, &dev->state)) { + netpoll_poll_unlock(have); + local_irq_disable(); + continue; + } + if (dev->quota <= 0 || dev->poll(dev, &budget)) { netpoll_poll_unlock(have); local_irq_disable(); --- linux-2.6.9/net/core/netpoll.c.orig 2007-06-28 00:53:27.847957000 -0700 +++ linux-2.6.9/net/core/netpoll.c 2007-09-04 10:43:42.667574000 -0700 @@ -96,13 +96,17 @@ static void poll_napi(struct net_device if (test_bit(__LINK_STATE_RX_SCHED, &dev->state) && npinfo->poll_owner != smp_processor_id() && spin_trylock(&npinfo->poll_lock)) { - npinfo->rx_flags |= NETPOLL_RX_DROP; - atomic_inc(&trapped); + /* Check if dev has already been processed + */ + if (test_bit(__LINK_STATE_RX_SCHED, &dev->state)) { + npinfo->rx_flags |= NETPOLL_RX_DROP; + atomic_inc(&trapped); - dev->poll(dev, &budget); + dev->poll(dev, &budget); - atomic_dec(&trapped); - npinfo->rx_flags &= ~NETPOLL_RX_DROP; + atomic_dec(&trapped); + npinfo->rx_flags &= ~NETPOLL_RX_DROP; + } spin_unlock(&npinfo->poll_lock); } }
Yeah, I see what you're saying. I was looking at the list_del in net_rx_action and assumed it was the only place we removed net_devices from the list, and didn't bother to look further. I'm still a little learly of this patch however. consider the following scenario: 1) net_devices A and B are added to the poll_list for cpu 0 2) cpu 1 atempts to preform a netpoll operation on net_device B 3) net_rx_action runs on cpu 0 and begins handling the poll operation for net_device A Since the poll_lock for A and B are separate cpu 0 and 1 can preform their operations unimpeded in parallel (which is, as we agree, nominally a good thing). If however, both CPU 0 and CPU 1 get through their respective devices poll routines and call netif_rx_complete at the same time, we have a problem. Even though net_device B is on the poll_list for cpu 0, the netif_rx_complete operation for device B is being run on CPU 1. If the list_del opperations occur close together on CPU 1 and CPU 0, we can wind up with list corruption, and an oops will result. So I think your patch may well solve the problem reported, but it will create another one in the process, one not easily reproduced. Thoughts?
Created attachment 186681 [details] patch forbids cross-cpu dev processing Glad you caught it - still can't escape from the netpoll design flaw. Here is a new approach (patch attached). The basic idea is to forbid cross-cpu processing. Let me know what you think.
Created attachment 187461 [details] alternate patch possibility I like it. It seems like it should work to me, and with a minimum of performance impact. It needs some very minor manipulations to avoid ABI breakage on RHEL4 but it looks upstream ready to me. While you were writing this, I also wrote the above patch, which is a variation on your last patch, but avoids the list corruption we discussed, and hopefully the oops that the originally reverted upstream patch caused. Let me know which implementation you like better, and I'll roll a test kernel for our customer. Thanks!
About your patch - If net_rx_action happens to run first and before RX_SCHED can be cleared in netif_rx_complete, poll_napi might process the dev->poll the second time as soon as the poll_lock is freed by net_rx_action, same as in the original patch (the one proposed by Olaf), which although avoids the netdevice.h:888 panic but might have subtle hw/sw impact and were reportedly not working according to your earlier note as well as some testing report we received here in-house two days ago. Without a doubt, I like the "non-cpu-crossing" approach much better simply because it restores the original design order and are clean, less convoluted and efficient. We shouldn't let a secondary functionality such as netpoll to overburden the normal, primary network stacks unnecessarily given that's where the most performance is demanded. That said, however, I would like to have it tested first to make sure it's not breaking anything before further discussion. I'll see if I can do it today. What do you think ?
Created attachment 187831 [details] revised version of my alternate patch good catch, I hadn't considered the alternate case where net_rx_action ran first. I think that can be fixed by only checking the __LINK_STATE_RX_SCHED_FLAG after we obtain the poll lock in poll_napi and skipping the device if its clear. As for the e1000 problem that caused the origional versions reverting, I've not received much (any) detail regarding the problem. I was hoping to avoid that problem in this version by checking the new flag in net_rx_action rather than just inside netif_rx_complete. I'm honestly not sure though. At any rate, I have e1000 adapters here so I can do some minimal testing before we send it out. That all being said I think both patches have a good deal of merit. I think, unless you have a better idea, I'll roll two kernels (one with your patch, one with mine, and put them up on my people page) and we can start testing from there.
Let me do some basic testing on my patch first before you put it up. Thanks.
copy that, I'll focus on testing my patch until I hear from you.
Ok, I'm greenlightling my patch for external testing. I've built it internally here, and am able to run it on an smp system with onboard e1000 NIC's using both netconsole and netdump during a flood ping. Mr Payne, Mr West, if you would please indicate to the customer that a test kernel rpm for this bz can be found here: http://people.redhat.com/nhorman/rpms/kernel-smp-2.6.9-57.EL.bz251978.rht.i686.rpm Tina, if you would like to test this out as well and offer feedback, it would be appreciated. Also, if you could let me know what your testing status is, so we have a better idea of what our current upstream options are, I would appreciate it. Thanks!
Created attachment 189261 [details] patch prohibits cross-cpu dev processing Above is my revised patch with limited testing done including netconsole and netdump with normal network traffic. Let me know if you have question about the patch. In the meantime, I'll review your latest patch. Thanks.
About your patch, here is a scenario see if you can dispute - . netpoll_poll calls dev->poll_controller to schedule the dev and turn off the hw interrupt . poll_napi sets the NETPOLL_SERVICED, turns on hw interrupt, but w/o clear RX_SCHED . much time elapses ... . packets coming in, hw interrupts, sees RX_SCHED, disable interrupts . net_rx_action is invoked and sees the NETPOLL_SERVICED, which is stale from previous event. It removes the dev from the list and exits w/o poll. . since nothing is done with the current hw event and interrupt is disabled, it's stalled until maybe the next possible netpoll_poll. Also, I did a little investigations on the netpoll path on a live system. It seems the calling frequency is rather low (it normally won't go through it unless a extremely stressful situation such as running out of skb) and that's why this part of the logics although is broken but hardly got caught. As I said before, the customer didn't run into it until they upgraded the OCFS2 with tg3 specifically. Some thought: To have a more productive test, maybe the code needs to be tweaked a bit to go through this path intentionally.
Yes, I understand your concern and I beleive that the patch should handle that. If I may paraphrase, you're worried about the situation in which we conduct a netpoll operation, and no subsequent frames come in for a 'long time' after that (a 'long time' being a time gap of arbitrary length after the netpoll operation completes). The concern is that on reception of the first frame after the netpoll operation we still have __LINK_STATE_NETPOLL_SERVICED set, which we clear, and as an artifact, we skip the poll operation on that device, and so we may not accept that frame into the network stack without a potentially large latency. The answer is that we have two ways out of that situation, one is inside the clause that I added to net_rx_action. After we clear the __LINK_STATE_NETPOLL_SERVICED bit and the RX_SCHED bit we look at the devices quota. If we still have a quota > 0 on the device we immediately reschedule it for servicing on the same cpu again. Alternatively if subsequent packets arrive in close succession, the driver interrupt routine will re-set the __LINK_STATE_RX_SCHED bit to one and re-enque the device for servicing via net_rx_action. Either way, there is a 'hiccup' in napi processing for the device in that we skip the net device once in our receive processing, but we immediately recover and catch up. As for the netpoll path, you are correct, its use is (ideally) fairly infrequent and used only in exceptional cases. As you discovered it is used in extreeme low memory situations. It is also used in situations that required synchronous network processing outside the scope of the normal network infrastructure (netconsole, netdump, kgdb, etc, all use netpoll to communicate). And yes, netpoll is a horribly broken design, but for all its irritations it is unfortunately needed for several cases still. If you want to go through this path intentionally, we don't really need to tweak much. My test should exercize if fairly well. What I did was the following: 1) configure netconsole and netdump on the system 2) start the netdump/netconsole services and make sure they are operating on the remove receiving machine 3) flood ping out the interface to a remote host 4) run a script that generates _lots_ of console messages 5) crash the machine with a `echo c > /proc/sysrq-trigger` Step 4 should cause the netpoll path to be run quite frequently, with netconsole spewing messages to the remote netlog server. In conjunction with the normal network processing going on with the flood ping in step 3, this should give us a pretty good idea of how well this patch handles integrating the netpoll and non-netpoll paths. Step 5 should test the patchs ability to switch over to exclusive netpoll use, since netdump only sends frames via the netpoll path. Let me know what you think. Also, Mr. Payne, if you could provide the test kernel I mentioned previously to the customer for testing, I would appreciate it. Thanks!
>If we still have a quota > 0 on the device we immediately reschedule it >for servicing on the same cpu again. Ok, but quota > 0 doesn't mean there is work to be done. Rescheduling the dev might be a waste of cycles in case the same event has already been processed by netpoll. That's what Olaf's patch does too and it seems to have some subtle impact on the driver/hw which I haven't really looked into yet. Also, in your net_rx_action + if (dev->quota > 0) + netif_rx_schedule(dev); + continue; 1) What if quota < 0 ? It'll lost its turn forever. 2) It will get into deadlock if it doesn't release the poll_lock before "continue" >1) configure netconsole and netdump on the system >2) start the netdump/netconsole services and make sure they are operating on >the >remove receiving machine >3) flood ping out the interface to a remote host >4) run a script that generates _lots_ of console messages >5) crash the machine with a `echo c > /proc/sysrq-trigger` These were pretty much what I did too for my testing, but with tg3 instead of e1000. I'll see if I can tally the frequency.
"Ok, but quota > 0 doesn't mean there is work to be done" But it does, or at least its the best heuristic available to us. Note that right below the section I added, we preform the exact same check: if (dev->quota <= 0 || dev->poll(dev, &budget)) { ... } else { netpoll_poll_unlock(have); dev_put(dev); local_irq_disable(); } if quota is greater than zero here, the device falls into the else clause here and we simply unlock it and move on to the next item in the list, without ever calling dev->poll. This means that the device is still on the cpu's poll list, and the RX_SCHED bit is still set, meaning the next scheduling of net_rx_action will pick it up. My added code emulates that behavior, and should be safe. "1) What if quota < 0 ? It'll lost its turn forever." No. if quota is less than zero, then it needs to wait until such time as the hardware interrupts again and re-schedules the interface to run, at which point (unless netpoll has locked the device again) net_rx_action will run its nominal path and start receiving frames on the device normally again. "2) It will get into deadlock if it doesn't release the poll_lock before "continue"" Dang, your right. I'll fix that up and attach a new version of the patch. Thanks!
Created attachment 190311 [details] new version of my patch new version of my patch. I'm rebuilding my test kernel now. Let me know how your testing goes.
>if quota is greater than zero here, the device falls into the else clause here >and we simply unlock it and move on to the next item in the list, without ever >calling dev->poll. No, if > 0, you will definitely do dev->poll because the "or" condition in the "if" clause. >"1) What if quota < 0 ? It'll lost its turn forever." >No. if quota is less than zero, then it needs to wait until such time as the >hardware interrupts again and re-schedules the interface to run, at which point >(unless netpoll has locked the device again) net_rx_action will run its nominal >path and start receiving frames on the device normally again. But the original code logic is it will be queued to the tail of the list with added weight this time (below), while in your case, it simply "continue" w/o rescheduling, if (dev->quota <= 0 || dev->poll(dev, &budget)) { netpoll_poll_unlock(have); local_irq_disable(); list_del(&dev->poll_list); list_add_tail(&dev->poll_list, &queue->poll_list); if (dev->quota < 0) dev->quota += dev->weight; else dev->quota = dev->weight; } else { BTW, it's the wrong patch (nfs) you put up in Comment #43
Created attachment 190351 [details] correct patch correct version of patch. Sorry, a few too many plates spinning today. As for the scheduling, its not a concern: note the logic in the while loop. The loop exits only when the list is empty (!list_empty). Since we're on the list at the head, its not empty. And since the device we just checked is at the head of the list, it gets selected again in list_entry. Since we just cleared the _NETPOLL_SERVICED bit, we skip that if statement, and procede as normal through the net_rx_action code. So we're covered.
>Since we're on the list at >the head, its not empty. And since the device we just checked is at the head >of the list, it gets selected again in list_entry. It's not at the head of the list, it's removed from the list the first thing in the if (test_and_set_bit()) clause. Not sure you are answering my previous questions, 1) if dev < quota, dev is removed and w/o being rescheduled before "continue" ? 2) if dev >= quota, dev->poll possibly gets executed second time in the rescheduled run. + list_del(&dev->poll_list); + clear_bit(__LINK_STATE_RX_SCHED, &dev->state); + /* + * Check to see if any pending frames have come in + * while we were mucking about. If they have lets reschedule + * this device + */ + if (dev->quota > 0) { + netif_rx_schedule(dev); + netpoll_poll_unlock(have); + } + continue;
shoot, you're right, I do remove it, and I mean to do that too. Hmm. We'll I think for the purposes of testing I think we can ignore this issue, since it really is a corner case. To observe the bug, you would have to get the kernel to issue a netpoll, after which you would have to stop the incomming flow of frames at exactly the right time, then you would need to send in a single frame to be received on the regular input path, after which you would need to stop the incomming flow of frames again, and the result would be a handful of delayed frames. A bug that needs to be fixed for certain, but nothing that will have a severe impact on upper level protocol behavior, and not one we are likely to see. As for how to fix it, i would say we could probably add a clear_bit(__LINK_STATE_NETPOLL_SERVICED to __netif_rx_schedule. What are your thoughts? As for your questions: 1) if dev->quota <= 0, then we have the bug we are talking about, which I agree needs fixing, but is not serious, at least for the purposes of testing the origionally reported BUG_ON halt, and certainly will be fixable with minor adjustment to this patch as I propose above. 2) if dev->quota > 0, the all is well as we'll just reschedule the poll and work in the normal receive path. I've got the new kernel with my last patch build, lets go ahead with testing here and at the customer Site. Justin, let us know what the customer gets for results with the kernel at my people page, I've just uploaded the new version there.
The testing status on my patch remains the same as in Comment #38 if you need an alternative approach. Otherwise, please keep me posted of the progress of your patch and upstream filing status so that I can put a closure to this bug on my end. Thanks.
update regarding the race condition from comment #39: After looking at it a bit longer, I'm back to thinking its not an issue. I think this because for the 'stale' NETPOLL_SERVICED bit will never occur, due to the way we set the __RX_SCHED bit. From looking at it, when we call ->poll_controller in netpoll_poll_dev, if there are frames to be processed, the poll_controller path in the driver will call netif_rx_schedule which will do 2 things: 1) ensure that the __LINK_STATE_RX_SCHED bit is set on the devices state flags 2) if it sets the __LINK_STATE_RX_SCHED bit, it enqueues the device to the local cpus poll_list. In the event that __LINK_STATE_RX_SCHED was already set by another cpu, the device will be enqueued to that remote cpus poll_list. Either way, when we exit poll_napi, we are guaranteed that the device we just polled is waiting to be serviced by the net_rx_action softirq, either on this or another cpu, and as soon as it runs, that bit will be cleared, and the device can return to normal processing. So its never possible for us to be in the condition where we netpoll a device and its NETPOLL_SERVICED bit remains set while no pending net_rx_action is also queued on it. The patch is good as it is.
>2) if it sets the __LINK_STATE_RX_SCHED bit, it enqueues the device to the >local >cpus poll_list. Since it's poll_controller that enqueues it to the poll_list, not the actual hw interrupt, the softirq/net_rx_action will never be invoked. >when we exit >poll_napi, we are guaranteed that the device we just polled is waiting to be >serviced by the net_rx_action softirq, either on this or another cpu, and as >soon as it runs, that bit will be cleared, and the device can return to normal >processing. So its never possible for us to be in the condition where we >netpoll a device and its NETPOLL_SERVICED bit remains set while no pending >net_rx_action is also queued on it. Not necessarily. Same as above - Softirq is called only when there is a hw interrupt. If it's the poll_controller that turns off interrupt, sets RX_SCHED and does netif_rx_schedule, net_rx_action will never gets called.
Yes, it works. The net_rx_action Softirq is raised from inside netif_rx_schedule which is called from the interrupt handler for a given driver, which in turn is called when a hardware interrupt is encountered, _or_ when a properly written poll_controller method is called. Either way, the device in question is added to one of the cpus in the system if there are frames to be read. At that point either net_rx_action runs first, and services the device, in which case poll_napi skips it, or poll_napi gets it first, and net_rx_action skips it, clearing the NETPOLL_SERVICED bit. Either way, both poll_napi and net_rx_action will interrogate the device struct every time netpoll_poll_dev is called and frames are received.
>Yes, it works. The net_rx_action Softirq is raised from inside netif_rx_schedule >which is called from the interrupt handler for a given driver, which in turn is >called when a hardware interrupt is encountered. Which hw interrupt ? The device's interrupt is turned off unless you count on other types of interrupts.
Never mind. The previous argument is invalid.
One more thing, The NETPOLL_SERVICED bit is already protected by poll_lock everywhere you access it, why use a heavyweight bitops (test_and_clear_bit, set_bit ... etc) especially in net_rx_action ? That overhead is equivalent to a "locking".
Continue from Comment #54, The following code segment looks suspicious, First of all, shouldn't netpoll_poll_unlock be outside of "if (dev->quota)" ? Secondly, why reschedule the dev based on dev->quota > 0 ? If poll_napi has just done a netif_rx_complete on this event, will you end up dev->poll a second time in net_rx_action ? + /* + * Check to see if any pending frames have come in + * while we were mucking about. If they have lets reschedule + * this device + */ + if (dev->quota > 0) { + netif_rx_schedule(dev); + netpoll_poll_unlock(have); + } + continue;
Continue from previous Comments, Poll_napi is running with interrupt enabled. Will it get deadlocked after trylock when an interrupt/softirq occurs before the lock is released ? - if (test_bit(__LINK_STATE_RX_SCHED, &dev->state) && - npinfo->poll_owner != smp_processor_id() && + if (npinfo->poll_owner != smp_processor_id() && spin_trylock(&npinfo->poll_lock)) { + /* net_rx_action already got to us */ + if (!test_bit(__LINK_STATE_RX_SCHED, &dev->state)) + goto skip; + /* + * This gets cleared in net_rx_action + */ + set_bit(__LINK_STATE_NETPOLL_SERVICED, &dev->state); npinfo->rx_flags |= NETPOLL_RX_DROP; atomic_inc(&trapped);
Comment 53 - Its possible for the hw to interrupt on an another cpu. Alternatively the bit can be set from the poll_controller mehod (which should be calling something equivalent to the hardware drivers interrupt routine, which in turn will set the RX_SCHED bit. Comment 54 - Well, I'm using bitops for two reasons - primarily because its the data structure we have to work with, and secondly, because interrupts are only disabled at the local cpu when we work with that flag. There is no reason the hardware cant issue and irq to be serviced by another cpu, which means we can have two cpus reading/writng that flag at the same time. Comment 55 - yup your right, we leave the lock dangling there. the unlock should be moved outside of the conditional, right before the continue statement. I'll attch a patch to corect that. Comment 56 - Hmm, you would appear correct, although that deadlock condition would appear to be present outside the scope of this patch. We should probably tackle that separately, since that seems like a condition that could happen rather easily, yet it doesn't. We should probably look at this more closely.
Created attachment 194551 [details] updated patch to fix possible deadlock
I asked one of our testing engineer to try your patch about 10 days ago, system hangs on both 2.6.18 and 2.6.23, and there was this "transmit timeout ... tg3 resetting" message shown on 2.6.18's console. Then, I had him try my patch, the system didn't hang this time, but the "transmit timeout" was still there. It appears that dev->controller() while called synchronously by the netpoll thread, is competing with the ISR thread from the interrupt from another cpu. We tested it again with that line removed, all is well now. I redid the patch (below) to skip the ->controller() + poll_napi() altogether and the initial testing is showing no performance impact and no hangs or timeouts. The reason for giving up patching the existing netpoll path is clear - it will cost too much implementation and performance-wise while the benefits are little to none. Please let me know what you think. +++ linux-2.6.9/net/core/netpoll.c 2007-09-15 11:07:25.000000000 -0700 @@ -127,14 +127,21 @@ void netpoll_poll_dev(struct net_device *dev) { struct netpoll_info *npi = dev_wrapper(dev)->npinfo; + int polling_mode = netdump_mode; if(!netif_running(dev) || !dev->poll_controller) return; /* Process pending work on NIC */ - dev->poll_controller(dev); - if (dev->poll) - poll_napi(dev); + /* Only do the polling when the machine is + * single-threaded, i.e. only one cpu active + * and with interrupt disabled such as in + * crash mode, because the code below is + * neither smp nor irq safe. + */ + if (polling_mode) { + dev->poll_controller(dev); + if (dev->poll) + poll_napi(dev); + } service_arp_queue(npi);
flipping back to NEEDINFO this happened b/c of the IT update. --vince
Tina, the patch looks reasonable for the netdump case, but it ignores other users of the netpoll infrastructure. netconsole and kgdb spring immediately to mind. I expect that this patch breaks both of those, since it prevents either service from receiving frames. Also, there are times when netdump leaves netdump mode, like when the netdump server sends a COMM_SHOWSTATE message. During these times we won't be able to receive other netdump messages. Can you provide some detail as to how my previous patch hung? It worked fine for me here (although I may well have not stressed it as much as you). Sysrq-t output would be great.
>the patch looks reasonable for the netdump case, but it ignores other >users of the netpoll infrastructure. netconsole and kgdb spring immediately to >mind. I expect that this patch breaks both of those, since it prevents either >service from receiving frames. No, it doesn't prevent from receiving frames, all it does is to let the receiving take its course through the regular fastpath process instead of this intrusive netpoll approach. We did test this with both netdump and netconsole, and it's fine. I've tried submitting a refined version of this patch to netdev but was rejected for no convincing reason. >Can you provide some detail as to how my previous patch hung? It worked fine >for me here (although I may well have not stressed it as much as you). Sysrq-t >output would be great. It's a hard hang, nothing can be gotten out of the console. It seems easier to trigger on 2.6.23 than on 2.6.18 (see http://bugzilla.kernel.org/show_bug.cgi?id=9124) Maybe by modifying the source to increase the call frequency of netpoll_poll() (i.e force the call everytime you send a skb) will get it to occur sooner ?! frequency the callers such as netpoll_send_skb() to force a call to netpoll_poll() everytime,
"the regular fastpath process instead of this intrusive netpoll approach. We did test this with both netdump and netconsole, and it's fine" Of course it will be fine for netdump, the patch is written for that purpose, and netconsole will be fine in most cases, because it doesn't nominally receive frames, it nomincally only sends them. kgdb is another case however, as are the netdump corner cases. Did you try logging the state of a large machine during a netdump? I expect we will drop reboot requests if the state takes too long to dump. Regarding your upstream submission, do you have a pointer to the discussion thread? The last traffic I see on netdev from you in regards to this issue is from last october, and the comments made at that time regarding that version of the patch seem rather relevant to me. Regardless, if upstream rejected this fix, then we definately should avoid it here.
This patch is written mainly for the netconsole hang not for netdump, and yes, we've considered and tested all situations with the patch I submitted upstream. I didn't receive anything (response) regarding this after the rejection and am not tracking this any more since the patch we have here runs fine.
This patch is written mainly for hte netconsole hang not for netdump? I'm not sure what that means. The hang as described in this bug should be one and the same. Are there two separate problems in your view? You've considered and tested all situatoins with the patch? Can you guarantee that services other than netdump are able to receive frames properly with this patch? I see how it could be possible to receive frames through the normal receive path here, but I'd like to be certain. Regarding your upstream post, I'd still like to have a link to the netdev post in the archives so that I can follow up on it, but I don't see anything after your october exchange. Is that the post set that you're referring to? I ask because the post I looked at bears no resemlance to your latest patch here.
I only posted once to netdev (last october), and it is based on the patch I posted here - basically eliminate the poll_controller() and poll_napi() all together in normal operations. Yes, even the poll_controller() shouldn't be called because consider a netpoll thread can contend with the real interrupt running the same ISR that manipulating the device interrupt that even though the two sides initiate the PIOs in correct order regulated by atomic/locking, they are not synchronized at the actual IO-to-bus level because they are running on different cpus. e.g. device interrupt enabling/disabling could end up at the device in the wrong order and results in it being turned off forever. Anyway, I saw a fix posted to 2.6.24 to address specifically this "netdevice.h:888" race albeit not all the potential (race) issues (like the one above) found in the netpoll code.
Hi, Can we get this BZ re-opened? It's a critical patch for customers loading netconsole/netdump, otherwise they will experience outages. Also note that we have been shipping Tina's patch since OEL 4.6 (2.6.9-67.0.0.0.1.EL) and it has been thru extensive QA/stress load here at Oracle. Thanks, John
Thought you didn't fork OEL? :) You can reopen this if you like, but I'll just close it again unless you can reproduce it on the latest RHEL4 kernel. Most notably, we fixed bz's 474479 and 477202, which should have fixed this bug as well.