Bug 226701
Summary: | netdump client misses COMM_REBOOT with large number of console messages | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Bryn M. Reeves <bmr> |
Component: | netdump | Assignee: | Neil Horman <nhorman> |
Status: | CLOSED ERRATA | QA Contact: | |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.4 | CC: | benl, edamato, konradr, poelstra, tao |
Target Milestone: | --- | Keywords: | OtherQA, ZStream |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | RHBA-2007-0827 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-11-15 16:11:15 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 222397, 239544 | ||
Attachments: |
Description
Bryn M. Reeves
2007-02-01 00:34:37 UTC
Created attachment 147064 [details]
patch changing order of reply and show_state
This patch just changes the ordering, moving the reply after the call to
show_mem. Tested locally with the same setup originally used to reproduce and
it didn't cause any problems & allowed a successful reboot.
Created attachment 147065 [details]
change netdump-server reboot timeout & interval
This patch also avoids the problem, this time on the server side by increasing
NETDUMP_REBOOT_DELAY_MSEC and NETDUMP_N_REBOOT_REQUESTS.
The patch in comment #1 seems like the "right" way to fix this to me, as it ensures the client & server are in sync, but I'm assuming there's a reason for the somment /* send response first */ in netdump. The second patch works for the test cases I've used so far, but the numbers could probably do with fine tuning (1000ms between requests and 200 requests). Neither patch seems correct to me. We send the REPLY_SHOW_STATE message before dumping the memory and system state information to avoid a timeout on the server and the resending of the same message again and again (which your first patch would seem to cause to happen more often). Further, the fact that we set netdump_mode to 0 after we send that message should prevent us from actually receiving the subsequent reboot messages from the server until we are done sending the netconsole messages we generate. Unfortunately we have no real way to determine when the console is "done", so this is a best effort sort of thing (and why we abandoned netdump for kdump). Why don't you try adding for(i=0;i<1000;i++) udelay(1000); between the call to show_mem and netdump_mode=1 in netpoll_netdump. That should give the console time to flush. If that works perhaps we can work up a patch that makes that timeout configurable for those individuals that have sufficiently busy networks to see this problem. Sorry if the point of the patches wasn't clear - they're not really proposed 'fixes' as much as patches I've been using to help debug the problem and figure out what the real fix is. I'll test the delay loop this afternoon, but I suspect the reboot commands will still be long gone by the time we return from show_mem and set netdump_mode=1. Btw, to clarify - the behavior I seem to be seeing is that when we return from show_mem the server has already sent the reboot command and it has been dropped by netpoll. I'm not sure how waiting even longer here is going to help avoid that (I can reproduce the same behavior by just replacing show_mem/show_state with a wait of >10s). >"Sorry if the point of the patches wasn't clear - they're not really proposed >'fixes'" Then why did you say "The patch in comment #1 seems like the "right" way to fix this to me" in comment #3? >"Btw, to clarify - the behavior" you don't need to clarify, I think its pretty clear whats going on. On a busy network, trnasmission delays on the network cause the reboot command from the netdump server to be received by the client after we re-enable netdump mode but before the nic can transmit all the netconsole messages that were generated as a result of the previous show status command. Since we have no way to know when netconsole is done with the work that we generate for it over any given period. This doesn't happen on lightly loaded machines or not so busy networks because netconsole can clear out the tx queue before they get the reboot command. Thats why I'd like to put a delay in before we re-enable netdump_mode, so that we give the console time to send all its data before we allow a reboot. let me know the results of your testing. Thanks! > Then why did you say "The patch in comment #1 seems like the "right" way > to fix this to me" in comment #3?" That's what the "" were about.. should have been more clear. Intuitively, it felt more "right" moving the response after the show_mem/show_state than hacking around with the timeouts on the server. Doing that server side means we're OK for this case (thread dumps taking 1-2 minutes) but it's a kludge and will fail for cases where the thread dump is taking longer. Again, sorry that wasn't clear - I'll be sure to include "RFC" in the comment next time ;) I've tested the delay loop suggestion and this had no impact on the problem for my test cases. The client still sits spinning forever. Lengthening/shortening the period from the 1s delay you recommended also didn't help. I don't see how this delay could help here, as it seems to me it's the delay while calling show_state/show_mem (while polling the controller) that causes the problem - replacing those calls with a similar wait/poll loop of >10s is enough to reproduce this, with no console backlog at all. > you don't need to clarify, I think its pretty clear whats going on. On a busy > network, trnasmission delays on the network cause the reboot command from the > netdump server to be received by the client after we re-enable netdump mode but > before the nic can transmit all the netconsole messages that were generated as a > result of the previous show status command. That's not what I described and it doesn't match what I'm seeing. Can you explain the reasoning behind the packet being received on the client after we re-enable netdump_mode? This seems to contradict what I'm seeing in my tests. I see the reboot packets going to the NIC /while/ netdump_mode == 0, being handed to netdump_rx and discarded: Enabling packet dumps inside netpoll seems to confirm this. During show_mem/show_state, I see exactly 20 UDP packets recieved and handed thru netpoll_rx -> __netpoll_rx -> netdump_rx and being thrown away. Once show_mem returns & netdump_mode is reasserted I see a tonne of UDP traffic from a neighbouring cluster, but nothing from the netdump-server is ever passed to netpoll_rx after re-setting netdump_mode - all twenty packets are long gone by this stage. I don't see adding delays to the client code ever preventing this - either the netdump server needs to send the requests later or over a much longer interval (to ensure they hit netpoll after netdump_mode=1 and are handed to add_new_req instead of hitting the floor), or the netdump client needs to delay sending REPLY_SHOW_STATE so that the server doesn't start sending COMM_REBOOT until we're ready to receive it. I will attach some traces showing the packets being handled by netpoll that shows the behavior I'm describing. Created attachment 147163 [details]
netpoll debug log showing reboot messages being dropped while netdump_mode=0
The time spent with netdump_mode==0 is bracketed by ***waiting/***waited, so:
$ awk '/^.*waiting.*$/,/^.*waited.*$/{print}' screenlog.0.dropped_packets2|
grep dropping | wc -l
20
All the packets being logged here after netdump_mode=1 are UDP broadcast
unrelated to netdump ("local ip check fails" in the log).
Note that the same behavior (no reboot) is seen when no console logging is done
from netpoll (i.e. just sitting in a loop alternately mdelaying/polling).
Created attachment 147164 [details]
patch to reproduce missed COMM_REBOOT with netpoll logging
Debugging patch - this avoids calling show_state/show_mem and just sits in a
loop:
for(i=0;i<300;i++){
mdelay(100);
netpoll_poll(&np);
}
With netpoll debug disabled it is completely silent, with netpoll debugging
enabled, this obviously still sends output to the console.
Created attachment 147165 [details]
simple netpoll trace patch
Netpoll patch used for debugging.
My bad, I read your description backwards. I understood the problem to be that when the kernel receives a SHOW_STATE command it immediately replies and queues all the netdump information to be sent, and then re-enters netdump mode and gets the servers reboot command before its finished dumping the show state info to the log, leading to lost crash data. But as your last post indicates the problem is really that we take too much time dumping the log, so that we timeout on sending our reboot requests and leave the system in a crashed state. Thats easily fixable without any code changes, through the use of the netdump-reboot script on the netdump server. just modify that script to pause with a sleep command for an amount of time appropriate for the crashing system in question to complete the show state command, and then exit with a zero return code. If you want to add some intellegence, you can even check the destination log file for continued ouput to fine tune the wait period. Since the netdump-reboot script executes beofre we send the reboot command, this can be used to see that the log gets flushed before we even start to send out COMM_REBOOT messages. That seems like a worse workaround than changing the timeouts/number of reboot retries on the netdump-server. The server runs scripts in the foreground and blocks for the duration of the script: "The four executable scripts are run in the foreground by the netdump-server, so any time-consuming task done by the scripts will block the netdump-server from servicing any crash or log requests from any other source. Therefore it may be advisable to run any time-consuming tasks in the background, allowing the script to return back to the netdump-server." (man 8 netdump-server) Kludges to guestimate the "appropriate" length of time this should wait also seem suspect - the delay is proportional to the number of threads executing on the system at the time of the crash and the depth of their backtraces & speed of the console. How do you envisage approximating that in the scripts? I'll test this out here, but I'm worried we'll end up missing other clients netdump requests while the script is waiting. Another (probably dumb) question, in netdump_rx, what are we guarding against by dropping incoming packets while netdump_mode == 0? void netdump_rx(struct netpoll *np, short source, char *data, int dlen) { req_t *req, *__req = (req_t *)data; if (!netdump_mode) return; If we could queue the incoming COMM_REBOOT packets while the stack dumps are happening, the problem goes away as we just peel them of the request list when the dump finishes. >"That seems like a worse workaround than changing the timeouts/number of reboot >retries on the netdump-server. The server runs scripts in the foreground and >blocks for the duration of the script:" I disagree. yes, the script runs in the foreground, as it needs to to prevent the netdump server process from continuing until it completes, but we shouldn't be talking about a long period of time here in almost all cases. And since subsequent dumping clients (which would be blocked by this solution) will wait infinitely for a the dump server to service them (see netpoll_netdump and netdup_startup_handshake) I don't really see this as a big deal. Granted, we can create a pessimal case (consider the serial console running at 2400 baud, that would take a large show_state a _very_ long time to complete), but we can just set a max wait value in a script to avoid blocking forever. The long and the short of it is this seems like a pretty isolated problem to me, and before we go adding in features for this, it seems like we can do a site specific fix >"Kludges to guestimate the "appropriate" length of time this should wait >also...How do you envisage approximating that in the scripts" I am assuming that the netdump server is also capturing netconsole output from the system in question. If so, then it can periodically check the file that is being logged to for increases in size. As long as the file continues to increase, output continues to flow from the system being captured, and we can be reasonably sure that netdump_mode is not yet enabled on the client. Once growth stops, or a max wait period is reached, we can continue by exiting the script. >Another (probably dumb) question, in netdump_rx, what are we guarding against >by >dropping incoming packets while netdump_mode == 0? Its not a dumb question, we do it because we don't write to the netconsole in crashdump mode (see write_msg() in netconsole.c). IIRC there is a deadlock potential if we are in netdump mode and we write something to the netconsole at the same time. I'm not saying that we can't increase our request retransmit count or something simmilar, but since this is the only report of this I've gotten on this, I'd really prefer to do a site specific fix if we can. > I disagree. yes, the script runs in the foreground, as it needs to to prevent > the netdump server process from continuing until it completes, but we > shouldn't be talking about a long period of time here in almost all cases. Ok - seems reasonable. I'll test this out & get back to you - thanks for your advice. > Its not a dumb question, we do it because we don't write to the netconsole in > crashdump mode (see write_msg() in netconsole.c). IIRC there is a deadlock > potential if we are in netdump mode and we write something to the netconsole > at the same time. I'm being dumb again, but I don't see how that relates to queueing requests inside netdump_rx - avoiding console ouput when netdump_mode is true is one thing, but this is avoiding queuing a request onto netdump's internal request_list while netdump_mode is false. The only thing netdump_rx does is a GFP_ATOMIC kmalloc and a handful of Dprintks that disappear in any normal build anyway, and then the spin_lock_irqsave/spin_unlock_irqrestore of the request_list lock (that doesn't seem any different than the lock/unlocks in the showm_mem/show_state paths). Since all this is happening in a single thread, I can't see we're going to get ahead of ourselves and start processing those commands before show_mem/show_state return, since the netdump poll loop is blocked inside those functions. Anyway, there's probably something obvious I'm missing here, but this is another change that gives reliable reboots for my test cases in the face of slow consoles & lots of UDP traffic. Regarding the question in comment #13 and the answer in comment 14# - I still don't see what queuing requests when netdump_mode == 0 has to do with performing console output - there's nothing in that path that would write to the console in a normal build, and besides, the race with console output you mentioned was with netdump_mode == 1, right? Again, sorry if I missed something obvious here. Created attachment 148441 [details]
Make netdump queue packets while dumping state to the console
This patch adds a netdump_paused flag. This is set when we clear netdump_mode
to allow console output and is checked inside netdump_rx to decide if we want
to queue the packet (netdump_paused == 1), or discard it because netdump hasn't
been activated (netdump_mode == 0 && netdump_paused == 0).
There is a fixed limit of 20 packets queued during this period. This should be
sufficient to avoid the problem of clients dropping all 20 reboot requests sent
by the server, unless someone is simultaneously sending us lots of UDP traffic
on the netdump port.
I've given this some basic testing with the customer's reproducer - so far, so good. Just setting status back to ASSIGNED for you to take a quick look at the patch. I'll keep testing this here & update as I have more info. Thanks, Initially the patch looks good. I assume you're testing involves just crashing a machine that previously failed to reboot, due to the number of tasks it needed to print for the show_state command? And with this patch, the system is rebooting properly, correct? If yes, then I think I'm willing to push this. Yes - I have several boxes on slow serial consoles that easily demonstrate the problem. With the queueing patch they've made a couple of dozen successful reboots without issue.
> The patch in comment #1 seems like the "right" way to fix this to me, as it
> ensures the client & server are in sync, but I'm assuming there's a reason for
> the somment /* send response first */ in netdump.
Rather than cluttering up rhkernel-list on netdump minutia, Jeff and I
were just looking back trying to find the original patch that moved the
COMM_SHOW_STATE_REPLY to be done *prior* to the actual work being
done in the client. But it seems that we must have done it when forward-porting
netdump from AS2.1 to RHEL3, since AS2.1 still does it after the
show_[regs|state|mem]() operations, and RHEL3 always has done it beforehand.
We *think* we did it because there were always "too many timeouts"
messages on the netdump-server machines.
But now, re-looking at the netdump-server code, it looks like the code
there originally intended to prevent the show-status timeouts, but it
seems that the logic is wrong:
static void
status_timeout (NetdumpClient *client,
guint32 request)
{
client->n_timeouts++;
if (client->n_timeouts <= 1 && !client->got_log_packet)
status_request (client, FALSE);
else
{
syslog (LOG_WARNING, "Got too many timeouts waiting for SHOW_STATUS for
client %s, rebooting it\n", client->ip_addr);
netdump_client_destroy (client, TRUE);
}
}
We *believe* that is should be something like:
if (client->n_timeouts <= MAX_NR_RETRIES || client->got_log_packet)
If a log packet comes in while waiting for the SHOW_STATUS to end,
it should just retry the SHOW_STATUS without complaining, or giving up.
That all being said, it would require a change to both the client and
the server, and wouldn't account for an "old" netdump-server being used,
with an updated client, etc, etc. Jeff feels that this would be the right
way to solve it, i.e., a two-line change to the client, and a one-line change
to the netdump-server. I'm of the opinion that the less that is done the
better. You guys, I'm sure, have your own opinions...
In any case, we just wanted to explain why it (the movement of the
send_netdump_msg() call) was historically changed to be done that way.
based on feedback, we have a better solution to this problem, which we are currently implementing Created attachment 148827 [details]
netdump kernel patch
Created attachment 148828 [details]
netdump server patch
Bryn, Here are the netdump kernel and server patches based on the fallout of
the on-list conversaion we all had. Please test them out and confirm that
they solve the problem appropriately. Thanks!
Hi Niel, Thanks for the patches - looks good. I will get these applied & tested on my boxes here today or tomorrow. Thanks! Niel, did you actually try to compile that kernel patch? - if (unlikely(crashdump_mode())) + if (unlikely(!netdump_emit_printks) return; Created attachment 148947 [details]
revised kernel side patch with balanced parens
No, clearly I didn't test it out myself. you wrote the first patch, I figured that you would be able to correct any minor typos you encountered while testing this one. Thank you for fixing that Initial tests are looking good. I've tested different combinations of patched/unpatched server and kernel - the patched kernel makes it through all my tests now. Running an old kernel with the patched server still hangs, but that's expected. I'll keep this in NEEDINFO until I've got a few more runs done but so far so good. Thanks! Hi Niel, I've had the test boxes run through about 15 runs with the new patches now without any sign of trouble. Thanks! Created attachment 149359 [details]
revised kernel patch
Niel's revised kernel side patch
Hi Niel, The revised kernel patch has now been through ten test runs without incident. Thanks! This bug has been copied as z-stream (EUS) bug #239544 and now must be resolved in the current update release, set blocker flag. committed in stream U6 build 55.6. A test kernel with this patch is available from http://people.redhat.com/~jbaron/rhel4/ *** Bug 248602 has been marked as a duplicate of this bug. *** A fix for this issue should have been included in the packages contained in the RHEL4.6 Beta released on RHN (also available at partners.redhat.com). Requested action: Please verify that your issue is fixed to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0827.html |