Bug 157439 - LTC14642-NetDump is too slow to dump...[PATCH]
LTC14642-NetDump is too slow to dump...[PATCH]
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Dave Anderson
David Lawrence
Depends On:
Blocks: 156320
  Show dependency treegraph
Reported: 2005-05-11 13:46 EDT by Issue Tracker
Modified: 2007-11-30 17:07 EST (History)
2 users (show)

See Also:
Fixed In Version: RHSA-2005-663
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2005-09-28 11:07:49 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Issue Tracker 2005-05-11 13:46:40 EDT
Escalated to Bugzilla from IssueTracker
Comment 3 Dave Anderson 2005-05-11 13:49:44 EDT
Waiting on IBM to test patch on suspect ppc64 -- it works on x86, ia64 and x86_64
Comment 4 Dave Anderson 2005-05-12 11:29:32 EDT
A RHEL3 kernel src.rpm file, kernel-2.4.21-32.3.ppc64netdump.EL.src.rpm,
can be found here:


It contains this patch:

--- linux-2.4.21/drivers/net/netconsole.c.orig
+++ linux-2.4.21/drivers/net/netconsole.c
@@ -186,11 +186,11 @@ static void zap_completion_queue(void)
        /* maintain jiffies in a polling fashion, based on rdtsc. */
-       {
+       if (netdump_mode) {
                static unsigned long long prev_tick;
                if (t1 - prev_tick >= jiffy_cycles) {
-                       prev_tick += jiffy_cycles;
+                       prev_tick = t1;

This patch addresses two problems:  

(1) The "if (netdump_mode)" addition prevents netconsole from 
    incrementing jiffies during system runtime which netconsole
    is active.  This could cause the system time to jump ahead
    if, say, a long alt-sysrq-t operation took place.

(2) More importantly, the "prev_tick = t1" assignment addresses
    the problem reported in this case.  This will prevent jiffies
    from incrementing out of control until prev_tick finally
    makes it to the current timestamp value.  prev_tick should
    be initialized, and maintained, to reflect the current value
    returned by platform_timestamp(), i.e, each time jiffy_cycles
    have transpired.

Depending upon the value returned by platform_timestamp(), which 
in the ppc64 case by a mftb instruction, there could be a flurry
of print_status() calls, one for each command received from the
netdump-server, up until the point that the value of prev_tick 
"caught up with" the current value returned by platform_time_stamp(). 
The fix properly initializes prev_tick the first time that 
zap_completion_queue() gets called, and keeps it in sync for the
remainder of the session.  Therefore, the print_status() output
will be displayed once per second as originally intended, because
jiffies will not have been incrementing out of control.

As far as attempting to emulate diskdump's print_status(), that
task is very simple for diskdump, because the total number of disk
blocks and the current block number are known in advance, and the
two numbers are easy to relate.

However, in the case of netdump, it is not nearly that simple.
This is especially true in the case of systems with memory holes,
such as ia64 machines, which can have multiple 256GB+ holes.
This makes a comparison of a given page number with the
maximum page number invalid.  So it perhaps would make sense
to use the sysinfo.totalram value instead of the maximum
page frame number.  But then the problem becomes the page
number requested in send_netdump_mem(), which would obviously
become greater than the total RAM value in systems with memory
holes.  At first I just tried keeping a "pages sent" counter in
send_netdump_mem(), but in testing it was shown to end up signficantly
larger than the total RAM value -- because of retries of the same page
by the netdump-server. Since the netdump client doesn't know it's being
asked for the same page multiple times, it would be extremely difficult
to track exactly how many pages out of the total RAM count were
sent at least one time.

So, I'm not pursuing that functionality any more, I just want
to get print_status() back to working as originally intended.

I've tested this patch on ia64, x86 and x86_64 machines, and
verified that the the print_status() messages are displayed
once per second right from the start.  It is important, though,
that IBM verify it on a ppc64 machine, especially the one
that exhibited the problem.  Apparently on that ppc64 machine, 
I'm presuming that it must have had an extremely large value 
returned by platform_timestamp() the first time, and incrementing
prev_tick by jiffy_cycles each zap_completion_queue -- instead of
initializing it the first time - could have caused that machine 
to never "catch up".

Since we're not getting much of a response from IT #69160, I will
send this same message to Mike Mason and perhaps a few others
in a direct email.

Dave Anderson

Comment 5 Dave Anderson 2005-05-12 16:28:07 EDT
Here is email response from Mike Mason of IBM:

Subject: Re: IT #69160 -- LTC14642-NetDump is too slow to dump...
   Date: Thu, 12 May 2005 13:12:20 -0700
   From: Mike Mason <mmlnx@us.ibm.com>
     To: Dave Anderson <anderson@redhat.com>

Hi Dave,

Sorry for the lack of response on our end.  This is the first I've seen your
comment.  Every since they started mirroring IBM's bugs to RH issue tracker
instead of RH bugzilla, I'm not seeing the comments coming from Red Hat.  The
IBM bugzilla to RH bugzilla communication path is broken.  I'll push (again)
on my end to try to get that fixed.

I agree, your solution is simpler, makes more sense than using the diskdump
print_status and achieves the same result.  I'll test it on ppc64 and try to
test on the machine that exhibited the problem, if I can.  I don't know if I
can still get access to that machine.

Comment 8 Ernie Petrides 2005-06-08 23:26:08 EDT
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.7.EL).
Comment 13 Red Hat Bugzilla 2005-09-28 11:07:49 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.