Bug 157439 - LTC14642-NetDump is too slow to dump...[PATCH]
Summary: LTC14642-NetDump is too slow to dump...[PATCH]
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Dave Anderson
QA Contact: David Lawrence
URL:
Whiteboard:
Depends On:
Blocks: 156320
TreeView+ depends on / blocked
 
Reported: 2005-05-11 17:46 UTC by Issue Tracker
Modified: 2007-11-30 22:07 UTC (History)
2 users (show)

Fixed In Version: RHSA-2005-663
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-09-28 15:07:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:663 0 qe-ready SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 6 2005-09-28 04:00:00 UTC

Description Issue Tracker 2005-05-11 17:46:40 UTC
Escalated to Bugzilla from IssueTracker

Comment 3 Dave Anderson 2005-05-11 17:49:44 UTC
Waiting on IBM to test patch on suspect ppc64 -- it works on x86, ia64 and x86_64

Comment 4 Dave Anderson 2005-05-12 15:29:32 UTC
A RHEL3 kernel src.rpm file, kernel-2.4.21-32.3.ppc64netdump.EL.src.rpm,
can be found here:

  http://people.redhat.com/anderson/.for_IBM

It contains this patch:

--- linux-2.4.21/drivers/net/netconsole.c.orig
+++ linux-2.4.21/drivers/net/netconsole.c
@@ -186,11 +186,11 @@ static void zap_completion_queue(void)
                }
        }
        /* maintain jiffies in a polling fashion, based on rdtsc. */
-       {
+       if (netdump_mode) {
                static unsigned long long prev_tick;
 
                if (t1 - prev_tick >= jiffy_cycles) {
-                       prev_tick += jiffy_cycles;
+                       prev_tick = t1;
                        jiffies++;
                }
        }

This patch addresses two problems:  

(1) The "if (netdump_mode)" addition prevents netconsole from 
    incrementing jiffies during system runtime which netconsole
    is active.  This could cause the system time to jump ahead
    if, say, a long alt-sysrq-t operation took place.

(2) More importantly, the "prev_tick = t1" assignment addresses
    the problem reported in this case.  This will prevent jiffies
    from incrementing out of control until prev_tick finally
    makes it to the current timestamp value.  prev_tick should
    be initialized, and maintained, to reflect the current value
    returned by platform_timestamp(), i.e, each time jiffy_cycles
    have transpired.

Depending upon the value returned by platform_timestamp(), which 
in the ppc64 case by a mftb instruction, there could be a flurry
of print_status() calls, one for each command received from the
netdump-server, up until the point that the value of prev_tick 
"caught up with" the current value returned by platform_time_stamp(). 
The fix properly initializes prev_tick the first time that 
zap_completion_queue() gets called, and keeps it in sync for the
remainder of the session.  Therefore, the print_status() output
will be displayed once per second as originally intended, because
jiffies will not have been incrementing out of control.

As far as attempting to emulate diskdump's print_status(), that
task is very simple for diskdump, because the total number of disk
blocks and the current block number are known in advance, and the
two numbers are easy to relate.

However, in the case of netdump, it is not nearly that simple.
This is especially true in the case of systems with memory holes,
such as ia64 machines, which can have multiple 256GB+ holes.
This makes a comparison of a given page number with the
maximum page number invalid.  So it perhaps would make sense
to use the sysinfo.totalram value instead of the maximum
page frame number.  But then the problem becomes the page
number requested in send_netdump_mem(), which would obviously
become greater than the total RAM value in systems with memory
holes.  At first I just tried keeping a "pages sent" counter in
send_netdump_mem(), but in testing it was shown to end up signficantly
larger than the total RAM value -- because of retries of the same page
by the netdump-server. Since the netdump client doesn't know it's being
asked for the same page multiple times, it would be extremely difficult
to track exactly how many pages out of the total RAM count were
sent at least one time.

So, I'm not pursuing that functionality any more, I just want
to get print_status() back to working as originally intended.

I've tested this patch on ia64, x86 and x86_64 machines, and
verified that the the print_status() messages are displayed
once per second right from the start.  It is important, though,
that IBM verify it on a ppc64 machine, especially the one
that exhibited the problem.  Apparently on that ppc64 machine, 
I'm presuming that it must have had an extremely large value 
returned by platform_timestamp() the first time, and incrementing
prev_tick by jiffy_cycles each zap_completion_queue -- instead of
initializing it the first time - could have caused that machine 
to never "catch up".

Since we're not getting much of a response from IT #69160, I will
send this same message to Mike Mason and perhaps a few others
in a direct email.

Dave Anderson



Comment 5 Dave Anderson 2005-05-12 20:28:07 UTC
Here is email response from Mike Mason of IBM:


Subject: Re: IT #69160 -- LTC14642-NetDump is too slow to dump...
   Date: Thu, 12 May 2005 13:12:20 -0700
   From: Mike Mason <mmlnx.com>
     To: Dave Anderson <anderson>

Hi Dave,

Sorry for the lack of response on our end.  This is the first I've seen your
comment.  Every since they started mirroring IBM's bugs to RH issue tracker
instead of RH bugzilla, I'm not seeing the comments coming from Red Hat.  The
IBM bugzilla to RH bugzilla communication path is broken.  I'll push (again)
on my end to try to get that fixed.

I agree, your solution is simpler, makes more sense than using the diskdump
print_status and achieves the same result.  I'll test it on ppc64 and try to
test on the machine that exhibited the problem, if I can.  I don't know if I
can still get access to that machine.

Thanks,
Mike

Comment 8 Ernie Petrides 2005-06-09 03:26:08 UTC
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.7.EL).


Comment 13 Red Hat Bugzilla 2005-09-28 15:07:49 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html



Note You need to log in before you can comment on or make changes to this bug.