Bug 74596 - False idle_timeouts occurs frequently
Summary: False idle_timeouts occurs frequently
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Larry Woodman
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2002-09-27 09:41 UTC by Lars Ekman
Modified: 2012-06-20 15:58 UTC (History)
0 users

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-20 15:58:41 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Lars Ekman 2002-09-27 09:41:37 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.9-31 i686; Nav)

Description of problem:
The netdump-client (the netconsole.o module) often gets a false idle_timeout and
stops during dump of the processor memory.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Set-up the netdump server and client
2. Make sure that the server requests a memory dump
3. Force a kernel crash
	

Actual Results:  The netdump-client gets a idle_timeout  during the transfer
of the processor
memory (512MB).

Expected Results:  The transfer should complete without idle_timeout

Additional info:

The processor memory is 512MB.
The false idle_timeout occurs almost always.
The kernel version is 2.4.17

A printk inserted at the idle_timeout check in the code gives;
printk("t: %llu %llu (%llu) > (%llu)\n", t1,  t0, ((t1 - t0) >> 20), \
   (unsigned long long)(mhz * idle_timeout));
Output: t: 649639796377 649639801243 (17592186044415) > (5250)

As you can see in the printout t1 < t0 which gives a negative diff. The huge
number 17592186044415 is 0xfffffffffff, which is the negative number shifted 20
bits right. I can not imagine how t1 < t0, it implies that the time goes
backward,
which seems impossible. As a temporary work-around I added a check;
  if (t0 && t1 > t0) { ...
in the code. It fixes the problem, and the idle_timeout still seems to work (I
killed
the server during the transfer about 10 times). The work-around is insecure
since
I dont know the cause of the problem, I just handle the symptom.

Comment 1 Arjan van de Ven 2002-09-27 09:45:08 UTC
humm we haven't shipped a 2.4.17 kernel with netdump/netconsole.....

2.4.17 isn't exactly a good kernel either ;(

Comment 2 Lars Ekman 2002-09-27 10:33:34 UTC
No, we modified the patch to be able to use 2.4.17. I forgot that this was not
the original kernel,
but I suspected that the kernel version could have something to do with this so
I included it.

If you think this is a kernel fault, please set this bug report to NOTABUG and I
apologize
for the unnecessary bother.

I think the rdtscll() translates more or less directly to an assembly
instruction (rdtsc), so
I still can't understand how it can run backwards regardless of kernel version.

Best Regards,
Lars Ekman


Comment 3 Arjan van de Ven 2002-09-27 10:36:02 UTC
well it depends which version of netdump you took ;)
also rdtsc can't go backwards, but can be out of sync between cpus


Comment 4 Lars Ekman 2002-09-27 12:37:05 UTC
The version is netdump-0.6.7.

We found the bug! I had a discussion with our HW-guru and he suggested that the
global
t0 variable was updated between the read of t1 and the compare. The simple
solution is
to store t0 in a local variable before reading t1. From netconsole.c (updated);
	t0temp = t0;
	rdtscll(t1);
	if (idle_timeout) {
		if (t0temp) {
			if (((t1 - t0temp) >> 20) > (unsigned long long)(mhz * idle_timeout)) {
...

With this fix it seems to work fine. I have tried 5 times successfully (before
it never worked).


Comment 5 Jiri Pallich 2012-06-20 15:58:41 UTC
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.


Note You need to log in before you can comment on or make changes to this bug.