Bug 74596

Summary:	False idle_timeouts occurs frequently
Product:	Red Hat Enterprise Linux 2.1	Reporter:	Lars Ekman <lars.g.ekman>
Component:	kernel	Assignee:	Larry Woodman <lwoodman>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	medium	Docs Contact:
Priority:	medium
Version:	2.1
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-06-20 15:58:41 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Lars Ekman 2002-09-27 09:41:37 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.9-31 i686; Nav)

Description of problem:
The netdump-client (the netconsole.o module) often gets a false idle_timeout and
stops during dump of the processor memory.

Version-Release number of selected component (if applicable):


How reproducible:
Sometimes

Steps to Reproduce:
1. Set-up the netdump server and client
2. Make sure that the server requests a memory dump
3. Force a kernel crash
	

Actual Results:  The netdump-client gets a idle_timeout  during the transfer
of the processor
memory (512MB).

Expected Results:  The transfer should complete without idle_timeout

Additional info:

The processor memory is 512MB.
The false idle_timeout occurs almost always.
The kernel version is 2.4.17

A printk inserted at the idle_timeout check in the code gives;
printk("t: %llu %llu (%llu) > (%llu)\n", t1,  t0, ((t1 - t0) >> 20), \
   (unsigned long long)(mhz * idle_timeout));
Output: t: 649639796377 649639801243 (17592186044415) > (5250)

As you can see in the printout t1 < t0 which gives a negative diff. The huge
number 17592186044415 is 0xfffffffffff, which is the negative number shifted 20
bits right. I can not imagine how t1 < t0, it implies that the time goes
backward,
which seems impossible. As a temporary work-around I added a check;
  if (t0 && t1 > t0) { ...
in the code. It fixes the problem, and the idle_timeout still seems to work (I
killed
the server during the transfer about 10 times). The work-around is insecure
since
I dont know the cause of the problem, I just handle the symptom.

Comment 1 Arjan van de Ven 2002-09-27 09:45:08 UTC

humm we haven't shipped a 2.4.17 kernel with netdump/netconsole.....

2.4.17 isn't exactly a good kernel either ;(

Comment 2 Lars Ekman 2002-09-27 10:33:34 UTC

No, we modified the patch to be able to use 2.4.17. I forgot that this was not
the original kernel,
but I suspected that the kernel version could have something to do with this so
I included it.

If you think this is a kernel fault, please set this bug report to NOTABUG and I
apologize
for the unnecessary bother.

I think the rdtscll() translates more or less directly to an assembly
instruction (rdtsc), so
I still can't understand how it can run backwards regardless of kernel version.

Best Regards,
Lars Ekman

Comment 3 Arjan van de Ven 2002-09-27 10:36:02 UTC

well it depends which version of netdump you took ;)
also rdtsc can't go backwards, but can be out of sync between cpus

Comment 4 Lars Ekman 2002-09-27 12:37:05 UTC

The version is netdump-0.6.7.

We found the bug! I had a discussion with our HW-guru and he suggested that the
global
t0 variable was updated between the read of t1 and the compare. The simple
solution is
to store t0 in a local variable before reading t1. From netconsole.c (updated);
	t0temp = t0;
	rdtscll(t1);
	if (idle_timeout) {
		if (t0temp) {
			if (((t1 - t0temp) >> 20) > (unsigned long long)(mhz * idle_timeout)) {
...

With this fix it seems to work fine. I have tried 5 times successfully (before
it never worked).

Comment 5 Jiri Pallich 2012-06-20 15:58:41 UTC

Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.