455694 – Linux Kernel hang on __delay() function

Bug 455694 - Linux Kernel hang on __delay() function

Summary: Linux Kernel hang on __delay() function

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.6
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Prarit Bhargava
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-07-17 07:17 UTC by Cheng Ho Lin
Modified:	2009-12-09 14:31 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-12-09 14:31:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Intel In-Target Probe snapshot (118.37 KB, application/octet-stream) 2008-07-17 07:17 UTC, Cheng Ho Lin	no flags	Details
View All

Description Cheng Ho Lin 2008-07-17 07:17:54 UTC

Description of problem:
-----------------------

On our system of Linux Advance Server 4.6, warm boot test will hang from time 
to time. By probing the system with "Intel In-Target Probe" and checking 
with "System.map-2.6.9-67.ELsmp", we found that the CPU falled into a forever 
loop in __delay(), linux-2.6.9-final/arch/x86_64/lib/delay.c . The kernel 
source code is listed as follows:

void __delay(unsigned long loops)
{
	unsigned long bclock, now;
	
	rdtscl(bclock);
	do
	{
		rep_nop(); 
		rdtscl(now);
	}
	while((now-bclock) < loops);
} 

And the corresponding assembly code is listed below:

	rdtsc
	mov rcx, rax
loop:
	pause
	rdtsc
	sub rax, rcx
	cmp rax, rdi
	jb	loop
	ret

This piece of code may lead problem on TSC value wrap-up. For example,
if the rcx (bclock) is 0xfffffffffffffffe in the beginning, and the next rax 
(now) are 3, 15, 27 .... and so on. The system may hang up on __delay() .

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
Linux kernel version : 2.6.9-67


How reproducible:
-----------------
Just repeat to warm boot via cron job.

Steps to Reproduce:
1. add "*/5 * * * * date > reboot.log; /sbin/reboot" into crontab

Comment 1 Cheng Ho Lin 2008-07-17 07:17:54 UTC

Created attachment 312012 [details]
Intel In-Target Probe snapshot

Comment 2 Prarit Bhargava 2008-07-18 12:12:59 UTC

Cheng, please attach a sysreport from the system.

Thanks,

P.

Comment 4 Prarit Bhargava 2008-07-22 14:48:23 UTC

I came up with a proposed patch and started testing and came across a similar
issue which appears to have been resolved upstream.  __delay can be restarted on
another processor.  When this happens the values of bclock and now are bogus and
this causes wackiness within the __delay function.

I'll submit a patch for both issues.

P.

Comment 5 Prarit Bhargava 2008-07-22 15:01:51 UTC

The more I look at this issue, the more I agree that while this is a bug I
wonder if this is really the issue the reporter is hitting.

The tsc is a 64-bit counter linked to the frequency of the CPU.  For simplicity,
let's assume that the CPU frequency is 2.0 GHz.

That means the tsc will wrap every 4G X 2 seconds (64 bits divided by 31 bits).

AFAICT, that is roughly 2.3 million hours, or ~ 100,000 days, or 200 years.  (If
I have my math right)

I suppose that quantatw could have run a system this long ;).

IMO, it is much more likely that the quantatw ran into the strange issue I ran
into -- the __delay was suspended and restarted on another CPU.


P.

Comment 6 Prarit Bhargava 2008-07-22 15:08:41 UTC

Marking as NOTABUG.

P.

Comment 7 Cheng Ho Lin 2008-07-23 00:41:34 UTC

1. After refering linux-2.6.26/arch/x86/lib/delay_64.c to modify __delay(), 
the system passed warm boot testing for more than 5 days. While it will hang 
up every 2~3 dayes warm boot testing before.

The code is listed below for convenience:

void __delay(unsigned long loops)
{
	unsigned bclock, now;
	int cpu;

	preempt_disable();
	cpu = smp_processor_id();
	rdtscl(bclock);
	for (;;) {
		rdtscl(now);
		if ((now - bclock) >= loops)
			break;

		/* Allow RT tasks to run */
		preempt_enable();
		rep_nop();
		preempt_disable();

		/*
		 * It is possible that we moved to another CPU, and
		 * since TSC's are per-cpu we need to calculate
		 * that. The delay must guarantee that we wait "at
		 * least" the amount of time. Being moved to another
		 * CPU could make the wait longer but we just need to
		 * make sure we waited long enough. Rebalance the
		 * counter for this CPU.
		 */
		if (unlikely(cpu != smp_processor_id())) {
			loops -= (now - bclock);
			cpu = smp_processor_id();
			rdtscl(bclock);
		}
	}
	preempt_enable();
}

2. Since all the series of server machines under developing are scheduled to 
perform other tests. I am sorry that i could not gather sysreport.

Comment 8 Prarit Bhargava 2008-07-23 10:11:08 UTC

Fred, are you saying that you are hitting the issue described in comment #4? 
That switching between CPUs is causing your problem?

I'm confused -- because your initial bug report implies that you thought you had
a tsc overflow issue.

P.

Comment 9 Cheng Ho Lin 2008-07-23 10:55:56 UTC

In the beginning, we guess the problem is due to TSC value wrap-up. But after 
bug re-producing and investigation, we switch to the direction as described in 
http://www.chineselinuxuniversity.net/articles/12762.shtml . Therefore, we 
modify __delay() and verify it.

PS. By probing with ITP, the BSP is in __delay() and the other three AP are 
all in smp_really_stop_cpu(). In principle, the other processors will not 
restart __delay().

void smp_stop_cpu(void)
{
	/*
	 * Remove this CPU:
	 */
	cpu_clear(smp_processor_id(), cpu_online_map);
	local_irq_disable();
	disable_local_APIC();
	local_irq_enable(); 
}

static void smp_really_stop_cpu(void *dummy)
{
	smp_stop_cpu(); 
	for (;;) 
		asm("hlt"); 
}

Comment 10 Prarit Bhargava 2008-07-23 11:14:23 UTC

Fred,

AFAICT, in order for this to happen, CONFIG_PREEMPT must be on in the .config --
it isn't in RHEL5.  So I suspect that there is something else going on.

Could you attach your test program to this BZ?  I'll run the test to see if I
can hit the issue.

P.

Comment 11 Cheng Ho Lin 2008-07-24 00:42:29 UTC

Hi Prarit,

The OS version in issue is RedHat AS 4 update 6 rather than RHEL5.
As i check the system files, CONFIG_PREEMPT in .config is off.

Our test procedure is via crontab:

*/5 * * * * echo "reboot test"; date > reboot.log; /sbin/reboot

BTW, in our another project (different hardware architecture) SLES 10 also 
hang up on __delay() after about 9 days of warm-boot tests.

Comment 12 Brian Maly 2008-07-24 03:57:24 UTC

This seems like a BIOS issue. The passoff back to the firmware (when leaving the
OS during a reboot) seems incomplete or broken and as a result the hardware may
not be re-initialized properly for the next boot. Can we try some different
reboot flags to see if it triggers a proper hardware reset during reboot?


Can you try the following boot args and see if the issue goes away? Im guessing
a you want to use the 'cold' flag since the warm reboot hangs.

Try boot with each (one at a time), then try a reboot and see if it hangs:
reboot=hard,cold
reboot=triple,cold
reboot=bios,cold
reboot=kbd,cold

For point of reference, here are all the possible flags for RHEL4 (for
experimentation purposes):

/* reboot=b[ios] | t[riple] | k[bd] [, [w]arm | [c]old] | [a]cpi
   bios	  Use the CPU reboot vector for warm reset
   warm   Don't set the cold reboot flag
   cold   Set the cold reboot flag
   triple Force a triple fault (init)
   kbd    Use the keyboard controller. cold reset (default)
   acpi   Use the ACPI reset mechanism defined in the FADT
 */

Note You need to log in before you can comment on or make changes to this bug.