Bug 455694 - Linux Kernel hang on __delay() function
Linux Kernel hang on __delay() function
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.6
x86_64 Linux
low Severity medium
: rc
: ---
Assigned To: Prarit Bhargava
Martin Jenner
: Reopened
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-07-17 03:17 EDT by Cheng Ho Lin
Modified: 2009-12-09 09:31 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-12-09 09:31:44 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Intel In-Target Probe snapshot (118.37 KB, application/octet-stream)
2008-07-17 03:17 EDT, Cheng Ho Lin
no flags Details

  None (edit)
Description Cheng Ho Lin 2008-07-17 03:17:54 EDT
Description of problem:
-----------------------

On our system of Linux Advance Server 4.6, warm boot test will hang from time 
to time. By probing the system with "Intel In-Target Probe" and checking 
with "System.map-2.6.9-67.ELsmp", we found that the CPU falled into a forever 
loop in __delay(), linux-2.6.9-final/arch/x86_64/lib/delay.c . The kernel 
source code is listed as follows:

void __delay(unsigned long loops)
{
	unsigned long bclock, now;
	
	rdtscl(bclock);
	do
	{
		rep_nop(); 
		rdtscl(now);
	}
	while((now-bclock) < loops);
} 

And the corresponding assembly code is listed below:

	rdtsc
	mov rcx, rax
loop:
	pause
	rdtsc
	sub rax, rcx
	cmp rax, rdi
	jb	loop
	ret

This piece of code may lead problem on TSC value wrap-up. For example,
if the rcx (bclock) is 0xfffffffffffffffe in the beginning, and the next rax 
(now) are 3, 15, 27 .... and so on. The system may hang up on __delay() .

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
Linux kernel version : 2.6.9-67


How reproducible:
-----------------
Just repeat to warm boot via cron job.

Steps to Reproduce:
1. add "*/5 * * * * date > reboot.log; /sbin/reboot" into crontab
Comment 1 Cheng Ho Lin 2008-07-17 03:17:54 EDT
Created attachment 312012 [details]
Intel In-Target Probe snapshot
Comment 2 Prarit Bhargava 2008-07-18 08:12:59 EDT
Cheng, please attach a sysreport from the system.

Thanks,

P.
Comment 4 Prarit Bhargava 2008-07-22 10:48:23 EDT
I came up with a proposed patch and started testing and came across a similar
issue which appears to have been resolved upstream.  __delay can be restarted on
another processor.  When this happens the values of bclock and now are bogus and
this causes wackiness within the __delay function.

I'll submit a patch for both issues.

P.
Comment 5 Prarit Bhargava 2008-07-22 11:01:51 EDT
The more I look at this issue, the more I agree that while this is a bug I
wonder if this is really the issue the reporter is hitting.

The tsc is a 64-bit counter linked to the frequency of the CPU.  For simplicity,
let's assume that the CPU frequency is 2.0 GHz.

That means the tsc will wrap every 4G X 2 seconds (64 bits divided by 31 bits).

AFAICT, that is roughly 2.3 million hours, or ~ 100,000 days, or 200 years.  (If
I have my math right)

I suppose that quantatw could have run a system this long ;).

IMO, it is much more likely that the quantatw ran into the strange issue I ran
into -- the __delay was suspended and restarted on another CPU.


P.
 
Comment 6 Prarit Bhargava 2008-07-22 11:08:41 EDT
Marking as NOTABUG.

P.
Comment 7 Cheng Ho Lin 2008-07-22 20:41:34 EDT
1. After refering linux-2.6.26/arch/x86/lib/delay_64.c to modify __delay(), 
the system passed warm boot testing for more than 5 days. While it will hang 
up every 2~3 dayes warm boot testing before.

The code is listed below for convenience:

void __delay(unsigned long loops)
{
	unsigned bclock, now;
	int cpu;

	preempt_disable();
	cpu = smp_processor_id();
	rdtscl(bclock);
	for (;;) {
		rdtscl(now);
		if ((now - bclock) >= loops)
			break;

		/* Allow RT tasks to run */
		preempt_enable();
		rep_nop();
		preempt_disable();

		/*
		 * It is possible that we moved to another CPU, and
		 * since TSC's are per-cpu we need to calculate
		 * that. The delay must guarantee that we wait "at
		 * least" the amount of time. Being moved to another
		 * CPU could make the wait longer but we just need to
		 * make sure we waited long enough. Rebalance the
		 * counter for this CPU.
		 */
		if (unlikely(cpu != smp_processor_id())) {
			loops -= (now - bclock);
			cpu = smp_processor_id();
			rdtscl(bclock);
		}
	}
	preempt_enable();
}

2. Since all the series of server machines under developing are scheduled to 
perform other tests. I am sorry that i could not gather sysreport.
Comment 8 Prarit Bhargava 2008-07-23 06:11:08 EDT
Fred, are you saying that you are hitting the issue described in comment #4? 
That switching between CPUs is causing your problem?

I'm confused -- because your initial bug report implies that you thought you had
a tsc overflow issue.

P.
Comment 9 Cheng Ho Lin 2008-07-23 06:55:56 EDT
In the beginning, we guess the problem is due to TSC value wrap-up. But after 
bug re-producing and investigation, we switch to the direction as described in 
http://www.chineselinuxuniversity.net/articles/12762.shtml . Therefore, we 
modify __delay() and verify it.

PS. By probing with ITP, the BSP is in __delay() and the other three AP are 
all in smp_really_stop_cpu(). In principle, the other processors will not 
restart __delay().

void smp_stop_cpu(void)
{
	/*
	 * Remove this CPU:
	 */
	cpu_clear(smp_processor_id(), cpu_online_map);
	local_irq_disable();
	disable_local_APIC();
	local_irq_enable(); 
}

static void smp_really_stop_cpu(void *dummy)
{
	smp_stop_cpu(); 
	for (;;) 
		asm("hlt"); 
}
Comment 10 Prarit Bhargava 2008-07-23 07:14:23 EDT
Fred,

AFAICT, in order for this to happen, CONFIG_PREEMPT must be on in the .config --
it isn't in RHEL5.  So I suspect that there is something else going on.

Could you attach your test program to this BZ?  I'll run the test to see if I
can hit the issue.

P.
Comment 11 Cheng Ho Lin 2008-07-23 20:42:29 EDT
Hi Prarit,

The OS version in issue is RedHat AS 4 update 6 rather than RHEL5.
As i check the system files, CONFIG_PREEMPT in .config is off.

Our test procedure is via crontab:

*/5 * * * * echo "reboot test"; date > reboot.log; /sbin/reboot

BTW, in our another project (different hardware architecture) SLES 10 also 
hang up on __delay() after about 9 days of warm-boot tests.
Comment 12 Brian Maly 2008-07-23 23:57:24 EDT
This seems like a BIOS issue. The passoff back to the firmware (when leaving the
OS during a reboot) seems incomplete or broken and as a result the hardware may
not be re-initialized properly for the next boot. Can we try some different
reboot flags to see if it triggers a proper hardware reset during reboot?


Can you try the following boot args and see if the issue goes away? Im guessing
a you want to use the 'cold' flag since the warm reboot hangs.

Try boot with each (one at a time), then try a reboot and see if it hangs:
reboot=hard,cold
reboot=triple,cold
reboot=bios,cold
reboot=kbd,cold

For point of reference, here are all the possible flags for RHEL4 (for
experimentation purposes):

/* reboot=b[ios] | t[riple] | k[bd] [, [w]arm | [c]old] | [a]cpi
   bios	  Use the CPU reboot vector for warm reset
   warm   Don't set the cold reboot flag
   cold   Set the cold reboot flag
   triple Force a triple fault (init)
   kbd    Use the keyboard controller. cold reset (default)
   acpi   Use the ACPI reset mechanism defined in the FADT
 */ 

Note You need to log in before you can comment on or make changes to this bug.