462361 – gettimeofday() jumps many usec when enabling and disabling cpus

Bug 462361 - gettimeofday() jumps many usec when enabling and disabling cpus

Summary: gettimeofday() jumps many usec when enabling and disabling cpus

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Prarit Bhargava
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-09-15 17:15 UTC by Prarit Bhargava
Modified:	2011-10-17 14:15 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2011-10-17 14:15:16 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Prarit Bhargava 2008-09-15 17:15:24 UTC

From private email from Cai:

I ran the reproducer [prarit: see BZ 250708], and found it failed in one of boxes while doing online/offline CPUs with .98.el5 Kernel,

[root@nec-em11 ~]# ./gtod_test TPstart_time=1221418633.824041s
start time =1221418648.829208 end time =1221418648.829207 reverse -1000 nsec (-1.999999 )

It could be reproduced this way,

Run gtod_test (the original reproducer, in attachment) in one terminal
window. Then, run the following script in another window,

cd /sys/devices/system/node/node0
while :; do
  echo 0 >cpu1/online
  echo 0 >cpu2/online
  sleep 2
  echo 1 >cpu1/online
  echo 1 >cpu2/online
  sleep 2
done

Sometimes, it takes a little while or needs to re-run the above to see
the problem.

I have tried the same on another machine (nec-em12.rhts.bos.redhat.com)
which could reproduce the original bug with .92 Kernel, but looks like
there was no such issue with .98 Kernel. Nec-em12 has a single dual-core
physical processor, while nec-em11 has two single-core processors with
HT enabled. 

Prarit:

part of /proc/cpuinfo from nec-em11.rhts.bos.redhat.com:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :                   Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 3
cpu MHz         : 3192.078
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pn
i monitor ds_cpl cid cx16 xtpr
bogomips        : 6386.56
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

Comment 3 Brian Maly 2008-09-15 17:54:47 UTC

Please report the clocksource being used? Presumably TSC since this is specific to CPU disablement/enablement. Just want to confirm.

Comment 4 Prarit Bhargava 2008-09-15 18:04:10 UTC

bmaly,

time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
time.c: Detected 3192.109 MHz processor.

P.

Comment 5 Prarit Bhargava 2008-09-15 19:01:27 UTC

I've extended the reservation for nec-em11.rhts for 99 hours.

I've created a new gtod_test, gtod_test01.c which runs the gtod_test in a slightly different manner.  In the new test it does the following:

       while (1){
               if( gettimeofday(&tv_start, NULL) != 0)
                       error(EXIT_FAILURE,errno,"cannot gettimeofday");

               if( gettimeofday(&tv_end, NULL) != 0)
                       error(EXIT_FAILURE,errno,"cannot gettimeofday");
                             if(tv_lt(tv_end,tv_start)){
                       print_difftime(tv_start,tv_end);
                       break;
               }
               tv_start = tv_end ;
       }

ie) compares back-to-back reads of gettimeofday()

I'm currently running the test *WITHOUT* enabling and disabling cpus to see if the problem exists on bare-metal.

If the test is still running tomorrow AM, I will reboot and run the test while enabling and disabling cpus.

FYI,

P.

Comment 6 Prarit Bhargava 2008-09-16 12:09:24 UTC

The test ran fine overnight.  I am going to enable Cai's script to enable and disable cpus and retest.

P.

Comment 9 Prarit Bhargava 2008-09-16 18:29:03 UTC

This doesn't appear to be a gettimeofday issue.  It looks like CPU Hotplug is broken.

[root@nec-em11 prarit]# cd /sys/devices/system/node/node0
[root@nec-em11 node0]# echo 0 >cpu1/online

results in

[  179.894725] CPU 1 is now offline

[root@nec-em11 node0]# echo 1 >cpu1/online

results in *no* output.

[root@nec-em11 node0]# echo 0 >cpu1/online

results in (note the output below includes some debug output I've added)

[  183.565332] Initializing CPU#1
[  183.565361] start_secondary: calling smp_callin
[  183.565365] CPU#1 (phys ID: 6) waiting for CALLOUT
[  183.566224] CALLIN, before setup_local_APIC().
[  183.566230] smp_callin: calling calibrate_delay
[  183.625901] Calibrating delay using timer specific routine.. 6383.23 BogoMIPS (lpj=3191617)
[  183.625905] Stack at about ffff8100026cdef4
[  183.626287]                   Intel(R) Xeon(TM) CPU 3.20GHz stepping 03
[  183.626292] 
[  183.626294] Call Trace:
[  183.626316]  [<ffffffff800768c5>] start_secondary+0x3b/0x4b8
[  183.626352] 
[  183.626354] cpu 1: setting up apic clock
[  183.626906] cpu 1: enabling apic timer
[  186.866164] CPU 1 is now offline


ie) the last "echo 0" results in the processor being brought online and then immediately offline.

I will go speak to dzickus to see if any changes were recently made in CPU Hotplug that I'm unaware of.

Comment 10 Prarit Bhargava 2008-09-17 11:22:30 UTC

After some additional testing it looks like the difference between gettimeofday() reads is not always 1 usec.  It seems to vary anywhere from 10 usec to lt. 1 usec (but obviously not 0 ;) ).

That footprint also indicates that this isn't the same issue encountered previously.  I suspect that some value is not being updated when a cpu is brought back up via the sysfs interface.

P.

Comment 11 Prarit Bhargava 2008-09-17 11:23:18 UTC

example output showing gt. 1 usec difference between gettimeofday() reads:

[root@nec-em11 prarit]# ./gtod_test01 
TPstart_time=1221595467.243089s
start time =1221595469.699872 
end time =1221595469.699867 
reverse -5000 nsec (-1.999995 )

P.

Comment 14 Prarit Bhargava 2011-10-17 14:15:16 UTC

This is still a problem, however, it is a relatively minor problem that occurs only on bare-metal.   Closing as WONTFIX.

P.

Note You need to log in before you can comment on or make changes to this bug.