Bug 462361 - gettimeofday() jumps many usec when enabling and disabling cpus
gettimeofday() jumps many usec when enabling and disabling cpus
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
x86_64 Linux
medium Severity medium
: rc
: ---
Assigned To: Prarit Bhargava
Red Hat Kernel QE team
Depends On:
  Show dependency treegraph
Reported: 2008-09-15 13:15 EDT by Prarit Bhargava
Modified: 2011-10-17 10:15 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2011-10-17 10:15:16 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description Prarit Bhargava 2008-09-15 13:15:24 EDT
From private email from Cai:

I ran the reproducer [prarit: see BZ 250708], and found it failed in one of boxes while doing online/offline CPUs with .98.el5 Kernel,

[root@nec-em11 ~]# ./gtod_test TPstart_time=1221418633.824041s
start time =1221418648.829208 end time =1221418648.829207 reverse -1000 nsec (-1.999999 )

It could be reproduced this way,

Run gtod_test (the original reproducer, in attachment) in one terminal
window. Then, run the following script in another window,

cd /sys/devices/system/node/node0
while :; do
  echo 0 >cpu1/online
  echo 0 >cpu2/online
  sleep 2
  echo 1 >cpu1/online
  echo 1 >cpu2/online
  sleep 2

Sometimes, it takes a little while or needs to re-run the above to see
the problem.

I have tried the same on another machine (nec-em12.rhts.bos.redhat.com)
which could reproduce the original bug with .92 Kernel, but looks like
there was no such issue with .98 Kernel. Nec-em12 has a single dual-core
physical processor, while nec-em11 has two single-core processors with
HT enabled. 


part of /proc/cpuinfo from nec-em11.rhts.bos.redhat.com:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      :                   Intel(R) Xeon(TM) CPU 3.20GHz
stepping        : 3
cpu MHz         : 3192.078
cache size      : 2048 KB
physical id     : 0
siblings        : 2
core id         : 0
cpu cores       : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pn
i monitor ds_cpl cid cx16 xtpr
bogomips        : 6386.56
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:
Comment 3 Brian Maly 2008-09-15 13:54:47 EDT
Please report the clocksource being used? Presumably TSC since this is specific to CPU disablement/enablement. Just want to confirm.
Comment 4 Prarit Bhargava 2008-09-15 14:04:10 EDT

time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
time.c: Detected 3192.109 MHz processor.

Comment 5 Prarit Bhargava 2008-09-15 15:01:27 EDT
I've extended the reservation for nec-em11.rhts for 99 hours.

I've created a new gtod_test, gtod_test01.c which runs the gtod_test in a slightly different manner.  In the new test it does the following:

       while (1){
               if( gettimeofday(&tv_start, NULL) != 0)
                       error(EXIT_FAILURE,errno,"cannot gettimeofday");

               if( gettimeofday(&tv_end, NULL) != 0)
                       error(EXIT_FAILURE,errno,"cannot gettimeofday");
               tv_start = tv_end ;

ie) compares back-to-back reads of gettimeofday()

I'm currently running the test *WITHOUT* enabling and disabling cpus to see if the problem exists on bare-metal.

If the test is still running tomorrow AM, I will reboot and run the test while enabling and disabling cpus.


Comment 6 Prarit Bhargava 2008-09-16 08:09:24 EDT
The test ran fine overnight.  I am going to enable Cai's script to enable and disable cpus and retest.

Comment 9 Prarit Bhargava 2008-09-16 14:29:03 EDT
This doesn't appear to be a gettimeofday issue.  It looks like CPU Hotplug is broken.

[root@nec-em11 prarit]# cd /sys/devices/system/node/node0
[root@nec-em11 node0]# echo 0 >cpu1/online

results in

[  179.894725] CPU 1 is now offline

[root@nec-em11 node0]# echo 1 >cpu1/online

results in *no* output.

[root@nec-em11 node0]# echo 0 >cpu1/online

results in (note the output below includes some debug output I've added)

[  183.565332] Initializing CPU#1
[  183.565361] start_secondary: calling smp_callin
[  183.565365] CPU#1 (phys ID: 6) waiting for CALLOUT
[  183.566224] CALLIN, before setup_local_APIC().
[  183.566230] smp_callin: calling calibrate_delay
[  183.625901] Calibrating delay using timer specific routine.. 6383.23 BogoMIPS (lpj=3191617)
[  183.625905] Stack at about ffff8100026cdef4
[  183.626287]                   Intel(R) Xeon(TM) CPU 3.20GHz stepping 03
[  183.626292] 
[  183.626294] Call Trace:
[  183.626316]  [<ffffffff800768c5>] start_secondary+0x3b/0x4b8
[  183.626352] 
[  183.626354] cpu 1: setting up apic clock
[  183.626906] cpu 1: enabling apic timer
[  186.866164] CPU 1 is now offline

ie) the last "echo 0" results in the processor being brought online and then immediately offline.

I will go speak to dzickus to see if any changes were recently made in CPU Hotplug that I'm unaware of.
Comment 10 Prarit Bhargava 2008-09-17 07:22:30 EDT
After some additional testing it looks like the difference between gettimeofday() reads is not always 1 usec.  It seems to vary anywhere from 10 usec to lt. 1 usec (but obviously not 0 ;) ).

That footprint also indicates that this isn't the same issue encountered previously.  I suspect that some value is not being updated when a cpu is brought back up via the sysfs interface.

Comment 11 Prarit Bhargava 2008-09-17 07:23:18 EDT
example output showing gt. 1 usec difference between gettimeofday() reads:

[root@nec-em11 prarit]# ./gtod_test01 
start time =1221595469.699872 
end time =1221595469.699867 
reverse -5000 nsec (-1.999995 )

Comment 14 Prarit Bhargava 2011-10-17 10:15:16 EDT
This is still a problem, however, it is a relatively minor problem that occurs only on bare-metal.   Closing as WONTFIX.


Note You need to log in before you can comment on or make changes to this bug.