Red Hat Bugzilla – Bug 462361
gettimeofday() jumps many usec when enabling and disabling cpus
Last modified: 2011-10-17 10:15:16 EDT
From private email from Cai:
I ran the reproducer [prarit: see BZ 250708], and found it failed in one of boxes while doing online/offline CPUs with .98.el5 Kernel,
[root@nec-em11 ~]# ./gtod_test TPstart_time=1221418633.824041s
start time =1221418648.829208 end time =1221418648.829207 reverse -1000 nsec (-1.999999 )
It could be reproduced this way,
Run gtod_test (the original reproducer, in attachment) in one terminal
window. Then, run the following script in another window,
while :; do
echo 0 >cpu1/online
echo 0 >cpu2/online
echo 1 >cpu1/online
echo 1 >cpu2/online
Sometimes, it takes a little while or needs to re-run the above to see
I have tried the same on another machine (nec-em12.rhts.bos.redhat.com)
which could reproduce the original bug with .92 Kernel, but looks like
there was no such issue with .98 Kernel. Nec-em12 has a single dual-core
physical processor, while nec-em11 has two single-core processors with
part of /proc/cpuinfo from nec-em11.rhts.bos.redhat.com:
processor : 0
vendor_id : GenuineIntel
cpu family : 15
model : 4
model name : Intel(R) Xeon(TM) CPU 3.20GHz
stepping : 3
cpu MHz : 3192.078
cache size : 2048 KB
physical id : 0
siblings : 2
core id : 0
cpu cores : 1
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall lm constant_tsc pn
i monitor ds_cpl cid cx16 xtpr
bogomips : 6386.56
clflush size : 64
cache_alignment : 128
address sizes : 36 bits physical, 48 bits virtual
Please report the clocksource being used? Presumably TSC since this is specific to CPU disablement/enablement. Just want to confirm.
time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
time.c: Detected 3192.109 MHz processor.
I've extended the reservation for nec-em11.rhts for 99 hours.
I've created a new gtod_test, gtod_test01.c which runs the gtod_test in a slightly different manner. In the new test it does the following:
if( gettimeofday(&tv_start, NULL) != 0)
if( gettimeofday(&tv_end, NULL) != 0)
tv_start = tv_end ;
ie) compares back-to-back reads of gettimeofday()
I'm currently running the test *WITHOUT* enabling and disabling cpus to see if the problem exists on bare-metal.
If the test is still running tomorrow AM, I will reboot and run the test while enabling and disabling cpus.
The test ran fine overnight. I am going to enable Cai's script to enable and disable cpus and retest.
This doesn't appear to be a gettimeofday issue. It looks like CPU Hotplug is broken.
[root@nec-em11 prarit]# cd /sys/devices/system/node/node0
[root@nec-em11 node0]# echo 0 >cpu1/online
[ 179.894725] CPU 1 is now offline
[root@nec-em11 node0]# echo 1 >cpu1/online
results in *no* output.
[root@nec-em11 node0]# echo 0 >cpu1/online
results in (note the output below includes some debug output I've added)
[ 183.565332] Initializing CPU#1
[ 183.565361] start_secondary: calling smp_callin
[ 183.565365] CPU#1 (phys ID: 6) waiting for CALLOUT
[ 183.566224] CALLIN, before setup_local_APIC().
[ 183.566230] smp_callin: calling calibrate_delay
[ 183.625901] Calibrating delay using timer specific routine.. 6383.23 BogoMIPS (lpj=3191617)
[ 183.625905] Stack at about ffff8100026cdef4
[ 183.626287] Intel(R) Xeon(TM) CPU 3.20GHz stepping 03
[ 183.626294] Call Trace:
[ 183.626316] [<ffffffff800768c5>] start_secondary+0x3b/0x4b8
[ 183.626354] cpu 1: setting up apic clock
[ 183.626906] cpu 1: enabling apic timer
[ 186.866164] CPU 1 is now offline
ie) the last "echo 0" results in the processor being brought online and then immediately offline.
I will go speak to dzickus to see if any changes were recently made in CPU Hotplug that I'm unaware of.
After some additional testing it looks like the difference between gettimeofday() reads is not always 1 usec. It seems to vary anywhere from 10 usec to lt. 1 usec (but obviously not 0 ;) ).
That footprint also indicates that this isn't the same issue encountered previously. I suspect that some value is not being updated when a cpu is brought back up via the sysfs interface.
example output showing gt. 1 usec difference between gettimeofday() reads:
[root@nec-em11 prarit]# ./gtod_test01
start time =1221595469.699872
end time =1221595469.699867
reverse -5000 nsec (-1.999995 )
This is still a problem, however, it is a relatively minor problem that occurs only on bare-metal. Closing as WONTFIX.