Red Hat Bugzilla – Bug 466056
Unexplained high latency
Last modified: 2012-01-09 17:11:11 EST
Created attachment 319700 [details]
Description of problem:
Test code (attached) exhibits occasional unexpectedly large latencies (estimated at several hundred microseconds).
Version-Release number of selected component (if applicable):
Using kernel 2.6.24-74el5rt and 2.6.24-74el5rttrace
Highly repeatable, though the exact rate and timing of the latency events is unpredictable.
Steps to Reproduce:
1. Reboot machine with 'isolcpus=2,3' kernel option
2. Compile and run attached test code
Test code reports the first 10 high latency events (greater than 500k CPU cycles). Usually occurs after less than 5 seconds on our system.
Test code binds itself to one of the CPUs that isn't in use by the OS at large (CPU 3, in particular), then enters a tight loop. Within the loop, it uses the RDTSC instruction to access the CPU clock counter twice, and compares the results. The latency between these two instructions varies greatly. Normaly the latecy is about 100 cycles, but exhibits latency spikes at around 1k clock cycles, 14k cycles, 30k cycles. Several times a second, the latency spikes to ~800k cycles.
Using the debugfs tracing functionality with a current_tracer of "sched_switch" seems to indicate that no process other than watchdog interrupts the test code while it's in the tight loop.
Opening this bug at the direction of Rohit Persai per Service Request #1862346.
Can you give us the contents of:
and the output of lspci -v?
Created attachment 319935 [details]
Contents of /proc/cpuinfo
Created attachment 319936 [details]
Contents of /proc/meminfo
Created attachment 319937 [details]
Contents of /proc/interrupts
Created attachment 319938 [details]
Output of lspci -v
The machine is an HP DL380G5.
My first suspicion would be SMIs. HP makes heavy use of SMIs to gather data for their management utilities. Unfortunately (latency-wise) SMIs aren't handled by the kernel, they go straight to the BIOS, so it's completely out of Linux's hands. From the Linux perspective, a leap-forward in time just occurs.
One thing you can try is going into the BIOS and see if there is an option for turning off SMIs. Of course it probably won't be that easy, since I doubt there's a "Turn off System Management Interrupts" option, but there maybe something to disable management statistics or some such.
How long does it take to happen?
I ran your test program on both a dual dual-core Xeon and a dual dual-core Opteron for over five minutes with no output.
1. What BIOS level are you running?
2. Are you using IPMI?
I usually see output after 5 seconds or so.
1. The BIOS is reporting as 12/28/2006. I've updated to 6/25/2008, but to no apparent effect.
2. HP's iLO2 (which I believe is their IPMI) is present and had been enabled. It's not in use, so I've disabled it, but to no apparent effect.
I've gone through all the BIOS settings I can see and disabled just about everything. Nothing refers to System Management Interrupts (of course), or to management statistics.
well, it was a shot anyway.
I've run your test case all night on two test systems with no reports of large latencies.
One of the guys here had a good suggestion: try physically disconnecting the system from any network device and running your test again. If iLO2 is the culprit, this should inhibit SMI generation.
Would it be possible for you to record not only the biggest time intervals but also their starting times? That way you could have a better idea of the interval between spikes. If they are happening precisely on a given interval, that's a good hint. Depending on the interval it is easy to point the guilt piece.
A few random thoughts about the original test:
* Have you noticed any message about "unstable TSC" on dmesg? I have the impression your test machine has 2 sockets with two cores on each. There is a change that TSC is not in sync between sockets and that could lead to false positives if your test is not bound to one CPU only. That can be confirmed looking at the context switch trace and making sure your test always run on the same CPU. Another option is running the test with:
taskset -c 1 ./test_time
* That would be also interesting checking what is the clocksource in use by the system:
* The conversion from cycles to ns is done by:
static inline s64 cyc2ns(struct clocksource *cs, cycle_t cycles)
u64 ret = (u64)cycles;
ret = (ret * cs->mult) >> cs->shift;
where mult = (1000000000<<shift) / hz
and hz is the clock frequency.
So, a rough aproximantion of what may be your system gives a worst case of:
mult = (1000000000 << 22) / 3000000000
mult = 1398101
cyc2ns = (800000 * mult) >> 22
266666ns or 267us
Another thing we should look at is which services are configured to run at boot
time. Would you send us the output of:
One of the main one's we're looking for is irqbalance to make sure it's turned
off. If you see that it's running, you can run this:
/sbin/service irqbalance stop
/sbin/chkconfig irqbalance off
Created attachment 320031 [details]
Output of chkconfig --list
I've tried 'service network stop', disconnecting the network cables, and physically removing all PCI cards. None of these appear to have any effect.
The only comments in dmeseg regardiong TSC are:
"TSC calibrated against HPET
Marking TSC unstable due to TSCs unsynchronized"
One of the first things the test code does is bind itself to CPU 3. I've also tried using "taskset -c 3 ..." to no apparent effect.
/sys/devices/system/clocksource/clocksource0/ reports the current_clocksource is "hpet" and available_clocksource is "hpet acpi_pm jiffies tsc". I tried setting current_clocksource to each of those. The only apparent effect was on "jiffies", which caused the machine to seize.
Ok, let's try one of our test programs. The MRG RT channel comes with a package called rt-tests, which contains a program called 'cyclictest'. If you don't have rt-tests installed you should be able to get it with 'yum install rt-tests'.
Cyclictest measure wakeup latency by getting a time stamp (t1), sleeping for a specific interval, getting another timestamp (t2), then comparing t1+interval to t2. The difference is the wakeup latency and *should* be quite small (usually measured in microseconds). Try running cyclictest like this:
$ sudo cyclictest -mn -p 95 -t
That will run one measurement thread per-core with a max priority of 95. If we don't see big max latencies, then the problem is contained to the TSC. If we see large max latencies(> 150us), then there's something else going on...
[root@bender ~]# cyclictest -d 100 -mn -p 95 -t
0.62 0.90 0.79 2/181 9124
T: 0 ( 9121) P:95 I:1000 C: 18937 Min: 3 Act: 7 Avg: 7 Max: 241
T: 1 ( 9122) P:94 I:1100 C: 17210 Min: 3 Act: 5 Avg: 13 Max: 310
T: 2 ( 9123) P:93 I:1200 C: 15770 Min: 3 Act: 23 Avg: 14 Max: 323
T: 3 ( 9124) P:92 I:1300 C: 14552 Min: 3 Act: 35 Avg: 16 Max: 330
Outputing old_time and new_time as well as the difference shows some interesting results:
The times between large latency events seems to cluster at multiples of about 405000000 ticks or about 135ms. I'm attaching an output from time_test that shows 1000 high latency events.
Created attachment 320045 [details]
Output from time_test, 1000 events
try 'lsmod | grep ipmi' and see if you still have an IPMI module loaded. Luis caught that there's an smi function that runs about once every 100us or so (smi_start_processing) that is a likely suspect. If you have one of the ipmi modules loaded and can 'modprobe -r' it, that might help.
There were two impi modules loaded. I removed them with 'rmmod', but cyclictest is still showing max latencies in the hundreds.
New HP boxes seem to behave much better with our RT kernel that the older (<= G5) boxes. If you're still seeing this on newer DL380's, please reopen.