466056 – Unexplained high latency

Bug 466056 - Unexplained high latency

Summary: Unexplained high latency

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	1.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Clark Williams
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-10-07 22:08 UTC by Brian Kidder
Modified:	2012-01-09 22:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-01-09 22:11:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Demostration code (1.38 KB, text/plain) 2008-10-07 22:08 UTC, Brian Kidder	no flags	Details
Contents of /proc/cpuinfo (2.60 KB, text/plain) 2008-10-09 20:54 UTC, Brian Kidder	no flags	Details
Contents of /proc/meminfo (748 bytes, text/plain) 2008-10-09 20:55 UTC, Brian Kidder	no flags	Details
Contents of /proc/interrupts (2.06 KB, text/plain) 2008-10-09 20:55 UTC, Brian Kidder	no flags	Details
Output of lspci -v (16.10 KB, text/plain) 2008-10-09 20:56 UTC, Brian Kidder	no flags	Details
Output of chkconfig --list (4.33 KB, text/plain) 2008-10-10 16:56 UTC, Brian Kidder	no flags	Details
Output from time_test, 1000 events (40.94 KB, text/plain) 2008-10-10 18:29 UTC, Brian Kidder	no flags	Details
View All

Description Brian Kidder 2008-10-07 22:08:25 UTC

Created attachment 319700 [details]
Demostration code

Description of problem:
Test code (attached) exhibits occasional unexpectedly large latencies (estimated at several hundred microseconds).

Version-Release number of selected component (if applicable):
Using kernel 2.6.24-74el5rt and 2.6.24-74el5rttrace

How reproducible:
Highly repeatable, though the exact rate and timing of the latency events is unpredictable.

Steps to Reproduce:
1. Reboot machine with 'isolcpus=2,3' kernel option
2. Compile and run attached test code
  
Actual results:
Test code reports the first 10 high latency events (greater than 500k CPU cycles).  Usually occurs after less than 5 seconds on our system.

Additional info:
Test code binds itself to one of the CPUs that isn't in use by the OS at large (CPU 3, in particular), then enters a tight loop.  Within the loop, it uses the RDTSC instruction to access the CPU clock counter twice, and compares the results.  The latency between these two instructions varies greatly.  Normaly the latecy is about 100 cycles, but exhibits latency spikes at around 1k clock cycles, 14k cycles, 30k cycles.  Several times a second, the latency spikes to ~800k cycles.

Using the debugfs tracing functionality with a current_tracer of "sched_switch" seems to indicate that no process other than watchdog interrupts the test code while it's in the tight loop.

Opening this bug at the direction of Rohit Persai per Service Request #1862346.

Comment 1 Clark Williams 2008-10-09 18:58:58 UTC

Brian,

Can you give us the contents of:
/proc/cpuinfo
/proc/meminfo
/proc/interrupts 

and the output of lspci -v?

Thanks,
Clark

Comment 2 Brian Kidder 2008-10-09 20:54:58 UTC

Created attachment 319935 [details]
Contents of /proc/cpuinfo

Comment 3 Brian Kidder 2008-10-09 20:55:30 UTC

Created attachment 319936 [details]
Contents of /proc/meminfo

Comment 4 Brian Kidder 2008-10-09 20:55:58 UTC

Created attachment 319937 [details]
Contents of /proc/interrupts

Comment 5 Brian Kidder 2008-10-09 20:56:19 UTC

Created attachment 319938 [details]
Output of lspci -v

Comment 6 Brian Kidder 2008-10-09 20:57:56 UTC

The machine is an HP DL380G5.

Comment 7 Clark Williams 2008-10-09 21:18:28 UTC

My first suspicion would be SMIs. HP makes heavy use of SMIs to gather data for their management utilities. Unfortunately (latency-wise) SMIs aren't handled by the kernel, they go straight to the BIOS, so it's completely out of Linux's hands. From the Linux perspective, a leap-forward in time just occurs. 

One thing you can try is going into the BIOS and see if there is an option for turning off SMIs. Of course it probably won't be that easy, since I doubt there's a "Turn off System Management Interrupts" option, but there maybe something to disable management statistics or some such. 

How long does it take to happen?

I ran your test program on both a dual dual-core Xeon and a dual dual-core Opteron for over five minutes with no output.

Comment 8 Clark Williams 2008-10-09 21:34:29 UTC

More questions:

1. What BIOS level are you running?

2. Are you using IPMI?

Comment 9 Brian Kidder 2008-10-09 23:20:32 UTC

I usually see output after 5 seconds or so.

---

1. The BIOS is reporting as 12/28/2006.  I've updated to 6/25/2008, but to no apparent effect.

2. HP's iLO2 (which I believe is their IPMI) is present and had been enabled.  It's not in use, so I've disabled it, but to no apparent effect.

I've gone through all the BIOS settings I can see and disabled just about everything.  Nothing refers to System Management Interrupts (of course), or to management statistics.

Comment 10 Clark Williams 2008-10-10 14:16:14 UTC

well, it was a shot anyway. 

I've run your test case all night on two test systems with no reports of large latencies. 

One of the guys here had a good suggestion: try physically disconnecting the system from any network device and running your test again. If iLO2 is the culprit, this should inhibit SMI generation. 

Clark

Comment 11 Luis Claudio R. Goncalves 2008-10-10 14:59:22 UTC

Would it be possible for you to record not only the biggest time intervals but also their starting times? That way you could have a better idea of the interval between spikes. If they are happening precisely on a given interval, that's a good hint. Depending on the interval it is easy to point the guilt piece.

A few random thoughts about the original test:

* Have you noticed any message about "unstable TSC" on dmesg? I have the impression your test machine has 2 sockets with two cores on each. There is a change that TSC is not in sync between sockets and that could lead to false positives if your test is not bound to one CPU only. That can be confirmed looking at the context switch trace and making sure your test always run on the same CPU. Another option is running the test with:

      taskset -c 1 ./test_time

* That would be also interesting checking what is the clocksource in use by the system:

   cat /sys/devices/system/clocksource/clocksource0/*

* The conversion from cycles to ns is done by:

  static inline s64 cyc2ns(struct clocksource *cs, cycle_t cycles) 
  {
        u64 ret = (u64)cycles;
        ret = (ret * cs->mult) >> cs->shift;
        return ret;
  }

  where  mult = (1000000000<<shift) / hz

  and hz is the clock frequency.

  So, a rough aproximantion of what may be your system gives a worst case of:

  mult = (1000000000 << 22) / 3000000000
  mult = 1398101

  cyc2ns = (800000 * mult) >> 22
  266666ns  or  267us

Comment 12 Clark Williams 2008-10-10 15:42:13 UTC

Another thing we should look at is which services are configured to run at boot
time. Would you send us the output of:

    /sbin/chkconfig --list

One of the main one's we're looking for is irqbalance to make sure it's turned
off. If you see that it's running, you can run this:

    /sbin/service irqbalance stop
    /sbin/chkconfig irqbalance off

Comment 13 Brian Kidder 2008-10-10 16:56:00 UTC

Created attachment 320031 [details]
Output of chkconfig --list

Comment 14 Brian Kidder 2008-10-10 17:06:51 UTC

Re:#10

I've tried 'service network stop', disconnecting the network cables, and physically removing all PCI cards.  None of these appear to have any effect.

--

Re: #11

The only comments in dmeseg regardiong TSC are:
"TSC calibrated against HPET
Marking TSC unstable due to TSCs unsynchronized"

One of the first things the test code does is bind itself to CPU 3.  I've also tried using "taskset -c 3 ..." to no apparent effect.

/sys/devices/system/clocksource/clocksource0/ reports the current_clocksource is "hpet" and available_clocksource is "hpet acpi_pm jiffies tsc".  I tried setting current_clocksource to each of those.  The only apparent effect was on "jiffies", which caused the machine to seize.

Comment 15 Clark Williams 2008-10-10 17:37:07 UTC

Ok, let's try one of our test programs. The MRG RT channel comes with a package called rt-tests, which contains a program called 'cyclictest'. If you don't have rt-tests installed you should be able to get it with 'yum install rt-tests'.

Cyclictest measure wakeup latency by getting a time stamp (t1), sleeping for a specific interval, getting another timestamp (t2), then comparing t1+interval to t2. The difference is the wakeup latency and *should* be quite small (usually measured in microseconds). Try running cyclictest like this:

     $ sudo cyclictest -mn -p 95 -t

That will run one measurement thread per-core with a max priority of 95. If we don't see  big max latencies, then the problem is contained to the TSC. If we see large max latencies(> 150us), then there's something else going on...

Comment 16 Brian Kidder 2008-10-10 18:11:03 UTC

[root@bender ~]# cyclictest -d 100 -mn -p 95 -t
0.62 0.90 0.79 2/181 9124          

T: 0 ( 9121) P:95 I:1000 C:  18937 Min:      3 Act:    7 Avg:    7 Max:     241
T: 1 ( 9122) P:94 I:1100 C:  17210 Min:      3 Act:    5 Avg:   13 Max:     310
T: 2 ( 9123) P:93 I:1200 C:  15770 Min:      3 Act:   23 Avg:   14 Max:     323
T: 3 ( 9124) P:92 I:1300 C:  14552 Min:      3 Act:   35 Avg:   16 Max:     330

Comment 17 Brian Kidder 2008-10-10 18:27:57 UTC

Outputing old_time and new_time as well as the difference shows some interesting results:

The times between large latency events seems to cluster at multiples of about 405000000 ticks or about 135ms.  I'm attaching an output from time_test that shows 1000 high latency events.

Comment 18 Brian Kidder 2008-10-10 18:29:05 UTC

Created attachment 320045 [details]
Output from time_test, 1000 events

Comment 19 Clark Williams 2008-10-10 18:41:39 UTC

try 'lsmod | grep ipmi' and see if you still have an IPMI module loaded. Luis caught that there's an smi function that runs about once every 100us or so (smi_start_processing) that is a likely suspect. If you have one of the ipmi modules loaded and can 'modprobe -r' it, that might help.

Comment 20 Brian Kidder 2008-10-10 19:30:53 UTC

There were two impi modules loaded.  I removed them with 'rmmod', but cyclictest is still showing max latencies in the hundreds.

Comment 21 Clark Williams 2012-01-09 22:11:11 UTC

New HP boxes seem to behave much better with our RT kernel that the older (<= G5) boxes. If you're still seeing this on newer DL380's, please reopen.

Note You need to log in before you can comment on or make changes to this bug.