446944 – Max latency near 1ms encountered while running cyclictest

Bug 446944 - Max latency near 1ms encountered while running cyclictest

Summary: Max latency near 1ms encountered while running cyclictest

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	beta
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	1.0
Target Release:	---
Assignee:	Steven Rostedt
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	447235
TreeView+	depends on / blocked

Reported:	2008-05-16 18:17 UTC by Clark Williams
Modified:	2008-08-14 20:51 UTC (History)
CC List:	6 users (show)
Fixed In Version:	2.6.24.7-65.el5rt
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-08-14 20:51:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
allow delay to preempt (4.49 KB, patch) 2008-05-27 17:26 UTC, Steven Rostedt	no flags	Details \| Diff
View All

Description Clark Williams 2008-05-16 18:17:01 UTC

Description of problem:
While running cyclictest on various systems (with load in the background), large
(>=1ms) latencies can be observed

Version-Release number of selected component (if applicable):
2.6.24.4-30+

How reproducible:
Sometimes takes a while but in general show up in less than 5 minutes of runtime

Steps to Reproduce:
1. Boot RT kernel
2. run: cyclictest -n -p 90 -t 4
3. run: (cd LTPdir; while true; do ./runltp; done)
4. run: (cd kernelsrc; while true; do make clean bzImage modules; done)

Actual results:
One of the four cyclictest threads will eventually show a latency in the
900us-1ms range. Avg latency on a dual 2GHz Opteron system will stay in the
single microseconds:

$ sudo cyclictest -n -p 90 -t 4
0.11 0.18 0.14 1/158 14964          

T: 0 ( 3419) P:90 I:1000 C: 819087 Min:      4 Act:    5 Avg:    5 Max:      40
T: 1 ( 3420) P:89 I:1500 C: 546059 Min:      4 Act:    6 Avg:    5 Max:     960
T: 2 ( 3421) P:88 I:2000 C: 409544 Min:      4 Act:    4 Avg:    5 Max:      32
T: 3 ( 3422) P:87 I:2500 C: 327635 Min:      4 Act:    5 Avg:    5 Max:      30

Expected results:
No Max Latency over 100us

Additional info:

I've run this test on a number of the test boxes in Huntsville and all have
exhibited this behavior, with production kernel-rt packages.

Comment 1 Steven Rostedt 2008-05-18 23:33:24 UTC

Clark, try this:

 # sysctl kernel.sched_nr_migrate=2

And see if that solves you problem. With the new 2.6.24.7-rt7 coming out (with
latency stealing), I was hitting that 900+ all the time. Looking at ftrace, I
found that the load balancing was taking up to 400us at a time. Compounded by
4way or more boxes that can be running this balancing at the same time. My trace
showed that the double_rq_lock took 424us!!! during this balance code and that
was because the CPU of the rq it was trying to grab was also trying to do balancing.

This needs to be noted (Lana ;-)

Comment 2 Steven Rostedt 2008-05-19 00:42:55 UTC

Nevermind, I just hit the 900us+ latency with sched_nr_migrate=2. I'll look
deeper into it.

Comment 3 Clark Williams 2008-05-21 00:57:48 UTC

What ftrace setup are you using to try and trap this thing? If you can post a
setup here, I'll try and duplicate it and get a trace for you.

Comment 4 Clark Williams 2008-05-22 20:40:15 UTC

elevating to blocker status

Comment 5 Steven Rostedt 2008-05-27 17:26:48 UTC

Created attachment 306803 [details]
allow delay to preempt

Seems that x86 delay uses tsc, and Andrew Morton made the delay non-preempt to
keep problems from non-synced tsc's from causing problems on SMP. This
disabling of preemption caused us large latencies.

The patch breaks this preemption and removes the 1ms latencies that we are
seeing.

Comment 6 Steven Rostedt 2008-05-27 17:28:05 UTC

Changing status to MODIFIED to show that this bug is considered resolved, but
needs testing. The -rt11 version will contain this patch.

Comment 7 Clark Williams 2008-05-28 21:59:37 UTC

kernel-rt-2.6.24.7-60.el5rt was built with -rt11 and is currently under test.
Early signs show that the __delay fix has addressed this bug.

Comment 8 Clark Williams 2008-07-02 14:42:54 UTC

fixed in GA

Comment 9 Beth Uptagrafft 2008-07-28 16:37:08 UTC

Need to reopen and set state to Modified for 1.0.1 release.

Comment 13 Beth Uptagrafft 2008-08-14 20:51:43 UTC

Closing.  Fixed in GA.

Note You need to log in before you can comment on or make changes to this bug.