990145 – Still have latencies with the 40 core 4 socket system

Bug 990145 - Still have latencies with the 40 core 4 socket system

Summary: Still have latencies with the 40 core 4 socket system

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	2.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	2.4
Target Release:	---
Assignee:	John Kacur
QA Contact:	MRG Quality Engineering
Docs Contact:
URL:
Whiteboard:
Depends On:	928003
Blocks:
TreeView+	depends on / blocked

Reported:	2013-07-30 13:26 UTC by John Kacur
Modified:	2013-09-17 14:23 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Lock Contention in scheduling code Consequence: High Latency with cpu_partials Fix: Set cpu_partials to 0 for PREEMPT_RT_FULL Result: Good real-time latency is restored
Clone Of:	928003
Environment:
Last Closed:	2013-09-17 14:23:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description John Kacur 2013-07-30 13:26:45 UTC

+++ This bug was initially created as a clone of Bug #928003 +++

Bugzilla 858396 solved some of the issues that caused the high latency of the 40 Core system. But there still was one more. This one was a bit strange.

When we had tracing enabled, the latency never appeared, but when we disabled tracing, a 200 - 300+ latency appeared.

--- Additional comment from Steven Rostedt on 2013-03-26 12:54:29 EDT ---

After lots of tracing tries, by adding a little tracing at a time (as if you added too much, the latency would not trigger), I finally got to the issue.

Adding a backtrace of the timer interrupt in the tracing, it pointed to the __slab_free() code in slub.c.  Looking at that code, there's a function called unfreeze_partials() that had a loop grabbing and releasing raw_spin_locks, and doing this all the while with interrupts disabled.

I added a function_graph trace on the unfreeze_partials() and it proved to be the location of the latency. It reported that that function could take up to 260us to complete. Again, this is without interrupts enabled.

More tracing proved that it had to do more with the lock contention than with the loops. The code took several locks, where some of them took 60us to acquire. The sum of the waiting on the locks in the loop exceeded our latency requirements.

Talking with Christoph Lameter (the author of the slub.c code), he told me that you can disable the slab partials via the sysfs files. I did the following:

 # ls /sys/kernel/slab/*/cpu_partial | while read f; do echo 0 > $f; done

After doing this, I ran rteval for 24 hours, and it had a max latency of 135us. Well within our latency range for such a machine.

The slab partials batches work when kfree() is done. This batch can cause non-deterministic results. The only down side of disabling the partials is that it causes kfree to take a little bit longer. A slight decrease in performance perhaps, but much better deterministic results.

As for why tracing would make the latency disappear. As the problem was how the batching worked, timing is critical. Too much tracing could cause the batching to be spread out more evenly, and never cause a high latency when a bunch of work was done in a short time period, building up a large batch job.

--- Additional comment from John Kacur on 2013-06-18 07:05:08 EDT ---

Is this method of disabling the slab partials via the sysfs file system a quick and dirty proof of concept, or something we want to implement?

--- Additional comment from Clark Williams on 2013-07-09 11:09:40 EDT ---

This is the second version of Christoph Lameter's patch to make SLUB cpu partial processing a configurable option

--- Additional comment from Steven Rostedt on 2013-07-17 13:45:12 EDT ---

As the patch to disable SLUB did not backport as easy as I hoped, and when running it, it would lock up the system. I found that just setting the cpu_partials to default size 0 was a much easier and safer patch.

This avoids the races that exist between a user space task setting the cpu_partials to size zero and a new cpu_partials memory unit being created (and thus still set to something other than zero).

Another nice thing about this patch is that the user can still manually set the cpu_partials if they want the performance with the sacrifice of determination.

--- Additional comment from John Kacur on 2013-07-26 11:33:10 EDT ---

Fixed in 3.6.11.5-rt37.53

Comment 1 John Kacur 2013-07-30 13:31:54 UTC

Fix in v3.8.13-rt14.18

Note You need to log in before you can comment on or make changes to this bug.