Description of problem:
When there's two or more tasks running in user-space and taking 100% of a nohz_full CPU, top will report 60%-100% system time utilization. This is wrong, as most of the CPU time is actually spent in user-space (in 1ms duration, the kernel executes just once for a couple of microseconds).
Running the same test-case against non-nohz_full CPUs shows 100% user-space time and close to 0% system time (which is the expected behavior).
This issue has been reported upstream:
[BUG nohz]: wrong user and system time accounting
This email also contains initial debugging information and other people have found different ways to reproduce the problem.
Version-Release number of selected component (if applicable): kernel-3.10.0-617.el7.x86_64 (also present on latest upstream 4.11-rc4)
Steps to Reproduce:
1. Set up a CPU to be nohz_full (that is, pass nohz_full= and isolcpus=)
2. Pin two tasks that run a busy loop in user-space to that CPU
3. Run 'top -d1'
4. Check user time and system time for that CPU
System time is between 60% - 100%.
User time is at or close to 100%, system time is at or close to 0%.
I'm on top of this so I'm taking this BZ, but feel free to re-assign.
I've found the following while debugging this issue:
- I can only reproduce if the tick is re-activated on the nohz_full CPU. If the tick is de-activated (that is, we have only the 1HZ tick) I don't reproduce this issue. That's why the issue triggers when we have two tasks running (the tick is activated) and not when we have a single task (tick de-activated)
- On tracing this, I've been seen that the following pattern:
1. Task is happilly executing in user-space
2. There's a timer interrupt
3. We transition from user-space to kernel-space
4. user time accounting is skipped because vtime_delta() returns zero (that's, jiffies hasn't changed since last accounting)
5. The interrupt handler for the tick executes
6. We return to user-space, system time is accounted for because now vtime_delta() returns non-zero
As I have several thousand entries in my tracing data, it's hard to tell if all wrong accounting is happening as described. But if it is, then the problem is that we're skipping user time accounting in step 4.
I think I understand why we didn't see this before, it goes like this:
1. In the beginning of KVM-RT, we knew that nohz_full added a big overhead for kernel/user-space transitions. Applications such as cyclictest (that are basically a loop around a system call), would hit this case pretty hard giving bad latency numbers. Because of this problem, we'd run KVM-RT test-cases WITHOUT nohz_full in the guest (yes, this is somewhat counter-intuitive, but we'd get better latencies without nohz_full due to the overhead problem. For sysjitter test-case and for DPDK/NFV use-cases, we'd have it enabled)
2. Rik fixed the overhead issue upstream in commit ff9a9b4c433 last year
3. Recently, I realized the fix had reached RHEL7 and re-enabled nohz_full in the guest. The latencies we're excellent so I kept it enabled in my testing
4. I first realized the bad accounting last week, when running KVM-RT stress testing against latest downstream RT kernel. I usually watch top on KVM-RT stress testing to see if the housekeeping CPUs are getting stressed. But then I saw the nohz_full CPU with 100% system time while running threads in user-space...
A small status update:
- There's consensus on upstream that the root cause of this issue is the fact that the ticks on the nohz_full CPUs and the timekeeper CPU are getting aligned. This causes the timerkeeper CPU to always update jiffies right after the nohz_full CPU enters kernel-space from user-space (due to its own tick) but before it returns to user-space. This ends up causing the nohz_cpu= to always account for system-time when returning to user-space
- Passing skew_tick=1 to the kernel command-line mostly fixes this issue, but I can still see it happening with a KVM guest with 8-vCPUs (where 6 CPUs are nohz_full). This is still being debugged
The issue I was seeing turned out to be "expected". Meaning, no form of accounting is perfect. Even for tick-based accounting, it is possible to cook an application that will trig incorrect accounting.
Upstream has agreed that making skew_tick=1 default when nohz_full is enabled is the best solution. A patch implementing this has been posted:
I'll backport this patch as soon as it's merged upstream.
NOTE: Just to emphasize that even with this patch, it is possible that we'll see wrong accounting on some CPUs when running KVM-RT test-cases as our test-case is good at triggering the bad accounting scenario.
Further discussion on upstream concluded that the issue I was seeing in comment 5 should be fixed. There's a proposal by Thomas Gleixner on how to properly fix all cases.
However, this is still under discussion and the patches are getting more and more complex. Move this to 7.5 as I don't think the series will make it on time for 7.4.
Affected workloads can use skew_tick=1 as a workaround on 7.4 (this is possibly going to be the default for RT, see bug 1447938 comment 12).
New series fixing this issue posted upstream by Frederic Weisbecker:
[RFC PATCH 0/5] vtime: Fix wrong user and system time accounting
Even though it's an RFC, it is getting ACKs and I fully tested it and it fixes all instances of the issue for me.
Some additional comments regarding the reproducer from the description:
- To reproduce in the guest, it is sometimes needed to have load in the host (I don't know why this is so). What I'm doing is running a kernel build in the host while running the reproducer in the guest (make sure all CPUs are busy in the host)
- Also use acct-bug to reproduce (follow the instructions in the source-file), in both host and guests:
- Run KVM-RT test-cases with load in the host
Looks like the patch series got stuck on upstream. Any thoughts how to proceed?
(In reply to Stanislav Kozina from comment #12)
> Looks like the patch series got stuck on upstream. Any thoughts how to
As far as I understand the following commit:
Author: Wanpeng Li <firstname.lastname@example.org>
Date: Thu Jun 29 19:15:11 2017 +0200
sched/cputime: Accumulate vtime on top of nsec clocksource
and its predecessors are fixing the issue. Present in 4.13-rc1.
*** Bug 1467266 has been marked as a duplicate of this bug. ***
Patch(es) committed on kernel-3.10.0-966.el7
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.