Bug 1436351 - [nohz]: wrong user and system time accounting
Summary: [nohz]: wrong user and system time accounting
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.4
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: ---
Assignee: Yauheni Kaliuta
QA Contact: Chunyu Hu
URL:
Whiteboard:
: 1467266 (view as bug list)
Depends On:
Blocks: 1175461 1472889 1548445 1549423 1649835
TreeView+ depends on / blocked
 
Reported: 2017-03-27 17:27 UTC by Luiz Capitulino
Modified: 2019-08-06 12:05 UTC (History)
8 users (show)

Fixed In Version: kernel-3.10.0-966.el7
Doc Type: Bug Fix
Doc Text:
Cause: kernel accounts whole jiffy time for an activity Consequence: some of virtual time lost Fix: account time in nanoseconds and save uneven amount between jiffies Result: virtual cpu time properly accounted
Clone Of:
Environment:
Last Closed: 2019-08-06 12:05:16 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2019:2029 None None None 2019-08-06 12:05:42 UTC

Description Luiz Capitulino 2017-03-27 17:27:11 UTC
Description of problem:

When there's two or more tasks running in user-space and taking 100% of a nohz_full CPU, top will report 60%-100% system time utilization. This is wrong, as most of the CPU time is actually spent in user-space (in 1ms duration, the kernel executes just once for a couple of microseconds).

Running the same test-case against non-nohz_full CPUs shows 100% user-space time and close to 0% system time (which is the expected behavior).

This issue has been reported upstream:

[BUG nohz]: wrong user and system time accounting
https://lkml.org/lkml/2017/3/23/764

This email also contains initial debugging information and other people have found different ways to reproduce the problem.


Version-Release number of selected component (if applicable): kernel-3.10.0-617.el7.x86_64 (also present on latest upstream 4.11-rc4)


How reproducible:


Steps to Reproduce:
1. Set up a CPU to be nohz_full (that is, pass nohz_full= and isolcpus=)
2. Pin two tasks that run a busy loop in user-space to that CPU
3. Run 'top -d1'
4. Check user time and system time for that CPU

Actual results:

System time is between 60% - 100%.


Expected results:

User time is at or close to 100%, system time is at or close to 0%.

Comment 2 Luiz Capitulino 2017-03-27 17:28:31 UTC
I'm on top of this so I'm taking this BZ, but feel free to re-assign.

Comment 3 Luiz Capitulino 2017-03-27 17:46:39 UTC
I've found the following while debugging this issue:

- I can only reproduce if the tick is re-activated on the nohz_full CPU. If the tick is de-activated (that is, we have only the 1HZ tick) I don't reproduce this issue. That's why the issue triggers when we have two tasks running (the tick is activated) and not when we have a single task (tick de-activated)

- On tracing this, I've been seen that the following pattern:

1. Task is happilly executing in user-space
2. There's a timer interrupt
3. We transition from user-space to kernel-space
4. user time accounting is skipped because vtime_delta() returns zero (that's, jiffies hasn't changed since last accounting)
5. The interrupt handler for the tick executes
6. We return to user-space, system time is accounted for because now vtime_delta() returns non-zero

As I have several thousand entries in my tracing data, it's hard to tell if all wrong accounting is happening as described. But if it is, then the problem is that we're skipping user time accounting in step 4.

Comment 4 Luiz Capitulino 2017-03-28 15:05:21 UTC
I think I understand why we didn't see this before, it goes like this:

1. In the beginning of KVM-RT, we knew that nohz_full added a big overhead for kernel/user-space transitions. Applications such as cyclictest (that are basically a loop around a system call), would hit this case pretty hard giving bad latency numbers. Because of this problem, we'd run KVM-RT test-cases WITHOUT nohz_full in the guest (yes, this is somewhat counter-intuitive, but we'd get better latencies without nohz_full due to the overhead problem. For sysjitter test-case and for DPDK/NFV use-cases, we'd have it enabled)

2. Rik fixed the overhead issue upstream in commit ff9a9b4c433 last year

3. Recently, I realized the fix had reached RHEL7 and re-enabled nohz_full in the guest. The latencies we're excellent so I kept it enabled in my testing

4. I first realized the bad accounting last week, when running KVM-RT stress testing against latest downstream RT kernel. I usually watch top on KVM-RT stress testing to see if the housekeeping CPUs are getting stressed. But then I saw the nohz_full CPU with 100% system time while running threads in user-space...

Comment 5 Luiz Capitulino 2017-04-03 19:13:48 UTC
A small status update:

- There's consensus on upstream that the root cause of this issue is the fact that the ticks on the nohz_full CPUs and the timekeeper CPU are getting aligned. This causes the timerkeeper CPU to always update jiffies right after the nohz_full CPU enters kernel-space from user-space (due to its own tick) but before it returns to user-space. This ends up causing the nohz_cpu= to always account for system-time when returning to user-space

- Passing skew_tick=1 to the kernel command-line mostly fixes this issue, but I can still see it happening with a KVM guest with 8-vCPUs (where 6 CPUs are nohz_full). This is still being debugged

Comment 6 Luiz Capitulino 2017-04-07 17:05:54 UTC
The issue I was seeing turned out to be "expected". Meaning, no form of accounting is perfect. Even for tick-based accounting, it is possible to cook an application that will trig incorrect accounting.

Upstream has agreed that making skew_tick=1 default when nohz_full is enabled is the best solution. A patch implementing this has been posted:

https://lkml.org/lkml/2017/4/6/56

I'll backport this patch as soon as it's merged upstream.

NOTE: Just to emphasize that even with this patch, it is possible that we'll see wrong accounting on some CPUs when running KVM-RT test-cases as our test-case is good at triggering the bad accounting scenario.

Comment 7 Luiz Capitulino 2017-05-10 19:50:46 UTC
Further discussion on upstream concluded that the issue I was seeing in comment 5 should be fixed. There's a proposal by Thomas Gleixner on how to properly fix all cases.

However, this is still under discussion and the patches are getting more and more complex. Move this to 7.5 as I don't think the series will make it on time for 7.4.

Affected workloads can use skew_tick=1 as a workaround on 7.4 (this is possibly going to be the default for RT, see bug 1447938 comment 12).

Comment 8 Luiz Capitulino 2017-07-04 16:55:24 UTC
New series fixing this issue posted upstream by Frederic Weisbecker:

[RFC PATCH 0/5] vtime: Fix wrong user and system time accounting
https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1434270.html

Even though it's an RFC, it is getting ACKs and I fully tested it and it fixes all instances of the issue for me.

Comment 9 Luiz Capitulino 2017-07-04 17:00:18 UTC
Some additional comments regarding the reproducer from the description:

- To reproduce in the guest, it is sometimes needed to have load in the host (I don't know why this is so). What I'm doing is running a kernel build in the host while running the reproducer in the guest (make sure all CPUs are busy in the host)

- Also use acct-bug to reproduce (follow the instructions in the source-file), in both host and guests:

  http://people.redhat.com/~lcapitul/real-time/acct-bug.c

- Run KVM-RT test-cases with load in the host

Comment 12 Stanislav Kozina 2017-08-03 11:25:15 UTC
Looks like the patch series got stuck on upstream. Any thoughts how to proceed?

Comment 13 Vitaly Kuznetsov 2017-08-08 15:19:21 UTC
(In reply to Stanislav Kozina from comment #12)
> Looks like the patch series got stuck on upstream. Any thoughts how to
> proceed?

As far as I understand the following commit:

commit 2a42eb9594a1480b4ead9e036e06ee1290e5fa6d
Author: Wanpeng Li <wanpeng.li@hotmail.com>
Date:   Thu Jun 29 19:15:11 2017 +0200

    sched/cputime: Accumulate vtime on top of nsec clocksource

and its predecessors are fixing the issue. Present in 4.13-rc1.

Comment 14 Vitaly Kuznetsov 2017-08-08 16:08:36 UTC
*** Bug 1467266 has been marked as a duplicate of this bug. ***

Comment 30 Bruno Meneguele 2018-11-19 19:45:11 UTC
Patch(es) committed on kernel-3.10.0-966.el7

Comment 40 errata-xmlrpc 2019-08-06 12:05:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2029


Note You need to log in before you can comment on or make changes to this bug.