Bug 1436351
| Summary: | [nohz]: wrong user and system time accounting | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Luiz Capitulino <lcapitulino> |
| Component: | kernel | Assignee: | Yauheni Kaliuta <ykaliuta> |
| kernel sub component: | Process management | QA Contact: | Chunyu Hu <chuhu> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | cye, hhei, lcapitulino, liwan, qzhao, skozina, vkuznets, yacao |
| Version: | 7.4 | ||
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | kernel-3.10.0-966.el7 | Doc Type: | Bug Fix |
| Doc Text: |
Cause: kernel accounts whole jiffy time for an activity
Consequence: some of virtual time lost
Fix: account time in nanoseconds and save uneven amount between jiffies
Result: virtual cpu time properly accounted
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-08-06 12:05:16 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1175461, 1472889, 1548445, 1549423, 1649835 | ||
|
Description
Luiz Capitulino
2017-03-27 17:27:11 UTC
I'm on top of this so I'm taking this BZ, but feel free to re-assign. I've found the following while debugging this issue: - I can only reproduce if the tick is re-activated on the nohz_full CPU. If the tick is de-activated (that is, we have only the 1HZ tick) I don't reproduce this issue. That's why the issue triggers when we have two tasks running (the tick is activated) and not when we have a single task (tick de-activated) - On tracing this, I've been seen that the following pattern: 1. Task is happilly executing in user-space 2. There's a timer interrupt 3. We transition from user-space to kernel-space 4. user time accounting is skipped because vtime_delta() returns zero (that's, jiffies hasn't changed since last accounting) 5. The interrupt handler for the tick executes 6. We return to user-space, system time is accounted for because now vtime_delta() returns non-zero As I have several thousand entries in my tracing data, it's hard to tell if all wrong accounting is happening as described. But if it is, then the problem is that we're skipping user time accounting in step 4. I think I understand why we didn't see this before, it goes like this: 1. In the beginning of KVM-RT, we knew that nohz_full added a big overhead for kernel/user-space transitions. Applications such as cyclictest (that are basically a loop around a system call), would hit this case pretty hard giving bad latency numbers. Because of this problem, we'd run KVM-RT test-cases WITHOUT nohz_full in the guest (yes, this is somewhat counter-intuitive, but we'd get better latencies without nohz_full due to the overhead problem. For sysjitter test-case and for DPDK/NFV use-cases, we'd have it enabled) 2. Rik fixed the overhead issue upstream in commit ff9a9b4c433 last year 3. Recently, I realized the fix had reached RHEL7 and re-enabled nohz_full in the guest. The latencies we're excellent so I kept it enabled in my testing 4. I first realized the bad accounting last week, when running KVM-RT stress testing against latest downstream RT kernel. I usually watch top on KVM-RT stress testing to see if the housekeeping CPUs are getting stressed. But then I saw the nohz_full CPU with 100% system time while running threads in user-space... A small status update: - There's consensus on upstream that the root cause of this issue is the fact that the ticks on the nohz_full CPUs and the timekeeper CPU are getting aligned. This causes the timerkeeper CPU to always update jiffies right after the nohz_full CPU enters kernel-space from user-space (due to its own tick) but before it returns to user-space. This ends up causing the nohz_cpu= to always account for system-time when returning to user-space - Passing skew_tick=1 to the kernel command-line mostly fixes this issue, but I can still see it happening with a KVM guest with 8-vCPUs (where 6 CPUs are nohz_full). This is still being debugged The issue I was seeing turned out to be "expected". Meaning, no form of accounting is perfect. Even for tick-based accounting, it is possible to cook an application that will trig incorrect accounting. Upstream has agreed that making skew_tick=1 default when nohz_full is enabled is the best solution. A patch implementing this has been posted: https://lkml.org/lkml/2017/4/6/56 I'll backport this patch as soon as it's merged upstream. NOTE: Just to emphasize that even with this patch, it is possible that we'll see wrong accounting on some CPUs when running KVM-RT test-cases as our test-case is good at triggering the bad accounting scenario. Further discussion on upstream concluded that the issue I was seeing in comment 5 should be fixed. There's a proposal by Thomas Gleixner on how to properly fix all cases. However, this is still under discussion and the patches are getting more and more complex. Move this to 7.5 as I don't think the series will make it on time for 7.4. Affected workloads can use skew_tick=1 as a workaround on 7.4 (this is possibly going to be the default for RT, see bug 1447938 comment 12). New series fixing this issue posted upstream by Frederic Weisbecker: [RFC PATCH 0/5] vtime: Fix wrong user and system time accounting https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1434270.html Even though it's an RFC, it is getting ACKs and I fully tested it and it fixes all instances of the issue for me. Some additional comments regarding the reproducer from the description: - To reproduce in the guest, it is sometimes needed to have load in the host (I don't know why this is so). What I'm doing is running a kernel build in the host while running the reproducer in the guest (make sure all CPUs are busy in the host) - Also use acct-bug to reproduce (follow the instructions in the source-file), in both host and guests: http://people.redhat.com/~lcapitul/real-time/acct-bug.c - Run KVM-RT test-cases with load in the host Looks like the patch series got stuck on upstream. Any thoughts how to proceed? (In reply to Stanislav Kozina from comment #12) > Looks like the patch series got stuck on upstream. Any thoughts how to > proceed? As far as I understand the following commit: commit 2a42eb9594a1480b4ead9e036e06ee1290e5fa6d Author: Wanpeng Li <wanpeng.li> Date: Thu Jun 29 19:15:11 2017 +0200 sched/cputime: Accumulate vtime on top of nsec clocksource and its predecessors are fixing the issue. Present in 4.13-rc1. *** Bug 1467266 has been marked as a duplicate of this bug. *** Patch(es) committed on kernel-3.10.0-966.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2029 |