Bug 624756
Summary: | idle time accounted for twice in /proc/stat for Xen guest | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Paolo Bonzini <pbonzini> |
Component: | kernel | Assignee: | Laszlo Ersek <lersek> |
Status: | CLOSED ERRATA | QA Contact: | Boris Ranto <branto> |
Severity: | medium | Docs Contact: | |
Priority: | low | ||
Version: | 6.0 | CC: | branto, drjones, gcosta, ketuzsezr, leiwang, lersek, mjenner, mrezanin, notting, pbonzini, qguan, qwan, tburke, xen-maint |
Target Milestone: | rc | Keywords: | Reopened, TestOnly |
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | xen | ||
Fixed In Version: | kernel-2.6.32-238.el6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-06-20 07:35:26 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 760479, 821374 | ||
Bug Blocks: | 523117, 767187 | ||
Attachments: |
Description
Paolo Bonzini
2010-08-17 16:37:33 UTC
This issue has been proposed when we are only considering blocker issues in the current Red Hat Enterprise Linux release. ** If you would still like this issue considered for the current release, ask your support representative to file as a blocker on your behalf. Otherwise ask that it be considered for the next Red Hat Enterprise Linux release. ** Have you tried a more recent kernel? There were no changes in this area since that kernel. Upstream suggested one commit that is scheduled for backport in 6.1; I'll test the latest RHEL6 kernel with that commit, but I don't have very high hopes that it'll fix it. The commit suggested by upstream doesn't help (and the bug was also confirmed on xen-devel by another user). Thank you for your bug report. This issue was evaluated for inclusion in the current release of Red Hat Enterprise Linux. Unfortunately, we are unable to address this request in the current release. Because we are in the final stage of Red Hat Enterprise Linux 6 development, only significant, release-blocking issues involving serious regressions and data corruption can be considered. If you believe this issue meets the release blocking criteria as defined and communicated to you by your Red Hat Support representative, please ask your representative to file this issue as a blocker for the current release. Otherwise, ask that it be evaluated for inclusion in the next minor release of Red Hat Enterprise Linux. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. This request was evaluated by Red Hat Product Management for inclusion in the current release of Red Hat Enterprise Linux. Because the affected component is not scheduled to be updated in the current release, Red Hat is unfortunately unable to address this request at this time. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. If you would like it considered as an exception in the current release, please ask your support representative. This request was erroneously denied for the current release of Red Hat Enterprise Linux. The error has been fixed and this request has been re-proposed for the current release. The only we've done for 6.2 that *may* have changed something wrt to this was to backport "xen: drop xen_sched_clock in favour of using plain wallclock time", but I just tried a latest kernel and it still reproduces the exact same way. Kicking to 6.3. show_stat() in "fs/proc/stat.c" formats those lines. 21 #ifndef arch_idle_time 22 #define arch_idle_time(cpu) 0 23 #endif 25 static int show_stat(struct seq_file *p, void *v) 26 { 29 cputime64_t user, nice, system, idle, iowait, irq, softirq, steal; 75 for_each_online_cpu(i) { 81 idle = kstat_cpu(i).cpustat.idle; 82 idle = cputime64_add(idle, arch_idle_time(i)); 88 seq_printf(p, 89 "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu\n", 90 i, 91 (unsigned long long)cputime64_to_clock_t(user), 92 (unsigned long long)cputime64_to_clock_t(nice), 93 (unsigned long long)cputime64_to_clock_t(system), 94 (unsigned long long)cputime64_to_clock_t(idle), arch_idle_time() seems to be defined only on s390, so the doubling must happen solely in kstat_cpu(i).cpustat.idle. The latter is increased by account_idle_time() in "kernel/sched.c". The two callers that qualify seem to be account_process_tick() and account_idle_ticks(). account_idle_ticks() has a xen-specific caller: in "arch/x86/xen/time.c", xen_setup_timer() binds VIRQ_TIMER to: xen_timer_interrupt() [arch/x86/xen/time.c] -> do_stolen_accounting() -> account_idle_ticks(ticks) [kernel/sched.c] -> account_idle_time(jiffies_to_cputime(ticks)) do_stolen_accounting() pulls some voodoo tricks to compute "ticks" for account_idle_ticks(): 114 static void do_stolen_accounting(void) 115 { 118 s64 blocked, runnable, offline, stolen; 119 cputime_t ticks; 127 /* work out how much time the VCPU has not been runn*ing* */ 128 blocked = state.time[RUNSTATE_blocked] - snap->time[RUNSTATE_blocked]; 145 /* Add the appropriate number of ticks of blocked time, 146 including any left-overs from last time. */ 147 blocked += __get_cpu_var(residual_blocked); 148 149 if (blocked < 0) 150 blocked = 0; 151 152 ticks = iter_div_u64_rem(blocked, NS_PER_TICK, &blocked); 153 __get_cpu_var(residual_blocked) = blocked; 154 account_idle_ticks(ticks); 155 } Reproduced with 2.6.32-209.el6.x86_64. Created attachment 528607 [details] systemtap script to count some function calls and argument averages In the guest where I reproduced this (see comment 13), I ran timer-jiffies.stp from the SystemTap Beginner's Guide. (Don't laugh.) It said: jiffies:ms ratio 123:123 => CONFIG_HZ=1000 So far so good. Then I cooked up the attached stap script. After setting the VCPU number to 1 for the guest and rebooting, the script outputs the following (per-second averages, collected over 10 seconds): account_idle_time(): calls = 327, cputime sum = 1995 account_process_tick(): calls = 153 account_idle_ticks(): calls = 173, ticks sum = 1841 do_stolen_accounting(): calls = 162 xen_timer_interrupt(): calls = 162 (The indentation intends to show an incomplete, inverted call tree: more to the right side and towards the bottom are the callers. account_idle_time() has two direct callers here, for example.) So, why is xen_timer_interrupt() called 162 times per second? Working backwards in the hypervisor, send_timer_event() [arch/x86/time.c] sends the VIRQ_TIMER to the guest VCPU. In "common/schedule.c", this function is called in two locations: vcpu_periodic_timer_work() and vcpu_singleshot_timer_fn(). I can't see any automatism for the singleshot timer. For the periodic timer: vcpu_periodic_timer_work() checks if the periodic_period has elapsed and if so, it sends a VIRQ_TIMER, and sets up the next alarm point (periodic_next_event). One would expect that this function (vcpu_periodic_timer_work()) is only called by the periodic timer's alarm function, ie. vcpu_periodic_timer_fn(). (This timer BTW seems to have a fixed 10 msecs period, see vcpu_initialise() in "arch/x86/domain.c": /* PV guests by default have a 100Hz ticker. */ v->periodic_period = MILLISECS(10); ) However, vcpu_periodic_timer_work() is also called from schedule() [common/schedule.c]. vcpu_periodic_timer_work() does not indiscriminately fire off a VIRQ_TIMER, but it does unconditionally *reset* the next event (the alarm) for the timer. And the timer's alarm function does send the VIRQ_TIMER unconditionally. Does this make any sense? http://xenbits.xensource.com/hg/xen-unstable.hg/rev/18416 ? Created attachment 528752 [details]
rhel-5 stap script
In a RHEL-5.7 single-VCPU PV guest, the jiffies script prints:
jiffies:ms ratio 30:123 => CONFIG_HZ=243
Adapting the stap script to RHEL-5, the output goes like:
account_steal_time(): calls = 352
CPUT_account_steal_time[0] = 0
CPUT_account_steal_time[ffffffff804feb80] = 249
timer_interrupt(): calls = 176
The timer interrupt handler is invoked with the same frequency as in the RHEL-6 guest case. timer_interrupt() [arch/i386/kernel/time-xen.c] calls account_steal_time() [kernel/kernel/sched.c] twice, first with a NULL task pointer ("stolen time") and then with "idle_task(cpu)" (idle/wait time).
ffffffff804feb80 D init_task
The cputime value of 249 is added to cpustat->idle in account_steal_time(). It is then printed in show_stat() [fs/proc/proc_misc.c], after conversion with cputime64_to_clock_t(), which is #defined as jiffies_64_to_clock_t().
Since we know that the jiffies:ms ratio is 30:123, 249 jiffies correspond to about 249 * 123 / 30 = 1020 milliseconds, or one second. So RHEL-5 is indeed correct.
Therefore I think do_stolen_accounting() passes a wrong (double) value to account_idle_ticks() in RHEL-6. I suspect 79741dd3. /proc/uptime is broken in the same way (found by grepping the commit log for 79741dd3, and finding 96830a57). Git commit 79741dd changes idle cputime accounting, but unfortunately the /proc/uptime file hasn't caught up. Here the idle time calculation from /proc/stat is copied over. Since /proc/stat is broken (or victim to something else) to begin with... Reproduced with a Fedora 16 PV guest (kernel 3.1.0-0.rc8.git0.1.fc16.x86_64). (In reply to comment #4) > Upstream suggested one > commit that is scheduled for backport in 6.1; I'll test the latest RHEL6 kernel > with that commit, but I don't have very high hopes that it'll fix it. Next time please paste the xen-devel link: http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01068.html I saw the (*event_handler)() callback in xen_timer_interrupt(). It wasn't set in "arch/x86/xen/time.c" anywhere, so I didn't start to investigate that. Now I realize "include/linux/clockchips.h" states: * @event_handler: Assigned by the framework to be called by the low * level handler of the event source If the xen-devel link had been there in the BZ, I wouldn't have wasted days trying to find the root cause in the wrong place. To begin with, you described the cause in detail in the eight paragraph of the upstream thread starter. Basically this BZ has been laying dormant since August 2010 with neither the (known) root cause nor the link to the external description of it. The bug is still present in upstream, I checked the source and also tested a recent kernel in comment 19. Setting Cond-NAK=Upstream and unassigning from myself. This would not be related to https://lkml.org/lkml/2011/10/6/10 ? Yes, it would. :) Created attachment 528873 [details] remove blocked time accounting from xen "clockchip" (patch for upstream) Hi Konrad, (In reply to comment #21) > This would not be related to https://lkml.org/lkml/2011/10/6/10 ? thanks for the hint. Yes, that probably fixes the symptom. I'm not sure though whether that's the right / elegant / canonical way to attack the problem. If it is, then the fix could/should remove more code. Please see the attached patch. It shouldn't affect stolen time. I figure the stolen + blocked time accounting was scavenged from linux-2.6.18-xen.hg, where timer_interrupt() handled both (--> comment 15). I guess in the pvops kernel the new -- how new? -- clock_event_device framework assumed the responsibility of blocked time accounting. Created attachment 528882 [details] remove blocked time accounting from xen "clockchip" (proposed RHEL-6 patch) ... The clock_event_device framework should take care of that. I tested this patch with a 64-bit x86_64 PV guest (RHEL-6.1, kernel upgraded to -210+patch, local build). It seems to pass the test described in comment 0, but I'm still not sure this is the right way to go about the problem. (In reply to comment #23) > Created attachment 528873 [details] > remove blocked time accounting from xen "clockchip" (patch for upstream) http://lists.xensource.com/archives/html/xen-devel/2011-10/msg01279.html Re http://lists.xensource.com/archives/html/xen-devel/2011-10/msg01311.html (1) I tried to figure out what function is called from xen_timer_interrupt() via the event_handler callback. It's hrtimer_interrupt(): #0 hrtimer_interrupt (dev=0xffff880002f7f620) at kernel/hrtimer.c:1233 #1 0xffffffff81007945 in xen_timer_interrupt (irq=<value optimized out>, dev_id=<value optimized out>) at arch/x86/xen/time.c:363 (2) What sets up that callback? I think: init_timers() [kernel/timer.c] -> open_softirq(TIMER_SOFTIRQ, run_timer_softirq) [kernel/softirq.c] Then run_timer_softirq() [kernel/timer.c] -> hrtimer_run_pending() [kernel/hrtimer.c] -> hrtimer_switch_to_hres() -> tick_init_highres() [kernel/time/tick-oneshot.c] -> tick_switch_to_oneshot() The last two functions set the per-CPU tick_cpu_device.evtdev->event_handler pointer to &hrtimer_interrupt. I believe this switch happens on the first timer softirq, dynamically. See the comment in hrtimer_run_pending(): /* * This _is_ ugly: We have to check in the softirq context, * whether we can switch to highres and / or nohz mode. The * clocksource switch happens in the timer interrupt with * xtime_lock held. Notification from there only sets the * check bit in the tick_oneshot code, otherwise we might * deadlock vs. xtime_lock. */ if (tick_check_oneshot_change(!hrtimer_is_hres_enabled())) hrtimer_switch_to_hres(); (3) Where is account_idle_ticks() called from (when not from do_stolen_accounting())? #0 account_idle_ticks (ticks=142593) at kernel/sched.c:5588 #1 0xffffffff810a10c5 in tick_nohz_restart_sched_tick () at kernel/time/tick-sched.c:536 #2 0xffffffff81009e39 in cpu_idle () at arch/x86/kernel/process_64.c:148 #3 0xffffffff814b6a6a in rest_init () at init/main.c:483 #4 0xffffffff81c1df76 in start_kernel () at init/main.c:740 #5 0xffffffff81c1d33a in x86_64_start_reservations (real_mode_data=<value optimized out>) at arch/x86/kernel/head64.c:123 #6 0xffffffff81c21180 in xen_start_kernel () at arch/x86/xen/enlighten.c:1275 (4) How do I connect these? It seems that the xen "clockchip" is configured (by the "framework") to kick the high resolution interrupt handler, see (1) and (2). There's also the idle kernel thread (?) that runs in an endless loop, and is directly responsible for increasing the idle time counter. What I can't determine is if the hrtimer interrupt handler is responsible for waking up the idle thread (via inter-thread comm means) and causing it to run some more iterations around the loop. Because: - If so, then the patch can't just remove the idle time accounting from the end of do_stolen_accounting(): in case the event_handler callback is NULL, nobody would wake up cpu_idle(). (However, is a NULL event_handler even possible, modulo a very short initial phase? If there were no real timer interrupt handler consuming clockchip interrupts, everything should grind to a halt. (*)) - If there's no connection whatsoever between the hrtimer interrupt handler and the cpu_idle() loop, then calling account_idle_ticks() at the end of do_stolen_accounting() is wrong in any case, independently of the NULL-ness of event_handler. (*) Most "clockchip implementations" that I could find in the kernel either just call evt->event_handler() without checking for NULL, or do check for NULL and qualify that case as "spurious timer interrupt". Especially local_apic_timer_interrupt() [arch/x86/kernel/apic/apic.c] is telling: static void local_apic_timer_interrupt(void) { int cpu = smp_processor_id(); struct clock_event_device *evt = &per_cpu(lapic_events, cpu); /* * Normally we should not be here till LAPIC has been initialized but * in some cases like kdump, its possible that there is a pending LAPIC * timer interrupt from previous kernel's context and is delivered in * new kernel the moment interrupts are enabled. * * Interrupts are enabled early and LAPIC is setup much later, hence * its possible that when we get here evt->event_handler is NULL. * Check for event_handler being NULL and discard the interrupt as * spurious. */ if (!evt->event_handler) { pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu); /* Switch it off */ lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, evt); return; } /* * the NMI deadlock-detector uses this. */ inc_irq_stat(apic_timer_irqs); evt->event_handler(evt); } Perhaps one could argue that event_handler may be NULL for an initial short period (which doesn't really matter wrt. idle time accounting), but afterwards it's always set. Meaning, we could unconditionally remove the account_idle_ticks() call from do_stolen_accounting(). If I pass "highres=off" to the guest kernel, then event_handler points to tick_nohz_handler() [kernel/time/tick-sched.c]. This handler is set up in run_timer_softirq() [kernel/timer.c] -> hrtimer_run_pending() [kernel/hrtimer.c, hrtimer_is_hres_enabled() retval changed] -> tick_check_oneshot_change(1) [kernel/time/tick-sched.c, retval changes] -> tick_nohz_switch_to_nohz() -> tick_switch_to_oneshot() [kernel/time/tick-oneshot.c, see also comment 26 (2) for this] The patched kernel still passes the test with this option. Pristine 6.1 still has the double idle time problem. It appears that the upstream kernel will have to see some deeper changes to fix this properly. http://lists.xensource.com/archives/html/xen-devel/2011-11/msg00523.html "xen/time: Use the pv_ops steal_time" http://lists.xen.org/archives/html/xen-devel/2012-01/msg01926.html I'm not sure how easy it would be to port Konrad's patch to RHEL-6. Our "pv_time_ops" struct type doesn't seem to have the "steal_clock" member. Also no hits for either jump_label_inc() / "paravirt_steal_enabled". Let's make this bug blocked by Rik's steal time backport then. We get lucky every once in a while. Retesting comment 0: # uname -r 2.6.32-250.el6.x86_64 # grep cpu0 /proc/stat; sleep 20 ; grep cpu0 /proc/stat cpu0 185 0 307 28653 434 0 0 76 0 cpu0 185 0 308 30652 434 0 0 76 0 ^ difference: 30652 - 28653 == 1999 Thanks for the tip, Drew! (Also removing dependency on bug 612320 -- that one is also built into -250, but we only considered it as a prerequisite.) *** This bug has been marked as a duplicate of bug 760479 *** Let's still get this tested by QE. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2012-0862.html |