Bug 624756

Summary: idle time accounted for twice in /proc/stat for Xen guest
Product: Red Hat Enterprise Linux 6 Reporter: Paolo Bonzini <pbonzini>
Component: kernelAssignee: Laszlo Ersek <lersek>
Status: CLOSED ERRATA QA Contact: Boris Ranto <branto>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: branto, drjones, gcosta, ketuzsezr, leiwang, lersek, mjenner, mrezanin, notting, pbonzini, qguan, qwan, tburke, xen-maint
Target Milestone: rcKeywords: Reopened, TestOnly
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard: xen
Fixed In Version: kernel-2.6.32-238.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-20 07:35:26 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 760479, 821374    
Bug Blocks: 523117, 767187    
Attachments:
Description Flags
systemtap script to count some function calls and argument averages
none
rhel-5 stap script
none
remove blocked time accounting from xen "clockchip" (patch for upstream)
none
remove blocked time accounting from xen "clockchip" (proposed RHEL-6 patch) none

Description Paolo Bonzini 2010-08-17 16:37:33 UTC
Description of problem:
/proc/stat counts idle time twice for RHEL6 Xen PV guests.

Version-Release number of selected component (if applicable):
kernel-2.6.32-31.el6

How reproducible:
100%

Steps to Reproduce:
1. Start a RHEL6 Xen PV guest.
2. Invoke "grep cpu0 /proc/stat; sleep 20 ; grep cpu0 /proc/stat"

Actual results:
The fourth number in /proc/stat (idle) increases by approximately 4000.

Expected results:
The fourth number in /proc/stat (idle)  increases by approximately 2000.

Additional info:
If you try instead (you need an otherwise unloaded machine for these):

grep cpu0 /proc/stat; timeout 20s yes > /dev/null ; grep cpu0 /proc/stat

grep cpu0 /proc/stat; timeout 20s dd if=/dev/urandom > /dev/null ; grep cpu0 /proc/stat

you'll notice respectively the first and third number in the /cpu/stat increase by 2000.

Comment 2 RHEL Program Management 2010-08-17 16:58:11 UTC
This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 3 Bill Nottingham 2010-08-17 18:48:38 UTC
Have you tried a more recent kernel?

Comment 4 Paolo Bonzini 2010-08-18 08:42:13 UTC
There were no changes in this area since that kernel.  Upstream suggested one commit that is scheduled for backport in 6.1; I'll test the latest RHEL6 kernel with that commit, but I don't have very high hopes that it'll fix it.

Comment 5 Paolo Bonzini 2010-08-18 14:49:02 UTC
The commit suggested by upstream doesn't help (and the bug was also confirmed on xen-devel by another user).

Comment 6 RHEL Program Management 2010-08-18 21:23:35 UTC
Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.

If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.

Comment 7 RHEL Program Management 2011-01-07 03:52:14 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 8 Suzanne Logcher 2011-01-07 16:25:04 UTC
This request was erroneously denied for the current release of Red Hat
Enterprise Linux.  The error has been fixed and this request has been
re-proposed for the current release.

Comment 9 RHEL Program Management 2011-02-01 05:30:22 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 10 RHEL Program Management 2011-02-01 18:33:53 UTC
This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 11 Andrew Jones 2011-09-13 09:21:30 UTC
The only we've done for 6.2 that *may* have changed something wrt to this was to backport "xen: drop xen_sched_clock in favour of using plain wallclock time", but I just tried a latest kernel and it still reproduces the exact same way. Kicking to 6.3.

Comment 12 Laszlo Ersek 2011-09-22 14:10:49 UTC
show_stat() in "fs/proc/stat.c" formats those lines.

    21  #ifndef arch_idle_time
    22  #define arch_idle_time(cpu) 0
    23  #endif

    25  static int show_stat(struct seq_file *p, void *v)
    26  {

    29    cputime64_t user, nice, system, idle, iowait, irq, softirq, steal;

    75    for_each_online_cpu(i) {

    81      idle = kstat_cpu(i).cpustat.idle;
    82      idle = cputime64_add(idle, arch_idle_time(i));

    88      seq_printf(p,
    89        "cpu%d %llu %llu %llu %llu %llu %llu %llu %llu %llu\n",
    90        i,
    91        (unsigned long long)cputime64_to_clock_t(user),
    92        (unsigned long long)cputime64_to_clock_t(nice),
    93        (unsigned long long)cputime64_to_clock_t(system),
    94        (unsigned long long)cputime64_to_clock_t(idle),

arch_idle_time() seems to be defined only on s390, so the doubling must happen solely in kstat_cpu(i).cpustat.idle.

The latter is increased by account_idle_time() in "kernel/sched.c". The two callers that qualify seem to be account_process_tick() and account_idle_ticks().

account_idle_ticks() has a xen-specific caller: in "arch/x86/xen/time.c", xen_setup_timer() binds VIRQ_TIMER to:

xen_timer_interrupt() [arch/x86/xen/time.c]
  -> do_stolen_accounting()
    -> account_idle_ticks(ticks) [kernel/sched.c]
      -> account_idle_time(jiffies_to_cputime(ticks)) 

do_stolen_accounting() pulls some voodoo tricks to compute "ticks" for account_idle_ticks():

   114  static void do_stolen_accounting(void)
   115  {

   118    s64 blocked, runnable, offline, stolen;
   119    cputime_t ticks;

   127    /* work out how much time the VCPU has not been runn*ing*  */
   128    blocked = state.time[RUNSTATE_blocked] - snap->time[RUNSTATE_blocked];

   145    /* Add the appropriate number of ticks of blocked time,
   146       including any left-overs from last time. */
   147    blocked += __get_cpu_var(residual_blocked);
   148  
   149    if (blocked < 0)
   150      blocked = 0;
   151  
   152    ticks = iter_div_u64_rem(blocked, NS_PER_TICK, &blocked);
   153    __get_cpu_var(residual_blocked) = blocked;
   154    account_idle_ticks(ticks);
   155  }

Comment 13 Laszlo Ersek 2011-10-17 12:43:57 UTC
Reproduced with 2.6.32-209.el6.x86_64.

Comment 14 Laszlo Ersek 2011-10-17 17:36:09 UTC
Created attachment 528607 [details]
systemtap script to count some function calls and argument averages

In the guest where I reproduced this (see comment 13), I ran timer-jiffies.stp
from the SystemTap Beginner's Guide. (Don't laugh.) It said:

jiffies:ms ratio 123:123 => CONFIG_HZ=1000

So far so good. Then I cooked up the attached stap script. After setting the VCPU number to 1 for the guest and rebooting, the script outputs the following (per-second averages, collected over 10 seconds):

account_idle_time():         calls = 327, cputime sum = 1995
  account_process_tick():    calls = 153
  account_idle_ticks():      calls = 173, ticks sum = 1841
    do_stolen_accounting():  calls = 162
      xen_timer_interrupt(): calls = 162

(The indentation intends to show an incomplete, inverted call tree: more to the right side and towards the bottom are the callers. account_idle_time() has two direct callers here, for example.)

So, why is xen_timer_interrupt() called 162 times per second?

Working backwards in the hypervisor, send_timer_event() [arch/x86/time.c] sends the VIRQ_TIMER to the guest VCPU. In "common/schedule.c", this function is called in two locations: vcpu_periodic_timer_work() and vcpu_singleshot_timer_fn().

I can't see any automatism for the singleshot timer. For the periodic timer:

vcpu_periodic_timer_work() checks if the periodic_period has elapsed and if so, it sends a VIRQ_TIMER, and sets up the next alarm point (periodic_next_event).

One would expect that this function (vcpu_periodic_timer_work()) is only called by the periodic timer's alarm function, ie. vcpu_periodic_timer_fn().

(This timer BTW seems to have a fixed 10 msecs period, see vcpu_initialise() in "arch/x86/domain.c":

        /* PV guests by default have a 100Hz ticker. */
        v->periodic_period = MILLISECS(10);
)

However, vcpu_periodic_timer_work() is also called from schedule() [common/schedule.c]. vcpu_periodic_timer_work() does not indiscriminately fire off a VIRQ_TIMER, but it does unconditionally *reset* the next event (the alarm) for the timer. And the timer's alarm function does send the VIRQ_TIMER unconditionally. Does this make any sense?

http://xenbits.xensource.com/hg/xen-unstable.hg/rev/18416
?

Comment 15 Laszlo Ersek 2011-10-18 09:30:41 UTC
Created attachment 528752 [details]
rhel-5 stap script

In a RHEL-5.7 single-VCPU PV guest, the jiffies script prints:

jiffies:ms ratio 30:123 => CONFIG_HZ=243

Adapting the stap script to RHEL-5, the output goes like:

account_steal_time(): calls = 352
CPUT_account_steal_time[0] = 0
CPUT_account_steal_time[ffffffff804feb80] = 249
  timer_interrupt():  calls = 176

The timer interrupt handler is invoked with the same frequency as in the RHEL-6 guest case. timer_interrupt() [arch/i386/kernel/time-xen.c] calls account_steal_time() [kernel/kernel/sched.c] twice, first with a NULL task pointer ("stolen time") and then with "idle_task(cpu)" (idle/wait time).

ffffffff804feb80 D init_task

The cputime value of 249 is added to cpustat->idle in account_steal_time(). It is then printed in show_stat() [fs/proc/proc_misc.c], after conversion with cputime64_to_clock_t(), which is #defined as jiffies_64_to_clock_t(). 

Since we know that the jiffies:ms ratio is 30:123, 249 jiffies correspond to about 249 * 123 / 30 = 1020 milliseconds, or one second. So RHEL-5 is indeed correct.

Comment 16 Laszlo Ersek 2011-10-18 09:38:27 UTC
Therefore I think do_stolen_accounting() passes a wrong (double) value to account_idle_ticks() in RHEL-6.

Comment 17 Laszlo Ersek 2011-10-18 10:03:10 UTC
I suspect 79741dd3.

Comment 18 Laszlo Ersek 2011-10-18 10:10:45 UTC
/proc/uptime is broken in the same way (found by grepping the commit log for 79741dd3, and finding 96830a57).

    Git commit 79741dd changes idle cputime accounting, but unfortunately
    the /proc/uptime file hasn't caught up.  Here the idle time calculation
    from /proc/stat is copied over.

Since /proc/stat is broken (or victim to something else) to begin with...

Comment 19 Laszlo Ersek 2011-10-18 12:02:40 UTC
Reproduced with a Fedora 16 PV guest (kernel 3.1.0-0.rc8.git0.1.fc16.x86_64).

Comment 20 Laszlo Ersek 2011-10-18 13:11:35 UTC
(In reply to comment #4)
> Upstream suggested one
> commit that is scheduled for backport in 6.1; I'll test the latest RHEL6 kernel
> with that commit, but I don't have very high hopes that it'll fix it.

Next time please paste the xen-devel link:

http://lists.xensource.com/archives/html/xen-devel/2010-08/msg01068.html

I saw the (*event_handler)() callback in xen_timer_interrupt(). It wasn't set in "arch/x86/xen/time.c" anywhere, so I didn't start to investigate that.

Now I realize "include/linux/clockchips.h" states:

 * @event_handler:	Assigned by the framework to be called by the low
 *			level handler of the event source

If the xen-devel link had been there in the BZ, I wouldn't have wasted days trying to find the root cause in the wrong place. To begin with, you described the cause in detail in the eight paragraph of the upstream thread starter. Basically this BZ has been laying dormant since August 2010 with neither the (known) root cause nor the link to the external description of it.

The bug is still present in upstream, I checked the source and also tested a recent kernel in comment 19. Setting Cond-NAK=Upstream and unassigning from myself.

Comment 21 Konrad Rzeszutek Wilk 2011-10-18 13:33:20 UTC
This would not be related to https://lkml.org/lkml/2011/10/6/10 ?

Comment 22 Paolo Bonzini 2011-10-18 15:34:12 UTC
Yes, it would. :)

Comment 23 Laszlo Ersek 2011-10-18 18:54:11 UTC
Created attachment 528873 [details]
remove blocked time accounting from xen "clockchip" (patch for upstream)

Hi Konrad,

(In reply to comment #21)
> This would not be related to https://lkml.org/lkml/2011/10/6/10 ?

thanks for the hint.

Yes, that probably fixes the symptom. I'm not sure though whether that's the right / elegant / canonical way to attack the problem.

If it is, then the fix could/should remove more code. Please see the attached patch. It shouldn't affect stolen time.

I figure the stolen + blocked time accounting was scavenged from linux-2.6.18-xen.hg, where timer_interrupt() handled both (--> comment 15). I guess in the pvops kernel the new -- how new? -- clock_event_device framework assumed the responsibility of blocked time accounting.

Comment 24 Laszlo Ersek 2011-10-18 19:47:30 UTC
Created attachment 528882 [details]
remove blocked time accounting from xen "clockchip" (proposed RHEL-6 patch)

... The clock_event_device framework should take care of that.

I tested this patch with a 64-bit x86_64 PV guest (RHEL-6.1, kernel upgraded to -210+patch, local build). It seems to pass the test described in comment 0, but I'm still not sure this is the right way to go about the problem.

Comment 25 Laszlo Ersek 2011-10-18 20:59:48 UTC
(In reply to comment #23)
> Created attachment 528873 [details]
> remove blocked time accounting from xen "clockchip" (patch for upstream)

http://lists.xensource.com/archives/html/xen-devel/2011-10/msg01279.html

Comment 26 Laszlo Ersek 2011-10-19 14:41:30 UTC
Re http://lists.xensource.com/archives/html/xen-devel/2011-10/msg01311.html

(1) I tried to figure out what function is called from xen_timer_interrupt() via the event_handler callback. It's hrtimer_interrupt():

#0  hrtimer_interrupt (dev=0xffff880002f7f620) at kernel/hrtimer.c:1233
#1  0xffffffff81007945 in xen_timer_interrupt (irq=<value optimized out>,
    dev_id=<value optimized out>) at arch/x86/xen/time.c:363

(2) What sets up that callback? I think:

init_timers() [kernel/timer.c]
  -> open_softirq(TIMER_SOFTIRQ, run_timer_softirq) [kernel/softirq.c]

Then

run_timer_softirq() [kernel/timer.c]
  -> hrtimer_run_pending() [kernel/hrtimer.c]
    -> hrtimer_switch_to_hres()
      -> tick_init_highres() [kernel/time/tick-oneshot.c]
        -> tick_switch_to_oneshot()

The last two functions set the per-CPU tick_cpu_device.evtdev->event_handler pointer to &hrtimer_interrupt. I believe this switch happens on the first timer softirq, dynamically. See the comment in hrtimer_run_pending():

        /*
         * This _is_ ugly: We have to check in the softirq context,
         * whether we can switch to highres and / or nohz mode. The
         * clocksource switch happens in the timer interrupt with
         * xtime_lock held. Notification from there only sets the
         * check bit in the tick_oneshot code, otherwise we might
         * deadlock vs. xtime_lock.
         */
        if (tick_check_oneshot_change(!hrtimer_is_hres_enabled()))
                hrtimer_switch_to_hres();

(3) Where is account_idle_ticks() called from (when not from do_stolen_accounting())?

#0  account_idle_ticks (ticks=142593) at kernel/sched.c:5588
#1  0xffffffff810a10c5 in tick_nohz_restart_sched_tick () at
    kernel/time/tick-sched.c:536
#2  0xffffffff81009e39 in cpu_idle () at arch/x86/kernel/process_64.c:148
#3  0xffffffff814b6a6a in rest_init () at init/main.c:483
#4  0xffffffff81c1df76 in start_kernel () at init/main.c:740
#5  0xffffffff81c1d33a in x86_64_start_reservations (real_mode_data=<value
    optimized out>) at arch/x86/kernel/head64.c:123
#6  0xffffffff81c21180 in xen_start_kernel () at arch/x86/xen/enlighten.c:1275


(4) How do I connect these?

It seems that the xen "clockchip" is configured (by the "framework") to kick the high resolution interrupt handler, see (1) and (2). There's also the idle kernel thread (?) that runs in an endless loop, and is directly responsible for increasing the idle time counter.

What I can't determine is if the hrtimer interrupt handler is responsible for waking up the idle thread (via inter-thread comm means) and causing it to run some more iterations around the loop. Because:

- If so, then the patch can't just remove the idle time accounting from the end of do_stolen_accounting(): in case the event_handler callback is NULL, nobody would wake up cpu_idle(). (However, is a NULL event_handler even possible, modulo a very short initial phase? If there were no real timer interrupt handler consuming clockchip interrupts, everything should grind to a halt. (*))

- If there's no connection whatsoever between the hrtimer interrupt handler and the cpu_idle() loop, then calling account_idle_ticks() at the end of do_stolen_accounting() is wrong in any case, independently of the NULL-ness of event_handler.

(*) Most "clockchip implementations" that I could find in the kernel either just call evt->event_handler() without checking for NULL, or do check for NULL and qualify that case as "spurious timer interrupt". Especially local_apic_timer_interrupt() [arch/x86/kernel/apic/apic.c] is telling:

  static void local_apic_timer_interrupt(void)
  {
    int cpu = smp_processor_id();
    struct clock_event_device *evt = &per_cpu(lapic_events, cpu);

    /*
     * Normally we should not be here till LAPIC has been initialized but
     * in some cases like kdump, its possible that there is a pending LAPIC
     * timer interrupt from previous kernel's context and is delivered in
     * new kernel the moment interrupts are enabled.
     *
     * Interrupts are enabled early and LAPIC is setup much later, hence
     * its possible that when we get here evt->event_handler is NULL.
     * Check for event_handler being NULL and discard the interrupt as
     * spurious.
     */
    if (!evt->event_handler) {
            pr_warning("Spurious LAPIC timer interrupt on cpu %d\n", cpu);
            /* Switch it off */
            lapic_timer_setup(CLOCK_EVT_MODE_SHUTDOWN, evt);
            return;
    }

    /*
     * the NMI deadlock-detector uses this.
     */
    inc_irq_stat(apic_timer_irqs);

    evt->event_handler(evt);
  }

Perhaps one could argue that event_handler may be NULL for an initial short period (which doesn't really matter wrt. idle time accounting), but afterwards it's always set. Meaning, we could unconditionally remove the account_idle_ticks() call from do_stolen_accounting().

Comment 27 Laszlo Ersek 2011-10-20 10:33:12 UTC
If I pass "highres=off" to the guest kernel, then event_handler points to tick_nohz_handler() [kernel/time/tick-sched.c].

This handler is set up in

run_timer_softirq() [kernel/timer.c]
  -> hrtimer_run_pending() [kernel/hrtimer.c,
                            hrtimer_is_hres_enabled() retval changed]
    -> tick_check_oneshot_change(1) [kernel/time/tick-sched.c,
                                     retval changes]
      -> tick_nohz_switch_to_nohz() 
        -> tick_switch_to_oneshot() [kernel/time/tick-oneshot.c,
                                     see also comment 26 (2) for this]

The patched kernel still passes the test with this option. Pristine 6.1 still has the double idle time problem.

Comment 28 Laszlo Ersek 2011-11-14 09:11:59 UTC
It appears that the upstream kernel will have to see some deeper changes to fix this properly.

http://lists.xensource.com/archives/html/xen-devel/2011-11/msg00523.html

Comment 32 Laszlo Ersek 2012-01-24 11:00:41 UTC
"xen/time: Use the pv_ops steal_time"

http://lists.xen.org/archives/html/xen-devel/2012-01/msg01926.html

I'm not sure how easy it would be to port Konrad's patch to RHEL-6. Our "pv_time_ops" struct type doesn't seem to have the "steal_clock" member. Also no hits for either jump_label_inc() / "paravirt_steal_enabled".

Comment 33 Paolo Bonzini 2012-01-24 11:18:21 UTC
Let's make this bug blocked by Rik's steal time backport then.

Comment 35 Laszlo Ersek 2012-03-07 14:39:12 UTC
We get lucky every once in a while. Retesting comment 0:

# uname -r 
2.6.32-250.el6.x86_64

# grep cpu0 /proc/stat; sleep 20 ; grep cpu0 /proc/stat
cpu0 185 0 307 28653 434 0 0 76 0
cpu0 185 0 308 30652 434 0 0 76 0
               ^
               difference: 30652 - 28653 == 1999

Thanks for the tip, Drew!

(Also removing dependency on bug 612320 -- that one is also built into -250, but we only considered it as a prerequisite.)

*** This bug has been marked as a duplicate of bug 760479 ***

Comment 36 Paolo Bonzini 2012-03-07 15:33:30 UTC
Let's still get this tested by QE.

Comment 42 errata-xmlrpc 2012-06-20 07:35:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0862.html